8.4. How the Buffer Manager Works

This section describes how the buffer manager works. When a backend process wants to access a desired page, it calls the ReadBufferExtended() function.

The behavior of the ReadBufferExtended() function depends on three logical cases. Each case is described in the following subsections. In addition, the PostgreSQL clock sweep page replacement algorithm is described in the final subsection.

8.4.1. Accessing a Page Stored in the Buffer Pool

First, the simplest case is described, in which the desired page is already stored in the buffer pool. In this case, the buffer manager performs the following steps:

  • (1) Create the buffer_tag of the desired page (in this example, the buffer_tag is ‘Tag_C’) and compute the hash bucket slot that contains the associated entry of the created buffer_tag, using the hash function.

  • (2) Acquire the BufMappingLock partition that covers the obtained hash bucket slot in shared mode (this lock will be released in step (5)).

  • (3) Look up the entry whose tag is ‘Tag_C’ and obtain the buffer_id from the entry. In this example, the buffer_id is 2.

  • (4) Pin the buffer descriptor for buffer_id 2, increasing the refcount and usage_count of the descriptor by 1. ( Section 8.3.2 describes pinning).

  • (5) Release the BufMappingLock.

  • (6) Access the buffer pool slot with buffer_id 2.

Fig. 8.8. Accessing a page stored in the buffer pool.

Then, when reading rows from the page in the buffer pool slot, the PostgreSQL process acquires the shared content_lock of the corresponding buffer descriptor. Therefore, buffer pool slots can be read by multiple processes simultaneously.

When inserting (and updating or deleting) rows to the page, a Postgres process acquires the exclusive content_lock of the corresponding buffer descriptor. (Note that the dirty bit of the page must be set to 1.)

After accessing the pages, the refcount values of the corresponding buffer descriptors are decreased by 1.

8.4.2. Loading a Page from Storage to Empty Slot

In this second case, assume that the desired page is not in the buffer pool and the freelist has free elements (empty descriptors). In this case, the buffer manager performs the following steps:

  • (1) Look up the buffer table (we assume that it is not found).

    1. Create the buffer_tag of the desired page (in this example, the buffer_tag is ‘Tag_E’) and compute the hash bucket slot.
    2. Acquire the BufMappingLock partition in shared mode.
    3. Look up the buffer table. (Not found according to the assumption.)
    4. Release the BufMappingLock.
  • (2) Obtain an empty buffer descriptor from the freelist, and pin it. In this example, the buffer_id of the obtained descriptor is 4.

  • (3) Acquire the BufMappingLock partition in exclusive mode. (This lock will be released in step (6).)

  • (4) Create a new data entry that comprises the buffer_tag ‘Tag_E’ and buffer_id 4. Insert the created entry to the buffer table.

  • (5) Load the desired page data from storage to the buffer pool slot with buffer_id 4 as follows:

    1. In versions 9.5 or earlier, acquire the exclusive io_in_progress_lock of the corresponding descriptor.
    2. Set the io_in_progress bit of the corresponding descriptor to 1 to prevent access by other processes.
    3. Load the desired page data from storage to the buffer pool slot.
    4. Change the states of the corresponding descriptor: the io_in_progress bit is set to 0, and the valid bit is set to 1.
    5. In versions 9.5 or earlier, release the io_in_progress_lock.
  • (6) Release the BufMappingLock.

  • (7) Access the buffer pool slot with buffer_id 4.

Fig. 8.9. Loading a page from storage to an empty slot.

8.4.3. Loading a Page from Storage to a Victim Buffer Pool Slot

In this case, assume that all buffer pool slots are occupied by pages but the desired page is not stored. The buffer manager performs the following steps:

  • (1) Create the buffer_tag of the desired page and look up the buffer table. In this example, we assume that the buffer_tag is ‘Tag_M’ (the desired page is not found).

  • (2) Select a victim buffer pool slot using the clock-sweep algorithm. Obtain the old entry, which contains the buffer_id of the victim pool slot, from the buffer table and pin the victim pool slot in the buffer descriptors layer. In this example, the buffer_id of the victim slot is 5 and the old entry is ‘Tag_F, id=5’. The clock sweep is described in Section 8.4.4.

  • (3) Flush (write and fsync) the victim page data if it is dirty; otherwise proceed to step (4).
    The dirty page must be written to storage before overwriting with new data. Flushing a dirty page is performed as follows:

    1. Acquire the shared content_lock and the exclusive io_in_progress lock of the descriptor with buffer_id 5 (released in step 6).
    2. Change the states of the corresponding descriptor; the io_in_progress bit is set to 1 and the just_dirtied bit is set to 0.
    3. Depending on the situation, the XLogFlush() function is invoked to write WAL data on the WAL buffer to the current WAL segment file (details are omitted; WAL and the XLogFlush() function are described in Chapter 9.
    4. Flush the victim page data to storage.
    5. Change the states of the corresponding descriptor; the io_in_progress bit is set to 0 and the valid bit is set to 1.
    6. Release the io_in_progress and content_lock locks.
  • (4) Acquire the old BufMappingLock partition that covers the slot that contains the old entry, in exclusive mode.

  • (5) Acquire the new BufMappingLock partition and insert the new entry to the buffer table:

    1. Create the new entry comprised of the new buffer_tag ‘Tag_M’ and the victim’s buffer_id.
    2. Acquire the new BufMappingLock partition that covers the slot containing the new entry in exclusive mode.
    3. Insert the new entry to the buffer table.
Fig. 8.10. Loading a page from storage to a victim buffer pool slot.
  • (6) Delete the old entry from the buffer table, and release the old BufMappingLock partition.

  • (7) Load the desired page data from the storage to the victim buffer slot. Then, update the flags of the descriptor with buffer_id 5; the dirty bit is set to 0 and other bits are initialized.

  • (8) Release the new BufMappingLock partition.

  • (9) Access the buffer pool slot with buffer_id 5.

Fig. 8.11. Loading a page from storage to a victim buffer pool slot (continued from Fig. 8.10).

8.4.4. Page Replacement Algorithm: Clock Sweep

The rest of this section describes the clock-sweep algorithm. This algorithm is a variant of NFU (Not Frequently Used) with low overhead; it selects less frequently used pages efficiently.

Imagine buffer descriptors as a circular list (Fig. 8.12). The nextVictimBuffer, an unsigned 32-bit integer, is always pointing to one of the buffer descriptors and rotates clockwise. The pseudocode and description of the algorithm are follows:

Pseudocode: clock-sweep
    WHILE true
(1)    Obtain the candidate buffer descriptor pointed by the nextVictimBuffer
(2)    IF the candidate descriptor is 'unpinned' THEN
(3)       IF the candidate descriptor's usage_count == 0 THEN
             BREAK WHILE LOOP  /* the corresponding slot of this descriptor */
                               /* is victim slot.                           */
          ELSE
             Decrease the candidate descriptpor's usage_count by 1
          END IF
       END IF
(4)    Advance nextVictimBuffer to the next one
    END WHILE 
(5) RETURN buffer_id of the victim
  • (1) Obtain the candidate buffer descriptor pointed by nextVictimBuffer.
  • (2) If the candidate buffer descriptor is unpinned, proceed to step (3). Otherwise, proceed to step (4).
  • (3) If the usage_count of the candidate descriptor is 0, select the corresponding slot of this descriptor as a victim and proceed to step (5). Otherwise, decrease this descriptor’s usage_count by 1 and proceed to step (4).
  • (4) Advance the nextVictimBuffer to the next descriptor (if at the end, wrap around) and return to step (1). Repeat until a victim is found.
  • (5) Return the buffer_id of the victim.

A specific example is shown in Fig. 8.12. The buffer descriptors are shown as blue or cyan boxes, and the numbers in the boxes show the usage_count of each descriptor.

Fig. 8.12. Clock Sweep.
  • 1) The nextVictimBuffer points to the first descriptor (buffer_id 1). However, this descriptor is skipped because it is pinned.

  • 2) The nextVictimBuffer points to the second descriptor (buffer_id 2). This descriptor is unpinned but its usage_count is 2. Thus, the usage_count is decreased by 1, and the nextVictimBuffer advances to the third candidate.

  • 3) The nextVictimBuffer points to the third descriptor (buffer_id 3). This descriptor is unpinned and its usage_count is 0. Thus, this is the victim in this round.

Whenever the nextVictimBuffer sweeps an unpinned descriptor, its usage_count is decreased by 1. Therefore, if unpinned descriptors exist in the buffer pool, this algorithm can always find a victim, whose usage_count is 0, by rotating the nextVictimBuffer.