Transaction logs are an essential part of databases because they ensure that no data is lost even when a system failure occurs. They are a history log of all changes and actions in a database system. This ensures that no data is lost due to failures, such as a power failure or a server crash. The log contains sufficient information about each transaction that has already been executed, so the database server can recover the database cluster by replaying the changes and actions in the transaction log in the event of a server crash.

In the field of computer science, WAL is an acronym for Write-Ahead Logging, which is a protocol or rule that requires both changes and actions to be written to a transaction log. However, in PostgreSQL, WAL is also an acronym for Write Ahead Log. In PostgreSQL, the term WAL is used interchangeably with transaction log, and it also refers to the implemented mechanism for writing actions to a transaction log (WAL). Although this can be confusing, this document will adopt the PostgreSQL definition.

The WAL mechanism was first implemented in version 7.1 to mitigate the impacts of server crashes. It also made possible the implementation of the Point-in-Time Recovery (PITR) and Streaming Replication (SR), both of which are described in Chapter 10 and Chapter 11 respectively.

Although understanding the WAL mechanism is essential for system integrations and administration using PostgreSQL, the complexity of this mechanism makes it impossible to summarize its description in brief. Therefore, the complete explanation of WAL in PostgreSQL is as follows:

  • The logical and physical structures of the WAL (transaction log)
  • The internal layout of WAL data
  • Writing of WAL data
  • WAL writer process
  • The checkpoint processing
  • The database recovery processing
  • Managing WAL segment files
  • Continuous archiving

9.1. Overview

Let's take a look at the overview of the WAL mechanism. To clarify the issue that WAL solves, the first subsection shows what happens when a crash occurs if PostgreSQL does not implement WAL. The second subsection introduces some key concepts and shows an overview of the main subjects in this chapter: the writing of WAL data and the database recovery process. The final subsection completes the overview of WAL by adding one more key concept.

In this section, to simplify the description, the table TABLE_A which contains just one page has been used.

9.1.1. Insertion Operations without WAL

As described in Chapter 8, every DBMS implements a shared buffer pool to provide efficient access to the relation's pages.

Assume that we insert some data tuples into TABLE_A on PostgreSQL which does not implement the WAL feature. This situation is illustrated in Fig. 9.1.

Fig. 9.1. Insertion operations without WAL.
Fig. 9.1. Insertion operations without WAL.
  • (1) When we issue the first INSERT statement, PostgreSQL loads the TABLE_A page from the database cluster into the in-memory shared buffer pool and inserts a tuple into the page. The page is not written to the database cluster immediately. (As mentioned in Chapter 8, modified pages are generally called dirty pages.)
  • (2) When we issue the second INSERT statement, PostgreSQL inserts a new tuple into the page in the buffer pool. The page has not been written to storage yet.
  • (3) If the operating system or PostgreSQL server should fail for any reason, such as a power failure, all of the inserted data would be lost.

Therefore, a database without WAL is vulnerable to system failures.


Historical Info

Before WAL was introduced (versions 7.0 or earlier), PostgreSQL did synchronous writes to the disk by issuing a sync system call whenever a page was changed in memory in order to ensure durability. This made modification commands such as INSERT and UPDATE very poor-performing.


9.1.2. Insertion Operations and Database Recovery

To deal with the system failures mentioned above without compromising performance, PostgreSQL supports WAL. In this subsection, some keywords and key concepts are described, followed by the writing of WAL data and the recovery of the database.

PostgreSQL writes all modifications as history data into a persistent storage to prepare for failures. In PostgreSQL, the history data are known as XLOG record(s) or WAL data.

XLOG records are written into the in-memory WAL buffer by change operations such as insertion, deletion, or commit action. They are immediately written into a WAL segment file on the storage when a transaction commits or aborts. (To be precise, the writing of XLOG records may occur in other cases. The details will be described in Section 9.5.) The LSN (Log Sequence Number) of an XLOG record represents the location where its record is written on the transaction log. The LSN of a record is used as the unique id of the XLOG record.

When considering how a database system recovers, one question that may arise is: what point does PostgreSQL start to recover from? The answer is the REDO point. That is the location to write the XLOG record at the moment when the latest checkpoint is started. (Checkpoints in PostgreSQL are described in Section 9.7.) In fact, the database recovery process is closely linked to the checkpoint process, and both of these processes are inseparable.


The WAL and checkpoint process were implemented at the same time in version 7.1.


As the introduction of major keywords and concepts has just finished, the following is a description of the tuple insertion with WAL. See Fig. 9.2 and the following description. (Also refer to this slide.)

Fig. 9.2. Insertion operations with WAL.
Fig. 9.2. Insertion operations with WAL.

Notation

'TABLE_A's LSN' shows the value of pd_lsn within the page-header of TABLE_A. 'page's LSN' is the same manner.


  • (1) A checkpointer, a background process, periodically performs checkpointing. Whenever the checkpointer starts, it writes a XLOG record called checkpoint record to the current WAL segment. This record contains the location of the latest REDO point.
  • (2) When we issue the first INSERT statement, PostgreSQL loads the TABLE_A page into the shared buffer pool, inserts a tuple into the page, creates and writes a XLOG record of this statement into the WAL buffer at the location LSN_1, and updates the TABLE_A's LSN from LSN_0 to LSN_1.
    In this example, this XLOG record is a pair of a header-data and the tuple entire.
  • (3) As this transaction commits, PostgreSQL creates and writes a XLOG record of this commit action into the WAL buffer, and then, writes and flushes all XLOG records on the WAL buffer to the WAL segment file, from LSN_1.
  • (4) When we issue the second INSERT statement, PostgreSQL inserts a new tuple into the page, creates and writes this tuple's XLOG record to the WAL buffer at LSN_2, and updates the TABLE_A's LSN from LSN_1 to LSN_2.
  • (5) When this statement's transaction commits, PostgreSQL operates in the same manner as in step (3).
  • (6) Imagine that an operating system failure occurs. Even though all of the data in the shared buffer pool is lost, all modifications to the page have been written to the WAL segment files as history data.

The following instructions show how to recover our database cluster back to the state immediately before the crash. There is no need to do anything special, since PostgreSQL will automatically enter recovery-mode by restarting. See Fig. 9.3 (and this slide). PostgreSQL will sequentially read and replay XLOG records within the appropriate WAL segment files from the REDO point.

Fig. 9.3. Database recovery using WAL.
Fig. 9.3. Database recovery using WAL.
  • (1) PostgreSQL reads the XLOG record of the first INSERT statement from the appropriate WAL segment file, and loads the TABLE_A page from the database cluster into the shared buffer pool.
  • (2) Before trying to replay the XLOG record, PostgreSQL compares the XLOG record's LSN with the corresponding page's LSN. the reason for doing this will be described in Section 9.8. The rules for replaying XLOG records are as follows:
    If the XLOG record's LSN is larger than the page's LSN, the data-portion of the XLOG record is inserted into the page, and the page's LSN is updated to the XLOG record's LSN.
    On the other hand, if the XLOG record's LSN is smaller, there is nothing to do other than to read next WAL record.
    In this example, the XLOG record is replayed since the XLOG record's LSN (LSN_1) is larger than the TABLE_A's LSN (LSN_0). Then, TABLE_A's LSN is updated from LSN_0 to LSN_1.
  • (3) PostgreSQL replays the remaining XLOG record(s) in the same way.

PostgreSQL can recover itself in this way by replaying XLOG records written in WAL segment files in chronological order. Thus, PostgreSQL's XLOG records are REDO log.


PostgreSQL does not support UNDO log.


Although writing XLOG records certainly costs a certain amount, it is nothing compared to writing the entire modified pages. We are confident that the benefit we gain, namely system failure tolerance, is greater than the amount we pay.

9.1.3. Full-Page Writes

Suppose that the TABLE_A's page data on the storage is corrupted because the operating system has failed while the background writer process has been writing the dirty pages. As XLOG records cannot be replayed on the corrupted page, we would need an additional feature.

PostgreSQL supports a feature called full-page writes to deal with such failures. If it is enabled, PostgreSQL writes a pair of the header data and the entire page as an XLOG record during the first change of each page after every checkpoint. (This is the default setting.) In PostgreSQL, such a XLOG record containing the entire page is called a backup block (or full-page image).

Let's describe the insertion of tuples again, but with full-page writes enabled. See Fig. 9.4 and the following description.

Fig. 9.4. Full page writes.
Fig. 9.4. Full page writes.
  • (1) The checkpointer starts a checkpoint process.
  • (2) When we insert the first INSERT statement, PostgreSQL operates in the same way as in the previous subsection, except that this XLOG record is the backup block of this page, because this is the first writing of this page after the latest checkpoint. (In other words, it contains the entire page.)
  • (3) As this transaction commits, PostgreSQL operates in the same way as in the previous subsection.
  • (4) When we insert the second INSERT statement, PostgreSQL operates in the same way as in the previous subsection, since this XLOG record is not a backup block.
  • (5) When this statement's transaction commits, PostgreSQL operates in the same way as in the previous subsection.
  • (6) To demonstrate the effectiveness of full-page writes, let's consider the case in which the TABLE_A page on the storage has been corrupted due to an operating system failure that occurred while the background writer was writing it to the storage (HDD or SSD).

Restart the PostgreSQL server to repair the broken cluster. See Fig. 9.5 and the following description.

Fig. 9.5. Database recovery with backup block.
Fig. 9.5. Database recovery with backup block.
  • (1) PostgreSQL reads the XLOG record of the first INSERT statement and loads the corrupted TABLE_A page from the database cluster into the shared buffer pool. In this example, the XLOG record is a backup block, because the first XLOG record of each page is always its backup block according to the writing rule of full-page writes.
  • (2) When a XLOG record is its backup block, another rule of replaying is applied: the record's data-portion (i.e., the page itself) is to be overwritten onto the page regardless of the values of both LSNs, and the page's LSN updated to the XLOG record's LSN.
    In this example, PostgreSQL overwrites the data-portion of the record to the corrupted page, and updates the TABLE_A's LSN to LSN_1. In this way, the corrupted page is restored by its backup block.
  • (3) Since the second XLOG record is a non-backup block, PostgreSQL operates in the same way as the instruction in the previous subsection.

In this way, PostgreSQL can recover the database even if some data write errors occur due to a process or operating system crash.


WAL, Backup, and Replication

As mentioned above, WAL can prevent data loss due to process or operating system crashes. However, if a file system or media failure occurs, the data will be lost. To deal with such failures, PostgreSQL provides online backup and replication features.

If online backups are taken regularly, the database can be restored from the most recent backup, even if a media failure occurs. However, it is important to note that the changes made after taking the last backup cannot be restored.

The synchronous replication feature can store all changes to another storage or host in real time. This means that if a media failure occurs on the primary server, the data can be restored from the secondary server.

For more information, see chapters 10 and 11, respectively.


9.2. Transaction Log and WAL Segment Files

Logically, PostgreSQL writes XLOG records into a virtual file that is 8 bytes long (16 exabytes).

Since a transaction log capacity is effectively unlimited and so can be said that 8-bytes of address space is vast enough, it is impossible for us to handle a file with a capacity of 8 bytes. Therefore, a transaction log in PostgreSQL is divided into files of 16 megabytes, by default, each of which is known as a WAL segment. See Fig. 9.6.


WAL segment file size

In versions 11 or later, the size of WAL segment file can be configured using --wal-segsize option when PostgreSQL cluster is created by initdb command.


Fig. 9.6. Transaction log and WAL segment files
Fig. 9.6. Transaction log and WAL segment files

The WAL segment filename is in hexadecimal 24-digit number and the naming rule is as follows:

\begin{align} \verb|WAL segment file name| = \verb|timelineId| + (\verb|uint32|) \frac{\verb|LSN|-1}{16\verb|M|*256} + (\verb|uint32|)\left(\frac{\verb|LSN|-1}{16\verb|M|}\right) \% 256 \end{align}
timelineId

PostgreSQL's WAL contains the concept of timelineId (4-byte unsigned integer), which is for Point-in-Time Recovery (PITR) described in Chapter 10. However, the timelineId is fixed to 0x00000001 in this chapter because this concept is not required in the following descriptions.


The first WAL segment file is 000000010000000000000001. If the first one has been filled up with the writing of XLOG records, the second one 000000010000000000000002 would be provided. Files are used in ascending order in succession. After 0000000100000000000000FF has been filled up, the next one 000000010000000100000000 will be provided. In this way, whenever the last 2-digit carries over, the middle 8-digit number increases one.

Similarly, after 0000000100000001000000FF has been filled up, 000000010000000200000000 will be provided, and so on.


pg_xlogfile_name / pg_walfile_name

Using the built-in function pg_xlogfile_name (versions 9.6 or earlier) or pg_walfile_name (versoin 10 or later), we can find the WAL segment file name that contains the specified LSN. An example is shown below:

testdb=# SELECT pg_xlogfile_name('1/00002D3E');  # In versions 10 or later, "SELECT pg_walfile_name('1/00002D3E');"
     pg_xlogfile_name     
--------------------------
 000000010000000100000000
(1 row)

9.3. Internal Layout of WAL Segment

A WAL segment is a 16 MB file by default, and it is internally divided into pages of 8192 bytes (8 KB). The first page has a header-data defined by the XLogLongPageHeaderData structure, while the headings of all other pages have the page information defined by the XLogPageHeaderData structure. Following the page header, XLOG records are written in each page from the beginning in descending order. See Fig. 9.7.

Fig. 9.7. Internal layout of a WAL segment file.
Fig. 9.7. Internal layout of a WAL segment file.

The XLogLongPageHeaderData structure and the XLogPageHeaderData structure are defined in src/include/access/xlog_internal.h. The explanation of both structures is omitted because they are not required in the following descriptions.

9.4. Internal Layout of XLOG Record

An XLOG record comprises a general header portion and each associated data portion. The first subsection describes the header structure. The remaining two subsections explain the structure of the data portion in versions 9.4 and earlier, and version 9.5, respectively. (The data format changed in version 9.5.)

9.4.1. Header Portion of XLOG Record

All XLOG records have a general header portion defined by the XLogRecord structure. Here, the structure of 9.4 and earlier versions is shown below, although it has been changed in version 9.5.

typedef struct XLogRecord
{
   uint32          xl_tot_len;   /* total len of entire record */
   TransactionId   xl_xid;       /* xact id */
   uint32          xl_len;       /* total len of rmgr data. This variable was removed in ver.9.5. */
   uint8           xl_info;      /* flag bits, see below */
   RmgrId          xl_rmid;      /* resource manager for this record */
   /* 2 bytes of padding here, initialize to zero */
   XLogRecPtr      xl_prev;      /* ptr to previous record in log */
   pg_crc32        xl_crc;       /* CRC for this record */
} XLogRecord;

The Header Portion of XLOG Record in versions 9.5 or later.

In versions 9.5 or later, one variable (xl_len) has been removed the structure XLogRecord to refine the XLOG record format, which reduced the size by a few bytes.


Apart from two variables, most of the variables are so obvious that they do not need to be described.

Both xl_rmid and xl_info are variables related to resource managers, which are collections of operations associated with the WAL feature, such as writing and replaying of XLOG records. The number of resource managers tends to increase with each PostgreSQL version. Version 10 contains the following:

Operation Resource manager
Heap tuple operations RM_HEAP, RM_HEAP2
Index operations RM_BTREE, RM_HASH, RM_GIN, RM_GIST, RM_SPGIST, RM_BRIN
Sequence operations RM_SEQ
Transaction operations RM_XACT, RM_MULTIXACT, RM_CLOG, RM_XLOG, RM_COMMIT_TS
Tablespace operations RM_SMGR, RM_DBASE, RM_TBLSPC, RM_RELMAP
replication and hot standby operations RM_STANDBY, RM_REPLORIGIN, RM_GENERIC_ID, RM_LOGICALMSG_ID

Here are some representative examples of how resource managers work:

  • If an INSERT statement is issued, the header variables xl_rmid and xl_info of its XLOG record are set to 'RM_HEAP' and 'XLOG_HEAP_INSERT', respectively. When recovering the database cluster, the RM_HEAP's function heap_xlog_insert() is selected according to the xl_info and replays this XLOG record.
  • Similarly, for an UPDATE statement, the header variable xl_info of the XLOG record is set to 'XLOG_HEAP_UPDATE', and the RM_HEAP's function heap_xlog_update() replays its record when the database recovers.
  • When a transaction commits, the header variables xl_rmid and xl_info of its XLOG record are set to 'RM_XACT' and 'XLOG_XACT_COMMIT', respectively. When recovering the database cluster, the function xact_redo_commit() replays this record.

XLogRecord structure in versions 9.4 or earlier is defined in src/include/access/xlog.h and that of versions 9.5 or later is defined in src/include/access/xlogrecord.h.

The heap_xlog_insert and heap_xlog_update are defined in src/backend/access/heap/heapam.c; while the function xact_redo_commit is defined in src/backend/access/transam/xact.c.


9.4.2. Data Portion of XLOG Record (versions 9.4 or earlier)

The data portion of an XLOG record can be classified into either a backup block (which contains the entire page) or a non-backup block (which contains different data depending on the operation).

Fig. 9.8. Examples of XLOG records (versions 9.4 or earlier).
Fig. 9.8. Examples of XLOG records  (versions 9.4 or earlier).

The internal layouts of XLOG records are described below, using some specific examples.

9.4.2.1. Backup Block

A backup block is shown in Fig. 9.8(a). It is composed of two data structures and one data object:

  1. The XLogRecord structure (header portion)
  2. The BkpBlock structure
  3. The entire page, except for its free space

The BkpBlock structure contains the variables that identify the page in the database cluster (i.e., the relfilenode and fork number of the relation that contains the page, and the page's block number), as well as the starting position and length of the page's free space.

9.4.2.2. Non-Backup Block

In non-backup blocks, the layout of the data portion differs depending on the operation. Here, the XLOG record for an INSERT statement is explained as a representative example. See Fig. 9.8(b). In this case, the XLOG record for the INSERT statement is composed of two data structures and one data object:

  1. The XLogRecord (header-portion) structure
  2. the xl_heap_insert structure
  3. The inserted tuple, with a few bytes removed

The xl_heap_insert structure contains the variables that identify the inserted tuple in the database cluster (i.e., the relfilenode of the table that contains this tuple, and the tuple's tid), as well as a visibility flag of this tuple.


The reason to remove a few bytes from inserted tuple is described in the source code comment of the structure xl_heap_header:

We don't store the whole fixed part (HeapTupleHeaderData) of an inserted or updated tuple in WAL; we can save a few bytes by reconstructing the fields that are available elsewhere in the WAL record, or perhaps just plain needn't be reconstructed.

One more example will be shown here. See Fig. 9.8(c). The XLOG record for a checkpoint record is quite simple; it is composed of two data structures:

  1. the XLogRecord structure (header-portion)
  2. the Checkpoint structure, which contains its checkpoint information (see more detail in Section 9.7)

The xl_heap_header structure is defined in src/include/access/htup.h while the CheckPoint structure is defined in src/include/catalog/pg_control.h.


9.4.3. Data Portion of XLOG Record (versions 9.5 or later)

In versions 9.4 or earlier, there was no common format for XLOG records, so each resource manager had to define its own format. This made it increasingly difficult to maintain the source code and implement new features related to WAL. To address this issue, a common structured format that is independent of resource managers was introduced in version 9.5.

The data portion of an XLOG record can be divided into two parts: header and data. See Fig. 9.9.

Fig. 9.9. Common XLOG record format.
Fig. 9.9. Common XLOG record format.

The header part contains zero or more XLogRecordBlockHeaders and zero or one XLogRecordDataHeaderShort (or XLogRecordDataHeaderLong). It must contain at least one of these. When the record stores a full-page image (i.e., a backup block), the XLogRecordBlockHeader includes the XLogRecordBlockImageHeader, and also includes the XLogRecordBlockCompressHeader if its block is compressed.

The data part is composed of zero or more block data and zero or one main data, which correspond to the XLogRecordBlockHeader(s) and to the XLogRecordDataHeader, respectively


WAL compression

In versions 9.5 or later, full-page images within XLOG records can be compressed using the LZ compression method by setting the parameter wal_compression = enable. In that case, the XLogRecordBlockCompressHeader structure will be added.

This feature has two advantages and one disadvantage. The advantages are reducing the I/O cost for writing records and suppressing the consumption of WAL segment files. The disadvantage is consuming much CPU resource to compress.


Fig. 9.10. Examples of XLOG records (versions 9.5 or later).
Fig. 9.10. Examples of XLOG records  (versions 9.5 or later).

Some specific examples are shown below, as in the previous subsection.

9.4.3.1. Backup Block

The backup block created by an INSERT statement is shown in Fig. 9.10(a). It is composed of four data structures and one data object:

  1. the XLogRecord structure (header-portion)
  2. the XLogRecordBlockHeader structure, including one LogRecordBlockImageHeader structure
  3. the XLogRecordDataHeaderShort structure
  4. a backup block (block data)
  5. the xl_heap_insert structure (main data)

The XLogRecordBlockHeader structure contains the variables to identify the block in the database cluster (the relfilenode, the fork number, and the block number). The XLogRecordImageHeader structure contains the length of this block and offset number. (These two header structures together can store the same data as the BkBlock structure used until version 9.4.)

The XLogRecordDataHeaderShort structure stores the length of the xl_heap_insert structure, which is the main data of the record. (See below.)


The main data of an XLOG record that contains a full-page image is not used except in some special cases, such as logical decoding and speculative insertions. It is ignored when the record is replayed, making it redundant data. This may be improved in the future.

In addition, the main data of backup block records depends on the statements that create them. For example, an UPDATE statement appends xl_heap_lock or xl_heap_updated.


9.4.3.2. Non-Backup Block

Next, I will describe the non-backup block record created by the INSERT statement (see Fig. 9.10(b)). It is composed of four data structures and one data object:

  1. the XLogRecord structure (header-portion)
  2. the XLogRecordBlockHeader structure
  3. the XLogRecordDataHeaderShort structure
  4. an inserted tuple (to be exact, a xl_heap_header structure and an inserted data entire)
  5. the xl_heap_insert structure (main data)

The XLogRecordBlockHeader structure contains three values (the relfilenode, the fork number, and the block number) to specify the block that the tuple was inserted into, and the length of the data portion of the inserted tuple. The XLogRecordDataHeaderShort structure contains the length of the new xl_heap_insert structure, which is the main data of this record.

The new xl_heap_insert structure contains only two values: the offset number of this tuple within the block, and a visibility flag. It became very simple because the XLogRecordBlockHeader structure stores most of the data that was contained in the old xl_heap_insert structure.

As the final example, a checkpoint record is shown in the Fig. 9.10(c). It is composed of three data structures:

  1. the XLogRecord structure (header-portion)
  2. the XLogRecordDataHeaderShort structure contained of the main data length
  3. the structure CheckPoint (main data)

The structure xl_heap_header is defined in src/include/access/htup.h and the CheckPoint structure is defined in src/include/catalog/pg_control.h.


Although the new format is a little complicated for us, it is well-designed for the parsers of the resource managers. Additionally, the size of many types of XLOG records is usually smaller than the previous ones. The sizes of the main structures are shown in Figures 9.8 and 9.10, so you can calculate the sizes of those records and compare them. (The size of the new checkpoint is greater than the previous one, but it contains more variables.)

9.5. Writing of XLOG Records

Having finished the warm-up exercises, we are now ready to understand the writing of XLOG records. I will explain it as precisely as possible in this section.

First, issue the following statement to explore the PostgreSQL internals:

testdb=# INSERT INTO tbl VALUES ('A');

By issuing the above statement, the internal function exec_simple_query() is invoked. The pseudocode of exec_simple_query() is shown below:

exec_simple_query() @postgres.c

(1) ExtendCLOG() @clog.c                  /* Write the state of this transaction
                                           * "IN_PROGRESS" to the CLOG.
                                           */
(2) heap_insert()@heapam.c                /* Insert a tuple, creates a XLOG record,
                                           * and invoke the function XLogInsert.
                                           */
(3)   XLogInsert() @xloginsert.c (9.4 or earlier, xlog.c)
                                          /* Write the XLOG record of the inserted tuple
                                           *  to the WAL buffer, and update page's pd_lsn.
                                           */
(4) finish_xact_command() @postgres.c     /* Invoke commit action.*/   
      XLogInsert() @xloginsert.c (9.4 or earlier, xlog.c)
                                          /* Write a XLOG record of this commit action 
                                           * to the WAL buffer.
                                           */
(5)   XLogWrite() @xloginsert.c (9.4 or earlier, xlog.c)
                                          /* Write and flush all XLOG records on
                                           * the WAL buffer to WAL segment.
                                           */
(6) TransactionIdCommitTree() @transam.c  /* Change the state of this transaction 
                                           * from "IN_PROGRESS" to "COMMITTED" on the CLOG.
                                           */

In the following paragraphs, each line of the pseudocode will be explained to help you understand the writing of XLOG records. See also Figs. 9.11 and 9.12.

  • (1) The function ExtendCLOG() writes the state of this transaction 'IN_PROGRESS' in the (in-memory) CLOG.
  • (2) The function heap_insert() inserts a heap tuple into the target page in the shared buffer pool, creates the XLOG record for that page, and invokes the function XLogInsert().
  • (3) The function XLogInsert() writes the XLOG record created by the heap_insert() to the WAL buffer at LSN_1, and then updates the modified page's pd_lsn from LSN_0 to LSN_1.
  • (4) The function finish_xact_command(), which invoked to commit this transaction, creates the XLOG record for the commit action, and then the function XLogInsert() writes this record to the WAL buffer at LSN_2.
Fig. 9.11. Write-sequence of XLOG records.
Fig. 9.11. Write-sequence of XLOG records.

The format of these XLOG records is version 9.4.
  • (5) The function XLogWrite() writes and flushes all XLOG records on the WAL buffer to the WAL segment file.
    If the parameter wal_sync_method is set to 'open_sync' or 'open_datasync', the records are synchronously written because the function writes all records with the open() system call specified the flag O_SYNC or O_DSYNC.
    If the parameter is set to 'fsync', 'fsync_writethrough' or 'fdatasync', the respective system call – fsync(), fcntl() with F_FULLFSYNC option, or fdatasync() – will be executed. In any case, all XLOG records are ensured to be written into the storage.
  • (6) The function TransactionIdCommitTree() changes the state of this transaction from 'IN_PROGRESS' to 'COMMITTED' on the CLOG.
Fig. 9.12. Write-sequence of XLOG records (continued from Fig. 9.11).
Fig. 9.12. Write-sequence of XLOG records (continued from Fig. 9.11).

In the above example, the commit action caused the writing of XLOG records to the WAL segment, but such writing may be caused by any of the following:

  1. One running transaction has committed or aborted.
  2. The WAL buffer has been filled up with many tuples. (The WAL buffer size is set to the parameter wal_buffers.)
  3. A WAL writer process writes periodically. (See the next section.)

If any of the above occurs, all WAL records on the WAL buffer are written into a WAL segment file regardless of whether their transactions have been committed or not.

It is taken for granted that DML (Data Manipulation Language) operations write XLOG records, but so do non-DML operations. As described above, a commit action writes a XLOG record that contains the id of the committed transaction. Another example is a checkpoint action, which writes a XLOG record that contains general information about the checkpoint. Furthermore, the SELECT statement creates XLOG records in special cases, although it does not usually create them. For example, if deletion of unnecessary tuples and defragmentation of the necessary tuples in pages occur by HOT (Heap Only Tuple) during a SELECT statement, the XLOG records of modified pages are written to the WAL buffer.


Direct I/O

PostgreSQL versions 15 and earlier do not support direct I/O, although it has been discussed. Reffer to this discussion on the pgsql-ML and this article.

In version 16, the debug-io-direct option has been added. This option is for developers to improve the use of direct I/O in PostgreSQL. If development goes well, direct I/O will be officially supported in the near future.


9.6. WAL Writer Process

The WAL writer is a background process that periodically checks the WAL buffer and writes all unwritten XLOG records to the WAL segments. This process helps to avoid bursts of XLOG record writing. If the WAL writer is not enabled, writing XLOG records could be bottlenecked when a large amount of data is committed at once.

The WAL writer is enabled by default and cannot be disabled. The check interval is set to the configuration parameter wal_writer_delay, which defaults to 200 milliseconds.

9.7. Checkpoint Processing in PostgreSQL

In PostgreSQL, the checkpointer (background) process performs checkpoints. It starts when one of the following occurs:

  1. The interval time set for checkpoint_timeout from the previous checkpoint has been elapsed (the default interval is 300 seconds (5 minutes)).
  2. In versions 9.4 or earlier, the number of WAL segment files set for checkpoint_segments has been consumed since the previous checkpoint (the default number is 3).
  3. In versions 9.5 or later, the total size of the WAL segment files in the pg_wal (in versions 9.6 or earlier, pg_xlog) directory has exceeded the value of the parameter max_wal_size (the default value is 1GB (64 files)).
  4. The PostgreSQL server stops in smart or fast mode.
  5. A superuser issues the CHECKPOINT command manually.

In versions 9.1 or earlier, as mentioned in in Section 8.6, the background writer process did both checkpinting and dirty-page writing.


In the following subsections, the outline of checkpointing and the pg_control file, which holds the metadata of the current checkpoint, are described.

v

9.7.1. Outline of the Checkpoint Processing

The checkpoint process has two aspects: preparing for database recovery, and cleaning dirty pages in the shared buffer pool. In this subsection, we will focus on the former aspect and describe its internal processing. See Fig. 9.13 for an overview.

Fig. 9.13. Internal processing of PostgreSQL Checkpoint.
Fig. 9.13. Internal processing of PostgreSQL Checkpoint.
  • (1) When a checkpoint process starts, the REDO point is stored in memory. The REDO point is the location of the XLOG record that was written at the moment the latest checkpoint was started. It is the starting point for database recovery.
  • (2) A XLOG record of this checkpoint (i.e., the checkpoint record) is written to the WAL buffer. The data-portion of the record is defined by the CheckPoint structure, which contains several variables such as the REDO point stored in step (1).
    The location where the checkpoint record is written is also called the checkpoint.
  • (3) All data in shared memory (e.g., the contents of the clog, etc..) is flushed to the storage.
  • (4) All dirty pages in the shared buffer pool are gradually written and flushed to the storage.
  • (5) The pg_control file is updated. This file contains fundamental information such as the location where the checkpoint record was written (a.k.a. the checkpoint location). We will discuss this file in more detail later.

To summarize the description above from the perspective of database recovery, checkpointing creates a checkpoint record that contains the REDO point, and stores the checkpoint location and other information in the pg_control file. This allows PostgreSQL to recover itself by replaying WAL data from the REDO point (obtained from the checkpoint record) provided by the pg_control file.

9.7.2. pg_control File

As the pg_control file contains the fundamental information of the checkpoint, it is essential for database recovery. If it is corrupted or unreadable, the recovery process cannot start because it cannot obtain a starting point.

Even though pg_control file stores over 40 items, three items that will be needed in the next section are shown below:

  • State – The state of the database server at the time the latest checkpoint was started. There are seven states in total:
    • 'start up' is the state that system is starting up.
    • 'shut down' is the state that the system is going down normally by the shutdown command.
    • 'in production' is the state that the system is running.
    and so on.
  • Latest checkpoint location – The LSN Location of the latest checkpoint record.
  • Prior checkpoint location – The LSN Location of the prior checkpoint record. Note that this is deprecated in version 11; the details are described in below.

A pg_control file is stored in the global subdirectory under the base-directory. Its contents can be shown using the pg_controldata utility.

postgres> pg_controldata  /usr/local/pgsql/data
pg_control version number:            1300
Catalog version number:               202306141
Database system identifier:           7250496631638317596
Database cluster state:               in production
pg_control last modified:             
Latest checkpoint location:           0/16AF0090
Latest checkpoint's REDO location:    0/16AF0090
Latest checkpoint's REDO WAL file:    000000010000000000000016

... snip ...

Removal of prior checkpoint in PostgreSQL 11

PostgreSQL 11 or later only stores the WAL segments that contain the latest checkpoint or newer. Older segment files, which contains the prior checkpoint, are not stored to reduce the disk space used for saving WAL segment files under the pg_wal subdirectory. See this thread in details.


9.8. Database Recovery in PostgreSQL

PostgreSQL implements redo log-based recovery. If the database server crashes, PostgreSQL can restore the database cluster by sequentially replaying the XLOG records in the WAL segment files from the REDO point.

We have already talked about database recovery several times up to this section. Here, I will describe two things about recovery that have not been explained yet.

The first thing is how PostgreSQL starts the recovery process. When PostgreSQL starts up, it first reads the pg_control file. The following are the details of the recovery process from that point. See Fig. 9.14 and the following description.

Fig. 9.14. Details of the recovery process.
Fig. 9.14. Details of the recovery process.
  • (1) PostgreSQL reads all items in the pg_control file when it starts. If the state item is 'in production', PostgreSQL enters recovery-mode because this means that the database was not shut down normally. If it is 'shut down', PostgreSQL enters normal startup-mode.
  • (2) PostgreSQL reads the latest checkpoint record, the location of which is written in the pg_control file, from the appropriate WAL segment file. It then gets the REDO point from the record. If the latest checkpoint record is invalid, PostgreSQL reads the one prior to it. If both records are unreadable, it gives up recovering by itself. (Note that the prior checkpoint is not stored in PostgreSQL 11 or later.)
  • (3) The appropriate resource managers read and replay XLOG records in sequence from the REDO point until they reach the end of the latest WAL segment. When an XLOG record is replayed and if it is a backup block, it is overwritten on the corresponding table page regardless of its LSN. Otherwise, a (non-backup block) XLOG record is replayed only if the LSN of the record is greater than the pd_lsn of the corresponding page.

The second point is about the comparison of LSNs: why the non-backup block's LSN and the corresponding page's pd_lsn should be compared. Unlike the previous examples, this will be explained using a specific example that emphasizes the need for this comparison. See Figs. 9.15 and 9.16. (Note that the WAL buffer is omitted to simplify the description.)

Fig. 9.15. Insertion operations during the background writer working.
Fig. 9.15. Insertion operations during the background writer working.
  • (1) PostgreSQL inserts a tuple into the TABLE_A, and writes an XLOG record at LSN_1.
  • (2) The background-writer process writes the TABLE_A page to storage. At this point, this page's pd_lsn is LSN_1.
  • (3) PostgreSQL inserts a new tuple into the TABLE_A, and writes a XLOG record at LSN_2. The modified page is not written into the storage yet.

Unlike the examples in overview, the TABLE_A's page has been once written to the storage in this scenario.

Do shutdown with immediate-mode, and then start the database.

Fig. 9.16. Database recovery.
Fig. 9.16. Database recovery.
  • (1) PostgreSQL loads the first XLOG record and the TABLE_A page, but does not replay it because the LSN of the record (LSN_1) is not larger than the LSN of the page (also LSN_1). In fact, it is clear that there is no need to replay it.
  • (2) Next, PostgreSQL replays the second XLOG record because the LSN of the record (LSN_2) is greater than the current LSN of the TABLE_A page (LSN_1).

As you can see from this example, if the replay order of non-backup blocks is incorrect or non-backup blocks are replayed more than once, the database cluster will no longer be consistent. In short, the redo (replay) operation of non-backup block is not idempotent. Therefore, to preserve the correct replay order, non-backup block records should be replayed only if their LSN is greater than the corresponding page's pd_lsn.

On the other hand, as the redo operation of backup block is idempotent, backup blocks can be replayed any number of times regardless of their LSN.

9.9. WAL Segment Files Management

PostgreSQL writes XLOG records to one of the WAL segment files stored in the pg_wal subdirectory (in versions 9.6 or earlier, pg_xlog subdirectory). A new WAL segment file is switched in if the old one has been filled up. The number of WAL files varies depending on several configuration parameters, as well as server activity. In addition, the management policy for WAL segment files has been improved in version 9.5.

The following subsections describe how WAL segment files are switched and managed.

9.9.1. WAL Segment Switches

WAL segment switches occur when one of the following events happens:

  1. WAL segment is filled up.
  2. The function pg_switch_xlog is called.
  3. The archive_mode parameter is enabled and the archive_timeout parameter has expired.

When a WAL segment file is switched, it is usually recycled (renamed and reused) for future use. However, it may be removed later if it is not needed.

9.9.2. WAL Segment Management

Whenever a checkpoint starts, PostgreSQL estimates and prepares the number of WAL segment files that will be needed for the next checkpoint cycle. This estimate is based on the number of WAL segment files that were consumed in previous checkpoint cycles.

The number of WAL segment files is counted from the segment that contains the prior REDO point, and the value must be between the min_wal_size parameter (which default to 80 MB, or 5 files) and the max_wal_size parameter (which default to 1 GB, or 64 files).

If a checkpoint starts, PostgreSQL will keep or recycle the necessary WAL segment files, and remove any unnecessary files.

A specific example is shown in Fig. 9.17. Assuming that there are six WAL segment files before a checkpoint starts, WAL_3 contains the prior REDO point (in versions 10 or earlier; in versions 11 or later, the REDO point), and PostgreSQL estimates that five files will be needed. In this case, WAL_1 will be renamed as WAL_7 for recycling and WAL_2 will be removed.


The files older than the one that contains the prior REDO point can be removed, because, as is clear from the recovery mechanism described in Section 9.8, they would never be used.


Fig. 9.17. Recycling and removing WAL segment files at a checkpoint.
Fig. 9.17. Recycling and removing WAL segment files at a checkpoint.

If more WAL segment files are required due to a spike in WAL activity, new WAL segment files will be created while the total size of the WAL segment files is less than the max_wal_size parameter. For example, in Fig. 9.18, if WAL_7 has been filled up, WAL_8 will be newly created.

Fig. 9.18. Creating WAL segment file.
Fig. 9.18. Creating WAL segment file.

The number of WAL segment files adapts to the server activity. If the amount of WAL data writing has been constantly increasing, the estimated number of WAL segment files, as well as the total size of the WAL segment files, will gradually increase. In the opposite case (i.e., the amount of WAL data writing has decreased), these will decrease.

If the total size of the WAL segment files exceeds the max_wal_size parameter, a checkpoint will be started. Figure 9.19 illustrates this situation. By checkpointing, a new REDO point will be created and the prior REDO point will be discarded; then, unnecessary old WAL segment files will be recycled. In this way, PostgreSQL will always keep just the WAL segment files that are needed for database recovery.

Fig. 9.19. Checkpointing and recycling WAL segment files.
Fig. 9.19. Checkpointing and recycling WAL segment files.

The wal_keep_size (or wal_keep_segments in versions 12 or earlear) and the replication slot feature also affect the number of WAL segment files.

9.10. Continuous Archiving and Archive Logs

Continuous archiving is a feature that copies WAL segment files to an archival area at the time when a WAL segment switch occurs. It is performed by the archiver (background) process. The copied file is called an archive log. This feature is typically used for hot physical backup and PITR (Point-in-Time Recovery), which are described in Chapter 10.

The path to the archival area is set by the archive_command configuration parameter. For example, the following parameter would copy WAL segment files to the directory '/home/postgres/archives/' every time a segment switch occurs:

archive_command = 'cp %p /home/postgres/archives/%f'

where, the %p placeholder represents the copied WAL segment, and the %f placeholder represents the archive log.

Fig. 9.20. Continuous archiving.
Fig. 9.20. Continuous archiving.

When the WAL segment file WAL_7 is switched, the file is copied to the archival area as Archive log 7.

The archive_command parameter can be set to any Unix command or tool. This means that you can use the scp command or any other file backup tool to transfer the archive logs to another host, instead of using a simple copy command.


archive_library

In versions 14 or earlier, continuous archiving could only use shell commands. In version 15, PostgreSQL introduced a loadable library feature that allows you to achieve continuous archiving using a library. For more information, see the archive_library and basic_archive documentation.



PostgreSQL does not automatically clean up created archive logs. Therefore, you must properly manage the logs when using this feature. If you do nothing, the number of archive logs will continue to grow.

The pg_archivecleanup utility is one of the useful tools for managing archive log files.

You can also use the find command to delete archive logs. For example, the following command would delete all archive logs that were created more than three days ago:

$ find /home/postgres/archives -mtime +3d -exec rm  -f {} \;