9.5. Writing of XLOG Records

Having finished the warm-up exercises, we are now ready to understand the writing of XLOG records. I will explain it as precisely as possible in this section.

First, issue the following statement to explore the PostgreSQL internals:

testdb=# INSERT INTO tbl VALUES ('A');

By issuing the above statement, the internal function exec_simple_query() is invoked. The pseudocode of exec_simple_query() is shown below:

exec_simple_query() @postgres.c

(1) ExtendCLOG() @clog.c                  /* Write the state of this transaction
                                           * "IN_PROGRESS" to the CLOG.
                                           */
(2) heap_insert()@heapam.c                /* Insert a tuple, creates a XLOG record,
                                           * and invoke the function XLogInsert.
                                           */
(3)   XLogInsert() @xloginsert.c (9.4 or earlier, xlog.c)
                                          /* Write the XLOG record of the inserted tuple
                                           *  to the WAL buffer, and update page's pd_lsn.
                                           */
(4) finish_xact_command() @postgres.c     /* Invoke commit action.*/   
      XLogInsert() @xloginsert.c (9.4 or earlier, xlog.c)
                                          /* Write a XLOG record of this commit action 
                                           * to the WAL buffer.
                                           */
(5)   XLogWrite() @xloginsert.c (9.4 or earlier, xlog.c)
                                          /* Write and flush all XLOG records on
                                           * the WAL buffer to WAL segment.
                                           */
(6) TransactionIdCommitTree() @transam.c  /* Change the state of this transaction 
                                           * from "IN_PROGRESS" to "COMMITTED"
                                           * on the CLOG.
                                           */

In the following paragraphs, each line of the pseudocode will be explained to help you understand the writing of XLOG records. See also Figs. 9.11 and 9.12.

  • (1) The function ExtendCLOG() writes the state of this transaction ‘IN_PROGRESS’ in the (in-memory) CLOG.

  • (2) The function heap_insert() inserts a heap tuple into the target page in the shared buffer pool, creates the XLOG record for that page, and invokes the function XLogInsert().

  • (3) The function XLogInsert() writes the XLOG record created by the heap_insert() to the WAL buffer at LSN_1, and then updates the modified page’s pd_lsn from LSN_0 to LSN_1.

  • (4) The function finish_xact_command(), which invoked to commit this transaction, creates the XLOG record for the commit action, and then the function XLogInsert() writes this record to the WAL buffer at LSN_2.

The format of these XLOG records is version 9.4.
Fig. 9.11. Write-sequence of XLOG records.

The format of these XLOG records is version 9.4.

  • (5) The function XLogWrite() writes and flushes all XLOG records on the WAL buffer to the WAL segment file.
    If the parameter wal_sync_method is set to ‘open_sync’ or ‘open_datasync’, the records are synchronously written because the function writes all records with the open() system call specified the flag ‘O_SYNC’ or ‘O_DSYNC’.
    If the parameter is set to ‘fsync’, ‘fsync_writethrough’ or ‘fdatasync’, the respective system call – fsync(), fcntl() with F_FULLFSYNC option, or fdatasync() – will be executed. In any case, all XLOG records are ensured to be written into the storage.

  • (6) The function TransactionIdCommitTree() changes the state of this transaction from ‘IN_PROGRESS’ to ‘COMMITTED’ on the CLOG.

Fig. 9.12. Write-sequence of XLOG records. (continued from Fig. 9.11)

In the above example, the commit action caused the writing of XLOG records to the WAL segment, but such writing may be caused by any of the following:

  1. One running transaction has committed or aborted.

  2. The WAL buffer has been filled up with many tuples. (The WAL buffer size is set to the parameter wal_buffers.)

  3. A WAL writer process writes periodically. (See the next section.)

If any of the above occurs, all WAL records on the WAL buffer are written into a WAL segment file regardless of whether their transactions have been committed or not.

It is taken for granted that DML (Data Manipulation Language) operations write XLOG records, but so do non-DML operations. As described above, a commit action writes a XLOG record that contains the id of the committed transaction. Another example is a checkpoint action, which writes a XLOG record that contains general information about the checkpoint.

Furthermore, the SELECT statement creates XLOG records in special cases, although it does not usually create them. For example, if deletion of unnecessary tuples and defragmentation of the necessary tuples in pages occur by HOT (Heap Only Tuple) during a SELECT statement, the XLOG records of modified pages are written to the WAL buffer.

Direct I/O

PostgreSQL versions 15 and earlier do not support direct I/O, although it has been discussed. Refer to this discussion on the pgsql-ML and this article.

In version 16, the debug-io-direct option has been added. This option is for developers to improve the use of direct I/O in PostgreSQL. If development goes well, direct I/O will be officially supported in the near future.