9.4. Internal Layout of XLOG Record
An XLOG record comprises a general header portion and associated data portions.
The first subsection describes the header structure. The remaining two subsections explain the data portion structures for versions 9.4 and earlier, and version 9.5 and later, respectively. (The data format changed in version 9.5.)
9.4.1. Header Portion of XLOG Record
All XLOG records have a general header portion defined by the XLogRecord structure. The structure for version 9.4 and earlier is shown below:
typedef struct XLogRecord
{
uint32 xl_tot_len; /* total len of entire record */
TransactionId xl_xid; /* xact id */
uint32 xl_len; /* total len of rmgr data. This variable was removed in ver.9.5. */
uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */
/* 2 bytes of padding here, initialize to zero */
XLogRecPtr xl_prev; /* ptr to previous record in log */
pg_crc32 xl_crc; /* CRC for this record */
} XLogRecord;In versions 9.5 or later, the xl_len variable was removed from the XLogRecord structure to refine the format and reduce the size by a few bytes.
Most variables are self-explanatory.
Both xl_rmid and xl_info relate to resource managers, which are collections of operations for the WAL feature, such as writing and replaying XLOG records. The number of resource managers tends to increase with each PostgreSQL version.
Table 9.1 lists the resource managers:
| Operation | Resource manager |
|---|---|
| Heap tuple operations | RM_HEAP, RM_HEAP2 |
| Index operations | RM_BTREE, RM_HASH, RM_GIN, RM_GIST, RM_SPGIST, RM_BRIN |
| Sequence operations | RM_SEQ |
| Transaction operations | RM_XACT, RM_MULTIXACT, RM_CLOG, RM_XLOG, RM_COMMIT_TS |
| Tablespace operations | RM_SMGR, RM_DBASE, RM_TBLSPC, RM_RELMAP |
| replication and hot standby operations | RM_STANDBY, RM_REPLORIGIN, RM_GENERIC_ID, RM_LOGICALMSG_ID |
Here are representative examples of how resource managers work:
-
When an INSERT statement is executed, the header variables xl_rmid and xl_info are set to RM_HEAP and XLOG_HEAP_INSERT, respectively. During database recovery, PostgreSQL selects the heap_xlog_insert() function from RM_HEAP based on xl_info to replay the record.
-
For an UPDATE statement, xl_info is set to XLOG_HEAP_UPDATE. The heap_xlog_update() function replays the record during recovery.
-
When a transaction commits, xl_rmid and xl_info are set to RM_XACT and XLOG_XACT_COMMIT, respectively. The xact_redo_commit() function replays this record during recovery.
The XLogRecord structure in versions 9.4 or earlier is defined in xlog.h, and in versions 9.5 or later, it is in xlogrecord.h.
The heap_xlog_insert and heap_xlog_update functions are defined in heapam.c; xact_redo_commit is defined in xact.c.
9.4.2. Data Portion of XLOG Record (versions 9.4 or earlier)
The data portion of an XLOG record is classified as either a backup block (containing an entire page) or a non-backup block (containing data that varies by operation).
Figure 9.8. Examples of XLOG records (versions 9.4 or earlier).
The internal layouts of XLOG records are described below using specific examples.
9.4.2.1. Backup Block
A backup block is shown in Figure 9.8(a). It consists of two data structures and one data object:
- The XLogRecord structure (header portion).
- The
BkpBlockstructure. - The entire page, excluding its free space.
The BkpBlock structure contains variables that identify the page in the database cluster (the relfilenode, the fork number of the relation, and the block number).
It also stores the starting position and length of the page’s free space.
9.4.2.2. Non-Backup Block
In non-backup blocks, the layout of the data portion differs depending on the operation. The XLOG record for an INSERT statement is explained here as a representative example. See Figure 9.8(b). In this case, the XLOG record consists of two data structures and one data object:
- The XLogRecord (header portion) structure.
- The
xl_heap_insertstructure. - The inserted tuple, with a few bytes removed.
The xl_heap_insert structure contains variables that identify the inserted tuple in the database cluster (the relfilenode of the table and the tuple’s TID) and a visibility flag for the tuple.
The reason for removing a few bytes from the inserted tuple is described in the source code comment of the xl_heap_header structure:
We don’t store the whole fixed part (HeapTupleHeaderData) of an inserted or updated tuple in WAL; we can save a few bytes by reconstructing the fields that are available elsewhere in the WAL record, or perhaps just plain needn’t be reconstructed.
One more example is shown here. See Figure 9.8(c).
The XLOG record for a checkpoint is simple and consists of two data structures:
- The XLogRecord structure (header portion).
- The CheckPoint structure, which contains checkpoint information (see Section 9.7 for details).
The xl_heap_header structure (versions 9.4 or earlier) is defined in heapam_xlog.h, while the CheckPoint structure is defined in pg_control.h.
9.4.3. Data Portion of XLOG Record (versions 9.5 or later)
In versions 9.4 or earlier, XLOG records had no common format, so each resource manager defined its own. This made it increasingly difficult to maintain the source code and implement new WAL features.
To address this issue, version 9.5 introduced a common structured format independent of resource managers.
The data portion of an XLOG record consists of two parts: header and data. See Figure 9.9.
Figure 9.9. Common XLOG record format.
The header part contains zero or more XLogRecordBlockHeaders and zero or one XLogRecordDataHeaderShort (or XLogRecordDataHeaderLong). It must contain at least one of these.
When a record stores a full-page image (FPI), the XLogRecordBlockHeader includes the XLogRecordBlockImageHeader.
It also includes the XLogRecordBlockCompressHeader if the block is compressed.
The data part consists of zero or more block data and zero or one main data, which correspond to the XLogRecordBlockHeader(s) and the XLogRecordDataHeader, respectively.
In versions 9.5 or later, full-page images within XLOG records can be compressed using the LZ method by setting “wal_compression = enable”. In that case, the XLogRecordBlockCompressHeader structure is added.
This feature provides two advantages: it reduces the I/O cost for writing records and suppresses the consumption of WAL segment files.
The disadvantage is the increased CPU resource consumption required for compression.
Figure 9.10. Examples of XLOG records (versions 9.5 or later).
Some specific examples are shown below.
9.4.3.1. Backup Block
The backup block created by an INSERT statement is shown in Figure 9.10(a). It consists of four data structures and one data object:
- The XLogRecord structure (header-portion).
- The XLogRecordBlockHeader structure, including one XLogRecordBlockImageHeader structure.
- The XLogRecordDataHeaderShort structure.
- A backup block (block data).
- The xl_heap_insert structure (main data).
The XLogRecordBlockHeader structure contains variables to identify the block in the database cluster (the relfilenode, the fork number, and the block number). The XLogRecordBlockImageHeader structure contains the length and offset number of this block.
These two header structures together store the same data as the BkBlock structure used until version 9.4.
Main Data Section
The XLogRecordDataHeaderShort structure stores the length of the xl_heap_insert structure, which serves as the main data of the record.
The content of the Main Data section in a WAL record containing an FPI varies depending on the operation. For example, an UPDATE statement adds structures such as xl_heap_lock or xl_heap_update.
In the context of WAL-based physical recovery, the data in the Main Data section of a backup block is redundant and remains unused, as the FPI itself provides the complete state of the page.
When wal_level is set to logical, the behavior changes: the actual tuple data is explicitly appended to the Main Data section, even if an FPI is present. See Figure 9.11.
Figure 9.11. XLOG record with a Backup Block (wal_level = logical).
This design is crucial for logical replication. It allows the walsender to skip the physical FPI and directly decode the tuple information stored in the Main Data. Consequently, PostgreSQL achieves an efficient decoding process independent of the physical page layout.
This ensures high performance and consistent throughput even during “checkpoint spikes” when FPI generation is frequent.
9.4.3.2. Non-Backup Block
The non-backup block record created by an INSERT statement is shown in Figure 9.10(b). It consists of four data structures and one data object:
- The XLogRecord structure (header-portion).
- The XLogRecordBlockHeader structure.
- The XLogRecordDataHeaderShort structure.
- An inserted tuple (specifically, an xl_heap_header structure and the entire inserted data).
- The
xl_heap_insertstructure (main data).
The XLogRecordBlockHeader structure contains three values (the relfilenode, the fork number, and the block number) to specify the target block, and the length of the inserted tuple’s data portion.
The XLogRecordDataHeaderShort structure contains the length of the xl_heap_insert structure.
The xl_heap_insert structure contains only two values: the offset number of the tuple within the block and a visibility flag. This structure is simplified because the XLogRecordBlockHeader now stores most of the data previously contained in xl_heap_insert.
A checkpoint record is shown in Figure 9.10(c). It consists of three data structures:
- The XLogRecord structure (header-portion).
- The XLogRecordDataHeaderShort structure, which contains the main data length.
- The CheckPoint structure (main data).
The xl_heap_header structure is defined in htup.h and the CheckPoint structure is defined in pg_control.h.
The new XLOG format is optimized for parser efficiency, although it is more complex for human interpretation. Additionally, many XLOG record types are now smaller.
Figures 9.8 and 9.10 show the sizes of the main structures, allowing for the calculation and comparison of record sizes1.
-
While the new checkpoint record is larger than the previous one, it includes more variables. ↩︎