9.4. Internal Layout of XLOG Record

An XLOG record comprises a general header portion and associated data portions.

The first subsection describes the header structure. The remaining two subsections explain the data portion structures for versions 9.4 and earlier, and version 9.5 and later, respectively. (The data format changed in version 9.5.)

9.4.1. Header Portion of XLOG Record

All XLOG records have a general header portion defined by the XLogRecord structure. The structure for version 9.4 and earlier is shown below:

typedef struct XLogRecord
{
   uint32          xl_tot_len;   /* total len of entire record */
   TransactionId   xl_xid;       /* xact id */
   uint32          xl_len;       /* total len of rmgr data. This variable was removed in ver.9.5. */
   uint8           xl_info;      /* flag bits, see below */
   RmgrId          xl_rmid;      /* resource manager for this record */
   /* 2 bytes of padding here, initialize to zero */
   XLogRecPtr      xl_prev;      /* ptr to previous record in log */
   pg_crc32        xl_crc;       /* CRC for this record */
} XLogRecord;
The Header Portion of XLOG Record in versions 9.5 or later.

In versions 9.5 or later, the xl_len variable was removed from the XLogRecord structure to refine the format and reduce the size by a few bytes.

typedef struct XLogRecord
{
        uint32          xl_tot_len;             /* total len of entire record */
        TransactionId 	xl_xid;           	/* xact id */
        XLogRecPtr      xl_prev;                /* ptr to previous record in log */
        uint8           xl_info;                /* flag bits, see below */
        RmgrId          xl_rmid;                /* resource manager for this record */
        /* 2 bytes of padding here, initialize to zero */
        pg_crc32c       xl_crc;                 /* CRC for this record */
        /* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
} XLogRecord;

Most variables are self-explanatory.

Both xl_rmid and xl_info relate to resource managers, which are collections of operations for the WAL feature, such as writing and replaying XLOG records. The number of resource managers tends to increase with each PostgreSQL version.

Table 9.1 lists the resource managers:

Table 9.1: Resource managers in version 18.
Operation Resource manager
Heap tuple operations RM_HEAP, RM_HEAP2
Index operations RM_BTREE, RM_HASH, RM_GIN, RM_GIST, RM_SPGIST, RM_BRIN
Sequence operations RM_SEQ
Transaction operations RM_XACT, RM_MULTIXACT, RM_CLOG, RM_XLOG, RM_COMMIT_TS
Tablespace operations RM_SMGR, RM_DBASE, RM_TBLSPC, RM_RELMAP
replication and hot standby operations RM_STANDBY, RM_REPLORIGIN, RM_GENERIC_ID, RM_LOGICALMSG_ID

Here are representative examples of how resource managers work:

  • When an INSERT statement is executed, the header variables xl_rmid and xl_info are set to RM_HEAP and XLOG_HEAP_INSERT, respectively. During database recovery, PostgreSQL selects the heap_xlog_insert() function from RM_HEAP based on xl_info to replay the record.

  • For an UPDATE statement, xl_info is set to XLOG_HEAP_UPDATE. The heap_xlog_update() function replays the record during recovery.

  • When a transaction commits, xl_rmid and xl_info are set to RM_XACT and XLOG_XACT_COMMIT, respectively. The xact_redo_commit() function replays this record during recovery.

Info

The XLogRecord structure in versions 9.4 or earlier is defined in xlog.h, and in versions 9.5 or later, it is in xlogrecord.h.

The heap_xlog_insert and heap_xlog_update functions are defined in heapam.c; xact_redo_commit is defined in xact.c.

9.4.2. Data Portion of XLOG Record (versions 9.4 or earlier)

The data portion of an XLOG record is classified as either a backup block (containing an entire page) or a non-backup block (containing data that varies by operation).

Figure 9.8. Examples of XLOG records (versions 9.4 or earlier).

The internal layouts of XLOG records are described below using specific examples.

9.4.2.1. Backup Block

A backup block is shown in Figure 9.8(a). It consists of two data structures and one data object:

  1. The XLogRecord structure (header portion).
  2. The BkpBlock structure.
  3. The entire page, excluding its free space.

The BkpBlock structure contains variables that identify the page in the database cluster (the relfilenode, the fork number of the relation, and the block number). It also stores the starting position and length of the page’s free space.

typedef struct BkpBlock @ include/access/xlog_internal.h
{
  RelFileNode node;        /* relation containing block */
  ForkNumber  fork;        /* fork within the relation */
  BlockNumber block;       /* block number */
  uint16      hole_offset; /* number of bytes before "hole" */
  uint16      hole_length; /* number of bytes in "hole" */

  /* ACTUAL BLOCK DATA FOLLOWS AT END OF STRUCT */
} BkpBlock;
9.4.2.2. Non-Backup Block

In non-backup blocks, the layout of the data portion differs depending on the operation. The XLOG record for an INSERT statement is explained here as a representative example. See Figure 9.8(b). In this case, the XLOG record consists of two data structures and one data object:

  1. The XLogRecord (header portion) structure.
  2. The xl_heap_insert structure.
  3. The inserted tuple, with a few bytes removed.

The xl_heap_insert structure contains variables that identify the inserted tuple in the database cluster (the relfilenode of the table and the tuple’s TID) and a visibility flag for the tuple.

typedef struct BlockIdData
{
   uint16          bi_hi;
   uint16          bi_lo;
} BlockIdData;

typedef uint16 OffsetNumber;

typedef struct ItemPointerData
{
   BlockIdData     ip_blkid;
   OffsetNumber    ip_posid;
}

typedef struct RelFileNode
{
   Oid             spcNode;             /* tablespace */
   Oid             dbNode;              /* database */
   Oid             relNode;             /* relation */
} RelFileNode;

typedef struct xl_heaptid
{
   RelFileNode     node;
   ItemPointerData tid;                 /* changed tuple id */
} xl_heaptid;

typedef struct xl_heap_insert
{
   xl_heaptid      target;              /* inserted tuple id */
   bool            all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
} xl_heap_insert;
Info

The reason for removing a few bytes from the inserted tuple is described in the source code comment of the xl_heap_header structure:

We don’t store the whole fixed part (HeapTupleHeaderData) of an inserted or updated tuple in WAL; we can save a few bytes by reconstructing the fields that are available elsewhere in the WAL record, or perhaps just plain needn’t be reconstructed.

One more example is shown here. See Figure 9.8(c).

The XLOG record for a checkpoint is simple and consists of two data structures:

  1. The XLogRecord structure (header portion).
  2. The CheckPoint structure, which contains checkpoint information (see Section 9.7 for details).
Info

The xl_heap_header structure (versions 9.4 or earlier) is defined in heapam_xlog.h, while the CheckPoint structure is defined in pg_control.h.

9.4.3. Data Portion of XLOG Record (versions 9.5 or later)

In versions 9.4 or earlier, XLOG records had no common format, so each resource manager defined its own. This made it increasingly difficult to maintain the source code and implement new WAL features.

To address this issue, version 9.5 introduced a common structured format independent of resource managers.

The data portion of an XLOG record consists of two parts: header and data. See Figure 9.9.

Figure 9.9. Common XLOG record format.

The header part contains zero or more XLogRecordBlockHeaders and zero or one XLogRecordDataHeaderShort (or XLogRecordDataHeaderLong). It must contain at least one of these.

When a record stores a full-page image (FPI), the XLogRecordBlockHeader includes the XLogRecordBlockImageHeader. It also includes the XLogRecordBlockCompressHeader if the block is compressed.

/*
 * Header info for block data appended to an XLOG record.
 *
 * 'data_length' is the length of the rmgr-specific payload data associated
 * with this block. It does not include the possible full page image, nor
 * XLogRecordBlockHeader struct itself.
 *
 * Note that we don't attempt to align the XLogRecordBlockHeader struct!
 * So, the struct must be copied to aligned local storage before use.
 */
typedef struct XLogRecordBlockHeader
{
	uint8		id;				/* block reference ID */
	uint8		fork_flags;		/* fork within the relation, and flags */
	uint16		data_length;	/* number of payload bytes (not including page
								 * image) */

	/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
	/* If BKPBLOCK_SAME_REL is not set, a RelFileLocator follows */
	/* BlockNumber follows */
} XLogRecordBlockHeader;

/*
 * The fork number fits in the lower 4 bits in the fork_flags field. The upper
 * bits are used for flags.
 */
#define BKPBLOCK_FORK_MASK	0x0F
#define BKPBLOCK_FLAG_MASK	0xF0
#define BKPBLOCK_HAS_IMAGE	0x10	/* block data is an XLogRecordBlockImage */
#define BKPBLOCK_HAS_DATA	0x20
#define BKPBLOCK_WILL_INIT	0x40	/* redo will re-init the page */
#define BKPBLOCK_SAME_REL	0x80	/* RelFileLocator omitted, same as
									 * previous */
/*
 * Additional header information when a full-page image is included
 * (i.e. when BKPBLOCK_HAS_IMAGE is set).
 *
 * The XLOG code is aware that PG data pages usually contain an unused "hole"
 * in the middle, which contains only zero bytes.  Since we know that the
 * "hole" is all zeros, we remove it from the stored data (and it's not counted
 * in the XLOG record's CRC, either).  Hence, the amount of block data actually
 * present is (BLCKSZ - <length of "hole" bytes>).
 *
 * Additionally, when wal_compression is enabled, we will try to compress full
 * page images using one of the supported algorithms, after removing the
 * "hole". This can reduce the WAL volume, but at some extra cost of CPU spent
 * on the compression during WAL logging. In this case, since the "hole"
 * length cannot be calculated by subtracting the number of page image bytes
 * from BLCKSZ, basically it needs to be stored as an extra information.
 * But when no "hole" exists, we can assume that the "hole" length is zero
 * and no such an extra information needs to be stored. Note that
 * the original version of page image is stored in WAL instead of the
 * compressed one if the number of bytes saved by compression is less than
 * the length of extra information. Hence, when a page image is successfully
 * compressed, the amount of block data actually present is less than
 * BLCKSZ - the length of "hole" bytes - the length of extra information.
 */
typedef struct XLogRecordBlockImageHeader
{
	uint16		length;			/* number of page image bytes */
	uint16		hole_offset;	/* number of bytes before "hole" */
	uint8		bimg_info;		/* flag bits, see below */

	/*
	 * If BKPIMAGE_HAS_HOLE and BKPIMAGE_COMPRESSED(), an
	 * XLogRecordBlockCompressHeader struct follows.
	 */
} XLogRecordBlockImageHeader;

/* Information stored in bimg_info */
#define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
#define BKPIMAGE_APPLY			0x02	/* page image should be restored
										 * during replay */
/* compression methods supported */
#define BKPIMAGE_COMPRESS_PGLZ	0x04
#define BKPIMAGE_COMPRESS_LZ4	0x08
#define BKPIMAGE_COMPRESS_ZSTD	0x10

#define	BKPIMAGE_COMPRESSED(info) \
	((info & (BKPIMAGE_COMPRESS_PGLZ | BKPIMAGE_COMPRESS_LZ4 | \
			  BKPIMAGE_COMPRESS_ZSTD)) != 0)
/*
 * Extra header information used when page image has "hole" and
 * is compressed.
 */
typedef struct XLogRecordBlockCompressHeader
{
	uint16		hole_length;	/* number of bytes in "hole" */
} XLogRecordBlockCompressHeader;
/*
 * XLogRecordDataHeaderShort/Long are used for the "main data" portion of
 * the record. If the length of the data is less than 256 bytes, the short
 * form is used, with a single byte to hold the length. Otherwise the long
 * form is used.
 *
 * (These structs are currently not used in the code, they are here just for
 * documentation purposes).
 */
typedef struct XLogRecordDataHeaderShort
{
	uint8		id;				/* XLR_BLOCK_ID_DATA_SHORT */
	uint8		data_length;	/* number of payload bytes */
}			XLogRecordDataHeaderShort;

#define SizeOfXLogRecordDataHeaderShort (sizeof(uint8) * 2)

typedef struct XLogRecordDataHeaderLong
{
	uint8		id;				/* XLR_BLOCK_ID_DATA_LONG */
	/* followed by uint32 data_length, unaligned */
}			XLogRecordDataHeaderLong;

#define SizeOfXLogRecordDataHeaderLong (sizeof(uint8) + sizeof(uint32))

The data part consists of zero or more block data and zero or one main data, which correspond to the XLogRecordBlockHeader(s) and the XLogRecordDataHeader, respectively.

WAL compression

In versions 9.5 or later, full-page images within XLOG records can be compressed using the LZ method by setting “wal_compression = enable”. In that case, the XLogRecordBlockCompressHeader structure is added.

This feature provides two advantages: it reduces the I/O cost for writing records and suppresses the consumption of WAL segment files.

The disadvantage is the increased CPU resource consumption required for compression.

Figure 9.10. Examples of XLOG records (versions 9.5 or later).

Some specific examples are shown below.

9.4.3.1. Backup Block

The backup block created by an INSERT statement is shown in Figure 9.10(a). It consists of four data structures and one data object:

  1. The XLogRecord structure (header-portion).
  2. The XLogRecordBlockHeader structure, including one XLogRecordBlockImageHeader structure.
  3. The XLogRecordDataHeaderShort structure.
  4. A backup block (block data).
  5. The xl_heap_insert structure (main data).

The XLogRecordBlockHeader structure contains variables to identify the block in the database cluster (the relfilenode, the fork number, and the block number). The XLogRecordBlockImageHeader structure contains the length and offset number of this block.

These two header structures together store the same data as the BkBlock structure used until version 9.4.

Main Data Section

The XLogRecordDataHeaderShort structure stores the length of the xl_heap_insert structure, which serves as the main data of the record.

The content of the Main Data section in a WAL record containing an FPI varies depending on the operation. For example, an UPDATE statement adds structures such as xl_heap_lock or xl_heap_update.

In the context of WAL-based physical recovery, the data in the Main Data section of a backup block is redundant and remains unused, as the FPI itself provides the complete state of the page.

Logical Replication

When wal_level is set to logical, the behavior changes: the actual tuple data is explicitly appended to the Main Data section, even if an FPI is present. See Figure 9.11.

Figure 9.11. XLOG record with a Backup Block (wal_level = logical).

This design is crucial for logical replication. It allows the walsender to skip the physical FPI and directly decode the tuple information stored in the Main Data. Consequently, PostgreSQL achieves an efficient decoding process independent of the physical page layout.

This ensures high performance and consistent throughput even during “checkpoint spikes” when FPI generation is frequent.

9.4.3.2. Non-Backup Block

The non-backup block record created by an INSERT statement is shown in Figure 9.10(b). It consists of four data structures and one data object:

  1. The XLogRecord structure (header-portion).
  2. The XLogRecordBlockHeader structure.
  3. The XLogRecordDataHeaderShort structure.
  4. An inserted tuple (specifically, an xl_heap_header structure and the entire inserted data).
  5. The xl_heap_insert structure (main data).

The XLogRecordBlockHeader structure contains three values (the relfilenode, the fork number, and the block number) to specify the target block, and the length of the inserted tuple’s data portion.

The XLogRecordDataHeaderShort structure contains the length of the xl_heap_insert structure.

The xl_heap_insert structure contains only two values: the offset number of the tuple within the block and a visibility flag. This structure is simplified because the XLogRecordBlockHeader now stores most of the data previously contained in xl_heap_insert.

typedef struct xl_heap_insert
{
        OffsetNumber	offnum;            /* inserted tuple's offset */
        uint8           flags;

        /* xl_heap_header & TUPLE DATA in backup block 0 */
} xl_heap_insert;

A checkpoint record is shown in Figure 9.10(c). It consists of three data structures:

  1. The XLogRecord structure (header-portion).
  2. The XLogRecordDataHeaderShort structure, which contains the main data length.
  3. The CheckPoint structure (main data).
Info

The xl_heap_header structure is defined in htup.h and the CheckPoint structure is defined in pg_control.h.

The new XLOG format is optimized for parser efficiency, although it is more complex for human interpretation. Additionally, many XLOG record types are now smaller.

Figures 9.8 and 9.10 show the sizes of the main structures, allowing for the calculation and comparison of record sizes1.


  1. While the new checkpoint record is larger than the previous one, it includes more variables. ↩︎