9.4. Internal Layout of XLOG Record

An XLOG record comprises a general header portion and each associated data portion. The first subsection describes the header structure. The remaining two subsections explain the structure of the data portion in versions 9.4 and earlier, and version 9.5, respectively. (The data format changed in version 9.5.)

9.4.1. Header Portion of XLOG Record

All XLOG records have a general header portion defined by the XLogRecord structure. Here, the structure of 9.4 and earlier versions is shown below, although it has been changed in version 9.5.

typedef struct XLogRecord
{
   uint32          xl_tot_len;   /* total len of entire record */
   TransactionId   xl_xid;       /* xact id */
   uint32          xl_len;       /* total len of rmgr data. This variable was removed in ver.9.5. */
   uint8           xl_info;      /* flag bits, see below */
   RmgrId          xl_rmid;      /* resource manager for this record */
   /* 2 bytes of padding here, initialize to zero */
   XLogRecPtr      xl_prev;      /* ptr to previous record in log */
   pg_crc32        xl_crc;       /* CRC for this record */
} XLogRecord;
The Header Portion of XLOG Record in versions 9.5 or later.

In versions 9.5 or later, one variable (xl_len) has been removed the XLogRecord structure to refine the XLOG record format, which reduced the size by a few bytes.

typedef struct XLogRecord
{
        uint32          xl_tot_len;             /* total len of entire record */
        TransactionId 	xl_xid;           	/* xact id */
        XLogRecPtr      xl_prev;                /* ptr to previous record in log */
        uint8           xl_info;                /* flag bits, see below */
        RmgrId          xl_rmid;                /* resource manager for this record */
        /* 2 bytes of padding here, initialize to zero */
        pg_crc32c       xl_crc;                 /* CRC for this record */
        /* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
} XLogRecord;

Apart from two variables, most of the variables are so obvious that they do not need to be described.

Both xl_rmid and xl_info are variables related to resource managers, which are collections of operations associated with the WAL feature, such as writing and replaying of XLOG records. The number of resource managers tends to increase with each PostgreSQL version. Version 10 contains the following:

Operation Resource manager
Heap tuple operations RM_HEAP, RM_HEAP2
Index operations RM_BTREE, RM_HASH, RM_GIN, RM_GIST, RM_SPGIST, RM_BRIN
Sequence operations RM_SEQ
Transaction operations RM_XACT, RM_MULTIXACT, RM_CLOG, RM_XLOG, RM_COMMIT_TS
Tablespace operations RM_SMGR, RM_DBASE, RM_TBLSPC, RM_RELMAP
replication and hot standby operations RM_STANDBY, RM_REPLORIGIN, RM_GENERIC_ID, RM_LOGICALMSG_ID

Here are some representative examples of how resource managers work:

  • If an INSERT statement is issued, the header variables xl_rmid and xl_info of its XLOG record are set to ‘RM_HEAP’ and ‘XLOG_HEAP_INSERT’, respectively. When recovering the database cluster, the RM_HEAP’s function heap_xlog_insert() is selected according to the xl_info and replays this XLOG record.

  • Similarly, for an UPDATE statement, the header variable xl_info of the XLOG record is set to ‘XLOG_HEAP_UPDATE’, and the RM_HEAP’s function heap_xlog_update() replays its record when the database recovers.

  • When a transaction commits, the header variables xl_rmid and xl_info of its XLOG record are set to ‘RM_XACT’ and ‘XLOG_XACT_COMMIT’, respectively. When recovering the database cluster, the function xact_redo_commit() replays this record.

Info

XLogRecord structure in versions 9.4 or earlier is defined in src/include/access/xlog.h and that of versions 9.5 or later is defined in src/include/access/xlogrecord.h.

The heap_xlog_insert and heap_xlog_update are defined in src/backend/access/heap/heapam.c; while the function xact_redo_commit is defined in src/backend/access/transam/xact.c.

9.4.2. Data Portion of XLOG Record (versions 9.4 or earlier)

The data portion of an XLOG record can be classified into either a backup block (which contains the entire page) or a non-backup block (which contains different data depending on the operation).

Fig. 9.8. Examples of XLOG records (versions 9.4 or earlier).

The internal layouts of XLOG records are described below, using some specific examples.

9.4.2.1. Backup Block

A backup block is shown in Fig. 9.8(a). It is composed of two data structures and one data object:

  1. The XLogRecord structure (header portion).

  2. The BkpBlock structure.

  3. The entire page, except for its free space.

The BkpBlock structure contains the variables that identify the page in the database cluster (i.e., the relfilenode and the fork number of the relation that contains the page, and the page’s block number), as well as the starting position and length of the page’s free space.

typedef struct BkpBlock @ include/access/xlog_internal.h
{
  RelFileNode node;        /* relation containing block */
  ForkNumber  fork;        /* fork within the relation */
  BlockNumber block;       /* block number */
  uint16      hole_offset; /* number of bytes before "hole" */
  uint16      hole_length; /* number of bytes in "hole" */

  /* ACTUAL BLOCK DATA FOLLOWS AT END OF STRUCT */
} BkpBlock;
9.4.2.2. Non-Backup Block

In non-backup blocks, the layout of the data portion differs depending on the operation. Here, the XLOG record for an INSERT statement is explained as a representative example. See Fig. 9.8(b). In this case, the XLOG record for the INSERT statement is composed of two data structures and one data object:

  1. The XLogRecord (header-portion) structure.

  2. The xl_heap_insert structure.

  3. The inserted tuple, with a few bytes removed.

The xl_heap_insert structure contains the variables that identify the inserted tuple in the database cluster (i.e., the relfilenode of the table that contains this tuple, and the tuple’s tid), as well as a visibility flag of this tuple.

typedef struct BlockIdData
{
   uint16          bi_hi;
   uint16          bi_lo;
} BlockIdData;

typedef uint16 OffsetNumber;

typedef struct ItemPointerData
{
   BlockIdData     ip_blkid;
   OffsetNumber    ip_posid;
}

typedef struct RelFileNode
{
   Oid             spcNode;             /* tablespace */
   Oid             dbNode;              /* database */
   Oid             relNode;             /* relation */
} RelFileNode;

typedef struct xl_heaptid
{
   RelFileNode     node;
   ItemPointerData tid;                 /* changed tuple id */
} xl_heaptid;

typedef struct xl_heap_insert
{
   xl_heaptid      target;              /* inserted tuple id */
   bool            all_visible_cleared; /* PD_ALL_VISIBLE was cleared */
} xl_heap_insert;
Info

The reason to remove a few bytes from inserted tuple is described in the source code comment of the structure xl_heap_header:

We don’t store the whole fixed part (HeapTupleHeaderData) of an inserted or updated tuple in WAL; we can save a few bytes by reconstructing the fields that are available elsewhere in the WAL record, or perhaps just plain needn’t be reconstructed.

One more example will be shown here. See Fig. 9.8(c). The XLOG record for a checkpoint record is quite simple; it is composed of two data structures:

  1. the XLogRecord structure (header-portion).
  2. the Checkpoint structure, which contains its checkpoint information (see more detail in Section 9.7).
Info

The xl_heap_header structure is defined in src/include/access/htup.h while the CheckPoint structure is defined in src/include/catalog/pg_control.h.

9.4.3. Data Portion of XLOG Record (versions 9.5 or later)

In versions 9.4 or earlier, there was no common format for XLOG records, so each resource manager had to define its own format. This made it increasingly difficult to maintain the source code and implement new features related to WAL. To address this issue, a common structured format that is independent of resource managers was introduced in version 9.5.

The data portion of an XLOG record can be divided into two parts: header and data. See Fig. 9.9.

Fig. 9.9. Common XLOG record format.

The header part contains zero or more XLogRecordBlockHeaders and zero or one XLogRecordDataHeaderShort (or XLogRecordDataHeaderLong). It must contain at least one of these.

When the record stores a full-page image (i.e., a backup block), the XLogRecordBlockHeader includes the XLogRecordBlockImageHeader, and also includes the XLogRecordBlockCompressHeader if its block is compressed.

/*
 * Header info for block data appended to an XLOG record.
 *
 * 'data_length' is the length of the rmgr-specific payload data associated
 * with this block. It does not include the possible full page image, nor
 * XLogRecordBlockHeader struct itself.
 *
 * Note that we don't attempt to align the XLogRecordBlockHeader struct!
 * So, the struct must be copied to aligned local storage before use.
 */
typedef struct XLogRecordBlockHeader
{
	uint8		id;				/* block reference ID */
	uint8		fork_flags;		/* fork within the relation, and flags */
	uint16		data_length;	/* number of payload bytes (not including page
								 * image) */

	/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
	/* If BKPBLOCK_SAME_REL is not set, a RelFileLocator follows */
	/* BlockNumber follows */
} XLogRecordBlockHeader;

/*
 * The fork number fits in the lower 4 bits in the fork_flags field. The upper
 * bits are used for flags.
 */
#define BKPBLOCK_FORK_MASK	0x0F
#define BKPBLOCK_FLAG_MASK	0xF0
#define BKPBLOCK_HAS_IMAGE	0x10	/* block data is an XLogRecordBlockImage */
#define BKPBLOCK_HAS_DATA	0x20
#define BKPBLOCK_WILL_INIT	0x40	/* redo will re-init the page */
#define BKPBLOCK_SAME_REL	0x80	/* RelFileLocator omitted, same as
									 * previous */
/*
 * Additional header information when a full-page image is included
 * (i.e. when BKPBLOCK_HAS_IMAGE is set).
 *
 * The XLOG code is aware that PG data pages usually contain an unused "hole"
 * in the middle, which contains only zero bytes.  Since we know that the
 * "hole" is all zeros, we remove it from the stored data (and it's not counted
 * in the XLOG record's CRC, either).  Hence, the amount of block data actually
 * present is (BLCKSZ - <length of "hole" bytes>).
 *
 * Additionally, when wal_compression is enabled, we will try to compress full
 * page images using one of the supported algorithms, after removing the
 * "hole". This can reduce the WAL volume, but at some extra cost of CPU spent
 * on the compression during WAL logging. In this case, since the "hole"
 * length cannot be calculated by subtracting the number of page image bytes
 * from BLCKSZ, basically it needs to be stored as an extra information.
 * But when no "hole" exists, we can assume that the "hole" length is zero
 * and no such an extra information needs to be stored. Note that
 * the original version of page image is stored in WAL instead of the
 * compressed one if the number of bytes saved by compression is less than
 * the length of extra information. Hence, when a page image is successfully
 * compressed, the amount of block data actually present is less than
 * BLCKSZ - the length of "hole" bytes - the length of extra information.
 */
typedef struct XLogRecordBlockImageHeader
{
	uint16		length;			/* number of page image bytes */
	uint16		hole_offset;	/* number of bytes before "hole" */
	uint8		bimg_info;		/* flag bits, see below */

	/*
	 * If BKPIMAGE_HAS_HOLE and BKPIMAGE_COMPRESSED(), an
	 * XLogRecordBlockCompressHeader struct follows.
	 */
} XLogRecordBlockImageHeader;

/* Information stored in bimg_info */
#define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
#define BKPIMAGE_APPLY			0x02	/* page image should be restored
										 * during replay */
/* compression methods supported */
#define BKPIMAGE_COMPRESS_PGLZ	0x04
#define BKPIMAGE_COMPRESS_LZ4	0x08
#define BKPIMAGE_COMPRESS_ZSTD	0x10

#define	BKPIMAGE_COMPRESSED(info) \
	((info & (BKPIMAGE_COMPRESS_PGLZ | BKPIMAGE_COMPRESS_LZ4 | \
			  BKPIMAGE_COMPRESS_ZSTD)) != 0)
/*
 * Extra header information used when page image has "hole" and
 * is compressed.
 */
typedef struct XLogRecordBlockCompressHeader
{
	uint16		hole_length;	/* number of bytes in "hole" */
} XLogRecordBlockCompressHeader;
/*
 * XLogRecordDataHeaderShort/Long are used for the "main data" portion of
 * the record. If the length of the data is less than 256 bytes, the short
 * form is used, with a single byte to hold the length. Otherwise the long
 * form is used.
 *
 * (These structs are currently not used in the code, they are here just for
 * documentation purposes).
 */
typedef struct XLogRecordDataHeaderShort
{
	uint8		id;				/* XLR_BLOCK_ID_DATA_SHORT */
	uint8		data_length;	/* number of payload bytes */
}			XLogRecordDataHeaderShort;

#define SizeOfXLogRecordDataHeaderShort (sizeof(uint8) * 2)

typedef struct XLogRecordDataHeaderLong
{
	uint8		id;				/* XLR_BLOCK_ID_DATA_LONG */
	/* followed by uint32 data_length, unaligned */
}			XLogRecordDataHeaderLong;

#define SizeOfXLogRecordDataHeaderLong (sizeof(uint8) + sizeof(uint32))

The data part is composed of zero or more block data and zero or one main data, which correspond to the XLogRecordBlockHeader(s) and to the XLogRecordDataHeader, respectively.

WAL compression

In versions 9.5 or later, full-page images within XLOG records can be compressed using the LZ compression method by setting the parameter wal_compression = enable. In that case, the XLogRecordBlockCompressHeader structure will be added.

This feature has two advantages and one disadvantage. The advantages are reducing the I/O cost for writing records and suppressing the consumption of WAL segment files. The disadvantage is consuming much CPU resource to compress.

Fig. 9.10. Examples of XLOG records (versions 9.5 or later).

Some specific examples are shown below, as in the previous subsection.

9.4.3.1. Backup Block

The backup block created by an INSERT statement is shown in Fig. 9.10(a). It is composed of four data structures and one data object:

  1. the XLogRecord structure (header-portion).
  2. the XLogRecordBlockHeader structure, including one LogRecordBlockImageHeader structure.
  3. the XLogRecordDataHeaderShort structure.
  4. a backup block (block data).
  5. the xl_heap_insert structure (main data).

The XLogRecordBlockHeader structure contains the variables to identify the block in the database cluster (the relfilenode, the fork number, and the block number). The XLogRecordImageHeader structure contains the length of this block and offset number. (These two header structures together can store the same data as the BkBlock structure used until version 9.4.)

The XLogRecordDataHeaderShort structure stores the length of the xl_heap_insert structure, which is the main data of the record. (See below.)

Info

The main data of an XLOG record that contains a full-page image is not used except in some special cases, such as logical decoding and speculative insertions. It is ignored when the record is replayed, making it redundant data. This may be improved in the future.

In addition, the main data of backup block records depends on the statements that create them. For example, an UPDATE statement appends xl_heap_lock or xl_heap_updated.

9.4.3.2. Non-Backup Block

Next, I will describe the non-backup block record created by the INSERT statement (see Fig. 9.10(b)). It is composed of four data structures and one data object:

  1. the XLogRecord structure (header-portion).
  2. the XLogRecordBlockHeader structure.
  3. the XLogRecordDataHeaderShort structure.
  4. an inserted tuple (to be exact, a xl_heap_header structure and an inserted data entire).
  5. the xl_heap_insert structure (main data).

The XLogRecordBlockHeader structure contains three values (the relfilenode, the fork number, and the block number) to specify the block that the tuple was inserted into, and the length of the data portion of the inserted tuple. The XLogRecordDataHeaderShort structure contains the length of the new xl_heap_insert structure, which is the main data of this record.

The new xl_heap_insert structure contains only two values: the offset number of this tuple within the block, and a visibility flag. It became very simple because the XLogRecordBlockHeader structure stores most of the data that was contained in the old xl_heap_insert structure.

typedef struct xl_heap_insert
{
        OffsetNumber	offnum;            /* inserted tuple's offset */
        uint8           flags;

        /* xl_heap_header & TUPLE DATA in backup block 0 */
} xl_heap_insert;

As the final example, a checkpoint record is shown in the Fig. 9.10(c). It is composed of three data structures:

  1. the XLogRecord structure (header-portion).
  2. the XLogRecordDataHeaderShort structure contained of the main data length.
  3. the structure CheckPoint (main data).
Info

The structure xl_heap_header is defined in src/include/access/htup.h and the CheckPoint structure is defined in src/include/catalog/pg_control.h.

Although the new format is a little complicated for us, it is well-designed for the parsers of the resource managers. Additionally, the size of many types of XLOG records is usually smaller than the previous ones. The sizes of the main structures are shown in Figures 9.8 and 9.10, so you can calculate the sizes of those records and compare them. (The size of the new checkpoint is greater than the previous one, but it contains more variables.)