9.8. Database Recovery in PostgreSQL

PostgreSQL implements redo log-based recovery. If the database server crashes, PostgreSQL restores the database cluster by sequentially replaying the XLOG records in the WAL segment files from the REDO point.

Database recovery has been discussed several times in previous sections. This section covers two additional aspects of recovery.

9.8.1. Recovery Processing

The first aspect is the start of the recovery process. When PostgreSQL starts up, it reads the pg_control file. The following details describe the recovery process from that point.

See Figure 9.20 and the following description.

Figure 9.20. Details of the recovery process.
  1. PostgreSQL reads all items in the pg_control file at startup.

    • If the state item is ‘in production’, PostgreSQL enters recovery mode because the database was not shut down normally.
    • If the state is ‘shut down’, PostgreSQL enters normal startup mode.
  2. PostgreSQL reads the latest checkpoint record from the appropriate WAL segment file using the location recorded in the pg_control file. It then retrieves the REDO point from that record.

  3. Resource managers read and replay XLOG records in sequence from the REDO point to the end of the latest WAL segment.

    • If a replayed XLOG record is a backup block, the manager overwrites the corresponding table page regardless of its LSN.
    • Otherwise, the manager replays a non-backup block XLOG record only if the record’s LSN is greater than the pd_lsn of the corresponding page.

9.8.2. LSN Comparison and Idempotency

The second point concerns LSN comparison: why the LSN of a non-backup block and the pd_lsn of the corresponding page must be compared.

This is explained using a specific example that emphasizes the need for this comparison. See Figures 9.21 and 9.22. (Note that the WAL buffer is omitted to simplify the description.)

Figure 9.21. Insertion operations during the background writer working.
  1. PostgreSQL inserts a tuple into TABLE_A and writes an XLOG record at LSN_1.
  2. The checkpointer process writes the TABLE_A page to storage. At this point, the page’s pd_lsn is LSN_1.
  3. PostgreSQL inserts a new tuple into TABLE_A and writes an XLOG record at LSN_2. The modified page is not yet written to storage.

In this scenario, unlike previous overview examples, the TABLE_A page has been written to storage once.

If the database is shut down in immediate mode and then started, the following recovery occurs:

Figure 9.22. Database recovery.
  1. PostgreSQL loads the first XLOG record and the TABLE_A page. It does not replay the record because the record’s LSN (LSN_1) is not greater than the page’s LSN (also LSN_1). There is no need to replay it.

  2. Next, PostgreSQL replays the second XLOG record because the record’s LSN (LSN_2) is greater than the current pd_lsn of the TABLE_A page (LSN_1).

If the replay order of non-backup blocks is incorrect or if they are replayed more than once, the database cluster becomes inconsistent. This shows that the redo operation for non-backup blocks is not idempotent. To ensure the correct replay order, non-backup block records should only be replayed when their LSN is greater than the pd_lsn of the corresponding page.

In contrast, the redo operation for backup blocks is idempotent, meaning these blocks can be replayed multiple times regardless of their LSN.