9.8. Database Recovery in PostgreSQL
PostgreSQL implements redo log-based recovery. If the database server crashes, PostgreSQL can restore the database cluster by sequentially replaying the XLOG records in the WAL segment files from the REDO point.
We have already talked about database recovery several times up to this section. Here, I will describe two things about recovery that have not been explained yet.
The first thing is how PostgreSQL starts the recovery process. When PostgreSQL starts up, it first reads the pg_control file. The following are the details of the recovery process from that point. See Fig. 9.14 and the following description.
(1) PostgreSQL reads all items in the pg_control file when it starts. If the state item is ‘in production’, PostgreSQL enters recovery-mode because this means that the database was not shut down normally. If it is ‘shut down’, PostgreSQL enters normal startup-mode.
(2) PostgreSQL reads the latest checkpoint record, the location of which is written in the pg_control file, from the appropriate WAL segment file. It then gets the REDO point from the record. If the latest checkpoint record is invalid, PostgreSQL reads the one prior to it. If both records are unreadable, it gives up recovering by itself. (Note that the prior checkpoint is not stored in PostgreSQL 11 or later.)
(3) The appropriate resource managers read and replay XLOG records in sequence from the REDO point until they reach the end of the latest WAL segment. When an XLOG record is replayed and if it is a backup block, it is overwritten on the corresponding table page regardless of its LSN. Otherwise, a (non-backup block) XLOG record is replayed only if the LSN of the record is greater than the ‘pd_lsn’ of the corresponding page.
The second point is about the comparison of LSNs: why the non-backup block’s LSN and the corresponding page’s pd_lsn should be compared. Unlike the previous examples, this will be explained using a specific example that emphasizes the need for this comparison. See Figs. 9.15 and 9.16. (Note that the WAL buffer is omitted to simplify the description.)
(1) PostgreSQL inserts a tuple into the TABLE_A, and writes an XLOG record at LSN_1.
(2) The background-writer process writes the TABLE_A page to storage. At this point, this page’s pd_lsn is LSN_1.
(3) PostgreSQL inserts a new tuple into the TABLE_A, and writes a XLOG record at LSN_2. The modified page is not written into the storage yet.
Unlike the examples in overview, the TABLE_A’s page has been once written to the storage in this scenario.
Do shutdown with immediate-mode, and then start the database.
(1) PostgreSQL loads the first XLOG record and the TABLE_A page, but does not replay it because the LSN of the record (LSN_1) is not larger than the LSN of the page (also LSN_1). In fact, it is clear that there is no need to replay it.
(2) Next, PostgreSQL replays the second XLOG record because the LSN of the record (LSN_2) is greater than the current LSN of the TABLE_A page (LSN_1).
As you can see from this example, if the replay order of non-backup blocks is incorrect or non-backup blocks are replayed more than once, the database cluster will no longer be consistent. In short, the redo (replay) operation of non-backup block is not idempotent. Therefore, to preserve the correct replay order, non-backup block records should be replayed only if their LSN is greater than the corresponding page’s pd_lsn.
On the other hand, as the redo operation of backup block is idempotent, backup blocks can be replayed any number of times regardless of their LSN.