Transaction log is an essential part of database, because all of the database management system is required not to lose any data even when a system failure occurs. It is a history log of all changes and actions in a database system so as to ensure that no data has been lost due to failures, such as a power failure, or some other server failure that causes the server crash. As the log contains sufficient information about each transaction executed already, the database server should be able to recover the database cluster by replaying changes and actions in the transaction log in case of the server crash.
In the field of computer science, WAL is an acronym of Write Ahead Logging, which is a protocol or a rule to write both changes and actions into a transaction log, whereas in PostgreSQL, WAL is an acronym of Write Ahead Log. There the term is used as synonym of transaction log, and also used to refer to an implemented mechanism related to writing action to a transaction log (WAL). Though this is a little confusing, in this document the PostgreSQL definition has been adopted.
The WAL mechanism was first implemented in version 7.1 to mitigate the impacts of server crashes. It also made possible the implementation of the Point-in-Time Recovery (PITR) and Streaming Replication (SR), both of which are described in Chapter 10 and Chapter 11 respectively.
Although understanding of the WAL mechanism is essential for system integrations and administrations using PostgreSQL, the complexity of this mechanism makes it impossible to summarize its description in brief. So the complete explanation of WAL in PostgreSQL is made as follows. In the first section, the overall picture of the WAL has been provided, introducing some important concepts and keywords. In the subsequent sections, the following topics are described:
Let's take a look at the overview of the WAL mechanism. To clarify the issue the WAL has been working on, the first subsection shows what happens when a crash occurs if PostgreSQL does not implement WAL. The second subsection introduces some key concepts and shows the overview of the main subjects in this chapter, the writing of WAL data and the database recovery processing. The final subsection completes the overview of the WAL, adding one more key concept.
In this section, to simplify the description, the table TABLE_A which contains just one page has been used.
As described in Chapter 8, to provide efficient access to the relation's pages, every DBMS implements shared buffer pool.
Assume that we insert some data tuples into TABLE_A on PostgreSQL which does not implement the WAL feature; this situation is illustrated in Fig. 9.1.
Fig. 9.1. Insertion operations without WAL.So, database without WAL is vulnerable to the system failures.
Before WAL was introduced (version 7.0 or earlier), PostgreSQL did synchronous writes to the disk by issuing a sync system call whenever changing a page in memory in order to ensure durability. Therefore, the modification commands such as INSERT and UPDATE were very poor-performance.
To deal with the system failures mentioned above without compromising performance, PostgreSQL supports the WAL. In this subsection, some keywords and key concepts, and then the writing of WAL data and the recovering of the database are described.
PostgreSQL writes all modifications as history data into a persistent storage, to prepare for failures. In PostgreSQL, the history data are known as XLOG record(s) or WAL data.
XLOG records are written into the in-memory WAL buffer by change operations such as insertion, deletion, or commit action. They are immediately written into a WAL segment file on the storage when a transaction commits/aborts. (To be precise, the writing of XLOG records may occur in other cases. The details will be described in Section 9.5.) LSN (Log Sequence Number) of XLOG record represents the location where its record is written on the transaction log. LSN of record is used as the unique id of XLOG record.
By the way, when we consider how database system recovers, there may be one question; what point does PostgreSQL start to recover from? The answer is REDO point; that is, the location to write the XLOG record at the moment when the latest checkpoint is started (checkpoint in PostgreSQL is described in Section 9.7). In fact, the database recovery processing is strongly linked to the checkpoint processing and both of these processing are inseparable.
The WAL and checkpoint process were implemented at the same time in version 7.1.
As the introduction of major keywords and concepts has just finished, from now on will be the description of the tuple insertion with WAL. See Fig. 9.2 and the following description. (Also refer to this slide.)
Fig. 9.2. Insertion operations with WAL.‘TABLE_A's LSN’ shows the value of pd_lsn within the page-header of TABLE_A. ‘page's LSN’ is the same manner.
The following instructions show how to recover our database cluster back to the state immediately before the crash. There is no need to do anything special, since PostgreSQL will automatically enter into the recovery-mode by restarting. See Fig. 9.3 (and this slide). PostgreSQL sequentially will read and replay XLOG records within the appropriate WAL segment files from the REDO point.
Fig. 9.3. Database recovery using WAL.PostgreSQL can recover itself in this way by replaying XLOG records written in WAL segment files in chronological order. Thus, PostgreSQL's XLOG records are obviously REDO log.
PostgreSQL does not support UNDO log.
Though writing XLOG records certainly costs a certain amount, it’s nothing compared to writing the entire modified pages. We are sure we can get a greater benefit, i.e., the system failure tolerance, than the amount we paid.
Suppose that the TABLE_A's page-data on the storage is corrupted, because the operating system has failed while the background-writer process has been writing the dirty pages. As XLOG records cannot be replayed on the corrupted page, we would need an additional feature.
PostgreSQL supports a feature referred to as full-page writes to deal with such failures. If it is enabled, PostgreSQL writes a pair of the header-data and the entire page as XLOG record during the first change of each page after every checkpoint; default is enabled. In PostgreSQL, such a XLOG record containing the entire page is referred to as backup block (or full-page image).
Let's describe the insertion of tuples again, but with the full-page-writes enabled. See Fig. 9.4 and the following description.
Fig. 9.4. Full page writes.Restart the PostgreSQL server to repair the broken cluster. See Fig. 9.5 and the following description.
Fig. 9.5. Database recovery with backup block.In this way, PostgreSQL can recover the database even if some data write errors occur due to a process or operating system down.
As mentioned above, WAL can prevent data loss due to process or operating system down. However, if the file-system or media failure has occurred, the data will be lost. To deal with such failures, PostgreSQL provides online backup and replication features.
If online backups are taken regularly, the database can be restored from the recent backup, even if media failure occurs. However, note that the changes made after taking the last backup cannot be restored.
The synchronous replication feature can store all changes to another storage or host in real-time.
Logically, PostgreSQL writes XLOG records into the transaction log which is a virtual file 8-byte long (16 ExaByte).
Since a transaction log capacity is effectively unlimited and so can be said that 8-byte address space is vast enough, it is impossible for us to handle a file with the capacity of 8-byte length. So, a transaction log in PostgreSQL is divided into files of 16-Mbyte, by default, each of which known as WAL segment. See Fig. 9.6.
In version 11 or later, the size of WAL segment file can be configured using --wal-segsize option when PostgreSQL cluster is created by initdb command.
The WAL segment filename is in hexadecimal 24-digit number and the naming rule is as follows:
\begin{align} \verb|WAL segment file name| = \verb|timelineId| + (\verb|uint32|) \frac{\verb|LSN|-1}{16\verb|M|*256} + (\verb|uint32|)\left(\frac{\verb|LSN|-1}{16\verb|M|}\right) \% 256 \end{align}PostgreSQL's WAL contains the concept of timelineId (4-byte unsigned integer), which is for Point-in-Time Recovery (PITR) described in Chapter 10. However, the timelineId is fixed to 0x00000001 in this chapter because this concept is not required in the following descriptions.
The first WAL segment file is 000000010000000000000001. If the first one has been filled up with the writing of XLOG records, the second one 000000010000000000000002 would be provided. Files of successor is used in ascending order in succession, after 0000000100000000000000FF has been filled up, next one 000000010000000100000000 will be provided. In this way, whenever the last 2-digit carries over, the middle 8-digit number increases one.
Similarly, after 0000000100000001000000FF has been filled up, 000000010000000200000000 will be provided, and so on.
Using the built-in function pg_xlogfile_name (version 9.6 or earlier) or pg_walfile_name (versoin 10 or later), we can find the WAL segment file name that contains the specified LSN. An example is shown below:
testdb=# SELECT pg_xlogfile_name('1/00002D3E'); # In version 10 or later, "SELECT pg_walfile_name('1/00002D3E');" pg_xlogfile_name -------------------------- 000000010000000100000000 (1 row)
A WAL segment is a 16 MB file, by default, and it is internally divided into pages of 8192 bytes (8 KB). The first page has a header-data defined by the structure XLogLongPageHeaderData, while the headings of all other pages have the page information defined by the structure XLogPageHeaderData. Following the page header, XLOG records are written in each page from the beginning in descending order. See Fig. 9.7.
Fig. 9.7. Internal layout of a WAL segment file.XLogLongPageHeaderData structure and XLogPageHeaderData structure are defined in src/include/access/xlog_internal.h. The explanation of both structures is omitted because those are not required in the following descriptions.
An XLOG record comprises the general header portion and each associated data portion. The first subsection describes the header structure; the remaining two subsections explain the structure of data portion in version 9.4 or earlier and in version 9.5, respectively. (The data format has changed in version 9.5.)
All XLOG records have a general header portion defined by the structure XLogRecord. Here, the structure of 9.4 and earlier versions is shown in the following, though it is changed in version 9.5.
typedef struct XLogRecord { uint32 xl_tot_len; /* total len of entire record */ TransactionId xl_xid; /* xact id */ uint32 xl_len; /* total len of rmgr data */ uint8 xl_info; /* flag bits, see below */ RmgrId xl_rmid; /* resource manager for this record */ /* 2 bytes of padding here, initialize to zero */ XLogRecPtr xl_prev; /* ptr to previous record in log */ pg_crc32 xl_crc; /* CRC for this record */ } XLogRecord;
Apart from two variables, most of the variables would be so obvious that no need for description.
Both xl_rmid and xl_info are variables related to resource managers, which are collections of operations associated with the WAL feature such as writing and replaying of XLOG records. The number of resource managers tends to increase with each PostgreSQL version, Version 10 contains the following them:
Operation | Resource manager |
---|---|
Heap tuple operations | RM_HEAP, RM_HEAP2 |
Index operations | RM_BTREE, RM_HASH, RM_GIN, RM_GIST, RM_SPGIST, RM_BRIN |
Sequence operations | RM_SEQ |
Transaction operations | RM_XACT, RM_MULTIXACT, RM_CLOG, RM_XLOG, RM_COMMIT_TS |
Tablespace operations | RM_SMGR, RM_DBASE, RM_TBLSPC, RM_RELMAP |
replication and hot standby operations | RM_STANDBY, RM_REPLORIGIN, RM_GENERIC_ID, RM_LOGICALMSG_ID |
Here are some representative examples how resource managers work in the following:
In version 9.5 or later, one variable (xl_len) has been removed the structure XLogRecord to refine the XLOG record format, which reduced the size by a few bytes.
XLogRecord structure in version 9.4 or earlier is defined in src/include/access/xlog.h and that of version 9.5 or later is defined in src/include/access/xlogrecord.h.
The heap_xlog_insert and heap_xlog_update are defined in src/backend/access/heap/heapam.c; while the function xact_redo_commit is defined in src/backend/access/transam/xact.c.
Data portion of XLOG record is classified into either backup block (entire page) or non-backup block (different data by operation).
Fig. 9.8. Examples of XLOG records (version 9.4 or earlier).The internal layouts of XLOG records are described below, using some specific examples.
A backup block is shown in Fig. 9.8(a). It is composed of two data structures and one data object as shown below:
The BkpBlock contains the variables to identify this page in the database cluster (i.e. the relfilenode and the fork number of the relation that contains this page, and this page's block number), and the beginning position and the length of this page's free space.
In non-backup blocks, the layout of data portion differs depending on each operation. Here, an INSERT statement's XLOG record is explained as a representative example. See Fig. 9.8(b). In this case, the XLOG record of the INSERT statement is composed of two data structures and one data object as shown below:
The structure xl_heap_insert, contains the variables to identify the inserted tuple in the database cluster (i.e. the relfilenode of the table that contains this tuple, and this tuple's tid), and a visibility flag of this tuple.
The reason to remove a few bytes from inserted tuple is described in the source code comment of the structure xl_heap_header:
We don't store the whole fixed part (HeapTupleHeaderData) of an inserted or updated tuple in WAL; we can save a few bytes by reconstructing the fields that are available elsewhere in the WAL record, or perhaps just plain needn't be reconstructed.
One more example to be shown here. See Fig. 9.8(c). The XLOG record of Checkpoint record is quite simple; it is composed of two data structures as shown below:
The xl_heap_header structure is defined in src/include/access/htup.h while the CheckPoint structure is defined in src/include/catalog/pg_control.h.
In version 9.4 or earlier, there was no common format of XLOG record, so that each resource manager had to define one’s own format. In such a case, it became increasingly difficult to maintain the source code and to implement new features related to WAL. In order to deal with this issue, a common structured format, which does not depend on resource managers, has been introduced in version 9.5.
Data portion of XLOG record can be divided into two parts: header and data. See Fig. 9.9.
Fig. 9.9. Common XLOG record format.Header part contains zero or more XLogRecordBlockHeaders and zero or one XLogRecordDataHeaderShort (or XLogRecordDataHeaderLong); it must contain at least either one of those. When its record stores full-page image (i.e. backup block), XLogRecordBlockHeader includes XLogRecordBlockImageHeader, and also includes XLogRecordBlockCompressHeader if its block is compressed.
Data part is composed of zero or more block data and zero or one main data, which correspond to the XLogRecordBlockHeader(s) and to the XLogRecordDataHeader respectively.
In version 9.5 or later, full-page images within XLOG record can be compressed using LZ compression method by setting the parameter wal_compression = enable. In that case, the structure XLogRecordBlockCompressHeader will be added.
This feature has two advantages and one disadvantage. The advantages are reducing I/O cost for writing records and suppressing the consumption of WAL segment files. The disadvantage is consuming much CPU resource to compress.
Some specific examples are shown below as in the previous subsection.
Backup block created by INSERT statement is shown in Fig. 9.10(a). It is composed of four data structures and one data object as shown below:
XLogRecordBlockHeader contains the variables to identify the block in the database cluster (the relfilenode, the fork number, and the block number); XLogRecordImageHeader contains the length of this block and offset number. (These two header structures together can store same data of BkpBlock used until version 9.4.)
XLogRecordDataHeaderShort stores the length of xl_heap_insert structure which is the main data of the record. (See below.)
The main data of XLOG record which contains full-page image is not used except in some special cases (e.g. being in logical decoding and speculative insertions). It’s ignored when this record is replayed, which is the redundant data. It might be improved in the future.
In addition, main data of backup block records depend on statements which create those. For example, UPDATE statement appends xl_heap_lock or xl_heap_updated.
Next, non-backup block record created by INSERT statement will be described as follows (see also Fig. 9.10(b)). It is composed of four data structures and one data object as shown below:
XLogRecordBlockHeader contains three values (the relfilenode, the fork number, and the block number) to specify the block inserted the tuple, and length of data portion of the inserted tuple. XLogRecordDataHeaderShort contains the length of new xl_heap_insert structure, which is the main data of this record.
The new xl_heap_insert contains only two values: offset number of this tuple within the block, and a visibility flags; it became very simple because XLogRecordBlockHeader stores most of data contained in the old one.
As the final example, a checkpoint record is shown in the Fig. 9.10(c). It is composed of three data structure as shown below:
The structure xl_heap_header is defined in src/include/access/htup.h and the CheckPoint structure is defined in src/include/catalog/pg_control.h.
Though the new format is a little complicated for us, it is well-designed for the parser of the resource managers, and also size of many types of XLOG records is usually smaller than the previous one. Size of main structures is shown in the Figs. 9.8 and 9.10, so there you can calculate sizes of those records and compare each other. (The size of new checkpoint is greater than the previous one, but it contains more variables.)
Having finished the warming-up exercises, now we are ready for understanding the writing of XLOG records. So I will explain it as precisely as possible in this section.
First, issue the following statement to explore the PostgreSQL internals:
testdb=# INSERT INTO tbl VALUES ('A');
By issuing the above statement, the internal function exec_simple_query() is invoked; the pseudo code of exec_simple_query() is shown below:
exec_simple_query() @postgres.c (1) ExtendCLOG() @clog.c /* Write the state of this transaction * "IN_PROGRESS" to the CLOG. */ (2) heap_insert()@heapam.c /* Insert a tuple, creates a XLOG record, * and invoke the function XLogInsert. */ (3) XLogInsert() @xlog.c (9.5 or later, xloginsert.c) /* Write the XLOG record of the inserted tuple * to the WAL buffer, and update page's pd_lsn. */ (4) finish_xact_command() @postgres.c /* Invoke commit action.*/ XLogInsert() @xlog.c (9.5 or later, xloginsert.c) /* Write a XLOG record of this commit action * to the WAL buffer. */ (5) XLogWrite() @xlog.c /* Write and flush all XLOG records on * the WAL buffer to WAL segment. */ (6) TransactionIdCommitTree() @transam.c /* Change the state of this transaction * from "IN_PROGRESS" to "COMMITTED" on the CLOG. */
In the following paragraphs, each line of the pseudo code will be explained for understanding the writing of XLOG records; see also Figs. 9.11 and 9.12.
In the above example, commit action has caused the writing of XLOG records into the WAL segment, but such writing may be caused when any one of the following occurs:
If one of above occurs, all WAL records on the WAL buffer are written into a WAL segment file regardless of whether their transactions have been committed or not.
It is taken for granted that DML (Data Manipulation Language) operations write XLOG records, but so do non-DML operations. As described above, a commit action writes a XLOG record that contains the id of committed transaction. Another example may be a checkpoint action to write a XLOG record that contains general information of this checkpoint. Furthermore, SELECT statement creates XLOG records in special cases, though it does not usually create them. For example, if deletion of unnecessary tuples and defragmentation of the necessary tuples in pages occur by HOT(Heap Only Tuple) during a SELECT statement processing, the XLOG records of modified pages are written into the WAL buffer.
WAL writer is a background process to check the WAL buffer periodically and write all unwritten XLOG records into the WAL segments. The purpose of this process is to avoid burst of writing of XLOG records. If this process has not been enabled, the writing of XLOG records might have been bottlenecked when a large amount of data committed at one time.
WAL writer is working by default and cannot be disabled. Check interval is set to the configuration parameter wal_writer_delay, default value is 200 milliseconds.
In PostgreSQL, the checkpointer (background) process performs checkpoints; its process starts when one of the following occurs:
Its process also does it when a superuser issues CHECKPOINT command manually.
In version 9.1 or earlier, as mentioned in in Section 8.6, the background writer process did both checkpinting and dirty-page writing.
In the following subsections, the outline of checkpointing and the pg_control file, which holds the metadata of the current checkpoint, are described.
Checkpoint process has two aspects: the preparation of database recovery, and the cleaning of dirty pages on the shared buffer pool. In this subsection, its internal processing will be described with focusing on the former one. See Fig. 9.13 and the following description.
Fig. 9.13. Internal processing of PostgreSQL's checkpoint.To summarize the description above from the viewpoint of the database recovery, checkpointing creates the checkpoint record which contains the REDO point, and stores the checkpoint location and more into the pg_control file. Therefore, PostgreSQL enables to recover itself by replaying WAL data from the REDO point (obtained from the checkpoint record) provided by the pg_control file.
As the pg_control file contains the fundamental information of the checkpoint, it is certainly essential for database recovery. If it is broken or unreadable, the recovery process cannot start up in order to not obtained a starting point.
Even though pg_control file stores over 40 items, three items to be required in the next section are shown in the following:
A pg_control file is stored in the global subdirectory under the base-directory; its contents can be shown using the pg_controldata utility.
postgres> pg_controldata /usr/local/pgsql/data pg_control version number: 937 Catalog version number: 201405111 Database system identifier: 6035535450242021944 Database cluster state: in production pg_control last modified: Latest checkpoint location: 0/C000F48 Prior checkpoint location: 0/C000E70 ... snip ...
PostgreSQL 11 or later will only store the WAL segments which contain the latest checkpoint or newer; the older segment files, which contains the prior checkpoint, will be not stored to reduce the disc space for saving WAL segment files under the pg_xlog(pg_wal) subdirectory. See this thread in details.
PostgreSQL implements the redo log based recovery feature. Should the database server crash, PostgreSQL restore the database cluster by sequentially replaying the XLOG records in the WAL segment files from the REDO point.
We had already talked about the database recovery several times up to this section, so I will describe two things regarding the recovery which has not been explained yet.
The first thing is how PostgreSQL begin the recovery process. When PostgreSQL starts up, it reads the pg_control file at first. The followings are the details of the recovery processing from that point. See Fig. 9.14 and the following description.
Fig. 9.14. Details of the recovery process.The second point is about the comparison of LSNs: why the non-backup block's LSN and the corresponding page's pd_lsn should be compared. Unlike the previous examples, it’s to be explained using a specific example emphasizing the need of comparison between both LSNs. See Figs. 9.15 and 9.16. (Note that the WAL buffer is omitted to simplify the description.)
Fig. 9.15. Insertion operations during the background writer working.Unlike the examples in overview, the TABLE_A's page has been once written into the storage in this scenario.
Do shutdown with immediate-mode, and start.
Fig. 9.16. Database recovery.As can be seen from this example, if the replaying order of non-backup blocks is incorrect or non-backup blocks are replayed out more than once, the database cluster will no longer be consistent. In short, the redo (replay) operation of non-backup block is not idempotent. Therefore, to preserve the correct replaying order, non-backup block records should replay if and only if its LSN is greater than the corresponding page's pd_lsn.
On the other hand, as the redo operation of backup block is idempotent, backup blocks can be replayed any number of times regardless of its LSN.
PostgreSQL writes XLOG records to one of the WAL segment files stored in the pg_xlog subdirectory (in version 10 or later, pg_wal subdirectory), and switches for a new one if the old one has been filled up. The number of the WAL files will vary depending on several configuration parameters, as well as server activity. In addition, their management policy has been improved in version 9.5.
In the following subsections, switching and managing of WAL segment files are described.
WAL segment switches occur when one of the following occurs:
Switched file is usually recycled (renamed and reused) for future use but it may be removed later if not necessary.
Whenever the checkpoint starts, PostgreSQL estimates and prepares the number of WAL segment files required for the next checkpoint cycle. Such estimate is made with regards to the numbers of files consumed in previous checkpoint cycles. They are counted from the segment that contains the prior REDO point, and the value is to be between min_wal_size (by default, 80 MB, i.e. 5 files) and max_wal_size (1 GB, i.e. 64 files). If a checkpoint starts, necessary files will be held or recycled, while the unnecessary ones removed.
A specific example is shown in Fig. 9.17. Assuming that there are six files before checkpoint starts, WAL_3 contains the prior REDO point (in version 10 or earlier; in version 11 or later, REDO point), and PostgreSQL estimates that five files are needed. In this case, WAL_1 will be renamed as WAL_7 for recycling and WAL_2 will be removed.
The files older than the one that contains the prior REDO point can be removed, because, as is clear from the recovery mechanism described in Section 9.8, they would never be used.
If more files are required due to a spike in WAL activity, new files will be created while the total size of WAL files is less than max_wal_size. For example, in Fig. 9.18, if WAL_7 has been filled up, WAL_8 is newly created.
Fig. 9.18. Creating WAL segment file.The number of WAL files adaptively changes depending on the server activity. If the amount of WAL data writing has constantly increased, the estimated number of the WAL segment files as well as the total size of WAL files also gradually increase. In the opposite case (i.e. the amount of WAL data writing has decreased), these decrease.
If the total size of the WAL files exceeds max_wal_size, a checkpoint will be started. Figure 9.19 illustrates this situation. By checkpointing, a new REDO point will be created and the last REDO point will be the prior one; and then unnecessary old files will be recycled. In this way, PostgreSQL will always hold just the WAL segment files needed for database recovery.
Fig. 9.19. Checkpointing and recycling WAL segment files.The configuration parameter wal_keep_size ( or wal_keep_segments in version 12 or earlear) and the replication slot feature also affect the number of WAL segment files.
Continuous Archiving is a feature that copies WAL segment files to archival area at the time when WAL segment switches, and is performed by the archiver (background) process. The copied file is called an archive log. This feature is usually used for hot physical backup and PITR (Point-in-Time Recovery) described in Chapter 10.
The path of archival area is set to the configuration parameter archive_command. For example, using the following parameter, WAL segment files are copied to the directory '/home/postgres/archives/' every time when each segment switches:
archive_command = 'cp %p /home/postgres/archives/%f'
where, placeholder %p is copied WAL segment, and %f is archive log.
Fig. 9.20. Continuous archiving.The parameter archive_command can be set any Unix commands and tools, so you can transfer the archive logs to other host by setting the scp command or any file backup tools instead of ordinary copy command.
In version 14 or earlier, continuous archiving could only use shell commands. In version 15, PostgreSQL has supported a loadable library feature to achieve logs. In detail, see archive_library and the reference implementation basic_archive.
PostgreSQL does *not* clean up created archiving logs, so you should properly manage the logs when using this feature. If you do nothing, the number of archiving logs continues to increase.
The pg_archivecleanup utility is one of the useful tools for the archiving log management.
Also the unix command find can be used to delete your archiving logs. The following command deletes archiving logs that were created more than three days ago.
$ find /home/postgres/archives -mtime +3d -exec rm -f {} \;