The synchronous streaming replication was implemented in version 9.1. It is a so-called single-master-multi-slaves type replication, and those two terms – master and slave(s) – are usually referred to as primary and standby(s) respectively in PostgreSQL.
This native replication feature is based on the log shipping, one of the general replication techniques, in which a primary server continues to send WAL data and then, each standby server replays the received data immediately.
This chapter covers the following topics focusing on how streaming replication works:
Though the first replication feature, only for asynchronous replication, had implemented in version 9.0, it had replaced with the new implementation (currently in use) for synchronous replication in 9.1.
In Streaming Replication, three kinds of processes work cooperatively. A walsender process on the primary server sends WAL data to standby server; and then, a walreceiver and a startup processes on standby server receives and replays these data. A walsender and a walreceiver communicate using a single TCP connection.
In this section, we will explore the start-up sequence of the streaming replication to understand how those processes are started and how the connection between them is established. Figure 11.1 shows the startup sequence diagram of streaming replication:Figure 11.1: SR startup sequence
Each walsender process keeps a state appropriate for the working phase of connected walreceiver or any application (Note that it is not the state of walreceiver or application connected to the walsender.) The following are the possible states of it:
The pg_stat_replication view shows the state of all running walsenders. An example is shown below:
testdb=# SELECT application_name,state FROM pg_stat_replication; application_name | state ------------------+----------- standby1 | streaming standby2 | streaming pg_basebackup | backup (3 rows)
As shown in the above result, two walsenders are running to send WAL data for the connected standby servers, and another one is running to send all files of the database cluster for pg_basebackup utility.
In version 9.3 or earlier, if the primary's WAL segments required by the standby server have already been recycled, the standby cannot catch up with the primary server. There is no reliable solution for this problem, but only to set a large value to the configuration parameter wal_keep_segments to reduce the possibility of the occurrence. It's a stopgap solution.
In version 9.4 or later, this problem can be prevented by using replication slot. The replication slot is a feature that expands the flexibility of the WAL data sending, mainly for the logical replication, which also provides the solution to this problem – the WAL segment files which contain unsent data under the pg_xlog (or pg_wal if version 10 or later) can be kept in the replication slot by pausing recycling process. Refer the official document for detail.
Streaming replication has two aspects: log shipping and database synchronization. Log shipping is obviously one of those aspects since the streaming replication is based on it – the primary server sends WAL data to the connected standby servers whenever the writing of them occurs. Database synchronization is required for synchronous replication – the primary server communicates with each multiple-standby server to synchronize its database clusters.
To accurately understand how streaming replication works, we should explore how one primary server manages multiple standby servers. To make it rather simple, the special case (i.e. single-primary single-standby system) is described in this section, while the general case (single-primary multi-standbys system) will be described in the next section.
Assume that the standby server is in the synchronous replication mode, but the configuration parameter hot-standby is disabled and wal_level is 'archive'. The main parameter of the primary server is shown below:
synchronous_standby_names = 'standby1' hot_standby = off wal_level = archive
Additionally, among the three triggers to write the WAL data mentioned in Section 9.5, we focus on the transaction commits here.
Suppose that one backend process on the primary server issues a simple INSERT statement in autocommit mode. The backend starts a transaction, issues an INSERT statement, and then commits the transaction immediately. Let's explore further how this commit action will be completed. See the following sequence diagram in Figure 11.2:Figure 11.2: Streaming Replication's communication sequence diagram
If the configuration parameter wal_level is 'hot_standby' or 'logical', PostgreSQL writes a WAL record regarding the hot standby feature, following the records of a commit or abort action. (In this example, PostgreSQL does not write that record because it’s 'archive'.)
Each ACK response informs the primary server of the internal information of standby server. It contains four items below:
XLogWalRcvSendReply(void)@src/backend/replication/walreceiver.c /* Construct a new message */ reply_message.write = LogstreamResult.Write; reply_message.flush = LogstreamResult.Flush; reply_message.apply = GetXLogReplayRecPtr(); reply_message.sendTime = now; /* Prepend with the message type and send it. */ buf = 'r'; memcpy(&buf, &reply_message, sizeof(StandbyReplyMessage)); walrcv_send(buf, sizeof(StandbyReplyMessage) + 1);
Walreceiver returns ACK responses not only when WAL data have been written and flushed, but also periodically as the heartbeat of standby server. The primary server therefore always grasps the status of all connected standby servers.
By issuing the queries as shown below, the LSN related information of the connected standby servers can be displayed.
testdb=# SELECT application_name AS host, write_location AS write_LSN, flush_location AS flush_LSN, replay_location AS replay_LSN FROM pg_stat_replication; host | write_lsn | flush_lsn | replay_lsn ----------+-----------+-----------+------------ standby1 | 0/5000280 | 0/5000280 | 0/5000280 standby2 | 0/5000280 | 0/5000280 | 0/5000280 (2 rows)
The heartbeat's interval is set to the parameter wal_receiver_status_interval, which is 10 seconds by default.
In this subsection, descriptions are made on how the primary server behaves when the synchronous standby server has failed, and how to deal with the situation.
Even if the synchronous standby server has failed and is no longer able to return an ACK response, the primary server continues to wait for responses forever. So, the running transactions cannot commit and subsequent query processing cannot be started. In other words, all of the primary server operations are practically stopped. (Streaming replication does not support a function to revert automatically to asynchronous-mode by timing out.)
There are two ways to avoid such situation. One of them is to use multiple standby servers to increase the system availability, and the other is to switch from synchronous to asynchronous mode by performing the following steps manually.
synchronous_standby_names = ''
postgres> pg_ctl -D $PGDATA reload
The above procedure does not affect the connected clients. The primary server continues the transaction processing as well as all sessions between clients and the respective backend processes are kept.
In this section, the way streaming replication works with multiple standby servers is described.
Primary server gives sync_priority and sync_state to all managed standby servers, and treats each standby server depending on its respective values. (The primary server gives those values even if it manages just one standby server; haven’t mentioned this in the previous section.)
sync_priority indicates the priority of standby server in synchronous-mode and is a fixed value. The smaller value shows the higher priority, while 0 is the special value that means "in asynchronous-mode". Priorities of standby servers are given in the order listed in the primary's configuration parameter synchronous_standby_names. For example, in the following configuration, priorities of standby1 and standby2 are 1 and 2, respectively.
synchronous_standby_names = 'standby1, standby2'
(Standby servers not listed on this parameter are in asynchronous-mode, and their priority is 0.)
sync_state is the state of the standby server. It is variable according to the running status of all standby servers and the individual priority. The followings are the possible states:
The priority and the state of the standby servers can be shown by issuing the following query:
testdb=# SELECT application_name AS host, sync_priority, sync_state FROM pg_stat_replication; host | sync_priority | sync_state ----------+---------------+------------ standby1 | 1 | sync standby2 | 2 | potential (2 rows)
The primary server waits for ACK responses from the synchronous standby server alone. In other words, the primary server confirms only synchronous standby's writing and flushing of WAL data. Streaming replication, therefore, ensures that only synchronous standby is in the consistent and synchronous state with the primary.
Figure 11.3 shows the case in which the ACK response of potential standby has been returned earlier than that of the primary standby. There, the primary server does not complete the commit action of the current transaction, and continues to wait for the primary's ACK response. And then, when the primary's response is received, the backend process releases the latch and completes the current transaction processing.Figure 11.3: Managing multiple standby servers
In the opposite case (i.e. the primary's ACK response has been returned earlier than the potential's one), the primary server immediately completes the commit action of the current transaction without ensuring if the potential standby writes and flushes WAL data or not.
Once again, see how the primary server behaves when the standby server has failed.
When either a potential or an asynchronous standby server has failed, the primary server terminates the walsender process connected to the failed standby and continues all processing. In other words, transaction processing of the primary server would not be affected by the failure of either type of standby server.
When a synchronous standby server has failed, the primary server terminates the walsender process connected to the failed standby, and replaces synchronous standby with the highest priority potential standby. See Figure 11.4. In contrast to the failure described above, query processing on the primary server will be paused from the point of failure to the replacement of synchronous standby. (Therefore, failure detection of standby server is a very important function to increase availability of replication system. Failure detection will be described in the next section.)Figure 11.4: Replacing of synchronous standby server
In any case, if one or more standby server shall be running in syncrhonous-mode, the primary server keeps only one synchronous standby server at all times, and the synchronous standby server is always in a consistent and synchronous state with the primary.
Streaming replication uses two common failure detection procedures that will not require any special hardware at all.
When a connection drop between walsender and walreceiver has been detected, the primary server immediately determines that the standby server or walreceiver process is faulty. When a low level network function returns an error by failing to write or to read the socket interface of walreceiver, the primary also immediately determines its failure.
If a walreceiver returns nothing within the time set for the parameter wal_sender_timeout (default 60 seconds), the primary server determines that the standby server is faulty. In contrast to the failure described above, it takes a certain amount of time – maximum is wal_sender_timeout seconds – to confirm the standby's death on the primary server even if a standby server is no longer able to send any response by some failures (e.g. standby server's hardware failure, network failure, and so on).
Depending on the types of failures, it can usually be detected immediately after a failure occurs, while sometimes there might be a time lag between the occurrence of failure and the detection of it. In particular, if a latter type of failure occurs in synchronous standby server, all transaction processing on the primary server will be stopped until detecting the failure of standby, even though multiple potential standby servers may have been working.
The parameter wal_sender_timeout was called as replication_timeout in version 9.2 or earlier.