11.1. Starting the Streaming Replication

In streaming replication, three types of processes work cooperatively:

  • A walsender process on the primary server sends WAL (Write-Ahead Log) data to the standby server.
  • A walreceiver process on the standby server receives and replays the WAL data.
  • A startup process on the standby server starts the walreceiver process.

The walsender and walreceiver communicate using a single TCP connection.

The startup sequence of streaming replication is shown in Figure 11.1:

Fig. 11.1. SR startup sequence.
  • (1) Start the primary and standby servers.

  • (2) The standby server starts the startup process.

  • (3) The standby server starts a walreceiver process.

  • (4) The walreceiver sends a connection request to the primary server. If the primary server is not running, the walreceiver sends these requests periodically.

  • (5) When the primary server receives a connection request, it starts a walsender process and a TCP connection is established between the walsender and walreceiver.

  • (6) The walreceiver sends the latest LSN (Log Sequence Number) of standby’s database cluster. This is known as handshaking in the field of information technology.

  • (7) If the standby’s latest LSN is less than the primary’s latest LSN (Standby’s LSN $ \lt $ Primary’s LSN), the walsender sends WAL data from the former LSN to the latter LSN. These WAL data are provided by WAL segments stored in the primary’s pg_wal subdirectory (in versions 9.6 or earlier, pg_xlog). The standby server then replays the received WAL data. In this phase, the standby catches up with the primary, so it is called catch-up.

  • (8) Streaming Replication begins to work.

Each walsender process keeps a state that is appropriate for the working phase of the connected walreceiver or application. The following are the possible states of a walsender process:

  • start-up – From starting the walsender to the end of handshaking. See Figs. 11.1(5)–(6).
  • catch-up – During the catch-up phase. See Fig. 11.1(7).
  • streaming – While Streaming Replication is working. See Fig. 11.1(8).
  • backup – During sending the files of the whole database cluster for backup tools such as pg_basebackup utility.

The pg_stat_replication view shows the state of all running walsenders. An example is shown below:

testdb=# SELECT application_name,state FROM pg_stat_replication;
 application_name |   state
------------------+-----------
 standby1         | streaming
 standby2         | streaming
 pg_basebackup    | backup
(3 rows)

As shown in the above result, two walsenders are running to send WAL data for the connected standby servers, and another one is running to send all files of the database cluster for pg_basebackup utility.

What will happen if a standby server restarts after a long time in the stopped condition?

In versions 9.3 or earlier, if the primary’s WAL segments required by the standby server have already been recycled, the standby cannot catch up with the primary server.

There is no reliable solution for this problem, but only to set a large value to the configuration parameter wal_keep_segments to reduce the possibility of the occurrence. This is a stopgap solution.

In versions 9.4 or later, this problem can be prevented by using replication slot. A replication slot is a feature that expands the flexibility of the WAL data sending, mainly for the logical replication, which also provides the solution to this problem – the WAL segment files that contain unsent data under the pg_wal (or pg_xlog if versions 9.6 or earlier) can be kept in the replication slot by pausing recycling process. Refer the official document for detail.