How streaming replication starts up

This post is a part of my document.

In this post, I will explore the start-up sequence of the streaming replication to understand how those processes are started and how the connection between them is established.

Startup sequence

In Streaming Replication, three kinds of processes work cooperatively. A walsender process on the primary server sends WAL data to standby server; and then, a walreceiver and a startup processes on standby server receives and replays these data. A walsender and a walreceiver communicate using a single TCP connection.

Figure 1 shows the startup sequence diagram of streaming replication:

Figure 1: SR startup sequence

Start primary and standby servers.
The standby server starts a startup process.
The standby server starts a walreceiver process.
The walreceiver sends a connection request to the primary server. If the primary server is not running, the walreceiver sends these requests periodically.
When the primary server receives a connection request, it starts a walsender process and a TCP connection is established between the walsender and walreceiver.
The walreceiver sends the latest LSN of standby’s database cluster. In general, this phase is known as handshaking in the field of information technology.
If the standby’s latest LSN is less than the primary’s latest LSN (Standby’s LSN < Primary’s LSN), the walsender sends WAL data from the former LSN to the latter LSN. Such WAL data are provided by WAL segments stored in the primary’s pg_xlog subdirectory. Then, the standby server replays the received WAL data. In this phase, the standby catches up with the primary, so it is called catch-up.
Streaming Replication begins to work.

What will happen if a standby server restarts after a long time in the stopped condition?

In version 9.3 or earlier, if the primary’s WAL segments required by the standby server have already been recycled, the standby cannot catch up with the primary server. There is no reliable solution for this problem, but only to set a large value to the configuration parameter wal_keep_segments to reduce the possibility of the occurrence. It’s a stopgap solution.

In version 9.4 or later, this problem can be prevented by using replication slot. The replication slot is a feature that expands the flexibility of the WAL data sending, mainly for the logical replication, which also provides the solution to this problem – the WAL segment files which contain unsent data under the pg_xlog can be kept in the replication slot by pausing recycling process. Refer the official document for detail.