11.4. Detecting Failures of Standby Servers
Streaming replication uses two common failure detection procedures that do not require any special hardware.
-
Failure detection of standby server process:
- When a connection drop between the walsender and walreceiver is detected, the primary server immediately determines that the standby server or walreceiver process is faulty.
- When a low-level network function returns an error by failing to write or read the socket interface of the walreceiver, the primary server also immediately determines its failure.
-
Failure detection of hardware and networks:
- If a walreceiver does not return anything within the time set for the parameter wal_sender_timeout (default 60 seconds), the primary server determines that the standby server is faulty.
- In contrast to the failure described above, it takes a certain amount of time, up to wal_sender_timeout seconds, to confirm the standby’s death on the primary server even if a standby server is no longer able to send any response due to some failures (e.g., standby server’s hardware failure, network failure, etc.).
Depending on the type of failure, it can usually be detected immediately after the failure occurs. However, there may be a time lag between the occurrence of the failure and its detection. In particular, if the latter type of failure occurs in a synchronous standby server, all transaction processing on the primary server will be stopped until the failure of the standby is detected, even if multiple potential standby servers may have been working.