12.1. Overview and Key Concepts
Beta Version: Work in progress.
This section introduces key concepts necessary to understand the subsequent sections.
12.1.1. Related Processes
In logical replication, four types of processes work cooperatively:
- A walsender on the publisher sends WAL data to the subscriber and performs various coordination tasks.
- A logical replication launcher on the subscriber launches apply workers.
- An apply worker on the subscriber connects to the publisher’s walsender, receives logical change streams, parses messages, and updates target tables.
- A table sync worker on the subscriber performs initial data synchronization for a specific table. It catches up to the main replication stream before handing over updates to the apply worker. Refer to Section 12.2.2 for details.
12.1.2. Outline of Logical Replication
PostgreSQL’s logical replication is row-based. It captures changes to individual rows and streams them in a decoded format. Because logical messages contain the already-evaluated results of operations, logical replication avoids inconsistencies caused by non-deterministic functions like random() or now().
The following example compares logical replication with streaming replication, as illustrated in Figure 12.2.
Figure 12.2. Conceptual comparison of data sending: Streaming vs. Logical Replication.
Consider a scenario where two concurrent transactions (txid=99 and 100) execute SQL statements affecting tbl_a and tbl_b.
In streaming replication, the publisher writes WAL data ($w_{1}$ to $w_{5}$) to the WAL file sequentially as each SQL statement executes. The walsender reads these records and sends them immediately to the standby’s walreceiver, regardless of the transaction’s commit status. The walreceiver writes the received data into its own WAL file in the exact order received, maintaining a physical mirror of the primary server.
In contrast, logical replication processes data based on transaction boundaries and publication scopes. While the walsender reads WAL data as it is generated, it does not send the data immediately. Instead, it accumulates changes in a memory area called the ReorderBuffer, where changes are reassembled per transaction.
The ReorderBuffer within the walsender involves three main processes:
-
Filtering: The walsender filters decoded changes based on the publication scope. Even after txid=100 commits, the walsender only prepares changes related to subscribed tables for sending.
For instance, if the subscriber only tracks tbl_a, the walsender discards change $c_{3}$ (decoded from $w_{3}$ for tbl_b). Refer to Section 12.4.1 for details.
-
Decoding & Buffering: As the walsender reads the WAL, it decodes each record $w_{n}$ into a logical change $c_{n}$ and stores it in the ReorderBuffer. Refer to Section 12.4.3 for details.
-
Sending: Upon encountering the commit record $w_{5}$ for txid=100, the walsender gathers the relevant buffered changes ($c_{1}$ and $c_{4}$) and sends them to the subscriber as a series of messages, finalized by the commit message $c_{5}$.
Changes from uncommitted transactions, such as $c_{2}$ from txid=99, remain buffered and are not sent yet. Refer to Section 12.6 for details.
The structure of the ReorderBuffer is described in Section 12.3.
PostgreSQL provides the pgoutput plugin by default for standard logical replication, though the output process is extensible via plugins. Refer to Section 12.5 for details.
The subscriber’s apply worker achieves replication by reconstructing and executing transactions based on the received messages. Further details are in Section 12.7.
12.1.2.1. Asynchronous vs. Synchronous
PostgreSQL supports both asynchronous and synchronous modes for logical replication. Asynchronous is the default. See Figure 12.3.
Figure 12.3. Comparison of transaction flow in asynchronous and synchronous logical replication modes.
In asynchronous mode, a COMMIT statement on the publisher completes immediately after the local WAL is flushed, without waiting for a subscriber response.
In synchronous mode (specifically when synchronous_commit is set to ‘remote_apply’), the publisher’s commit process waits until it receives an acknowledgment (ACK) from the subscriber. As illustrated in Figure 12.3, the apply worker sends this ACK only after it finishes applying the changes to the subscriber’s database.
The additional time required for a synchronous commit to finalize, compared to asynchronous mode, is the End-to-End Latency.
Transactions involving large volumes of changes require more time for decoding, transmission, and application. Consequently:
- In asynchronous mode, larger transaction volumes increase replication lag, extending the window of data inconsistency.
- In synchronous mode, larger volumes increase commit latency on the publisher, as the backend process must wait for the entire replication pipeline to finalize.
12.1.2.2. Management and Optimization of Large Transactions
This section examines the behavior of “Large Transactions” — those with significant change volumes — and the evolution of optimization techniques.
[1] Standard Transaction Processing (streaming = off)
When change volume exceeds the ReorderBuffer capacity (defined by logical_decoding_work_mem), the walsender persists data to spill files on disk. This is illustrated in Figure 12.4 [1].
Figure 12.4: Behavior of large transactions when streaming is disabled.
The processing flow is:
- INSERT operations execute on tbl_a, and decoded changes accumulate in the ReorderBuffer.
- Upon exceeding the memory limit, the ReorderBuffer scans currently held records.
- The walsender generates spill files and moves memory-overflow data to disk to free buffer space.
- The transaction COMMITS on the publisher.
- After detecting the commit, the walsender sends the aggregated changes (from spill files and memory) to the subscriber.
- The apply worker applies the received messages sequentially.
Mechanics regarding spill files are in Section 12.4.4.
Under this configuration, the walsender sends changes for only one transaction at a time per connection. Consequently, the subscriber serializes the application of changes, which often results in significant replication lag (Figure 12.4 [2]).
[2] In-Progress Streaming of Large Transactions (Version 14 and Later)
Version 14 introduced the pre-emptive sending of change data before a transaction commits. This mitigates transfer overhead and reduces replication lag. This is illustrated in Figure 12.5 [1].
Figure 12.5: Mechanism of in-progress streaming for large transactions.
The operational flow is:
- INSERT operations execute, and data accumulates in the ReorderBuffer.
- Upon exceeding the memory threshold (defined by logical_decoding_work_mem), the walsender initiates sending immediately, without waiting for the commit.
- The apply worker buffers incoming data in memory using a
StringInfoDatastructure named original_msg. - If the subscriber-side memory limit is reached, the worker persists data to Temp files.
- The transaction COMMITS on the publisher.
- The apply worker retrieves the data from Temp files (or memory) and applies the changes.
Unlike the walsender which manages changes within a structured ReorderBuffer, the apply worker buffers incoming changes as a raw binary stream within a generic StringInfoData structure (original_msg).
Enable streaming by specifying “streaming = on” in the CREATE SUBSCRIPTION command. The apply worker’s message buffer capacity is also governed by logical_decoding_work_mem. Temp file management details are in Section 12.7.2.1.
This method eliminates the transfer latency that typically occurs after a commit. Even with concurrent large transactions, pre-emptive transfer ensures the application process begins promptly after the commit (Figure 12.5 [2]).
[3] Parallel Application of Streamed Changes (Version 16 and Later)
Version 16 introduced the capability to parallelize the application process during streaming before the transaction commit. This overlaps both data transfer and application. This is illustrated in Figure 12.6 [1].
Figure 12.6: Mechanism and efficiency of parallel apply workers during streaming.
The flow for parallel streaming is:
- INSERT operations lead to data accumulation in the ReorderBuffer.
- Upon exceeding the memory threshold, the walsender begins sending change data.
- The leader apply worker receives the stream and dispatches data to a parallel apply worker.
- The parallel apply worker immediately begins applying changes to target tables.
- The COMMIT executes on the publisher.
- The leader apply worker receives the commit message and sends it to the parallel apply worker.
- The parallel apply worker completes the local transaction and finalizes the application.
Enable parallel application by specifying streaming = ‘parallel’ in the CREATE SUBSCRIPTION.
This architecture allows nearly total overlap between publisher processing and subscriber application. This drastically reduces replication lag during large updates (Figure 12.6 [2]).
However, parallel execution is not guaranteed in every scenario. If active workers reach the max_parallel_apply_workers_per_subscription limit, the subscriber reverts to serial application upon commit – matching the behavior of “streaming = on”. Further constraints are discussed in Section 12.7.3.
12.1.3. Replica Identity
PostgreSQL logical replication is row-based and operates on logical data rows rather than physical storage layouts (such as blocks or offsets). Consequently, when the subscriber executes UPDATE or DELETE operations, the system requires a “search key” to identify exactly which row to modify. This configuration is known as the Replica Identity.
If a proper Replica Identity is not defined, the subscriber cannot uniquely identify target rows. This leads to replication errors or unintended data modifications. To prevent such issues, the publisher rejects UPDATE or DELETE attempts on a table lacking a Replica Identity by returning an error:
testdb=# UPDATE tbl SET data = 'updated_value' WHERE id = 1;
ERROR: cannot update table "tbl" because it does not have a replica identity and publishes updates
HINT: To enable updating the table, set REPLICA IDENTITY using ALTER TABLE.12.1.3.1. Types of Replica Identity
PostgreSQL provides four Replica Identity modes, which are configurable on a per-table basis:
| Mode | Description | Usage and Characteristics |
|---|---|---|
| DEFAULT | Uses the Primary Key columns as the identifier. | This mode is applied automatically when a Primary Key is defined. |
| USING INDEX | Uses a specific unique, non-null index as the identifier. | This is useful for tables without a Primary Key where a specific unique index serves as the key. |
| FULL | Records the old values of all columns in the row. | This is required for tables with no unique constraints. This mode increases message size; see Section 12.4.3 for details. |
| NOTHING | Records no identity information. | This is the default for tables without a PK. INSERT operations proceed, but UPDATE and DELETE cannot be replicated. |
12.1.3.2. Configuration and Verification
The ALTER TABLE … REPLICA IDENTITY command configures the Replica Identity. When a Primary Key is created, the system automatically sets the mode to DEFAULT.
Example: Specifying a unique index
testdb=# CREATE TABLE tbl_ri (id int NOT NULL, name text, data int NOT NULL);
testdb=# CREATE UNIQUE INDEX tbl_ri_idx ON tbl_ri (id, data);
testdb=# ALTER TABLE tbl_ri REPLICA IDENTITY USING INDEX tbl_ri_idx;Example: Configuration to FULL
testdb=# CREATE TABLE tbl_ri_full (id int, name text, data int);
testdb=# ALTER TABLE tbl_ri_full REPLICA IDENTITY FULL;Each table’s Replica Identity configuration is stored in the relreplident column of the pg_class system catalog. The following SQL query verifies this configuration:
testdb=# -- Values: 'd' (default), 'n' (nothing), 'f' (full), 'i' (index)
testdb=# SELECT relname, relreplident FROM pg_class WHERE relname = 'tbl_ri';
relname | relreplident
---------+--------------
tbl_ri | i
(1 row)
testdb=# SELECT relname, relreplident FROM pg_class WHERE relname = 'tbl_ri_full';
relname | relreplident
-------------+--------------
tbl_ri_full | f
(1 row)While pg_class.relreplident indicates the type of Replica Identity, it does not store the specific index OID. Instead, when “REPLICA IDENTITY USING INDEX” is configured, the designated index is recorded in the pg_index system catalog. Specifically, the indisreplident column (a boolean type) is set to true for the chosen index.
The following SQL query identifies which specific index serves as the Replica Identity for a given table:
testdb=# SELECT rel.relname AS table_name, idx_rel.relname AS index_name
FROM pg_class rel
JOIN pg_index idx ON rel.oid = idx.indrelid
JOIN pg_class idx_rel
ON idx.indexrelid = idx_rel.oid
WHERE rel.relname = 'tbl_ri' AND idx.indisreplident = true;
table_name | index_name
------------+------------
tbl_ri | tbl_ri_idx
(1 row)12.1.4. Replication Origin
Replication Origin identifies the source of a data change. It serves two primary purposes:
-
Tracking Replication Progress (Recovery Control): When applying data from an external node, the subscriber records the publisher’s COMMIT LSN (Log Sequence Number) within its own WAL. By maintaining this mapping, the system ensures logical replication accurately resumes from the correct point after an interruption.
-
Prevention of Infinite Replication Loops (Circular Replication): In bidirectional replication, a change sent from Node A to Node B might inadvertently be sent back to Node A. By “stamping” each change with an origin, the system distinguishes local changes from replicated changes, preventing redundant re-transmissions.
While Replication Origin is deeply integrated into the logical replication framework, its most fundamental element is the origin_id. This local identifier allows a subscriber to internally distinguish between different publishers (origins).
The mechanisms for tracking progress are embedded within the architecture and detailed in Section 12.8. This section focuses on loop prevention, a feature introduced in version 16.
It is essential to recognize that an origin_id is not a cluster-wide or globally unique identifier. Instead, a subscriber assigns this value internally to distinguish among the multiple publishers it connects to.
Consequently, the specific numeric value of an origin_id carries no significance to other nodes in the replication topology. This value serves only as a simple binary distinction: whether it is zero or non-zero.
- An origin_id of 0 signifies changes from transactions originally executed on that node.
- An origin_id of 1 or greater signifies the node was replaying changes received from an upstream source.
12.1.4.1. Preventing Infinite Replication Loops via Replication Origin
PostgreSQL logical replication allows subscriber-side tables to remain writable. Leveraging this allows two nodes to function simultaneously as both publishers and subscribers, achieving active-active replication (multi-primary configuration), as shown in Figure 12.7 [1].
Figure 12.7. Active-Active replication and the infinite replication loop.
Prior to version 16, logical replication unconditionally forwarded all decoded changes. In a bidirectional configuration, a change originating on Node 1 propagated to Node 2; Node 2 then treated this as a “new local change” and sent it back to Node 1. This unstoppable chain is an infinite replication loop (or circular replication).
Version 16 addressed this issue with the following mechanism. Although the architecture is versatile, the current implementation follows these behaviors:
- origin = ‘any’ (Default): The walsender sends changes regardless of whether the WAL was generated locally or by applying messages from a publisher.
- origin = ’none’: The walsender excludes WAL records generated by applying messages from an external node.
The “origin” setting is a configuration option in the CREATE SUBSCRIPTION command.
Internal Mechanism and Filtering Sequence
The subscriber and publisher coordinate to identify and filter origins:
-
Origin Stamping: When the subscriber’s apply worker commits a transaction, it attaches an origin_id to the COMMIT (or ABORT) WAL record1. Current implementations assign a non-zero integer to changes from external nodes.
-
Evaluation by walsender: While decoding a transaction in the ReorderBuffer, the publisher’s walsender examines the origin information within the COMMIT WAL record.
-
Execution of Filtering: If origin = “none” is configured and an origin_id is present, the walsender skips the transmission of that entire transaction and discards the data.
Figure 12.8 illustrates this behavior using a cascaded setup (Node 1 $\rightarrow$ Node 2 $\rightarrow$ Node 3).
Figure 12.8. Data propagation behavior based on the origin setting.
[1] Case: origin = ‘any’ (Propagation across all nodes)
- Node 1: Executes an INSERT. Since no origin exists in the COMMIT, the walsender sends the data to Node 2.
- Node 2: The apply worker applies the data and records “origin_id = 1”. Because origin = “any”, the walsender sends this stamped data to Node 3.
- Node 3: The apply worker receives data from Node 2. Despite the origin_id, the worker applies the change because `origin = “any”.
[2] Case: origin = ’none’ (Termination at the intermediate node)
- Node 1: Executes an INSERT. The data is sent to Node 2 as no origin is present.
- Node 2: The apply worker applies the change and records “origin_id = 1”. When the walsender identifies the origin_id via the ReorderBuffer, it skips sending the message to Node 3.
- Node 3: No data is received because the walsender on Node 2 filters out the transaction.
Configuring origin = “none” in the subscription settings on all participating nodes prevents infinite replication loops, which enables active-active topologies.
12.1.5. Replication Slot
Logical replication slots include five additional attributes compared to physical streaming replication. The following three are essential for the discussions in later sections:
- plugin: The name of the output plugin used for logical decoding (e.g., pgoutput).
- database: The name of the database to which the replication slot is attached. While physical replication slots are instance-wide, logical replication slots are scoped strictly to a single database.
- confirmed_flush_lsn: The Log Sequence Number (LSN) up to which the subscriber’s apply worker has confirmed data receipt. The publisher no longer retains committed transactions prior to this LSN, making them eligible for removal. See Section 12.8 for details.
Note that although logical slots also include attributes such as catalog_xmin and two_phase, this documentation omits their descriptions. Refer to the official documentation for details.
12.1.6. Conflicts
Logical replication does not replicate DDL operations and is not affected by VACUUM processes on the publisher, unlike physical streaming replication. Consequently, the conflict types associated with streaming replication do not occur.
However, conflicts arise primarily due to concurrent data modifications at the application level on the subscriber. For example, if a row is deleted directly on the subscriber and the publisher subsequently attempts to update that same row, an “update_missing” conflict occurs.
A comprehensive list of conflicts detected by PostgreSQL is available in the official documentation: Logical Replication: Conflicts.
-
In the COMMIT (or ABORT) WAL record, the origin_id is included in the header, while the “origin_commit_lsn” and “origin_commit_timestamp” are added to the extended section. ↩︎