Model Checking pgpool-II with SPIN
I recalled performing model verification for pgpool-II (a PostgreSQL replication tool) using SPIN in 2010, so I’m sharing some of the results.
To start, I investigated the behavior of the replication-strict mode, which was deprecated in v2.0.
The goal was to prepare for and reproduce a bug that was later fixed in v2.2 (and also to get familiar with SPIN).
Note: More accurately, the “abolition of replication-strict mode” means that this behavior became the default going forward,
and thus the configuration setting itself was removed.
The spin code is following:
First, I checked the behavior when replication-strict=false.
I commented out both STRICT and SERIALIZABLE in the source code’s declaration section.
This sets replication-strict=false and the transaction isolation level to READ_COMMITTED.
I then investigated a scenario where two clients simultaneously UPDATE the same column.
Specifically, I performed a full state space search for potential data inconsistency (checked by the assert in the watchdog process) and the possibility of a deadlock (invalid end states).
/* #define STRICT */
/* #define SERIALIZABLE */
Running the SPIN commands:
$ spin -a pgpool-v2.0-strict.pml
$ gcc -w -o pan -D_POSIX_SOURCE -DMEMLIM=128 -DSAFETY -DNOCLAIM -DXUSAFE -DNOFAIR pan.c
$ ./pan -v -m10000 -w19 -c10
The result showed “errors: 2”, indicating that some errors were found.
pan:1: invalid end state (at depth 76)
... (omitted) ...
Full statespace search for:
never claim - (not selected)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 96 byte, depth reached 111, errors: 2
Simulating the first error with spin -j0:
$ spin -p -v -g -s -r -t -j0 pgpool-v2.0-strict.pml
... (omitted) ...
77: proc 6 (pool) line 137 "pgpool-v2.0-strict.pml" Recv ACK,OK <- queue 1 (to_master[0])
77: proc 6 (pool) line 137 "pgpool-v2.0-strict.pml" (state 4) [to_master[p]?ACK,pm]
spin: trail ends after 77 steps
#processes: 8
master_mutex = LOCKED
slave_mutex = LOCKED
master_rowval = 9
slave_rowval = 9
pool_end[0] = 0
pool_end[1] = 0
77: proc 7 (pool) line 138 "pgpool-v2.0-strict.pml" (state 7)
77: proc 6 (pool) line 137 "pgpool-v2.0-strict.pml" (state 5)
77: proc 5 (slave) line 100 "pgpool-v2.0-strict.pml" (state 23)
77: proc 4 (slave) line 19 "pgpool-v2.0-strict.pml" (state 5)
77: proc 3 (master) line 19 "pgpool-v2.0-strict.pml" (state 5)
77: proc 2 (master) line 89 "pgpool-v2.0-strict.pml" (state 23)
77: proc 1 (watchdog) line 74 "pgpool-v2.0-strict.pml" (state 8)
77: proc 0 (:init:) line 166 "pgpool-v2.0-strict.pml" (state 12) <valid end state>
8 processes created
The simulation output clearly shows that after 77 steps, both master_mutex and slave_mutex are LOCKED, resulting in a deadlock.
Note that SPIN doesn’t inherently understand LOCK/UNLOCK; it recognizes that the simulation cannot progress further and reports that state. We, the users, interpret this as a deadlock.
The second error was similar, also detecting a deadlock (symmetrical because there are two client processes).
Next, I investigated replication-strict=true by uncommenting #define STRICT.
#define STRICT
/* #define SERIALIZABLE */
The execution result showed “errors: 0”. This confirmed that no data inconsistency or deadlock occurred under these conditions.
... (omitted) ...
State-vector 96 byte, depth reached 111, errors: 0
This suggests that using STRICT mode with the READ_COMMITTED level offers safety, at least in this scenario.
Finally, I switched the transaction isolation level from READ_COMMITTED to SERIALIZABLE.
#define STRICT
#define SERIALIZABLE
The result changed to “errors: 4”.
pan:1: assertion violated (master_rowval==slave_rowval) (at depth 111)
... (omitted) ...
Full statespace search for:
never claim - (not selected)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 96 byte, depth reached 355, errors: 4
The output indicates that the assertion assert(master_rowval==slave_rowval) was violated, meaning a data inconsistency occurred between the master and slave databases.
Conclusion for v2.0: Even with STRICT mode, data inconsistency could occur in v2.0 when using the SERIALIZABLE isolation level.
This demonstrates that SPIN allows for appropriate modeling to discover or highlight potential bugs and system behaviors. Testing distributed systems is notoriously difficult, so leveraging formal methods like model verification and any available tools is highly recommended.
This brings us to the main part. The behavior described above reproduced a bug that was fixed in pgpool-II v2.2.
The fix addresses a scenario where an error (like a “could not serialize access due to concurrent update” error) occurs, but the transactions on all DB nodes are not aborted, leading to data inconsistency.
The fix in v2.2 ensures that all DB node transactions are aborted if a non-serializable error occurs.
Here is the pgpool-II’s document of the inconsistency before the fix:
* English version:
pool_process_query.c: Check serialization failure error and
abort all nodes if so. Otherwise we allow data inconsistency
among DB nodes. See following scenario: (M:master, S:slave)
* Japanese version:
トランザクションをシリアライズできないエラーが発生したときに、
すべてのDBノードのトランザクションをアボートするようにしました。
こうしないと、DBノードの間でデータの不整合が起きることがあります(Tatsuo)。
例を示します(Mはマスタ、Sはスレーブを示します)。
M:S1:BEGIN;
M:S2:BEGIN;
S:S1:BEGIN;
S:S2:BEGIN;
M:S1:SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
M:S2:SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
S:S1:SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
S:S2:SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
M:S1:UPDATE t1 SET i = i + 1;
S:S1:UPDATE t1 SET i = i + 1;
M:S2:UPDATE t1 SET i = i + 1; <-- blocked
S:S1:COMMIT;
M:S1:COMMIT;
M:S2:ERROR: could not serialize access due to concurrent update
S:S2:UPDATE t1 SET i = i + 1; <-- success in UPDATE and data becomes inconsistent!
The updated model for v2.2 incorporates this fix. The macro OLD_VERSION can be toggled to compare the v2.2 and v2.0 behaviors.
The spin code is the following:
Running the model checking commands again:
$ spin -a pgpool-v2.2-serializable.pml
$ gcc -w -o pan -D_POSIX_SOURCE -DMEMLIM=128 -DSAFETY -DNOCLAIM -DXUSAFE -DNOFAIR pan.c
$ ./pan -v -m10000 -w19 -c10
The execution result is successfully “errors: 0”.
... (omitted) ...
Full statespace search for:
never claim - (not selected)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 96 byte, depth reached 205, errors: 0
This model verification confirms that the improvement in pgpool-II was successful.
Consider trying to test all possible subtle timing issues in a real distributed system—it would be nearly impossible. However, if the modeling is correct (if all the assumptions in the model are accurate), model checking provides a theoretical guarantee that the system is likely problem-free concerning the modeled behaviors.
While it’s often impossible to check all states—theoretically, full state space checking is an $NP_{\text{space}}$ problem — it’s still better than doing nothing. No one avoids testing just because they can’t cover all combinations. Model verification should be approached with the same light attitude: it’s worth doing!