Reduce EXPLAIN ANALYZE overhead for row counting

This is a blog that combined and re-edited version of the two related blogs: The 3 Steps to a Faster EXPLAIN ANALYZE and Two More Steps to a Faster EXPLAIN ANALYZE.

Introduction

A critical issue with EXPLAIN ANALYZE is its significant performance overhead. Therefore, I focused solely on counting actual rows—i.e., where TIMING, BUFFERUSAGE, and WALUSAGE are all FALSE. Through several iterations of optimization attempts, I managed to reduce this overhead from 26.60% down to 1.64%. (Please see Appendix for benchmarks.)

The optimization consists of three phases:

Phase 1
- Action / Approach: Simplification of ExecProcNodeInstr() and InstrStopNode().
- Resulting Overhead: 8.13%
Phase 2
- Action / Approach: Direct instrumentation inside the ExecXXX() family (e.g., ExecSeqScan, ExecIndexScan).
- Resulting Overhead: 3.95%
- Issue: Current implementation causes a 0.47% performance degradation for normal SELECT queries.
Phase 3
- Action / Approach: Memory optimization incl. PlanState changes and bit-packing.
- Resulting Overhead: 1.64%
- Issue: Inherits the performance degradation issue found in Phase 2.
- Status: JIT unsupported.

If implemented, this feature offers substantial benefits:

Enhanced Monitoring: Enables frequent use of tools like auto_explain to better identify performance issues.
Optimizer Improvement: Allows external extensions like Adaptive Query Optimization (AQO) by PostgresPro to perform frequent, low-impact training, leading to more accurate optimizers and stimulating research.
Query Progress Observation: Makes it feasible to observe query progress directly by snapshotting the executing plan and its actual row counts.

I have summarized the results into three improvement steps. To make replication easy, I prepared an environment that allows anyone to reproduce the results by running just a few shell commands:
$ git clone https://github.com/s-hironobu/fast_explain_analyze.git
$ cd fast_explain_analyze
$ bash ./bench.sh setup
$ bash ./bench.sh benchmark

Detailed Design Specification

Profiler results indicate that InstrStartNode(), InstrStopNode(), and ExecProcNodeInstr() are the primary sources of overhead.

Measured on Ubuntu 24.04.
Test3 size: 35,000 rows

Query-1: SELECT

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 27.38     34.05    34.05 4000040003     0.00     0.00  ExecInterpExpr
 15.68     53.55    19.50 2000000001     0.00     0.00  ExecNestLoop
 11.44     67.77    14.22                             _init
  7.95     77.66     9.89 2000010000     0.00     0.00  tuplestore_gettupleslot
  7.02     86.39     8.73        2     4.37    54.90  ExecAgg
  6.68     94.70     8.31 2000010000     0.00     0.00  tuplestore_gettuple
  6.48    102.76     8.06 1999960000     0.00     0.00  ExecStoreMinimalTuple
  5.13    109.14     6.38 2000050000     0.00     0.00  ExecMaterial
  4.18    114.34     5.20 2000000001     0.00     0.00  fetch_input_tuple
...

Query-2: EXPLAIN ANALYZE (TIMING FALSE)

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 21.24     36.39    36.39 4000040003     0.00     0.00  ExecInterpExpr
 15.53     62.99    26.60                             _init
 12.08     83.69    20.70 2000000001     0.00     0.00  ExecNestLoop
  8.44     98.15    14.46 2000010016     0.00     0.00  tuplestore_gettupleslot
  6.01    108.44    10.29 4000215014     0.00     0.00  InstrStartNode
  5.43    117.74     9.30 2000050000     0.00     0.00  ExecMaterial
  5.32    126.86     9.12 4000215014     0.00     0.00  InstrStopNode
  5.26    135.87     9.01 2000050000     0.00     0.00  tuplestore_ateof
  4.13    142.94     7.07 4000215015     0.00     0.00  ExecProcNodeInstr
...

My approach systematically eliminates these sources while utilizing the existing Instrumentation Framework. (A major architectural overhaul is outside the scope of this proposal.)

Note that preliminary results indicate that inlining these key functions reduces overhead by several percentage points in non-target cases (e.g., when BUFFERUSAGE or WALUSAGE are TRUE). Therefore, all subsequent phases assume inlining is applied.

Phase 1

Approach: Simplifying ExecProcNodeInstr()

When ANALYZE is TRUE but other metrics are FALSE, we can skip InstrStartNode() and reduce InstrStopNode() to just two operations: incrementing tuplecount and setting running=true.

I introduced ExecProcNodeInstrRowCount() to execute only these minimal statements:

static TupleTableSlot *
ExecProcNodeInstrRowCount(PlanState *node)
{
	TupleTableSlot *result;

	 /* Skip calling InstrStartNode() */

	result = node->ExecProcNodeReal(node);

	 /* Perform the essential steps from InstrStopNode() */
	if (!TupIsNull(result))
		node->instrument->tuplecount += 1.0;
	node->instrument->running = true; /* Always true */

	return result;
}

ExecProcNodeFirst() is extended to switch between ExecProcNodeInstr and ExecProcNodeInstrRowCount based on the requested metrics.

(While this proposal focuses solely on counting actual rows, optimization of other metrics can be achieved without conflicts by employing conditional branching within ExecProcNodeFirst().)

Results:

Overhead reduced to 8.13%. Crucially, this introduces no performance degradation for normal (non-instrumented) execution.

Phase 2

Phase 2 is a Proof-of-Concept (PoC) model.

Approach: Direct Instrumentation in ExecXXX Functions

Phase 2 removes the wrapper function overhead by modifying every ExecXXX function (e.g., ExecSeqScan, ExecIndexScan) to call a new, inlined helper, ExecNodeRowcountEnd(), at the final step.

ExecSeqScan() Example:

// BEFORE
static TupleTableSlot *
ExecSeqScan(PlanState *pstate)
{
	SeqScanState *node = castNode(SeqScanState, pstate);

	return ExecScanExtended(&node->ss,
			(ExecScanAccessMtd) SeqNext,
			(ExecScanRecheckMtd) SeqRecheck,
			NULL,
			NULL,
			NULL);
}

// AFTER
static TupleTableSlot *
ExecSeqScan(PlanState *pstate)
{
	TupleTableSlot *result;
	SeqScanState *node = castNode(SeqScanState, pstate);

	result = ExecScanExtended(&node->ss,
			  (ExecScanAccessMtd) SeqNext,
			  (ExecScanRecheckMtd) SeqRecheck,
			  NULL,
			  NULL,
			  NULL);

	return ExecNodeRowcountEnd(pstate, result);
}

ExecNodeRowcountEnd() Definition:

static pg_attribute_always_inline TupleTableSlot *
ExecNodeRowcountEnd(PlanState *node, TupleTableSlot *result)
{
	Instrumentation *const instr = node->instrument;

	if (!instr)
		return result;

	if (!instr->fast_path_instr)
		return result;

	/* count the returned tuples */
	if (!TupIsNull(result))
		instr->tuplecount += 1.0;
	instr->running = true; /* Always true */

	return result;
}

I added a boolean flag, fast_path_instr, to the Instrumentation struct. This is set to TRUE only when counting actual rows is the sole active metric.

typedef struct Instrumentation
{
	bool            fast_path_instr;                /* true if we can bypass InstrStartNode and InstrStopNode */
	/* Parameters set at node creation: */

	...

} Instrumentation;

Execution Paths:

Standard SELECT: ExecNodeRowcountEnd is called but returns immediately (if (!instr)).

Issue: The function call and check cause a performance degradation.

EXPLAIN ANALYZE (Fast Path): fast_path_instr is TRUE. Executes minimal logic.
EXPLAIN ANALYZE (Full): Normal path. InstrStartNode runs, ExecNodeRowcountEnd returns early (fast_path_instr is FALSE), then InstrStopNode runs.

Results and Degradation Analysis:

Overhead reduced to 3.95%.

However, the non-instrumented baseline (Query-1) suffered a 0.47% slowdown (approx. 0.57s increase) due to the unconditional call to ExecNodeRowcountEnd. While code duplication (e.g., ExecSeqScan vs ExecSeqScanWithInstr) could solve this, it is unmaintainable. I am seeking a cleaner solution.

Phase 3

Phase 3 is also a Proof-of-Concept (PoC) and Feasibility Study. It is currently ad-hoc and does not support JIT.

The performance degradation observed here stems from the same reasons as in Phase 2 (unconditional checks in the hot path).

Approach: Reducing Register Loads

To further minimize overhead, I focused on memory and register efficiency:

Consolidating Variables: tuplecount (bits 0-61), running (bit 62), and fast_path_instr (bit 63) are packed into a single uint64 hot_instr.
Relocating the Variable: hot_instr is moved from Instrumentation to PlanState.

Note: A 62-bit integer is sufficient for tuplecount as the double mantissa limit is 53 bits.

PlanState Structure Modification:

typedef struct PlanState
{
	...
	uint64_t    hot_instr;		/* bit 63: fast_path_instr, bit 62: running, bit 0-61: tuplecount */
	Instrumentation *instrument;    /* Optional runtime stats for this node */
	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
	...
}

ExecNodeRowcountEnd() Implementation:

static pg_attribute_always_inline TupleTableSlot *
ExecNodeRowcountEnd(PlanState *node, TupleTableSlot *result)
{
	uint64_t current = node->hot_instr;
	if ((current & FAST_PATH_INSTR_MASK) == 0)
		return result;

	if (!TupIsNull(result))
		if ((current & TUPLECOUNT_MASK) != MAX_TUPLECOUNT) /* Prevent overflow */
			current += 1;		/* tuplecount++ */
	current |= RUNNING_MASK;   		/* running = true */

	/* store current */
	node->hot_instr = current;
	return result;
}

This reduces data loads from cache/memory to registers from five (Phase 2) to two (Phase 3).

Specifically, we eliminated the need to load the address of the Instrumentation struct. Additionally, by loading three pieces of data simultaneously, we reduced the total number of load operations by three.

Results:

Overhead reduced to 1.64%.

As with Phase 2, a performance degradation in standard queries was observed (1.11%). See Appendix for a discussion on the variance of these degradation metrics.

Appendix: Benchmark

A.1: Environment

Component	Version	Note
CPU	M1 Max	ARM architecture
OS	macOS 15.5
Compiler	Apple clang version 17.0.0	Target: arm64-apple-darwin24.5.0
PostgreSQL	19dev	config options: –without-icu CFLAGS="-O3 -g"
postgresql.conf	Default settings, except: shared_buffers = 512MB, max_parallel_workers_per_gather = 0, max_parallel_workers = 0	Disable parallel queries.

Note: Although the benchmark disables parallel queries, all patches support them.

A.2: Dataset

CREATE TABLE test1 (id int, data int);
CREATE INDEX test1_id_idx ON test1 (id);
CREATE TABLE test2 (id int PRIMARY KEY, data int);
CREATE TABLE test3 (id int PRIMARY KEY, data int);
INSERT INTO test1 (id, data) SELECT i, i % 51  FROM generate_series(1, 150000) AS i;
INSERT INTO test1 (id, data) SELECT i, i % 51 FROM generate_series(1, 5000) AS i;
INSERT INTO test2 (id, data) SELECT i, floor(random() * 50 + 1)::int FROM generate_series(1, 50000) AS i;
INSERT INTO test3 (id, data) SELECT i, i FROM generate_series(1, 100000) AS i;
ANALYZE;

A.3: Queries

Query-1: Standard SELECT
Query-2: EXPLAIN ANALYZE with TIMING=FALSE

-- Query-1
SELECT count(*) FROM test1 AS a, test2 AS b, test3 AS c WHERE a.id = c.id;

-- Query-2
EXPLAIN (ANALYZE TRUE, TIMING FALSE, BUFFERS FALSE) SELECT count(*) FROM test1 AS a, test2 AS b, test3 AS c WHERE a.id = c.id;

A.4: Results (Test3 size: 100,000 rows)

Query-1 and Query-2 were each executed 10 times. The results show the average execution time (and variance) and the derived overheads.

Overhead Definition: $$ \text{Overhead} = \frac{\text{ExecutionTime(Query-2)} - \text{ExecutionTime(Query-1)}}{\text{ExecutionTime(Query-1)}} \times 100 \\ \text{Overhead based on Original} = \frac{\text{ExecutionTime(Query-2)} - \text{ExecutionTime(Original Query-1)}}{\text{ExecutionTime(Original Query-1)}} \times 100 $$

	Query-1 [sec] (Var.[$\text{sec}^{2}$])	Query-1 Degradation [%]	Query-2 [sec] (Var. [$\text{sec}^{2}$])	Overhead [%]	Overhead based on Original [%]
Original	120.7846 (0.1260)	N/A	152.9188 (0.3539)	26.60	26.60
Phase 1	120.7209 (0.0822)	N/A	130.6013 (0.0257)	8.18	8.13
Phase 2	121.3559 (0.0007)	0.47	125.5515 (0.0039)	3.46	3.95
Phase 3	122.1207 (0.1115)	1.11	122.7696 (0.1403)	0.53	1.64

A.5: Sensitivity Analysis

Test3 size: 35,000 rows

	Query-1 [sec] (Var.[$\text{sec}^{2}$])	Query-1 Degradation [%]	Query-2 [sec] (Var. [$\text{sec}^{2}$])	Overhead [%]	Overhead based on Original [%]
Original	45.3314 (0.0090)	N/A	58.4396 (0.0721)	28.92	28.92
Phase 1	45.3670 (0.0115)	N/A	50.5628 (0.0133)	11.45	11.54
Phase 2	45.8230 (0.0196)	1.08	48.4476 (0.0159)	5.73	6.87
Phase 3	45.3970 (0.2161)	0.15	46.2231 (0.1641)	1.82	1.96

Test3 size: 200,000 rows

	Query-1 [sec] (Var.[$\text{sec}^{2}$])	Query-1 Degradation [%]	Query-2 [sec] (Var. [$\text{sec}^{2}$])	Overhead [%]	Overhead based on Original [%]
Original	177.4159 (0.2237)	N/A	226.9384 (0.6501)	27.91	27.91
Phase 1	176.2791 (0.1037)	N/A	192.8065 (0.3016)	9.38	8.67
Phase 2	177.9841 (0.2546)	0.32	185.5116 (0.3097)	4.23	4.56
Phase 3	178.8132 (0.2190)	0.78	181.2868 (0.4573)	1.38	2.18

Discussion on Results

The measured results contain statistical variance due to environmental factors. For example, the execution time for the original Query-1 is slightly higher than that of Phase 1, which suggests measurement noise.

Furthermore, the benchmark results are highly sensitive to the dataset size and environmental factors. As demonstrated by reducing the size of the test3 table from 100,000 rows to 35,000 rows, the Query-1 performance degradation in Phase 3 is sharply reduced from 1.11% to 0.15%.

Therefore, these benchmark figures should be interpreted not as absolute performance metrics, but primarily as an indication of the significant trend of overhead reduction achieved through the proposed phases.