The 3 Steps to a Faster EXPLAIN ANALYZE
1-Line Abstract: Cut the overhead of EXPLAIN ANALYZE by roughly 87[%].

In July 2025, I gave a talk at the Japan PostgreSQL Developer Meetup titled: “EXPLAIN ANALYZE Is ALL We Need – But what if we could always get actual rows without it?”

My proposal was simple: whenever a query is executed, PostgreSQL should count the actual number of rows – without requiring an explicit EXPLAIN ANALYZE. I believe such a feature could significantly enhance PostgreSQL’s capabilities:
- Detection of Cardinality Estimation Errors
- Useful for improving the query optimizer, including with modern AI techniques.
- Query Progress Monitoring
- Enables real-time monitoring of query execution progress.
- Anomaly Detection
- A first step toward achieving greater observability.
However, there is one serious drawback: significant overhead.
- Instrumentation Overhead
- When counting rows, the instrumentation module can consume 25–30% of total execution time.
When I gave the talk in July, I had not yet performed a detailed investigation – I was simply sharing an idea.
After that, before attempting my ultimate goal, I began researching EXPLAIN ANALYZE and managed to reduce its overhead by about 87[%], so I’d like to share the technique here.
I have summarized the results into three improvement steps. To make replication easy, I prepared an environment that allows anyone to reproduce the results by running just a few shell commands:
$ git clone https://github.com/s-hironobu/explain_analyze_bench.git $ cd explain_analyze_bench $ bash ./bench.sh setup $ bash ./bench.sh benchmark
If this interests you, please try it out: explain_analyze_bench
Component | Version | Note |
---|---|---|
CPU | M1 Max | |
OS | macOS 15.5 | |
GCC | Apple clang version 17.0.0 (clang-1700.0.13.5) | Target: arm64-apple-darwin24.5.0 |
PostgreSQL | 19dev | commit: 6a46089e458f2d700dd3b8c3f6fc782de933529a config options: –without-icu CFLAGS="-O3 -g" |
postgresql.conf | Default settings, except: shared_buffers = 512MB, max_parallel_workers_per_gather = 0, max_parallel_workers = 0 | Disable parallel queries. |
Profiling was conducted on Ubuntu 24.04 running in Vagrant on macOS using gprof and perf.
CREATE TABLE test1 (id int, data int);
CREATE INDEX test1_id_idx ON test1 (id);
CREATE TABLE test2 (id int PRIMARY KEY, data int);
CREATE TABLE test3 (id int PRIMARY KEY, data int);
INSERT INTO test1 (id, data) SELECT i, i % 51 FROM generate_series(1, 150000) AS i;
INSERT INTO test1 (id, data) SELECT i, i % 51 FROM generate_series(1, 5000) AS i;
INSERT INTO test2 (id, data) SELECT i, floor(random() * 50 + 1)::int FROM generate_series(1, 50000) AS i;
INSERT INTO test3 (id, data) SELECT i, i FROM generate_series(1, 35000) AS i;
ANALYZE;
-- Query-1
SELECT count(*) FROM test1 AS a, test2 AS b, test3 AS c WHERE a.id = c.id;
-- Query-2
EXPLAIN (ANALYZE TRUE, TIMING FALSE, BUFFERS FALSE) SELECT count(*) FROM test1 AS a, test2 AS b, test3 AS c WHERE a.id = c.id;
-- Query-3
EXPLAIN (ANALYZE TRUE, TIMING TRUE, BUFFERS FALSE) SELECT count(*) FROM test1 AS a, test2 AS b, test3 AS c WHERE a.id = c.id;
- Query-1: Standard SELECT
- Query-2: EXPLAIN ANALYZE with TIMING=FALSE
- Query-3: EXPLAIN ANALYZE with TIMING=TRUE
First, I benchmarked the unmodified PostgreSQL 19dev. Each query (Query-1, Query-2, Query-3) was executed 10 times.
The table below shows the mean execution time, sample variance, and overhead for each query.
Query | Query-1 | Query-2 | Query-3 |
---|---|---|---|
Execution Time [sec] (sample-variance [$\text{sec}^{2}$]) | 45.4885 (0.0062) | 58.8461 (0.0064) | 160.8744 (0.1695) |
Overhead [%] | N/A | 29.36 | 253.66 |
Note:
$$ \text{Overhead} = \frac{\text{ExecutionTime(Query-X)} - \text{ExecutionTime(Query-1)}}{\text{ExecutionTime(Query-1)}} \times 100 $$
where Query-X is either Query-2 or Query-3.
Below is a partial gprof profiling output for Query-1 and Query-2 (measured on Ubuntu 24.04).
Query-1: SELECT
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
27.38 34.05 34.05 4000040003 0.00 0.00 ExecInterpExpr
15.68 53.55 19.50 2000000001 0.00 0.00 ExecNestLoop
11.44 67.77 14.22 _init
7.95 77.66 9.89 2000010000 0.00 0.00 tuplestore_gettupleslot
7.02 86.39 8.73 2 4.37 54.90 ExecAgg
6.68 94.70 8.31 2000010000 0.00 0.00 tuplestore_gettuple
6.48 102.76 8.06 1999960000 0.00 0.00 ExecStoreMinimalTuple
5.13 109.14 6.38 2000050000 0.00 0.00 ExecMaterial
4.18 114.34 5.20 2000000001 0.00 0.00 fetch_input_tuple
2.93 117.98 3.64 4000425019 0.00 0.00 MemoryContextReset
1.94 120.39 2.41 2000115013 0.00 0.00 tts_virtual_clear
1.62 122.40 2.01 2000000000 0.00 0.00 int8inc
1.29 124.01 1.61 2000050000 0.00 0.00 tuplestore_ateof
...
Query-2: EXPLAIN ANALYZE (TIMING FALSE)
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
21.24 36.39 36.39 4000040003 0.00 0.00 ExecInterpExpr
15.53 62.99 26.60 _init
12.08 83.69 20.70 2000000001 0.00 0.00 ExecNestLoop
8.44 98.15 14.46 2000010016 0.00 0.00 tuplestore_gettupleslot
6.01 108.44 10.29 4000215014 0.00 0.00 InstrStartNode
5.43 117.74 9.30 2000050000 0.00 0.00 ExecMaterial
5.32 126.86 9.12 4000215014 0.00 0.00 InstrStopNode
5.26 135.87 9.01 2000050000 0.00 0.00 tuplestore_ateof
4.13 142.94 7.07 4000215015 0.00 0.00 ExecProcNodeInstr
3.63 149.16 6.22 2000010016 0.00 0.00 tuplestore_gettuple
2.51 153.46 4.30 4000425033 0.00 0.00 MemoryContextReset
2.46 157.68 4.22 2 2.11 12.75 ExecAgg
2.36 161.72 4.04 2000000001 0.00 0.00 fetch_input_tuple
...
Full profiling results:
The most significant difference between Query-1 and Query-2 – and the main source of EXPLAIN ANALYZE overhead – comes from the functions InstrStartNode(), InstrStopNode(), and ExecProcNodeInstr(), each executed roughly $4 \times 10^{9}$ times.
ExecProcNodeInstr() is defined as follows:
/*
* ExecProcNode wrapper that performs instrumentation calls. By keeping
* this a separate function, we avoid overhead in the normal case where
* no instrumentation is wanted.
*/
static TupleTableSlot *
ExecProcNodeInstr(PlanState *node)
{
TupleTableSlot *result;
InstrStartNode(node->instrument);
result = node->ExecProcNodeReal(node);
InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
return result;
}
This wrapper calls InstrStartNode() and InstrStopNode(), as well as the plan node’s actual execution function (for example, ExecSeqScan() for sequential scans, ExecIndexScan() for index scans, etc.).
From a plan node perspective, the execution path differs as follows (using ExecSeqScan as an example):
- In Query-1 (a normal SELECT), ExecSeqScan() is called directly.
- In Query-2 and Query-3 (EXPLAIN ANALYZE), ExecProcNodeInstr() is called, which in turn calls InstrStartNode(), then ExecSeqScan(), and finally InstrStopNode().
Query-1: SELECT
node->ExecProcNodeReal = ExecSeqScan();
-
Query-2 and -3: EXPLAIN ANALYZE
ExecProcNodeInstr();
InstrStartNode();
node->ExecProcNodeReal = ExecSeqScan();
InstrStopNode();
ExecProcNodeInstr() is invoked every time a plan node processes a single row. For example, when sequentially scanning a table with 1,000,000 rows, it will be called 1,000,000 times.
Query-3 shows a much larger overhead because TIMING = TRUE triggers a time measurement in every call to InstrStartNode() and InstrStopNode().
As shown in the flamegraph for Query-3 below (measured on Ubuntu 24.04), most of the execution time is spent in clock_gettime() retrieving the system clock.

Flamegraph of Query-3
The natural conclusion is that timing should be set aside.
From this point on, I will focus on minimizing the overhead of counting rows, ignoring time measurement.
The flamegraphs for Query-1, Query-2, and Query-3 can be found here:
In principle, any measurement affects the system being observed.
However, with EXPLAIN ANALYZE and TIMING = TRUE, the impact on execution time is disproportionately large. The accuracy and validity of timing values obtained in this way should be questioned.
A fundamental improvement might be needed – for example, sampling timing less frequently (e.g., once every 100 or 1,000 calls) to reduce overhead while still gathering statistically meaningful data.
The first step in improving EXPLAIN ANALYZE performance was to inline the key instrumentation functions: InstrStartNode(), InstrStopNode(), InstrUpdateTupleCount(), and InstrEndLoop().
Function calls are not usually a performance bottleneck. However, in the case of InstrStartNode() and InstrStopNode() – which are executed billions of times – their overhead is no longer negligible. To reduce this cost, I inlined these functions.
Inlining removes the call overhead for this massive number of invocations, which should reduce total execution time. The implementation is straightforward: move the function definitions from instrument.c into instrument.h and declare them with static pg_attribute_always_inline
.
Here’s a before-and-after example for InstrStartNode():
Before:
void
InstrStartNode(Instrumentation *instr)
{
...
After:
static pg_attribute_always_inline void
InstrStartNode(Instrumentation *instr)
{
...
See the patch for details: step1-improved-explain-analyze.patch
The table below shows the mean execution time, sample variance, and overhead for each query in the Original and Step 1 benchmarks:
Query-1 [sec] (Var.[$\text{sec}^{2}$]) | Query-2 [sec] (Var. [$\text{sec}^{2}$]) | Overhead [%] | |
---|---|---|---|
Original | 45.4885 (0.0062) | 58.8461 (0.0064) | 29.36 |
Step 1 | 45.5090 (0.0454) | 54.8780 (0.0292) | 20.59 |

In Step 1, Query-2’s execution time decreased by 3.9681 [sec], reducing its overhead from 29.36 [%] to 20.59 [%].
If we assume this improvement comes entirely from inlining InstrStartNode() and InstrStopNode(), we can estimate the cost per function call as:
$$ \frac{3.9681 \text{[sec]}}{2 \times 4 \times 10^{9} \text{[times]}} = 0.496 \times 10^{-9} \text{[sec]} = 0.496 \text{[nsec]} $$
For comparison, on a CPU running at 5 [GHz], one clock cycle takes about 0.2 [nsec]. This means that while a single function call is extremely cheap, the sheer number of calls in EXPLAIN ANALYZE turns them into a significant performance cost.
(Unlike the Zilog Z80 – which I once wrote hand-assembled code for – modern CPUs execute instructions in parallel using pipelines with more than 10 stages. So this is only a rough back-of-the-envelope calculation, meant to give a sense of scale.)
This result reinforces that our optimization target involves performance changes at the CPU clock cycle level.
When ANALYZE is TRUE but TIMING, BUFFERUSAGE, and WALUSAGE are all FALSE, we can skip calling InstrStartNode() and execute only the two essential steps from InstrStopNode().
See the Expand section below for details.
I implemented a simplified version of ExecProcNodeInstr() called ExecProcNodeInstrLite(), executed only when TIMING, BUFFERUSAGE, and WALUSAGE are all FALSE.
Compared with the original ExecProcNodeInstr(), ExecProcNodeInstrLite() skips InstrStartNode() entirely and includes only the two minimal steps from InstrStopNode().
ExecProcNodeInstr():
/*
* ExecProcNode wrapper that performs instrumentation calls. By keeping
* this a separate function, we avoid overhead in the normal case where
* no instrumentation is wanted.
*/
static TupleTableSlot *
ExecProcNodeInstr(PlanState *node)
{
TupleTableSlot *result;
/* Inlined */
InstrStartNode(node->instrument);
result = node->ExecProcNodeReal(node);
/* Inlined */
InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
return result;
}
ExecProcNodeInstrLite():
/*
* A simplified version of ExecProcNodeInstr. When the following
* conditions are not required, this function avoids calling
* InstrStartNode and InstrStopNode.
* Instead, it performs only two minimal checks:
* need_timer, need_bufusage and need_walusage
*/
static TupleTableSlot *
ExecProcNodeInstrLite(PlanState *node)
{
TupleTableSlot *result;
/* Skip calling InstrStartNode() */
result = node->ExecProcNodeReal(node);
/* Perform the essential steps from InstrStopNode() */
if (!TupIsNull(result))
node->instrument->tuplecount += 1.0;
if (!node->instrument->running)
node->instrument->running = true;
return result;
}
Note: Both functions call node->ExecProcNodeReal(node) internally, so the compiler cannot inline them.
To select between ExecProcNodeInstr() and ExecProcNodeInstrLite() based on the conditions, I modified ExecProcNodeFirst() as follows:
Before:
static TupleTableSlot *
ExecProcNodeFirst(PlanState *node)
{
check_stack_depth();
/*
* If instrumentation is required, change the wrapper to one that just
* does instrumentation. Otherwise we can dispense with all wrappers and
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
node->ExecProcNode = ExecProcNodeInstr;
else
node->ExecProcNode = node->ExecProcNodeReal;
return node->ExecProcNode(node);
}
After:
static TupleTableSlot *
ExecProcNodeFirst(PlanState *node)
{
check_stack_depth();
if (node->instrument)
{
/*--
* Use ExecProcNodeInstrLite() to avoid calling InstrStartNode()
* and InstrStopNode() when the following conditions are not required:
* - need_timer
* - need_bufusage
* - need_walusage
*/
if (node->instrument->need_timer
|| node->instrument->need_bufusage
|| node->instrument->need_walusage)
node->ExecProcNode = ExecProcNodeInstr;
else
node->ExecProcNode = ExecProcNodeInstrLite;
}
else
node->ExecProcNode = node->ExecProcNodeReal;
return node->ExecProcNode(node);
}
Note: ExecProcNodeFirst() is executed only once for each plan node at the start of query execution.
See the patch for details: step2-improved-explain-analyze.patch
Query-1 [sec] (Var.[$\text{sec}^{2}$]) | Query-2 [sec] (Var. [$\text{sec}^{2}$]) | Overhead [%] | |
---|---|---|---|
Original | 45.4885 (0.0062) | 58.8461 (0.0064) | 29.36 |
Step 1 | 45.5090 (0.0454) | 54.8780 (0.0292) | 20.59 |
Step 2 | 45.5785 (0.0082) | 50.4877 (0.0097) | 10.77 |

In Step 2, Query-2’s execution time decreased by 8.3584 [sec] compared with the original, reducing the overhead to 10.77 [%].
From a micro-level perspective, using SeqScan as an example:
Original:Query-1
node->ExecProcNodeReal = ExecSeqScan();
-
Step 1:Query-2
ExecProcNodeInstr();
(inline) InstrStartNode();
node->ExecProcNodeReal = ExecSeqScan();
(inline) InstrStopNode();
Step 2:Query-2
ExecProcNodeInstrLite();
node->ExecProcNodeReal = ExecSeqScan();
if (!NULL) node->instrument->tuplecount += 1.0;
if (!running) node->instrument->running = true;
-
In Query-1 (a plain SELECT), processing each row calls ExecSeqScan() directly.
-
In Step 1’s Query-2, each row goes through ExecProcNodeInstr(), which calls both inlined instrumentation functions plus ExecSeqScan(). Inlining reduced function call overhead, but the internal work of both instrumentation functions still runs.
-
In Step 2’s Query-2, each row calls ExecProcNodeInstrLite(), which executes ExecSeqScan() followed by just two minimal updates. The per-row call cost for ExecSeqScan() remains, but removing most instrumentation logic greatly reduces the cumulative overhead.
In Step 2, I reduced overhead by replacing ExecProcNodeInstr() with a lighter version, ExecProcNodeInstrLite(), when no timing, buffer usage, or WAL usage information was needed.
While this eliminated unnecessary calls to InstrStartNode() and most of InstrStopNode(), it still introduced one extra function call per processed row—namely the call to ExecProcNodeInstrLite() itself.
In Step 3, I take this idea further by removing the extra wrapper entirely. Instead of routing execution through a separate function, I modify every plan node execution function in the ExecXX family (ExecSeqScan, ExecIndexScan, ExecIndexOnlyScan, etc.) so that the minimal instrumentation logic is performed inside the node’s own execution path.
Concretely, at the final step of each ExecXX function, I insert a call to PostExecProcNodeInstr(), which runs only the two essential steps originally from InstrStopNode().
By inlining this logic directly into each execution function, we completely avoid the per-row wrapper call overhead from Step 2.
Here is PostExecProcNodeInstr():
static pg_attribute_always_inline TupleTableSlot *
PostExecProcNodeInstr(PlanState *node, TupleTableSlot *result)
{
Instrumentation *const instr = node->instrument;
/* Most common case: no instrumentation */
if (!instr)
return result;
/* count the returned tuples */
if (!TupIsNull(result))
instr->tuplecount = 1.0;
/* Is this the first tuple of this cycle? */
/*--
if (!instr->running)
instr->running = true;
*/
instr->running = true; /* Always true */
return result;
}
if
statement always sets instr->running
to true
, so I hand-optimized it.
See this section in detail.For a normal query (e.g., a plain SELECT), PostExecProcNodeInstr() simply assigns instr, evaluates the if (!instr) condition, and returns result.
When EXPLAIN ANALYZE is run with TIMING, BUFFERS, and WAL all set to false, it executes the two minimal steps extracted from InstrStopNode() before returning result.
Below is an example of the before-and-after transformation for ExecSeqScan():
Before:
static TupleTableSlot *
ExecSeqScan(PlanState *pstate)
{
SeqScanState *node = castNode(SeqScanState, pstate);
return ExecScanExtended(&node->ss,
(ExecScanAccessMtd) SeqNext,
(ExecScanRecheckMtd) SeqRecheck,
NULL,
NULL,
NULL);
}
After:
static TupleTableSlot *
ExecSeqScan(PlanState *pstate)
{
TupleTableSlot *result;
SeqScanState *node = castNode(SeqScanState, pstate);
result = ExecScanExtended(&node->ss,
(ExecScanAccessMtd) SeqNext,
(ExecScanRecheckMtd) SeqRecheck,
NULL,
NULL,
NULL);
return PostExecProcNodeInstr(pstate, result);
}
All results now pass through PostExecProcNodeInstr().
I applied this same modification to all functions in the ExecXX() family.
To match these changes, I also modified ExecProcNodeFirst(): if need_timer, need_bufusage, and need_walusage are all false, it runs the regular (already modified) ExecProcNodeReal, as in a normal SELECT. Otherwise, EXPLAIN ANALYZE uses the regular instrumentation path via ExecProcNodeInstr().
static TupleTableSlot *
ExecProcNodeFirst(PlanState *node)
{
check_stack_depth();
if (node->instrument)
{
/*--
* Use node->ExecProcNodeReal, which is modified to call
* PostExecProcNodeInstr before returning the result,
* to avoid calling InstrStartNode() and InstrStopNode()
* when the following conditions are not required:
* - need_timer
* - need_bufusage
* - need_walusage
*/
if (node->instrument->need_timer
|| node->instrument->need_bufusage
|| node->instrument->need_walusage)
node->ExecProcNode = ExecProcNodeInstr;
else
node->ExecProcNode = node->ExecProcNodeReal;
}
else
node->ExecProcNode = node->ExecProcNodeReal;
return node->ExecProcNode(node);
}
See the patch for details: step3-improved-explain-analyze.patch
Query-1 [sec] (Var.[$\text{sec}^{2}$]) | Query-2 [sec] (Var. [$\text{sec}^{2}$]) | Overhead [%] | |
---|---|---|---|
Original | 45.4885 (0.0062) | 58.8461 (0.0064) | 29.36 |
Step 1 | 45.5090 (0.0454) | 54.8780 (0.0292) | 20.59 |
Step 2 | 45.5785 (0.0082) | 50.4877 (0.0097) | 10.77 |
Step 3 | 45.9527 (0.0095) | 47.6872 (0.0089) | 3.77 |

In Step 3, Query-1 (a normal SELECT) took 45.9527 [sec], while Query-2 took 47.6872 [sec], reducing the gap to 1.7345 [sec]. The overhead is now just 3.77[%], which means it has been reduced by roughly 87[%].
For Query-1, modifying the ExecXX() functions caused a slight regression: execution time increased by about 0.5 [sec] compared with the original – roughly a 1[%] slowdown.
To pinpoint the source, I profiled both Query-1 and Query-2 with gprof (10 runs each to improve accuracy). Functions with more than 200 calls per query (or 2,000 in gprof’s total output) showed no significant differences between the two queries.
Full profiling results:
This suggests that – unlike in Step 1’s InstrStartNode() / InstrStopNode() or Step 2’s ExecProcNodeInstrLite() – there is no single, high-frequency function dominating the loss. Instead, the instrumentation-related code in PostExecProcNodeInstr() – now embedded in every plan node execution function – is likely responsible.
static pg_attribute_always_inline TupleTableSlot *
PostExecProcNodeInstr(PlanState *node, TupleTableSlot *result)
{
Instrumentation *const instr = node->instrument;
/* Most common case: no instrumentation */
if (!instr)
return result;
/* count the returned tuples */
if (!TupIsNull(result))
instr->tuplecount = 1.0;
/* Is this the first tuple of this cycle? */
/*--
if (!instr->running)
instr->running = true;
*/
instr->running = true; /* Always true */
return result;
}
As of August 2025, I have not yet measured the exact cost of this code path, nor completely ruled out the influence of other factors.
Before running the Step 3 benchmark, I conducted preliminary experiments comparing three variants of the final statement: the original, the unlikely
version, and a hand-optimized version.
Original:
if (!instr->running)
instr->running = true;
unlikely version:
if (unlikely(!instr->running))
instr->running = true;
hand-optimized version:
/* Always true */
instr->running = true;
The results are shown below:
Version | Original | unlikely version | hand-optimized version |
---|---|---|---|
Execution Time [sec] | 48.0040 | 47.8312 | 47.6872 |
At first glance, these numbers look convincing, but they represent champion data.
In reality, I could not fully control disturbances during measurement (e.g., interference from other processes), therefore, results will vary slightly from run to run.
For this reason, while I chose the hand-optimized version here, other variants might perform better depending on the compiler and CPU.
As a feasibility test, I optimized EXPLAIN ANALYZE row counting and reduced its overhead from 29.36[%] to 3.77[%] in three steps. However, the final step introduced a small slowdown of about 1[%] for regular SELECT queries.
My ultimate goal is to count actual rows during normal query execution without using EXPLAIN ANALYZE, while keeping the overhead around 1[%] – roughly 30 seconds of extra execution time per hour.
This experiment demonstrates that reducing EXPLAIN ANALYZE overhead is relatively straightforward, thanks to its inefficient baseline. However, counting actual rows in normal execution still incurs higher costs than expected. Achieving the 1[%] overhead target will likely require a different approach.