A Deep Dive into The Faster EXPLAIN ANALYZE
In my previous blog post, I explained how to optimize EXPLAIN ANALYZE. I demonstrated how to reduce its overhead from 29.82[%] to 5.33[%] in just three steps.
While the analysis at the end of that post was brief and a bit superficial, in this blog, I will perform a deeper analysis to gain insights for further improvements.
Query-1 [sec] (Var.[$\text{sec}^{2}$]) | Query-2 [sec] (Var. [$\text{sec}^{2}$]) | Overhead [%] | Overhead based on Original [%] | |
---|---|---|---|---|
Original | 45.2046 (0.0070) | 58.6842 (0.0153) | 29.82 | 29.82 |
Step 1 | 45.1979 (0.0033) | 54.6081 (0.0104) | 20.82 | 20.80 |
Step 2 | 45.1916 (0.0065) | 50.5348 (0.0433) | 11.82 | 11.79 |
Step 3 | 45.4063 (0.0023) | 47.6129 (0.0006) | 4.86 | 5.33 |

First, let’s take a closer look at the PostExecProcNodeInstr() function I created in the last step. The goal of this analysis is to identify which parts can be either optimized or omitted entirely to improve performance.
Below is the code for PostExecProcNodeInstr()
defined in Step 3:
#define TupIsNull(slot) \
((slot) == NULL || TTS_EMPTY(slot))
#define TTS_EMPTY(slot) (((slot)->tts_flags & TTS_FLAG_EMPTY) != 0)
#define TTS_FLAG_EMPTY (1 << 1)
TupleTableSlot *
PostExecProcNodeInstr(PlanState *node, TupleTableSlot *result)
{
Instrumentation *const instr = node->instrument;
if (!instr)
return result;
if (!TupIsNull(result))
instr->tuplecount += 1.0;
instr->running = true;
return result;
}
Next, we will examine the LLVM-IR (Intermediate Representation) generated by compiling the non-inlined PostExecProcNodeInstr() function, along with a simple flowchart for easy reference.

From a qualitative analysis of the code, we can anticipate that the following items will have a high cost:
-
The
TupIsNull()
check, specifically the cost of inspectingtts_flags
. -
The cost of loading data from the
Instrumentation
andTupleTableSlot
structs into registers.
Below is the corresponding Apple Silicon M1 (arm64) assembly code for the LLVM-IR shown above.
_PostExecProcNodeInstr: ; @PostExecProcNodeInstr
.cfi_startproc
; %bb.0:
ldr x8, [x0, #40]
cbz x8, LBB0_5
; %bb.1:
cbz x1, LBB0_4
; %bb.2:
ldrh w9, [x1, #4]
tbnz w9, #1, LBB0_4
; %bb.3:
ldr d0, [x8, #32]
fmov d1, #1.00000000
fadd d0, d0, d1
str d0, [x8, #32]
LBB0_4:
mov w9, #1 ; =0x1
strb w9, [x8, #4]
LBB0_5:
mov x0, x1
ret
.cfi_endproc
-
Registers:
- General-Purpose Registers:
- x0 - x30: 64-bit general-purpose registers.
- w0 - w30: 32-bit registers.
- SIMD and Floating-Point Registers:
- v0 - v31: 128-bit SIMD (Single Instruction, Multiple Data) registers.
- d0 - d31: 64-bit floating-point registers.
- s0 - s31: 32-bit floating-point registers.
- h0 - h31: 16-bit floating-point registers.
- b0 - b31: 8-bit floating-point registers.
- General-Purpose Registers:
While it would be ideal to know the clock cycles required for each instruction (ldr, cbz, etc.), modern CPUs employ advanced optimizations like pipelining and branch prediction. Additionally, Apple doesn’t publicly disclose the precise details of its M1 chip’s architecture. As a result, it’s impossible to accurately estimate execution time from the assembly code alone.
Instead, let’s take a more direct approach and estimate the execution time for each statement by running benchmarks.
We will use the original PostExecProcNodeInstr() as Version 0 and create three additional versions by removing one statement at a time. Then, we will run a benchmark for each version.
Version 0: Original
|
|
Version 1: Removed instr->running = true
|
|
Version 2: Removed if (!TupIsNull(result))
|
|
Version 3: Removed instr->tuplecount = 1.0
|
|
Here are the benchmark results:
Version | 0 | 1 | 2 | 3 |
---|---|---|---|---|
Query-2 [sec] | 47.7896 | 47.7220 | 46.8282 | 46.1475 |
diff from previous version [sec] | N/A | 0.0667 | 0.8938 | 0.6807 |
The benchmark difference between Version 0 and Version 1 is a mere 0.0667 [sec].
This tiny difference suggests that the cost of the assignment operation to the instr->running
variable is very low.
Version 0
_PostExecProcNodeInstr: ; @PostExecProcNodeInstr
.cfi_startproc
; %bb.0:
ldr x8, [x0, #40]
cbz x8, LBB0_5
; %bb.1:
cbz x1, LBB0_4
; %bb.2:
ldrh w9, [x1, #4]
tbnz w9, #1, LBB0_4
; %bb.3:
ldr d0, [x8, #32]
fmov d1, #1.00000000
fadd d0, d0, d1
str d0, [x8, #32]
LBB0_4:
mov w9, #1 ; =0x1
strb w9, [x8, #4]
LBB0_5:
mov x0, x1
ret
.cfi_endproc
Version 1
_PostExecProcNodeInstr: ; @PostExecProcNodeInstr
.cfi_startproc
; %bb.0:
cbz x1, LBB0_4
; %bb.1:
ldr x8, [x0, #40]
cbz x8, LBB0_4
; %bb.2:
ldrh w9, [x1, #4]
tbnz w9, #1, LBB0_4
; %bb.3:
ldr d0, [x8, #32]
fmov d1, #1.00000000
fadd d0, d0, d1
str d0, [x8, #32]
LBB0_4:
mov x0, x1
ret
.cfi_endproc
The difference between Version 1 and Version 2 is a significant 0.8938 [sec].
As the LLVM-IR flowchart suggested, the cost of the if (!TupIsNull(result))
conditional,
specifically the ldrh
instruction to load tts_flags
into a register, is quite high.
Version 1
_PostExecProcNodeInstr: ; @PostExecProcNodeInstr
.cfi_startproc
; %bb.0:
cbz x1, LBB0_4
; %bb.1:
ldr x8, [x0, #40]
cbz x8, LBB0_4
; %bb.2:
ldrh w9, [x1, #4]
tbnz w9, #1, LBB0_4
; %bb.3:
ldr d0, [x8, #32]
fmov d1, #1.00000000
fadd d0, d0, d1
str d0, [x8, #32]
LBB0_4:
mov x0, x1
ret
.cfi_endproc
Version 2
_PostExecProcNodeInstr: ; @PostExecProcNodeInstr
.cfi_startproc
; %bb.0:
ldr x8, [x0, #40]
cbz x8, LBB0_2
; %bb.1:
ldr d0, [x8, #32]
fmov d1, #1.00000000
fadd d0, d0, d1
str d0, [x8, #32]
LBB0_2:
mov x0, x1
ret
.cfi_endproc
The difference between Version 2 and Version 3 is 0.6807 [sec]. It would be a mistake to immediately conclude that this entire difference is due to the cost of the floating-point addition (fadd) to tuplecount.
As the assembly code below shows, the compiler’s optimizations allow it to completely remove the if (!instr)
conditional in Version 3.
This means that the benchmark difference isn’t just a result of removing the floating-point addition.
It also includes the cost of loading result and instr->tuplecount
into registers and storing the result back to memory.
Version 2
_PostExecProcNodeInstr: ; @PostExecProcNodeInstr
.cfi_startproc
; %bb.0:
ldr x8, [x0, #40]
cbz x8, LBB0_2
; %bb.1:
ldr d0, [x8, #32]
fmov d1, #1.00000000
fadd d0, d0, d1
str d0, [x8, #32]
LBB0_2:
mov x0, x1
ret
.cfi_endproc
Version 3
_PostExecProcNodeInstr: ; @PostExecProcNodeInstr
.cfi_startproc
; %bb.0:
mov x0, x1
ret
.cfi_endproc
Considering that a separate experiment showed a negligible difference when changing the data type from double to int64,
it’s highly likely that the primary cause of the difference between Version 2 and Version 3 is the cost of the memory access operations—loading result and instr->tuplecount
into registers.
The difference between the Version 3 result (46.1475 [sec]) and the Query-1 result of the original (45.2046 [sec]) is 0.9429 [sec].
Since the PostExecProcNodeInstr() function in Version 3 has almost no overhead, this difference is primarily a result of the overhead inherent in the EXPLAIN process itself. This includes operations like preparing the instrument structures and executing ExplainNode().
Based on our benchmark, this irreducible overhead is estimated at 2.1[%] of the total query time.
$$ \frac{46.1475 - 45.2046}{45.2046} \times 100 = 2.1 \text{[%]} $$
This percentage represents the practical lower bound of the overhead that can be achieved through EXPLAIN ANALYZE optimizations.
Based on the benchmark results and analysis, the following tentative conclusions are presented for future improvements.
-
Macro-level Optimization: From a macro perspective, the
if(!TupIsNull())
andif (!instr)
checks need to be simplified. This would require large-scale modifications to the PostgreSQL core, not just to the Explain and Instrument modules. However, because these checks account for a significant portion of the reducible overhead, I will continue to analyze and improve them. -
Micro-level Optimization: From a micro perspective, the cost of loading variables into CPU registers accounts for a large portion of the overhead. This is closely related to the Macro-level Optimization, but can also be addressed to some extent by improving the Explain and Instrument modules. I will pursue this approach in the next steps.