A Deep Dive into The Faster EXPLAIN ANALYZE

In my previous blog post, I explained how to optimize EXPLAIN ANALYZE. I demonstrated how to reduce its overhead from 29.82[%] to 5.33[%] in just three steps.

While the analysis at the end of that post was brief and a bit superficial, in this blog, I will perform a deeper analysis to gain insights for further improvements.

Query-1 [sec] (Var.[$\text{sec}^{2}$]) Query-2 [sec] (Var. [$\text{sec}^{2}$]) Overhead [%] Overhead based on Original [%]
Original 45.2046 (0.0070) 58.6842 (0.0153) 29.82 29.82
Step 1 45.1979 (0.0033) 54.6081 (0.0104) 20.82 20.80
Step 2 45.1916 (0.0065) 50.5348 (0.0433) 11.82 11.79
Step 3 45.4063 (0.0023) 47.6129 (0.0006) 4.86 5.33

Analysis of The Last Step

First, let’s take a closer look at the PostExecProcNodeInstr() function I created in the last step. The goal of this analysis is to identify which parts can be either optimized or omitted entirely to improve performance.

Below is the code for PostExecProcNodeInstr() defined in Step 3:

#define TupIsNull(slot) \
    ((slot) == NULL || TTS_EMPTY(slot))

#define TTS_EMPTY(slot) (((slot)->tts_flags & TTS_FLAG_EMPTY) != 0)

#define			TTS_FLAG_EMPTY			(1 << 1)

TupleTableSlot *
PostExecProcNodeInstr(PlanState *node, TupleTableSlot *result)
{
    Instrumentation *const instr = node->instrument;

    if (!instr)
        return result;

    if (!TupIsNull(result))
        instr->tuplecount += 1.0;

    instr->running = true;

    return result;
}

Next, we will examine the LLVM-IR (Intermediate Representation) generated by compiling the non-inlined PostExecProcNodeInstr() function, along with a simple flowchart for easy reference.

From a qualitative analysis of the code, we can anticipate that the following items will have a high cost:

  1. The TupIsNull() check, specifically the cost of inspecting tts_flags.

  2. The cost of loading data from the Instrumentation and TupleTableSlot structs into registers.

Below is the corresponding Apple Silicon M1 (arm64) assembly code for the LLVM-IR shown above.

_PostExecProcNodeInstr:                 ; @PostExecProcNodeInstr
        .cfi_startproc
; %bb.0:
        ldr     x8, [x0, #40]
        cbz     x8, LBB0_5
; %bb.1:
        cbz     x1, LBB0_4
; %bb.2:
        ldrh    w9, [x1, #4]
        tbnz    w9, #1, LBB0_4
; %bb.3:
        ldr     d0, [x8, #32]
        fmov    d1, #1.00000000
        fadd    d0, d0, d1
        str     d0, [x8, #32]
LBB0_4:
        mov     w9, #1                          ; =0x1
        strb    w9, [x8, #4]
LBB0_5:
        mov     x0, x1
        ret
        .cfi_endproc
  • A64 – Base Instructions

    • ldr : Load register (immediate)
    • ldrh : Load register halfword (immediate)
    • cbz : Compare and branch on zero
    • tbnz : Test bit and branch if nonzero
    • fadd : Floating-point add
    • str : Store register (immediate)
    • strb : Store register byte (immediate)
  • Registers:

    • General-Purpose Registers:
      • x0 - x30: 64-bit general-purpose registers.
      • w0 - w30: 32-bit registers.
    • SIMD and Floating-Point Registers:
      • v0 - v31: 128-bit SIMD (Single Instruction, Multiple Data) registers.
      • d0 - d31: 64-bit floating-point registers.
      • s0 - s31: 32-bit floating-point registers.
      • h0 - h31: 16-bit floating-point registers.
      • b0 - b31: 8-bit floating-point registers.

While it would be ideal to know the clock cycles required for each instruction (ldr, cbz, etc.), modern CPUs employ advanced optimizations like pipelining and branch prediction. Additionally, Apple doesn’t publicly disclose the precise details of its M1 chip’s architecture. As a result, it’s impossible to accurately estimate execution time from the assembly code alone.

Instead, let’s take a more direct approach and estimate the execution time for each statement by running benchmarks.

Estimating Execution Time by Statement

We will use the original PostExecProcNodeInstr() as Version 0 and create three additional versions by removing one statement at a time. Then, we will run a benchmark for each version.

Version 0: Original

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
static pg_attribute_always_inline TupleTableSlot *
PostExecProcNodeInstr(PlanState *node, TupleTableSlot *result)
{
	Instrumentation *const instr = node->instrument;

	if (!instr)
		return result;

	if (!TupIsNull(result))
		instr->tuplecount += 1.0;

	instr->running = true; /* Always true */

	return result;
}

Version 1: Removed instr->running = true

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
static pg_attribute_always_inline TupleTableSlot *
PostExecProcNodeInstr(PlanState *node, TupleTableSlot *result)
{
	Instrumentation *const instr = node->instrument;

	if (!instr)
		return result;

	if (!TupIsNull(result))
		instr->tuplecount += 1.0;



	return result;
}

Version 2: Removed if (!TupIsNull(result))

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
static pg_attribute_always_inline TupleTableSlot *
PostExecProcNodeInstr(PlanState *node, TupleTableSlot *result)
{
	Instrumentation *const instr = node->instrument;

	if (!instr)
		return result;


	instr->tuplecount += 1.0;



	return result;
}

Version 3: Removed instr->tuplecount = 1.0

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
static pg_attribute_always_inline TupleTableSlot *
PostExecProcNodeInstr(PlanState *node, TupleTableSlot *result)
{
	Instrumentation *const instr = node->instrument;

	if (!instr)
		return result;






	return result;
}

Here are the benchmark results:

Version 0 1 2 3
Query-2 [sec] 47.7896 47.7220 46.8282 46.1475
diff from previous version [sec] N/A 0.0667 0.8938 0.6807

Difference between Version 0 and Version 1

The benchmark difference between Version 0 and Version 1 is a mere 0.0667 [sec]. This tiny difference suggests that the cost of the assignment operation to the instr->running variable is very low.

Version 0

_PostExecProcNodeInstr:                 ; @PostExecProcNodeInstr
        .cfi_startproc
; %bb.0:
        ldr     x8, [x0, #40]
        cbz     x8, LBB0_5
; %bb.1:
        cbz     x1, LBB0_4
; %bb.2:
        ldrh    w9, [x1, #4]
        tbnz    w9, #1, LBB0_4
; %bb.3:
        ldr     d0, [x8, #32]
        fmov    d1, #1.00000000
        fadd    d0, d0, d1
        str     d0, [x8, #32]
LBB0_4:
        mov     w9, #1                          ; =0x1
        strb    w9, [x8, #4]
LBB0_5:
        mov     x0, x1
        ret
        .cfi_endproc

Version 1

_PostExecProcNodeInstr:                 ; @PostExecProcNodeInstr
        .cfi_startproc
; %bb.0:
        cbz     x1, LBB0_4
; %bb.1:
        ldr     x8, [x0, #40]
        cbz     x8, LBB0_4
; %bb.2:
        ldrh    w9, [x1, #4]
        tbnz    w9, #1, LBB0_4
; %bb.3:
        ldr     d0, [x8, #32]
        fmov    d1, #1.00000000
        fadd    d0, d0, d1
        str     d0, [x8, #32]
LBB0_4:
        mov     x0, x1
        ret
        .cfi_endproc

Difference between Version 1 and Version 2

The difference between Version 1 and Version 2 is a significant 0.8938 [sec].

As the LLVM-IR flowchart suggested, the cost of the if (!TupIsNull(result)) conditional, specifically the ldrh instruction to load tts_flags into a register, is quite high.

Version 1

_PostExecProcNodeInstr:                 ; @PostExecProcNodeInstr
        .cfi_startproc
; %bb.0:
        cbz     x1, LBB0_4
; %bb.1:
        ldr     x8, [x0, #40]
        cbz     x8, LBB0_4
; %bb.2:
        ldrh    w9, [x1, #4]
        tbnz    w9, #1, LBB0_4
; %bb.3:
        ldr     d0, [x8, #32]
        fmov    d1, #1.00000000
        fadd    d0, d0, d1
        str     d0, [x8, #32]
LBB0_4:
        mov     x0, x1
        ret
        .cfi_endproc

Version 2

_PostExecProcNodeInstr:                 ; @PostExecProcNodeInstr
        .cfi_startproc
; %bb.0:
        ldr     x8, [x0, #40]
        cbz     x8, LBB0_2
; %bb.1:
        ldr     d0, [x8, #32]
        fmov    d1, #1.00000000
        fadd    d0, d0, d1
        str     d0, [x8, #32]
LBB0_2:
        mov     x0, x1
        ret
        .cfi_endproc

Difference between Version 2 and Version 3

The difference between Version 2 and Version 3 is 0.6807 [sec]. It would be a mistake to immediately conclude that this entire difference is due to the cost of the floating-point addition (fadd) to tuplecount.

As the assembly code below shows, the compiler’s optimizations allow it to completely remove the if (!instr) conditional in Version 3. This means that the benchmark difference isn’t just a result of removing the floating-point addition. It also includes the cost of loading result and instr->tuplecount into registers and storing the result back to memory.

Version 2

_PostExecProcNodeInstr:                 ; @PostExecProcNodeInstr
        .cfi_startproc
; %bb.0:
        ldr     x8, [x0, #40]
        cbz     x8, LBB0_2
; %bb.1:
        ldr     d0, [x8, #32]
        fmov    d1, #1.00000000
        fadd    d0, d0, d1
        str     d0, [x8, #32]
LBB0_2:
        mov     x0, x1
        ret
        .cfi_endproc

Version 3

_PostExecProcNodeInstr:                 ; @PostExecProcNodeInstr
        .cfi_startproc
; %bb.0:
        mov     x0, x1
        ret
        .cfi_endproc

Considering that a separate experiment showed a negligible difference when changing the data type from double to int64, it’s highly likely that the primary cause of the difference between Version 2 and Version 3 is the cost of the memory access operations—loading result and instr->tuplecount into registers.

The practical limit of overhead optimization

The difference between the Version 3 result (46.1475 [sec]) and the Query-1 result of the original (45.2046 [sec]) is 0.9429 [sec].

Since the PostExecProcNodeInstr() function in Version 3 has almost no overhead, this difference is primarily a result of the overhead inherent in the EXPLAIN process itself. This includes operations like preparing the instrument structures and executing ExplainNode().

Based on our benchmark, this irreducible overhead is estimated at 2.1[%] of the total query time.

$$ \frac{46.1475 - 45.2046}{45.2046} \times 100 = 2.1 \text{[%]} $$

This percentage represents the practical lower bound of the overhead that can be achieved through EXPLAIN ANALYZE optimizations.


Conclusion

Based on the benchmark results and analysis, the following tentative conclusions are presented for future improvements.

  1. Macro-level Optimization: From a macro perspective, the if(!TupIsNull()) and if (!instr) checks need to be simplified. This would require large-scale modifications to the PostgreSQL core, not just to the Explain and Instrument modules. However, because these checks account for a significant portion of the reducible overhead, I will continue to analyze and improve them.

  2. Micro-level Optimization: From a micro perspective, the cost of loading variables into CPU registers accounts for a large portion of the overhead. This is closely related to the Macro-level Optimization, but can also be addressed to some extent by improving the Explain and Instrument modules. I will pursue this approach in the next steps.