Two More Steps to a Faster EXPLAIN ANALYZE

Based on the improvements shown in this blog post, I will show the results of further optimizations.

In this blog, I’ll challenge conventional wisdom and explore methods to reduce overhead as much as possible. To that end, I will first discuss data types before explaining the specific improvements.

I have summarized the results into five improvement steps. To make replication easy, I prepared an environment that allows anyone to reproduce the results by running just a few shell commands:
$ git clone https://github.com/s-hironobu/explain_analyze_bench.git
$ cd explain_analyze_bench
$ bash ./bench.sh setup
$ bash ./bench.sh benchmark
If this interests you, please try it out: explain_analyze_bench

Double v.s. 64-bit Integer

Before the improvements, I will discuss the tuplecount data type.

Analysis of EXPLAIN ANALYZE Tuple Counts

The tuplecount data type in EXPLAIN ANALYZE is a double. As a 64-bit floating-point number, double uses 8 bytes of memory, which are allocated as follows:

1 bit for the sign
11 bits for the exponent
52 bits for the mantissa

As we’ve seen, tuplecount is incremented one by one during an EXPLAIN ANALYZE operation. Due to the limited precision of its mantissa, the count will become inaccurate once it exceeds $2^{52} - 1$. In other words, the practical limit of tuplecount is far smaller than that of a 64-bit integer (int64), which is $2^{63} - 1$.

Practicality of Counting to the int64 Limit

Let’s consider this from a different perspective.

Suppose we perform a sequential scan on a table with $2^{63} - 1$ rows. Even if we disregard the massive minimum file size of 334 [exabytes] required to store such a table, the sheer time needed for the operation makes it practically impossible.

Assuming a CPU clock speed of 10 [GHz] (0.1 [ns] per clock cycle), the counting operation would take over 29 years:

$$(2^{63}-1) \times 0.1\text{[nsec]} = (9.223 \times 10^{18}) \times 10^{-10}\text{[sec]} = 9.223 \times 10^8\text{[sec]} = 29.23\text{[year]}$$

A more plausible scenario would involve a nested loop join between two tables, each with $2^{32}$ rows (4.2 billion rows). In this case, an int64 would overflow. However, even if each join operation could be processed in a mere 0.1 [nsec], the total time would still exceed 29 [years]. A hypothetical 100 GHz CPU would still take about 3 [years] to complete the task.

Conclusion

Given these time scales, using a double for the EXPLAIN ANALYZE tuplecount serves no practical purpose, as int64 is more than sufficient to handle any count that can be measured within a reasonable timeframe.

In practice, 62-bit or even 61-bit is sufficient.

Step 4

Approach: Combining three data variables into one

As discussed, by changing the original tuplecount data type from double to uint64_t, we can pack fast_path_instr, running, and tuplecount into a single uint64_t variable named hot_instr.

The most significant bit is assigned to the fast_path_instr flag, and the next most significant bit is assigned to the running flag.

The expected benefits of this approach are:

A reduction in data loading costs.
A reduction in computational costs by replacing floating-point operations with integer operations.

Below is the modified data structure, along with a set of macros to assist with the new operations:

typedef struct Instrumentation
{
    uint64_t    hot_instr;		/* bit 63: fast_path_instr, bit 62: running, bit 0-61: tuplecount */

    /* Parameters set at node creation: */
    bool        need_timer;     /* true if we need timer data */
    bool        need_bufusage;  /* true if we need buffer usage data */
...

} Instrumentation;

/*
 * For hot_instr in Instrumentation
 */
/* bit definition */
#define FAST_PATH_INSTR_BIT   63
#define RUNNING_BIT    62

#define FAST_PATH_INSTR_MASK  (1ULL << FAST_PATH_INSTR_BIT)
#define RUNNING_MASK   (1ULL << RUNNING_BIT)
#define TUPLECOUNT_MASK ((1ULL << RUNNING_BIT) - 1)  //(0..61)
#define MAX_TUPLECOUNT	TUPLECOUNT_MASK

// fast_path_instr
#define is_fast_path_instr_true(instr)    (((instr)->hot_instr & FAST_PATH_INSTR_MASK) != 0)
#define is_fast_path_instr_false(instr)   (((instr)->hot_instr & FAST_PATH_INSTR_MASK) == 0)
#define set_fast_path_instr_true(instr)   ((instr)->hot_instr |= FAST_PATH_INSTR_MASK)
#define set_fast_path_instr_false(instr)  ((instr)->hot_instr &= ~FAST_PATH_INSTR_MASK)

// running
#define is_running_true(instr)    (((instr)->hot_instr & RUNNING_MASK) != 0)
#define is_running_false(instr)   (((instr)->hot_instr & RUNNING_MASK) == 0)
#define set_running_true(instr)   ((instr)->hot_instr |= RUNNING_MASK)
#define set_running_false(instr)  ((instr)->hot_instr &= ~RUNNING_MASK)

// tuplecount
#define get_tuplecount(instr)     ((instr)->hot_instr & TUPLECOUNT_MASK)
#define set_tuplecount(instr, n)  \
	((instr)->hot_instr = ((instr)->hot_instr & ~TUPLECOUNT_MASK) | ((n) & TUPLECOUNT_MASK))
#define add_tuplecount(instr, n) do { \
        uint64_t current_tuple = (instr)->hot_instr & TUPLECOUNT_MASK;  \
        uint64_t new_tuple = current_tuple + (n);                       \
        if (new_tuple <= MAX_TUPLECOUNT) {                              \
            (instr)->hot_instr = ((instr)->hot_instr & ~TUPLECOUNT_MASK) | new_tuple; \
}                                                               \
} while(0)

The following are the implementations of the PostExecProcNodeInstr(), ExecProcNodeFirst(), and InstrStopNode() functions, which correspond to the data structure above.

#define TupIsNull(slot) \
    ((slot) == NULL || TTS_EMPTY(slot))

#define TTS_EMPTY(slot) (((slot)->tts_flags & TTS_FLAG_EMPTY) != 0)

#define			TTS_FLAG_EMPTY			(1 << 1)

TupleTableSlot *
PostExecProcNodeInstr(PlanState *node, TupleTableSlot *result)
{
	uint64_t current;
	Instrumentation *instr = node->instrument;

	/* Most common case: no instrumentation or no fast_path_instr */
	if (!instr)
		return result;
	if (is_fast_path_instr_false(instr))
		return result;

	current = instr->hot_instr;

	if (!TupIsNull(result))
		/* Prevent overflow */
		if ((current & TUPLECOUNT_MASK) != MAX_TUPLECOUNT)
			/* tuplecount++ */
			current += 1

	/* running = true */
	current |= RUNNING_MASK;

	/* store current */
	instr->hot_instr = current;

	return result;
}

static TupleTableSlot *
ExecProcNodeFirst(PlanState *node)
{
	check_stack_depth();

	if (node->instrument)
	{
		/*--
		 * Use node->ExecProcNodeReal, which is modified to call PostExecProcNodeInstr
		 * before returning the result, to avoid calling InstrStartNode() and
		 * InstrStopNode() when the following conditions are not required:
		 * - need_timer
		 * - need_bufusage
		 * - need_walusage
		 */
		if (node->instrument->need_timer || node->instrument->need_bufusage
			|| node->instrument->need_walusage)
		{
			set_fast_path_instr_false(node->instrument);
			node->ExecProcNode = ExecProcNodeInstr;
		}
		else
		{
			set_fast_path_instr_true(node->instrument);
			node->ExecProcNode = node->ExecProcNodeReal;
		}
	}
	else
		node->ExecProcNode = node->ExecProcNodeReal;

	return node->ExecProcNode(node);
}

static pg_attribute_always_inline void
InstrStopNode(Instrumentation *instr, uint64_t nTuples)
{
	uint64_t 	save_tuplecount = get_tuplecount(instr);
	instr_time	endtime;

	add_tuplecount(instr, nTuples);

	/* let's update the time only if the timer was requested */
	if (instr->need_timer)
	{
		if (INSTR_TIME_IS_ZERO(instr->starttime))
			elog(ERROR, "InstrStopNode called without start");

		INSTR_TIME_SET_CURRENT(endtime);
		INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);

		INSTR_TIME_SET_ZERO(instr->starttime);
	}

	/* Add delta of buffer usage since entry to node's totals */
	if (instr->need_bufusage)
		BufferUsageAccumDiff(&instr->bufusage,
							 &pgBufferUsage, &instr->bufusage_start);

	if (instr->need_walusage)
		WalUsageAccumDiff(&instr->walusage,
						  &pgWalUsage, &instr->walusage_start);

	/* Is this the first tuple of this cycle? */
	if (!is_running_true(instr))
	{
		set_running_true(instr);
		if (instr->need_timer)
			instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
	}
	else
	{
		/*
		 * In async mode, if the plan node hadn't emitted any tuples before,
		 * this might be the first tuple
		 */
		if (instr->need_timer && instr->async_mode && save_tuplecount < 1)
			instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
	}
}

See the patch for details: step4-improved-explain-analyze.patch

Results

	Query-1 [sec] (Var.[$\text{sec}^{2}$])	Query-2 [sec] (Var. [$\text{sec}^{2}$])	Overhead [%]	Overhead based on Original [%]
Original	45.2046 (0.0070)	58.6842 (0.0153)	29.82	29.82
Step 1	45.2138 (0.0012)	54.9460 (0.0012)	21.52	21.54
Step 2	45.2957 (0.1936)	50.2848 (0.0020)	11.01	11.23
Step 3	45.4849 (0.0007)	48.2952 (0.0012)	6.18	6.84
Step 4	45.8483 (0.0012)	47.6494 (0.0089)	3.93	5.41

Note:

$$ \text{Overhead} = \frac{\text{ExecutionTime(Query-2)} - \text{ExecutionTime(Query-1)}}{\text{ExecutionTime(Query-1)}} \times 100 $$

Analysis

Below is a comparison of the assembly code for Step 3 and Step 4.

Step 3:

_PostExecProcNodeInstr:                 ; @PostExecProcNodeInstr
        .cfi_startproc
; %bb.0:
        ldr     x8, [x0, #40]
        cbz     x8, LBB0_6
; %bb.1:
        ldrb    w9, [x8]
        cmp     w9, #1
        b.ne    LBB0_6
; %bb.2:
        cbz     x1, LBB0_5
; %bb.3:
        ldrh    w9, [x1, #4]
        tbnz    w9, #1, LBB0_5
; %bb.4:
        ldr     d0, [x8, #32]
        fmov    d1, #1.00000000
        fadd    d0, d0, d1
        str     d0, [x8, #32]
LBB0_5:
        mov     w9, #1                          ; =0x1
        strb    w9, [x8, #5]
LBB0_6:
        mov     x0, x1
        ret
        .cfi_endproc

Step 4:

_PostExecProcNodeInstr:                 ; @PostExecProcNodeInstr
        .cfi_startproc
; %bb.0:
        ldr     x8, [x0, #40]
        cbz     x8, LBB0_2
; %bb.1:
        ldr     x9, [x8]
        tbnz    x9, #63, LBB0_3
LBB0_2:
        mov     x0, x1
        ret
LBB0_3:
        cbz     x1, LBB0_5
; %bb.4:
        ldrh    w10, [x1, #4]
        mov     x11, #4611686018427387903       ; =0x3fffffffffffffff
        bics    xzr, x11, x9
        cset    w11, ne
        tst     w10, #0x2
        csel    w10, wzr, w11, ne
        add     x9, x9, x10
LBB0_5:
        orr     x9, x9, #0x4000000000000000
        str     x9, [x8]
        mov     x0, x1
        ret
        .cfi_endproc

The main differences are as follows:

Fewer Memory Accesses: In Step 3, the address of instr is first loaded into a register x8. From this base address, the code separately loads the fast_path_instr (ldrb w9, [x8]) and the tuplecount (ldr d0, [x8, #32]). The tts_flags is also loaded into a register (ldrh w9, [x1, #4]) from the result pointer. (Note that running is not loaded into a register; its value is written directly to memory.)
In contrast, Step 4 is designed more efficiently. After loading the address of instr into register x8, the code loads the hot_instr into register x9 (ldr x9, [x8]) and the tts_flags from the result pointer into register w10 (ldrh w10, [x1, #4]).

More Efficient Bit Operations: Step 4 uses bitfield instructions like bfxil and csel to perform the tuplecount increment and set the running flag entirely within the CPU’s registers. This makes the processing more efficient than the floating-point operations (fmov, fadd) and multiple store instructions used in Step 3.

Instruction Latency

Fetch from L1 cache memory 0.5 [nsec]

Fetch from L2 cache memory 7 [nsec]

Fetch from main memory 100 [nsec]

Read 1MB sequentially from memory 250,000 [nsec] = 2.5[msec]

“Numbers Everyone Should Know” from Jeff Dean.

Instruction	Latency
Fetch from L1 cache memory	0.5 [nsec]
Fetch from L2 cache memory	7 [nsec]
Fetch from main memory	100 [nsec]
Read 1MB sequentially from memory	250,000 [nsec] = 2.5[msec]

Step 5

This step is currently experimental.

Support for triggers has not yet been thoroughly tested.

Approach: Migrating hot_instr into PlanState

Finally, I migrate the uint64_t variable hot_instr from the Instrumentation struct directly into the PlanState struct. This helps eliminate the overhead of accessing the Instrumentation struct from the PlanState.

typedef struct PlanState
{
    pg_node_attr(abstract)

    NodeTag     type;
    ...

    uint64_t    hot_instr;		/* bit 63: fast_path_instr, bit 62: running, bit 0-61: tuplecount */
    Instrumentation *instrument;    /* Optional runtime stats for this node */
...
}

/* bit definition */
#define FAST_PATH_INSTR_BIT   63
#define RUNNING_BIT    62

#define FAST_PATH_INSTR_MASK  (1ULL << FAST_PATH_INSTR_BIT)
#define RUNNING_MASK   (1ULL << RUNNING_BIT)
#define TUPLECOUNT_MASK ((1ULL << RUNNING_BIT) - 1)  //(0..61)
#define MAX_TUPLECOUNT	TUPLECOUNT_MASK

// fast_path_instr
#define is_fast_path_instr_true(node)    (((node)->hot_instr & FAST_PATH_INSTR_MASK) != 0)
#define is_fast_path_instr_false(node)   (((node)->hot_instr & FAST_PATH_INSTR_MASK) == 0)
#define set_fast_path_instr_true(node)   ((node)->hot_instr |= FAST_PATH_INSTR_MASK)
#define set_fast_path_instr_false(node)  ((node)->hot_instr &= ~FAST_PATH_INSTR_MASK)

// running
#define is_running_true(node)    (((node)->hot_instr & RUNNING_MASK) != 0)
#define is_running_false(node)   (((node)->hot_instr & RUNNING_MASK) == 0)
#define set_running_true(node)   ((node)->hot_instr |= RUNNING_MASK)
#define set_running_false(node)  ((node)->hot_instr &= ~RUNNING_MASK)

// tuplecount
#define get_tuplecount(node)     ((node)->hot_instr & TUPLECOUNT_MASK)
#define set_tuplecount(node, n)  \
	((node)->hot_instr = ((node)->hot_instr & ~TUPLECOUNT_MASK) | ((n) & TUPLECOUNT_MASK))
#define add_tuplecount(node, n) do { \
        uint64_t current_tuple = (node)->hot_instr & TUPLECOUNT_MASK;  \
        uint64_t new_tuple = current_tuple + (n);                       \
        if (new_tuple <= MAX_TUPLECOUNT) {                              \
            (node)->hot_instr = ((node)->hot_instr & ~TUPLECOUNT_MASK) | new_tuple; \
}                                                               \
} while(0)

The following are the implementations of the PostExecProcNodeInstr() and ExecProcNodeFirst() functions, which correspond to the data structure above.

#define TupIsNull(slot) \
    ((slot) == NULL || TTS_EMPTY(slot))

#define TTS_EMPTY(slot) (((slot)->tts_flags & TTS_FLAG_EMPTY) != 0)

#define			TTS_FLAG_EMPTY			(1 << 1)

static pg_attribute_always_inline TupleTableSlot *
PostExecProcNodeInstr(PlanState *node, TupleTableSlot *result)
{
	uint64_t current = node->hot_instr;

	if ((current & FAST_PATH_INSTR_MASK) == 0) /* is_fast_path_instr_false(node) */
		return result;


	if (!TupIsNull(result))
		/* Prevent overflow */
		if ((current & TUPLECOUNT_MASK) != MAX_TUPLECOUNT)
			/* tuplecount++ */
			current += 1

	/* running = true */
	current |= RUNNING_MASK;

	/* store current */
	node->hot_instr = current;

	return result;
}

static TupleTableSlot *
ExecProcNodeFirst(PlanState *node)
{
	check_stack_depth();

	set_fast_path_instr_false(node);

	if (node->instrument)
	{
		/*--
		 * Use node->ExecProcNodeReal, which is modified to call PostExecProcNodeInstr
		 * before returning the result, to avoid calling InstrStartNode() and
		 * InstrStopNode() when the following conditions are not required:
		 * - need_timer
		 * - need_bufusage
		 * - need_walusage
		 */
		if (node->instrument->need_timer || node->instrument->need_bufusage
			|| node->instrument->need_walusage)
		{
			node->ExecProcNode = ExecProcNodeInstr;
		}
		else
		{
			set_fast_path_instr_true(node);
			node->ExecProcNode = node->ExecProcNodeReal;
		}
	}
	else
		node->ExecProcNode = node->ExecProcNodeReal;

	return node->ExecProcNode(node);
}

Despite the simplicity of the idea, the implementation was quite challenging. Consequently, the current implementation is very ad-hoc.

See the patch for details: step5-improved-explain-analyze.patch

Results

	Query-1 [sec] (Var.[$\text{sec}^{2}$])	Query-2 [sec] (Var. [$\text{sec}^{2}$])	Overhead [%]	Overhead based on Original [%]
Original	45.2046 (0.0070)	58.6842 (0.0153)	29.82	29.82
Step 1	45.2138 (0.0012)	54.9460 (0.0012)	21.52	21.54
Step 2	45.2957 (0.1936)	50.2848 (0.0020)	11.01	11.23
Step 3	45.4849 (0.0007)	48.2952 (0.0012)	6.18	6.84
Step 4	45.8483 (0.0012)	47.6494 (0.0089)	3.93	5.41
Step 5	45.5007 (0.0005)	46.1107 (0.0006)	1.34	2.00

Analysis

Below is a comparison of the assembly code for Step 4 and Step 5.

Step 4:

_PostExecProcNodeInstr:                 ; @PostExecProcNodeInstr
        .cfi_startproc
; %bb.0:
        ldr     x8, [x0, #40]
        cbz     x8, LBB0_2
; %bb.1:
        ldr     x9, [x8]
        tbnz    x9, #63, LBB0_3
LBB0_2:
        mov     x0, x1
        ret
LBB0_3:
        cbz     x1, LBB0_5
; %bb.4:
        ldrh    w10, [x1, #4]
        mov     x11, #4611686018427387903       ; =0x3fffffffffffffff
        bics    xzr, x11, x9
        cset    w11, ne
        tst     w10, #0x2
        csel    w10, wzr, w11, ne
        add     x9, x9, x10
LBB0_5:
        orr     x9, x9, #0x4000000000000000
        str     x9, [x8]
        mov     x0, x1
        ret
        .cfi_endproc

Step 5:

_PostExecProcNodeInstr:                 ; @PostExecProcNodeInstr
        .cfi_startproc
; %bb.0:
        ldr     x8, [x0, #40]
        tbnz    x8, #63, LBB0_2
; %bb.1:
        mov     x0, x1
        ret
LBB0_2:
        cbz     x1, LBB0_4
; %bb.3:
        ldrh    w9, [x1, #4]
        mov     x10, #4611686018427387903       ; =0x3fffffffffffffff
        bics    xzr, x10, x8
        cset    w10, ne
        tst     w9, #0x2
        csel    w9, wzr, w10, ne
        add     x8, x8, x9
LBB0_4:
        orr     x8, x8, #0x4000000000000000
        str     x8, [x0, #40]
        mov     x0, x1
        ret
        .cfi_endproc

Let’s look into how the hot_instr variable is loaded.

In Step 4, the process is two steps. First, it loads the address of hot_instr into register x8 with ldr x8, [x0, #40]. Then, it loads the actual value of hot_instr from that address into x9 with ldr x9, [x8]. This two-step loading is a bottleneck for performance.

On the other hand, Step 5 completes the task in just one step. The ldr x8, [x0, #40] instruction directly loads the value of hot_instr into register x8. This reduces the number of ldr instructions to just one.

The results from steps 3, 4, and 5 strongly suggest that the latency of loading data from memory (either cache or main memory) accounts for a substantial part of the overhead. This finding supports the strategy of combining the running flag and tuplecount into a single variable to reduce the number of memory loads¹.

Beyond that, rather than consolidating fast_path_instr, running, and tuplecount into a 64-bit data type, it has been suggested that a 128-bit data type could be used. I have, however, decided against implementing this for now due to the lack of a suitable built-in library in PostgreSQL.

Conclusion

I have completed a feasibility study on reducing EXPLAIN ANALYZE overhead, ultimately decreasing it from 29.82[%] to 2.00[%].

This improvement was mainly achieved by:

Reducing the number of function calls from step 1 to step 3.
Reducing the number of data loads to registers in steps 4 and 5.

As a feasibility study, this was a success. I have demonstrated that the most straightforward method of managing row counts within a PlanNode is effective. This result proves that my ultimate goal² — counting actual rows during normal query execution without using EXPLAIN ANALYZE—is realistic, with an overhead of roughly 1[%] (approximately 30 seconds of extra execution time per hour).

Of course, future challenges remain. The ad-hoc implementation will need a structurally sound refactoring, which is expected to be quite complex.

It also reinforces the decision not to use atomic operations, which are primarily for handling race conditions between concurrent processes, because they have no effect on register loads. ↩︎
See the introduction for details. ↩︎