Mathematical Approaches for Selectivity Error Correction
In previous posts, post1 and post2, I demonstrated that it’s possible to continuously count the actual rows from EXPLAIN ANALYZE with an overhead of about 2.0%.
The next step is to examine methods for correcting the selectivity (or cardinality) estimation results.
Below, I will explain the correction methods based on Most Common Values (MCVs) and Histograms.
First, let’s consider adjusting the joint probability distribution based on MCVs for columns in two tables.
- The MCVs for column $C_{A}$ in table $T_{A}$ are [$X_{a}, X_{b}, X_{c}, X_{d}$], with corresponding Most Common Frequencies (MCFs) of [0.2, 0.25, 0.35, 0.2].
- The MCVs for column $C_{B}$ in table $T_{B}$ are [$Y_{a}, Y_{b}, Y_{c}$], with corresponding MCFs of [0.4, 0.55, 0.05].
In this case, the joint probability for $( C_{A} = X_{a} \wedge C_{B} = Y_{a} )$ is calculated by assuming independence between $C_{A}$ and $C_{B}$ in many RDBMSs like PostgreSQL, which gives:
$$ X_{a} \times Y_{a} = 0.2 \times 0.4 = 0.08 $$
The resulting joint probability distribution for all combinations is as follows:
| X\Y | Y_a | Y_b | Y_c |
|---|---|---|---|
| X_a | 0.08 | 0.11 | 0.01 |
| X_b | 0.10 | 0.14 | 0.01 |
| X_c | 0.14 | 0.19 | 0.02 |
| X_d | 0.08 | 0.11 | 0.01 |
Now, suppose that the actual measured joint probability $ P(X_{a}, Y_{a}) $ is found to be $0.2$.
We must now adjust the entire joint probability distribution while preserving the marginal distributions $P(X)$ and $P(Y)$. Several methods exist, but here we employ Iterative Proportional Fitting (IPF), which is relatively simple to implement and minimizes the Kullback-Leibler (KL) divergence.
The source code and execution result are shown below, followed by the mathematical proof.
The source code is shown below: mcv-ipf.py
The execution result of mcv-ipf.py is shown below:
$ python mcv-ipf.py
==== Initial Conditions ====
P_A: [X_a, X_b, X_c, X_d]
[0.2 0.25 0.35 0.2 ]
P_B: [Y_a, Y_b, Y_c]
[0.4 0.55 0.05]
==== Joint Distribution (Independence Assumption) ====
Y_a Y_b Y_c
X_a 0.08 0.11 0.01
X_b 0.10 0.14 0.01
X_c 0.14 0.19 0.02
X_d 0.08 0.11 0.01
==== Final Result (P(X_a, Y_a)=0.2 Fixed, KL Minimization) ====
Y_a Y_b Y_c
X_a 0.20 0.00 0.00
X_b 0.06 0.17 0.02
X_c 0.09 0.24 0.02
X_d 0.05 0.14 0.01
Row Marginal Sums (compared to P_A_example): [0.2 0.25 0.35 0.2 ]
Column Marginal Sums (compared to P_B_example): [0.4 0.55 0.05]
The IPF algorithm can adjust the joint probability distribution to minimize the Kullback-Leibler (KL) divergence without directly calculating it. This is because IPF can be viewed as a type of Expectation-Maximization (EM) algorithm or Coordinate Descent. Each step of the IPF process is proven to analytically decrease the divergence.
The objective of IPF is to find an adjusted joint probability distribution $\mathbf{P}$ that minimizes the KL divergence $D_{KL}(\mathbf{P} \parallel \mathbf{P}^{(0)})$ from the initial distribution $\mathbf{P}^{(0)}$, while satisfying the marginal probability constraints $\mathbf{P}_A$ and $\mathbf{P}_B$:
$$ \min_{P} D_{KL}(P \parallel P^{(0)}) = \min_{P} \sum_{i, j} P_{i, j} \log \left( \frac{P_{i, j}}{P^{(0)}_{i, j}} \right) $$
Constraints:
$$ \sum_{j} P_{i, j} = P_{A, i} \quad (\text{Row Marginal}) $$ $$ \sum_{i} P_{i, j} = P_{B, j} \quad (\text{Column Marginal}) $$
Each iteration alternates between adjusting rows and columns until convergence.
IPF finds the solution to this optimization problem through iterative scaling operations without the need for direct calculation of the KL divergence.
To make the current distribution $\mathbf{P}^{(k)}$ conform to the row marginal constraint, a row-specific scaling factor $\lambda_i$ is applied:
$$ P_{i,j}^{(k + 1/2)} = P_{i,j}^{(k)} \times \lambda_{i} $$
Here, the scaling factor $\lambda_{i}$ is defined as:
$$ \lambda_{i} = \frac{P_{A, i}}{\sum_{j’} P^{(k)}_{i, j’}}$$
The row adjustment update formula is expressed by the following ratio:
$$ P_{i, j}^{(k + 1/2)} = P_{i, j}^{(k)} \times \left( \frac{P_{A, i}}{\sum_{j’} P_{i, j’}^{(k)}} \right) $$
Next, a column-specific scaling factor $\mu_j$ is applied to the adjusted distribution $\mathbf{P}^{(k + 1/2)}$ to make it conform to the column marginal constraint:
$$ P_{i, j}^{(k + 1)} = P_{i, j}^{(k + 1/2)} \times \mu_{j} $$
Here, the scaling factor $\mu_{j}$ is defined as:
$$ \mu_{j} = \frac{P_{B, j}}{\sum_{i’} P_{i’, j}^{(k + 1/2)}} $$
The column adjustment update formula is expressed by the following ratio:
$$ P_{i, j}^{(k + 1)} = P_{i, j}^{(k + 1/2)} \times \left( \frac{P_{B, j}}{\sum_{i’} P_{i’, j}^{(k + 1/2)}} \right) $$
When using histograms, two key differences arise compared to Most Common Values (MCVs):
- Range Query Support: Query constraints must accommodate not only equi-joins but also range queries. Iterative Proportional Fitting (IPF) is primarily designed for equality constraints (e.g., summing marginal probabilities), making it challenging to directly and efficiently incorporate area constraints involving inequalities (range queries).
- High Dimensionality: The number of histogram bins (typically 100 by default) is large compared to MCVs. This necessitates reducing the data storage space required for the full joint probability distribution.
To simultaneously address both challenges, we adopt a Neural Network (NN) approach, utilizing function approximation to fine-tune the joint probability distribution.
We will implement the simplest machine learning method, which minimizes the Mean Squared Error (MSE). Specifically, this approach estimates a joint probability distribution $P_{\text{NN}}$ that satisfies two predefined statistical constraints: marginal distributions ($H_{A}, H_{B}$) and a specific area sum ($C_{\text{true}}$).
The core idea is to treat the estimation as a constrained optimization problem, where the NN parameters $\mathbf{\Theta}$ are adjusted to minimize a total loss function, $L_{\text{total}}$.
The total loss function $L_{\text{total}}$ is defined as the weighted sum of two components: the area loss ($L_{\text{area}}$) and the marginal loss ($L_{\text{margin}}$).
$$ L_{\text{total}} = \lambda_{\text{area}} L_{\text{area}} + \lambda_{\text{margin}} L_{\text{margin}} $$Here, $\lambda_{\text{area}}$ and $\lambda_{\text{margin}}$ are hyperparameters (weights) used to adjust the importance of each constraint.
-
Area Loss ($L_{\text{area}}$):
This term penalizes the difference between the sum of probabilities estimated by the NN over a specific range $\mathcal{R}$ and the true target probability $C_{\text{true}}$. The range $\mathcal{R}$ corresponds to the query range that needs correction based on the observed true row count ($C_{\text{true}}$). $$L_{\text{area}} = \left( \sum_{\mathbf{x} \in \mathcal{R}} P_{\text{NN}}(\mathbf{x}; \mathbf{\Theta}) - C_{\text{true}} \right)^2 $$Here, $\mathbf{x}$ represents the histogram bin coordinates $(i, j)$, and $C_{\text{true}}$ is the normalized probability ($C_{\text{true}}/N$) derived from the actual row count $C_{\text{true}}$ and the total number of rows $N$. -
Marginal Loss ($L_{\text{margin}}$):
This term ensures that the marginal distributions derived from the NN output match the predefined marginal histograms $H_{A}$ and $H_{B}$. $$ L_{\text{margin}} = \sum_{i} \left( \sum_{j} P_{\text{NN}}(i, j; \mathbf{\Theta}) - H_{A}(i) \right)^2 + \sum_{j} \left( \sum_{i} P_{\text{NN}}(i, j; \mathbf{\Theta}) - H_{B}(j) \right)^2 $$$i$ and $j$ represent the bin indices for the $A$ and $B$ axes, respectively.
The Neural Network (NN), such as the simple multi-layer perceptron implemented in the code, is trained using the following steps:
-
Step 1: Input and Output:
- The model (ProbabilityModel) takes the normalized bin coordinates $(i, j)$ of the 2D histogram grid as input, encoded via a concatenated One-hot vector (dimension $NUM_{A} + NUM_{B}$).
- The network’s output is a single logit value for that specific coordinate.
-
Step 2: Probability Normalization:
- The network first calculates logit values for all bins. The joint probability $P_{\text{NN}}$ is then computed by applying the softmax function across the logits of all bins. This step is crucial as it guarantees that the sum of probabilities for all bins is exactly 1. $$ P_{NN} (i, j; \mathbf{\Theta}) = \frac{\exp(\text{logits}_{i, j})} { \sum_{k, l} \exp(\text{logits}_{k, l}) } $$
-
Step 3: Optimization:
- The loss functions $L_{\text{area}}$ and $L_{\text{margin}}$ are calculated, and the weighted total loss $L_{\text{total}}$ is obtained using the hyperparameters $\lambda_{\text{area}}$ and $\lambda_{\text{margin}}$.
- The parameters $\mathbf{\Theta}$ are iteratively updated using a gradient descent approach with the Adam optimizer (optim.Adam) to minimize $L_{\text{total}}$.
-
Step 4:Fine-tuning Goal:
- The process is essentially a fine-tuning of the NN. The network is adjusted from an initial state to incorporate the new, more accurate selectivity information $C_{\text{true}}$ into the area $\mathcal{R}$.
- This adjustment is performed while simultaneously minimizing the marginal loss to ensure the generated distribution $P_{\text{NN}}$ remains consistent with the pre-defined marginal histograms $H_{A}$ and $H_{B}$. The weights ($\lambda_{\text{area}}$, $\lambda_{\text{margin}}$) are used to balance these two competing objectives: fitting the observed data and maintaining statistical consistency.
The source code is shown below: hist-nn.py
The execution result of hist-nn.py is shown below:
$ python hist-nn.py
--- Before Fine-tuning (Initial Random State) ---
---------------------------------------------------------------
A\B | B[0] | B[1] | B[2] | B[3] | B[4] | B[5] | B[6] | B[7]
---------------------------------------------------------------
A[0] | 0.013 | 0.013 | 0.012 | 0.012 | 0.012 | 0.013 | 0.013 | 0.013 |
A[1] | 0.013 | 0.013 | 0.013 | 0.012 | 0.013 | 0.013 | 0.013 | 0.013 |
A[2] | 0.012 | 0.012 | 0.011 | 0.011 | 0.012 | 0.012 | 0.013 | 0.012 |
A[3] | 0.011 | 0.012 | 0.011 | 0.011 | 0.012 | 0.012 | 0.012 | 0.012 |
A[4] | 0.012 | 0.013 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 |
A[5] | 0.013 | 0.014 | 0.013 | 0.013 | 0.013 | 0.013 | 0.014 | 0.013 |
A[6] | 0.013 | 0.013 | 0.013 | 0.012 | 0.013 | 0.013 | 0.013 | 0.013 |
A[7] | 0.012 | 0.012 | 0.012 | 0.011 | 0.011 | 0.012 | 0.012 | 0.012 |
A[8] | 0.013 | 0.013 | 0.012 | 0.012 | 0.013 | 0.013 | 0.014 | 0.013 |
A[9] | 0.013 | 0.013 | 0.013 | 0.012 | 0.013 | 0.013 | 0.013 | 0.013 |
---------------------------------------------------------------
Verification: Sum of area P(A[0:4], B[0:3]): 0.241186 (Target: 0.300000)
Verification: Marginal Error (MSE): 0.01066820
--- Starting Fine-tuning ---
Step 50: L_Total=0.011445, L_Area=0.000037, L_Margin=0.011070, C_NN=0.293877
Step 100: L_Total=0.010102, L_Area=0.000001, L_Margin=0.010089, C_NN=0.298861
Step 150: L_Total=0.008607, L_Area=0.000000, L_Margin=0.008604, C_NN=0.299434
Step 200: L_Total=0.006526, L_Area=0.000000, L_Margin=0.006522, C_NN=0.299367
Step 250: L_Total=0.004152, L_Area=0.000001, L_Margin=0.004146, C_NN=0.299279
Step 300: L_Total=0.002355, L_Area=0.000001, L_Margin=0.002348, C_NN=0.299156
Step 350: L_Total=0.001419, L_Area=0.000001, L_Margin=0.001412, C_NN=0.299172
Step 400: L_Total=0.000997, L_Area=0.000001, L_Margin=0.000990, C_NN=0.299193
Step 450: L_Total=0.000784, L_Area=0.000001, L_Margin=0.000778, C_NN=0.299235
Step 500: L_Total=0.000645, L_Area=0.000000, L_Margin=0.000640, C_NN=0.299303
--- Fine-tuning Complete ---
--- After Fine-tuning (Constraints Applied) ---
---------------------------------------------------------------
A\B | B[0] | B[1] | B[2] | B[3] | B[4] | B[5] | B[6] | B[7]
---------------------------------------------------------------
A[0] | 0.005 | 0.007 | 0.019 | 0.029 | 0.025 | 0.013 | 0.005 | 0.004 |
A[1] | 0.006 | 0.009 | 0.025 | 0.035 | 0.030 | 0.015 | 0.006 | 0.004 |
A[2] | 0.004 | 0.005 | 0.014 | 0.022 | 0.020 | 0.010 | 0.003 | 0.003 |
A[3] | 0.007 | 0.011 | 0.028 | 0.043 | 0.038 | 0.021 | 0.008 | 0.006 |
A[4] | 0.003 | 0.005 | 0.009 | 0.015 | 0.015 | 0.007 | 0.003 | 0.003 |
A[5] | 0.007 | 0.010 | 0.025 | 0.041 | 0.045 | 0.021 | 0.007 | 0.005 |
A[6] | 0.003 | 0.004 | 0.010 | 0.018 | 0.022 | 0.010 | 0.003 | 0.003 |
A[7] | 0.002 | 0.003 | 0.006 | 0.010 | 0.012 | 0.006 | 0.002 | 0.002 |
A[8] | 0.005 | 0.007 | 0.019 | 0.032 | 0.041 | 0.018 | 0.006 | 0.004 |
A[9] | 0.003 | 0.004 | 0.007 | 0.012 | 0.016 | 0.007 | 0.003 | 0.002 |
---------------------------------------------------------------
Verification: Sum of area P(A[0:4], B[0:3]): 0.299303 (Target: 0.300000)
Verification: Marginal Error (MSE): 0.00063787
An alternative approach combines the Neural Network (NN) estimation with the Iterative Proportional Fitting (IPF) procedure. This method aims to leverage the global consistency of IPF while using the NN for local refinement. This hybrid approach combines the expressiveness of a Neural Network (NN) with the rigorous constraint enforcement of the Iterative Proportional Fitting (IPF) method.
The implementation details are provided in the source code below: hist-hybrid.py
The proposed method allows the joint probability distribution to be updated based on observed data.
However, there are still several hurdles to overcome for practical application in a real-world Relational Database (RDB).
We will discuss these challenges from two perspectives: Computational Complexity and Memory Footprint Reduction.
When a probability distribution stored as a histogram is used for Selectivity Estimation, the values often need to be summed up based on a given condition.
For example, to get the value for the range $h[3:7]$ in a 1D histogram $h[0:19]$, you must sum the values from $h[3]$ to $h[7]$, i.e., $ \sum_{i=3}^{7} h[i]$. The computational cost for this summation is $O(n)$, where $n$ is the length of the histogram $h$.
While this cost is acceptable for a 1D case (in fact, PostgreSQL uses this method), it becomes non-negligible for a 2D histogram, where the calculation cost increases to $O(n^{2})$ .
Fortunately, there is a method to address this situation: the Cumulative Sum (or Summed-Area Table in 2D).
Simply put, instead of sequentially adding the values in the desired range, we use a new histogram H (the cumulative sum histogram) where each element is the accumulated sum of the original histogram $h$. By finding the difference between the upper and lower bounds of the region, the calculation can always be performed with an $O(1)$ computational cost.
The Cumulative Sum $H$ is defined as follows (assuming $H[−1]=0$):
$$H[i] = \sum_{j=0}^{i} h[j]$$ $$H[i] = H[i-1] + h[i] \quad \text{for } i \ge 0$$
Let’s consider calculating$S = \sum_{i=3}^{7} h[i]$:
- Using the original histogram $h$, you must sum the values from $h[3]$ to $h[7]$: $$ S = \sum_{i=3}^{7} h[i] = h[3] + h[4] + h[5] + h[6] + h[7] $$
- Conversely, using the Cumulative Sum Histogram $H$, only the following subtraction is required: $$ S = H[7] - H[2] $$
The same principle applies in the 2D case.
If $h[i,j]$ is the original 2D histogram and $H[i,j]$ is the Cumulative Sum Table (or Summed-Area Table), $H[i,j]$ is defined as the sum of all bin probabilities (or frequencies) contained in the rectangular region from the origin $(0,0)$ up to coordinates $(i,j)$.
$$\mathbf{H}[i, j] = \sum_{x=0}^{i} \sum_{y=0}^{j} \mathbf{h}[x, y]$$
For an original histogram $h$ of dimension $M \times N$, $\mathbf{H}$ can be created efficiently with a computational complexity of $O(MN)$ using the following recurrence relation:
$$\mathbf{H}[i, j] = \mathbf{h}[i, j] + \mathbf{H}[i-1, j] + \mathbf{H}[i, j-1] - \mathbf{H}[i-1, j-1]$$
- $\mathbf{h}[i, j]$: The value of the current bin.
- $\mathbf{H}[i-1, j]$: The cumulative sum from the area above.
- $\mathbf{H}[i, j-1]$: The cumulative sum from the area to the left.
- $\mathbf{H}[i-1, j-1]$: This is the top-left corner area which was doubly added in the previous two terms, so it must be subtracted once.
To find the probability $P$ for any arbitrary rectangular region $[a_{1}, a_{2}] \times [b_{1}, b_{2}]$((with top-left corner $(a_{1}, b_{1})$ and bottom-right corne $(a_{2}, b_{2})$)), we can calculate it in $O(1)$ ime using the following four cumulative sum values:
$$P = \mathbf{H}[a_2, b_2] - \mathbf{H}[a_1-1, b_2] - \mathbf{H}[a_2, b_1-1] + \mathbf{H}[a_1-1, b_1-1]$$
- Calculation Breakdown:
- $\mathbf{H}[a_2, b_2]$: The total area ($(0, 0)$ to $(a_2, b_2)$)
- $\mathbf{H}[a_1-1, b_2]$: Subtract the unnecessary region on the left side.
- $\mathbf{H}[a_2, b_1-1]$: Subtract the unnecessary region on the top side.
- $\mathbf{H}[a_1-1, b_1-1]$: The top-left corner was doubly subtracted in steps 2 and 3, so we add it back once.
To reduce the memory required for the 2D histogram and allow for faster fine-tuning, I adopted an NN (Neural Network). However, directly handling an NN inside an RDB is computationally expensive.
Here, the Count-Min Sketch (CMS), a probabilistic data structure, can dramatically reduce the storage required for the 2D histogram.
The Count-Min Sketch is a probabilistic data structure used to approximately estimate the frequency (count) of specific elements in a stream or massive dataset. It stores data in a fixed, small-sized matrix (the “sketch”). By using multiple independent hash functions, the impact of hash collisions is minimized. To estimate the count of any element, we adopt the minimum value among the multiple buckets that the element maps to. The biggest advantage of this technique is that the estimation accuracy ($epsilon$) and the probability of failure ($delta$) can be specified by the user, fixing the memory size to be extremely small, regardless of the data’s cardinality (number of unique items) or the total count.
Therefore, the practical approach would be to adjust the probability distribution using the NN and then map that resulting distribution onto the CMS for operational use.
Note that the probability distribution data type from the NN is typically a floating-point type (like double), while the CMS uses an integer type. Therefore, when using the CMS, you must separately store the total sample count $N$ and divide the value retrieved from the CMS by $N$ N to obtain the probability.
Two parameters are required to calculate the memory capacity needed for a Count-Min Sketch:
- Permissible Error Range: $\epsilon$
- Acceptable Failure Probability: $\delta$
The siz ($d \times w$) of the CM Sketch matrix is determined by the following formulas:
- Number of Rows $d$ (Number of Hash Functions): $$d = \left\lceil \ln\left(\frac{1}{\delta}\right) \right\rceil $$
- Number of Columns $w$ (Bucket Width): $$w = \left\lceil \frac{e}{\epsilon} \right\rceil $$
Let’s calculate the required memory using specific values. Assuming the histogram size is $100 \times 100$, and setting the permissible error $epsilon$ and acceptable failure probability $\delta$ to 0.01 (or 1[%]), the number of rows $d$ and columns w for the CMS are:
$$d = \left\lceil \ln\left(\frac{1}{\delta}\right) \right\rceil = \lceil \ln(100) \rceil = \lceil 4.605 \rceil = 5$$
$$w = \left\lceil \frac{e}{\epsilon} \right\rceil = \lceil \frac{2.718}{0.01} \rceil = \lceil 271.8 \rceil = 272$$
Thus, the total number of elements in the sketch is $d \times w = 5 \times 272 = 1360$.
Assuming each element is an 8-byte integer (like long long), the memory required for the CMS is:
$$\text{CM Sketch Memory} = 1,360 \ \text{elements} \times 8 \text{ bytes/element} = 10,880 \text{ bytes} \approx \mathbf{10.63 \ Kbytes}$$
| Item | Calculation | Memory Capacity |
|---|---|---|
| Original Histogram | $100 \times 100 \times 8 \text{ bytes}$ | $800,000 \text{ bytes} \approx \mathbf{781.25 \ Kbytes}$ |
| CM Sketch | $5 \times 272 \times 8 \text{ bytes}$ | $10,880 \text{ bytes} \approx \mathbf{10.63 \ Kbytes}$ |
In this example, using the Count-Min Sketch reduces the memory footprint of the original histogram to approximately 1/74th of its original size.
The source code is shown below: cms.py
As stated above, the CMS uses integers. Therefore, when dealing with probability distributions, remember to separately store the total number of samples $N$ and divide the value retrieved from the CMS by $N$ to obtain the probability.
The core takeaway is that I have demonstrated the feasibility of continuously improving selectivity estimation by acquiring actual row counts for every query.
- If time permits, this functionality would be implemented in pg_plan_inspector.
-
Persistence of Corrections After ANALYZE
I deliberately avoided discussing the following issue until now: whenever the ANALYZE command refreshes the selectivity statistics, all previous corrections are effectively invalidated.
Therefore, we must find a way to reuse those corrections or integrate the pre-ANALYZE distribution updates into the new statistics. Assuming a sudden, drastic change in data distribution is a rare event, this approach seems practical. -
Accuracy Verification of Probabilistic Data Structures
Neural Networks (NNs) and Count-Min Sketches (CMSs) are both probabilistic data structures. The use of CMSs, however, carries the risk of introducing or compounding false positives into the estimation.
Consequently, the accuracy of selectivity estimation obtained by combining these two approaches requires rigorous qualitative and quantitative analysis. Since the overarching goal of these three blog posts is to improve the accuracy of selectivity estimation, if the final result cannot guarantee sufficient accuracy, then all the effort invested in this approach would be meaningless.
My primary goal was to enable real-time query progress display by continuously obtaining actual row counts. By reducing the overhead of EXPLAIN ANALYZE to 2%, that objective has been achieved.
However, the accuracy of the Query Progress display is directly proportional to the quality of the Selectivity Estimation. Thus, although secondary to my main goal, improving the performance of Selectivity Estimation is still necessary.
I hope that providing researchers with a constant stream of actual row counts will inspire them to devise entirely new methods for Selectivity Estimation.