Mathematical Approaches for Selectivity Error Correction
In previous posts, post1 and post2,
I demonstrated that it’s possible to continuously count the actual rows from EXPLAIN ANALYZE with an overhead of about 2.0%.
The next step is to examine methods for correcting the selectivity (or cardinality) estimation results.
Below, I will explain the correction methods based on Most Common Values (MCVs) and Histograms.
1. MCVs
First, let’s consider adjusting the joint probability distribution based on MCVs for columns in two tables.
The MCVs for column $C_{A}$ in table $T_{A}$ are [$X_{a}, X_{b}, X_{c}, X_{d}$], with corresponding Most Common Frequencies (MCFs) of [0.2, 0.25, 0.35, 0.2].
The MCVs for column $C_{B}$ in table $T_{B}$ are [$Y_{a}, Y_{b}, Y_{c}$], with corresponding MCFs of [0.4, 0.55, 0.05].
In this case, the joint probability for $( C_{A} = X_{a} \wedge C_{B} = Y_{a} )$ is calculated by assuming independence between $C_{A}$ and $C_{B}$
in many RDBMSs like PostgreSQL, which gives:
$$
X_{a} \times Y_{a} = 0.2 \times 0.4 = 0.08
$$
The resulting joint probability distribution for all combinations is as follows:
X\Y
$Y_{a}$
$Y_{b}$
$Y_{c}$
$X_{a}$
0.08
0.11
0.01
$X_{b}$
0.10
0.14
0.01
$X_{c}$
0.14
0.19
0.02
$X_{d}$
0.08
0.11
0.01
Now, suppose that the actual measured joint probability $ P(X_{a}, Y_{a}) $ is found to be $0.2$.
We must now adjust the entire joint probability distribution while preserving the marginal distributions $P(X)$ and $P(Y)$.
Several methods exist, but here we employ Iterative Proportional Fitting (IPF), which is relatively simple to implement and minimizes the Kullback-Leibler (KL) divergence.
The source code and execution result are shown below, followed by the mathematical proof.
1.1. Source code
The source code is shown below: mcv-ipf.py
importnumpyasnpdefprint_mat(P_mat):age=["X_a","X_b","X_c","X_d"]# Print column headersprint("\tY_a\tY_b\tY_c")i=0forrowinP_mat:# Print row headerprint("{}".format(age[i]),end="")forelementinrow:# Print elements formatted to two decimal placesprint(f"\t{element:.2f}",end=" ")print()i+=1defiterative_proportional_fitting_fixed_cell(P_A,P_B,fixed_coords,fixed_value,max_iterations=100,tolerance=1e-6):m=len(P_A)n=len(P_B)P_A_target=P_A.reshape(-1,1)P_B_target=P_B.reshape(1,-1)# 1. Set the fixed value in the initial distribution P^{(0)} and fill the rest based on independence.P_initial=np.outer(P_A,P_B)P_initial[fixed_coords]=fixed_value# 2. Adjustment for the remaining cells# Subtract the fixed cell P(X_a, Y_a) value from the marginalsP_A_adj=P_A.copy()P_B_adj=P_B.copy()P_A_adj[fixed_coords[0]]-=fixed_valueP_B_adj[fixed_coords[1]]-=fixed_value# 3. Check that the remaining cells are non-negative (Axiom of Probability)ifP_A_adj[fixed_coords[0]]<0orP_B_adj[fixed_coords[1]]<0:raiseValueError("The fixed value exceeds the marginal probability. IPF cannot be applied.")# 4. Matrix to mask the location of the fixed cellmask=np.ones((m,n))mask[fixed_coords]=0# 5. Set the initial joint distribution P^{(0)} (fixed cell remains, normalize the rest using independence)# Calculate the product of the remaining marginal sums and allocate proportionally to the remaining cellsP_rest_sum=np.sum(P_A_adj)# Calculate the remaining initial distribution (starting here as the outer product of adjusted marginals)P_current=np.outer(P_A_adj,P_B_adj)# Normalize so the remaining marginal sum is 1.0P_current/=P_current.sum()# Normalize to the sum of P_rest_sum# 6. Add the fixed value to P_current to form the overall initial distributionP_current*=mask# Apply the maskP_current[fixed_coords]=fixed_value# Set the fixed value# Final normalization of the initial distributionP_current/=P_current.sum()foriinrange(max_iterations):P_prev=P_current.copy()# === Match Row Constraint (P_A) ===# Only adjust the part that does not include the fixed cell# Use the approach of applying standard IPF and finally overwriting the fixed cell# Calculate row sumsrow_sums=P_current.sum(axis=1).reshape(-1,1)adjustment_factor_row=P_A_target/(row_sums+1e-18)P_current*=adjustment_factor_row# Overwrite the fixed cell (to satisfy the constraint)P_current[fixed_coords]=fixed_value# === Match Column Constraint (P_B) ===col_sums=P_current.sum(axis=0).reshape(1,-1)adjustment_factor_col=P_B_target/(col_sums+1e-18)P_current*=adjustment_factor_col# Overwrite the fixed cell (to satisfy the constraint)P_current[fixed_coords]=fixed_value# === Convergence Check ===ifnp.sum(np.abs(P_current-P_prev))<tolerance:# print(f"--- Converged. Iterations: {i+1} ---")breakelse:print(f"--- Maximum iterations ({max_iterations}) reached ---")returnP_currentif__name__=="__main__":# --- Example Usage ---# Use data from the question example and modify the marginal distribution to incorporate the constraint P(X_a, Y_a) = 0.2# P_A: [X_a, X_b, X_c, X_d] (4 items)P_A_example=np.array([0.2,0.25,0.35,0.2])# P_B: [Y_a, Y_b, Y_c] (3 items)P_B_example=np.array([0.4,0.55,0.05])print("==== Initial Conditions ====")print("P_A: [X_a, X_b, X_c, X_d]")print(P_A_example)print("P_B: [Y_a, Y_b, Y_c]")print(P_B_example)print("")P_current=np.outer(P_A_example,P_B_example)# Row marginal distribution (constraint) shaped as (m, 1)P_A_target=P_A_example.reshape(-1,1)# Column marginal distribution (constraint)P_B_target=P_B_example.reshape(1,-1)print("==== Joint Distribution (Independence Assumption) ====")print_mat(P_current)print("")# Case from the question: Fix P(X_a, Y_a) = 0.2 and satisfy marginal constraintsfixed_coords=(0,0)# X_a (0th row), Y_a (0th column)fixed_value=0.2P_fixed=iterative_proportional_fitting_fixed_cell(P_A_example,P_B_example,fixed_coords,fixed_value)print("\n==== Final Result (P(X_a, Y_a)=0.2 Fixed, KL Minimization) ====")print_mat(P_fixed)print("\nRow Marginal Sums (compared to P_A_example):",P_fixed.sum(axis=1))print("Column Marginal Sums (compared to P_B_example):",P_fixed.sum(axis=0))
1.2. Execution Result
The execution result of mcv-ipf.py is shown below:
The IPF algorithm can adjust the joint probability distribution to minimize the Kullback-Leibler (KL) divergence without directly calculating it.
This is because IPF can be viewed as a type of Expectation-Maximization (EM) algorithm or Coordinate Descent.
Each step of the IPF process is proven to analytically decrease the divergence.
1.3.1. The Objective Function to be Minimized
The objective of IPF is to find an adjusted joint probability distribution $P$ that minimizes the KL divergence $D_{KL}(P \parallel P^{(0)})$
from the initial distribution $P^{(0)}$, while satisfying the marginal probability constraints $P_A$ and $P_B$:
Step 2: Projection (Adjustment) to the Column Marginal $\ P_{B}$
Next, a column-specific scaling factor $\mu_j$ is applied to the adjusted distribution $P^{(k + 1/2)}$ to make it conform to the column marginal constraint:
When using histograms, two key differences arise compared to Most Common Values (MCVs):
Range Query Support: Query constraints must accommodate not only equi-joins but also range queries. Iterative Proportional Fitting (IPF) is primarily designed for equality constraints (e.g., summing marginal probabilities), making it challenging to directly and efficiently incorporate area constraints involving inequalities (range queries).
High Dimensionality: The number of histogram bins (typically 100 by default) is large compared to MCVs. This necessitates reducing the data storage space required for the full joint probability distribution.
To simultaneously address both challenges, we adopt a Neural Network (NN) approach, utilizing function approximation to fine-tune the joint probability distribution.
We will implement the simplest machine learning method, which minimizes the Mean Squared Error (MSE). Specifically, this approach estimates a joint probability distribution $P_{\text{NN}}$ that satisfies two predefined statistical constraints: marginal distributions ($H_{A}, H_{B}$) and a specific area sum ($C_{\text{true}}$).
The core idea is to treat the estimation as a constrained optimization problem, where the NN parameters $\mathbf{\Theta}$ are adjusted to minimize a total loss function, $L_{\text{total}}$.
2.1. Mathematical Explanation
2.1.1. Objective Function
The total loss function $L_{\text{total}}$ is defined as the weighted sum of two components: the area loss ($L_{\text{area}}$) and the marginal loss ($L_{\text{margin}}$).
$$
L_{\text{total}} = \lambda_{\text{area}} L_{\text{area}} + \lambda_{\text{margin}} L_{\text{margin}}
$$Here, $\lambda_{\text{area}}$ and $\lambda_{\text{margin}}$ are hyperparameters (weights) used to adjust the importance of each constraint.
Area Loss ($L_{\text{area}}$):
This term penalizes the difference between the sum of probabilities estimated by the NN over a specific range $\mathcal{R}$ and the true target probability $C_{\text{true}}$. The range $\mathcal{R}$ corresponds to the query range that needs correction based on the observed true row count ($C_{\text{true}}$).
$$
L_{\text{area}} = \left( \sum_{\mathbf{x} \in \mathcal{R}} P_{\text{NN}}(\mathbf{x}; \mathbf{\Theta}) - C_{\text{true}} \right)^2
$$ Here, $\mathbf{x}$ represents the histogram bin coordinates $(i, j)$, and $C_{\text{true}}$ is the normalized probability ($C_{\text{true}}/N$) derived from the actual row count $C_{\text{true}}$ and the total number of rows $N$.
Marginal Loss ($L_{\text{margin}}$):
This term ensures that the marginal distributions derived from the NN output match the predefined marginal histograms $H_{A}$ and $H_{B}$.
$$
L_{\text{margin}} = \sum_{i} \left( \sum_{j} P_{\text{NN}}(i, j; \mathbf{\Theta}) - H_{A}(i) \right)^2 + \sum_{j} \left( \sum_{i} P_{\text{NN}}(i, j; \mathbf{\Theta}) - H_{B}(j) \right)^2
$$$i$ and $j$ represent the bin indices for the $A$ and $B$ axes, respectively.
2.1.2. Training Procedure
The Neural Network (NN), such as the simple multi-layer perceptron implemented in the code, is trained using the following steps:
Step 1: Input and Output:
The model (ProbabilityModel) takes the normalized bin coordinates $(i, j)$ of the 2D histogram grid as input, encoded via a concatenated One-hot vector (dimension $NUM_{A} + NUM_{B}$).
The network’s output is a single logit value for that specific coordinate.
Step 2: Probability Normalization:
The network first calculates logit values for all bins.
The joint probability $P_{\text{NN}}$ is then computed by applying the softmax function across the logits of all bins.
This step is crucial as it guarantees that the sum of probabilities for all bins is exactly 1.
$$ P_{NN} (i, j; \mathbf{\Theta}) = \frac{\exp(\text{logits}_{i, j})} { \sum_{k, l} \exp(\text{logits}_{k, l}) } $$
Step 3: Optimization:
The loss functions $L_{\text{area}}$ and $L_{\text{margin}}$ are calculated, and the weighted total loss $L_{\text{total}}$ is obtained using the hyperparameters $\lambda_{\text{area}}$ and $\lambda_{\text{margin}}$.
The parameters $\mathbf{\Theta}$ are iteratively updated using a gradient descent approach with the Adam optimizer (optim.Adam) to minimize $L_{\text{total}}$.
Step 4:Fine-tuning Goal:
The process is essentially a fine-tuning of the NN. The network is adjusted from an initial state to incorporate the new, more accurate selectivity information $C_{\text{true}}$ into the area $\mathcal{R}$.
This adjustment is performed while simultaneously minimizing the marginal loss to ensure the generated distribution $P_{\text{NN}}$ remains consistent with the pre-defined marginal histograms $H_{A}$ and $H_{B}$. The weights ($\lambda_{\text{area}}$, $\lambda_{\text{margin}}$) are used to balance these two competing objectives: fitting the observed data and maintaining statistical consistency.
2.2. Source code
The source code is shown below: hist-nn.py
importtorchimporttorch.nnasnnimporttorch.optimasoptimimportnumpyasnp# NN Model DefinitionclassProbabilityModel(nn.Module):def__init__(self):super().__init__()self.fc=nn.Sequential(nn.Linear(NUM_A+NUM_B,128),nn.ReLU(),nn.Linear(128,1),# Change output size to 1)defforward(self,coords):# Simulate one-hot encodingcoords_int=coords.long()features=[]foriinrange(coords.size(0)):one_hot_A=torch.zeros(NUM_A,device=coords.device)one_hot_A[coords_int[i,0]]=1.0one_hot_B=torch.zeros(NUM_B,device=coords.device)one_hot_B[coords_int[i,1]]=1.0features.append(torch.cat([one_hot_A,one_hot_B]))features_tensor=torch.stack(features)logits=self.fc(features_tensor)returnlogits.squeeze(1)defdisplay_joint_probability(model,coords_tensor,num_a,num_b,title="Joint Probability Distribution P(A, B)"):"""
Retrieves and displays the joint probability distribution from the trained NN model.
"""print(f"\n--- {title} ---")withtorch.no_grad():# 1. Get logits for all bins from NN (NN forward pass)individual_logits=model(coords_tensor)# 2. Calculate joint probability distribution P_NN (shape [80])P_NN_flat=torch.softmax(individual_logits,dim=0)# 3. Reshape into a 2D matrix [NUM_A, NUM_B]P_NN_matrix=P_NN_flat.view(num_a,num_b).cpu().numpy()# Header (Indices of H_B)header="A\\B | "+" | ".join([f"B[{j}]"forjinrange(num_b)])print("-"*(len(header)+4))print(header)print("-"*(len(header)+4))foriinrange(num_a):row_str=f"A[{i}] | "forjinrange(num_b):prob=P_NN_matrix[i,j]row_str+=f"{prob:.3f} | "print(row_str.strip())print("-"*(len(header)+4))# Verification: Check if the area sum is close to the targetarea_sum_check=P_NN_matrix[:N_CONSTRAINT,:M_CONSTRAINT].sum()P_A_check=P_NN_matrix.sum(axis=1)P_B_check=P_NN_matrix.sum(axis=0)# Calculate marginal errorL_Margin=(nn.MSELoss()(torch.tensor(P_A_check),H_A).item()+nn.MSELoss()(torch.tensor(P_B_check),H_B).item())print(f"Verification: Sum of area P(A[0:{N_CONSTRAINT-1}], B[0:{M_CONSTRAINT-1}]): {area_sum_check:.6f} (Target: {C_TRUE:.6f})")print(f"Verification: Marginal Error (MSE): {L_Margin:.8f}")# Loss Function and Fine-tuningdeffine_tune_constrained(model,optimizer,c_true,area_indices,H_A_target,H_B_target,lambda_area=10.0,lambda_margin=1.0,num_steps=500,):criterion_mse=nn.MSELoss()forstepinrange(num_steps):optimizer.zero_grad()individual_logits=model(all_coords_tensor)P_NN_flat=torch.softmax(individual_logits,dim=0)# Loss 1: Area Sum Constraint (L_area)c_nn=P_NN_flat[area_indices].sum()c_true_tensor=torch.tensor(c_true).float().to(c_nn.device)loss_area=criterion_mse(c_nn,c_true_tensor)# Loss 2: Marginal Constraints (L_margin)P_NN_matrix=P_NN_flat.view(NUM_A,NUM_B)P_A_NN=P_NN_matrix.sum(dim=1)P_B_NN=P_NN_matrix.sum(dim=0)loss_margin_A=criterion_mse(P_A_NN,H_A_target)loss_margin_B=criterion_mse(P_B_NN,H_B_target)loss_margin=loss_margin_A+loss_margin_B# Total Lossloss_total=lambda_area*loss_area+lambda_margin*loss_marginloss_total.backward()optimizer.step()if(step+1)%50==0:print(f"Step {step+1}: L_Total={loss_total.item():.6f}, L_Area={loss_area.item():.6f}, L_Margin={loss_margin.item():.6f}, C_NN={c_nn.item():.6f}")if__name__=="__main__":# --- 1. Initial Conditions and Setup ---# Distribution Histograms (Fixed as Marginal Constraints)H_A=torch.tensor([0.085,0.112,0.063,0.141,0.039,0.187,0.095,0.051,0.155,0.072],dtype=torch.float32,)H_B=torch.tensor([0.021,0.058,0.150,0.245,0.280,0.148,0.065,0.033],dtype=torch.float32)NUM_A=H_A.size(0)# 10NUM_B=H_B.size(0)# 8TOTAL_BINS=NUM_A*NUM_B# 80# Target Value for Area Constraint (Fitting Target)C_TRUE=0.3# Area Constraint Definition: H_A[0:n] and H_B[0:m] (Assume n=5, m=4)N_CONSTRAINT=5M_CONSTRAINT=4# --- Generate coordinates (i, j) for all bins (Global scope) ---all_coords=[]foriinrange(NUM_A):forjinrange(NUM_B):all_coords.append([i,j])all_coords_tensor=torch.tensor(all_coords).float()# Shape [80, 2]# --- Identify flat indices of bins included in the area constraint ---AREA_INDICES=[]foriinrange(N_CONSTRAINT):forjinrange(M_CONSTRAINT):AREA_INDICES.append(i*NUM_B+j)AREA_INDICES=torch.tensor(AREA_INDICES)# Shape [20]# --- Execution ---model=ProbabilityModel()optimizer=optim.Adam(model.parameters(),lr=5e-4)# Display before fine-tuningdisplay_joint_probability(model,all_coords_tensor,NUM_A,NUM_B,title="Before Fine-tuning (Initial Random State)",)print("\n--- Starting Fine-tuning ---")fine_tune_constrained(model,optimizer,C_TRUE,AREA_INDICES,H_A,H_B)print("--- Fine-tuning Complete ---")# Display after fine-tuningdisplay_joint_probability(model,all_coords_tensor,NUM_A,NUM_B,title="After Fine-tuning (Constraints Applied)",)
2.3. Execution Result
The execution result of hist-nn.py is shown below:
An alternative approach combines the Neural Network (NN) estimation with the Iterative Proportional Fitting (IPF) procedure.
This method aims to leverage the global consistency of IPF while using the NN for local refinement.
This hybrid approach combines the expressiveness of a Neural Network (NN) with the rigorous constraint enforcement of the Iterative Proportional Fitting (IPF) method.
Step 1: Generate Target Distribution ($P^{\ast}$) using IPF
We first compute an ideal target distribution $P^{\ast}$ that is the closest possible distribution to the initial distribution ($P_{\text{initial}} $), which in the implementation is the initial NN output), while satisfying all constraints. $P^{\ast}$ is defined by minimizing the Kullback–Leibler (KL) Divergence:
$$P^{\ast} = \arg \min_{P} D_{KL}(P \parallel P_{\text{initial}})$$
subject to the marginal constraints and the area constraint:
$$\sum_{j} P_{i,j}^{\ast} = H_{A,i}, \quad \sum_{i} P_{i,j}^{\ast} = H_{B,j}, \quad \sum_{(i,j) \in \mathcal{R}} P_{i,j}^{\ast} = C_{\text{true}}$$
where $H_{A}$ and $H_{B}$ are the marginal histograms, and $\mathcal{R}$ is the constrained area corresponding to the corrected true count $C_{\text{true}}$. The function calculate_p_star_with_constraints in the source code executes this constrained IPF process.
Step 2: Fine-tune NN to Match $P^{\ast}$
The neural network is subsequently trained to approximate the static target distribution $P^{\ast}$ calculated in Step 1.
This process is commonly known as knowledge distillation.
The loss function $L_{KL}$ minimizes the Reverse Kullback–Leibler (KL) Divergence ($D_{KL}(P^{\ast} \parallel P_{\text{NN}})$).
This is the standard method used when training a student model ($P_{\text{NN}}$) to match a fixed teacher distribution ($P^{\ast}$):
$$L_{KL}(\mathbf{\Theta}) = D_{KL}(P^{\ast} \parallel P_{\text{NN}}) = \sum_{i, j} P_{i, j}^{\ast} \log \frac{P_{i, j}^{\ast}}{P_{\text{NN}, i, j}(\mathbf{\Theta})}$$
This formulation, which is implemented using nn.KLDivLoss(log_probs_nn, P_star), encourages the neural network output $P_{\text{NN}}$ to closely match the IPF-corrected target distribution $P^{\ast}$.
Implementation Note:
In the actual implementation, the IPF-corrected distribution $P^{\ast}$ is computed once and used as a static target during the neural network’s training (static distillation).
Although this is a simplification, it provides a stable and practical approximation compared to fully iterative IPF-NN updates.
The implementation details are provided in the source code below: hist-hybrid.py
importtorchimporttorch.nnasnnimporttorch.optimasoptimimportnumpyasnpimportcopy# NN Model DefinitionclassProbabilityModel(nn.Module):def__init__(self):super().__init__()self.fc=nn.Sequential(nn.Linear(NUM_A+NUM_B,128),nn.ReLU(),nn.Linear(128,1),# Change output size to 1)defforward(self,coords):# Simulate one-hot encodingcoords_int=coords.long()features=[]foriinrange(coords.size(0)):one_hot_A=torch.zeros(NUM_A,device=coords.device)one_hot_A[coords_int[i,0]]=1.0one_hot_B=torch.zeros(NUM_B,device=coords.device)one_hot_B[coords_int[i,1]]=1.0features.append(torch.cat([one_hot_A,one_hot_B]))features_tensor=torch.stack(features)logits=self.fc(features_tensor)returnlogits.squeeze(1)defdisplay_joint_probability(model,coords_tensor,num_a,num_b,H_A_target,H_B_target,C_true,area_indices,title="Joint Probability Distribution P(A, B)",):"""
Retrieves the joint probability distribution from the trained NN model and displays it using nested loops.
"""print(f"\n--- {title} ---")withtorch.no_grad():# 1. Get logits for all bins from NN (NN forward pass)individual_logits=model(coords_tensor)# 2. Calculate joint probability distribution P_NN (shape [80])P_NN_flat=torch.softmax(individual_logits,dim=0)# 3. Reshape into a 2D matrix [NUM_A, NUM_B]P_NN_matrix=P_NN_flat.view(num_a,num_b).cpu().numpy()# Header (Indices of H_B)header="A\\B | "+" | ".join([f"B[{j}]"forjinrange(num_b)])print("-"*(len(header)+4))print(header)print("-"*(len(header)+4))foriinrange(num_a):row_str=f"A[{i}] | "forjinrange(num_b):prob=P_NN_matrix[i,j]row_str+=f"{prob:.3f} | "print(row_str.strip())print("-"*(len(header)+4))# Verificationarea_sum_check=P_NN_matrix[:N_CONSTRAINT,:M_CONSTRAINT].sum()P_A_check=P_NN_matrix.sum(axis=1)P_B_check=P_NN_matrix.sum(axis=0)L_Margin=(nn.MSELoss()(torch.tensor(P_A_check),H_A_target).item()+nn.MSELoss()(torch.tensor(P_B_check),H_B_target).item())print(f"Verification: Sum of area P(A[0:{N_CONSTRAINT-1}], B[0:{M_CONSTRAINT-1}]): {area_sum_check:.6f} (Target: {C_true:.6f})")print(f"Verification: Marginal Error (MSE): {L_Margin:.8f}")# Function to Calculate P* (Constrained IPF Principle)defcalculate_p_star_with_constraints(P_initial,H_A_target,H_B_target,C_true,area_indices,max_iter=100,tolerance=1e-6,):P_current=P_initial.view(NUM_A,NUM_B).clone()# Convert marginal sums to NumPy arraysH_A_np=H_A_target.cpu().numpy()H_B_np=H_B_target.cpu().numpy()# Area indices (2D)area_rows,area_cols=np.unravel_index(area_indices.cpu().numpy(),(NUM_A,NUM_B))foriinrange(max_iter):P_prev=P_current.clone()# === Step 1: Adjust Area Constraint ===P_np=P_current.cpu().numpy()current_area_sum=P_np[area_rows,area_cols].sum()ifabs(current_area_sum-C_true)>1e-18:# Calculate ratios for the area and the restratio_area=C_true/(current_area_sum+1e-18)P_rest_sum=1.0-current_area_sumP_rest_target=1.0-C_trueratio_rest=P_rest_target/(P_rest_sum+1e-18)P_temp_np=P_np.copy()# Adjust inside the constrained areaP_temp_np[area_rows,area_cols]*=ratio_area# Adjust outside the constrained areamask_rest=np.ones_like(P_np,dtype=bool)mask_rest[area_rows,area_cols]=FalseP_temp_np[mask_rest]*=ratio_restP_current=torch.tensor(P_temp_np).float()P_current/=P_current.sum()# Ensure normalization after adjustment# === Step 2: Adjust Marginal Constraints (Standard IPF Steps) ===# Adjustment for A (Row) directionrow_sums=P_current.sum(dim=1).unsqueeze(1)adjustment_factor_row=H_A_target.unsqueeze(1)/(row_sums+1e-18)P_current*=adjustment_factor_row# Adjustment for B (Column) directioncol_sums=P_current.sum(dim=0).unsqueeze(0)adjustment_factor_col=H_B_target.unsqueeze(0)/(col_sums+1e-18)P_current*=adjustment_factor_col# === Convergence Check ===iftorch.sum(torch.abs(P_current-P_prev))<tolerance:breakreturnP_current.view(1,TOTAL_BINS)# Shape [1, 80]# Learning Step Using KL Divergence as Teacherdeffine_tune_to_p_star(model,optimizer,P_star,num_steps=500):# KL Divergence Loss: Minimize D_KL(P* || P_NN)criterion_kl=nn.KLDivLoss(reduction="batchmean",log_target=False)forstepinrange(num_steps):optimizer.zero_grad()individual_logits=model(all_coords_tensor)log_probs_nn=torch.log_softmax(individual_logits,dim=0).unsqueeze(0)# Calculate loss: loss_kl = D_KL(P* || P_NN)loss_kl=criterion_kl(log_probs_nn,P_star)loss_total=loss_klloss_total.backward()optimizer.step()if(step+1)%50==0:print(f"Step {step+1}: KL_Loss = {loss_kl.item():.6f}")if__name__=="__main__":# --- 1. Initial Conditions and Setup ---# Distribution Histograms (Fixed as Marginal Constraints)H_A=torch.tensor([0.085,0.112,0.063,0.141,0.039,0.187,0.095,0.051,0.155,0.072],dtype=torch.float32,)H_B=torch.tensor([0.021,0.058,0.150,0.245,0.280,0.148,0.065,0.033],dtype=torch.float32)NUM_A=H_A.size(0)# 10NUM_B=H_B.size(0)# 8TOTAL_BINS=NUM_A*NUM_B# 80# Target Value for Area Constraint (Fitting Target)C_TRUE=0.3# Area Constraint Definition: H_A[0:n] and H_B[0:m] (Assume n=5, m=4)N_CONSTRAINT=5M_CONSTRAINT=4# --- Generate coordinates (i, j) for all bins (Global scope) ---all_coords=[]foriinrange(NUM_A):forjinrange(NUM_B):all_coords.append([i,j])all_coords_tensor=torch.tensor(all_coords).float()# Shape [80, 2]# --- Identify flat indices of bins included in the area constraint ---AREA_INDICES=[]foriinrange(N_CONSTRAINT):forjinrange(M_CONSTRAINT):AREA_INDICES.append(i*NUM_B+j)AREA_INDICES=torch.tensor(AREA_INDICES)# Shape [20]# --- Execution ---model_hybrid=ProbabilityModel()optimizer_hybrid=optim.Adam(model_hybrid.parameters(),lr=1e-3)# ----------------------------------------------------# Display Before Fine-tuning# ----------------------------------------------------display_joint_probability(model_hybrid,all_coords_tensor,NUM_A,NUM_B,H_A,H_B,C_TRUE,AREA_INDICES,title="Before Fine-tuning (Initial Random State)",)print("\n--- Calculating P* Target Distribution (Applying IPF Principle) ---")# 1. Get the initial output of the NN (P_initial)withtorch.no_grad():initial_logits=model_hybrid(all_coords_tensor)P_initial=torch.softmax(initial_logits,dim=0)# 2. Modify P_initial and calculate the target distribution P*P_star=calculate_p_star_with_constraints(P_initial,H_A,H_B,C_TRUE,AREA_INDICES)# Check P* constraint achievementP_star_matrix_check=P_star.squeeze().view(NUM_A,NUM_B)P_star_area_sum_check=(P_star_matrix_check[:N_CONSTRAINT,:M_CONSTRAINT].sum().item())print(f"P* Target Area Sum: {P_star_area_sum_check:.6f} (Target: {C_TRUE:.6f})")# 3. Fine-tune the NN to approach P*print("\n--- Starting NN Fine-tuning (KL Minimization) ---")fine_tune_to_p_star(model_hybrid,optimizer_hybrid,P_star)print("--- NN Fine-tuning Complete ---")# ----------------------------------------------------# Display After Fine-tuning# ----------------------------------------------------display_joint_probability(model_hybrid,all_coords_tensor,NUM_A,NUM_B,H_A,H_B,C_TRUE,AREA_INDICES,title="After Fine-tuning (Approximation to P* Target)",)
3. Discussions
The proposed method allows the joint probability distribution to be updated based on observed data.
However, there are still several hurdles to overcome for practical application in a real-world Relational Database (RDB).
We will discuss these challenges from two perspectives: Computational Complexity and Memory Footprint Reduction.
3.1. Reducing Computational Complexity: Cumulative Sum or Summed-Area Table
When a probability distribution stored as a histogram is used for Selectivity Estimation, the values often need to be summed up based on a given condition.
For example, to get the value for the range $h[3:7]$ in a 1D histogram $h[0:19]$, you must sum the values from $h[3]$ to $h[7]$, i.e., $ \sum_{i=3}^{7} h[i]$.
The computational cost for this summation is $O(n)$, where $n$ is the length of the histogram $h$.
While this cost is acceptable for a 1D case (in fact, PostgreSQL uses this method),
it becomes non-negligible for a 2D histogram, where the calculation cost increases to $O(n^{2})$ .
Fortunately, there is a method to address this situation: the Cumulative Sum (or Summed-Area Table in 2D).
Simply put, instead of sequentially adding the values in the desired range, we use a new histogram H (the cumulative sum histogram) where each element is the accumulated sum of the original histogram $h$.
By finding the difference between the upper and lower bounds of the region, the calculation can always be performed with an $O(1)$ computational cost.
The Cumulative Sum $H$ is defined as follows (assuming $H[−1]=0$):
Using the original histogram $h$, you must sum the values from $h[3]$ to $h[7]$:
$$
S = \sum_{i=3}^{7} h[i] = h[3] + h[4] + h[5] + h[6] + h[7]
$$
Conversely, using the Cumulative Sum Histogram $H$, only the following subtraction is required:
$$
S = H[7] - H[2]
$$
The same principle applies in the 2D case.
If $h[i,j]$ is the original 2D histogram and $H[i,j]$ is the Cumulative Sum Table (or Summed-Area Table),
$H[i,j]$ is defined as the sum of all bin probabilities (or frequencies) contained in the rectangular region from the origin $(0,0)$ up to coordinates $(i,j)$.
3.1.1. Creating the Cumulative Sum Table $\mathbf{H}$: $\text{O}(MN)$
For an original histogram $h$ of dimension $M \times N$,
$\mathbf{H}$ can be created efficiently with a computational complexity of $O(MN)$ using the following recurrence relation:
$\mathbf{H}[i-1, j]$: The cumulative sum from the area above.
$\mathbf{H}[i, j-1]$: The cumulative sum from the area to the left.
$\mathbf{H}[i-1, j-1]$: This is the top-left corner area which was doubly added in the previous two terms, so it must be subtracted once.
3.1.2. Calculating Region Probability: $\text{O}(1)$
To find the probability $P$ for any arbitrary rectangular region $[a_{1}, a_{2}] \times [b_{1}, b_{2}]$((with top-left corner $(a_{1}, b_{1})$ and bottom-right corne $(a_{2}, b_{2})$)),
we can calculate it in $O(1)$ ime using the following four cumulative sum values:
$\mathbf{H}[a_2, b_2]$: The total area ($(0, 0)$ to $(a_2, b_2)$)
$\mathbf{H}[a_1-1, b_2]$: Subtract the unnecessary region on the left side.
$\mathbf{H}[a_2, b_1-1]$: Subtract the unnecessary region on the top side.
$\mathbf{H}[a_1-1, b_1-1]$: The top-left corner was doubly subtracted in steps 2 and 3, so we add it back once.
3.2. Reducing Memory Footprint:Count Min Sketch
To reduce the memory required for the 2D histogram and allow for faster fine-tuning, I adopted an NN (Neural Network). However, directly handling an NN inside an RDB is computationally expensive.
Here, the Count-Min Sketch (CMS), a probabilistic data structure, can dramatically reduce the storage required for the 2D histogram.
The Count-Min Sketch is a probabilistic data structure used to approximately estimate the frequency (count) of specific elements in a stream or massive dataset.
It stores data in a fixed, small-sized matrix (the “sketch”).
By using multiple independent hash functions, the impact of hash collisions is minimized.
To estimate the count of any element, we adopt the minimum value among the multiple buckets that the element maps to.
The biggest advantage of this technique is that the estimation accuracy ($epsilon$) and the probability of failure ($delta$) can be specified by the user,
fixing the memory size to be extremely small, regardless of the data’s cardinality (number of unique items) or the total count.
Therefore, the practical approach would be to adjust the probability distribution using the NN and then map that resulting distribution onto the CMS for operational use.
Note that the probability distribution data type from the NN is typically a floating-point type (like double), while the CMS uses an integer type. Therefore, when using the CMS, you must separately store the total sample count $N$ and divide the value retrieved from the CMS by $N$ N to obtain the probability.
3.2.1. Memory Calculation
Two parameters are required to calculate the memory capacity needed for a Count-Min Sketch:
Permissible Error Range: $\epsilon$
Acceptable Failure Probability: $\delta$
The siz ($d \times w$) of the CM Sketch matrix is determined by the following formulas:
Number of Rows $d$ (Number of Hash Functions):
$$d = \left\lceil \ln\left(\frac{1}{\delta}\right) \right\rceil $$
Number of Columns $w$ (Bucket Width):
$$w = \left\lceil \frac{e}{\epsilon} \right\rceil $$
Let’s calculate the required memory using specific values. Assuming the histogram size is $100 \times 100$,
and setting the permissible error $epsilon$ and acceptable failure probability $\delta$ to 0.01 (or 1[%]),
the number of rows $d$ and columns w for the CMS are:
In this example, using the Count-Min Sketch reduces the memory footprint of the original histogram to approximately 1/74th of its original size.
3.2.3. Source code
The source code is shown below: cms.py
importnumpyasnpimportmathimporthashlibclassCountMinSketch:def__init__(self,d_rows,w_cols):# Initialize the sketch matrix with zerosself.sketch=np.zeros((d_rows,w_cols),dtype=np.int32)self.d=d_rowsself.w=w_cols# Seeds to simulate different hash functions# Using NumPy random seeds (in practice, cryptographic hash functions are often used)self.seeds=np.random.randint(0,1000000,d_rows)def_hash(self,i,j,seed):"""
Calculates a hash value within w_cols using 2D indices (i, j) and a seed.
"""# Combine indices and hash (simple concatenation and hashing)# Assuming i and j are integerscombined_key=str(i)+"-"+str(j)# Combine Python's standard hash with the seed# Append the seed to ensure diversity of hash valuesh=int(hashlib.sha256((combined_key+str(seed)).encode("utf-8")).hexdigest(),16)# Map to the range of w_colsreturnh%self.wdefadd(self,i:int,j:int,k:int):"""
Adds value k to the array cell (i, j).
Args:
i (int): Index of the first dimension
j (int): Index of the second dimension
k (int): Value (frequency or weight) to add
"""ifi>=N_DIMorj>=M_DIM:raiseIndexError("Index out of bounds for the assumed array size.")forrowinrange(self.d):# Use a different hash function for each rowcol_index=self._hash(i,j,self.seeds[row])self.sketch[row,col_index]+=kdefget(self,i:int,j:int)->int:"""
Returns the estimated value of the array cell (i, j).
Args:
i (int): Index of the first dimension
j (int): Index of the second dimension
Returns:
int: Estimated value of cell (i, j)
"""ifi>=N_DIMorj>=M_DIM:raiseIndexError("Index out of bounds for the assumed array size.")min_value=np.infforrowinrange(self.d):col_index=self._hash(i,j,self.seeds[row])min_value=min(min_value,self.sketch[row,col_index])returnint(min_value)defshow(self,max_dim=10):"""
Displays the estimated data for the array (only a subset is shown due to computation cost).
Args:
max_dim (int): Maximum dimension to display (e.g., shows the top-left max_dim x max_dim)
"""print(f"\n--- Estimated Joint Probability Distribution (CM Sketch) ---")print(f" (Displaying estimate for top-left {min(N_DIM,max_dim)}x{min(M_DIM,max_dim)})")# Array to store resultsestimated_array=np.zeros((min(N_DIM,max_dim),min(M_DIM,max_dim)),dtype=np.int32)# Estimate all elements in the nested loop and displayforiinrange(min(N_DIM,max_dim)):forjinrange(min(M_DIM,max_dim)):estimated_array[i,j]=self.get(i,j)# Print the formatted NumPy arrayprint(estimated_array)print("-"*40)if__name__=="__main__":# --- 1. Initial Parameter Setup ---# Corresponds to user input: error epsilon and failure probability deltaEPSILON=0.01# Tolerance for error (1%)DELTA=0.01# Tolerance for failure probability (1%)# Array dimensions (assuming 400x400, but doesn't directly affect CM Sketch size)N_DIM=400M_DIM=400# --- 2. CM Sketch Parameter Calculation ---# Number of rows d (number of hash functions)D_ROWS=math.ceil(math.log(1/DELTA))# Number of columns w (width of buckets)W_COLS=math.ceil(math.e/EPSILON)# --- Example Execution ---# 1. Initialize the Sketchcm_sketch=CountMinSketch(D_ROWS,W_COLS)# 2. Add Data (Add operation)# i, j are in the range 0 to 399print("Adding data (Add Operation)...")cm_sketch.add(1,1,100)cm_sketch.add(1,1,50)# Total 150cm_sketch.add(1,2,20)# Total 20cm_sketch.add(200,300,5)# Far indexcm_sketch.add(2,1,100)# Data that might cause hash collision with (1,1)# 3. Get value of specific cells (Get operation)val_1_1=cm_sketch.get(1,1)val_1_2=cm_sketch.get(1,2)val_2_1=cm_sketch.get(2,1)val_200_300=cm_sketch.get(200,300)print(f"Estimated value (1, 1): {val_1_1} (True value: 150)")print(f"Estimated value (1, 2): {val_1_2} (True value: 20)")print(f"Estimated value (2, 1): {val_2_1} (True value: 100)")print(f"Estimated value (200, 300): {val_200_300} (True value: 5)")# 4. Display the entire array (Show operation)cm_sketch.show(max_dim=5)
As stated above, the CMS uses integers.
Therefore, when dealing with probability distributions, remember to separately store the total number of samples $N$ and divide the value retrieved from the CMS by $N$ to obtain the probability.
4. Conclusion
The core takeaway is that I have demonstrated the feasibility of continuously improving selectivity estimation by acquiring actual row counts for every query.
If time permits, this functionality would be implemented in pg_plan_inspector.
Remaining Work
Persistence of Corrections After ANALYZE
I deliberately avoided discussing the following issue until now: whenever the ANALYZE command refreshes the selectivity statistics,
all previous corrections are effectively invalidated.
Therefore, we must find a way to reuse those corrections or integrate the pre-ANALYZE distribution updates into the new statistics.
Assuming a sudden, drastic change in data distribution is a rare event, this approach seems practical.
Accuracy Verification of Probabilistic Data Structures
Neural Networks (NNs) and Count-Min Sketches (CMSs) are both probabilistic data structures.
The use of CMSs, however, carries the risk of introducing or compounding false positives into the estimation.
Consequently, the accuracy of selectivity estimation obtained by combining these two approaches requires rigorous qualitative and quantitative analysis.
Since the overarching goal of these three blog posts is to improve the accuracy of selectivity estimation,
if the final result cannot guarantee sufficient accuracy, then all the effort invested in this approach would be meaningless.
Personal Thoughts
My primary goal was to enable real-time query progress display by continuously obtaining actual row counts.
By reducing the overhead of EXPLAIN ANALYZE to 2%, that objective has been achieved.
However, the accuracy of the Query Progress display is directly proportional to the quality of the Selectivity Estimation.
Thus, although secondary to my main goal, improving the performance of Selectivity Estimation is still necessary.
I hope that providing researchers with a constant stream of actual row counts will inspire them to devise entirely new methods for Selectivity Estimation.