Mathematical Approaches for Selectivity Error Correction

In previous posts, post1 and post2, I demonstrated that it’s possible to continuously count the actual rows from EXPLAIN ANALYZE with an overhead of about 2.0%.

The next step is to examine methods for correcting the selectivity (or cardinality) estimation results.

Below, I will explain the correction methods based on Most Common Values (MCVs) and Histograms.

1. MCVs

First, let’s consider adjusting the joint probability distribution based on MCVs for columns in two tables.

  • The MCVs for column $C_{A}$ in table $T_{A}$ are [$X_{a}, X_{b}, X_{c}, X_{d}$], with corresponding Most Common Frequencies (MCFs) of [0.2, 0.25, 0.35, 0.2].
  • The MCVs for column $C_{B}$ in table $T_{B}$ are [$Y_{a}, Y_{b}, Y_{c}$], with corresponding MCFs of [0.4, 0.55, 0.05].

In this case, the joint probability for $( C_{A} = X_{a} \wedge C_{B} = Y_{a} )$ is calculated by assuming independence between $C_{A}$ and $C_{B}$ in many RDBMSs like PostgreSQL, which gives:

$$ X_{a} \times Y_{a} = 0.2 \times 0.4 = 0.08 $$

The resulting joint probability distribution for all combinations is as follows:

X\Y Y_a Y_b Y_c
X_a 0.08 0.11 0.01
X_b 0.10 0.14 0.01
X_c 0.14 0.19 0.02
X_d 0.08 0.11 0.01

Now, suppose that the actual measured joint probability $ P(X_{a}, Y_{a}) $ is found to be $0.2$.

We must now adjust the entire joint probability distribution while preserving the marginal distributions $P(X)$ and $P(Y)$. Several methods exist, but here we employ Iterative Proportional Fitting (IPF), which is relatively simple to implement and minimizes the Kullback-Leibler (KL) divergence.

The source code and execution result are shown below, followed by the mathematical proof.

1.1. Source code

The source code is shown below: mcv-ipf.py

import numpy as np

def print_mat(P_mat):
    age = ["X_a", "X_b", "X_c", "X_d"]
    # Print column headers
    print("\tY_a\tY_b\tY_c")
    i = 0
    for row in P_mat:
        # Print row header
        print("{}".format(age[i]), end="")
        for element in row:
            # Print elements formatted to two decimal places
            print(f"\t{element:.2f}", end="  ")
        print()
        i += 1


def iterative_proportional_fitting_fixed_cell(
    P_A, P_B, fixed_coords, fixed_value, max_iterations=100, tolerance=1e-6
):
    m = len(P_A)
    n = len(P_B)
    P_A_target = P_A.reshape(-1, 1)
    P_B_target = P_B.reshape(1, -1)

    # 1. Set the fixed value in the initial distribution P^{(0)} and fill the rest based on independence.
    P_initial = np.outer(P_A, P_B)
    P_initial[fixed_coords] = fixed_value

    # 2. Adjustment for the remaining cells
    # Subtract the fixed cell P(X_a, Y_a) value from the marginals
    P_A_adj = P_A.copy()
    P_B_adj = P_B.copy()

    P_A_adj[fixed_coords[0]] -= fixed_value
    P_B_adj[fixed_coords[1]] -= fixed_value

    # 3. Check that the remaining cells are non-negative (Axiom of Probability)
    if P_A_adj[fixed_coords[0]] < 0 or P_B_adj[fixed_coords[1]] < 0:
        raise ValueError(
            "The fixed value exceeds the marginal probability. IPF cannot be applied."
        )

    # 4. Matrix to mask the location of the fixed cell
    mask = np.ones((m, n))
    mask[fixed_coords] = 0

    # 5. Set the initial joint distribution P^{(0)} (fixed cell remains, normalize the rest using independence)
    # Calculate the product of the remaining marginal sums and allocate proportionally to the remaining cells
    P_rest_sum = np.sum(P_A_adj)

    # Calculate the remaining initial distribution (starting here as the outer product of adjusted marginals)
    P_current = np.outer(P_A_adj, P_B_adj)
    # Normalize so the remaining marginal sum is 1.0
    P_current /= P_current.sum()  # Normalize to the sum of P_rest_sum

    # 6. Add the fixed value to P_current to form the overall initial distribution
    P_current *= mask  # Apply the mask
    P_current[fixed_coords] = fixed_value  # Set the fixed value

    # Final normalization of the initial distribution
    P_current /= P_current.sum()

    for i in range(max_iterations):
        P_prev = P_current.copy()

        # === Match Row Constraint (P_A) ===
        # Only adjust the part that does not include the fixed cell
        # Use the approach of applying standard IPF and finally overwriting the fixed cell

        # Calculate row sums
        row_sums = P_current.sum(axis=1).reshape(-1, 1)
        adjustment_factor_row = P_A_target / (row_sums + 1e-18)
        P_current *= adjustment_factor_row

        # Overwrite the fixed cell (to satisfy the constraint)
        P_current[fixed_coords] = fixed_value

        # === Match Column Constraint (P_B) ===
        col_sums = P_current.sum(axis=0).reshape(1, -1)
        adjustment_factor_col = P_B_target / (col_sums + 1e-18)
        P_current *= adjustment_factor_col

        # Overwrite the fixed cell (to satisfy the constraint)
        P_current[fixed_coords] = fixed_value

        # === Convergence Check ===
        if np.sum(np.abs(P_current - P_prev)) < tolerance:
            # print(f"--- Converged. Iterations: {i+1} ---")
            break
    else:
        print(f"--- Maximum iterations ({max_iterations}) reached ---")

    return P_current


if __name__ == "__main__":
    # --- Example Usage ---
    # Use data from the question example and modify the marginal distribution to incorporate the constraint P(X_a, Y_a) = 0.2

    # P_A: [X_a, X_b, X_c, X_d] (4 items)
    P_A_example = np.array([0.2, 0.25, 0.35, 0.2])
    # P_B: [Y_a, Y_b, Y_c] (3 items)
    P_B_example = np.array([0.4, 0.55, 0.05])

    print("==== Initial Conditions ====")
    print("P_A: [X_a, X_b, X_c, X_d]")
    print(P_A_example)
    print("P_B: [Y_a, Y_b, Y_c]")
    print(P_B_example)
    print("")

    P_current = np.outer(P_A_example, P_B_example)

    # Row marginal distribution (constraint) shaped as (m, 1)
    P_A_target = P_A_example.reshape(-1, 1)
    # Column marginal distribution (constraint)
    P_B_target = P_B_example.reshape(1, -1)

    print("==== Joint Distribution (Independence Assumption) ====")

    print_mat(P_current)
    print("")

    # Case from the question: Fix P(X_a, Y_a) = 0.2 and satisfy marginal constraints
    fixed_coords = (0, 0)  # X_a (0th row), Y_a (0th column)
    fixed_value = 0.2

    P_fixed = iterative_proportional_fitting_fixed_cell(
        P_A_example, P_B_example, fixed_coords, fixed_value
    )

    print("\n==== Final Result (P(X_a, Y_a)=0.2 Fixed, KL Minimization) ====")
    print_mat(P_fixed)
    print("\nRow Marginal Sums (compared to P_A_example):", P_fixed.sum(axis=1))
    print("Column Marginal Sums (compared to P_B_example):", P_fixed.sum(axis=0))

1.2. Execution Result

The execution result of mcv-ipf.py is shown below:

$ python mcv-ipf.py 
==== Initial Conditions ====
P_A: [X_a, X_b, X_c, X_d]
[0.2  0.25 0.35 0.2 ]
P_B: [Y_a, Y_b, Y_c]
[0.4  0.55 0.05]

==== Joint Distribution (Independence Assumption) ====
          Y_a      Y_b       Y_c
X_a     0.08     0.11      0.01
X_b     0.10     0.14      0.01
X_c     0.14     0.19      0.02
X_d     0.08     0.11      0.01


==== Final Result (P(X_a, Y_a)=0.2 Fixed, KL Minimization) ====
          Y_a      Y_b       Y_c
X_a     0.20      0.00     0.00
X_b     0.06      0.17     0.02
X_c     0.09      0.24     0.02
X_d     0.05      0.14     0.01

Row Marginal Sums (compared to P_A_example): [0.2  0.25 0.35 0.2 ]
Column Marginal Sums (compared to P_B_example): [0.4  0.55 0.05]

1.3. Mathematical Explanation

The IPF algorithm can adjust the joint probability distribution to minimize the Kullback-Leibler (KL) divergence without directly calculating it. This is because IPF can be viewed as a type of Expectation-Maximization (EM) algorithm or Coordinate Descent. Each step of the IPF process is proven to analytically decrease the divergence.

1.3.1. The Objective Function to be Minimized

The objective of IPF is to find an adjusted joint probability distribution $\mathbf{P}$ that minimizes the KL divergence $D_{KL}(\mathbf{P} \parallel \mathbf{P}^{(0)})$ from the initial distribution $\mathbf{P}^{(0)}$, while satisfying the marginal probability constraints $\mathbf{P}_A$ and $\mathbf{P}_B$:

$$ \min_{P} D_{KL}(P \parallel P^{(0)}) = \min_{P} \sum_{i, j} P_{i, j} \log \left( \frac{P_{i, j}}{P^{(0)}_{i, j}} \right) $$

Constraints:

$$ \sum_{j} P_{i, j} = P_{A, i} \quad (\text{Row Marginal}) $$ $$ \sum_{i} P_{i, j} = P_{B, j} \quad (\text{Column Marginal}) $$

Each iteration alternates between adjusting rows and columns until convergence.

1.3.2. Derivation of the Scaling Step (Information Projection Theorem)

IPF finds the solution to this optimization problem through iterative scaling operations without the need for direct calculation of the KL divergence.

Step 1: Projection (Adjustment) to the Row Marginal $\ \mathbf{P}_{A}$

To make the current distribution $\mathbf{P}^{(k)}$ conform to the row marginal constraint, a row-specific scaling factor $\lambda_i$ is applied:

$$ P_{i,j}^{(k + 1/2)} = P_{i,j}^{(k)} \times \lambda_{i} $$

Here, the scaling factor $\lambda_{i}$ is defined as:

$$ \lambda_{i} = \frac{P_{A, i}}{\sum_{j’} P^{(k)}_{i, j’}}$$

The row adjustment update formula is expressed by the following ratio:

$$ P_{i, j}^{(k + 1/2)} = P_{i, j}^{(k)} \times \left( \frac{P_{A, i}}{\sum_{j’} P_{i, j’}^{(k)}} \right) $$

Step 2: Projection (Adjustment) to the Column Marginal $\ \mathbf{P}_{B}$

Next, a column-specific scaling factor $\mu_j$ is applied to the adjusted distribution $\mathbf{P}^{(k + 1/2)}$ to make it conform to the column marginal constraint:

$$ P_{i, j}^{(k + 1)} = P_{i, j}^{(k + 1/2)} \times \mu_{j} $$

Here, the scaling factor $\mu_{j}$ is defined as:

$$ \mu_{j} = \frac{P_{B, j}}{\sum_{i’} P_{i’, j}^{(k + 1/2)}} $$

The column adjustment update formula is expressed by the following ratio:

$$ P_{i, j}^{(k + 1)} = P_{i, j}^{(k + 1/2)} \times \left( \frac{P_{B, j}}{\sum_{i’} P_{i’, j}^{(k + 1/2)}} \right) $$


2. Histograms

When using histograms, two key differences arise compared to Most Common Values (MCVs):

  1. Range Query Support: Query constraints must accommodate not only equi-joins but also range queries. Iterative Proportional Fitting (IPF) is primarily designed for equality constraints (e.g., summing marginal probabilities), making it challenging to directly and efficiently incorporate area constraints involving inequalities (range queries).
  2. High Dimensionality: The number of histogram bins (typically 100 by default) is large compared to MCVs. This necessitates reducing the data storage space required for the full joint probability distribution.

To simultaneously address both challenges, we adopt a Neural Network (NN) approach, utilizing function approximation to fine-tune the joint probability distribution.

We will implement the simplest machine learning method, which minimizes the Mean Squared Error (MSE). Specifically, this approach estimates a joint probability distribution $P_{\text{NN}}$ that satisfies two predefined statistical constraints: marginal distributions ($H_{A}, H_{B}$) and a specific area sum ($C_{\text{true}}$).

The core idea is to treat the estimation as a constrained optimization problem, where the NN parameters $\mathbf{\Theta}$ are adjusted to minimize a total loss function, $L_{\text{total}}$.

2.1. Mathematical Explanation

2.1.1. Objective Function

The total loss function $L_{\text{total}}$ is defined as the weighted sum of two components: the area loss ($L_{\text{area}}$) and the marginal loss ($L_{\text{margin}}$).

$$ L_{\text{total}} = \lambda_{\text{area}} L_{\text{area}} + \lambda_{\text{margin}} L_{\text{margin}} $$Here, $\lambda_{\text{area}}$ and $\lambda_{\text{margin}}$ are hyperparameters (weights) used to adjust the importance of each constraint.

  1. Area Loss ($L_{\text{area}}$):
    This term penalizes the difference between the sum of probabilities estimated by the NN over a specific range $\mathcal{R}$ and the true target probability $C_{\text{true}}$. The range $\mathcal{R}$ corresponds to the query range that needs correction based on the observed true row count ($C_{\text{true}}$). $$L_{\text{area}} = \left( \sum_{\mathbf{x} \in \mathcal{R}} P_{\text{NN}}(\mathbf{x}; \mathbf{\Theta}) - C_{\text{true}} \right)^2 $$Here, $\mathbf{x}$ represents the histogram bin coordinates $(i, j)$, and $C_{\text{true}}$ is the normalized probability ($C_{\text{true}}/N$) derived from the actual row count $C_{\text{true}}$ and the total number of rows $N$.

  2. Marginal Loss ($L_{\text{margin}}$):
    This term ensures that the marginal distributions derived from the NN output match the predefined marginal histograms $H_{A}$ and $H_{B}$. $$ L_{\text{margin}} = \sum_{i} \left( \sum_{j} P_{\text{NN}}(i, j; \mathbf{\Theta}) - H_{A}(i) \right)^2 + \sum_{j} \left( \sum_{i} P_{\text{NN}}(i, j; \mathbf{\Theta}) - H_{B}(j) \right)^2 $$$i$ and $j$ represent the bin indices for the $A$ and $B$ axes, respectively.

2.1.2. Training Procedure

The Neural Network (NN), such as the simple multi-layer perceptron implemented in the code, is trained using the following steps:

  • Step 1: Input and Output:

    • The model (ProbabilityModel) takes the normalized bin coordinates $(i, j)$ of the 2D histogram grid as input, encoded via a concatenated One-hot vector (dimension $NUM_{A} + NUM_{B}$).
    • The network’s output is a single logit value for that specific coordinate.
  • Step 2: Probability Normalization:

    • The network first calculates logit values for all bins. The joint probability $P_{\text{NN}}$ is then computed by applying the softmax function across the logits of all bins. This step is crucial as it guarantees that the sum of probabilities for all bins is exactly 1. $$ P_{NN} (i, j; \mathbf{\Theta}) = \frac{\exp(\text{logits}_{i, j})} { \sum_{k, l} \exp(\text{logits}_{k, l}) } $$
  • Step 3: Optimization:

    • The loss functions $L_{\text{area}}$ and $L_{\text{margin}}$ are calculated, and the weighted total loss $L_{\text{total}}$ is obtained using the hyperparameters $\lambda_{\text{area}}$ and $\lambda_{\text{margin}}$.
    • The parameters $\mathbf{\Theta}$ are iteratively updated using a gradient descent approach with the Adam optimizer (optim.Adam) to minimize $L_{\text{total}}$.
  • Step 4:Fine-tuning Goal:

    • The process is essentially a fine-tuning of the NN. The network is adjusted from an initial state to incorporate the new, more accurate selectivity information $C_{\text{true}}$ into the area $\mathcal{R}$.
    • This adjustment is performed while simultaneously minimizing the marginal loss to ensure the generated distribution $P_{\text{NN}}$ remains consistent with the pre-defined marginal histograms $H_{A}$ and $H_{B}$. The weights ($\lambda_{\text{area}}$, $\lambda_{\text{margin}}$) are used to balance these two competing objectives: fitting the observed data and maintaining statistical consistency.

2.2. Source code

The source code is shown below: hist-nn.py

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np


# NN Model Definition
class ProbabilityModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(NUM_A + NUM_B, 128),
            nn.ReLU(),
            nn.Linear(128, 1),  # Change output size to 1
        )

    def forward(self, coords):
        # Simulate one-hot encoding
        coords_int = coords.long()
        features = []
        for i in range(coords.size(0)):
            one_hot_A = torch.zeros(NUM_A, device=coords.device)
            one_hot_A[coords_int[i, 0]] = 1.0
            one_hot_B = torch.zeros(NUM_B, device=coords.device)
            one_hot_B[coords_int[i, 1]] = 1.0
            features.append(torch.cat([one_hot_A, one_hot_B]))

        features_tensor = torch.stack(features)
        logits = self.fc(features_tensor)
        return logits.squeeze(1)


def display_joint_probability(
    model, coords_tensor, num_a, num_b, title="Joint Probability Distribution P(A, B)"
):
    """
    Retrieves and displays the joint probability distribution from the trained NN model.
    """
    print(f"\n--- {title} ---")

    with torch.no_grad():
        # 1. Get logits for all bins from NN (NN forward pass)
        individual_logits = model(coords_tensor)
        # 2. Calculate joint probability distribution P_NN (shape [80])
        P_NN_flat = torch.softmax(individual_logits, dim=0)
        # 3. Reshape into a 2D matrix [NUM_A, NUM_B]
        P_NN_matrix = P_NN_flat.view(num_a, num_b).cpu().numpy()

    # Header (Indices of H_B)
    header = "A\\B | " + " | ".join([f"B[{j}]" for j in range(num_b)])
    print("-" * (len(header) + 4))
    print(header)
    print("-" * (len(header) + 4))

    for i in range(num_a):
        row_str = f"A[{i}] | "
        for j in range(num_b):
            prob = P_NN_matrix[i, j]
            row_str += f"{prob:.3f} | "
        print(row_str.strip())
    print("-" * (len(header) + 4))

    # Verification: Check if the area sum is close to the target
    area_sum_check = P_NN_matrix[:N_CONSTRAINT, :M_CONSTRAINT].sum()
    P_A_check = P_NN_matrix.sum(axis=1)
    P_B_check = P_NN_matrix.sum(axis=0)

    # Calculate marginal error
    L_Margin = (
        nn.MSELoss()(torch.tensor(P_A_check), H_A).item()
        + nn.MSELoss()(torch.tensor(P_B_check), H_B).item()
    )
    print(
        f"Verification: Sum of area P(A[0:{N_CONSTRAINT-1}], B[0:{M_CONSTRAINT-1}]): {area_sum_check:.6f} (Target: {C_TRUE:.6f})"
    )
    print(f"Verification: Marginal Error (MSE): {L_Margin:.8f}")


# Loss Function and Fine-tuning
def fine_tune_constrained(
    model,
    optimizer,
    c_true,
    area_indices,
    H_A_target,
    H_B_target,
    lambda_area=10.0,
    lambda_margin=1.0,
    num_steps=500,
):

    criterion_mse = nn.MSELoss()

    for step in range(num_steps):
        optimizer.zero_grad()

        individual_logits = model(all_coords_tensor)
        P_NN_flat = torch.softmax(individual_logits, dim=0)

        # Loss 1: Area Sum Constraint (L_area)
        c_nn = P_NN_flat[area_indices].sum()
        c_true_tensor = torch.tensor(c_true).float().to(c_nn.device)
        loss_area = criterion_mse(c_nn, c_true_tensor)

        # Loss 2: Marginal Constraints (L_margin)
        P_NN_matrix = P_NN_flat.view(NUM_A, NUM_B)
        P_A_NN = P_NN_matrix.sum(dim=1)
        P_B_NN = P_NN_matrix.sum(dim=0)

        loss_margin_A = criterion_mse(P_A_NN, H_A_target)
        loss_margin_B = criterion_mse(P_B_NN, H_B_target)
        loss_margin = loss_margin_A + loss_margin_B

        # Total Loss
        loss_total = lambda_area * loss_area + lambda_margin * loss_margin
        loss_total.backward()
        optimizer.step()

        if (step + 1) % 50 == 0:
            print(
                f"Step {step+1}: L_Total={loss_total.item():.6f}, L_Area={loss_area.item():.6f}, L_Margin={loss_margin.item():.6f}, C_NN={c_nn.item():.6f}"
            )


if __name__ == "__main__":

    # --- 1. Initial Conditions and Setup ---

    # Distribution Histograms (Fixed as Marginal Constraints)
    H_A = torch.tensor(
        [0.085, 0.112, 0.063, 0.141, 0.039, 0.187, 0.095, 0.051, 0.155, 0.072],
        dtype=torch.float32,
    )
    H_B = torch.tensor(
        [0.021, 0.058, 0.150, 0.245, 0.280, 0.148, 0.065, 0.033], dtype=torch.float32
    )

    NUM_A = H_A.size(0)  # 10
    NUM_B = H_B.size(0)  # 8
    TOTAL_BINS = NUM_A * NUM_B  # 80

    # Target Value for Area Constraint (Fitting Target)
    C_TRUE = 0.3

    # Area Constraint Definition: H_A[0:n] and H_B[0:m] (Assume n=5, m=4)
    N_CONSTRAINT = 5
    M_CONSTRAINT = 4

    # --- Generate coordinates (i, j) for all bins (Global scope) ---
    all_coords = []
    for i in range(NUM_A):
        for j in range(NUM_B):
            all_coords.append([i, j])
    all_coords_tensor = torch.tensor(all_coords).float()  # Shape [80, 2]

    # --- Identify flat indices of bins included in the area constraint ---
    AREA_INDICES = []
    for i in range(N_CONSTRAINT):
        for j in range(M_CONSTRAINT):
            AREA_INDICES.append(i * NUM_B + j)
    AREA_INDICES = torch.tensor(AREA_INDICES)  # Shape [20]

    # --- Execution ---
    model = ProbabilityModel()
    optimizer = optim.Adam(model.parameters(), lr=5e-4)

    # Display before fine-tuning
    display_joint_probability(
        model,
        all_coords_tensor,
        NUM_A,
        NUM_B,
        title="Before Fine-tuning (Initial Random State)",
    )

    print("\n--- Starting Fine-tuning ---")
    fine_tune_constrained(model, optimizer, C_TRUE, AREA_INDICES, H_A, H_B)
    print("--- Fine-tuning Complete ---")

    # Display after fine-tuning
    display_joint_probability(
        model,
        all_coords_tensor,
        NUM_A,
        NUM_B,
        title="After Fine-tuning (Constraints Applied)",
    )

2.3. Execution Result

The execution result of hist-nn.py is shown below:

$ python hist-nn.py 

--- Before Fine-tuning (Initial Random State) ---
---------------------------------------------------------------
A\B | B[0] | B[1] | B[2] | B[3] | B[4] | B[5] | B[6] | B[7]
---------------------------------------------------------------
A[0] | 0.013 | 0.013 | 0.012 | 0.012 | 0.012 | 0.013 | 0.013 | 0.013 |
A[1] | 0.013 | 0.013 | 0.013 | 0.012 | 0.013 | 0.013 | 0.013 | 0.013 |
A[2] | 0.012 | 0.012 | 0.011 | 0.011 | 0.012 | 0.012 | 0.013 | 0.012 |
A[3] | 0.011 | 0.012 | 0.011 | 0.011 | 0.012 | 0.012 | 0.012 | 0.012 |
A[4] | 0.012 | 0.013 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 |
A[5] | 0.013 | 0.014 | 0.013 | 0.013 | 0.013 | 0.013 | 0.014 | 0.013 |
A[6] | 0.013 | 0.013 | 0.013 | 0.012 | 0.013 | 0.013 | 0.013 | 0.013 |
A[7] | 0.012 | 0.012 | 0.012 | 0.011 | 0.011 | 0.012 | 0.012 | 0.012 |
A[8] | 0.013 | 0.013 | 0.012 | 0.012 | 0.013 | 0.013 | 0.014 | 0.013 |
A[9] | 0.013 | 0.013 | 0.013 | 0.012 | 0.013 | 0.013 | 0.013 | 0.013 |
---------------------------------------------------------------
Verification: Sum of area P(A[0:4], B[0:3]): 0.241186 (Target: 0.300000)
Verification: Marginal Error (MSE): 0.01066820

--- Starting Fine-tuning ---
Step 50: L_Total=0.011445, L_Area=0.000037, L_Margin=0.011070, C_NN=0.293877
Step 100: L_Total=0.010102, L_Area=0.000001, L_Margin=0.010089, C_NN=0.298861
Step 150: L_Total=0.008607, L_Area=0.000000, L_Margin=0.008604, C_NN=0.299434
Step 200: L_Total=0.006526, L_Area=0.000000, L_Margin=0.006522, C_NN=0.299367
Step 250: L_Total=0.004152, L_Area=0.000001, L_Margin=0.004146, C_NN=0.299279
Step 300: L_Total=0.002355, L_Area=0.000001, L_Margin=0.002348, C_NN=0.299156
Step 350: L_Total=0.001419, L_Area=0.000001, L_Margin=0.001412, C_NN=0.299172
Step 400: L_Total=0.000997, L_Area=0.000001, L_Margin=0.000990, C_NN=0.299193
Step 450: L_Total=0.000784, L_Area=0.000001, L_Margin=0.000778, C_NN=0.299235
Step 500: L_Total=0.000645, L_Area=0.000000, L_Margin=0.000640, C_NN=0.299303
--- Fine-tuning Complete ---

--- After Fine-tuning (Constraints Applied) ---
---------------------------------------------------------------
A\B | B[0] | B[1] | B[2] | B[3] | B[4] | B[5] | B[6] | B[7]
---------------------------------------------------------------
A[0] | 0.005 | 0.007 | 0.019 | 0.029 | 0.025 | 0.013 | 0.005 | 0.004 |
A[1] | 0.006 | 0.009 | 0.025 | 0.035 | 0.030 | 0.015 | 0.006 | 0.004 |
A[2] | 0.004 | 0.005 | 0.014 | 0.022 | 0.020 | 0.010 | 0.003 | 0.003 |
A[3] | 0.007 | 0.011 | 0.028 | 0.043 | 0.038 | 0.021 | 0.008 | 0.006 |
A[4] | 0.003 | 0.005 | 0.009 | 0.015 | 0.015 | 0.007 | 0.003 | 0.003 |
A[5] | 0.007 | 0.010 | 0.025 | 0.041 | 0.045 | 0.021 | 0.007 | 0.005 |
A[6] | 0.003 | 0.004 | 0.010 | 0.018 | 0.022 | 0.010 | 0.003 | 0.003 |
A[7] | 0.002 | 0.003 | 0.006 | 0.010 | 0.012 | 0.006 | 0.002 | 0.002 |
A[8] | 0.005 | 0.007 | 0.019 | 0.032 | 0.041 | 0.018 | 0.006 | 0.004 |
A[9] | 0.003 | 0.004 | 0.007 | 0.012 | 0.016 | 0.007 | 0.003 | 0.002 |
---------------------------------------------------------------
Verification: Sum of area P(A[0:4], B[0:3]): 0.299303 (Target: 0.300000)
Verification: Marginal Error (MSE): 0.00063787

2.4. Alternative Hybrid Approach

An alternative approach combines the Neural Network (NN) estimation with the Iterative Proportional Fitting (IPF) procedure. This method aims to leverage the global consistency of IPF while using the NN for local refinement. This hybrid approach combines the expressiveness of a Neural Network (NN) with the rigorous constraint enforcement of the Iterative Proportional Fitting (IPF) method.

  • Step 1: Generate Target Distribution ($P^{\ast}$) using IPF
    We first compute an ideal target distribution $P^{\ast}$ that is the closest possible distribution to the initial distribution ($P_{\text{initial}}$, which in the implementation is the initial NN output), while satisfying all constraints. $P^{\ast}$ is defined by minimizing the Kullback–Leibler (KL) Divergence: $$P^{\ast} = \arg \min_{P} D_{KL}(P \parallel P_{\text{initial}})$$ subject to the marginal constraints and the area constraint: $$\sum_{j} P_{i,j}^{\ast} = H_{A,i}, \quad \sum_{i} P_{i,j}^{\ast} = H_{B,j}, \quad \sum_{(i,j) \in \mathcal{R}} P_{i,j}^{\ast} = C_{\text{true}}$$ where $H_{A}$ and $H_{B}$ are the marginal histograms, and $\mathcal{R}$ is the constrained area corresponding to the corrected true count $C_{\text{true}}$. The function calculate_p_star_with_constraints in the source code executes this constrained IPF process.

  • Step 2: Fine-tune NN to Match $P^{\ast}$
    The neural network is subsequently trained to approximate the static target distribution $P^{\ast}$ calculated in Step 1. This process is commonly known as knowledge distillation. The loss function $L_{KL}$ minimizes the Reverse Kullback–Leibler (KL) Divergence ($D_{KL}(P^{\ast} \parallel P_{\text{NN}})$). This is the standard method used when training a student model ($P_{\text{NN}}$) to match a fixed teacher distribution ($P^{\ast}$): $$L_{KL}(\mathbf{\Theta}) = D_{KL}(P^{\ast} \parallel P_{\text{NN}}) = \sum_{i, j} P_{i, j}^{\ast} \log \frac{P_{i, j}^{\ast}}{P_{\text{NN}, i, j}(\mathbf{\Theta})}$$ This formulation, which is implemented using nn.KLDivLoss(log_probs_nn, P_star), encourages the neural network output $P_{\text{NN}}$ to closely match the IPF-corrected target distribution $P^{\ast}$.

Implementation Note:

In the actual implementation, the IPF-corrected distribution $P^{\ast}$ is computed once and used as a static target during the neural network’s training (static distillation). Although this is a simplification, it provides a stable and practical approximation compared to fully iterative IPF-NN updates.

The implementation details are provided in the source code below: hist-hybrid.py

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import copy


# NN Model Definition
class ProbabilityModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(NUM_A + NUM_B, 128),
            nn.ReLU(),
            nn.Linear(128, 1),  # Change output size to 1
        )

    def forward(self, coords):
        # Simulate one-hot encoding
        coords_int = coords.long()
        features = []
        for i in range(coords.size(0)):
            one_hot_A = torch.zeros(NUM_A, device=coords.device)
            one_hot_A[coords_int[i, 0]] = 1.0
            one_hot_B = torch.zeros(NUM_B, device=coords.device)
            one_hot_B[coords_int[i, 1]] = 1.0
            features.append(torch.cat([one_hot_A, one_hot_B]))

        features_tensor = torch.stack(features)
        logits = self.fc(features_tensor)
        return logits.squeeze(1)


def display_joint_probability(
    model,
    coords_tensor,
    num_a,
    num_b,
    H_A_target,
    H_B_target,
    C_true,
    area_indices,
    title="Joint Probability Distribution P(A, B)",
):
    """
    Retrieves the joint probability distribution from the trained NN model and displays it using nested loops.
    """
    print(f"\n--- {title} ---")

    with torch.no_grad():
        # 1. Get logits for all bins from NN (NN forward pass)
        individual_logits = model(coords_tensor)
        # 2. Calculate joint probability distribution P_NN (shape [80])
        P_NN_flat = torch.softmax(individual_logits, dim=0)
        # 3. Reshape into a 2D matrix [NUM_A, NUM_B]
        P_NN_matrix = P_NN_flat.view(num_a, num_b).cpu().numpy()

    # Header (Indices of H_B)
    header = "A\\B | " + " | ".join([f"B[{j}]" for j in range(num_b)])
    print("-" * (len(header) + 4))
    print(header)
    print("-" * (len(header) + 4))

    for i in range(num_a):
        row_str = f"A[{i}] | "
        for j in range(num_b):
            prob = P_NN_matrix[i, j]
            row_str += f"{prob:.3f} | "
        print(row_str.strip())
    print("-" * (len(header) + 4))

    # Verification
    area_sum_check = P_NN_matrix[:N_CONSTRAINT, :M_CONSTRAINT].sum()
    P_A_check = P_NN_matrix.sum(axis=1)
    P_B_check = P_NN_matrix.sum(axis=0)

    L_Margin = (
        nn.MSELoss()(torch.tensor(P_A_check), H_A_target).item()
        + nn.MSELoss()(torch.tensor(P_B_check), H_B_target).item()
    )

    print(
        f"Verification: Sum of area P(A[0:{N_CONSTRAINT-1}], B[0:{M_CONSTRAINT-1}]): {area_sum_check:.6f} (Target: {C_true:.6f})"
    )
    print(f"Verification: Marginal Error (MSE): {L_Margin:.8f}")


# Function to Calculate P* (Constrained IPF Principle)
def calculate_p_star_with_constraints(
    P_initial,
    H_A_target,
    H_B_target,
    C_true,
    area_indices,
    max_iter=100,
    tolerance=1e-6,
):

    P_current = P_initial.view(NUM_A, NUM_B).clone()

    # Convert marginal sums to NumPy arrays
    H_A_np = H_A_target.cpu().numpy()
    H_B_np = H_B_target.cpu().numpy()

    # Area indices (2D)
    area_rows, area_cols = np.unravel_index(area_indices.cpu().numpy(), (NUM_A, NUM_B))

    for i in range(max_iter):
        P_prev = P_current.clone()

        # === Step 1: Adjust Area Constraint ===
        P_np = P_current.cpu().numpy()
        current_area_sum = P_np[area_rows, area_cols].sum()

        if abs(current_area_sum - C_true) > 1e-18:
            # Calculate ratios for the area and the rest
            ratio_area = C_true / (current_area_sum + 1e-18)
            P_rest_sum = 1.0 - current_area_sum
            P_rest_target = 1.0 - C_true
            ratio_rest = P_rest_target / (P_rest_sum + 1e-18)

            P_temp_np = P_np.copy()

            # Adjust inside the constrained area
            P_temp_np[area_rows, area_cols] *= ratio_area

            # Adjust outside the constrained area
            mask_rest = np.ones_like(P_np, dtype=bool)
            mask_rest[area_rows, area_cols] = False
            P_temp_np[mask_rest] *= ratio_rest

            P_current = torch.tensor(P_temp_np).float()
            P_current /= P_current.sum()  # Ensure normalization after adjustment

        # === Step 2: Adjust Marginal Constraints (Standard IPF Steps) ===

        # Adjustment for A (Row) direction
        row_sums = P_current.sum(dim=1).unsqueeze(1)
        adjustment_factor_row = H_A_target.unsqueeze(1) / (row_sums + 1e-18)
        P_current *= adjustment_factor_row

        # Adjustment for B (Column) direction
        col_sums = P_current.sum(dim=0).unsqueeze(0)
        adjustment_factor_col = H_B_target.unsqueeze(0) / (col_sums + 1e-18)
        P_current *= adjustment_factor_col

        # === Convergence Check ===
        if torch.sum(torch.abs(P_current - P_prev)) < tolerance:
            break

    return P_current.view(1, TOTAL_BINS)  # Shape [1, 80]


# Learning Step Using KL Divergence as Teacher
def fine_tune_to_p_star(model, optimizer, P_star, num_steps=500):

    # KL Divergence Loss: Minimize D_KL(P* || P_NN)
    criterion_kl = nn.KLDivLoss(reduction="batchmean", log_target=False)

    for step in range(num_steps):
        optimizer.zero_grad()

        individual_logits = model(all_coords_tensor)
        log_probs_nn = torch.log_softmax(individual_logits, dim=0).unsqueeze(0)

        # Calculate loss: loss_kl = D_KL(P* || P_NN)
        loss_kl = criterion_kl(log_probs_nn, P_star)

        loss_total = loss_kl
        loss_total.backward()
        optimizer.step()

        if (step + 1) % 50 == 0:
            print(f"Step {step+1}: KL_Loss = {loss_kl.item():.6f}")


if __name__ == "__main__":

    # --- 1. Initial Conditions and Setup ---

    # Distribution Histograms (Fixed as Marginal Constraints)
    H_A = torch.tensor(
        [0.085, 0.112, 0.063, 0.141, 0.039, 0.187, 0.095, 0.051, 0.155, 0.072],
        dtype=torch.float32,
    )
    H_B = torch.tensor(
        [0.021, 0.058, 0.150, 0.245, 0.280, 0.148, 0.065, 0.033], dtype=torch.float32
    )

    NUM_A = H_A.size(0)  # 10
    NUM_B = H_B.size(0)  # 8
    TOTAL_BINS = NUM_A * NUM_B  # 80

    # Target Value for Area Constraint (Fitting Target)
    C_TRUE = 0.3

    # Area Constraint Definition: H_A[0:n] and H_B[0:m] (Assume n=5, m=4)
    N_CONSTRAINT = 5
    M_CONSTRAINT = 4

    # --- Generate coordinates (i, j) for all bins (Global scope) ---
    all_coords = []
    for i in range(NUM_A):
        for j in range(NUM_B):
            all_coords.append([i, j])
    all_coords_tensor = torch.tensor(all_coords).float()  # Shape [80, 2]

    # --- Identify flat indices of bins included in the area constraint ---
    AREA_INDICES = []
    for i in range(N_CONSTRAINT):
        for j in range(M_CONSTRAINT):
            AREA_INDICES.append(i * NUM_B + j)
    AREA_INDICES = torch.tensor(AREA_INDICES)  # Shape [20]

    # --- Execution ---
    model_hybrid = ProbabilityModel()
    optimizer_hybrid = optim.Adam(model_hybrid.parameters(), lr=1e-3)

    # ----------------------------------------------------
    # Display Before Fine-tuning
    # ----------------------------------------------------
    display_joint_probability(
        model_hybrid,
        all_coords_tensor,
        NUM_A,
        NUM_B,
        H_A,
        H_B,
        C_TRUE,
        AREA_INDICES,
        title="Before Fine-tuning (Initial Random State)",
    )

    print("\n--- Calculating P* Target Distribution (Applying IPF Principle) ---")

    # 1. Get the initial output of the NN (P_initial)
    with torch.no_grad():
        initial_logits = model_hybrid(all_coords_tensor)
        P_initial = torch.softmax(initial_logits, dim=0)

    # 2. Modify P_initial and calculate the target distribution P*
    P_star = calculate_p_star_with_constraints(
        P_initial, H_A, H_B, C_TRUE, AREA_INDICES
    )

    # Check P* constraint achievement
    P_star_matrix_check = P_star.squeeze().view(NUM_A, NUM_B)
    P_star_area_sum_check = (
        P_star_matrix_check[:N_CONSTRAINT, :M_CONSTRAINT].sum().item()
    )

    print(f"P* Target Area Sum: {P_star_area_sum_check:.6f} (Target: {C_TRUE:.6f})")

    # 3. Fine-tune the NN to approach P*
    print("\n--- Starting NN Fine-tuning (KL Minimization) ---")
    fine_tune_to_p_star(model_hybrid, optimizer_hybrid, P_star)
    print("--- NN Fine-tuning Complete ---")

    # ----------------------------------------------------
    # Display After Fine-tuning
    # ----------------------------------------------------
    display_joint_probability(
        model_hybrid,
        all_coords_tensor,
        NUM_A,
        NUM_B,
        H_A,
        H_B,
        C_TRUE,
        AREA_INDICES,
        title="After Fine-tuning (Approximation to P* Target)",
    )

3. Discussions

The proposed method allows the joint probability distribution to be updated based on observed data.

However, there are still several hurdles to overcome for practical application in a real-world Relational Database (RDB).

We will discuss these challenges from two perspectives: Computational Complexity and Memory Footprint Reduction.

3.1. Reducing Computational Complexity: Cumulative Sum or Summed-Area Table

When a probability distribution stored as a histogram is used for Selectivity Estimation, the values often need to be summed up based on a given condition.

For example, to get the value for the range $h[3:7]$ in a 1D histogram $h[0:19]$, you must sum the values from $h[3]$ to $h[7]$, i.e., $ \sum_{i=3}^{7} h[i]$. The computational cost for this summation is $O(n)$, where $n$ is the length of the histogram $h$.

While this cost is acceptable for a 1D case (in fact, PostgreSQL uses this method), it becomes non-negligible for a 2D histogram, where the calculation cost increases to $O(n^{2})$ .

Fortunately, there is a method to address this situation: the Cumulative Sum (or Summed-Area Table in 2D).

Simply put, instead of sequentially adding the values in the desired range, we use a new histogram H (the cumulative sum histogram) where each element is the accumulated sum of the original histogram $h$. By finding the difference between the upper and lower bounds of the region, the calculation can always be performed with an $O(1)$ computational cost.

The Cumulative Sum $H$ is defined as follows (assuming $H[−1]=0$):

$$H[i] = \sum_{j=0}^{i} h[j]$$ $$H[i] = H[i-1] + h[i] \quad \text{for } i \ge 0$$

Let’s consider calculating$S = \sum_{i=3}^{7} h[i]$:

  • Using the original histogram $h$, you must sum the values from $h[3]$ to $h[7]$: $$ S = \sum_{i=3}^{7} h[i] = h[3] + h[4] + h[5] + h[6] + h[7] $$
  • Conversely, using the Cumulative Sum Histogram $H$, only the following subtraction is required: $$ S = H[7] - H[2] $$

The same principle applies in the 2D case.

If $h[i,j]$ is the original 2D histogram and $H[i,j]$ is the Cumulative Sum Table (or Summed-Area Table), $H[i,j]$ is defined as the sum of all bin probabilities (or frequencies) contained in the rectangular region from the origin $(0,0)$ up to coordinates $(i,j)$.

$$\mathbf{H}[i, j] = \sum_{x=0}^{i} \sum_{y=0}^{j} \mathbf{h}[x, y]$$

3.1.1. Creating the Cumulative Sum Table $\mathbf{H}$: $\text{O}(MN)$

For an original histogram $h$ of dimension $M \times N$, $\mathbf{H}$ can be created efficiently with a computational complexity of $O(MN)$ using the following recurrence relation:

$$\mathbf{H}[i, j] = \mathbf{h}[i, j] + \mathbf{H}[i-1, j] + \mathbf{H}[i, j-1] - \mathbf{H}[i-1, j-1]$$

  • $\mathbf{h}[i, j]$: The value of the current bin.
  • $\mathbf{H}[i-1, j]$: The cumulative sum from the area above.
  • $\mathbf{H}[i, j-1]$: The cumulative sum from the area to the left.
  • $\mathbf{H}[i-1, j-1]$: This is the top-left corner area which was doubly added in the previous two terms, so it must be subtracted once.

3.1.2. Calculating Region Probability: $\text{O}(1)$

To find the probability $P$ for any arbitrary rectangular region $[a_{1}, a_{2}] \times [b_{1}, b_{2}]$((with top-left corner $(a_{1}, b_{1})$ and bottom-right corne $(a_{2}, b_{2})$)), we can calculate it in $O(1)$ ime using the following four cumulative sum values:

$$P = \mathbf{H}[a_2, b_2] - \mathbf{H}[a_1-1, b_2] - \mathbf{H}[a_2, b_1-1] + \mathbf{H}[a_1-1, b_1-1]$$

  • Calculation Breakdown:
    1. $\mathbf{H}[a_2, b_2]$: The total area ($(0, 0)$ to $(a_2, b_2)$)
    2. $\mathbf{H}[a_1-1, b_2]$: Subtract the unnecessary region on the left side.
    3. $\mathbf{H}[a_2, b_1-1]$: Subtract the unnecessary region on the top side.
    4. $\mathbf{H}[a_1-1, b_1-1]$: The top-left corner was doubly subtracted in steps 2 and 3, so we add it back once.

3.2. Reducing Memory Footprint:Count Min Sketch

To reduce the memory required for the 2D histogram and allow for faster fine-tuning, I adopted an NN (Neural Network). However, directly handling an NN inside an RDB is computationally expensive.

Here, the Count-Min Sketch (CMS), a probabilistic data structure, can dramatically reduce the storage required for the 2D histogram.

The Count-Min Sketch is a probabilistic data structure used to approximately estimate the frequency (count) of specific elements in a stream or massive dataset. It stores data in a fixed, small-sized matrix (the “sketch”). By using multiple independent hash functions, the impact of hash collisions is minimized. To estimate the count of any element, we adopt the minimum value among the multiple buckets that the element maps to. The biggest advantage of this technique is that the estimation accuracy ($epsilon$) and the probability of failure ($delta$) can be specified by the user, fixing the memory size to be extremely small, regardless of the data’s cardinality (number of unique items) or the total count.

Therefore, the practical approach would be to adjust the probability distribution using the NN and then map that resulting distribution onto the CMS for operational use.

Note that the probability distribution data type from the NN is typically a floating-point type (like double), while the CMS uses an integer type. Therefore, when using the CMS, you must separately store the total sample count $N$ and divide the value retrieved from the CMS by $N$ N to obtain the probability.

3.2.1. Memory Calculation

Two parameters are required to calculate the memory capacity needed for a Count-Min Sketch:

  1. Permissible Error Range: $\epsilon$
  2. Acceptable Failure Probability: $\delta$

The siz ($d \times w$) of the CM Sketch matrix is determined by the following formulas:

  1. Number of Rows $d$ (Number of Hash Functions): $$d = \left\lceil \ln\left(\frac{1}{\delta}\right) \right\rceil $$
  2. Number of Columns $w$ (Bucket Width): $$w = \left\lceil \frac{e}{\epsilon} \right\rceil $$

Let’s calculate the required memory using specific values. Assuming the histogram size is $100 \times 100$, and setting the permissible error $epsilon$ and acceptable failure probability $\delta$ to 0.01 (or 1[%]), the number of rows $d$ and columns w for the CMS are:

$$d = \left\lceil \ln\left(\frac{1}{\delta}\right) \right\rceil = \lceil \ln(100) \rceil = \lceil 4.605 \rceil = 5$$

$$w = \left\lceil \frac{e}{\epsilon} \right\rceil = \lceil \frac{2.718}{0.01} \rceil = \lceil 271.8 \rceil = 272$$

Thus, the total number of elements in the sketch is $d \times w = 5 \times 272 = 1360$.

Assuming each element is an 8-byte integer (like long long), the memory required for the CMS is:

$$\text{CM Sketch Memory} = 1,360 \ \text{elements} \times 8 \text{ bytes/element} = 10,880 \text{ bytes} \approx \mathbf{10.63 \ Kbytes}$$

3.2.2. Comparison of Memory Reduction

Item Calculation Memory Capacity
Original Histogram $100 \times 100 \times 8 \text{ bytes}$ $800,000 \text{ bytes} \approx \mathbf{781.25 \ Kbytes}$
CM Sketch $5 \times 272 \times 8 \text{ bytes}$ $10,880 \text{ bytes} \approx \mathbf{10.63 \ Kbytes}$

In this example, using the Count-Min Sketch reduces the memory footprint of the original histogram to approximately 1/74th of its original size.

3.2.3. Source code

The source code is shown below: cms.py

import numpy as np
import math
import hashlib


class CountMinSketch:
    def __init__(self, d_rows, w_cols):
        # Initialize the sketch matrix with zeros
        self.sketch = np.zeros((d_rows, w_cols), dtype=np.int32)
        self.d = d_rows
        self.w = w_cols

        # Seeds to simulate different hash functions
        # Using NumPy random seeds (in practice, cryptographic hash functions are often used)
        self.seeds = np.random.randint(0, 1000000, d_rows)

    def _hash(self, i, j, seed):
        """
        Calculates a hash value within w_cols using 2D indices (i, j) and a seed.
        """
        # Combine indices and hash (simple concatenation and hashing)
        # Assuming i and j are integers
        combined_key = str(i) + "-" + str(j)

        # Combine Python's standard hash with the seed
        # Append the seed to ensure diversity of hash values
        h = int(
            hashlib.sha256((combined_key + str(seed)).encode("utf-8")).hexdigest(), 16
        )

        # Map to the range of w_cols
        return h % self.w

    def add(self, i: int, j: int, k: int):
        """
        Adds value k to the array cell (i, j).
        Args:
            i (int): Index of the first dimension
            j (int): Index of the second dimension
            k (int): Value (frequency or weight) to add
        """
        if i >= N_DIM or j >= M_DIM:
            raise IndexError("Index out of bounds for the assumed array size.")

        for row in range(self.d):
            # Use a different hash function for each row
            col_index = self._hash(i, j, self.seeds[row])
            self.sketch[row, col_index] += k

    def get(self, i: int, j: int) -> int:
        """
        Returns the estimated value of the array cell (i, j).
        Args:
            i (int): Index of the first dimension
            j (int): Index of the second dimension
        Returns:
            int: Estimated value of cell (i, j)
        """
        if i >= N_DIM or j >= M_DIM:
            raise IndexError("Index out of bounds for the assumed array size.")

        min_value = np.inf

        for row in range(self.d):
            col_index = self._hash(i, j, self.seeds[row])
            min_value = min(min_value, self.sketch[row, col_index])

        return int(min_value)

    def show(self, max_dim=10):
        """
        Displays the estimated data for the array (only a subset is shown due to computation cost).
        Args:
            max_dim (int): Maximum dimension to display (e.g., shows the top-left max_dim x max_dim)
        """
        print(f"\n--- Estimated Joint Probability Distribution (CM Sketch) ---")
        print(
            f" (Displaying estimate for top-left {min(N_DIM, max_dim)}x{min(M_DIM, max_dim)})"
        )

        # Array to store results
        estimated_array = np.zeros(
            (min(N_DIM, max_dim), min(M_DIM, max_dim)), dtype=np.int32
        )

        # Estimate all elements in the nested loop and display
        for i in range(min(N_DIM, max_dim)):
            for j in range(min(M_DIM, max_dim)):
                estimated_array[i, j] = self.get(i, j)

        # Print the formatted NumPy array
        print(estimated_array)
        print("-" * 40)


if __name__ == "__main__":

    # --- 1. Initial Parameter Setup ---
    # Corresponds to user input: error epsilon and failure probability delta
    EPSILON = 0.01  # Tolerance for error (1%)
    DELTA = 0.01  # Tolerance for failure probability (1%)

    # Array dimensions (assuming 400x400, but doesn't directly affect CM Sketch size)
    N_DIM = 400
    M_DIM = 400

    # --- 2. CM Sketch Parameter Calculation ---
    # Number of rows d (number of hash functions)
    D_ROWS = math.ceil(math.log(1 / DELTA))

    # Number of columns w (width of buckets)
    W_COLS = math.ceil(math.e / EPSILON)

    # --- Example Execution ---
    # 1. Initialize the Sketch
    cm_sketch = CountMinSketch(D_ROWS, W_COLS)

    # 2. Add Data (Add operation)
    # i, j are in the range 0 to 399
    print("Adding data (Add Operation)...")
    cm_sketch.add(1, 1, 100)
    cm_sketch.add(1, 1, 50)  # Total 150
    cm_sketch.add(1, 2, 20)  # Total 20
    cm_sketch.add(200, 300, 5)  # Far index
    cm_sketch.add(2, 1, 100)  # Data that might cause hash collision with (1,1)

    # 3. Get value of specific cells (Get operation)
    val_1_1 = cm_sketch.get(1, 1)
    val_1_2 = cm_sketch.get(1, 2)
    val_2_1 = cm_sketch.get(2, 1)
    val_200_300 = cm_sketch.get(200, 300)

    print(f"Estimated value (1, 1): {val_1_1} (True value: 150)")
    print(f"Estimated value (1, 2): {val_1_2} (True value: 20)")
    print(f"Estimated value (2, 1): {val_2_1} (True value: 100)")
    print(f"Estimated value (200, 300): {val_200_300} (True value: 5)")

    # 4. Display the entire array (Show operation)
    cm_sketch.show(max_dim=5)

As stated above, the CMS uses integers. Therefore, when dealing with probability distributions, remember to separately store the total number of samples $N$ and divide the value retrieved from the CMS by $N$ to obtain the probability.

4. Conclusion

The core takeaway is that I have demonstrated the feasibility of continuously improving selectivity estimation by acquiring actual row counts for every query.

Remaining Work

  1. Persistence of Corrections After ANALYZE
    I deliberately avoided discussing the following issue until now: whenever the ANALYZE command refreshes the selectivity statistics, all previous corrections are effectively invalidated.

    Therefore, we must find a way to reuse those corrections or integrate the pre-ANALYZE distribution updates into the new statistics. Assuming a sudden, drastic change in data distribution is a rare event, this approach seems practical.

  2. Accuracy Verification of Probabilistic Data Structures
    Neural Networks (NNs) and Count-Min Sketches (CMSs) are both probabilistic data structures. The use of CMSs, however, carries the risk of introducing or compounding false positives into the estimation.

    Consequently, the accuracy of selectivity estimation obtained by combining these two approaches requires rigorous qualitative and quantitative analysis. Since the overarching goal of these three blog posts is to improve the accuracy of selectivity estimation, if the final result cannot guarantee sufficient accuracy, then all the effort invested in this approach would be meaningless.

Personal Thoughts

My primary goal was to enable real-time query progress display by continuously obtaining actual row counts. By reducing the overhead of EXPLAIN ANALYZE to 2%, that objective has been achieved.

However, the accuracy of the Query Progress display is directly proportional to the quality of the Selectivity Estimation. Thus, although secondary to my main goal, improving the performance of Selectivity Estimation is still necessary.

I hope that providing researchers with a constant stream of actual row counts will inspire them to devise entirely new methods for Selectivity Estimation.