A Practical Guide To Gradient Boosting: Xgboost’S Weighted Quantile Sketch And Sparsity Aware Split Finding

2022-01-10 · Leonardo Benicio

Summary

A comprehensive technical exploration of a practical guide to gradient boosting: xgboost’s weighted quantile sketch and sparsity aware split finding, covering key concepts, practical implementations, and real-world applications.

Contents

A Practical Guide To Gradient Boosting: XGBoost’s Weighted Quantile Sketch And Sparsity Aware Split Finding

If you have spent any time in the world of machine learning over the last decade, you have felt the gravitational pull of XGBoost. It is the algorithm that won Kaggle competitions before deep learning stole the spotlight for unstructured data. It is the go-to tool for every credit risk model, every customer churn prediction, and every click-through rate forecast in production systems worldwide. For many practitioners, XGBoost is simply the magic black box that “just works.” You feed in a messy DataFrame, you call model.fit(), and somehow, out pops a model that is accurate, robust, and—most importantly—fast.

But here is the uncomfortable truth that separates a skilled practitioner from an expert: what happens inside that black box when the data is sparse, when the features are continuous, and when the dataset is too large to fit into memory? Most users have no idea. They set max_depth and learning_rate by intuition, or worse, by copying parameters from a forum post. They know that XGBoost is “optimized,” but they do not know why. This blog post is designed to pull back the curtain on two of the most critical, yet least understood, innovations that make XGBoost the titan it is: the Weighted Quantile Sketch and Sparsity-Aware Split Finding.

Understanding these two mechanisms is not just an academic exercise. It is the key to unlocking the next level of performance in your models. It is the difference between blindly tuning hyperparameters and strategically guiding your algorithm to find better splits, faster. It is the difference between struggling with memory errors on a 10GB dataset and confidently scaling to terabytes. By the end of this guide, you will not just be a user of XGBoost—you will understand the reason behind the legend.

To appreciate the genius of these innovations, we must first step back and understand the fundamental challenge that every gradient boosting algorithm faces: how to find the best split points for continuous features when the dataset is enormous, sparse, or both. Traditional decision tree algorithms like CART (Classification and Regression Trees) operate on the assumption that all features are dense and that the entire dataset fits in memory. They sort the feature values once and then scan through them to evaluate every possible split. This is fine for the iris dataset, but it collapses under the weight of modern industrial data—millions of rows, thousands of features, and high sparsity due to one-hot encoding or missing values.

XGBoost was designed from the ground up to tackle these real-world constraints. Its authors, Tianqi Chen and Carlos Guestrin, published the seminal paper “XGBoost: A Scalable Tree Boosting System” in 2016, which introduced the two core innovations we will dissect here. By the time you finish reading, you will understand not only how they work, but why they are essential for anyone who wants to push the boundaries of gradient boosting on large-scale, sparse data.

1. The Gradient Boosting Landscape: A Quick Refresher

Before we dive into the innovations, let’s establish a common vocabulary. Gradient boosting builds an ensemble of weak learners—typically decision trees—in a sequential, additive manner. At each iteration (t), we have a current model (F_{t-1}(x)). We compute the negative gradient (also called the pseudo-residual) of the loss function with respect to the prediction for each training example. Then we fit a new regression tree (h_t(x)) to these gradients, and update the model:

[ Ft(x) = F{t-1}(x) + \eta \cdot h_t(x) ]

where (\eta) is the learning rate.

XGBoost generalizes this by using a second-order Taylor expansion of the loss function, similar to Newton’s method. For a given data point (x_i) with true label (y_i) and current prediction (\hat{y}_i^{(t-1)}), we define:

[ g_i = \frac{\partial L(y_i, \hat{y})}{\partial \hat{y}} \quad \text{(first derivative)} ] [ h_i = \frac{\partial^2 L(y_i, \hat{y})}{\partial \hat{y}^2} \quad \text{(second derivative)} ]

At each step, we want to find a tree that minimizes the following objective:

[ \text{Obj} = \sum_{i=1}^n \left[ L(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) \right] + \Omega(f_t) ]

Using the Taylor expansion, the objective for a given tree structure can be approximated as:

[ \text{Obj}^{(t)} \approx \sum_{i=1}^n \left[ g_i f_t(x_i) + \frac{1}{2} h_i f_t(x_i)^2 \right] + \Omega(f_t) ]

Here (f_t(x)) is the output of the tree at iteration (t). The regularization term (\Omega) penalizes the number of leaves and the magnitude of leaf weights. This formulation is crucial because it turns the tree-building problem into a quadratic optimization over leaf weights. For a given tree structure, the optimal weight for leaf (j) is:

[ wj^* = -\frac{\sum{i \in Ij} g_i}{\sum{i \in I_j} h_i + \lambda} ]

where (I_j) is the set of indices in leaf (j) and (\lambda) is a regularization parameter. The corresponding loss reduction (gain) for a split is:

[ \text{Gain} = \frac{1}{2} \left[ \frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} h_i + \lambda} + \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i + \lambda} - \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i + \lambda} \right] ]

This gain formula is used to evaluate each candidate split. Notice that both first and second derivatives appear as sums over the data points in each leaf. This is the heart of the algorithm: we need to find split points that maximize this gain. But with large datasets, we cannot afford to examine every possible split point for every continuous feature. That is where the Weighted Quantile Sketch comes in.

2. The Quantile Sketch Problem: Why Exact Greedy Splitting Fails

2.1 The Exact Greedy Algorithm

In a classic decision tree, building a split for a continuous feature (x_j) works as follows: sort all training examples by their value of (x_j). Then, for every distinct value (or every point between distinct values), compute the loss reduction if you split there. This “exact greedy” algorithm is optimal in the sense that it examines all possible splits. But it is computationally infeasible for large datasets.

Memory cost: Sorting a feature of (n) values takes (O(n \log n)) time and (O(n)) memory for the sorted array. When you have millions of rows and thousands of features, you cannot store all the sorted arrays in memory simultaneously. You could sort on the fly, but that would be too slow.

Time cost: For each feature, you must iterate through all (n) sorted examples to compute cumulative sums of (g_i) and (h_i). That is (O(n)) per feature per tree level. If you have (d) features, that’s (O(d n)) per level, and the total time across all trees becomes prohibitive.

2.2 The Need for Approximate Splitting

Instead of examining every possible split, we can predefine a set of candidate split points—a “sketch” of the feature distribution—and evaluate only those. The challenge is to choose candidate points that are representative of where the loss function is sensitive. Random uniform sampling of the feature values would be wasteful because the loss function’s Hessian (h_i) tells us how much curvature there is. Points with larger (h_i) (higher second derivative) contribute more to the quadratic approximation, so we should allocate more candidate splits where (h_i) is large.

This is a weighted quantile problem: we want to find split points such that the sum of weights (h_i) in each bucket is roughly equal. If we use uniform weights (i.e., (h_i = 1) for all), we get ordinary quantiles. But by weighting with the Hessian, we ensure that regions of high curvature get more granular splits.

2.3 Formal Definition: Weighted Quantile Sketch

Given a set of (n) data points with feature values (x_1, x_2, \dots, x_n) and associated weights (w_i) (in our case, (w_i = h_i)), the weighted quantile sketch of order (\epsilon) is a data structure that can answer queries of the form: what is the smallest value (x) such that the cumulative weight of points with value (\le x) is at least a given fraction (r) of the total weight?

In other words, we want to construct a set of approximate quantiles (s_1, s_2, \dots, s_m) where the difference between the cumulative weight at successive quantiles is at most (\epsilon \cdot \text{total weight}). Here (\epsilon) is a user-controlled parameter that determines the fidelity of the sketch—smaller (\epsilon) gives more candidate splits.

In XGBoost, this is used per feature to generate a list of candidate split points. The parameter that controls this is sketch_eps (or approx_quantile in the original paper), typically set to 0.03 or 0.1. A smaller sketch_eps yields more candidate splits and potentially better accuracy, but also increases computation and memory.

3. How the Weighted Quantile Sketch Works in XGBoost

3.1 Data Structure: The GK Sketch

The original paper uses a variant of the Greenwald-Khanna (GK) algorithm for quantile sketches, adapted to handle weights. The GK sketch maintains a small set of “summary” tuples that approximate the distribution with provable error bounds. It allows merging of multiple sketches (useful for distributed computation) and can answer quantile queries in (O(\log m)) time.

However, XGBoost’s implementation (starting from the early versions) uses a simpler yet effective approach: weighted quantile sampling. For each feature, it picks a set of candidate split points that partition the cumulative weight into nearly equal intervals. The algorithm works as follows:

Sort the data points by the feature value (x_j). (Yes, there is still a sort, but it’s done once per feature per tree level, and the sorted order can be reused until the tree structure changes? Actually, XGBoost uses block structures to avoid repeated sorting—more on that later.)
Accumulate the weights (h_i) in sorted order.
Select split points at cumulative weight thresholds: (0, \epsilon, 2\epsilon, \dots, 1) (scaled to total weight). These thresholds define the candidate splits.

This gives at most (\lceil 1/\epsilon \rceil + 1) candidate splits per feature. For example, with (\epsilon = 0.1), you get at most 11 candidate splits. That is a huge reduction from the original (n) possible splits.

3.2 Why Weight by the Hessian?

The insight is that the loss reduction formula depends quadratically on the gradient sums and linearly on the Hessian sums. The Hessian (h_i) is the curvature—the second derivative of the loss. Points where the current model is uncertain (i.e., the gradient is large and the Hessian is small) are more important to split precisely. But actually, a large Hessian indicates high local curvature; the quadratic approximation is most accurate there, so we want to place more split points in those regions to better capture the shape of the loss function.

Consider a regression with squared error loss: (L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2). Then (g_i = \hat{y}_i - y_i) (residual) and (h_i = 1) for all (i) (constant). With constant Hessian, weighted quantile reduces to ordinary quantile. But for classification with logistic loss, the Hessian varies: for a binary prediction (\hat{p}_i = \sigma(\hat{y}_i)), we have (h_i = \hat{p}_i (1 - \hat{p}_i)). This is large when (\hat{p}_i \approx 0.5) (high uncertainty) and small when the model is confident. So weighted quantile will allocate more split points in regions where the model is uncertain—exactly where you need them.

3.3 Implementation Details in XGBoost

XGBoost’s codebase implements the weighted quantile sketch as part of its “Approx” tree method (the default for large datasets). The process for building a tree level is:

Compute gradient statistics (g_i, h_i) for all data points from the current model.
For each feature (or a random subset of features if colsample_by* is used):
- Sort the data by that feature’s value.
- Scan to compute cumulative sum of (h_i) (and optionally cumulative sum of (g_i) for later gain calculation).
- Using the cumulative weight, determine candidate split points (feature values) such that the cumulative weight between consecutive points is at most (\epsilon \cdot \text{total weight}).
- For each candidate split, compute the gain using the precomputed cumulative sums of (g_i) and (h_i) for the left and right partitions.
Select the best split across all features.
Recursively split child nodes (using the same data, but with reordered blocks? Actually, XGBoost uses a column block structure to avoid repeated sorting—each block stores a feature column in sorted order for the entire dataset, and it maintains a “position” array to know which rows go to which node. This is another optimization.)

The weighted quantile sketch is what makes the approximate algorithm feasible. Without it, you would have to examine all (n) potential splits, which is (O(n)) per feature per level. With the sketch, you examine only (O(1/\epsilon)) candidates, which is constant in (n). This is the key to XGBoost’s scalability.

3.4 A Simple Python Example

To illustrate, let’s implement a minimal weighted quantile sketch in Python:

import numpy as np

def weighted_quantile_sketch(values, weights, eps=0.1):
    # Sort by values
    idx = np.argsort(values)
    sorted_vals = values[idx]
    sorted_weights = weights[idx]

    total_weight = np.sum(sorted_weights)
    # target cumulative weight intervals
    step = eps * total_weight
    targets = np.arange(0, total_weight, step)  # quantile thresholds
    # merge targets that are too close? Not necessary.
    cumsum = 0
    candidate_splits = []
    target_idx = 0
    for i in range(len(sorted_vals)):
        cumsum += sorted_weights[i]
        while target_idx < len(targets) and cumsum >= targets[target_idx]:
            # Use the feature value at this point as candidate split
            candidate_splits.append(sorted_vals[i])
            target_idx += 1
    return candidate_splits

# Example usage:
x = np.array([1, 2, 3, 100, 101, 102])
h = np.array([0.1, 0.2, 0.2, 0.5, 0.5, 0.5])  # hessians
splits = weighted_quantile_sketch(x, h, eps=0.3)
print(splits)  # will place more splits around large weights

Notice that with small weights on the first three points, the algorithm may skip them and only pick splits near the large weights. This is exactly the desired behavior.

4. Sparsity-Aware Split Finding: Handling Missing Values and Structural Zeros

4.1 The Challenge of Sparse Data

Real-world datasets are often sparse. This sparsity can come from:

Missing values: In surveys or sensor data, not all features are recorded for every instance.
One-hot encoding: Categorical variables with many levels produce matrices where most entries are zero.
Feature engineering: Counts, indicator variables, or embeddings from text can all produce zeros.

A naive approach would treat zeros as regular numeric values and consider splits like “feature > 0” versus “feature == 0”. But that is highly inefficient. Moreover, missing values require special handling: you cannot simply replace them with a default value because that introduces bias.

XGBoost’s sparsity-aware algorithm handles both missing values (NaN) and user-specified sparsity (e.g., zeros) in a unified way. The key idea: learn the optimal direction for missing values during training.

4.2 Default Direction: Learn Where Missing Values Go

For each split, the algorithm tests two possible default directions: putting all missing values to the left child or to the right child. It picks the one that yields the highest gain. This is done efficiently by scanning only the non-missing values once.

Let’s formalize: For a given feature (j), suppose we have a set of non-missing values (x_{ij}) with corresponding gradient statistics (g_i, h_i). We want to find a split point (t) and a default direction (d \in {L, R}) such that the gain is maximized. The algorithm proceeds as follows:

Sort the non-missing data by feature value.
Sketch candidate split points (using weighted quantile on the non-missing data only; weights are Hessians).
For each candidate split point (t):
- If we default left: missing values go left, so the left branch gets all missing weights plus those with (x < t). The right branch gets the rest.
- If we default right: missing values go right.
- Compute gain using cumulative sums (built during the sort scan) and the total sum of missing weights.
- Keep the best combination of split point and default direction.

Because the missing values are all assigned to one side, this adds only constant overhead per candidate split. The algorithm essentially learns a rule like: “If feature value is missing, send the instance right; otherwise, if value < threshold, send left; else right.” That is why it’s called sparsity-aware: it treats missing as a separate case without requiring imputation.

4.3 Handling Structural Zeros (Sparse Matrices)

XGBoost can also handle zeros that come from sparse representations (e.g., scipy.sparse matrices). The algorithm treats zero as a “missing” value by default, but you can change this behavior with the missing parameter. By setting missing=None, you tell XGBoost to treat all zeros as actual numeric values. However, the sparsity-aware split finding still applies: the algorithm will consider both directions for the zero/non-zero distinction.

In practice, for one-hot encoded features, it is beneficial to treat zeros as missing because a split on “feature > 0” essentially separates the presence of a category versus its absence. The default direction learns which side to send zeros (i.e., the absence) to achieve the best loss reduction.

4.4 Computational Efficiency

Because only non-missing values need to be sorted and scanned, the cost per feature is proportional to the number of non-missing entries, not (n). This is a huge win for highly sparse datasets where the sparsity ratio (fraction of zeros or NaNs) is, say, 99%. XGBoost’s column block structure stores each feature as a compressed sparse column (CSC) or similar, allowing quick iterative access to non-missing values.

4.5 Example: Putting It All Together

Imagine a feature “age” with 10% missing values. The algorithm:

Sorts the 90% non-missing ages.
Picks candidate splits from the weighted quantile sketch of the non-missing ages.
For each split, computes gain assuming missing go left, and gain assuming missing go right. Chooses the best combination.
The resulting tree node might have: “If age is missing, go left; else if age < 35, go left; else go right.”

This is far more powerful than simply imputing with the mean, which would force all missing values to one side of any split.

5. The Synergy: How Weighted Quantile Sketch and Sparsity Work Together

XGBoost’s genius is that these two innovations are orthogonal and complementary. The weighted quantile sketch reduces the number of split candidates to a manageable size, while the sparsity-aware algorithm handles missing/zero values without needing to sort them. Together, they allow XGBoost to scale to datasets with billions of rows and thousands of features, as long as the data fits in distributed memory.

Consider a real-world example: a click-through rate (CTR) prediction dataset. Such datasets often have millions of rows, thousands of categorical features (one-hot encoded), and a high missing data rate due to user interactions being optional (e.g., “did the user scroll?”). Without weighted quantile, evaluating all splits would be impossible. Without sparsity-aware handling, the tree would waste time sorting billions of zeros.

XGBoost’s block structure (column blocks) further enhances efficiency. Each feature’s non-missing values are stored in a contiguous array along with cumulative sums of gradients and Hessians. When training a tree, the algorithm pre-computes the cumulative sums for each feature block, then for each node, it can quickly find the best split by scanning the block once. This is possible because the cumulative sums can be reused across different nodes? Actually, for each node, you need to consider only the rows that belong to that node. XGBoost uses a position buffer to mark which rows are in which node, and then performs a linear scan over the feature block, ignoring rows not in the current node. This is still efficient, especially with sparsity.

6. Practical Implications and Tuning Advice

Now that you understand these mechanisms, you can make informed decisions about hyperparameters that directly affect them.

6.1 `sketch_eps` and `tree_method`

The parameter tree_method controls the split-finding algorithm. For moderate datasets (say, < 100k rows), you can use tree_method='exact' (or the default ‘auto’ which picks exact). For larger datasets, tree_method='approx' uses the weighted quantile sketch. The approx method also works well on dense data.

The sketch_eps parameter (or approx_quantile in some interfaces) governs the number of candidate splits per feature. A value of 0.03 means you will have about 33 candidate splits per feature. Lower values (e.g., 0.01) give more splits and potentially better accuracy, but increase memory and time. In practice, 0.03 is a good default. If you are memory-constrained, increase it to 0.1.

6.2 `max_delta_step` and Hessian Behavior

Since the weighted quantile sketch uses the Hessian as weights, you can influence the sketch by shrinking the Hessian. The max_delta_step parameter clamps the step size and affects Hessian magnitudes in logistic loss. If you set max_delta_step high, the Hessians become larger for uncertain points, which might lead to denser splits in those regions.

6.3 `missing` and Sparsity

For datasets with explicit NaNs, set missing=np.nan. For sparse matrices, XGBoost automatically treats zeros as missing (by default). If you want to treat zeros as real values (e.g., a count of zero has meaning), you can set missing=None. But be careful: doing so may make the algorithm slower because it will process all zeros as non-missing values.

6.4 Colsample_bytree and Feature Subsampling

When colsample_bytree < 1, XGBoost randomly selects a subset of features for each tree. The weighted quantile sketch is built only for the selected features, reducing memory. This is especially helpful when you have thousands of features.

7. Comparison with Other Gradient Boosting Implementations

7.1 LightGBM

LightGBM uses a different approach called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) . While GOSS addresses gradient imbalance, LightGBM’s split finding uses a histogram-based algorithm that bins feature values into discrete bins. This binning is essentially a fixed quantile sketch (uniform bins) rather than a weighted quantile sketch. LightGBM’s histogram method is faster for very wide datasets, but the binning may lose precision compared to XGBoost’s adaptive weighted quantile.

7.2 CatBoost

CatBoost uses ordered boosting and handles categorical features natively. Its split-finding is also histogram-based, but it applies specific handling for categorical features. It does not use a weighted quantile sketch.

7.3 Scikit-learn’s GradientBoostingRegressor

Scikit-learn’s implementation uses the exact greedy algorithm (sort all values). It is not designed for large-scale sparse data. XGBoost’s sparse-aware algorithm is a clear winner for missing values.

8. Code Walkthrough: Analyzing Split Quality and Quantile Sketch Behavior

Let’s simulate a small experiment to see the effect of Hessian weighting on candidate splits.

import numpy as np
import matplotlib.pyplot as plt

# Simulate data
np.random.seed(42)
n = 2000
x = np.random.exponential(scale=2, size=n)  # skewed feature
true_y = np.sin(x) + np.random.normal(0, 0.1, n)

# Train a quick XGBoost model (first iteration) to get g and h
import xgboost as xgb
dtrain = xgb.DMatrix(x.reshape(-1,1), label=true_y)
params = {'objective':'reg:squarederror', 'tree_method':'approx', 'sketch_eps':0.05}
model = xgb.train(params, dtrain, num_boost_round=1)
# Get g and h from the model (not directly available, but we can compute manually)
# Let's compute for a simple regression with initial prediction 0.5
initial_pred = np.full(n, 0.5)
g = initial_pred - true_y  # residual
h = np.ones(n)  # constant for squared error

# Now compute weighted quantile candidates
def get_candidates(values, weights, eps):
    idx = np.argsort(values)
    sorted_vals = values[idx]
    sorted_weights = weights[idx]
    total = np.sum(sorted_weights)
    step = eps * total
    cumsum = 0
    candidates = []
    tgt = step
    for i in range(len(sorted_vals)):
        cumsum += sorted_weights[i]
        while cumsum >= tgt:
            candidates.append(sorted_vals[i])
            tgt += step
    return candidates

# Uniform weights (h=1)
candidates_uniform = get_candidates(x, np.ones(n), eps=0.05)
print("Number of uniform candidates:", len(candidates_uniform))

# Now, suppose we have a different problem where Hessian varies (e.g., logistic)
# Let's create a fake hessian that is large for medium x values
h_var = np.exp(-0.5*(x-5)**2)  # Gaussian around 5
candidates_weighted = get_candidates(x, h_var, eps=0.05)
print("Number of weighted candidates:", len(candidates_weighted))

plt.hist(x, bins=50, alpha=0.5, label='data')
plt.scatter(candidates_uniform, [0]*len(candidates_uniform), marker='|', color='red', label='uniform')
plt.scatter(candidates_weighted, [0.1]*len(candidates_weighted), marker='|', color='blue', label='weighted')
plt.legend()
plt.show()

You will see that the weighted candidates are concentrated around the region where (h_i) is large (around x=5), whereas uniform candidates are spread across the feature range.

9. Conclusion: From Black Box to Transparent Power

You have now peeled back the curtain on two of XGBoost’s most important innovations. The Weighted Quantile Sketch allows the algorithm to focus its split search on the most informative regions of the feature space, dramatically reducing computation without sacrificing accuracy. The Sparsity-Aware Split Finding enables XGBoost to elegantly handle missing values and structural zeros, turning a common nuisance into a competitive advantage.

These innovations are not just academic niceties—they are the reason XGBoost can train on terabytes of data in distributed environments, and they are the reason your credit risk model can gracefully handle missing income data without imputation. By understanding them, you can make smarter choices about hyperparameters, diagnose performance issues, and even extend the algorithm to new use cases.

Next time you call model.fit() on a sparse, massive dataset, remember the quantile sketch silently carving out candidate splits, and the default direction learning where missing values belong. The black box becomes a well-engineered machine, and you—the practitioner—become the expert who understands the legend.

Further Reading:

Original XGBoost paper: Chen & Guestrin (2016)
Greenwald-Khanna quantile sketch (2001)
LightGBM’s gradient-based one-side sampling
CatBoost’s ordered boosting

If you enjoyed this deep dive, consider sharing it with a colleague who still thinks XGBoost is magic.