Skip to main content
Machine Learning·Intermediate

Decision Trees: CART, Splitting Criteria, and Pruning

Decision trees from first principles — CART's greedy recursive binary splitting, Gini vs entropy vs MSE split criteria, cost-complexity pruning with cross-validation, surrogate splits for missing values, feature importance bias (Strobl 2007) and the TreeSHAP fix, and why single trees are always ensembled in production. Covers ID3, C4.5, CART, and oblivious trees (CatBoost).

50 min read 21 sections 8 interview questions
Decision TreesCARTGini ImpurityEntropyInformation GainCost-Complexity PruningSurrogate SplitsTreeSHAPOblivious TreesCatBoostFeature ImportancePermutation ImportanceID3C4.5

Why Decision Trees Are the Building Block of Modern Tabular ML

A decision tree is a piecewise-constant function on a hierarchical partition of feature space. At each internal node, the data is split on a single-feature threshold; at each leaf, predictions are the majority class (classification) or mean (regression) of the training samples routed there. The reason trees matter in 2026 is not because they win alone — they don't, a single tree is beaten by a linear model on most real datasets — but because they are the atomic unit of every dominant tabular learner: Random Forest, XGBoost, LightGBM, CatBoost. Understanding a single tree's failure modes (variance, bias toward high-cardinality features, axis-aligned splits) is what separates a candidate who can tune an ensemble from one who parrots defaults.

The algorithm you need to know in an interview is CART (Classification and Regression Trees), introduced by Breiman et al. in 1984. CART is a greedy, top-down, recursive binary splitter: at each node, it exhaustively tries every feature-threshold combination, picks the one that most reduces impurity, and recurses. Greedy because it never backtracks — finding the globally optimal tree is NP-hard (Hyafil & Rivest, 1976). Binary because multi-way splits fragment the data too quickly (a 10-way categorical split at depth 3 leaves 1000× less data per leaf than binary). Top-down because bottom-up agglomeration has no obvious stopping rule.

IMPORTANT

What Interviewers Are Actually Evaluating

A 6/10 candidate recites 'Gini vs entropy' and stops. A 9/10 candidate says three things the 6/10 never will: (1) Gini vs entropy almost never matters (Raileanu & Stoffel 2004 found disagreement on under 2% of splits), but entropy is ~10% slower due to the log. (2) The default feature_importances_ in sklearn is statistically biased toward high-cardinality features (Strobl 2007) — never use it for feature selection; use permutation importance or TreeSHAP. (3) Single decision trees have no place in production — they exist only as the weak learner inside Random Forest/XGBoost/LightGBM/CatBoost. If an interviewer asks 'how would you deploy a decision tree?', the right answer is 'I wouldn't — I'd deploy an ensemble of them and explain individual predictions via TreeSHAP.' Staff-level signal comes from mentioning oblivious trees (CatBoost's symmetric trees enable ~100x faster vectorized inference) and the correct use of surrogate splits for production missing-value handling.

Clarifying Questions Before Choosing Decision Trees

01

What is the data shape — n samples and d features?

Trees cost O(n log n · d) per tree. For n < 10K and d < 50, exact CART is fine. For n > 1M, you need histogram-based approximation (LightGBM/XGBoost hist mode, 256 bins). For d > n (wide, short), trees perform badly — linear models with L1 dominate.

02

Is interpretability required by regulation or product?

In medical, credit, or insurance settings, EU GDPR Article 22 and Equal Credit Opportunity Act require explainable decisions. A single tree of depth ≤ 5 gives you a readable decision rule per prediction. Ensembles need TreeSHAP (Lundberg 2020) — exact Shapley values for tree models in polynomial time — to provide the same explanation.

03

What is the feature type mix — numerical, categorical, missing?

CART handles numerical splits with sorted thresholds O(n log n). For categoricals, CART's optimal partition of k categories into two groups has 2^(k-1) - 1 candidates, but Fisher 1958 proved that sorting categories by target mean reduces this to k-1 candidates without loss for binary/regression targets — this is why LightGBM/CatBoost handle categoricals natively. For missing values, CART uses surrogate splits (a backup split chosen to mimic the primary split's routing) at each node; sklearn does not implement these, so you must impute.

04

What is the noise level in the labels?

Trees overfit aggressively on noisy labels because a leaf with one sample always has zero training error. Set min_samples_leaf ≥ 5 as a hard floor in noisy domains, and never set max_depth=None with small data.

05

Will this be deployed standalone or as an ensemble base learner?

Standalone → prune aggressively (cost-complexity pruning with CV). Ensemble base learner → skip pruning entirely. Random Forest wants high-variance unpruned trees that get averaged; XGBoost wants shallow trees (depth 3–8) that focus on residual correction. Pruning a tree inside an ensemble destroys the diversity the ensemble relies on.

The CART Algorithm — Greedy Recursive Binary Splitting

At each node, CART solves: find the feature j and threshold t that maximize impurity reduction on the data routed to this node. For n samples and d features, the naive cost is O(n · d) per split if you test every unique value. Sorting the feature once per node reduces the sweep to O(n log n) per feature per level, and reusing the sort (as LightGBM does with presorted indices) gets the per-level cost down to O(n · d) amortized.

Why binary splits, not multi-way? Multi-way splits on a k-valued categorical create k children per node. At depth 3, binary gives you 8 leaves; 10-way multi-way gives you 1000. Each leaf has 1/1000 the data, so variance explodes. CART always does binary — for a k-category feature, it finds the optimal binary partition of the k categories into two groups.

Why greedy, not optimal? Finding the globally optimal binary tree of bounded depth is NP-hard (Hyafil & Rivest 1976). Greedy top-down is a heuristic that works well in practice because most useful signal is additive: the best split at the root is usually close to the best feature overall, and subsequent splits refine. The failure mode: XOR-like interactions where no single-feature split at the root reveals signal. A depth-2 tree can represent XOR (split on x₁, then on x₂ under each branch), but the greedy root split sees zero gain and may never grow. In practice, this is why ensembles help — different bootstrap samples surface the interaction differently.

Split Criteria — Gini, Entropy, MSE

Split Criteria Comparison — When Does the Choice Matter?

CriterionFormulaTaskCostWhen to Use
Gini1 − Σp²ₖClassificationFast (no log)Default for CART; sklearn default. ~10% faster than entropy.
Entropy / Info Gain−Σp log pClassification~10% slower (log call)ID3/C4.5 default. Agrees with Gini on >98% of splits (Raileanu 2004) — rarely worth the cost.
MSE / VarianceΣ(y − ȳ)²RegressionStandardDefault for regression trees. Assumes squared-error loss.
MAEΣ|y − ȳ|Regression (robust)Slower (median needed)Use when outliers dominate; ~3× slower than MSE in sklearn.
Friedman MSE(|S_L||S_R|/|S|)(ȳ_L − ȳ_R)²GBDT residual fittingSame as MSEDefault in sklearn GradientBoosting. Slightly better for boosting residuals.
XGBoost gain½[G²_L/(H_L+λ) + G²_R/(H_R+λ) − G²/(H+λ)] − γAny differentiable lossSame as MSESecond-order Taylor approximation — used by XGBoost/LightGBM.

CART Tree Construction — Greedy Recursive Splitting

Rendering diagram...

CART from Scratch — Gini Split Finding + Recursive Build

pythoncart_from_scratch.py
import numpy as np
from collections import Counter
from dataclasses import dataclass
from typing import Optional

@dataclass
class Node:
    feature: Optional[int] = None        # split feature index
    threshold: Optional[float] = None    # split threshold
    left: Optional["Node"] = None
    right: Optional["Node"] = None
    value: Optional[int] = None          # leaf prediction (majority class)
    impurity: float = 0.0                # for pruning later
    n_samples: int = 0

def gini(y: np.ndarray) -> float:
    """Gini impurity: 1 - sum(p_k^2). Fast: no log, vectorized."""
    if len(y) == 0:
        return 0.0
    _, counts = np.unique(y, return_counts=True)
    probs = counts / len(y)
    return 1.0 - np.sum(probs ** 2)

def best_split(X: np.ndarray, y: np.ndarray) -> tuple:
    """
    Exhaustive O(n log n * d) split search.
    Returns (feature, threshold, gain) of best split, or (None, None, 0) if no split found.
    """
    n, d = X.shape
    parent_gini = gini(y)
    best_gain, best_feat, best_thresh = 0.0, None, None

    for j in range(d):
        # Sort samples by feature j once (O(n log n))
        order = np.argsort(X[:, j])
        xs, ys = X[order, j], y[order]

        # Running class counts for left/right
        left_counts = Counter()
        right_counts = Counter(ys)
        n_left, n_right = 0, n

        for i in range(n - 1):
            cls = ys[i]
            left_counts[cls] += 1
            right_counts[cls] -= 1
            n_left, n_right = n_left + 1, n_right - 1

            # Skip duplicate thresholds (can't split between identical values)
            if xs[i] == xs[i + 1]:
                continue

            # Weighted child impurity
            p_l = np.array(list(left_counts.values())) / n_left
            p_r = np.array(list(right_counts.values())) / n_right
            g_l = 1.0 - np.sum(p_l ** 2)
            g_r = 1.0 - np.sum(p_r ** 2)
            child = (n_left / n) * g_l + (n_right / n) * g_r
            gain = parent_gini - child

            if gain > best_gain:
                best_gain = gain
                best_feat = j
                best_thresh = (xs[i] + xs[i + 1]) / 2.0

    return best_feat, best_thresh, best_gain

def build_tree(X, y, depth=0, max_depth=10, min_samples_split=2,
               min_impurity_decrease=0.0) -> Node:
    """Recursive CART builder with all standard stopping criteria."""
    node = Node(
        value=Counter(y).most_common(1)[0][0],
        impurity=gini(y),
        n_samples=len(y),
    )

    # Stopping: pure, too deep, too few samples, single class
    if (depth >= max_depth
            or len(y) < min_samples_split
            or node.impurity == 0.0):
        return node

    feat, thresh, gain = best_split(X, y)
    if feat is None or gain < min_impurity_decrease:
        return node  # no useful split found -> leaf

    mask = X[:, feat] <= thresh
    node.feature, node.threshold = feat, thresh
    node.left = build_tree(X[mask], y[mask], depth + 1, max_depth,
                           min_samples_split, min_impurity_decrease)
    node.right = build_tree(X[~mask], y[~mask], depth + 1, max_depth,
                            min_samples_split, min_impurity_decrease)
    node.value = None  # internal node
    return node

Stopping Criteria — Pre-Pruning vs Post-Pruning

There are two philosophies for preventing overfitting. Pre-pruning stops tree growth early using thresholds: max_depth (hard depth cap), min_samples_split (refuse to split a node with fewer samples), min_samples_leaf (refuse splits that create small leaves), min_impurity_decrease (refuse splits with gain below threshold). Pre-pruning is fast (less work) but myopic — it may stop at a node whose children would have been very useful (the XOR trap again: zero gain at root, huge gain at depth 2).

Post-pruning grows a full tree, then collapses subtrees that don't improve validation performance. The canonical algorithm is cost-complexity pruning (Breiman 1984), also called weakest-link pruning. Each subtree T is scored as R_α(T) = R(T) + α|T|, where R(T) is the training error and |T| is the number of leaves. As α increases from 0 to ∞, subtrees are collapsed in a nested sequence from the fully-grown tree to the root. You then pick α via k-fold cross-validation. Sklearn exposes this as ccp_alphaclf.cost_complexity_pruning_path(X, y) returns the candidate α values.

Why ensembles skip pruning entirely. In Random Forest, you want each tree to overfit — averaging many overfit trees is exactly how bagging reduces variance. Pruning would reduce the variance of each tree but also reduce ensemble diversity, leaving you with fewer, more correlated trees. The net effect is worse than no pruning. Same logic applies to XGBoost: shallow trees (depth 3–6) are the regularizer; pruning is redundant and would destroy the sequential residual-fitting.

Cost-Complexity Pruning Loop with Cross-Validation

pythonccp_pruning.py
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

def select_ccp_alpha(X, y, cv=5, random_state=42):
    """
    Breiman's cost-complexity pruning via cross-validation.
    Returns the best alpha and the pruned tree fit on all data.

    Step 1: Fit a full tree, extract the nested sequence of alphas.
    Step 2: For each alpha, cross-validate a tree pruned at that alpha.
    Step 3: Pick alpha* = argmax(mean CV score).
    """
    # Step 1: get the candidate alpha path (nested subtrees)
    full_tree = DecisionTreeClassifier(random_state=random_state).fit(X, y)
    path = full_tree.cost_complexity_pruning_path(X, y)
    alphas = path.ccp_alphas  # sorted ascending; alphas[-1] collapses to root

    # Drop the last alpha (fully collapsed = root-only = not useful)
    alphas = alphas[:-1]

    # Step 2: CV-evaluate each pruned tree
    mean_scores, std_scores = [], []
    for alpha in alphas:
        clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=random_state)
        scores = cross_val_score(clf, X, y, cv=cv, scoring="accuracy")
        mean_scores.append(scores.mean())
        std_scores.append(scores.std())

    mean_scores = np.array(mean_scores)
    std_scores = np.array(std_scores)

    # Step 3: pick alpha*. Use "1-SE rule" (Breiman) for robustness:
    #   choose the largest alpha whose CV score is within 1 SE of the best.
    #   This favors simpler trees when performance is statistically similar.
    best_idx = np.argmax(mean_scores)
    threshold = mean_scores[best_idx] - std_scores[best_idx]
    candidates = np.where(mean_scores >= threshold)[0]
    alpha_star = alphas[candidates.max()]   # largest alpha within 1 SE

    final = DecisionTreeClassifier(
        ccp_alpha=alpha_star, random_state=random_state
    ).fit(X, y)
    return alpha_star, final, mean_scores, std_scores

Categorical Features — CART's Optimal Partition Trick

For a k-valued categorical feature, CART must find the best binary partition of the k categories into two groups. Brute force: 2^(k-1) − 1 non-trivial partitions. For k = 20, that's 500K — infeasible at every split.

Fisher 1958's insight: for binary classification or regression, sort the categories by their target mean (mean of y within each category). The optimal binary partition is always a prefix split on this sorted order, reducing the search to k − 1 candidates — O(k) instead of O(2^k). This is a rare case where greedy-with-sorting is provably optimal.

LightGBM and CatBoost exploit this trick natively for categoricals; XGBoost and sklearn do not — they require one-hot encoding. And here is the interview trap: one-hot encoding destroys decision tree performance on high-cardinality categoricals. Each one-hot dummy is a binary feature that can only isolate 1 category at a time. A split on dummy = city_paris peels off Paris versus everyone else, then the next split peels off London, and so on. Depth is wasted, leaves become imbalanced (one category versus the union of all others), and the model fails to learn groupings of similar categories. The correct approach with sklearn/XGBoost is target encoding with smoothing (replace each category with its target mean, regularized by a prior), which recovers most of Fisher's advantage.

Missing Values — Surrogate Splits vs Imputation vs Learned Default

CART's original treatment of missing values is the surrogate split: at each internal node, CART computes not only the primary split but a ranked list of backup (surrogate) splits that mimic the primary split's routing as closely as possible. At inference, if a sample's primary-split feature is missing, the first available surrogate is used. This handles missingness without any imputation and preserves any information encoded in missingness patterns.

What sklearn does not implement. Sklearn's DecisionTreeClassifier does not support surrogate splits. You must impute (mean/median for numerics, mode for categoricals) or encode missingness as a separate feature. Rpart (R's CART implementation) does implement surrogates — the tree package in R gives you the original Breiman algorithm. In practice, for sklearn pipelines, the common fix is adding an is_missing_<feature> binary indicator alongside an imputed value, letting the tree learn to split on missingness if it's informative.

XGBoost's learned default direction is a modern alternative: at each split, try routing missing values left vs right, pick the direction that maximizes gain, store as the default. This is simpler than surrogate splits (one default per node vs a full ranked list) but achieves similar effect. LightGBM and CatBoost do the same.

Decision Boundary — Trees vs Linear vs kNN

Rendering diagram...

The Bias-Variance Profile of a Single Tree

A fully-grown decision tree (no pruning, no depth cap, one sample per leaf) has zero training error by construction — every sample is its own leaf. This is maximum variance, near-zero bias. A tiny change in training data (flip one label, add one sample) can completely restructure the tree, because the greedy split at the root cascades into different downstream splits. This instability is both the weakness (generalization is poor) and the strength (it's exactly what bagging exploits: average many unstable, uncorrelated estimators to get a low-variance estimate).

The opposite extreme — a tree of depth 1 (a 'stump', one split) — has high bias, low variance. This is what AdaBoost and gradient boosting use as the weak learner: many high-bias stumps combined to reduce bias sequentially. The depth parameter is the bias-variance knob.

Extrapolation failure. A tree predicts only within the range of training targets. For regression, every leaf predicts the mean of its training samples — so the model's output range is bounded by [min(y_train), max(y_train)]. A tree trained on house prices from 2020 will never predict a 2026 price higher than the highest 2020 price. This is critical in time series, physical systems, or any extrapolation setting — use linear models or Gaussian processes instead.

Feature Importance — The Bias Most Engineers Don't Know About

The default feature_importances_ in sklearn (and importance_type='gain' in XGBoost) is Mean Decrease in Impurity (MDI): sum the impurity reduction a feature causes across all splits in all trees, normalize. It's fast — computed for free during training — but statistically biased (Strobl et al. 2007). The bias has two sources: (1) high-cardinality features have more potential split points, so they appear more often as 'best split' even when they carry no signal; (2) correlated features 'share credit' — one gets inflated importance, the other gets near-zero.

Permutation importance (Breiman 2001) fixes the cardinality bias: after training, shuffle one feature across the test set and measure the increase in error. A feature with no predictive value shows no error increase under permutation. Works for any model, costs O(d · n_repeats · inference_time). Still has a weakness: correlated features both show low permutation importance (shuffling one still leaves the other to carry the signal).

TreeSHAP (Lundberg et al. 2020, NeurIPS paper 'From local explanations to global understanding with explainable AI for trees') is the gold standard. It computes exact Shapley values for tree models in polynomial time O(T · L · D²) where T = trees, L = max leaves, D = max depth. Without the tree structure trick, exact Shapley is O(2^d) — exponential. TreeSHAP satisfies three axioms MDI violates: efficiency (feature attributions sum exactly to prediction − baseline), symmetry (features with identical contribution get identical SHAP), dummy (non-contributing features get exactly zero). For any production use — regulatory audits, debugging, feature selection — use TreeSHAP. Reserve MDI for quick sanity checks during model development.

Tree Algorithm Variants — ID3, C4.5, CART, Oblivious

AlgorithmAuthor / YearSplit TypeCriterionCategorical HandlingKey Use Today
ID3Quinlan 1986Multi-way (one branch per category)Information gain (entropy)Native multi-wayHistorical — taught in textbooks, rarely used. Fragments data fast.
C4.5Quinlan 1993Multi-way for categorical, binary for numericGain ratio (normalizes for cardinality)NativeStill used in Weka. Handles continuous features and missing values. Post-pruning via error-based.
CARTBreiman 1984Binary onlyGini (clf) / MSE (regr)Optimal partition via Fisher 1958 sortingDominant today. sklearn, rpart, base of RF.
Oblivious (CatBoost)Dorogush et al. 2018 (CatBoost)Binary, symmetric (same split at every node at level d)MSE with ordered boostingNative with target encodingProduction inference: vectorized, cache-friendly, ~100x faster than level-wise. Less expressive but more robust to overfitting.
Conditional Inference TreesHothorn 2006Binary, statistical testsPermutation test p-valuesSupportedUnbiased feature selection (corrects MDI bias). Slower than CART.

Staff-Level Deep Dive — Oblivious Trees and Vectorized Inference

Standard CART / XGBoost / LightGBM trees have a different split at every internal node — the structure is irregular, which means inference requires a per-sample traversal with branch mispredictions on every level. On modern CPUs, this limits throughput to ~100K predictions/sec/core for a depth-6 tree.

Oblivious trees (used by CatBoost, also in Microsoft's MatrixNet since 2009) are symmetric: at every node of level d, the same feature-threshold split is used. A depth-6 oblivious tree is fully described by 6 (feature, threshold) pairs and 64 leaf values. Inference becomes a branchless 6-bit lookup: compute 6 comparisons, pack them into a 6-bit index, index into a 64-element lookup table. This vectorizes trivially with SIMD — achievable throughput is ~10M predictions/sec/core, a roughly 100× speedup.

The tradeoff: oblivious trees are less expressive per tree (symmetric structure can't adapt branches differently), so you need more trees to match the accuracy of a standard ensemble. For latency-critical inference (ad serving, fraud scoring at hundreds of thousands of QPS), the 100× inference speedup is worth the 2× more trees. This is why CatBoost is preferred at Yandex, Uber, and other companies with strict latency SLAs on tree-based models.

Interview signal: mentioning oblivious trees and the symmetric-structure → vectorized-inference argument is an instant staff-level marker. Almost no prep content covers this.

⚠ WARNING

Five Decision Tree Traps That Fail Interviews

1. 'One-hot encode your high-cardinality categorical.' Wrong for trees. One-hot forces trees to peel off categories one at a time, wasting depth. Use target encoding with smoothing, or a library that handles categoricals natively (LightGBM, CatBoost).

2. 'Trees can fit any function.' Only piecewise — trees cannot fit a smooth curve without infinite leaves. sin(x) fits badly, extrapolation fails (predictions are bounded by training target range). For smooth signal, use Gaussian processes or neural networks.

3. 'Feature importance tells you which features matter.' Not if you use the default MDI. Strobl 2007 proved MDI is biased toward high-cardinality features. Use permutation importance (fixes cardinality bias) or TreeSHAP (fixes both cardinality and correlation bias).

4. 'Just prune the tree and it'll generalize.' Only for standalone trees. Inside Random Forest or XGBoost, pruning destroys ensemble diversity. The regularizer in ensembles is shallow depth + learning rate, not pruning.

5. 'Trees are scale-invariant, so I don't need to normalize.' True for the splits — trees only use feature ranks. But if you're adding tree features to a linear model downstream (stacking), you still need to normalize that downstream model's inputs.

Evaluation and Diagnostic Checklist for Decision Trees

CheckWhat to MeasurePass CriterionFailure Mode It Catches
Train vs validation gapTrain acc − Val acc< 5 pp for single treeOverfitting — reduce max_depth or increase min_samples_leaf
Learning curveError vs training set sizeBoth train and val convergeFlat val curve → high bias; diverging → high variance
Feature importance stabilityMDI ranking across 10 bootstrap fitsTop-10 overlap > 80%Unstable importance → correlated features or too little data
Permutation importanceError increase under shuffleTop features match MDI topMismatch → MDI biased; trust permutation
Tree sizeNumber of leaves< 100 for interpretability; < 1000 before pruningToo many leaves → overfitting, high memory
Leaf sample countsDistribution of samples per leafMin ≥ 5 samplesLeaves with 1 sample = memorization
Prediction confidenceDistribution of leaf puritiesNot all 0 or 1All extreme → uncalibrated; apply Platt scaling if using as classifier
TIP

What to Say in Your Interview Tomorrow

When asked about decision trees, anchor on four things: (1) CART is greedy because the globally optimal tree is NP-hard (Hyafil & Rivest 1976), and binary-only because multi-way fragments data too fast. (2) Gini vs entropy rarely matters — Raileanu 2004 showed they agree on > 98% of splits; use Gini because it's ~10% faster (no log). (3) A single tree is always beaten by an ensemble — single trees exist only as weak learners inside Random Forest / XGBoost / LightGBM / CatBoost. Never deploy one alone. (4) Default feature importance is biased (Strobl 2007); use permutation importance for quick checks, TreeSHAP (Lundberg 2020) for regulatory-grade explanations. For staff-level signal, mention oblivious trees (CatBoost) — symmetric structure enables vectorized SIMD inference, ~100× throughput over standard trees. The interviewer's trap is usually 'would you one-hot encode?' — answer: no, one-hot wastes tree depth on high-cardinality features; use target encoding or a library with native categorical support.

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 17 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →