What Is a Random Forest?
A Random Forest is an ensemble of many decision trees that vote (for classification) or average (for regression) to produce a single, more stable prediction. In the previous chapter you saw that a lone decision tree is powerful but twitchy: change a few rows of training data and the tree can redraw its splits completely. That instability is the tree's biggest weakness. A Random Forest fixes it by growing hundreds of slightly different trees and pooling their opinions.
The core insight is statistical. A single deep tree has low bias but high variance — it fits the training data well but is sensitive to noise. If you average many noisy-but-roughly-correct predictors, the errors partly cancel out and the variance drops, while the bias stays about the same. The forest keeps the tree's flexibility but throws away most of its jitter.
Intuitive analogy. Imagine a hiring committee at an Indian IT services firm deciding whether to make an offer to a candidate named Rahul. If one interviewer decides alone, the outcome swings on that person's mood, biases, and the two questions they happened to ask. But if fifteen interviewers each see a slightly different slice of the interview (one focuses on coding, another on system design, another on communication) and then vote, the panel's collective decision is far more reliable than any single interviewer. A Random Forest is that panel: each tree is one imperfect interviewer, and the forest is the vote.
Goal: build many decorrelated trees so that their individual mistakes are independent enough to average away, then aggregate their predictions into one low-variance model.
Examples:
→ Predict whether a customer will churn from usage + billing features
→ Score credit-card applications as approve / reject
→ Predict house price bands from area, locality, age, amenities
→ Detect fraudulent UPI transactions from transaction metadata
Two Sources of Randomness: Bagging + Random Features
A Random Forest is not just "many trees on the same data" — that would give you 100 nearly identical trees whose votes are useless (they'd all make the same mistakes). The magic is in deliberately making the trees different. It does this in two ways.
Bagging = Bootstrap Aggregating
Bagging stands for bootstrap aggregating. For each tree, the forest draws a bootstrap sample: it samples rows from the training set with replacement, drawing the same number of rows as the original data. Because sampling is with replacement, some rows appear several times and others are left out entirely for that tree.
Bootstrap sample (n rows, sampled WITH replacement):
Original rows: [1, 2, 3, 4, 5, 6, 7, 8]
Tree 1 sees: [2, 2, 5, 1, 7, 8, 3, 3] (rows 4, 6 left out)
Tree 2 sees: [1, 4, 4, 6, 6, 2, 8, 5] (rows 3, 7 left out)
...
Each tree trains on a slightly different dataset → each tree is slightly different.
On average, a bootstrap sample of size n includes about 63.2% of the unique original rows; the roughly 36.8% that are left out are called the out-of-bag (OOB) rows for that tree. (That number comes from the limit 1 - (1 - 1/n)^n → 1 - 1/e ≈ 0.632.) We will use those left-out rows for free validation in a moment.
Random Feature Subsets at Each Split
Bagging alone is not enough. If one feature is very strong (say, credit_score in a loan model), every tree would pick it as the top split, and all the trees would look similar — their errors would be correlated, so averaging wouldn't help much.
Random Forests break this correlation with a second trick: at every split, the tree is only allowed to consider a random subset of the features (of size max_features), not all of them. Sometimes the strong feature isn't even in the candidate set, forcing that split to use something else. The result is a collection of decorrelated trees that make different kinds of mistakes.
Typical max_features defaults:
Classification: sqrt(total_features) e.g. 30 features → try ~5 per split
Regression: total_features / 3 (older default) or 1.0 (all features)
Smaller max_features → more decorrelation, more diverse trees, but each tree is weaker.
Larger max_features → trees stronger individually but more correlated.
Why decorrelation matters mathematically: if you average B predictions each with variance σ² and pairwise correlation ρ, the variance of the average is ρσ² + (1 - ρ)σ² / B. As B → ∞ the second term vanishes, but the first term ρσ² does not. So the lower the correlation ρ, the lower the floor on the ensemble's variance. Random feature selection exists purely to push ρ down.
How Predictions Are Made
Once the forest is trained, prediction is simple: run the input down every tree and combine the answers.
Classification (RandomForestClassifier):
Each tree predicts a class → the forest takes a MAJORITY VOTE.
(scikit-learn actually averages the trees' class PROBABILITIES,
then picks the argmax — a soft vote.)
Regression (RandomForestRegressor):
Each tree predicts a number → the forest AVERAGES them.
Because the final prediction is an average (or vote) over many trees, the forest's output changes smoothly and is far less jumpy than a single tree's step-like output.
Out-of-Bag (OOB) Error: Free Validation
Here is one of the most elegant features of Random Forests. Remember that each tree left out about 36.8% of the rows (its OOB rows). For any given training row, roughly a third of the trees never saw it during training. So we can predict that row using only the trees that did not train on it — a genuine held-out prediction — and we get this for every row without ever setting aside a separate validation set.
OOB error procedure:
1. For row i, find all trees that did NOT include row i in their bootstrap sample.
2. Aggregate ONLY those trees' predictions for row i.
3. Compare against the true label for row i.
4. Average the error over all rows → the OOB error (or oob_score_ = accuracy/R²).
The OOB score is an (almost) unbiased estimate of the test error —
similar in spirit to cross-validation, but essentially free.
Turn it on with oob_score=True. It is a great quick sanity check, though for a final reported estimate many practitioners still use a proper train-test split or cross-validation (see the Train-Test Split & Cross-Validation chapter), especially on small datasets where the OOB estimate can be noisy.
Feature Importance (and Its Caveats)
A Random Forest can tell you which features mattered via the feature_importances_ attribute. The default measure is mean decrease in impurity (MDI), also called Gini importance: for each feature, sum up how much every split on that feature reduced impurity (Gini or entropy), weighted by how many samples reached that split, then average over all trees and normalise so the importances sum to 1.
importance(feature f) ≈ average over all trees of
Σ (weighted impurity decrease at every split that used feature f)
Importances are normalised to sum to 1 across all features.
This is useful but comes with serious caveats you must know:
- Bias toward high-cardinality features. MDI inflates the importance of continuous features and categorical features with many levels (like a customer ID or a pincode), simply because they offer more places to split. A useless ID column can look "important."
- Correlated features share credit. If two features carry the same signal, the forest splits on them somewhat arbitrarily, so the importance gets diluted across both. Neither looks as important as it truly is, which can hide a strong predictor.
- It is impurity-based, not accuracy-based. A feature can reduce training impurity without helping generalisation.
The more trustworthy alternative is permutation importance: shuffle one feature's values in a held-out set and measure how much the score drops. If shuffling a feature barely hurts performance, it wasn't really being used. Use sklearn.inspection.permutation_importance on validation data. Treat all importances as a ranking hint, never as proof of causation.
Key Hyperparameters
You can go a long way with defaults, but understanding the main knobs lets you trade off accuracy, speed, and overfitting.
| Hyperparameter | What it controls | Typical values | Effect of increasing it |
|---|---|---|---|
n_estimators | Number of trees in the forest | 100 to 1000+ | More trees = more stable, diminishing returns; never overfits, just slower |
max_depth | Maximum depth of each tree | None (grow full) or 10-30 | Deeper = lower bias, higher variance per tree |
max_features | Features considered per split | sqrt, log2, a float, or an int | Larger = stronger but more correlated trees |
min_samples_leaf | Min samples required in a leaf | 1 to 50 | Larger = smoother, more regularised trees |
min_samples_split | Min samples to allow a split | 2 upward | Larger = shallower, more regularised trees |
bootstrap | Whether to bootstrap rows | True (default) | False disables bagging (rarely wanted) |
class_weight | Reweight classes for imbalance | None or balanced | balanced up-weights the rare class |
n_jobs | CPU cores to use | -1 for all cores | Faster training; no effect on the model itself |
Two rules of thumb worth memorising:
n_estimatorsis not a regularisation knob. Adding trees cannot make a Random Forest overfit; it only reduces variance until it plateaus. Pick a number where the OOB score stops improving (often a few hundred) and stop there for speed.- The regularisation knobs are
max_depth,min_samples_leaf, andmax_features. If the forest overfits, shrink depth, raisemin_samples_leaf, or lowermax_features.
A Full scikit-learn Example
Let's build a churn classifier for a fictional Indian telecom, "BharatFiber," predicting whether a customer will churn from a handful of account features. The code shows a train-test split, OOB scoring, feature importances, and permutation importance.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.inspection import permutation_importance
# --- 1. Load / prepare data (illustrative synthetic frame) ---
# Features: monthly bill (Rs), tenure (months), support calls, data usage (GB), is_prepaid
df = pd.read_csv("bharatfiber_churn.csv")
feature_cols = ["monthly_bill", "tenure_months", "support_calls", "data_gb", "is_prepaid"]
X = df[feature_cols]
y = df["churned"] # 1 = churned, 0 = stayed
# --- 2. Train-test split (stratify to keep class ratio) ---
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# --- 3. Build and train the forest ---
rf = RandomForestClassifier(
n_estimators=500,
max_features="sqrt", # try sqrt(n_features) candidates per split
max_depth=None, # let trees grow; bagging controls variance
min_samples_leaf=2, # light regularisation
oob_score=True, # free out-of-bag validation
class_weight="balanced", # churn is usually the minority class
n_jobs=-1, # use all CPU cores
random_state=42,
)
rf.fit(X_train, y_train)
# --- 4. Evaluate ---
print(f"OOB score (accuracy): {rf.oob_score_:.3f}")
y_pred = rf.predict(X_test)
print(f"Test accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred, digits=3))
Illustrative output:
OOB score (accuracy): 0.842
Test accuracy: 0.851
[[612 38]
[ 71 129]]
precision recall f1-score support
0 0.896 0.942 0.918 650
1 0.772 0.645 0.703 200
accuracy 0.851 850
Notice the OOB score (0.842) closely tracks the test accuracy (0.851) — a reassuring sign that the OOB estimate is doing its job as a stand-in for held-out data.
Reading the Importances
# --- 5a. Built-in (MDI / Gini) importance ---
importances = pd.Series(rf.feature_importances_, index=feature_cols)
print(importances.sort_values(ascending=False))
# --- 5b. Permutation importance on the TEST set (more trustworthy) ---
perm = permutation_importance(
rf, X_test, y_test, n_repeats=20, random_state=42, n_jobs=-1
)
perm_imp = pd.Series(perm.importances_mean, index=feature_cols)
print(perm_imp.sort_values(ascending=False))
Illustrative output:
# MDI importance (sums to 1.0)
tenure_months 0.34
monthly_bill 0.27
support_calls 0.21
data_gb 0.13
is_prepaid 0.05
# Permutation importance (mean drop in accuracy when shuffled)
tenure_months 0.061
support_calls 0.048
monthly_bill 0.033
data_gb 0.009
is_prepaid 0.002
The two rankings mostly agree, but support_calls looks more important under permutation than under MDI — a reminder to cross-check. A cheap final validation with cross-validation:
scores = cross_val_score(rf, X, y, cv=5, scoring="f1")
print(f"5-fold F1: {scores.mean():.3f} +/- {scores.std():.3f}")
# Illustrative: 5-fold F1: 0.698 +/- 0.021
One practical convenience: Random Forests are scale-invariant. Unlike KNN or SVM, they do not need feature scaling because splits are based on thresholds, not distances. You still need to handle categorical encoding and missing values before fitting.
Why It Usually Beats a Single Tree
Put simply: a single tree is a high-variance estimator, and a Random Forest is a variance-reduction machine built out of many such trees.
Single deep tree: low bias, HIGH variance → overfits, unstable
Random Forest: low bias, LOW variance → averages the jitter away
Bias stays about the same (each tree is still flexible),
but variance drops because averaging decorrelated predictors cancels errors.
You will meet this bias-variance trade-off formally in the Bias-Variance, Overfitting & Regularization chapter. For now, the takeaway is that the forest keeps the tree's ability to model complex, non-linear interactions while sanding down the instability that makes a single tree unreliable. In practice a Random Forest is one of the strongest baselines you can reach for on tabular data: it works well out of the box, tolerates messy features, and rarely embarrasses you.
Random Forest vs Single Tree vs Boosting
Random Forest is a bagging ensemble — trees are built independently and in parallel. Boosting (covered in the Ensemble Methods chapter) builds trees sequentially, each new tree focusing on the previous ones' mistakes. They reduce error in fundamentally different ways.
| Aspect | Single Decision Tree | Random Forest (bagging) | Boosting (e.g. XGBoost) |
|---|---|---|---|
| How trees are built | One tree | Many, in parallel, independent | Many, sequential, error-correcting |
| Primary error reduced | Neither well | Variance | Bias (and some variance) |
| Trees are | The model | Deep, decorrelated | Usually shallow (weak learners) |
| Overfitting risk | High | Low | Higher; needs careful tuning |
| Tuning effort | Low | Low (great defaults) | High (learning rate, depth, rounds) |
| Training | Fast | Parallel, moderate | Slower, harder to parallelise |
| Interpretability | High (readable rules) | Lower (importances only) | Lowest |
| Typical accuracy on tabular | Baseline | Strong | Often the best |
When to reach for each: use a single tree when you need a human-readable rule set; use a Random Forest as your reliable, low-effort strong baseline; escalate to boosting when you need to squeeze out the last few points of accuracy and are willing to tune carefully.
Common Mistakes
1. Treating n_estimators like a regularisation dial
More trees never cause overfitting in a Random Forest — they only reduce variance until performance plateaus, then waste compute. If your forest is overfitting, do not cut trees; instead reduce max_depth, raise min_samples_leaf, or lower max_features.
2. Trusting MDI feature importance blindly
Default feature_importances_ is biased toward high-cardinality and continuous features and dilutes credit across correlated ones. A meaningless customer_id column can top the chart. Cross-check with permutation importance on held-out data before making any business claim, and never read importance as causation.
3. Reporting the OOB score as your final test metric on tiny data
OOB is a fine quick estimate, but on small datasets it is noisy and can be optimistic. For a number you'll put in a report, use a proper hold-out or cross-validation. And never compute OOB with bootstrap=False — it requires bootstrapping to exist.
4. Forgetting to handle class imbalance
On a 2%-fraud dataset a forest can hit 98% accuracy by predicting "not fraud" every time. Use class_weight="balanced", resampling, or threshold tuning, and evaluate with precision/recall/F1 rather than raw accuracy (see the Model Evaluation Metrics chapter).
5. Leaking data before the split
Fitting encoders, imputers, or feature-selection on the full dataset and then splitting leaks test information into training and inflates the OOB and CV scores. Do all preprocessing inside a Pipeline fit only on the training fold.
6. Expecting good extrapolation
A forest can only predict values it saw during training — for regression its outputs are bounded by the training targets' range. It will not extrapolate a rising trend beyond the data (a common surprise when forecasting time series with a Random Forest).
Practice Exercises
-
Bootstrap intuition. For a training set of
n = 1000rows, roughly how many unique rows appear in one bootstrap sample, and how many are out-of-bag? Which fraction gets used for that tree's OOB prediction? -
Decorrelation. Explain in two or three sentences why setting
max_featuresto all features (so every tree can pick the best split from every feature) tends to hurt a Random Forest, even though each individual tree becomes stronger. -
Tuning. Your forest gets 99% training accuracy but only 78% on the test set. List three hyperparameter changes that would reduce this overfitting, and say which change would not help (and why).
-
OOB check. Train a
RandomForestClassifier(oob_score=True)on any tabular dataset (e.g. the built-inload_breast_cancer). Compareoob_score_against a 5-fold cross-validated accuracy. Are they close? What does a large gap suggest? -
Importance audit. On the same dataset, compare
feature_importances_(MDI) withpermutation_importanceon a held-out set. Add a random noise column and a duplicate of an existing column, refit, and describe how each importance measure reacts to them. -
Ensemble comparison. Train a single
DecisionTreeClassifier, aRandomForestClassifier, and (if available) a boosting model on the same split. Compare their test F1 scores and training times, and relate the differences back to the bias-variance and bagging-vs-boosting ideas from this chapter.
Summary
In this chapter you learned:
- A Random Forest is an ensemble of decision trees that vote (classification) or average (regression), turning many high-variance trees into one low-variance model.
- Its power comes from two sources of randomness: bagging (bootstrap samples of rows drawn with replacement) and random feature subsets (
max_features) at each split, which together produce decorrelated trees. - Averaging decorrelated predictors reduces variance without raising bias — the key reason a forest usually beats a single tree.
- Out-of-bag (OOB) error uses the ~36.8% of rows each tree didn't train on to give near-free, almost-unbiased validation via
oob_score=True. - Feature importance (
feature_importances_, i.e. MDI/Gini) ranks features but is biased toward high-cardinality features and dilutes credit across correlated ones — verify with permutation importance. - Key hyperparameters:
n_estimators(more is safer, never overfits), and the real regularisersmax_depth,max_features, andmin_samples_leaf. - In scikit-learn,
RandomForestClassifier/Regressorneed no feature scaling, supportclass_weight="balanced"andn_jobs=-1, and make an excellent strong baseline on tabular data. - Random Forest is a bagging ensemble (reduces variance); boosting builds trees sequentially to reduce bias — you'll compare them fully in the Ensemble Methods chapter.
A Random Forest is the model to reach for when you want strong, stable, low-effort results on tabular data before investing in heavier tuning.
Next up: Naive Bayes — a fast, probabilistic classifier built on Bayes' theorem and a bold independence assumption that works surprisingly well for text and high-dimensional data.