Why We Never Evaluate on Training Data
Imagine Priya is preparing for a competitive exam. She practises with a question bank, memorises the answers, and then re-attempts the same question bank — scoring 100%. Is she ready for the real exam? Obviously not. The only honest test is a fresh set of questions she has never seen.
Machine learning models face exactly this trap. A model that is both trained and scored on the same data can simply memorise it — including the noise — and report a dazzling accuracy that collapses the moment real, unseen data arrives. This gap is the difference between memorisation and generalisation, and it is the single most important idea in evaluating any model.
- Memorisation: the model fits the training examples (and their quirks) very closely. Training score is high; real-world score is low. This is overfitting.
- Generalisation: the model captures the underlying pattern, so it performs well on data it has never encountered. This is the goal.
The fix is simple to state: hold out some data, train on the rest, and evaluate only on the held-out portion. That held-out data is the model's "real exam". This chapter is about doing that correctly — because doing it incorrectly is astonishingly easy and quietly ruins more projects than any algorithm choice ever will.
Golden rule of ML evaluation:
The data used to TRAIN a model must never be used to SCORE it.
A score computed on training data is meaningless as a predictor of real performance.
The Train-Test Split
The most basic honest evaluation is the holdout method: randomly partition your dataset into two parts.
- Training set (typically 70–80%): the model learns its parameters here.
- Test set (typically 20–30%): kept in a locked drawer until the very end, used once to report final performance.
Scikit-learn provides train_test_split for this.
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X, y = load_breast_cancer(return_X_y=True)
print(X.shape) # (569, 30) -> 569 samples, 30 features
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% goes to the test set
random_state=42, # fixed seed -> reproducible split
stratify=y # preserve class proportions (see below)
)
print(X_train.shape, X_test.shape) # (455, 30) (114, 30)
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train) # learn on training data only
print("Train accuracy:", model.score(X_train, y_train))
print("Test accuracy:", model.score(X_test, y_test))
Illustrative output:
(569, 30)
(455, 30) (114, 30)
Train accuracy: 0.978
Test accuracy: 0.965
The test accuracy is the number you trust. If training accuracy were 0.99 and test accuracy only 0.72, that large gap would scream overfitting.
Key Arguments of train_test_split
| Argument | What it controls | Typical value |
|---|---|---|
test_size | Fraction (or count) of rows sent to the test set | 0.2 or 0.25 |
train_size | Fraction for training (optional; inferred if omitted) | usually left blank |
random_state | Seed for the random shuffle so the split is reproducible | any fixed int, e.g. 42 |
stratify | Keeps class proportions identical in train and test | y for classification |
shuffle | Whether to shuffle before splitting (default True) | False only for time series |
Why random_state Matters
Without a fixed random_state, every run produces a different split and therefore a different score. Fixing the seed makes results reproducible — a teammate running your notebook gets the exact same numbers. It does not make the model better; it just removes randomness from the comparison.
Why stratify Matters
Suppose Rahul is building a fraud detector where only 3% of transactions are fraudulent. A naive random split might, by bad luck, put almost all fraud cases into the training set and leave the test set with barely any. Stratified splitting guarantees that each class appears in train and test in the same proportion as the full dataset.
Full dataset: 97% legit, 3% fraud
Without stratify -> test set might be 99% legit, 1% fraud (misleading)
With stratify -> test set is exactly 97% legit, 3% fraud (faithful)
Always pass stratify=y for classification, especially with imbalanced classes.
The Three-Way Split: Train / Validation / Test
A single train-test split has a subtle flaw. As soon as you look at the test score, tweak a hyperparameter, and look again, you are learning from the test set. Repeat this a dozen times and you have quietly overfit to the test set — its score is no longer an honest estimate.
The clean solution is a three-way split:
- Training set — fit model parameters.
- Validation set — tune hyperparameters and compare models. You may look at this many times.
- Test set — touched exactly once, at the very end, to report final performance.
Full data (100%)
+---------------------------+----------+----------+
| Training | Val | Test |
| 60% | 20% | 20% |
+---------------------------+----------+----------+
fit params tune/compare report once
You can build it with two calls to train_test_split.
# First split off the test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Then split the remainder into train (75% of 80% = 60%) and val (25% of 80% = 20%)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
print(X_train.shape[0], X_val.shape[0], X_test.shape[0])
# 341 114 114 (roughly 60% / 20% / 20%)
The downside: with a modest dataset, carving off two chunks leaves less data to train on, and the validation score depends heavily on which rows landed in that one validation slice. Cross-validation solves both problems.
K-Fold Cross-Validation
A single validation set is one opinion. K-fold cross-validation asks k different validation sets and averages their verdicts, giving a far more robust estimate of performance.
The recipe:
- Split the training data into
kequal parts called folds (commonlyk = 5ork = 10). - Train on
k - 1folds; validate on the remaining 1 fold. - Rotate so every fold serves as the validation set exactly once.
- Average the
kscores. Report the mean and the standard deviation.
Diagram: 5-Fold Cross-Validation
Data split into 5 folds: [ F1 ][ F2 ][ F3 ][ F4 ][ F5 ]
Round 1: TEST train train train train -> score_1
Round 2: train TEST train train train -> score_2
Round 3: train train TEST train train -> score_3
Round 4: train train train TEST train -> score_4
Round 5: train train train train TEST -> score_5
CV score = mean(score_1..score_5) +/- std(score_1..score_5)
Every row is used for training in 4 rounds and for validation in exactly 1 round. Nothing is wasted, and the spread of the five scores tells you how stable the model is.
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=5000)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy")
print("Fold scores:", np.round(scores, 3))
print(f"CV accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
Illustrative output:
Fold scores: [0.967 0.945 0.978 0.956 0.967]
CV accuracy: 0.963 +/- 0.011
The +/- 0.011 (one standard deviation) is genuinely useful: it says the model's accuracy hovers around 96.3% and does not swing wildly across folds. A large standard deviation would warn you that the model is unstable and the single-split score could have been lucky.
Stratified K-Fold for Classification
For classification, you want each fold to preserve class proportions — the same reasoning as stratify in a plain split. Scikit-learn does this automatically: when you pass an integer cv to cross_val_score with a classifier, it uses StratifiedKFold under the hood. You can also request it explicitly.
from sklearn.model_selection import StratifiedKFold, cross_val_score
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=skf, scoring="f1")
print(f"CV F1: {scores.mean():.3f} +/- {scores.std():.3f}")
For regression, plain KFold is used (there are no classes to balance). If your data has groups or is a time series, use GroupKFold or TimeSeriesSplit instead — a normal shuffle would leak information across time.
Leave-One-Out Cross-Validation (LOO)
Leave-one-out is k-fold taken to the extreme: k = n, the number of samples. Each fold trains on all rows but one and validates on that single held-out row.
from sklearn.model_selection import LeaveOneOut, cross_val_score
loo = LeaveOneOut()
scores = cross_val_score(model, X_train, y_train, cv=loo, scoring="accuracy")
print("Number of folds:", len(scores)) # equals number of samples
print(f"LOO accuracy: {scores.mean():.3f}")
LOO uses the maximum possible training data each round, so it has low bias, but it fits the model n times (very slow on large data) and its estimate can have high variance. It shines only on small datasets. For most work, k = 5 or k = 10 is the sweet spot.
cross_val_score vs cross_validate
cross_val_score returns one score per fold for a single metric. When you want multiple metrics, timing, or the training scores alongside validation scores, reach for cross_validate.
from sklearn.model_selection import cross_validate
results = cross_validate(
model, X_train, y_train,
cv=5,
scoring=["accuracy", "precision", "recall", "f1"],
return_train_score=True
)
print("Val accuracy :", np.round(results["test_accuracy"], 3))
print("Val F1 :", np.round(results["test_f1"], 3))
print("Train F1 :", np.round(results["train_f1"], 3))
print("Fit time (s) :", np.round(results["fit_time"], 3))
Illustrative output:
Val accuracy : [0.967 0.945 0.978 0.956 0.967]
Val F1 : [0.974 0.957 0.983 0.965 0.974]
Train F1 : [0.981 0.979 0.982 0.980 0.981]
Fit time (s) : [0.041 0.038 0.040 0.039 0.042]
Comparing train_f1 against test_f1 (validation) fold-by-fold is a quick overfitting check: a consistently large gap between the two means the model is memorising.
Data Leakage: The Biggest Silent Killer
Data leakage happens when information that would not be available at prediction time sneaks into training — inflating your evaluation scores and producing a model that fails in production. It is silent because your metrics look fantastic right up until deployment. There are two classic forms.
Leak Type 1: Preprocessing Before the Split
This is the most common leak. Suppose you scale features using the whole dataset, then split.
# WRONG - leakage. Scaler sees the test rows before the split.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # mean/std computed over ALL rows
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42
)
The problem: StandardScaler computes the mean and standard deviation over every row, including the test rows. Statistics from the test set have leaked into training, so the test score is optimistically biased. The correct order is: split first, then fit the scaler on training data only and merely apply it to the test data.
# CORRECT - fit the scaler on train only, transform both.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train) # learn mean/std from TRAIN only
X_test_s = scaler.transform(X_test) # apply the SAME transform to test
The same rule applies to imputation, encoding, feature selection, PCA, and any step that learns from data (see the Feature Engineering & Scaling chapter). The transform's parameters must come from training data alone.
Leak Type 2: Target Leakage
Target leakage is when a feature contains information about the target that would not exist at prediction time. Examples:
- Predicting loan default, but including a "collections_agency_assigned" column.
That column only gets filled AFTER a customer defaults -> pure leakage.
- Predicting hospital readmission, but including "discharge_medication_for_readmission".
It encodes the outcome you are trying to predict.
- Predicting whether a user will churn next month, using "days_since_last_login"
measured AFTER the churn date.
Target leakage produces near-perfect validation scores that evaporate in production. There is no automatic fix — you must reason about when each feature becomes known. Ask for every feature: "Would I actually have this value at the moment I need to make the prediction?" If not, drop it.
Pipelines: The Structural Cure for Leak Type 1
The cleanest defence against preprocessing leakage is a scikit-learn Pipeline. A pipeline chains preprocessing and the model into one object. When you call fit, every step is fitted on the training portion; when you predict, the same fitted transforms are applied. Crucially, inside cross-validation the pipeline is re-fitted separately on each fold's training data, so no fold's validation rows ever influence its own preprocessing.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# Scaling + model bundled together. The scaler is fit ONLY on each fold's train part.
pipe = make_pipeline(
StandardScaler(),
LogisticRegression(max_iter=5000)
)
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="accuracy")
print(f"Leak-free CV accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
Using a pipeline is the single most effective habit for leakage-free evaluation. Whenever a preprocessing step learns anything from the data, put it inside a pipeline and let cross-validation drive the pipeline — never the raw estimator plus pre-scaled arrays.
Comparing the Validation Strategies
| Method | How it splits | Speed | Data efficiency | Best for |
|---|---|---|---|---|
| Holdout (single split) | One train/test cut, e.g. 80/20 | Fastest (1 fit) | Wastes held-out rows for training | Large datasets, quick checks |
K-Fold (k = 5 or 10) | k rotating folds, each row validated once | Moderate (k fits) | Every row used for train and validation | The default for most projects |
| Stratified K-Fold | Like K-Fold but preserves class ratios | Moderate (k fits) | Same as K-Fold | Classification, imbalanced classes |
| Leave-One-Out (LOO) | k = n; one row out per fold | Slowest (n fits) | Maximum training data per fold | Very small datasets only |
A practical rule of thumb: use a plain holdout for a fast sanity check, 5-fold or 10-fold cross-validation for real model comparison, and reserve LOO for datasets so small that every row counts. Whichever you pick, keep a truly untouched final test set for the one honest number you report at the end.
Common Mistakes
1. Scaling or encoding before splitting
Fitting a StandardScaler, OneHotEncoder, imputer, or PCA on the full dataset and then splitting leaks test statistics into training. Always split first, or wrap preprocessing in a Pipeline so cross-validation handles the boundary for you.
2. Tuning hyperparameters on the test set
The moment you adjust a setting to improve the test score, that score stops being honest. Tune on a validation set or with cross-validation, and touch the test set exactly once.
3. Forgetting stratify with imbalanced classes
A random split can hand you a test set with almost no minority-class examples, producing wildly unstable metrics. Pass stratify=y for classification and use StratifiedKFold.
4. Shuffling time-series data
For temporal data, a shuffled split trains on the future to predict the past — a leak. Use shuffle=False, or TimeSeriesSplit, so training always precedes validation in time.
5. Ignoring the standard deviation of CV scores
Reporting only the mean CV score hides instability. A model at 0.90 +/- 0.02 is far more trustworthy than one at 0.91 +/- 0.15. Always report the spread.
6. Leaking through duplicate or grouped rows
If the same customer, patient, or image appears in both train and test (near-duplicates, multiple records per person), the split leaks. Use GroupKFold so all rows for one group stay on the same side.
Practice Exercises
-
Load any classification dataset and create a stratified
70/15/15train/validation/test split using two calls totrain_test_split. Print the class proportions in each part to confirm they match. -
Run
cross_val_scoreon aLogisticRegressionwithcv=5and again withcv=10. Compare the mean and standard deviation. Which gives a tighter estimate, and why might that be? -
Deliberately create a leak: fit a
StandardScaleron the full data, split, train, and record the test accuracy. Then do it the correct way (fit scaler on train only). Report both accuracies and explain any difference. -
Build a
PipelineofStandardScalerplus a classifier and evaluate it with 5-fold cross-validation. Explain in one sentence why the pipeline prevents preprocessing leakage inside each fold. -
Given a dataset of patient records where each patient has multiple visits, describe why a plain random split leaks and which cross-validator you would use instead.
-
Use
cross_validatewithscoring=["accuracy", "f1"]andreturn_train_score=True. Identify any fold where the train score is much higher than the validation score, and state what that gap indicates.
Summary
In this chapter you learned:
- Never score a model on its training data — it rewards memorisation, not generalisation. Always evaluate on held-out data.
train_test_splitcarves data into train and test; key arguments aretest_size,random_state(reproducibility), andstratify(preserve class ratios).- A three-way split (train / validation / test) keeps the test set pristine: fit on train, tune on validation, report on test exactly once.
- K-fold cross-validation rotates
kvalidation folds and averages the scores, giving a robust estimate plus a standard deviation that reveals stability.k = 5ork = 10is the practical default. - Stratified K-Fold preserves class proportions for classification; Leave-One-Out maximises training data but is slow and best reserved for tiny datasets.
cross_val_scorereturns one metric per fold;cross_validateadds multiple metrics, timings, and train scores.- Data leakage is the silent killer: preprocessing before the split and target leakage both inflate scores that then collapse in production.
- Pipelines structurally prevent preprocessing leakage by re-fitting every transform on each fold's training portion — make them your default habit.
Get the split and cross-validation right, and every later chapter's model comparison rests on solid ground.
Next up: Linear Regression — build your first predictive model, fit a line with least squares, and interpret its coefficients to turn data into forecasts.