Overfitting vs Underfitting in Machine Learning: How to Diagnose and Fix Both

Most machine learning practitioners encounter overfitting early and loudly. The model performs brilliantly on training data and collapses on everything else. The diagnosis is obvious. The fix feels straightforward.

Underfitting is quieter and more insidious. It produces models that look mediocre everywhere — not dramatically wrong, just consistently insufficient. It is also more commonly misdiagnosed, more often treated with the wrong intervention, and more frequently produced by practitioners who have learned to fear overfitting without understanding the opposite failure.

The real skill is not avoiding one of these conditions. It is reading your model's behaviour accurately enough to know which condition you are in — and intervening precisely rather than reflexively.

What Overfitting and Underfitting Are Actually Measuring

Both conditions describe the same underlying phenomenon: a mismatch between the complexity of your model and the complexity of the problem you are asking it to solve.

Overfitting is the condition where the model is more complex than the problem requires. It has learned the specific patterns in the training data — including the noise, the outliers, the sampling artefacts — rather than the generalisable signal that the problem actually contains. The model has memorised rather than learned.

Underfitting is the condition where the model is less complex than the problem requires. It has not captured the actual signal in the data. The patterns that determine the outcome are present in the data but the model is not structured or parameterised to express them.

What is not said clearly enough about these conditions:

The dominant framing in most ML education treats overfitting as the primary enemy to be avoided. This framing produces a specific failure mode: practitioners who have been trained to fear overfitting apply regularisation, reduce model complexity, and add dropout reflexively — even when the model's actual problem is underfitting, not overfitting.

The model performs inadequately. The practitioner, conditioned to suspect overfitting, applies a regularisation intervention. The model's performance gets worse. The practitioner does more of the same.

The correct diagnostic frame is: performance inadequacy is not evidence of overfitting. It is evidence of a problem that needs to be diagnosed before it can be treated. The treatment for overfitting actively worsens underfitting, and vice versa.

The Bias-Variance Decomposition: What It Actually Tells You

The formal decomposition of a model's expected error is: Expected Error = Bias² + Variance + Irreducible Noise.

The operationally useful version: bias is the error that comes from wrong assumptions in the model — the model is structurally unable to represent the true relationship. Variance is the error that comes from sensitivity to small fluctuations in the training data — the model has learned the specific training set rather than the underlying distribution.

Underfitting is a high-bias condition. The model's structural assumptions are wrong or too simple. The error exists on both training and validation data because the model is not capturing the signal that is present.

Overfitting is a high-variance condition. The model's structural assumptions are correct but it has become too sensitive to the training data. The error is low on training data and high on validation data because the model has learned the training set's specific noise alongside its signal.

The diagnostic implication:

A high-bias model cannot be fixed by giving it more data. The structural wrong assumptions produce errors that persist regardless of data volume, because more data does not change what the model is structurally capable of representing.

A high-variance model cannot be fixed by making it more complex. More complexity gives the model more capacity to overfit, which is the opposite of what is needed.

The interventions are structurally opposite, and applying the wrong one is worse than applying nothing.

Diagnosing Overfitting: The Specific Patterns to Look For

Overfitting has a specific, recognisable signature that is visible in learning curves, evaluation metrics, and model behaviour.

The training-validation gap:

The primary signal of overfitting is a significant gap between training performance and validation performance. Specifically: high training performance, substantially lower validation performance. A rough heuristic: if training accuracy is 94% and validation accuracy is 91%, this is a modest gap that may be acceptable. If training accuracy is 94% and validation accuracy is 72%, this is a large gap that indicates significant overfitting.

The learning curve pattern:

As training set size increases, training performance decreases slightly and validation performance increases. In an overfit model, the two curves begin to converge as data increases, but with a persistent gap. If adding data consistently improves validation performance, more data is a viable treatment. If the curves have converged but both are at an inadequate level, more data will not help.

The real-world scenario:

A fraud detection team at a payments company built a gradient boosting model to classify transactions as fraudulent or legitimate. Training AUC was 0.97. Validation AUC was 0.81. The gap — 0.16 AUC points — was the first diagnostic signal.

Deeper investigation revealed why: the training data contained a specific pattern of transaction timestamps that was strongly correlated with fraud in the training period but was not a stable causal feature — it reflected a seasonal pattern in the specific months the training data covered.

The fix was not regularisation. It was feature engineering: removing or transforming the timestamp feature to eliminate the spurious correlation, then retraining. Validation AUC rose to 0.89.

The lesson: overfitting is not always caused by model complexity. It can be caused by spurious correlations in feature engineering that are stable in training data but not in the real distribution.

Diagnosing Underfitting: The Patterns Most Practitioners Miss

Underfitting is more often misdiagnosed than overfitting, for a specific reason: the symptoms are less dramatic.

The primary signal of underfitting:

Low training performance and low validation performance, with a small gap between them. The gap is small because the model is not sensitive enough to the training data to memorise it.

The key diagnostic question: is the training performance adequate? Not "is it higher than validation performance" but "is it actually good?" If training performance is inadequate, the model has not learned the training data well.

The specific patterns that indicate underfitting:

Training accuracy that plateaus early in the training process
Residual patterns in errors — the errors show systematic patterns that suggest the model is failing to capture structural relationships
Performance that is only marginally better than a naive baseline
Feature importance distributions where all features are equally unimportant

The real-world scenario:

A retail team built a demand forecasting model to predict weekly sales. They used a linear regression with five input features. Training RMSE was 340 units. Validation RMSE was 352 units. The gap was small — only 12 units — which superficially looked like good generalisation. But both numbers were inadequate: inventory planning required RMSE below 150 units.

The team initially attributed the poor performance to data quality issues and spent two weeks cleaning the data. Performance improved marginally.

The actual diagnosis: underfitting. The linear model was structurally incapable of capturing the non-linear relationships in retail demand — particularly the interaction effects between day of week, promotional periods, and seasonality.

The fix: replacing the linear model with a gradient boosting model, adding interaction features, and including lagged sales features at multiple time horizons. Post-fix training RMSE: 112. Validation RMSE: 138.

The Seven Interventions for Overfitting: When to Use Each One

Once overfitting is correctly diagnosed, the intervention choices are numerous and the selection criteria are specific. Applying all of them at once is not a strategy — it is a failure to understand what each intervention does.

Regularisation (L1 and L2):

L2 regularisation penalises large weights, encouraging the model to distribute its parameter values more uniformly. This is appropriate when you suspect the model is assigning excessive importance to specific features.

L1 regularisation encourages sparsity — it produces models where many weights are exactly zero, effectively performing feature selection. This is appropriate when you have a large feature space and suspect that many features are irrelevant or redundant.

Both work best when the model has the right architecture but too much parameter freedom. They are less effective when the overfitting is caused by spurious feature correlations.

Dropout (for neural networks):

Dropout randomly deactivates a proportion of neurons during each training step, forcing the network to learn redundant representations rather than memorising specific patterns. Dropout rates between 0.2 and 0.5 for most hidden layers. Applying too much dropout produces underfitting.

Early stopping:

Monitoring validation performance during training and stopping when validation performance begins to degrade. This is one of the simplest and most reliable overfitting interventions because it directly monitors the condition it is treating.

More training data:

If the overfitting is caused by insufficient data, more data is the correct intervention. The learning curve will show validation performance continuing to improve as data increases. If the overfitting is caused by spurious feature correlations or excessive model complexity, more data will not help.

Reducing model complexity:

Fewer layers in a neural network, fewer trees in a forest, lower polynomial degree in a regression. This is the right intervention when the model has more capacity than the problem requires.

Feature selection and engineering:

Removing features that are correlated with the training target for spurious reasons, removing redundant features. This is the right intervention when overfitting is caused by feature quality rather than model complexity.

Cross-validation:

Using k-fold cross-validation instead of a single train-validation split ensures that the performance estimate is not itself a product of how the specific split divided the data.

The Five Interventions for Underfitting

Increase model capacity:

Adding layers to a neural network, adding trees to a boosting ensemble, increasing polynomial degree. The test: does training performance improve when you increase capacity? If yes, capacity was the constraint.

Improve feature engineering:

The most commonly under-leveraged intervention for underfitting in tabular data problems. The model cannot learn signal that is not present in the features. If the actual drivers of the outcome are expressed in combinations of features that the model cannot represent, adding those features directly is more effective than adding model complexity.

Reduce regularisation:

If a model has been regularised too aggressively, it may be underfit despite having sufficient architecture. Reducing regularisation to allow the model to fit the training data more closely is the correct intervention.

Train for longer:

Models that have not converged benefit from additional training time. This is distinct from the overfitting scenario where early stopping is appropriate: here, training performance has not plateaued and the model genuinely has not finished learning.

Improve data quality:

If the features contain significant noise — measurement error, inconsistent labelling, missing values imputed with uninformative defaults — the model cannot learn clean signal from a noisy target.

The Three-Phase Diagnostic Protocol: A Systematic Approach

Most practitioners diagnose overfitting and underfitting reactively. A systematic three-phase protocol produces more reliable diagnoses and more targeted interventions.

Phase one: establish the training-validation performance pattern.

Before any intervention, establish the pattern clearly using this diagnostic table:

High train, low validation, large gap → overfitting
Low train, low validation, small gap → underfitting
High train, high validation, small gap → well-fitted

Do not skip this step. Do not proceed to interventions until the pattern is clearly established.

Phase two: identify the root cause within the diagnosis.

For overfitting: is the gap caused by insufficient data, excessive model complexity, or spurious feature correlations? The appropriate intervention differs.

For underfitting: is the inadequate performance caused by insufficient model capacity, insufficient feature quality, excessive regularisation, or insufficient training? The appropriate intervention differs.

Phase three: apply a single intervention and observe.

This is the most commonly violated principle in practice. Apply one intervention at a time. Rerun the learning curve. Re-evaluate the pattern. Then decide whether to continue with the same intervention or try another.

A data team building a customer churn prediction model had a large training-validation gap (training AUC 0.91, validation AUC 0.73). They applied three interventions simultaneously: increased L2 regularisation, removed 40% of features, and reduced model depth from 8 layers to 4.

Validation AUC improved to 0.82. But they did not know which intervention had produced the improvement, whether all three were necessary, or whether one had made things worse while the other two overcorrected.

The Specific Failure Mode: Overfitting the Validation Set

There is a third condition that sits between overfitting the training set and genuine generalisation: overfitting the validation set through repeated hyperparameter tuning.

Every time you evaluate your model on the validation set and then tune the model based on validation performance, you are using information from the validation set to make modelling decisions. Over many iterations, the model becomes adapted to the specific validation set.

How to detect this: If your validation performance has improved significantly across many tuning iterations, but your test set performance is substantially lower than the final validation performance, you have overfit the validation set.

How to prevent it: Use a separate test set that is never used during development, only for final evaluation. Use cross-validation rather than a single train-validation-test split when possible.

The Calibration Problem

A well-fitted model — one that avoids both overfitting and underfitting — can still be a poor model for deployment if it is miscalibrated.

A model is well-calibrated if its predicted probabilities reflect actual frequencies. Overfitting tends to produce overconfident models — models that assign probabilities close to 0 or 1 more often than the data warrants. Underfitting tends to produce under-confident models — probabilities clustered near 0.5.

A credit scoring model that assigns a 20% default probability to a set of applicants is being used to make lending decisions. If the model is overconfident (overfitting pattern), the 20% it predicts might reflect a true rate of 8% — meaning the bank is rejecting customers at twice the necessary rate. If the model is underconfident (underfitting pattern), the 20% might reflect a true rate of 38% — meaning the bank is approving customers at twice the acceptable risk level.

What Good Generalisation Actually Looks Like in Deployed Systems

Understanding overfitting and underfitting at the modelling stage is necessary but not sufficient for building ML systems that perform reliably in production.

Distribution shift occurs when the data the model encounters in production differs from the data it was trained on. The three types:

Covariate shift: The input feature distribution changes in production.

Concept drift: The relationship between features and target changes.

Label shift: The class distribution changes.

The diagnostic for distribution shift: Monitor feature distributions in production and compare them to training distributions. Monitor prediction distributions. Monitor performance on labelled production data when labels become available.

Closing: From Overfitting Diagnosis to Real-World ML Practice

The overfitting-underfitting framework is foundational — it is the lens through which every model's development phase should be evaluated.

At Meritshot, the Data Science programme is built around exactly these practical challenges. Students do not work on clean, textbook datasets where the right model architecture is obvious. They work on problems that reflect the actual structure of Indian industry data — where the diagnosis of overfitting versus underfitting requires judgment, where the right intervention is not given in advance, and where the evaluation pipeline is part of the deliverable rather than an afterthought.

Explore the Meritshot Data Science Programme →