Why 95% Model Accuracy Is Misleading: Precision, Recall and F1 Score Explained

A fraud detection model at a payments company achieved 97.3% accuracy. The team celebrated. They pushed it to production. Three months later, the fraud team reported that the model was catching almost no actual fraud.

The investigation found the problem immediately: 97.2% of transactions in the training data were legitimate. The model had learned to predict "not fraud" for everything. It was technically 97.2% accurate — and completely useless.

This is the accuracy paradox. It is not a corner case. It is the default failure mode for any classification model trained on imbalanced data, and imbalanced data is the norm in every commercially important classification problem: fraud detection, cancer screening, churn prediction, spam filtering, loan default prediction.

If you are evaluating your classification model using only accuracy, you are measuring the wrong thing.

The Accuracy Trap: What the Number Is Actually Telling You

Accuracy measures the fraction of all predictions your model got right. On a balanced dataset where the classes occur with roughly equal frequency, this is a reasonable summary.

On any imbalanced dataset, accuracy measures how well your model learned the distribution of the majority class.

The real-world scenario that makes this concrete:

A hospital wants to build a model to flag patients at risk of sepsis in the ICU. In the training data, 94% of ICU stays do not involve sepsis. A model that predicts "no sepsis" for every single patient achieves 94% accuracy without learning anything about the actual condition. In deployment, it misses every sepsis case. Patients die.

The accuracy number gave the development team no signal that anything was wrong until the model was in use. That is the problem. A metric that cannot distinguish between a useful model and a completely degenerate one is not a useful metric.

What makes a classification problem imbalanced:

Most real problems are significantly imbalanced:

Credit card fraud: 99.8% legitimate, 0.2% fraud
Cancer detection: 95-99% negative, 1-5% positive
Customer churn: 85-95% retained, 5-15% churned
Loan default: 92-97% repaid, 3-8% defaulted
Intrusion detection: 99%+ normal traffic, less than 1% attacks

In every one of these cases, a model that predicts the majority class for every observation achieves high accuracy while providing zero practical value.

The Confusion Matrix: The Foundation You Need to Read Correctly

For a binary classifier:

True Positives (TP): Predicted positive, actually positive. The correct detections.
True Negatives (TN): Predicted negative, actually negative. The correct rejections.
False Positives (FP): Predicted positive, actually negative. The false alarms. Also called Type I errors.
False Negatives (FN): Predicted negative, actually positive. The misses. Also called Type II errors.

The real-world scenario that makes the difference between FP and FN visceral:

System A — Spam filter:

False Positive: A legitimate email from your client lands in spam. You miss an important meeting. Costly.
False Negative: A spam email arrives in your inbox. You delete it. Mildly annoying.

System B — Cancer screening:

False Positive: A healthy patient is flagged for further testing. They undergo a biopsy that comes back negative. Stressful and expensive, but correctable.
False Negative: A patient with early-stage cancer is cleared. The cancer progresses undetected for months. Potentially fatal.

In the spam filter, false positives are worse. In cancer screening, false negatives are catastrophically worse. Accuracy treats both errors identically. This is precisely why accuracy fails — it cannot reflect the asymmetric cost of different error types.

Precision: When Being Wrong Is Expensive

Precision answers the question: of all the cases your model flagged as positive, what fraction were actually positive?

Precision = TP / (TP + FP)

The real-world scenario where precision is the dominant metric:

A legal discovery tool at a law firm identifies documents as "potentially relevant to litigation" and routes them to a paralegal for review. The model flags 800 documents. The paralegal reviews all of them. 200 turn out to be relevant. The other 600 were false alarms — documents that were irrelevant but consumed hours of expensive paralegal time.

Precision here is 200 / (200 + 600) = 25%. Every percentage point of precision improvement directly reduces the paralegal's review burden.

When to prioritise precision:

Content recommendation: Surfacing an irrelevant recommendation erodes user trust.
Ad targeting: Spending budget serving ads to users who will never convert.
Drug approval: A drug that gets approved but is ineffective or harmful is a false positive.
Judicial systems: A false conviction is a catastrophic FP.

The honest trade-off:

Increasing precision typically means increasing the model's threshold — requiring higher confidence before making a positive prediction. This reduces false positives but increases false negatives. The model becomes more conservative. There is no free lunch. Every threshold decision is a trade-off between precision and recall.

Recall: When Missing a Case Is Catastrophic

Recall answers the question: of all the actual positives in the dataset, what fraction did your model correctly identify?

Recall = TP / (TP + FN)

Also called sensitivity or true positive rate.

The real-world scenario where recall is the dominant metric:

A biotech company builds a model to screen blood samples for a rare but curable cancer. The cancer affects 2% of the screening population. Missing a true positive — a person with cancer who is cleared — means a missed diagnosis. The cancer progresses. Early-stage curable becomes late-stage difficult.

A false positive — flagging a healthy person for specialist review — means they undergo further tests and are cleared. Stressful, costly, but correctable.

In this context, recall is the survival-relevant metric.

When to prioritise recall:

Medical screening: Missing a positive is dangerous. Screen broadly, investigate further.
Security intrusion detection: Missing an attack is catastrophic. Flag broadly, filter in investigation.
Child safety systems: Missing a case of abuse or trafficking is unacceptable.
Natural disaster early warning: A missed earthquake alert is worse than a false alarm that triggers evacuation.

What low recall looks like in production:

A churn model with 78% recall is catching 78% of customers who will actually churn. The other 22% churn undetected. If the average churned customer is worth ₹12,000 in annual revenue, and you have 1,000 predicted churns per month, the 220 missed churns represent ₹2.6 million in revenue leaking silently every month — never flagged, never actioned.

F1 Score: When You Need to Balance Both

The F1 score is the harmonic mean of precision and recall. It is useful when you need a single summary metric that reflects both and penalises models that are extreme in either direction.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Why harmonic mean rather than arithmetic mean:

The arithmetic mean of precision and recall rewards models that are extreme in one direction. A model with 100% precision and 1% recall has an arithmetic mean of 50.5% — which sounds almost as good as a model with 80% precision and 80% recall (arithmetic mean: 80%). But the first model is catching essentially nothing. The harmonic mean punishes this asymmetry: the F1 of the first model is 1.98%. The F1 of the second is 80%.

The real-world scenario where F1 is the right choice:

A product team at a B2B SaaS company builds a lead scoring model. The sales team has capacity to call 100 leads per week. The model needs to not overwhelm the team with low-quality leads (precision matters) and not miss too many genuinely high-quality leads (recall matters). Neither precision nor recall alone tells them whether the model is serving the sales process well. F1 does.

When F1 is not sufficient:

F1 treats precision and recall as equally important. In many real problems, they are not. The asymmetric cost problem requires a weighted variant: the F-beta score.

F-beta = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)

When β > 1: recall is weighted more heavily. Use when false negatives are more costly.
When β < 1: precision is weighted more heavily. Use when false positives are more costly.
When β = 1: standard F1 — equal weighting.

The Industry-Specific Metric Choice: Which Number to Report and Why

The single most important skill in model evaluation is not calculating these metrics — it is knowing which one to optimise for and which ones to report for a given problem.

Use case 1 — Fraud detection at a payments company:

The fraud team reviews every transaction the model flags. Their capacity is 200 reviews per day. The consequence of missing fraud is financial loss and customer disputes.

Primary metric: Recall — catch as much fraud as possible within the team's review capacity constraint. Secondary metric: Precision — ensure the team isn't overwhelmed with false alarms. Reported as: F2 score (recall weighted 2x more than precision) plus a fixed precision floor.

Use case 2 — Churn prediction at a telecom company:

The retention team has budget to call 500 customers per month with a retention offer. The offer costs ₹500 per customer.

Calculate: average LTV of churned customer vs cost of retention call. If LTV = ₹8,000 and call cost = ₹500, you should be willing to make 16 false positive calls for every true positive catch. This implies a precision around 6% is acceptable — which means you should optimise for recall.

Use case 3 — Content moderation at a social platform:

Incorrectly removing a legitimate post alienates the creator. Incorrectly leaving up a violating post harms the community.

For hate speech: aggressive recall (high threshold for removal, accepting some FP). For copyright violation: conservative precision (don't remove content without strong confidence).

Use case 4 — Credit default prediction at a lender:

Both errors have quantifiable financial costs. The optimisation should be driven by: (Expected FP cost × False Positive Rate × Volume) vs (Expected FN cost × False Negative Rate × Volume). This is the expected cost minimisation framework, which subsumes F-score decisions.

Primary metric: Expected cost per decision — which is a function of precision, recall, class distribution, and the financial cost of each error type.

The ROC-AUC Trap: When This Metric Also Misleads

While precision, recall, and F1 represent significant improvements over accuracy, the commonly used ROC-AUC score has the same class-imbalance problem as accuracy.

The ROC curve plots the true positive rate (recall) against the false positive rate as the classification threshold is varied. The AUC summarises the model's discriminative ability across all thresholds.

Why ROC-AUC misleads on imbalanced data:

The false positive rate (FPR) = FP / (FP + TN). On a dataset with 99,000 negatives and 1,000 positives, you can have 500 false positives and the FPR is still only 0.5% — it looks small because the denominator is enormous. The ROC curve looks impressive while the model is generating large numbers of false alarms in absolute terms.

The Precision-Recall curve is the better choice for imbalanced problems:

The PR curve plots precision against recall as the threshold varies. It is not affected by the class imbalance problem because it does not involve the true negative count. A model that is good by PR curve standards is a model that is actually useful for finding the positive class.

When to use ROC-AUC vs PR-AUC:

Balanced datasets: ROC-AUC is appropriate and interpretable.
Imbalanced datasets: PR-AUC is more informative.
When the positive class is rare and you care primarily about performance on positives: always use PR-AUC.

The Threshold Decision: The Business Judgment That Metric Selection Cannot Make for You

Every classification model produces a probability score. The threshold converts that score into a binary prediction. The choice of threshold is not a statistical decision — it is a business decision that determines the operating precision and recall of your deployed model.

What threshold selection requires:

The financial or operational cost of a false positive in your context
The financial or operational cost of a false negative in your context
The expected class distribution in the production environment
The capacity constraints of whatever process acts on the model's outputs

A fraud model with 10,000 daily transactions, a review team capacity of 200, and a fraud rate of 0.3% needs a threshold that produces approximately 200 positive predictions per day while maintaining the highest possible recall. Finding that threshold requires plotting precision-recall at each threshold and identifying the operating point that meets the capacity constraint while maximising recall.

This cannot be done by looking at aggregate metrics. It requires threshold analysis — and threshold analysis requires knowing the business context of the model's deployment.

Closing: From Metric Selection to Analytical Credibility

The ability to select the right evaluation metric — and explain the selection in business terms — is what separates data scientists who build models from data scientists who build models that get used.

A model that achieves high accuracy on an imbalanced dataset, or high ROC-AUC on a problem where the positive class is rare, may be technically sophisticated and practically useless. The practitioner who catches this before deployment, explains why the model is measuring the wrong thing, and proposes the correct metric is providing value that the model alone cannot provide.

At Meritshot, the Data Science programme teaches evaluation metric selection as a business judgment problem, not just a mathematical one. Students work through cases where accuracy is high and the model is degenerate, where F1 is the wrong choice because the error costs are asymmetric, and where the threshold decision requires financial modelling, not just statistical analysis.

Explore the Meritshot Data Science Programme →