Data Science

Supervised vs Unsupervised vs Reinforcement Learning: Which One to Use When

The correct starting point for ML paradigm choice is the decision you're trying to make — not the data you have. Here's the framework practitioners actually use.

Meritshot7 min read
Machine LearningSupervised LearningUnsupervised LearningReinforcement LearningData Science
Back to Blog

Most teams approach the choice of ML paradigm by asking: "What data do we have, and what algorithm fits it?" This is the wrong starting question. The right starting question is: "What decision are we trying to improve, and what would it mean to improve it?"

The data question comes second. The decision question comes first. Teams that reverse this order end up with technically correct implementations of the wrong paradigm, deployed in production for months before the mismatch becomes obvious enough that someone calls it.

Machine learning paradigm comparison visualization showing different model types

The Routebox Nine-Month Detour

A logistics company — call them Routebox — spent nine months building a reinforcement learning system to optimize delivery routing. The system was technically sophisticated. The RL environment was well-specified. The reward signal was economically meaningful.

Nine months in, a new data scientist asked a simple question: "Do we actually need the agent to learn from interaction, or do we need to predict the best route given inputs we already have?"

The answer was the latter. The problem was a supervised learning problem. Routebox had solved it using the most technically interesting available approach rather than the most appropriate one. The RL system was scrapped. A gradient boosted model trained on historical route outcomes went into production in six weeks and outperformed the RL system.

This is not a story about RL being bad. It's a story about paradigm choice being driven by technical fascination rather than problem structure.

Supervised Learning: The Labeling Cost Is Real

Supervised learning is the right choice when you have labeled examples of the outcome you care about and you want to predict that outcome for new inputs. The quality of the labels determines the quality of the model.

The labeling cost is routinely underestimated. PathScan, a medical imaging startup, estimated their labeling budget at $600K for 100,000 annotated radiology images. The actual cost was $2.1M. The reasons: radiologists cost more per hour than expected, annotation disagreement required third-annotator tie-breaking for 15% of images, and the annotation schema had to be revised twice after early labels revealed inconsistencies.

Supervised learning is the right choice when:

  • You have labeled data or can acquire it at a defensible cost
  • The labels accurately represent the outcome you care about
  • The prediction itself directly improves the decision

Supervised learning is the wrong choice when:

  • Labels are unavailable, too expensive, or subject to labeler disagreement you can't resolve
  • The outcome you care about isn't directly observable as a label
  • The system needs to discover structure that isn't pre-specified

Unsupervised Learning: Almost Never the Final Destination

Unsupervised learning discovers structure in data without labels. Clustering, dimensionality reduction, anomaly detection, density estimation — these are the core techniques.

The common mistake: deploying unsupervised learning as the end product when it should be an intermediate step.

Tradehouse, a fintech, built a customer segmentation model using k-means clustering. The model produced eight distinct customer clusters. After six months of treating these clusters as customer segments for marketing and product decisions, a business review found that cluster membership had essentially no predictive relationship to any business outcome they cared about — conversion, retention, or LTV.

The clustering had found statistical structure in the data. That statistical structure didn't correspond to meaningful behavioral segments. The team was solving a math problem, not a business problem.

Unsupervised learning is most valuable as a pipeline step: feature extraction, dimensionality reduction before a supervised model, anomaly detection to flag unusual inputs. As a standalone product it works in narrow cases (search index construction, de-duplication) and often fails to deliver business value in the rest.

Comparison chart of different ML paradigms showing use cases and tradeoffs

Reinforcement Learning: Mostly Contextual Bandits in Disguise

RL is the paradigm for sequential decision-making: an agent takes actions, receives rewards, and learns a policy that maximizes cumulative reward over time. It is the right choice for a small, specific class of problems.

It is the wrong choice for most problems teams try to apply it to.

Verveloop, a B2B SaaS company, built a full RL system for pricing optimization. The agent explored price points, observed conversion rates, and updated its policy. After eight months, the system outperformed a simple A/B test by 2-3% in conversion rate.

The post-mortem question: was the RL system necessary for a 2-3% gain, or would a contextual bandit — a much simpler statistical approach — have achieved the same result at 10% of the development cost?

The answer was almost certainly the latter. Most "RL" problems in production are actually multi-armed bandit problems, where the state space is low-dimensional and the reward signal is immediate. Contextual bandits handle these well without the exploration overhead and optimization complexity of full RL.

Full RL is appropriate when:

  • Actions have long-horizon consequences that play out over many steps
  • The environment is dynamic and the policy needs to generalize to new states
  • You can simulate the environment safely before deploying

For pricing, content recommendation, and resource allocation, start with bandits. Move to RL only if the bandit approach demonstrably fails due to the sequential nature of the problem.

Self-Supervised Learning: The Paradigm That's Reshaping Everything

There's a fourth paradigm that is less discussed but increasingly dominant: self-supervised learning. It uses the data itself to generate training signal, without human-annotated labels.

OptiMD, a healthcare AI company, built a diagnostic support system for a rare condition with only 4,000 labeled cases in the literature. Traditional supervised learning was impossible at that scale. They used self-supervised pretraining on 1.2 million unlabeled imaging studies from the same modality, then fine-tuned on the 4,000 labeled cases.

The resulting model outperformed models trained on labeled data alone from other companies with 10x the labeled dataset size. Self-supervised pretraining on unlabeled data provided the foundational understanding; the small labeled set provided the final supervision signal.

Self-supervised learning is the right choice when:

  • Labeled data is scarce but unlabeled data is available
  • You're building on a pretrained foundation model (which itself was trained self-supervised)
  • The domain has intrinsic structure that can generate training signal (text prediction, masked image modeling, contrastive pairs)

The Four-Axis Decision Framework

When choosing a paradigm, answer these four questions:

  1. What outcome label do I have or can I create? If you have direct labels for the outcome you care about: supervised. If not: look elsewhere.

  2. What's my labeling budget? If labels are cheap or already exist: supervised is viable. If labels require expensive expert annotation: consider self-supervised pretraining + fine-tuning or unsupervised feature extraction.

  3. Is the decision sequential or one-shot? If one decision is made per prediction: supervised or bandit. If actions create environments that require future decisions: consider RL, but start with bandits.

  4. What does "good" look like? If "good" is measurable at prediction time: supervised. If "good" only emerges from exploration: bandit or RL.

When Not to Use ML

The question that doesn't appear in most ML frameworks: is ML the right tool at all?

  • If a simple business rule covers 90% of cases and the remaining 10% are rare enough to handle manually: use the rule.
  • If the decision is made so infrequently that a model can never have enough training data: don't build a model.
  • If the cost of errors is so high that a human must review every model decision anyway: the model is adding cost, not replacing it.

ML is a powerful tool. It is not the right tool for every decision. The paradigm selection process starts with the honesty check: should we be building a model here at all?


Meritshot's Data Science curriculum covers all four paradigms — supervised, unsupervised, reinforcement, and self-supervised — through production case studies that show not just how each works, but when each earns its place.

Recommended