Data Science

Synthetic Data for Training: When It Helps and When It Quietly Poisons Your Model

Synthetic data is half a good idea. The other half is model collapse, the 1% problem, and the sim-to-real gap. Here's how to use it without paying the long-term tax.

Meritshot6 min read
Synthetic DataTraining DataModel CollapseMachine LearningData ScienceAI
Back to Blog

The pitch for synthetic data is irresistible on a slide. It's cheap. It's privacy-safe. It scales infinitely. It can fill the gaps real data leaves behind.

The pitch is half-true. The other half is the part nobody puts on a slide.

Recent peer-reviewed research has established what many practitioners had been seeing in production for years: synthetic data, used carelessly, doesn't just fail to help — it actively degrades models in ways that are subtle, cumulative, and often invisible until the system meets real users. The Nature paper by Shumailov and colleagues gave the failure mode a name: model collapse. Subsequent work demonstrated that even small proportions of synthetic data can quietly poison training.

Laboratory data processing environment showing synthetic data generation

The Trade Nobody Names: Cheaper Per Example, Statistically Thinner

Real data carries information. Each example encodes something about the real world — a customer's actual decision, a patient's actual symptom presentation. Edge cases, weird interactions, and tail behaviors are baked into the distribution because they happened.

Synthetic data carries an approximation of information. The generator — whether a GAN, diffusion model, LLM, or rules-based simulator — has its own statistical fingerprint. It tends toward what its training data emphasized, smooths over what its training data missed, and confidently produces examples that look plausible but don't reflect the underlying real-world distribution.

The trade is: synthetic data is cheaper per example because it's thinner per example. You're getting more volume for less money, but the information density of each example is lower than the real-world equivalent.

The Telehealth Reckoning

A telehealth startup built a triage classifier using de-identified historical chat logs and synthetic patient queries generated by an LLM. In offline evaluation, the model performed well. In production, the failure pattern was specific: common symptom presentations were handled well, but rare phrasings, cultural metaphors, and patients describing symptoms in non-clinical language were handled poorly and often confidently misrouted.

The synthetic data had a demographic and linguistic distribution that mirrored the LLM's training data, not the actual patient population. The model was specifically calibrated to a narrower world than the one it operated in.

Model Collapse: What the Research Actually Shows

Model collapse is the well-documented phenomenon where a model trained on outputs from previous model generations loses fidelity to the original real-world distribution. With each generation of recursive training, rare patterns vanish first, distributions narrow, and outputs drift toward bland central tendencies.

Key research findings:

  • The 1% threshold. Dohmatob and colleagues showed that even small proportions of synthetic data — as little as 1% in some setups — can meaningfully harm performance in recursive training settings. The comforting intuition that "a little synthetic data won't hurt" is empirically wrong.

  • Mixing helps, but only conditionally. Accumulating real data alongside synthetic data bounds the test error to a finite ceiling rather than letting it grow without bound. The condition that matters is whether real data continues to flow in.

  • Visual quality and training utility are decoupled. Newer text-to-image generators, despite producing more visually realistic images, actually produce worse training data than older models. The newer generators have collapsed to a narrower, aesthetically-optimized distribution that hurts downstream classifier accuracy.

Distribution comparison charts showing model collapse over recursive training generations

The Fine-Tuning Pipeline That Quietly Degraded

A SaaS company built an internal fine-tuning pipeline for customer-support response generation. Each quarter, they fine-tuned on real customer interactions plus LLM-generated synthetic conversations to fill coverage gaps.

Across four quarterly cycles, internal benchmarks looked stable. After a careful audit:

  • The evaluation set was itself partially LLM-generated
  • On a held-out set of real customer interactions, performance had degraded steadily
  • Linguistic diversity had narrowed measurably
  • Edge-case handling — sarcasm, ambiguous requests, multi-issue conversations — had degraded most

This is collapse in real production: invisible on the polluted benchmark, visible on real data, cumulative across cycles.

Where Synthetic Data Legitimately Earns Its Place

For all the warnings, synthetic data is not categorically bad. Four use cases where it genuinely outperforms alternatives:

Use Case 1: Rare-class augmentation where real examples are vanishingly few. When the positive class is genuinely rare (fraud, equipment failure, certain medical conditions) and you have a small but high-quality set of real positive examples, synthetic augmentation of the rare class can improve recall. The key: synthetic data extends a small real distribution rather than replacing it.

Use Case 2: Privacy-preserving workflows in regulated domains. Healthcare, finance, and certain government applications cannot use real data freely. Synthetic data generated under explicit statistical constraints — preserving marginal distributions, specific correlations, subgroup properties — is sometimes the only viable path. The honest caveat: privacy and utility are in genuine tension, and models trained on privacy-preserving synthetic data meaningfully underperform models trained on real data.

Use Case 3: Edge case generation for safety-critical systems. Autonomous driving and robotics benefit from synthetic data to expose models to dangerous edge cases too rare to encounter organically. The condition: the synthetic data is generated against a physics-grounded simulator, not a general-purpose generator. The simulator's coverage limits are known and documented.

Use Case 4: Cold-start bootstrapping before real data exists. When a product is launching with no existing data, synthetic data can provide initial training material. The key is an explicit transition plan: synthetic data is used as a temporary scaffold, not a permanent foundation, with real data accumulating to replace it.

Mitigation Patterns That Actually Work

Mix and accumulate real data — don't replace it. Real data collection should never stop, even when synthetic data covers most of the training set. Collapse compounds when real data stops flowing.

Use provenance tagging throughout the pipeline. Every example should carry a tag indicating its source: real, synthetic, augmented-from-real, or unknown. The "unknown" category is what bites teams that ingest web-scale data without provenance metadata.

Bound the recursive loop. If your pipeline involves training a model whose outputs feed back into a future training set, bound the depth of recursion. Self-distillation, RLHF-style loops, and active learning with model-generated labels all carry recursive risk.

Validate against real-data holdouts — always. Never evaluate models on synthetic data. The eval set must be real, drawn from production or production-equivalent conditions. This is obvious until you're under pressure to ship and someone suggests using synthetic data for eval too.

Use the right generator for the right job. Physics simulators for physics-grounded problems. Differential privacy generators for privacy-constrained problems. Class-balancing generators for imbalanced classification. A generic LLM should not be your default synthetic data source for everything.

Synthetic data is a real engineering tool with a real engineering cost. The cost is paid in three currencies — distribution narrowness, generator artifacts, and recursive degradation — that don't appear on the vendor's pricing page. The teams that get value from synthetic data are the ones who understand both what it provides and what it takes.


Meritshot's Data Science programs include hands-on synthetic data pipeline design — including model collapse mitigation, provenance tracking, and real-data holdout discipline — as part of production ML engineering training.

Recommended