Data Science

Your A/B Test Shows Significance. Your Stakeholder Knows the Sample Is Wrong.

Statistical significance is the beginning of the conversation, not the end. A practitioner's guide to the four sample composition failures that kill valid tests — and the systems-thinking checks that experienced experimenters run before trusting a significant result.

Meritshot11 min read
A/B TestingStatisticsExperimentationData ScienceProduct Analytics
Back to Blog

Your A/B Test Shows Significance. Your Stakeholder Knows the Sample Is Wrong.

The p-value is 0.03. The confidence interval doesn't cross zero. The dashboard is green. You walk into the review meeting ready to ship the feature — and the product leader or data science head asks one question that stops you cold.

"Which users are in this test?"

Not "what's the sample size?" Not "how long did you run it?" The question is about composition. Because a test can be perfectly valid statistically and completely wrong analytically — and the gap between those two things is where most A/B testing mistakes actually live.

This article is about what experienced experimenters see in a test result that junior analysts consistently miss, and why statistical significance is the beginning of the conversation, not the end of it.


Statistical Significance Is a Threshold, Not a Truth

The most dangerous thing that can happen to an analyst after learning about p-values is believing that a p-value below 0.05 means the result is real. It means the result is unlikely to be explained by chance alone, given the data in front of you. That is a much narrower claim — and the "given the data in front of you" part is where the entire problem lives.

Statistical significance tells you something about internal validity: did the observed difference in your sample exceed what random variation would produce? It says nothing about external validity: does the difference you observed in your sample reflect what would happen if you shipped to your actual user base?

The confusion between these two things is systematic and expensive. Teams ship features that were "statistically significant" in testing and see flat or negative impact in production. Then they run the test again and get a different result. Then someone senior looks at the test setup and finds the sample problem that the p-value never flagged.

Consider a real-world pattern that plays out constantly in Indian growth-stage startups. A fintech company runs an A/B test on a new loan application flow. Control is the existing flow. Treatment is a redesigned onboarding with simplified document upload. The test runs for two weeks. Variant B shows a 14% improvement in application completion rate, significant at p=0.02.

The product team ships it. Application completion goes up 4%. Not 14%.

What happened? The test was run during a period when the marketing team was running a campaign targeting existing users — people who had previously attempted an application and dropped off. This cohort was disproportionately represented in the test sample relative to their share of the ongoing user base. They were already motivated to complete. They responded strongly to the simplified flow. New users — who make up 70% of ongoing traffic — responded much more modestly.

The test was internally valid. The sample was externally invalid. The p-value never flagged this because p-values do not know anything about the composition of your user base. They only know about the data they were given.


The Four Sample Problems That Kill Valid Tests

Most A/B testing guidance focuses on sample size — whether you have enough users to detect the effect you care about. Sample composition is the problem that gets far less attention and causes far more damage in practice.

Novelty Effect Contamination

The novelty effect is the behavioural tendency of users to engage more with anything new, regardless of whether it is actually better.

A team at a media platform tests a new article recommendation algorithm. The variant shows 22% higher click-through on recommended articles over a ten-day test. The result is significant. The algorithm ships. Click-through rates on recommendations return to baseline within three weeks.

The issue: a significant portion of the users in the test were long-tenured, highly engaged users who clicked on the new recommendations because they were different — not because the algorithm was superior. The novelty-driven engagement inflated the treatment effect. A proper test would have either excluded users in their first ten days of platform exposure or run long enough for the novelty response to decay — typically three to four weeks for a UI change on a daily-use product.

The practical intervention is to segment your test results by user tenure before drawing conclusions. If treatment outperforms control among users with less than thirty days on the platform but performs equivalently or worse among users with more than ninety days, the effect is likely novelty-driven, not algorithm-driven.

Survivorship Bias in the Treatment Group

Survivorship bias in A/B tests is subtler and more destructive than in other analytical contexts because it compounds with statistical significance to produce confidently wrong conclusions.

An e-commerce platform tests a new checkout flow with an additional trust signal element. Variant B shows a 9% improvement in checkout conversion. The key detail buried in the segment breakdown: Variant B had a 7% higher drop-off at the step before checkout — the cart review page — because the trust signal element confused lower-intent users into leaving.

The users who made it to checkout in the treatment group were, on average, more intent-strong than the users who made it to checkout in the control group. The improvement in checkout conversion was real. It was also a measurement artefact of a leakier upper funnel.

The practical check is to measure conversion from a consistent funnel entry point — typically the beginning of the flow — across both variants, not just at the specific step where you expect to see an effect.

A/B testing funnel analysis showing different drop-off points across variants

A p-value of 0.02 says nothing about which of these four problems affected your sample. Each one requires a separate check that lives outside the significance calculation.

Temporal Contamination

Running an A/B test across a period that contains a structural discontinuity in user behaviour is one of the most common sources of invalid test samples.

Structural discontinuities include: major marketing campaigns starting or stopping, new user cohorts entering after a referral programme launches, seasonal variation in product usage, and competitor events that shift baseline engagement levels.

An edtech company runs a test on a new study mode feature across two weeks in early October. Variant B shows a 31% uplift in session length. The marketing team ran a back-to-school campaign in week two that disproportionately brought high-intent users into the product during the test window. The high-intent cohort engaged with the new feature extensively. The treatment effect was inflated by a campaign that the experimental design did not account for.

The fix is not necessarily to extend the test — extending into a contaminated period makes things worse. It is to identify the contamination window and either re-run the test outside of it, or segment the analysis to isolate the pre-campaign and post-campaign periods and report them separately.

Interaction Effects Between Concurrent Tests

Running multiple A/B tests simultaneously on overlapping user populations is standard at companies with high testing velocity. It is also a consistent source of sample contamination.

A growth team at a subscription product runs three simultaneous tests: a new pricing page layout, a new onboarding email sequence, and a new in-product upgrade prompt. Users in the pricing page test are also receiving either the old or new onboarding email based on their random assignment to that test, and either the old or new upgrade prompt based on their third assignment.

The upgrade prompt test shows a significant positive result. It ships. The effect disappears in production. The reason: the upgrade prompt performed well among users who had received the new onboarding email sequence but performed no differently than control among users who had received the old onboarding. The test result was driven by an interaction with a concurrent test, not by the treatment itself.


The Stakeholder Question Is a Systems Thinking Question

When an experienced product leader or data scientist asks "which users are in this test?", they are not asking for the sample size number. They are asking you to trace the sample from first principles — to explain not just how many users are in the test, but which users, when they entered, what else was happening to them, and whether the sample composition reflects the population you are trying to draw conclusions about.

This is systems thinking applied to experimental design. The user funnel is a system. Marketing, product, infrastructure, and competitor dynamics all interact with it simultaneously.

The specific checks that experienced experimenters run before trusting a significant result:

Segment the result by user tenure. Split treatment and control into user cohorts: less than 30 days, 30–90 days, and more than 90 days on the platform. If the treatment effect is concentrated in new users or highly engaged veterans, the effect is likely confounded by tenure-correlated behaviour.

Check funnel entry rates, not just conversion rates at the target step. If you are testing a checkout flow, measure conversion from session start to purchase across both variants. If Variant B has a lower session-to-checkout entry rate but a higher checkout-to-purchase conversion rate, the net effect may be neutral or negative.

Map the test window against the marketing and product calendar. Before presenting a result, identify every significant event that occurred during the test window — campaigns that launched, promotions that ended, infrastructure incidents. Each one is a potential confound.

Pull the concurrent test assignment matrix. For every user in your test, what other tests were they assigned to? If users in Treatment A were disproportionately in Treatment B of another concurrent test, the interaction needs to be measured.


What Drives External Invalidity in Indian Growth Companies

There are patterns specific to the Indian tech company context that show up repeatedly in failed experiments.

Referral cohort contamination. Indian growth companies rely heavily on referral programmes. Referral-driven users are systematically different from organic users in motivation, intent, and session behaviour. Tests run during a referral programme spike are sampling from a biased pool.

Festival season effects. Consumer behaviour in India shifts dramatically around Diwali, Eid, and regional festivals. A test running across the transition into festival season is contaminated by seasonal uplift that has nothing to do with the treatment. This is one of the most common sources of results that show significant uplift in October that evaporates in November.

Tier 2 and Tier 3 city expansion effects. Many Indian growth companies are actively expanding into non-metro markets. If new market cohorts enter during your test, they dilute the sample with users whose behaviour is systematically different from your established base. The treatment effect among metro users may be strong; the overall effect may appear weak.

Data analytics dashboard showing statistical test results and user segmentation

Statistical significance tells you the result exceeded chance in your sample. Systems thinking tells you whether the sample was the right one to draw conclusions from.


Building Experimentation Culture That Catches These Problems

The teams that consistently run valid experiments have built the following into their process, not their ad hoc judgment:

Pre-registration of test design. Before running a test, document: who is in the sample and why, what concurrent tests are running, what events are expected during the test window, and how you will segment the results. This takes 30 minutes and catches the majority of composition problems before they cost anything.

Mandatory segment breakdowns before ship decisions. No experiment result is presented without breakdowns by user tenure, acquisition channel, and platform. If the treatment effect is heterogeneous across these segments, the aggregate number is not the number to ship on.

A test-calendar review process. Someone is responsible for reviewing concurrent test assignments before any new test is approved. This is an organisational change, not a statistical one — but it prevents the majority of interaction effect failures.

Post-ship monitoring against the test prediction. If your test showed 14% uplift, instrument a monitoring dashboard that tracks whether post-ship metrics are tracking toward that prediction. If they are not within two weeks, the experiment is flagged for retrospective analysis. This is how teams learn which sample problems are most common in their specific context.


Where You Learn to Run Experiments That Hold Up

At Meritshot, our Data Science and AI Engineering programs include hands-on experimentation design — not as a statistics module, but as a product engineering discipline. You design tests, identify their failure modes, trace sample composition problems in case studies drawn from real Indian tech company scenarios, and build the pre-registration and monitoring habits that separate junior from senior practitioners.

The most valuable skill in experimentation is not knowing how to calculate a p-value. It is knowing which questions to ask before you trust one — and that skill is built through deliberate practice against real cases, not through reading about probability theory.

Recommended