Hypothesis Testing — Concepts & p-Values | Statistics Tutorial | Meritshot

What Is Hypothesis Testing?

Hypothesis testing is a formal procedure for using sample data to make a decision about a population parameter. It answers: Is the pattern we see in the sample real, or could it have arisen by chance?

Scenario: A new training programme is introduced.
Before: average performance score = 70
After training (sample of 40 employees): average = 74.5

Question: Is the improvement real, or just random variation in the sample?
→ Hypothesis test answers this formally.

The Logic of Hypothesis Testing

The key idea: assume the boring explanation is true, then see how surprising the data would be under that assumption.

Assume nothing changed (null hypothesis H₀: μ = 70)
Calculate how likely the observed data is under this assumption
If the data is very unlikely under H₀, reject H₀ in favour of the alternative

This is similar to the legal system: "Innocent until proven guilty" — H₀ is innocent; data is the evidence.

Setting Up Hypotheses

Null Hypothesis (H₀)

The default position: no effect, no difference, no change, status quo.

H₀: μ = 70  (training had no effect on performance)
H₀: μ₁ = μ₂  (two groups have equal means)
H₀: p = 0.05  (defect rate has not changed)

Alternative Hypothesis (H₁ or Hₐ)

What you want to detect: there IS an effect, difference, or change.

Two-sided: H₁: μ ≠ 70  (score changed, either direction)
One-sided: H₁: μ > 70  (score increased — directional claim)
           H₁: μ < 70  (score decreased)

Use two-sided when: you care about change in either direction (new drug could help or harm) Use one-sided when: you have a strong prior reason to only care about one direction (you're only releasing the drug if it helps — won't release if neutral or harmful)

One-sided tests are more powerful (easier to reject H₀) but less conservative. Many journals require two-sided tests.

Significance Level (α)

α is the probability of rejecting H₀ when it's actually true (the Type I error rate). You set α before collecting data.

Common values:
α = 0.05 (5%) → most common in social sciences, business
α = 0.01 (1%) → stricter, medical research
α = 0.10 (10%) → exploratory research
α = 0.001       → very high-stakes decisions

Setting α = 0.05 means: "I accept a 5% chance of a false positive (incorrectly rejecting a true H₀)."

Test Statistic

A test statistic converts the sample evidence into a single number on a known distribution, measuring how many standard errors the sample result is from the hypothesised value.

For a mean (σ known):
Z = (x̄ − μ₀) / (σ/√n)

For a mean (σ unknown, use s):
t = (x̄ − μ₀) / (s/√n)     with df = n−1

For a proportion:
Z = (p̂ − p₀) / √(p₀(1−p₀)/n)

The p-Value

The p-value is the probability of observing a test statistic as extreme or more extreme than the one computed, assuming H₀ is true.

p-value = P(data this extreme or more | H₀ is true)

Decision rule:
p < α → reject H₀ (statistically significant result)
p ≥ α → fail to reject H₀ (not enough evidence)

Intuition

p = 0.001 → If H₀ were true, seeing this data is very rare (0.1% chance)
            → Strong evidence against H₀ → reject H₀

p = 0.45  → If H₀ were true, seeing this data is quite plausible (45% chance)
            → No reason to reject H₀ → fail to reject H₀

p = 0.049 → Just barely significant at α=0.05
p = 0.051 → Just barely NOT significant at α=0.05

(The difference between 0.049 and 0.051 is negligible — treat as uncertain)

Types of Errors

Reality vs Decision:

               H₀ True          H₀ False
Fail to reject H₀  Correct (1−α)   Type II Error (β)
Reject H₀          Type I Error (α) Correct (Power = 1−β)

Type I Error (False Positive)

Rejecting H₀ when it's actually true. Probability = α.

Example: Concluding the training improved performance when it actually didn't.
Consequence: Spend money rolling out an ineffective programme.
Control: Lower α (e.g., use 0.01 instead of 0.05)

Type II Error (False Negative)

Failing to reject H₀ when H₁ is actually true. Probability = β.

Example: Concluding training had no effect when it actually did improve performance.
Consequence: Don't scale a programme that would have helped.
Control: Increase sample size, increase effect size, raise α

Statistical Power

Power = 1 − β = probability of correctly rejecting a false H₀.

Common target: Power ≥ 0.80 (80%)
Factors that increase power:
→ Larger sample size n
→ Larger true effect size
→ Higher α (more willing to accept false positives)
→ Lower population variability σ

The 7-Step Hypothesis Testing Procedure

1. STATE the hypotheses (H₀ and H₁)
2. SET the significance level α
3. SELECT the appropriate test and check assumptions
4. COMPUTE the test statistic
5. FIND the p-value (or critical value)
6. MAKE the decision (reject or fail to reject H₀)
7. STATE the conclusion in plain language

One-Sample z-Test: Worked Example

Scenario:
Average exam score under old curriculum: μ₀ = 70 (known population mean)
New curriculum tested on n = 40 students
Sample results: x̄ = 74.5, σ = 12 (population SD known from years of data)

Step 1: Hypotheses
H₀: μ = 70  (new curriculum has same average as old)
H₁: μ ≠ 70  (two-sided — could be better or worse)

Step 2: α = 0.05

Step 3: One-sample z-test (σ known, n=40 large)
Assumption: data approximately normal or n large (CLT)

Step 4: Test statistic
Z = (x̄ − μ₀) / (σ/√n) = (74.5 − 70) / (12/√40)
  = 4.5 / (12/6.32)
  = 4.5 / 1.897
  = 2.37

Step 5: p-value (two-sided)
P(|Z| > 2.37) = 2 × P(Z > 2.37) = 2 × (1 − 0.9911) = 2 × 0.0089 = 0.0178

Step 6: Decision
p = 0.0178 < α = 0.05 → REJECT H₀

Step 7: Conclusion
"There is statistically significant evidence (Z=2.37, p=0.018) that the new
curriculum changed average exam scores. The sample mean of 74.5 is
significantly different from the hypothesised mean of 70."

Critical Value Approach

An alternative to p-values: compare the test statistic to a critical value.

For α=0.05, two-sided z-test:
Critical values: z_crit = ±1.96

|Z| > 1.96 → reject H₀
|Z| ≤ 1.96 → fail to reject H₀

Our test: Z = 2.37 > 1.96 → reject H₀ ✓

Rejection regions:
        α/2 = 0.025        α/2 = 0.025
 Reject     │    Fail to reject    │    Reject
            │                     │
────────────┼─────────────────────┼────────────→ Z
          −1.96                 +1.96

One-Sided vs Two-Sided Rejection Regions

Two-sided (H₁: μ ≠ μ₀):
Critical z at α=0.05: ±1.96
Reject if Z < −1.96 OR Z > +1.96

One-sided upper (H₁: μ > μ₀):
Critical z at α=0.05: +1.645
Reject if Z > +1.645

One-sided lower (H₁: μ < μ₀):
Critical z at α=0.05: −1.645
Reject if Z < −1.645

One-sided tests are more powerful for detecting effects in one direction.

Practical Examples

Example 1: Defect Rate Test

Claimed defect rate: p₀ = 3%
Sample: n=500 components, 22 defectives
p̂ = 22/500 = 0.044

H₀: p = 0.03  (claimed rate)
H₁: p > 0.03  (one-sided — concerned about higher defects)
α = 0.05

Z = (p̂ − p₀) / √(p₀(1−p₀)/n)
  = (0.044 − 0.030) / √(0.030×0.970/500)
  = 0.014 / √(0.0000582)
  = 0.014 / 0.00763
  = 1.835

p-value (one-sided) = P(Z > 1.835) = 1 − 0.9667 = 0.033

0.033 < 0.05 → REJECT H₀
"Evidence that the defect rate exceeds the claimed 3% (Z=1.84, p=0.033)."

Example 2: Marketing Campaign Effectiveness

Historical conversion rate: 8% (p₀ = 0.08)
New campaign tested on n=200 users: 22 converted
p̂ = 22/200 = 0.11

H₀: p = 0.08
H₁: p > 0.08  (campaign should only be adopted if it improves rate)
α = 0.05

Z = (0.11 − 0.08) / √(0.08×0.92/200)
  = 0.03 / √0.000368
  = 0.03 / 0.01918
  = 1.564

p-value = P(Z > 1.564) = 1 − 0.9411 = 0.059

0.059 > 0.05 → FAIL TO REJECT H₀
"Insufficient evidence that the campaign improved conversion rate (p=0.059).
The result is not statistically significant at the 5% level.
Note: p=0.059 is borderline — consider increasing sample size."

Statistical Significance vs Practical Significance

A/B test with n=100,000 users per group:
Conversion rate A: 5.01%, Conversion rate B: 5.09%
Difference: 0.08 percentage points

With such a large sample, even tiny differences become statistically significant.
p < 0.001 → statistically significant

BUT: A 0.08% lift in conversion generates ₹800 additional revenue per million users.
Is that worth the cost of the change? → PRACTICAL significance question

Always report:
1. Statistical significance (p-value)
2. Effect size (the magnitude of the difference)
3. Practical/business significance (does the effect matter in the real world?)

Effect Size

Measures the magnitude of the difference, independent of sample size.

Cohen's d (for means):
d = (x̄ − μ₀) / s

Interpretation:
d = 0.2 → small effect
d = 0.5 → medium effect
d = 0.8 → large effect

Our training example: d = (74.5 − 70)/12 = 0.375 → small-to-medium effect

Common Mistakes

1. "p > 0.05 means H₀ is true"

WRONG: "We found no significant difference, therefore the means are equal."
RIGHT: "We found insufficient evidence to reject H₀ at the 5% level."

Absence of evidence is not evidence of absence.
The study may just have been underpowered (too small a sample).

2. p-value is not the probability H₀ is true

WRONG: "p = 0.03 means there's a 3% chance H₀ is true."
RIGHT: "Given H₀ is true, there's a 3% chance of seeing data this extreme."
These are very different statements (prosecutor's fallacy applied to statistics).

3. Multiple testing problem

Testing 20 hypotheses at α=0.05:
Expected number of false positives = 20 × 0.05 = 1

If you test enough things, some will be "significant" by chance.
Corrections: Bonferroni (α/k per test), Benjamini-Hochberg (FDR control)

4. HARKing — Hypothesising After Results are Known

Look at data → identify patterns → then write hypotheses as if pre-planned.
This inflates false positive rates. Pre-register hypotheses before data collection.

Practice Exercises

A machine is supposed to fill bottles with 500 ml. A sample of 36 bottles: x̄=497, s=9 ml. At α=0.05, is there evidence the machine is underfilling? (One-sided test)
Historical click-through rate: 3%. New ad shown to 800 users: 30 clicks. Test whether the new ad has a different click rate (two-sided, α=0.01).
If α=0.05 and β=0.20, what is the power of the test? What does this mean practically?
A study reports p=0.049. Another reports p=0.051. A colleague says the first is "significant" and the second is "not significant." Critique this interpretation.
You run 50 simultaneous A/B tests at α=0.05. How many "significant" results would you expect by pure chance? What approach can control this?

Summary

In this chapter you learned:

Hypothesis testing — formal procedure for deciding whether sample evidence supports a claim about a population
H₀ (null): no effect/difference (default); H₁ (alternative): what we want to detect
α (significance level): acceptable Type I error rate; set before data collection (commonly 0.05)
Test statistic: Z = (estimate − H₀ value) / SE — measures evidence in standard errors
p-value: P(data this extreme | H₀ true); p < α → reject H₀
Type I error (α): reject H₀ when true (false positive); Type II error (β): fail to reject when false (false negative)
Power = 1 − β: probability of correctly detecting a real effect; target ≥ 80%
Two-sided vs one-sided: use one-sided only with strong prior directional justification
Statistical ≠ practical significance: large samples make tiny effects significant; report effect size
p-value is NOT P(H₀ is true); failing to reject H₀ does NOT prove H₀

Next up: t-Tests — the workhorse of hypothesis testing for comparing means.

Hypothesis Testing — Concepts & p-Values