What Is Hypothesis Testing?
Hypothesis testing is a formal procedure for using sample data to make a decision about a population parameter. It answers: Is the pattern we see in the sample real, or could it have arisen by chance?
Scenario: A new training programme is introduced.
Before: average performance score = 70
After training (sample of 40 employees): average = 74.5
Question: Is the improvement real, or just random variation in the sample?
→ Hypothesis test answers this formally.
The Logic of Hypothesis Testing
The key idea: assume the boring explanation is true, then see how surprising the data would be under that assumption.
- Assume nothing changed (null hypothesis H₀: μ = 70)
- Calculate how likely the observed data is under this assumption
- If the data is very unlikely under H₀, reject H₀ in favour of the alternative
This is similar to the legal system: "Innocent until proven guilty" — H₀ is innocent; data is the evidence.
Setting Up Hypotheses
Null Hypothesis (H₀)
The default position: no effect, no difference, no change, status quo.
H₀: μ = 70 (training had no effect on performance)
H₀: μ₁ = μ₂ (two groups have equal means)
H₀: p = 0.05 (defect rate has not changed)
Alternative Hypothesis (H₁ or Hₐ)
What you want to detect: there IS an effect, difference, or change.
Two-sided: H₁: μ ≠ 70 (score changed, either direction)
One-sided: H₁: μ > 70 (score increased — directional claim)
H₁: μ < 70 (score decreased)
Use two-sided when: you care about change in either direction (new drug could help or harm) Use one-sided when: you have a strong prior reason to only care about one direction (you're only releasing the drug if it helps — won't release if neutral or harmful)
One-sided tests are more powerful (easier to reject H₀) but less conservative. Many journals require two-sided tests.
Significance Level (α)
α is the probability of rejecting H₀ when it's actually true (the Type I error rate). You set α before collecting data.
Common values:
α = 0.05 (5%) → most common in social sciences, business
α = 0.01 (1%) → stricter, medical research
α = 0.10 (10%) → exploratory research
α = 0.001 → very high-stakes decisions
Setting α = 0.05 means: "I accept a 5% chance of a false positive (incorrectly rejecting a true H₀)."
Test Statistic
A test statistic converts the sample evidence into a single number on a known distribution, measuring how many standard errors the sample result is from the hypothesised value.
For a mean (σ known):
Z = (x̄ − μ₀) / (σ/√n)
For a mean (σ unknown, use s):
t = (x̄ − μ₀) / (s/√n) with df = n−1
For a proportion:
Z = (p̂ − p₀) / √(p₀(1−p₀)/n)
The p-Value
The p-value is the probability of observing a test statistic as extreme or more extreme than the one computed, assuming H₀ is true.
p-value = P(data this extreme or more | H₀ is true)
Decision rule:
p < α → reject H₀ (statistically significant result)
p ≥ α → fail to reject H₀ (not enough evidence)
Intuition
p = 0.001 → If H₀ were true, seeing this data is very rare (0.1% chance)
→ Strong evidence against H₀ → reject H₀
p = 0.45 → If H₀ were true, seeing this data is quite plausible (45% chance)
→ No reason to reject H₀ → fail to reject H₀
p = 0.049 → Just barely significant at α=0.05
p = 0.051 → Just barely NOT significant at α=0.05
(The difference between 0.049 and 0.051 is negligible — treat as uncertain)
Types of Errors
Reality vs Decision:
H₀ True H₀ False
Fail to reject H₀ Correct (1−α) Type II Error (β)
Reject H₀ Type I Error (α) Correct (Power = 1−β)
Type I Error (False Positive)
Rejecting H₀ when it's actually true. Probability = α.
Example: Concluding the training improved performance when it actually didn't.
Consequence: Spend money rolling out an ineffective programme.
Control: Lower α (e.g., use 0.01 instead of 0.05)
Type II Error (False Negative)
Failing to reject H₀ when H₁ is actually true. Probability = β.
Example: Concluding training had no effect when it actually did improve performance.
Consequence: Don't scale a programme that would have helped.
Control: Increase sample size, increase effect size, raise α
Statistical Power
Power = 1 − β = probability of correctly rejecting a false H₀.
Common target: Power ≥ 0.80 (80%)
Factors that increase power:
→ Larger sample size n
→ Larger true effect size
→ Higher α (more willing to accept false positives)
→ Lower population variability σ
The 7-Step Hypothesis Testing Procedure
1. STATE the hypotheses (H₀ and H₁)
2. SET the significance level α
3. SELECT the appropriate test and check assumptions
4. COMPUTE the test statistic
5. FIND the p-value (or critical value)
6. MAKE the decision (reject or fail to reject H₀)
7. STATE the conclusion in plain language
One-Sample z-Test: Worked Example
Scenario:
Average exam score under old curriculum: μ₀ = 70 (known population mean)
New curriculum tested on n = 40 students
Sample results: x̄ = 74.5, σ = 12 (population SD known from years of data)
Step 1: Hypotheses
H₀: μ = 70 (new curriculum has same average as old)
H₁: μ ≠ 70 (two-sided — could be better or worse)
Step 2: α = 0.05
Step 3: One-sample z-test (σ known, n=40 large)
Assumption: data approximately normal or n large (CLT)
Step 4: Test statistic
Z = (x̄ − μ₀) / (σ/√n) = (74.5 − 70) / (12/√40)
= 4.5 / (12/6.32)
= 4.5 / 1.897
= 2.37
Step 5: p-value (two-sided)
P(|Z| > 2.37) = 2 × P(Z > 2.37) = 2 × (1 − 0.9911) = 2 × 0.0089 = 0.0178
Step 6: Decision
p = 0.0178 < α = 0.05 → REJECT H₀
Step 7: Conclusion
"There is statistically significant evidence (Z=2.37, p=0.018) that the new
curriculum changed average exam scores. The sample mean of 74.5 is
significantly different from the hypothesised mean of 70."
Critical Value Approach
An alternative to p-values: compare the test statistic to a critical value.
For α=0.05, two-sided z-test:
Critical values: z_crit = ±1.96
|Z| > 1.96 → reject H₀
|Z| ≤ 1.96 → fail to reject H₀
Our test: Z = 2.37 > 1.96 → reject H₀ ✓
Rejection regions:
α/2 = 0.025 α/2 = 0.025
Reject │ Fail to reject │ Reject
│ │
────────────┼─────────────────────┼────────────→ Z
−1.96 +1.96
One-Sided vs Two-Sided Rejection Regions
Two-sided (H₁: μ ≠ μ₀):
Critical z at α=0.05: ±1.96
Reject if Z < −1.96 OR Z > +1.96
One-sided upper (H₁: μ > μ₀):
Critical z at α=0.05: +1.645
Reject if Z > +1.645
One-sided lower (H₁: μ < μ₀):
Critical z at α=0.05: −1.645
Reject if Z < −1.645
One-sided tests are more powerful for detecting effects in one direction.
Practical Examples
Example 1: Defect Rate Test
Claimed defect rate: p₀ = 3%
Sample: n=500 components, 22 defectives
p̂ = 22/500 = 0.044
H₀: p = 0.03 (claimed rate)
H₁: p > 0.03 (one-sided — concerned about higher defects)
α = 0.05
Z = (p̂ − p₀) / √(p₀(1−p₀)/n)
= (0.044 − 0.030) / √(0.030×0.970/500)
= 0.014 / √(0.0000582)
= 0.014 / 0.00763
= 1.835
p-value (one-sided) = P(Z > 1.835) = 1 − 0.9667 = 0.033
0.033 < 0.05 → REJECT H₀
"Evidence that the defect rate exceeds the claimed 3% (Z=1.84, p=0.033)."
Example 2: Marketing Campaign Effectiveness
Historical conversion rate: 8% (p₀ = 0.08)
New campaign tested on n=200 users: 22 converted
p̂ = 22/200 = 0.11
H₀: p = 0.08
H₁: p > 0.08 (campaign should only be adopted if it improves rate)
α = 0.05
Z = (0.11 − 0.08) / √(0.08×0.92/200)
= 0.03 / √0.000368
= 0.03 / 0.01918
= 1.564
p-value = P(Z > 1.564) = 1 − 0.9411 = 0.059
0.059 > 0.05 → FAIL TO REJECT H₀
"Insufficient evidence that the campaign improved conversion rate (p=0.059).
The result is not statistically significant at the 5% level.
Note: p=0.059 is borderline — consider increasing sample size."
Statistical Significance vs Practical Significance
A/B test with n=100,000 users per group:
Conversion rate A: 5.01%, Conversion rate B: 5.09%
Difference: 0.08 percentage points
With such a large sample, even tiny differences become statistically significant.
p < 0.001 → statistically significant
BUT: A 0.08% lift in conversion generates ₹800 additional revenue per million users.
Is that worth the cost of the change? → PRACTICAL significance question
Always report:
1. Statistical significance (p-value)
2. Effect size (the magnitude of the difference)
3. Practical/business significance (does the effect matter in the real world?)
Effect Size
Measures the magnitude of the difference, independent of sample size.
Cohen's d (for means):
d = (x̄ − μ₀) / s
Interpretation:
d = 0.2 → small effect
d = 0.5 → medium effect
d = 0.8 → large effect
Our training example: d = (74.5 − 70)/12 = 0.375 → small-to-medium effect
Common Mistakes
1. "p > 0.05 means H₀ is true"
WRONG: "We found no significant difference, therefore the means are equal."
RIGHT: "We found insufficient evidence to reject H₀ at the 5% level."
Absence of evidence is not evidence of absence.
The study may just have been underpowered (too small a sample).
2. p-value is not the probability H₀ is true
WRONG: "p = 0.03 means there's a 3% chance H₀ is true."
RIGHT: "Given H₀ is true, there's a 3% chance of seeing data this extreme."
These are very different statements (prosecutor's fallacy applied to statistics).
3. Multiple testing problem
Testing 20 hypotheses at α=0.05:
Expected number of false positives = 20 × 0.05 = 1
If you test enough things, some will be "significant" by chance.
Corrections: Bonferroni (α/k per test), Benjamini-Hochberg (FDR control)
4. HARKing — Hypothesising After Results are Known
Look at data → identify patterns → then write hypotheses as if pre-planned.
This inflates false positive rates. Pre-register hypotheses before data collection.
Practice Exercises
-
A machine is supposed to fill bottles with 500 ml. A sample of 36 bottles: x̄=497, s=9 ml. At α=0.05, is there evidence the machine is underfilling? (One-sided test)
-
Historical click-through rate: 3%. New ad shown to 800 users: 30 clicks. Test whether the new ad has a different click rate (two-sided, α=0.01).
-
If α=0.05 and β=0.20, what is the power of the test? What does this mean practically?
-
A study reports p=0.049. Another reports p=0.051. A colleague says the first is "significant" and the second is "not significant." Critique this interpretation.
-
You run 50 simultaneous A/B tests at α=0.05. How many "significant" results would you expect by pure chance? What approach can control this?
Summary
In this chapter you learned:
- Hypothesis testing — formal procedure for deciding whether sample evidence supports a claim about a population
- H₀ (null): no effect/difference (default); H₁ (alternative): what we want to detect
- α (significance level): acceptable Type I error rate; set before data collection (commonly 0.05)
- Test statistic: Z = (estimate − H₀ value) / SE — measures evidence in standard errors
- p-value: P(data this extreme | H₀ true); p < α → reject H₀
- Type I error (α): reject H₀ when true (false positive); Type II error (β): fail to reject when false (false negative)
- Power = 1 − β: probability of correctly detecting a real effect; target ≥ 80%
- Two-sided vs one-sided: use one-sided only with strong prior directional justification
- Statistical ≠ practical significance: large samples make tiny effects significant; report effect size
- p-value is NOT P(H₀ is true); failing to reject H₀ does NOT prove H₀
Next up: t-Tests — the workhorse of hypothesis testing for comparing means.