Why the t-Test?
The t-test is the most commonly used statistical test. It compares means when the population standard deviation (σ) is unknown — which is almost always. Three versions handle different scenarios:
| Test | Question |
|---|---|
| One-sample t-test | Is the population mean equal to a specific value? |
| Independent two-sample t-test | Are the means of two independent groups equal? |
| Paired t-test | Did the mean change between two related measurements? |
The t-Distribution (Recap)
t with df degrees of freedom:
- Symmetric and bell-shaped (like Z)
- Heavier tails than Z (accounts for uncertainty in estimating σ)
- As df → ∞, t → Z
- df = n − 1 for one-sample; df ≈ n₁ + n₂ − 2 for two-sample
1. One-Sample t-Test
Question: Is the population mean different from a known/hypothesised value μ₀?
Assumptions
- Data is approximately normally distributed OR n ≥ 30 (CLT)
- Data is continuous (interval or ratio scale)
- Sample is randomly selected
- Observations are independent
Test Statistic
t = (x̄ − μ₀) / (s / √n)
where:
x̄ = sample mean
μ₀ = hypothesised population mean
s = sample standard deviation
n = sample size
df = n − 1
Worked Example
Scenario: A company claims new hires have an average onboarding time of 5 days.
HR samples 20 recent new hires and records actual times.
Data: 4.2, 5.8, 6.1, 5.5, 4.9, 6.7, 5.1, 4.8, 5.3, 6.2,
5.9, 5.4, 4.6, 5.7, 6.3, 5.0, 5.5, 6.0, 4.7, 5.8
n = 20
x̄ = 5.52 days
s = 0.62 days
H₀: μ = 5.0 (claimed onboarding time)
H₁: μ ≠ 5.0 (two-sided)
α = 0.05
Step 1: Test statistic
t = (5.52 − 5.0) / (0.62 / √20)
= 0.52 / (0.62 / 4.472)
= 0.52 / 0.1387
= 3.749
Step 2: Degrees of freedom
df = 20 − 1 = 19
Step 3: p-value (two-sided)
From t-table, t(19) = 3.749
p-value < 0.002 (table shows t=3.579 at p=0.002 for df=19)
More precisely: p ≈ 0.0014
Step 4: Decision
p = 0.0014 < α = 0.05 → REJECT H₀
Conclusion: There is significant evidence that the actual average onboarding
time (5.52 days) differs from the claimed 5.0 days (t(19)=3.75, p=0.001).
Confidence Interval Connection
The 95% CI from a one-sample t-test is exactly the set of μ₀ values that would NOT be rejected at α=0.05:
95% CI = x̄ ± t* × (s/√n)
= 5.52 ± 2.093 × 0.1387
= 5.52 ± 0.290
= (5.23, 5.81) days
Since 5.0 is outside (5.23, 5.81), we reject H₀: μ=5 — consistent! ✓
2. Independent Two-Sample t-Test
Question: Do two independent groups have different population means?
H₀: μ₁ = μ₂ (or equivalently, μ₁ − μ₂ = 0)
H₁: μ₁ ≠ μ₂ (two-sided)
Two Versions
Equal Variances (Pooled t-Test)
Assumes both groups have the same population variance (σ₁² = σ₂²).
s_p² = [(n₁−1)s₁² + (n₂−1)s₂²] / (n₁+n₂−2) [pooled variance]
t = (x̄₁ − x̄₂) / (s_p × √(1/n₁ + 1/n₂))
df = n₁ + n₂ − 2
Unequal Variances (Welch's t-Test)
Does NOT assume equal variances — more robust, generally preferred.
t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)
df = (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)] [Welch-Satterthwaite]
Rule of thumb: Always use Welch's t-test by default unless you have strong reason to assume equal variances.
Worked Example
Scenario: Compare exam scores between two teaching methods.
Method A (traditional): n₁=30, x̄₁=72.3, s₁=8.5
Method B (interactive): n₂=28, x̄₂=78.6, s₂=11.2
H₀: μ₁ = μ₂ (no difference between methods)
H₁: μ₁ ≠ μ₂ (two-sided)
α = 0.05
Using Welch's t-test:
t = (72.3 − 78.6) / √(8.5²/30 + 11.2²/28)
= −6.3 / √(2.408 + 4.480)
= −6.3 / √6.888
= −6.3 / 2.625
= −2.400
Welch df ≈ 50.8 (round to 50)
From t-table: t(50) at two-sided 0.05 → critical value ≈ 2.009
|t| = 2.400 > 2.009 → reject H₀
p-value ≈ 0.020
Conclusion: Significant difference in exam scores between methods (t(50.8)=−2.40, p=0.020).
Students taught with Method B scored significantly higher on average (78.6 vs 72.3).
Checking Equal Variance Assumption
Use Levene's test or F-test to check if variances are significantly different:
Ratio of variances: s₂²/s₁² = 11.2²/8.5² = 125.44/72.25 = 1.74
Rule of thumb: if the larger SD is more than twice the smaller SD, variances are likely unequal.
Here: 11.2/8.5 = 1.32 → moderate difference; use Welch's to be safe
3. Paired t-Test
Question: Did the mean change between two related measurements on the same subjects?
When to use:
- Before/after measurements on the same individual
- Matched pairs (e.g., twins, matched controls)
- Cross-over trials (each subject gets both treatments)
Key Idea
Instead of treating the two groups separately, compute the difference for each pair and run a one-sample t-test on the differences.
dᵢ = x_after_i − x_before_i (difference for each subject)
d̄ = mean difference
s_d = standard deviation of differences
t = d̄ / (s_d / √n)
df = n − 1 (where n = number of PAIRS)
H₀: μ_d = 0 (no change on average)
H₁: μ_d ≠ 0 (there was a change)
Worked Example
Training programme: 10 employees measured before and after.
Employee Before After Difference (d = After − Before)
1 65 72 +7
2 70 75 +5
3 58 62 +4
4 80 85 +5
5 75 80 +5
6 62 68 +6
7 78 82 +4
8 68 73 +5
9 72 79 +7
10 60 65 +5
n = 10 pairs
d̄ = (7+5+4+5+5+6+4+5+7+5)/10 = 53/10 = 5.3
s_d = √[Σ(dᵢ−d̄)²/(n−1)] = √[4.9/9] = √0.544 = 0.738
H₀: μ_d = 0
H₁: μ_d > 0 (one-sided — we expect improvement)
α = 0.05
t = d̄ / (s_d/√n) = 5.3 / (0.738/√10) = 5.3 / 0.233 = 22.7
df = 9
Critical t(9, one-sided, 0.05) = 1.833
t = 22.7 >> 1.833 → Highly significant
p-value < 0.0001
Conclusion: The training programme significantly improved scores (paired t(9)=22.7, p<0.0001).
Average improvement: 5.3 points (95% CI: 4.77 to 5.83 points).
Why Paired > Independent for Before/After?
Using independent two-sample t-test on the same data:
Before: n₁=10, x̄₁=68.8, s₁=7.27
After: n₂=10, x̄₂=74.1, s₂=7.19
t = (68.8 − 74.1) / √(7.27²/10 + 7.19²/10)
= −5.3 / √(5.28 + 5.17)
= −5.3 / √10.45
= −5.3 / 3.233
= −1.639
df ≈ 17.9, p ≈ 0.118 → NOT significant!
The paired test gave t=22.7 (p<0.0001); independent gave t=1.64 (p=0.118).
Same data → opposite conclusions!
Why? The paired test REMOVES the between-person variability (everyone's different baseline).
It isolates only the within-person change — much more powerful when there's correlation between pairs.
Assumptions Checking
Normality
Required: Differences (paired) or each group (two-sample) should be normal.
Check with:
- Histogram: roughly bell-shaped?
- QQ plot: points near the diagonal?
- Shapiro-Wilk test: p > 0.05 → can't reject normality
CLT saves you: for n ≥ 30, t-tests are robust to non-normality.
For small n with severe non-normality → use Mann-Whitney U (two-sample) or Wilcoxon (paired).
Independence
Paired t-test: the DIFFERENCES must be independent (one pair doesn't affect another)
Two-sample t-test: the two groups must be independent of each other
One-sample t-test: observations must be independent
Scale of Measurement
All t-tests require interval or ratio data (or ordinal approximated as interval with justification).
Choosing the Right t-Test
Is there a natural pairing (before/after, matched pairs)?
→ YES: Paired t-test
→ NO: Are the two groups independent?
→ YES: Two-sample t-test (use Welch's by default)
→ NO: Re-examine — are they really independent?
Is there only one group compared to a known standard?
→ YES: One-sample t-test
Practical Examples
Example 1: Product Quality (One-Sample)
Specification: mean weight = 250g
Sample of 35 items: x̄=247.8g, s=6.2g
t = (247.8 − 250) / (6.2/√35) = −2.2/1.048 = −2.099
df = 34, p (two-sided) = 0.043
p < 0.05 → reject H₀
Evidence the process is underfilling (mean < 250g).
Example 2: Drug Trial (Two-Sample, Welch)
Treatment: n₁=40, x̄₁=85.2, s₁=12.1 (blood pressure reduction)
Placebo: n₂=38, x̄₂=79.4, s₂=18.4
t = (85.2 − 79.4) / √(12.1²/40 + 18.4²/38)
= 5.8 / √(3.66 + 8.90)
= 5.8 / √12.56
= 5.8 / 3.544
= 1.636
df ≈ 64, p (two-sided) ≈ 0.107
p = 0.107 > 0.05 → fail to reject H₀
Insufficient evidence that the drug reduces blood pressure more than placebo at 5% level.
(Small-to-medium effect observed; study may be underpowered — consider larger sample)
Example 3: Marketing A/B Test (Paired)
Same customers shown two ads on consecutive weeks:
Ad A revenue: 50, 65, 72, 48, 81, 70, 55, 90
Ad B revenue: 58, 72, 80, 55, 88, 76, 62, 98
Differences (B − A): 8, 7, 8, 7, 7, 6, 7, 8
d̄ = 7.25, s_d = 0.661
t = 7.25 / (0.661/√8) = 7.25/0.2337 = 31.02
p < 0.00001 → highly significant
Ad B generates significantly more revenue per customer.
Common Mistakes
1. Using independent t-test for paired data
Before/after data are paired — the same person was measured twice.
Using independent t-test ignores this structure → LESS POWERFUL test.
Always use paired t-test for before/after or matched designs.
2. Assuming equal variances without checking
Pooled t-test assumes σ₁ = σ₂.
Use Levene's test or just default to Welch's — it's valid even when variances are equal.
3. Not checking normality for small n
For n=8 (very small), non-normality can invalidate the t-test.
Check with histogram + QQ plot.
If severely non-normal: use Mann-Whitney U (two-sample) or Wilcoxon signed-rank (paired).
4. One-tailed test after seeing the data
Seeing x̄₁ > x̄₂ and then testing H₁: μ₁ > μ₂ one-sided → cheating.
The hypothesis must be set BEFORE looking at the data direction.
Practice Exercises
-
A company targets average delivery time of 3 days. Sample of 25 recent deliveries: x̄=3.4, s=0.8 days. Test at α=0.05 (two-sided) whether delivery time has changed.
-
Two factories produce the same component. Factory A (n=40): x̄=48.2mm, s=2.1mm. Factory B (n=35): x̄=49.5mm, s=3.8mm. Using Welch's t-test, test if the means differ at α=0.01.
-
A diet programme: 8 participants weighed before and after (kg): Before: 85, 90, 78, 95, 82, 88, 92, 76 After: 80, 86, 74, 88, 79, 83, 86, 72 Test whether the programme reduced weight (one-sided paired t-test, α=0.05).
-
For Exercise 3, compute a 95% CI for the mean weight loss.
-
Why would using an independent t-test for the diet programme in Exercise 3 be inappropriate? Would it give a different conclusion?
Summary
In this chapter you learned:
- One-sample t-test: t = (x̄−μ₀)/(s/√n), df=n−1; compare one group to a known standard
- Independent two-sample t-test: compare two unrelated groups; use Welch's (unequal variance) by default
- Welch: t = (x̄₁−x̄₂) / √(s₁²/n₁ + s₂²/n₂); df from Welch-Satterthwaite formula
- Paired t-test: t = d̄/(s_d/√n), df=n−1; compute differences first, then one-sample t-test on differences
- More powerful than independent test when pairs are correlated (e.g., before/after)
- All t-tests assume: normality (or large n via CLT), independence, interval/ratio data
- For non-normal small samples: Mann-Whitney U (two-sample) or Wilcoxon signed-rank (paired)
- CI connection: (1−α)% CI for μ is the set of μ₀ values NOT rejected by the t-test at level α
- Set hypotheses before seeing data; use two-sided tests unless directional prediction is pre-specified
Next up: Chi-Square Tests — testing independence and goodness-of-fit for categorical data.