Chapter 14 of 18

t-Tests — One-Sample, Two-Sample & Paired

Compare means with the t-test — one-sample t-test, independent two-sample t-test, paired t-test, and checking assumptions.

Meritshot10 min read
Statisticst-TestOne-SampleTwo-SamplePaired t-TestWelch t-Test
All Statistics Chapters

Why the t-Test?

The t-test is the most commonly used statistical test. It compares means when the population standard deviation (σ) is unknown — which is almost always. Three versions handle different scenarios:

TestQuestion
One-sample t-testIs the population mean equal to a specific value?
Independent two-sample t-testAre the means of two independent groups equal?
Paired t-testDid the mean change between two related measurements?

The t-Distribution (Recap)

t with df degrees of freedom:
- Symmetric and bell-shaped (like Z)
- Heavier tails than Z (accounts for uncertainty in estimating σ)
- As df → ∞, t → Z
- df = n − 1 for one-sample; df ≈ n₁ + n₂ − 2 for two-sample

1. One-Sample t-Test

Question: Is the population mean different from a known/hypothesised value μ₀?

Assumptions

  1. Data is approximately normally distributed OR n ≥ 30 (CLT)
  2. Data is continuous (interval or ratio scale)
  3. Sample is randomly selected
  4. Observations are independent

Test Statistic

t = (x̄ − μ₀) / (s / √n)

where:
x̄ = sample mean
μ₀ = hypothesised population mean
s = sample standard deviation
n = sample size
df = n − 1

Worked Example

Scenario: A company claims new hires have an average onboarding time of 5 days.
HR samples 20 recent new hires and records actual times.

Data: 4.2, 5.8, 6.1, 5.5, 4.9, 6.7, 5.1, 4.8, 5.3, 6.2,
      5.9, 5.4, 4.6, 5.7, 6.3, 5.0, 5.5, 6.0, 4.7, 5.8

n = 20
x̄ = 5.52 days
s = 0.62 days

H₀: μ = 5.0     (claimed onboarding time)
H₁: μ ≠ 5.0     (two-sided)
α = 0.05

Step 1: Test statistic
t = (5.52 − 5.0) / (0.62 / √20)
  = 0.52 / (0.62 / 4.472)
  = 0.52 / 0.1387
  = 3.749

Step 2: Degrees of freedom
df = 20 − 1 = 19

Step 3: p-value (two-sided)
From t-table, t(19) = 3.749
p-value < 0.002 (table shows t=3.579 at p=0.002 for df=19)
More precisely: p ≈ 0.0014

Step 4: Decision
p = 0.0014 < α = 0.05 → REJECT H₀

Conclusion: There is significant evidence that the actual average onboarding
time (5.52 days) differs from the claimed 5.0 days (t(19)=3.75, p=0.001).

Confidence Interval Connection

The 95% CI from a one-sample t-test is exactly the set of μ₀ values that would NOT be rejected at α=0.05:

95% CI = x̄ ± t* × (s/√n)
       = 5.52 ± 2.093 × 0.1387
       = 5.52 ± 0.290
       = (5.23, 5.81) days

Since 5.0 is outside (5.23, 5.81), we reject H₀: μ=5 — consistent! ✓

2. Independent Two-Sample t-Test

Question: Do two independent groups have different population means?

H₀: μ₁ = μ₂  (or equivalently, μ₁ − μ₂ = 0)
H₁: μ₁ ≠ μ₂  (two-sided)

Two Versions

Equal Variances (Pooled t-Test)

Assumes both groups have the same population variance (σ₁² = σ₂²).

s_p² = [(n₁−1)s₁² + (n₂−1)s₂²] / (n₁+n₂−2)    [pooled variance]

t = (x̄₁ − x̄₂) / (s_p × √(1/n₁ + 1/n₂))

df = n₁ + n₂ − 2

Unequal Variances (Welch's t-Test)

Does NOT assume equal variances — more robust, generally preferred.

t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)

df = (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)]    [Welch-Satterthwaite]

Rule of thumb: Always use Welch's t-test by default unless you have strong reason to assume equal variances.

Worked Example

Scenario: Compare exam scores between two teaching methods.
Method A (traditional): n₁=30, x̄₁=72.3, s₁=8.5
Method B (interactive): n₂=28, x̄₂=78.6, s₂=11.2

H₀: μ₁ = μ₂    (no difference between methods)
H₁: μ₁ ≠ μ₂    (two-sided)
α = 0.05

Using Welch's t-test:
t = (72.3 − 78.6) / √(8.5²/30 + 11.2²/28)
  = −6.3 / √(2.408 + 4.480)
  = −6.3 / √6.888
  = −6.3 / 2.625
  = −2.400

Welch df ≈ 50.8 (round to 50)
From t-table: t(50) at two-sided 0.05 → critical value ≈ 2.009
|t| = 2.400 > 2.009 → reject H₀

p-value ≈ 0.020

Conclusion: Significant difference in exam scores between methods (t(50.8)=−2.40, p=0.020).
Students taught with Method B scored significantly higher on average (78.6 vs 72.3).

Checking Equal Variance Assumption

Use Levene's test or F-test to check if variances are significantly different:

Ratio of variances: s₂²/s₁² = 11.2²/8.5² = 125.44/72.25 = 1.74

Rule of thumb: if the larger SD is more than twice the smaller SD, variances are likely unequal.
Here: 11.2/8.5 = 1.32 → moderate difference; use Welch's to be safe

3. Paired t-Test

Question: Did the mean change between two related measurements on the same subjects?

When to use:

  • Before/after measurements on the same individual
  • Matched pairs (e.g., twins, matched controls)
  • Cross-over trials (each subject gets both treatments)

Key Idea

Instead of treating the two groups separately, compute the difference for each pair and run a one-sample t-test on the differences.

dᵢ = x_after_i − x_before_i   (difference for each subject)

d̄ = mean difference
s_d = standard deviation of differences

t = d̄ / (s_d / √n)
df = n − 1   (where n = number of PAIRS)

H₀: μ_d = 0  (no change on average)
H₁: μ_d ≠ 0  (there was a change)

Worked Example

Training programme: 10 employees measured before and after.

Employee  Before  After   Difference (d = After − Before)
1         65      72       +7
2         70      75       +5
3         58      62       +4
4         80      85       +5
5         75      80       +5
6         62      68       +6
7         78      82       +4
8         68      73       +5
9         72      79       +7
10        60      65       +5

n = 10 pairs
d̄ = (7+5+4+5+5+6+4+5+7+5)/10 = 53/10 = 5.3
s_d = √[Σ(dᵢ−d̄)²/(n−1)] = √[4.9/9] = √0.544 = 0.738

H₀: μ_d = 0
H₁: μ_d > 0  (one-sided — we expect improvement)
α = 0.05

t = d̄ / (s_d/√n) = 5.3 / (0.738/√10) = 5.3 / 0.233 = 22.7

df = 9
Critical t(9, one-sided, 0.05) = 1.833
t = 22.7 >> 1.833 → Highly significant

p-value < 0.0001

Conclusion: The training programme significantly improved scores (paired t(9)=22.7, p<0.0001).
Average improvement: 5.3 points (95% CI: 4.77 to 5.83 points).

Why Paired > Independent for Before/After?

Using independent two-sample t-test on the same data:
Before: n₁=10, x̄₁=68.8, s₁=7.27
After:  n₂=10, x̄₂=74.1, s₂=7.19

t = (68.8 − 74.1) / √(7.27²/10 + 7.19²/10)
  = −5.3 / √(5.28 + 5.17)
  = −5.3 / √10.45
  = −5.3 / 3.233
  = −1.639

df ≈ 17.9, p ≈ 0.118 → NOT significant!

The paired test gave t=22.7 (p<0.0001); independent gave t=1.64 (p=0.118).
Same data → opposite conclusions!

Why? The paired test REMOVES the between-person variability (everyone's different baseline).
It isolates only the within-person change — much more powerful when there's correlation between pairs.

Assumptions Checking

Normality

Required: Differences (paired) or each group (two-sample) should be normal.
Check with:
- Histogram: roughly bell-shaped?
- QQ plot: points near the diagonal?
- Shapiro-Wilk test: p > 0.05 → can't reject normality

CLT saves you: for n ≥ 30, t-tests are robust to non-normality.
For small n with severe non-normality → use Mann-Whitney U (two-sample) or Wilcoxon (paired).

Independence

Paired t-test: the DIFFERENCES must be independent (one pair doesn't affect another)
Two-sample t-test: the two groups must be independent of each other
One-sample t-test: observations must be independent

Scale of Measurement

All t-tests require interval or ratio data (or ordinal approximated as interval with justification).

Choosing the Right t-Test

Is there a natural pairing (before/after, matched pairs)?
→ YES: Paired t-test
→ NO: Are the two groups independent?
   → YES: Two-sample t-test (use Welch's by default)
   → NO: Re-examine — are they really independent?

Is there only one group compared to a known standard?
→ YES: One-sample t-test

Practical Examples

Example 1: Product Quality (One-Sample)

Specification: mean weight = 250g
Sample of 35 items: x̄=247.8g, s=6.2g

t = (247.8 − 250) / (6.2/√35) = −2.2/1.048 = −2.099
df = 34, p (two-sided) = 0.043

p < 0.05 → reject H₀
Evidence the process is underfilling (mean < 250g).

Example 2: Drug Trial (Two-Sample, Welch)

Treatment: n₁=40, x̄₁=85.2, s₁=12.1  (blood pressure reduction)
Placebo:   n₂=38, x̄₂=79.4, s₂=18.4

t = (85.2 − 79.4) / √(12.1²/40 + 18.4²/38)
  = 5.8 / √(3.66 + 8.90)
  = 5.8 / √12.56
  = 5.8 / 3.544
  = 1.636

df ≈ 64, p (two-sided) ≈ 0.107

p = 0.107 > 0.05 → fail to reject H₀
Insufficient evidence that the drug reduces blood pressure more than placebo at 5% level.
(Small-to-medium effect observed; study may be underpowered — consider larger sample)

Example 3: Marketing A/B Test (Paired)

Same customers shown two ads on consecutive weeks:
Ad A revenue: 50, 65, 72, 48, 81, 70, 55, 90
Ad B revenue: 58, 72, 80, 55, 88, 76, 62, 98

Differences (B − A): 8, 7, 8, 7, 7, 6, 7, 8
d̄ = 7.25, s_d = 0.661

t = 7.25 / (0.661/√8) = 7.25/0.2337 = 31.02
p < 0.00001 → highly significant

Ad B generates significantly more revenue per customer.

Common Mistakes

1. Using independent t-test for paired data

Before/after data are paired — the same person was measured twice.
Using independent t-test ignores this structure → LESS POWERFUL test.
Always use paired t-test for before/after or matched designs.

2. Assuming equal variances without checking

Pooled t-test assumes σ₁ = σ₂.
Use Levene's test or just default to Welch's — it's valid even when variances are equal.

3. Not checking normality for small n

For n=8 (very small), non-normality can invalidate the t-test.
Check with histogram + QQ plot.
If severely non-normal: use Mann-Whitney U (two-sample) or Wilcoxon signed-rank (paired).

4. One-tailed test after seeing the data

Seeing x̄₁ > x̄₂ and then testing H₁: μ₁ > μ₂ one-sided → cheating.
The hypothesis must be set BEFORE looking at the data direction.

Practice Exercises

  1. A company targets average delivery time of 3 days. Sample of 25 recent deliveries: x̄=3.4, s=0.8 days. Test at α=0.05 (two-sided) whether delivery time has changed.

  2. Two factories produce the same component. Factory A (n=40): x̄=48.2mm, s=2.1mm. Factory B (n=35): x̄=49.5mm, s=3.8mm. Using Welch's t-test, test if the means differ at α=0.01.

  3. A diet programme: 8 participants weighed before and after (kg): Before: 85, 90, 78, 95, 82, 88, 92, 76 After: 80, 86, 74, 88, 79, 83, 86, 72 Test whether the programme reduced weight (one-sided paired t-test, α=0.05).

  4. For Exercise 3, compute a 95% CI for the mean weight loss.

  5. Why would using an independent t-test for the diet programme in Exercise 3 be inappropriate? Would it give a different conclusion?

Summary

In this chapter you learned:

  • One-sample t-test: t = (x̄−μ₀)/(s/√n), df=n−1; compare one group to a known standard
  • Independent two-sample t-test: compare two unrelated groups; use Welch's (unequal variance) by default
    • Welch: t = (x̄₁−x̄₂) / √(s₁²/n₁ + s₂²/n₂); df from Welch-Satterthwaite formula
  • Paired t-test: t = d̄/(s_d/√n), df=n−1; compute differences first, then one-sample t-test on differences
    • More powerful than independent test when pairs are correlated (e.g., before/after)
  • All t-tests assume: normality (or large n via CLT), independence, interval/ratio data
  • For non-normal small samples: Mann-Whitney U (two-sample) or Wilcoxon signed-rank (paired)
  • CI connection: (1−α)% CI for μ is the set of μ₀ values NOT rejected by the t-test at level α
  • Set hypotheses before seeing data; use two-sided tests unless directional prediction is pre-specified

Next up: Chi-Square Tests — testing independence and goodness-of-fit for categorical data.