What Is Correlation?
Correlation measures the strength and direction of the relationship between two variables. It answers: Do the variables tend to move together, and if so, how strongly?
Examples:
→ Do study hours and exam scores move together? (positive)
→ Do absences and grades move together? (negative)
→ Do shoe size and IQ move together? (none)
Correlation is always between −1 and +1.
Types of Correlation
r > 0: Positive correlation — as X increases, Y tends to increase
r < 0: Negative correlation — as X increases, Y tends to decrease
r = 0: No linear relationship
Strength:
|r| ≥ 0.9: Very strong
0.7 ≤ |r| < 0.9: Strong
0.5 ≤ |r| < 0.7: Moderate
0.3 ≤ |r| < 0.5: Weak
|r| < 0.3: Very weak or negligible
1. Pearson Correlation Coefficient (r)
Pearson r measures the linear relationship between two continuous variables.
Formula
r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / [√Σ(xᵢ − x̄)² × √Σ(yᵢ − ȳ)²]
Equivalently:
r = Σ(xᵢyᵢ) − n×x̄×ȳ / √[Σxᵢ² − n×x̄²] × √[Σyᵢ² − n×ȳ²]
Properties:
- Dimensionless (no units)
- Symmetric: r(X,Y) = r(Y,X)
- −1 ≤ r ≤ +1
- r = +1 or −1: perfect linear relationship
- r = 0: no linear relationship (may still have non-linear relationship)
Worked Example
Dataset: 8 employees — hours of study per week vs performance score
Employee Study hrs (x) Score (y)
1 2 55
2 4 60
3 6 70
4 8 75
5 10 80
6 12 88
7 14 92
8 16 95
n = 8
x̄ = (2+4+6+8+10+12+14+16)/8 = 72/8 = 9
ȳ = (55+60+70+75+80+88+92+95)/8 = 615/8 = 76.875
Computing Σ(xᵢ − x̄)(yᵢ − ȳ):
Employee 1: (2−9)(55−76.875) = (−7)(−21.875) = 153.125
Employee 2: (4−9)(60−76.875) = (−5)(−16.875) = 84.375
Employee 3: (6−9)(70−76.875) = (−3)(−6.875) = 20.625
Employee 4: (8−9)(75−76.875) = (−1)(−1.875) = 1.875
Employee 5: (10−9)(80−76.875) = (+1)(+3.125) = 3.125
Employee 6: (12−9)(88−76.875) = (+3)(+11.125) = 33.375
Employee 7: (14−9)(92−76.875) = (+5)(+15.125) = 75.625
Employee 8: (16−9)(95−76.875) = (+7)(+18.125) = 126.875
Σ(xᵢ − x̄)(yᵢ − ȳ) = 153.125 + 84.375 + 20.625 + 1.875 + 3.125 + 33.375 + 75.625 + 126.875 = 499.0
Σ(xᵢ − x̄)²:
(−7)² + (−5)² + (−3)² + (−1)² + (1)² + (3)² + (5)² + (7)²
= 49 + 25 + 9 + 1 + 1 + 9 + 25 + 49 = 168
Σ(yᵢ − ȳ)²:
(−21.875)² + (−16.875)² + (−6.875)² + (−1.875)² + (3.125)² + (11.125)² + (15.125)² + (18.125)²
= 478.516 + 284.766 + 47.266 + 3.516 + 9.766 + 123.766 + 228.766 + 328.516
= 1,504.875
r = 499.0 / √(168 × 1504.875)
= 499.0 / √252,819
= 499.0 / 502.81
= 0.992
Very strong positive correlation: r = 0.99
As study hours increase, performance scores increase almost perfectly linearly.
Significance Testing for r
A non-zero r in the sample doesn't necessarily mean the population correlation ρ ≠ 0. Test significance:
H₀: ρ = 0 (no linear relationship in the population)
H₁: ρ ≠ 0
Test statistic:
t = r × √(n−2) / √(1−r²) with df = n−2
For our example:
t = 0.992 × √(8−2) / √(1−0.992²)
= 0.992 × √6 / √(1−0.984)
= 0.992 × 2.449 / √0.016
= 2.429 / 0.1265
= 19.2
df = 6
p < 0.0001 → highly significant
The correlation between study hours and scores is highly significant.
Coefficient of Determination (r²)
r² is the proportion of variance in Y that is explained by the linear relationship with X:
r² = 0.992² = 0.984 = 98.4%
Interpretation: 98.4% of the variability in performance scores can be
explained by (is associated with) variability in study hours.
The remaining 1.6% is due to other factors not in the model.
r vs r²
r = 0.7 → r² = 0.49 (49% of Y variance explained)
r = 0.5 → r² = 0.25 (25% of Y variance explained)
r = 0.3 → r² = 0.09 (9% of Y variance explained)
A "moderate" r of 0.5 only explains 25% of variance — the relationship
is less practically meaningful than the raw r suggests.
Always report r² alongside r.
2. Spearman Rank Correlation (ρ)
Spearman ρ (rho) measures the monotonic relationship between two variables. It works on ranks, not raw values.
Use when:
- Data is ordinal (ratings, rankings)
- Data is non-normal or has outliers
- The relationship is monotonic but not necessarily linear
- Small sample sizes
Formula
ρ = 1 − (6 × Σdᵢ²) / (n(n²−1))
Where dᵢ = rank(xᵢ) − rank(yᵢ) for each observation
Worked Example
Scenario: 8 candidates ranked by two interviewers (A and B)
Candidate Interviewer A rank Interviewer B rank d = A−B d²
1 1 2 −1 1
2 2 1 +1 1
3 3 4 −1 1
4 4 3 +1 1
5 5 7 −2 4
6 6 5 +1 1
7 7 6 +1 1
8 8 8 0 0
Σd² = 1 + 1 + 1 + 1 + 4 + 1 + 1 + 0 = 10
ρ = 1 − (6 × 10) / (8 × (64−1))
= 1 − 60 / (8 × 63)
= 1 − 60/504
= 1 − 0.119
= 0.881
Strong agreement between the two interviewers (ρ = 0.881).
Pearson vs Spearman
| Feature | Pearson r | Spearman ρ |
|---|---|---|
| Measures | Linear relationship | Monotonic relationship |
| Data type | Interval/Ratio | Ordinal or ranked interval/ratio |
| Outlier sensitivity | High | Low (ranks reduce impact) |
| Distribution assumption | Bivariate normal (for testing) | None |
| When to prefer | Normal data, no outliers | Non-normal, ordinal, or outliers present |
| Scale | Raw values | Ranks |
Rule of thumb: Use Pearson when data is continuous, roughly normal, and outlier-free. Use Spearman otherwise.
The Scatter Plot: Always Plot First
Before computing any correlation, always look at the scatter plot:
Pattern Types and Their r:
r ≈ +1: points cluster r ≈ −1: points cluster r ≈ 0: no pattern
around upward line around downward line (scattered cloud)
Perfect curve (quadratic): r ≈ 0
BUT Spearman ρ might be ±1 (monotonic, not linear)
ALWAYS PLOT first — r = 0 doesn't mean "no relationship"
The Anscombe Quartet
Four datasets with the SAME Pearson r ≈ 0.82:
Dataset I: Linear relationship — appropriate to use r
Dataset II: Curved relationship — r misleads!
Dataset III: Almost perfect line with ONE outlier pulling r down
Dataset IV: Vertical cloud with ONE outlier creating the correlation
Moral: Same r, wildly different patterns. ALWAYS plot the scatter first.
Correlation Is Not Causation
This is the most critical concept in statistics.
A significant, strong correlation between X and Y does NOT mean:
1. X causes Y
2. Y causes X
3. They have any direct relationship at all
Explanations for correlation without causation:
1. COMMON CAUSE (confounding variable):
Ice cream sales correlate with drowning rates.
Cause: Hot weather → more ice cream AND more swimming
Fix: Control for temperature
2. REVERSE CAUSATION:
Hospital visits correlate with illness.
Does the hospital cause illness? Or does illness cause hospital visits?
3. COINCIDENTAL CORRELATION (spurious):
Nicholas Cage film releases correlate with pool drownings (r=0.67)
U.S. per capita cheese consumption correlates with deaths by tangled
in bedsheets (r=0.95)
These are statistically significant but meaningless coincidences.
4. SELECTION BIAS:
In a dataset of job applicants, skills and friendliness are negatively
correlated — but only because unskilled AND unfriendly people don't
get interviewed.
Establishing Causality
Correlation is a necessary but not sufficient condition for causation. To establish causality:
1. RANDOMISED CONTROLLED EXPERIMENT:
Randomly assign subjects to treatment/control → eliminates confounding
Gold standard for causality
2. TEMPORAL ORDER:
Cause must precede effect (if X causes Y, X must change before Y)
3. MECHANISM:
There should be a plausible biological/physical/economic mechanism
explaining how X affects Y
4. DOSE-RESPONSE:
More X should lead to more Y (or less Y if negative)
5. ELIMINATION OF ALTERNATIVES:
Rule out confounders, reverse causation, selection bias
Without an experiment, causal inference requires strong assumptions
and careful design (IV, DiD, RD, propensity score matching in economics).
Practical Examples
Example 1: Marketing Spend and Revenue
Monthly data for 12 months:
Marketing spend (₹000): 10, 15, 12, 20, 18, 25, 22, 30, 28, 35, 32, 40
Revenue (₹000): 85, 110, 95, 130, 125, 155, 145, 175, 170, 210, 200, 240
r = 0.993 (very strong positive)
r² = 0.986
Scatter plot: linear pattern with tight clustering
Interpretation: Very strong linear relationship between marketing spend
and revenue. A 1% increase in marketing spend is associated with
approximately a 0.99% increase in revenue (on average).
Caution: Cannot conclude marketing CAUSES revenue without controlling
for confounders (seasonality, competitors, economic conditions).
Example 2: Customer Satisfaction and Retention
12 product categories — satisfaction score (1–10) and 1-year retention rate (%):
r = 0.72, r² = 0.518
Interpretation: Moderate-strong positive relationship. Satisfaction explains
52% of the variance in retention rate. The relationship is real and
practically significant, though 48% of retention variance is unexplained.
Example 3: Outlier Effect on Pearson
Original 7 data points: r = 0.15 (very weak)
Add one extreme outlier (x=100, y=200):
New r = 0.88 (very strong!)
The SINGLE outlier completely changed the interpretation.
Solution: Compute Spearman ρ (unaffected by outlier):
Spearman ρ = 0.18 (consistent with original pattern)
This shows why Spearman is more robust when outliers are present.
Partial Correlation
Correlation between X and Y after removing the influence of a third variable Z:
r(XY.Z) = (r_XY − r_XZ × r_YZ) / √[(1 − r_XZ²)(1 − r_YZ²)]
Example:
r(education, salary) = 0.65
r(education, experience) = 0.55
r(experience, salary) = 0.70
Partial correlation of education and salary controlling for experience:
r(ed, sal | exp) = (0.65 − 0.55×0.70) / √[(1−0.3025)(1−0.49)]
= (0.65 − 0.385) / √(0.6975 × 0.51)
= 0.265 / √0.3557
= 0.265 / 0.5964
= 0.44
The relationship is weaker once we control for experience — part of the
education-salary correlation was driven by more-educated people also
having more relevant experience.
Common Mistakes
1. Computing r without looking at the scatter plot
Non-linear relationships, outliers, and clusters can all produce
misleading r values. Always visualise first.
2. Interpreting r as the slope
r = 0.8 does NOT mean "for every 1 unit increase in X, Y increases by 0.8"
That's the regression coefficient, not r.
r only tells you the strength of the linear relationship.
3. Treating absence of correlation as absence of relationship
r ≈ 0 means no LINEAR relationship.
X and Y could have a perfect quadratic, circular, or sinusoidal relationship
with r = 0. Use scatter plots and Spearman ρ to detect non-linear patterns.
4. Extrapolating from correlation to causation
Every statistics course teaches this, yet it's violated constantly in news reports.
"Study shows coffee drinkers live longer" → confounders abound.
Always ask: What else could explain this pattern?
Practice Exercises
-
Compute Pearson r for: X: 1, 3, 5, 7, 9 and Y: 4, 8, 12, 16, 20. Interpret the result and compute r².
-
Two teachers rank 6 students: Teacher A: 1, 2, 3, 4, 5, 6 Teacher B: 2, 1, 4, 3, 6, 5 Compute Spearman ρ.
-
Sales team size and revenue both increase over 10 years. r=0.95. A manager concludes "hiring more salespeople causes revenue to grow." Identify all possible alternative explanations.
-
Dataset with r=0.3. Is this statistically significant at α=0.05 with n=100? What about with n=10? Compute both t-statistics.
-
You find r=0.85 between variable A and variable B. A colleague says r²=0.85, so 85% of variance in B is explained by A. What's wrong?
Summary
In this chapter you learned:
- Pearson r: measures linear relationship between two continuous variables; r = Σ[(x−x̄)(y−ȳ)] / (n×sₓ×sᵧ)
- Interpretation: −1=perfect negative, 0=none, +1=perfect positive; |r|≥0.7 strong, 0.3–0.7 moderate
- r²: proportion of Y variance explained by X; always report alongside r
- Significance test: t = r√(n−2)/√(1−r²) with df=n−2; tests H₀: ρ=0
- Spearman ρ: ranks-based correlation; robust to outliers and non-normality; use for ordinal data
- Pearson vs Spearman: Pearson = linear, sensitive to outliers; Spearman = monotonic, robust
- Always visualise: scatter plot first — Anscombe's Quartet shows identical r for completely different patterns
- r=0 ≠ no relationship: could be a non-linear or non-monotonic pattern
- Correlation ≠ causation: common cause, reverse causation, spurious correlations all produce r≠0
- Partial correlation: removes the influence of a third variable to isolate the direct relationship
Next up: Linear Regression — the most powerful tool for modelling and predicting relationships.