Data Visualisation for Statistics

Why Visualise Data?

Before running any statistical test, plot your data. Visualisation reveals:

Shape of the distribution (symmetric, skewed, bimodal)
Outliers that could distort results
Relationships between variables
Patterns that summary statistics hide

"A picture is worth a thousand p-values." — statisticians everywhere

Anscombe's Quartet (1973): Four datasets with nearly identical mean, variance, and correlation — but completely different shapes. The lesson: always look at a plot first.

Choosing the Right Chart

Data type	Goal	Chart
One quantitative variable	Distribution shape	Histogram
One quantitative variable	Centre & spread, outliers	Box plot
One categorical variable	Category counts	Bar chart
Two quantitative variables	Relationship	Scatter plot
Quantitative over time	Trend	Line chart
Part of a whole	Composition	Pie / Stacked bar
Check normality	QQ plot	QQ plot
Two categorical variables	Frequency	Mosaic / Grouped bar

Histograms

A histogram shows the distribution of a single quantitative variable by dividing it into intervals (bins) and counting observations in each bin.

How to Read a Histogram

Salary distribution (₹ thousands):
Bin      Frequency
60–70       4
70–80      12
80–90      18
90–100      9
100–110     5
110–120     2
          ──────
Total:     50

The "peak" (tallest bar) = modal class = ₹80–90k

Distribution Shapes

SYMMETRIC / BELL-SHAPED:
     █
   █████
 █████████
█████████████
→ Mean ≈ Median ≈ Mode

RIGHT-SKEWED (positive skew):
█
███
█████
█████████████  →  long tail to the right
→ Mean > Median > Mode
→ Examples: income, house prices, wait times

LEFT-SKEWED (negative skew):
             █
           ███
         █████
←  █████████████
→ Mean < Median < Mode
→ Examples: age at death (most die old), exam scores on easy test

BIMODAL:
     █         █
   █████     █████
█████████ █████████
→ Two peaks → suggests two subgroups
→ Example: heights of mixed male/female group

UNIFORM:
████████████████
→ All values equally likely
→ Example: fair die rolls

Choosing Bin Width

Too few bins: hides the shape. Too many bins: too noisy to see the pattern.

Rules of thumb:

Square root rule: k = √n bins (n=100 → 10 bins)
Sturges' rule: k = 1 + 3.322 × log₁₀(n)
Freedman-Diaconis: bin width = 2 × IQR / n^(1/3)

Software (Excel, Python, R) auto-calculates bin width — adjust if needed.

Histogram vs Bar Chart

Histogram:
- Quantitative data
- Bars touch each other (continuous scale)
- Bar width = bin width (can vary)
- Shows distribution

Bar chart:
- Categorical data
- Bars have gaps between them
- All bars same width
- Shows counts per category

Box Plots (Box-and-Whisker Plots)

Summarises a distribution with five numbers: Min, Q1, Median, Q3, Max.

Reading a Box Plot

         ┌─────┬──────┐
─────────┤     │      ├───────── ∘ (outlier)
         └─────┴──────┘
   ↑     ↑    ↑       ↑         ↑
  Min    Q1  Median  Q3         
  (or                     Max
  lower                 (or upper
  fence)                  fence)

Box: from Q1 to Q3 (the IQR — contains the middle 50%)
Line in box: median
Whiskers: extend to min/max OR to Q1−1.5×IQR and Q3+1.5×IQR
Points beyond whiskers: outliers (plotted individually)

Detecting Skew from a Box Plot

SYMMETRIC:
  ┌────┼────┐
──┤    │    ├──
  └────┴────┘
Median in the centre; whiskers roughly equal length

RIGHT-SKEWED:
  ┌──┼──────┐
──┤  │      ├──────────────
  └──┴──────┘
Median closer to Q1; long right whisker (or right outliers)

LEFT-SKEWED:
       ┌──────┼──┐
───────┤      │  ├──
       └──────┴──┘
Median closer to Q3; long left whisker

Side-by-Side Box Plots

Most powerful use: compare distributions across groups.

Department Salaries (₹k):

Finance    ──[════╪════════]──────────────
Technology ────────[═════╪═════════════]──
Marketing  ──[═════╪═══]──
HR         ─[══╪══]──

→ Technology has highest median and widest spread
→ Finance has some high outliers
→ HR has the narrowest range

Scatter Plots

Shows the relationship between two quantitative variables. Each point represents one observation.

Reading a Scatter Plot

Exam study hours (x) vs Score (y):

100 |                         ∘ ∘
 90 |                    ∘  ∘  ∘
 80 |               ∘  ∘
 70 |          ∘  ∘
 60 |     ∘  ∘
 50 |  ∘
     ─────────────────────────────→
     0    2    4    6    8   10   hours

→ Positive association: more study → higher score
→ Roughly linear
→ One outlier at (1 hour, 95 score) — possibly prior knowledge

Types of Association

Positive linear: y increases as x increases
Negative linear: y decreases as x increases
No relationship: random scatter (no pattern)
Non-linear: curved relationship (e.g., U-shaped)

Scatter Plot Matrix

When you have multiple variables, a scatter plot matrix (SPLOM) shows all pairwise relationships simultaneously.

Frequency Tables and Relative Frequency

For categorical data:

Department    Frequency   Relative Freq   Cumulative Freq
Finance            32        32.0%              32.0%
Technology         28        28.0%              60.0%
Marketing          25        25.0%              85.0%
HR                 15        15.0%             100.0%
Total             100       100.0%

Relative frequency = count / total — allows comparison across samples of different sizes.

Cumulative frequency — useful for answering "what % of values are below X?"

Cumulative Frequency Curves (Ogive)

Plots cumulative relative frequency against values. Used to find percentiles visually.

Exam Scores Ogive:
Cumulative %
100% |                          ────────
 75% |                     ────
 50% |               ────
 25% |         ────
  0% |    ────
      ──────────────────────────────→
      40   50   60   70   80   90  100

→ Read off: 50th percentile (median) ≈ 68
→ Read off: 75th percentile (Q3) ≈ 78

QQ Plot (Quantile-Quantile Plot)

Checks whether data follows a theoretical distribution (usually the normal distribution).

How to Read a QQ Plot

If data is normal:
Points fall along the diagonal reference line (y = x)

If data is right-skewed:
Points curve above the line at the upper right

If data is left-skewed:
Points curve below the line at the lower left

If data has heavy tails (leptokurtic):
Points curve away from the line at both ends (S-shape)

Why it matters: many statistical tests (t-tests, ANOVA) assume normality. The QQ plot is your primary normality check.

Skewness and Kurtosis

Skewness

Measures the asymmetry of the distribution.

Skewness > 0: right skew (positive skew) — long tail to the right
Skewness = 0: symmetric
Skewness < 0: left skew (negative skew) — long tail to the left

Rule of thumb:
|Skewness| < 0.5 → approximately symmetric
0.5 ≤ |Skewness| < 1 → moderately skewed
|Skewness| ≥ 1 → highly skewed

Kurtosis

Measures how heavy the tails are compared to a normal distribution.

Normal distribution: kurtosis = 3 (excess kurtosis = 0)
Leptokurtic (excess kurtosis > 0): heavy tails → more extreme values
Platykurtic (excess kurtosis < 0): thin tails → fewer extreme values

Finance context:
Stock returns are leptokurtic — more frequent extreme returns than normal distribution predicts
→ Using normal distribution for VaR underestimates tail risk (the 2008 crisis)

Practical Examples

Example 1: Salary Distribution Analysis

Plot a histogram of 200 employee salaries:

Observation from histogram:
- Right-skewed (few very high earners)
- Modal class: ₹70–80k
- Long tail toward ₹200k+

Choice of summary statistics:
- Report median (not mean) — skewed data
- Report IQR (not SD) — robust to outliers
- Note the bimodal bump around ₹95–100k (might be managers)

Action: Investigate the ₹95–100k group — are they a different job grade?

Example 2: A/B Test Data Exploration

Before running a t-test comparing two groups:

1. Histogram each group → check for normality, outliers
2. Box plot side-by-side → compare medians and spreads
3. Check n per group (t-test needs n≥30 or normality)
4. QQ plot → check normality assumption

If data is severely non-normal and n is small → use non-parametric test instead

Example 3: Investment Returns Visualisation

Monthly returns for two funds (last 5 years = 60 months):

Fund A Histogram: Symmetric, bell-shaped → normal-like returns
Fund B Histogram: Left-skewed with heavy left tail → rare but large losses

Side-by-side box plots:
- Fund A: compact box, short whiskers → consistent returns
- Fund B: wider box, long left whisker, many outliers → volatile

Conclusion: Fund B has higher average return but also higher downside risk.
A risk-averse investor prefers Fund A.

Common Mistakes

1. Using a bar chart for quantitative data

Wrong: Bar chart of exam scores (one bar per student)
Right: Histogram (group scores into intervals)
→ Bar chart is for categorical variables; histogram for quantitative

2. Ignoring outliers in scatter plots

An outlier can dramatically change the apparent correlation. Always check if the pattern holds without the outlier — and investigate what the outlier represents.

3. Using a pie chart with too many slices

More than 5–6 slices → pie chart becomes unreadable. Use a bar chart instead.

4. Starting the y-axis at a non-zero value

Sales chart y-axis starting at 95 instead of 0:
→ A 5% change looks like a 100% change visually
→ Always start bar chart y-axis at 0
(Line charts can start at non-zero for trend focus, but note it clearly)

5. Inferring causation from a scatter plot

Scatter plot shows: as coffee consumption increases, productivity increases
→ Does NOT mean coffee causes productivity
→ Could be that busy workers drink more coffee AND work more
→ Visualisation shows association, not causation

Practice Exercises

Sketch (or describe) the expected histogram shape for: a) Annual income of Indian households b) Body temperature of healthy adults c) Rolling a fair six-sided die 1,000 times d) Marks in a very difficult exam where most students fail
Given Q1=45, Median=60, Q3=75, Min=20, Max=130: draw a box plot and identify any outliers using Tukey's fences.
A scatter plot of "number of hours worked" vs "employee satisfaction score" shows a negative relationship. Describe what this means, and propose two possible explanations beyond simple causation.
A QQ plot of residuals from a regression model shows points that curve away from the diagonal at both ends (heavy tails). What does this suggest about the residuals?
A dataset has mean=80, median=65. What can you infer about its shape? Sketch the expected distribution.

Summary

In this chapter you learned:

Histogram — distribution of one quantitative variable; reveals shape (symmetric, skewed, bimodal, uniform)
Shapes: symmetric (mean≈median), right-skewed (mean>median), left-skewed (mean<median), bimodal
Box plot — Min, Q1, Median, Q3, Max; shows outliers; great for comparing groups
Scatter plot — relationship between two quantitative variables; positive, negative, or no association
QQ plot — checks normality; points on diagonal = normal; deviations = departures from normality
Skewness: positive = right tail; negative = left tail; |skew| > 1 = highly skewed
Kurtosis: heavy tails (leptokurtic) vs thin tails (platykurtic) vs normal
Always choose chart type based on variable type and analytical goal
Bar chart ≠ histogram (bar = categorical; histogram = quantitative)
Visualise before computing — summary statistics can hide the true shape

Next up: Probability Fundamentals — the mathematical foundation for everything that follows: events, rules, and how uncertainty is quantified.