Chapter 6 of 18

Data Visualisation for Statistics

Choose the right chart for statistical data — histograms, box plots, scatter plots, QQ plots, and interpreting distribution shape.

Meritshot10 min read
StatisticsData VisualisationHistogramBox PlotScatter PlotSkewnessDistribution
All Statistics Chapters

Why Visualise Data?

Before running any statistical test, plot your data. Visualisation reveals:

  • Shape of the distribution (symmetric, skewed, bimodal)
  • Outliers that could distort results
  • Relationships between variables
  • Patterns that summary statistics hide

"A picture is worth a thousand p-values." — statisticians everywhere

Anscombe's Quartet (1973): Four datasets with nearly identical mean, variance, and correlation — but completely different shapes. The lesson: always look at a plot first.

Choosing the Right Chart

Data typeGoalChart
One quantitative variableDistribution shapeHistogram
One quantitative variableCentre & spread, outliersBox plot
One categorical variableCategory countsBar chart
Two quantitative variablesRelationshipScatter plot
Quantitative over timeTrendLine chart
Part of a wholeCompositionPie / Stacked bar
Check normalityQQ plotQQ plot
Two categorical variablesFrequencyMosaic / Grouped bar

Histograms

A histogram shows the distribution of a single quantitative variable by dividing it into intervals (bins) and counting observations in each bin.

How to Read a Histogram

Salary distribution (₹ thousands):
Bin      Frequency
60–70       4
70–80      12
80–90      18
90–100      9
100–110     5
110–120     2
          ──────
Total:     50

The "peak" (tallest bar) = modal class = ₹80–90k

Distribution Shapes

SYMMETRIC / BELL-SHAPED:
     █
   █████
 █████████
█████████████
→ Mean ≈ Median ≈ Mode

RIGHT-SKEWED (positive skew):
█
███
█████
█████████████  →  long tail to the right
→ Mean > Median > Mode
→ Examples: income, house prices, wait times

LEFT-SKEWED (negative skew):
             █
           ███
         █████
←  █████████████
→ Mean < Median < Mode
→ Examples: age at death (most die old), exam scores on easy test

BIMODAL:
     █         █
   █████     █████
█████████ █████████
→ Two peaks → suggests two subgroups
→ Example: heights of mixed male/female group

UNIFORM:
████████████████
→ All values equally likely
→ Example: fair die rolls

Choosing Bin Width

Too few bins: hides the shape. Too many bins: too noisy to see the pattern.

Rules of thumb:

  • Square root rule: k = √n bins (n=100 → 10 bins)
  • Sturges' rule: k = 1 + 3.322 × log₁₀(n)
  • Freedman-Diaconis: bin width = 2 × IQR / n^(1/3)

Software (Excel, Python, R) auto-calculates bin width — adjust if needed.

Histogram vs Bar Chart

Histogram:
- Quantitative data
- Bars touch each other (continuous scale)
- Bar width = bin width (can vary)
- Shows distribution

Bar chart:
- Categorical data
- Bars have gaps between them
- All bars same width
- Shows counts per category

Box Plots (Box-and-Whisker Plots)

Summarises a distribution with five numbers: Min, Q1, Median, Q3, Max.

Reading a Box Plot

         ┌─────┬──────┐
─────────┤     │      ├───────── ∘ (outlier)
         └─────┴──────┘
   ↑     ↑    ↑       ↑         ↑
  Min    Q1  Median  Q3         
  (or                     Max
  lower                 (or upper
  fence)                  fence)

Box: from Q1 to Q3 (the IQR — contains the middle 50%)
Line in box: median
Whiskers: extend to min/max OR to Q1−1.5×IQR and Q3+1.5×IQR
Points beyond whiskers: outliers (plotted individually)

Detecting Skew from a Box Plot

SYMMETRIC:
  ┌────┼────┐
──┤    │    ├──
  └────┴────┘
Median in the centre; whiskers roughly equal length

RIGHT-SKEWED:
  ┌──┼──────┐
──┤  │      ├──────────────
  └──┴──────┘
Median closer to Q1; long right whisker (or right outliers)

LEFT-SKEWED:
       ┌──────┼──┐
───────┤      │  ├──
       └──────┴──┘
Median closer to Q3; long left whisker

Side-by-Side Box Plots

Most powerful use: compare distributions across groups.

Department Salaries (₹k):

Finance    ──[════╪════════]──────────────
Technology ────────[═════╪═════════════]──
Marketing  ──[═════╪═══]──
HR         ─[══╪══]──

→ Technology has highest median and widest spread
→ Finance has some high outliers
→ HR has the narrowest range

Scatter Plots

Shows the relationship between two quantitative variables. Each point represents one observation.

Reading a Scatter Plot

Exam study hours (x) vs Score (y):

100 |                         ∘ ∘
 90 |                    ∘  ∘  ∘
 80 |               ∘  ∘
 70 |          ∘  ∘
 60 |     ∘  ∘
 50 |  ∘
     ─────────────────────────────→
     0    2    4    6    8   10   hours

→ Positive association: more study → higher score
→ Roughly linear
→ One outlier at (1 hour, 95 score) — possibly prior knowledge

Types of Association

Positive linear: y increases as x increases
Negative linear: y decreases as x increases
No relationship: random scatter (no pattern)
Non-linear: curved relationship (e.g., U-shaped)

Scatter Plot Matrix

When you have multiple variables, a scatter plot matrix (SPLOM) shows all pairwise relationships simultaneously.

Frequency Tables and Relative Frequency

For categorical data:

Department    Frequency   Relative Freq   Cumulative Freq
Finance            32        32.0%              32.0%
Technology         28        28.0%              60.0%
Marketing          25        25.0%              85.0%
HR                 15        15.0%             100.0%
Total             100       100.0%

Relative frequency = count / total — allows comparison across samples of different sizes.

Cumulative frequency — useful for answering "what % of values are below X?"

Cumulative Frequency Curves (Ogive)

Plots cumulative relative frequency against values. Used to find percentiles visually.

Exam Scores Ogive:
Cumulative %
100% |                          ────────
 75% |                     ────
 50% |               ────
 25% |         ────
  0% |    ────
      ──────────────────────────────→
      40   50   60   70   80   90  100

→ Read off: 50th percentile (median) ≈ 68
→ Read off: 75th percentile (Q3) ≈ 78

QQ Plot (Quantile-Quantile Plot)

Checks whether data follows a theoretical distribution (usually the normal distribution).

How to Read a QQ Plot

If data is normal:
Points fall along the diagonal reference line (y = x)

If data is right-skewed:
Points curve above the line at the upper right

If data is left-skewed:
Points curve below the line at the lower left

If data has heavy tails (leptokurtic):
Points curve away from the line at both ends (S-shape)

Why it matters: many statistical tests (t-tests, ANOVA) assume normality. The QQ plot is your primary normality check.

Skewness and Kurtosis

Skewness

Measures the asymmetry of the distribution.

Skewness > 0: right skew (positive skew) — long tail to the right
Skewness = 0: symmetric
Skewness < 0: left skew (negative skew) — long tail to the left

Rule of thumb:
|Skewness| < 0.5 → approximately symmetric
0.5 ≤ |Skewness| < 1 → moderately skewed
|Skewness| ≥ 1 → highly skewed

Kurtosis

Measures how heavy the tails are compared to a normal distribution.

Normal distribution: kurtosis = 3 (excess kurtosis = 0)
Leptokurtic (excess kurtosis > 0): heavy tails → more extreme values
Platykurtic (excess kurtosis < 0): thin tails → fewer extreme values

Finance context:
Stock returns are leptokurtic — more frequent extreme returns than normal distribution predicts
→ Using normal distribution for VaR underestimates tail risk (the 2008 crisis)

Practical Examples

Example 1: Salary Distribution Analysis

Plot a histogram of 200 employee salaries:

Observation from histogram:
- Right-skewed (few very high earners)
- Modal class: ₹70–80k
- Long tail toward ₹200k+

Choice of summary statistics:
- Report median (not mean) — skewed data
- Report IQR (not SD) — robust to outliers
- Note the bimodal bump around ₹95–100k (might be managers)

Action: Investigate the ₹95–100k group — are they a different job grade?

Example 2: A/B Test Data Exploration

Before running a t-test comparing two groups:

1. Histogram each group → check for normality, outliers
2. Box plot side-by-side → compare medians and spreads
3. Check n per group (t-test needs n≥30 or normality)
4. QQ plot → check normality assumption

If data is severely non-normal and n is small → use non-parametric test instead

Example 3: Investment Returns Visualisation

Monthly returns for two funds (last 5 years = 60 months):

Fund A Histogram: Symmetric, bell-shaped → normal-like returns
Fund B Histogram: Left-skewed with heavy left tail → rare but large losses

Side-by-side box plots:
- Fund A: compact box, short whiskers → consistent returns
- Fund B: wider box, long left whisker, many outliers → volatile

Conclusion: Fund B has higher average return but also higher downside risk.
A risk-averse investor prefers Fund A.

Common Mistakes

1. Using a bar chart for quantitative data

Wrong: Bar chart of exam scores (one bar per student)
Right: Histogram (group scores into intervals)
→ Bar chart is for categorical variables; histogram for quantitative

2. Ignoring outliers in scatter plots

An outlier can dramatically change the apparent correlation. Always check if the pattern holds without the outlier — and investigate what the outlier represents.

3. Using a pie chart with too many slices

More than 5–6 slices → pie chart becomes unreadable. Use a bar chart instead.

4. Starting the y-axis at a non-zero value

Sales chart y-axis starting at 95 instead of 0:
→ A 5% change looks like a 100% change visually
→ Always start bar chart y-axis at 0
(Line charts can start at non-zero for trend focus, but note it clearly)

5. Inferring causation from a scatter plot

Scatter plot shows: as coffee consumption increases, productivity increases
→ Does NOT mean coffee causes productivity
→ Could be that busy workers drink more coffee AND work more
→ Visualisation shows association, not causation

Practice Exercises

  1. Sketch (or describe) the expected histogram shape for: a) Annual income of Indian households b) Body temperature of healthy adults c) Rolling a fair six-sided die 1,000 times d) Marks in a very difficult exam where most students fail

  2. Given Q1=45, Median=60, Q3=75, Min=20, Max=130: draw a box plot and identify any outliers using Tukey's fences.

  3. A scatter plot of "number of hours worked" vs "employee satisfaction score" shows a negative relationship. Describe what this means, and propose two possible explanations beyond simple causation.

  4. A QQ plot of residuals from a regression model shows points that curve away from the diagonal at both ends (heavy tails). What does this suggest about the residuals?

  5. A dataset has mean=80, median=65. What can you infer about its shape? Sketch the expected distribution.

Summary

In this chapter you learned:

  • Histogram — distribution of one quantitative variable; reveals shape (symmetric, skewed, bimodal, uniform)
  • Shapes: symmetric (mean≈median), right-skewed (mean>median), left-skewed (mean<median), bimodal
  • Box plot — Min, Q1, Median, Q3, Max; shows outliers; great for comparing groups
  • Scatter plot — relationship between two quantitative variables; positive, negative, or no association
  • QQ plot — checks normality; points on diagonal = normal; deviations = departures from normality
  • Skewness: positive = right tail; negative = left tail; |skew| > 1 = highly skewed
  • Kurtosis: heavy tails (leptokurtic) vs thin tails (platykurtic) vs normal
  • Always choose chart type based on variable type and analytical goal
  • Bar chart ≠ histogram (bar = categorical; histogram = quantitative)
  • Visualise before computing — summary statistics can hide the true shape

Next up: Probability Fundamentals — the mathematical foundation for everything that follows: events, rules, and how uncertainty is quantified.