Why Visualise Data?
Before running any statistical test, plot your data. Visualisation reveals:
- Shape of the distribution (symmetric, skewed, bimodal)
- Outliers that could distort results
- Relationships between variables
- Patterns that summary statistics hide
"A picture is worth a thousand p-values." — statisticians everywhere
Anscombe's Quartet (1973): Four datasets with nearly identical mean, variance, and correlation — but completely different shapes. The lesson: always look at a plot first.
Choosing the Right Chart
| Data type | Goal | Chart |
|---|---|---|
| One quantitative variable | Distribution shape | Histogram |
| One quantitative variable | Centre & spread, outliers | Box plot |
| One categorical variable | Category counts | Bar chart |
| Two quantitative variables | Relationship | Scatter plot |
| Quantitative over time | Trend | Line chart |
| Part of a whole | Composition | Pie / Stacked bar |
| Check normality | QQ plot | QQ plot |
| Two categorical variables | Frequency | Mosaic / Grouped bar |
Histograms
A histogram shows the distribution of a single quantitative variable by dividing it into intervals (bins) and counting observations in each bin.
How to Read a Histogram
Salary distribution (₹ thousands):
Bin Frequency
60–70 4
70–80 12
80–90 18
90–100 9
100–110 5
110–120 2
──────
Total: 50
The "peak" (tallest bar) = modal class = ₹80–90k
Distribution Shapes
SYMMETRIC / BELL-SHAPED:
█
█████
█████████
█████████████
→ Mean ≈ Median ≈ Mode
RIGHT-SKEWED (positive skew):
█
███
█████
█████████████ → long tail to the right
→ Mean > Median > Mode
→ Examples: income, house prices, wait times
LEFT-SKEWED (negative skew):
█
███
█████
← █████████████
→ Mean < Median < Mode
→ Examples: age at death (most die old), exam scores on easy test
BIMODAL:
█ █
█████ █████
█████████ █████████
→ Two peaks → suggests two subgroups
→ Example: heights of mixed male/female group
UNIFORM:
████████████████
→ All values equally likely
→ Example: fair die rolls
Choosing Bin Width
Too few bins: hides the shape. Too many bins: too noisy to see the pattern.
Rules of thumb:
- Square root rule: k = √n bins (n=100 → 10 bins)
- Sturges' rule: k = 1 + 3.322 × log₁₀(n)
- Freedman-Diaconis: bin width = 2 × IQR / n^(1/3)
Software (Excel, Python, R) auto-calculates bin width — adjust if needed.
Histogram vs Bar Chart
Histogram:
- Quantitative data
- Bars touch each other (continuous scale)
- Bar width = bin width (can vary)
- Shows distribution
Bar chart:
- Categorical data
- Bars have gaps between them
- All bars same width
- Shows counts per category
Box Plots (Box-and-Whisker Plots)
Summarises a distribution with five numbers: Min, Q1, Median, Q3, Max.
Reading a Box Plot
┌─────┬──────┐
─────────┤ │ ├───────── ∘ (outlier)
└─────┴──────┘
↑ ↑ ↑ ↑ ↑
Min Q1 Median Q3
(or Max
lower (or upper
fence) fence)
Box: from Q1 to Q3 (the IQR — contains the middle 50%)
Line in box: median
Whiskers: extend to min/max OR to Q1−1.5×IQR and Q3+1.5×IQR
Points beyond whiskers: outliers (plotted individually)
Detecting Skew from a Box Plot
SYMMETRIC:
┌────┼────┐
──┤ │ ├──
└────┴────┘
Median in the centre; whiskers roughly equal length
RIGHT-SKEWED:
┌──┼──────┐
──┤ │ ├──────────────
└──┴──────┘
Median closer to Q1; long right whisker (or right outliers)
LEFT-SKEWED:
┌──────┼──┐
───────┤ │ ├──
└──────┴──┘
Median closer to Q3; long left whisker
Side-by-Side Box Plots
Most powerful use: compare distributions across groups.
Department Salaries (₹k):
Finance ──[════╪════════]──────────────
Technology ────────[═════╪═════════════]──
Marketing ──[═════╪═══]──
HR ─[══╪══]──
→ Technology has highest median and widest spread
→ Finance has some high outliers
→ HR has the narrowest range
Scatter Plots
Shows the relationship between two quantitative variables. Each point represents one observation.
Reading a Scatter Plot
Exam study hours (x) vs Score (y):
100 | ∘ ∘
90 | ∘ ∘ ∘
80 | ∘ ∘
70 | ∘ ∘
60 | ∘ ∘
50 | ∘
─────────────────────────────→
0 2 4 6 8 10 hours
→ Positive association: more study → higher score
→ Roughly linear
→ One outlier at (1 hour, 95 score) — possibly prior knowledge
Types of Association
Positive linear: y increases as x increases
Negative linear: y decreases as x increases
No relationship: random scatter (no pattern)
Non-linear: curved relationship (e.g., U-shaped)
Scatter Plot Matrix
When you have multiple variables, a scatter plot matrix (SPLOM) shows all pairwise relationships simultaneously.
Frequency Tables and Relative Frequency
For categorical data:
Department Frequency Relative Freq Cumulative Freq
Finance 32 32.0% 32.0%
Technology 28 28.0% 60.0%
Marketing 25 25.0% 85.0%
HR 15 15.0% 100.0%
Total 100 100.0%
Relative frequency = count / total — allows comparison across samples of different sizes.
Cumulative frequency — useful for answering "what % of values are below X?"
Cumulative Frequency Curves (Ogive)
Plots cumulative relative frequency against values. Used to find percentiles visually.
Exam Scores Ogive:
Cumulative %
100% | ────────
75% | ────
50% | ────
25% | ────
0% | ────
──────────────────────────────→
40 50 60 70 80 90 100
→ Read off: 50th percentile (median) ≈ 68
→ Read off: 75th percentile (Q3) ≈ 78
QQ Plot (Quantile-Quantile Plot)
Checks whether data follows a theoretical distribution (usually the normal distribution).
How to Read a QQ Plot
If data is normal:
Points fall along the diagonal reference line (y = x)
If data is right-skewed:
Points curve above the line at the upper right
If data is left-skewed:
Points curve below the line at the lower left
If data has heavy tails (leptokurtic):
Points curve away from the line at both ends (S-shape)
Why it matters: many statistical tests (t-tests, ANOVA) assume normality. The QQ plot is your primary normality check.
Skewness and Kurtosis
Skewness
Measures the asymmetry of the distribution.
Skewness > 0: right skew (positive skew) — long tail to the right
Skewness = 0: symmetric
Skewness < 0: left skew (negative skew) — long tail to the left
Rule of thumb:
|Skewness| < 0.5 → approximately symmetric
0.5 ≤ |Skewness| < 1 → moderately skewed
|Skewness| ≥ 1 → highly skewed
Kurtosis
Measures how heavy the tails are compared to a normal distribution.
Normal distribution: kurtosis = 3 (excess kurtosis = 0)
Leptokurtic (excess kurtosis > 0): heavy tails → more extreme values
Platykurtic (excess kurtosis < 0): thin tails → fewer extreme values
Finance context:
Stock returns are leptokurtic — more frequent extreme returns than normal distribution predicts
→ Using normal distribution for VaR underestimates tail risk (the 2008 crisis)
Practical Examples
Example 1: Salary Distribution Analysis
Plot a histogram of 200 employee salaries:
Observation from histogram:
- Right-skewed (few very high earners)
- Modal class: ₹70–80k
- Long tail toward ₹200k+
Choice of summary statistics:
- Report median (not mean) — skewed data
- Report IQR (not SD) — robust to outliers
- Note the bimodal bump around ₹95–100k (might be managers)
Action: Investigate the ₹95–100k group — are they a different job grade?
Example 2: A/B Test Data Exploration
Before running a t-test comparing two groups:
1. Histogram each group → check for normality, outliers
2. Box plot side-by-side → compare medians and spreads
3. Check n per group (t-test needs n≥30 or normality)
4. QQ plot → check normality assumption
If data is severely non-normal and n is small → use non-parametric test instead
Example 3: Investment Returns Visualisation
Monthly returns for two funds (last 5 years = 60 months):
Fund A Histogram: Symmetric, bell-shaped → normal-like returns
Fund B Histogram: Left-skewed with heavy left tail → rare but large losses
Side-by-side box plots:
- Fund A: compact box, short whiskers → consistent returns
- Fund B: wider box, long left whisker, many outliers → volatile
Conclusion: Fund B has higher average return but also higher downside risk.
A risk-averse investor prefers Fund A.
Common Mistakes
1. Using a bar chart for quantitative data
Wrong: Bar chart of exam scores (one bar per student)
Right: Histogram (group scores into intervals)
→ Bar chart is for categorical variables; histogram for quantitative
2. Ignoring outliers in scatter plots
An outlier can dramatically change the apparent correlation. Always check if the pattern holds without the outlier — and investigate what the outlier represents.
3. Using a pie chart with too many slices
More than 5–6 slices → pie chart becomes unreadable. Use a bar chart instead.
4. Starting the y-axis at a non-zero value
Sales chart y-axis starting at 95 instead of 0:
→ A 5% change looks like a 100% change visually
→ Always start bar chart y-axis at 0
(Line charts can start at non-zero for trend focus, but note it clearly)
5. Inferring causation from a scatter plot
Scatter plot shows: as coffee consumption increases, productivity increases
→ Does NOT mean coffee causes productivity
→ Could be that busy workers drink more coffee AND work more
→ Visualisation shows association, not causation
Practice Exercises
-
Sketch (or describe) the expected histogram shape for: a) Annual income of Indian households b) Body temperature of healthy adults c) Rolling a fair six-sided die 1,000 times d) Marks in a very difficult exam where most students fail
-
Given Q1=45, Median=60, Q3=75, Min=20, Max=130: draw a box plot and identify any outliers using Tukey's fences.
-
A scatter plot of "number of hours worked" vs "employee satisfaction score" shows a negative relationship. Describe what this means, and propose two possible explanations beyond simple causation.
-
A QQ plot of residuals from a regression model shows points that curve away from the diagonal at both ends (heavy tails). What does this suggest about the residuals?
-
A dataset has mean=80, median=65. What can you infer about its shape? Sketch the expected distribution.
Summary
In this chapter you learned:
- Histogram — distribution of one quantitative variable; reveals shape (symmetric, skewed, bimodal, uniform)
- Shapes: symmetric (mean≈median), right-skewed (mean>median), left-skewed (mean<median), bimodal
- Box plot — Min, Q1, Median, Q3, Max; shows outliers; great for comparing groups
- Scatter plot — relationship between two quantitative variables; positive, negative, or no association
- QQ plot — checks normality; points on diagonal = normal; deviations = departures from normality
- Skewness: positive = right tail; negative = left tail; |skew| > 1 = highly skewed
- Kurtosis: heavy tails (leptokurtic) vs thin tails (platykurtic) vs normal
- Always choose chart type based on variable type and analytical goal
- Bar chart ≠ histogram (bar = categorical; histogram = quantitative)
- Visualise before computing — summary statistics can hide the true shape
Next up: Probability Fundamentals — the mathematical foundation for everything that follows: events, rules, and how uncertainty is quantified.