What Is Statistics?
Statistics is the science of collecting, organising, analysing, interpreting, and presenting data. It provides the tools to make sense of a world full of uncertainty — turning raw numbers into decisions.
Statistics is used everywhere:
- Finance — portfolio risk, return forecasting, option pricing
- Healthcare — clinical trials, drug efficacy, disease surveillance
- Business — A/B testing, demand forecasting, customer segmentation
- Data Science — machine learning models are built on statistical foundations
- Government — census, economic indicators, policy evaluation
- Sports — player performance analysis, game strategy
"Statistics is the grammar of science." — Karl Pearson
Why Statistics Matters for Your Career
Whether you work in investment banking, consulting, marketing, or technology, you will encounter data. The question is whether you can trust it, understand it, and use it to make better decisions than your peers.
Statisticians — and people who understand statistics — can:
- Spot misleading conclusions in reports
- Design experiments that actually answer the right question
- Quantify uncertainty rather than ignoring it
- Communicate findings with appropriate confidence
Two Branches of Statistics
Descriptive Statistics
Describes and summarises data you already have — no generalisations beyond the dataset.
Exam scores of 30 students: 45, 67, 72, 88, 91, ...
Mean: 74.3
Median: 76
Standard Deviation: 12.8
Range: 46
Descriptive statistics answer: What does this data look like?
Inferential Statistics
Uses a sample to draw conclusions about a larger population — with quantified uncertainty.
Survey 500 customers out of 50,000.
→ 68% say they prefer Product A.
→ With 95% confidence, the true proportion is between 64% and 72%.
Inferential statistics answer: What can we conclude about the population from this sample?
| Descriptive | Inferential | |
|---|---|---|
| Goal | Summarise existing data | Generalise to population |
| Data needed | Entire dataset | Sample from population |
| Examples | Mean, median, charts | Hypothesis tests, confidence intervals |
| Uncertainty | No uncertainty (facts) | Always has uncertainty (estimation) |
Key Definitions
Population vs Sample
- Population: Every individual or item you care about (all 50,000 customers)
- Sample: A subset of the population you actually observe (the 500 surveyed)
- Parameter: A numerical fact about the population (Greek letters: μ, σ, π)
- Statistic: A numerical fact calculated from a sample (Latin letters: x̄, s, p̂)
Population: all students in India who took Class 12 boards (15 million)
Sample: 2,000 students surveyed by a research firm
Parameter (unknown): μ = true average score of ALL students
Statistic (calculated): x̄ = 71.4 (average of the 2,000-student sample)
Observation, Variable, and Dataset
- Observation (unit): one individual thing measured — one student, one customer, one transaction
- Variable: a characteristic that can differ across observations — age, salary, grade
- Dataset: a collection of observations, usually arranged in a table (rows = observations, columns = variables)
Dataset: 5 employees
EmpID Name Dept Salary Rating
001 Priya Sharma Finance 78000 4
002 Raj Patel Technology 95000 5
003 Meera Singh Marketing 68000 3
004 Arjun Nair Finance 82000 4
005 Kavya Menon HR 61000 3
Observations: 5 (one per row)
Variables: 5 (EmpID, Name, Dept, Salary, Rating)
Types of Variables
A crucial early concept — the type of variable determines which statistical methods are valid.
Quantitative (Numerical) Variables
Variables that represent amounts and can be measured on a numeric scale.
Continuous: Can take any value in a range (including decimals)
Height: 167.4 cm, 172.1 cm, 183.8 cm
Temperature: 36.5°C, 38.1°C
Salary: ₹78,432.50
Discrete: Can only take specific values, usually whole numbers
Number of children: 0, 1, 2, 3
Number of defects: 0, 1, 2...
Exam score (out of 100): 45, 72, 88
Qualitative (Categorical) Variables
Variables that represent categories or groups.
Nominal: Categories with no natural order
Department: Finance, Technology, Marketing, HR
Blood type: A, B, AB, O
City: Mumbai, Delhi, Bangalore
Ordinal: Categories with a meaningful order, but gaps between categories are not equal
Rating: 1, 2, 3, 4, 5 (order matters; gap between 1 and 2 ≠ gap between 4 and 5 meaningfully)
Education: School, Undergraduate, Postgraduate, PhD
Survey response: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree
The Statistical Workflow
A complete data analysis follows these steps:
1. DEFINE THE QUESTION
"Does the new training programme improve employee performance?"
2. COLLECT DATA
Measure performance scores before and after training for 50 employees
3. ORGANISE & CLEAN
Enter into a table, fix errors, handle missing values
4. DESCRIBE
Calculate mean scores before and after; draw a histogram
5. ANALYSE
Run a paired t-test to check if the improvement is statistically significant
6. INTERPRET
p-value = 0.02 → significant improvement at 5% level
7. COMMUNICATE
Report findings with confidence intervals and effect size
Each step has its own tools and pitfalls. This tutorial series covers them all.
Practical Examples
Example 1: Investment Banking Use Case
A bank wants to know if a new credit scoring model reduces loan defaults.
- Population: All loan applicants
- Sample: 1,000 applicants processed through the new model
- Descriptive: Default rate in sample = 3.2% (vs 5.8% historical)
- Inferential: Is 3.2% significantly lower than 5.8%? (Hypothesis test — Chapter 13)
Example 2: E-Commerce A/B Test
An online retailer tests two versions of a product page.
- Version A: current page → 500 visitors → 45 purchases (9% conversion)
- Version B: new design → 500 visitors → 63 purchases (12.6% conversion)
- Question: Is the 3.6% difference real, or just random variation?
- Answer: Requires a hypothesis test (two-proportion z-test — Chapter 13)
Example 3: HR Analytics
HR wants to understand salary distribution across departments.
- Descriptive: Calculate mean, median, SD per department
- Inferential: Are salary differences between departments statistically significant? (ANOVA — Chapter 16)
Common Misconceptions
1. Correlation implies causation
Ice cream sales and drowning deaths are correlated (both rise in summer).
Ice cream does not cause drowning — summer is the hidden cause (confounding variable).
2. A larger sample is always better
Sample quality matters more than size. A biased sample of 100,000 is worse than a representative sample of 1,000. (Literary Digest predicted FDR would lose in 1936 — with 10 million responses — because their sampling was biased toward wealthy households.)
3. Statistical significance = practical significance
A study of 1 million people might find that a drug reduces blood pressure by 0.3 mmHg with p < 0.001. Statistically significant — clinically meaningless.
4. The average tells the whole story
Team A salaries: 60k, 60k, 60k, 60k, 120k → Mean = 72k
Team B salaries: 50k, 60k, 70k, 80k, 100k → Mean = 72k
Same average, very different distributions.
Practice Exercises
-
Classify each variable as quantitative (continuous/discrete) or qualitative (nominal/ordinal): a) Number of transactions per day b) Customer satisfaction (1–5 stars) c) City of residence d) Annual revenue e) Bond credit rating (AAA, AA, A, BBB...)
-
You survey 200 employees about job satisfaction. Identify: the population, the sample, a parameter, and a statistic.
-
A news headline says "Coffee drinkers earn more." What questions would a statistician ask before accepting this claim?
-
Identify whether each scenario requires descriptive or inferential statistics: a) Reporting last quarter's revenue by product line b) Predicting customer churn rate for next year based on historical data c) Calculating the average age of the current workforce d) Determining whether a new pricing strategy significantly increased sales
Summary
In this chapter you learned:
- Statistics = collect, organise, analyse, interpret, present data
- Descriptive statistics — summarise the data you have; no generalisations
- Inferential statistics — use a sample to draw conclusions about a population, with quantified uncertainty
- Population vs Sample: all individuals vs a subset; parameters vs statistics
- Variable types: quantitative (continuous/discrete) and qualitative (nominal/ordinal)
- The statistical workflow: define question → collect → organise → describe → analyse → interpret → communicate
- Common pitfalls: correlation ≠ causation; sample quality > sample size; statistical ≠ practical significance
Next up: Data Types & Measurement Scales — nominal, ordinal, interval, and ratio scales, and why they determine which statistics are valid.