Introduction to Statistics

What Is Statistics?

Statistics is the science of collecting, organising, analysing, interpreting, and presenting data. It provides the tools to make sense of a world full of uncertainty — turning raw numbers into decisions.

Statistics is used everywhere:

Finance — portfolio risk, return forecasting, option pricing
Healthcare — clinical trials, drug efficacy, disease surveillance
Business — A/B testing, demand forecasting, customer segmentation
Data Science — machine learning models are built on statistical foundations
Government — census, economic indicators, policy evaluation
Sports — player performance analysis, game strategy

"Statistics is the grammar of science." — Karl Pearson

Why Statistics Matters for Your Career

Whether you work in investment banking, consulting, marketing, or technology, you will encounter data. The question is whether you can trust it, understand it, and use it to make better decisions than your peers.

Statisticians — and people who understand statistics — can:

Spot misleading conclusions in reports
Design experiments that actually answer the right question
Quantify uncertainty rather than ignoring it
Communicate findings with appropriate confidence

Two Branches of Statistics

Descriptive Statistics

Describes and summarises data you already have — no generalisations beyond the dataset.

Exam scores of 30 students: 45, 67, 72, 88, 91, ...

Mean: 74.3
Median: 76
Standard Deviation: 12.8
Range: 46

Descriptive statistics answer: What does this data look like?

Inferential Statistics

Uses a sample to draw conclusions about a larger population — with quantified uncertainty.

Survey 500 customers out of 50,000.
→ 68% say they prefer Product A.
→ With 95% confidence, the true proportion is between 64% and 72%.

Inferential statistics answer: What can we conclude about the population from this sample?

	Descriptive	Inferential
Goal	Summarise existing data	Generalise to population
Data needed	Entire dataset	Sample from population
Examples	Mean, median, charts	Hypothesis tests, confidence intervals
Uncertainty	No uncertainty (facts)	Always has uncertainty (estimation)

Key Definitions

Population vs Sample

Population: Every individual or item you care about (all 50,000 customers)
Sample: A subset of the population you actually observe (the 500 surveyed)
Parameter: A numerical fact about the population (Greek letters: μ, σ, π)
Statistic: A numerical fact calculated from a sample (Latin letters: x̄, s, p̂)

Population: all students in India who took Class 12 boards (15 million)
Sample: 2,000 students surveyed by a research firm
Parameter (unknown): μ = true average score of ALL students
Statistic (calculated): x̄ = 71.4 (average of the 2,000-student sample)

Observation, Variable, and Dataset

Observation (unit): one individual thing measured — one student, one customer, one transaction
Variable: a characteristic that can differ across observations — age, salary, grade
Dataset: a collection of observations, usually arranged in a table (rows = observations, columns = variables)

Dataset: 5 employees

EmpID  Name          Dept        Salary   Rating
001    Priya Sharma  Finance     78000    4
002    Raj Patel     Technology  95000    5
003    Meera Singh   Marketing   68000    3
004    Arjun Nair    Finance     82000    4
005    Kavya Menon   HR          61000    3

Observations: 5 (one per row)
Variables: 5 (EmpID, Name, Dept, Salary, Rating)

Types of Variables

A crucial early concept — the type of variable determines which statistical methods are valid.

Quantitative (Numerical) Variables

Variables that represent amounts and can be measured on a numeric scale.

Continuous: Can take any value in a range (including decimals)

Height: 167.4 cm, 172.1 cm, 183.8 cm
Temperature: 36.5°C, 38.1°C
Salary: ₹78,432.50

Discrete: Can only take specific values, usually whole numbers

Number of children: 0, 1, 2, 3
Number of defects: 0, 1, 2...
Exam score (out of 100): 45, 72, 88

Qualitative (Categorical) Variables

Variables that represent categories or groups.

Nominal: Categories with no natural order

Department: Finance, Technology, Marketing, HR
Blood type: A, B, AB, O
City: Mumbai, Delhi, Bangalore

Ordinal: Categories with a meaningful order, but gaps between categories are not equal

Rating: 1, 2, 3, 4, 5 (order matters; gap between 1 and 2 ≠ gap between 4 and 5 meaningfully)
Education: School, Undergraduate, Postgraduate, PhD
Survey response: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree

The Statistical Workflow

A complete data analysis follows these steps:

1. DEFINE THE QUESTION
   "Does the new training programme improve employee performance?"

2. COLLECT DATA
   Measure performance scores before and after training for 50 employees

3. ORGANISE & CLEAN
   Enter into a table, fix errors, handle missing values

4. DESCRIBE
   Calculate mean scores before and after; draw a histogram

5. ANALYSE
   Run a paired t-test to check if the improvement is statistically significant

6. INTERPRET
   p-value = 0.02 → significant improvement at 5% level

7. COMMUNICATE
   Report findings with confidence intervals and effect size

Each step has its own tools and pitfalls. This tutorial series covers them all.

Practical Examples

Example 1: Investment Banking Use Case

A bank wants to know if a new credit scoring model reduces loan defaults.

Population: All loan applicants
Sample: 1,000 applicants processed through the new model
Descriptive: Default rate in sample = 3.2% (vs 5.8% historical)
Inferential: Is 3.2% significantly lower than 5.8%? (Hypothesis test — Chapter 13)

Example 2: E-Commerce A/B Test

An online retailer tests two versions of a product page.

Version A: current page → 500 visitors → 45 purchases (9% conversion)
Version B: new design → 500 visitors → 63 purchases (12.6% conversion)
Question: Is the 3.6% difference real, or just random variation?
Answer: Requires a hypothesis test (two-proportion z-test — Chapter 13)

Example 3: HR Analytics

HR wants to understand salary distribution across departments.

Descriptive: Calculate mean, median, SD per department
Inferential: Are salary differences between departments statistically significant? (ANOVA — Chapter 16)

Common Misconceptions

1. Correlation implies causation

Ice cream sales and drowning deaths are correlated (both rise in summer).
Ice cream does not cause drowning — summer is the hidden cause (confounding variable).

2. A larger sample is always better

Sample quality matters more than size. A biased sample of 100,000 is worse than a representative sample of 1,000. (Literary Digest predicted FDR would lose in 1936 — with 10 million responses — because their sampling was biased toward wealthy households.)

3. Statistical significance = practical significance

A study of 1 million people might find that a drug reduces blood pressure by 0.3 mmHg with p < 0.001. Statistically significant — clinically meaningless.

4. The average tells the whole story

Team A salaries: 60k, 60k, 60k, 60k, 120k → Mean = 72k
Team B salaries: 50k, 60k, 70k, 80k, 100k → Mean = 72k
Same average, very different distributions.

Practice Exercises

Classify each variable as quantitative (continuous/discrete) or qualitative (nominal/ordinal): a) Number of transactions per day b) Customer satisfaction (1–5 stars) c) City of residence d) Annual revenue e) Bond credit rating (AAA, AA, A, BBB...)
You survey 200 employees about job satisfaction. Identify: the population, the sample, a parameter, and a statistic.
A news headline says "Coffee drinkers earn more." What questions would a statistician ask before accepting this claim?
Identify whether each scenario requires descriptive or inferential statistics: a) Reporting last quarter's revenue by product line b) Predicting customer churn rate for next year based on historical data c) Calculating the average age of the current workforce d) Determining whether a new pricing strategy significantly increased sales

Summary

In this chapter you learned:

Statistics = collect, organise, analyse, interpret, present data
Descriptive statistics — summarise the data you have; no generalisations
Inferential statistics — use a sample to draw conclusions about a population, with quantified uncertainty
Population vs Sample: all individuals vs a subset; parameters vs statistics
Variable types: quantitative (continuous/discrete) and qualitative (nominal/ordinal)
The statistical workflow: define question → collect → organise → describe → analyse → interpret → communicate
Common pitfalls: correlation ≠ causation; sample quality > sample size; statistical ≠ practical significance

Next up: Data Types & Measurement Scales — nominal, ordinal, interval, and ratio scales, and why they determine which statistics are valid.