Data Collection & Sampling Methods

Why Sampling Design Matters

Your analysis is only as good as your data collection. A brilliant statistical method applied to biased data produces confidently wrong conclusions. Sampling design is the foundation — get it right before any calculations begin.

The 1936 US Election: Literary Digest surveyed 10 million people and predicted Alf Landon would beat FDR 57–43. The actual result: FDR won 62–38. Their massive sample failed because it was drawn from telephone directories and car registration lists — systematically excluding the poor, who voted overwhelmingly for FDR.

Population and the Sampling Frame

Target population: Everyone you want to draw conclusions about (all adult customers)
Sampling frame: The list you actually sample from (customers in your CRM database)
Undercoverage: When the sampling frame misses part of the population (customers who never registered online are excluded)

The gap between target population and sampling frame introduces bias before a single respondent is selected.

Types of Data Collection

1. Surveys / Questionnaires

Asking people questions about their opinions, behaviour, or characteristics.

Customer Satisfaction Survey:
- Online form sent via email
- Phone interview
- In-store tablet survey

Pros: cheap, scalable, flexible
Cons: non-response bias, self-report bias, question wording effects

2. Experiments

The researcher controls the treatment and randomly assigns subjects to groups.

Clinical Trial:
- Group A: receives new drug
- Group B: receives placebo
- Randomly assigned → control for confounding variables

Pros: can establish causation; controls for confounders
Cons: expensive, ethical constraints, Hawthorne effect

3. Observational Studies

Observe and record without intervention.

Retrospective: Look at past records (did smokers have higher cancer rates?)
Prospective: Follow subjects forward in time (track health outcomes over 10 years)

Pros: ethical, feasible for large populations
Cons: cannot establish causation (confounders uncontrolled)

4. Administrative / Secondary Data

Use data collected for another purpose (tax records, hospital databases, web logs).

Pros: already collected, large scale, cheap
Cons: not designed for your question; may have errors, inconsistencies

Sampling Methods

Probability Sampling (Every Member Has a Known Chance of Selection)

These methods allow valid statistical inference to the population.

1. Simple Random Sampling (SRS)

Every member of the population has an equal probability of being selected.

Population: 1,000 employees (numbered 1–1,000)
Sample size: 100

Method: Use a random number generator to pick 100 numbers → contact those employees

Pros: Simple, eliminates selection bias, valid for inference
Cons: Needs a complete list; can be expensive if population is spread out

2. Systematic Sampling

Select every kth member from a list after a random start.

Population: 1,000 customer accounts (sorted by account number)
Want sample of 100 → k = 1000/100 = 10
Random start: pick a number between 1 and 10 (say, 7)
Select: 7, 17, 27, 37, 47, ... 997

Pros: Simple to implement; spreads evenly across the list
Cons: Fails if there's a periodic pattern in the list aligned with k

3. Stratified Sampling

Divide the population into subgroups (strata) and sample from each.

Employee survey — strata by department:
Finance: 300 employees → sample 30 (10%)
Technology: 400 employees → sample 40 (10%)
Marketing: 200 employees → sample 20 (10%)
HR: 100 employees → sample 10 (10%)
Total: 100 employees sampled

Proportional stratified: sample size proportional to stratum size (above)
Disproportional stratified: oversample small groups to ensure representation

Use when: You want to ensure all subgroups are represented; or you want separate estimates per subgroup.

4. Cluster Sampling

Divide into clusters (often geographically), randomly select clusters, then survey all members of selected clusters.

National customer survey:
1. Divide India into 500 districts (clusters)
2. Randomly select 20 districts
3. Survey every customer in those 20 districts

Pros: Practical when the population is geographically spread
Cons: Higher sampling error than SRS of same size (within-cluster similarity)

Multi-stage cluster sampling: After selecting clusters, randomly sample within them (not everyone).

5. Systematic vs Stratified — When to Use Which

Feature	Systematic	Stratified
Ordering matters?	Yes — spreads across ordered list	No — groups by characteristic
Subgroup representation	Not guaranteed	Guaranteed
Complexity	Low	Medium
Best for	Long lists with natural order	Heterogeneous population with identifiable subgroups

Non-Probability Sampling (Selection Probability Unknown)

Cannot use these for valid statistical inference. Useful for exploratory work only.

Convenience Sampling

Select whoever is easiest to reach.

Surveying students in the cafeteria for opinions on campus policy
→ Misses students who don't eat there (international students, off-campus students)

Purposive (Judgement) Sampling

Researcher selects subjects based on their own judgement about who is "representative."

"I'll interview 10 executives who I think represent different management styles"
→ Subject to researcher bias; not generalisable

Snowball Sampling

Existing subjects recruit future subjects from their network.

Studying informal money lending — hard to find subjects
→ Interview one lender → ask them to refer others
→ Good for hard-to-reach populations; but sample is self-selected

Quota Sampling

Like stratified, but selection within strata is non-random.

"Get 30 Finance and 30 Technology employees"
→ Collector picks convenient employees in each group
→ Not random within strata → selection bias possible

Sample Size Considerations

The Margin of Error Formula (preview)

For a proportion with 95% confidence:

n = (z²  × p × (1−p)) / E²

Where:
z = 1.96 (95% confidence)
p = estimated proportion (use 0.5 if unknown — maximises n)
E = desired margin of error

Example: Want ±5% margin of error, 95% confidence
n = (1.96² × 0.5 × 0.5) / 0.05²
n = (3.84 × 0.25) / 0.0025
n = 0.96 / 0.0025
n = 384 people

Larger samples → smaller margin of error — but diminishing returns:

n=100 → ±10% margin
n=400 → ±5% margin
n=1600 → ±2.5% margin
n=10000 → ±1% margin

Quadrupling the sample halves the margin of error.

Bias in Data Collection

Selection Bias

The sample systematically differs from the population.

Voluntary response bias: Only people with strong opinions respond
Survivorship bias: Only analysing companies that still exist (misses failures)
Non-response bias: Non-responders differ systematically from responders

Measurement Bias

The way data is collected distorts the true value.

Social desirability bias: People underreport embarrassing behaviour
Leading questions: "Don't you agree that the new policy is fair?" → inflates agreement
Question order effects: Earlier questions prime responses to later ones
Interviewer bias: Interviewer's characteristics affect respondent answers

Response Bias Examples

Actual vs reported:
- Alcohol consumption: reported 40–60% lower than sales data implies
- Exercise frequency: reported 20–30% higher than fitness tracker data
- Income: often inflated or deflated based on perceived social desirability

Experimental Design Principles

When running experiments (A/B tests, clinical trials):

Randomisation

Randomly assign subjects to treatment and control groups. This balances known and unknown confounders across groups.

Without randomisation: 
Assign healthy patients to treatment → treatment looks better (not because of drug — because of patient health)

With randomisation:
Healthy and unhealthy patients are balanced across groups → fair comparison

Control Group

A group that receives no treatment (or the existing treatment), used as a baseline.

New drug study:
Treatment group: new drug
Control group: placebo (or existing standard drug)

Blinding

Single-blind: Participants don't know which group they're in (prevents placebo effect)
Double-blind: Neither participants nor researchers know group assignment (prevents researcher bias too)

Replication

Run the experiment on multiple subjects to detect real effects vs random variation. More subjects → more reliable estimates.

Practical Examples

Example 1: Customer Satisfaction Survey Design

Goal: Measure customer satisfaction for a fintech app with 100,000 users.

Sampling frame: All users who logged in within the past 90 days (80,000 users)
Method: Stratified by usage frequency (Daily, Weekly, Monthly)
  Daily (20,000): sample 100
  Weekly (35,000): sample 175
  Monthly (25,000): sample 125
Total: 400 users sampled

Why stratified? Daily and monthly users likely have very different satisfaction levels.
  Stratification ensures both groups are represented proportionally.

Delivery: In-app survey after session ends
Non-response mitigation: Follow-up push notification after 48 hours

Example 2: A/B Test Design

Goal: Test whether a new loan application flow increases completion rates.

Population: All new loan applicants (2,000/week)
Method: Simple random assignment
  Group A (control): current flow → 1,000 applicants/week
  Group B (treatment): new flow → 1,000 applicants/week
Duration: 4 weeks (8,000 applicants total)
Blinding: Applicants don't know they're in a test
Randomisation: Assignment at the session level (50/50 coin flip)

Success metric: Completion rate (completed application / started application)
Analysis: Two-proportion z-test (Chapter 13)

Example 3: Retail Outlet Audit

Goal: Estimate on-shelf availability across 5,000 retail outlets nationwide.

Method: Multi-stage cluster sampling
Stage 1: Stratify by region (North, South, East, West, Central)
Stage 2: Within each region, randomly select 20 outlets
Stage 3: Within each outlet, check 10 randomly selected SKUs

Total outlets: 100 (2% of 5,000)
Total SKU checks: 1,000

Common Mistakes

1. Confusing sampling frame with population

If your CRM only has customers who registered online, any sample from it excludes offline customers — conclusions apply only to online customers.

2. Using a large sample to compensate for bias

A biased sample of 1 million is worse than an unbiased sample of 1,000. Size does not fix bias.

3. Non-random sampling in A/B tests

Wrong: Show Version A to morning visitors, Version B to afternoon visitors
→ Morning vs afternoon users might differ (commuters vs shoppers)
→ Any difference could be due to user type, not the design

Right: Randomly assign each visitor (50/50) regardless of time

4. Ignoring non-response

If 70% of your survey is unanswered, the 30% who responded may be systematically different — more engaged, more satisfied, or more dissatisfied.

Practice Exercises

A researcher wants to study income levels across India. They use a phone survey. Identify three sources of bias in this design.
Describe the sampling method you would use to: a) Survey 200 employees from a company of 2,000 (across 5 departments) b) Estimate defect rates across 500 factories nationwide c) Study drug use patterns among homeless youth
A bank wants to test two versions of a credit card offer email. Design an experiment including: sampling method, group assignment, control group, success metric, and duration.
You survey 1,000 hotel guests and get a 20% response rate (200 responses). What concerns does this raise? What could you do to address them?
Calculate the required sample size to estimate a customer satisfaction proportion with ±3% margin of error at 95% confidence, assuming p ≈ 0.5.

Summary

In this chapter you learned:

Target population vs sampling frame — gaps cause undercoverage bias
Probability sampling methods: Simple Random Sampling (SRS), Systematic, Stratified, Cluster — each with valid inference to the population
Non-probability methods: Convenience, purposive, snowball, quota — useful for exploration but not valid for inference
Bias types: Selection bias (systematic exclusion), measurement bias (distorted data), non-response bias
Sample size: n ≈ 384 for ±5% margin at 95% confidence (p=0.5); quadruple n to halve the margin
Experimental design: Randomisation, control group, blinding, replication
Larger samples don't fix biased sampling — quality of design matters more than quantity

Next up: Descriptive Statistics — mean, median, mode, and how to summarise the centre of a distribution.