What Is Linear Regression?
Linear regression is the simplest and most widely used algorithm for regression — the task of predicting a continuous numeric target from one or more input features. Think house prices in ₹ lakhs, a customer's expected lifetime spend, tomorrow's temperature, or the delivery time of a Swiggy order. In each case the answer is a number on a continuous scale, not a category.
The idea is to fit a straight line (or, with several features, a flat plane or hyperplane) through the data so that the line captures the underlying trend. Once fitted, you feed in a new input and read off the predicted number.
Here is the intuition. Imagine Priya, an analyst at a real-estate firm in Pune, plots flat area (sq ft) on the x-axis and price (₹ lakhs) on the y-axis. The dots trend upward — bigger flats cost more. Linear regression draws the single straight line that sits "closest" to all those dots at once. That line is the model. For a new 900 sq ft flat, she moves up from 900 on the x-axis to the line and reads the predicted price.
Regression → predict a number (price, temperature, demand)
Classification → predict a category (spam / not-spam, churn / stay)
Linear regression is the natural first model to reach for: it is fast, it trains on almost any hardware, and — unlike many black-box models — its output is fully interpretable. You can look at the fitted coefficients and say exactly how much each feature moves the prediction. For classification problems (yes/no, category labels), you instead use Logistic Regression, covered in the next chapter.
This chapter is the machine-learning view of regression — model fitting, gradient descent, and the scikit-learn workflow. The deeper statistical treatment (hypothesis tests on coefficients, confidence and prediction intervals, formal assumption diagnostics) lives in the Linear Regression & Model Evaluation chapter of the Statistics tutorial.
Simple vs Multiple Linear Regression
There are two flavours, differing only in how many input features you use.
- Simple linear regression uses a single feature
Xto predict the targetY. Example: predict salary from years of experience alone. - Multiple linear regression uses two or more features
X1, X2, ..., Xn. Example: predict salary from experience and education level and city tier.
The model equation for the simple case is a straight line:
ŷ = w·x + b
where
ŷ = predicted target value ("y-hat")
x = the single input feature
w = weight / slope (how much ŷ changes per unit of x)
b = bias / intercept (value of ŷ when x = 0)
The multiple case generalises to a weighted sum of all features plus a bias:
ŷ = w1·x1 + w2·x2 + ... + wn·xn + b
In compact vector form:
ŷ = wᵀx + b (dot product of the weight vector w and feature vector x, plus bias b)
Statisticians write the same thing with betas — Y = β0 + β1·X1 + ... + βn·Xn + ε, where β0 is the intercept, the βi are coefficients, and ε is the irreducible error. The machine-learning w/b notation and the statistics β notation describe the identical model.
| Aspect | Simple Linear Regression | Multiple Linear Regression |
|---|---|---|
| Number of features | 1 | 2 or more |
| Model shape | A line in 2D | A plane / hyperplane in n dimensions |
| Equation | ŷ = w·x + b | ŷ = wᵀx + b |
| Main extra concern | Is the relationship linear? | Multicollinearity between features |
| Coefficient meaning | Marginal effect of x | Partial effect, holding other features fixed |
The Cost Function: Mean Squared Error
To fit the line we need a way to score how good any given line is. For each training example the residual (error) is the gap between the true value and the prediction:
residualᵢ = yᵢ − ŷᵢ
We square each residual (so positive and negative errors do not cancel, and large errors are penalised more), then average across all m training examples. That average is the Mean Squared Error (MSE) — the cost function linear regression minimises:
1 m
MSE(w, b) = --- · Σ (yᵢ − ŷᵢ)²
m i=1
1 m
= --- · Σ (yᵢ − (wᵀxᵢ + b))²
m i=1
Some texts use 1/(2m) instead of 1/m; the extra 1/2 just makes the calculus tidier and does not change where the minimum sits. A related metric, Root Mean Squared Error (RMSE = √MSE), is popular because it is back in the same units as the target (₹, minutes, degrees) and is therefore easy to communicate.
Lower MSE → the line sits closer to the data → better fit
MSE = 0 → the line passes through every point exactly (rare, often overfitting)
The whole job of "training" is to search for the values of w and b that make MSE as small as possible. There are two standard ways to do that search.
How the Line Is Fit
Method 1: The Normal Equation (closed-form solution)
Because MSE is a smooth, bowl-shaped (convex) function of the weights, calculus gives an exact formula for the minimising weights in one shot — no iteration required. Stacking all features into a matrix X (with a column of 1s for the bias) and all targets into a vector y, the optimal weight vector is:
w = (Xᵀ X)⁻¹ Xᵀ y
This is the normal equation. It is what scikit-learn's LinearRegression uses under the hood (via an efficient SVD-based solver, not a literal matrix inverse).
- Pros: exact answer, no learning rate to tune, no iterations.
- Cons: computing
(Xᵀ X)⁻¹costs roughlyO(n³)in the number of featuresn, so it becomes slow when you have very many features; it also struggles when features are highly collinear (Xᵀ Xis near-singular).
Method 2: Gradient Descent (iterative optimisation)
When you have millions of rows or thousands of features, the closed form is too expensive. Gradient descent instead starts from a guess and repeatedly nudges the weights downhill on the MSE surface until it reaches the bottom.
The intuition: imagine standing on a foggy hillside and wanting the lowest point. You feel the slope under your feet and take a small step in the steepest downhill direction. Repeat until the ground is flat. The gradient is that slope; the learning rate is your step size.
The partial derivatives of MSE give the update rules. Each iteration:
For each feature j:
wⱼ ← wⱼ − α · (∂MSE/∂wⱼ)
b ← b − α · (∂MSE/∂b)
where the gradients are:
∂MSE/∂wⱼ = (−2/m) · Σ (yᵢ − ŷᵢ) · xᵢⱼ
∂MSE/∂b = (−2/m) · Σ (yᵢ − ŷᵢ)
and α (alpha) is the learning rate, typically 0 < α < 1
Choosing α matters. Too small and training crawls; too large and the steps overshoot the minimum and the cost can diverge to infinity. A common range to try is α between about 0.0001 and 0.3.
α too small → many tiny steps, very slow convergence
α just right → smooth, steady decrease in MSE each iteration
α too large → MSE bounces around or explodes (diverges)
Variants you will meet: Batch GD uses all rows per step, Stochastic GD (SGD) uses one random row per step (noisier but fast on huge data), and Mini-batch GD uses small batches — the practical default in modern ML. Because features on wildly different scales distort the gradient, gradient descent needs feature scaling (standardisation) to converge well; see the Feature Engineering & Scaling chapter.
| Normal Equation | Gradient Descent | |
|---|---|---|
| Style | Closed-form, one shot | Iterative |
| Learning rate | Not needed | Must tune α |
Cost in features n | About O(n³) | About O(n) per step |
| Very many features | Slow | Scales well |
| Very many rows | Fine (fits in memory) | Excellent (works with mini-batches) |
| Needs feature scaling | No | Yes |
| Used by | LinearRegression | SGDRegressor, deep learning |
Interpreting Coefficients and Intercept
The great strength of linear regression is that the fitted numbers mean something concrete.
- The intercept
bis the predicted target when every feature equals 0. Often this is a mathematical anchor rather than a meaningful scenario (a flat with 0 sq ft does not exist), so interpret it with care. - Each coefficient
wⱼis the change in the predicted target for a one-unit increase in that feature, holding all other features constant. The "holding others constant" part is what makes it a partial effect in multiple regression.
Suppose a salary model (in ₹ thousands) fits to:
salary = 320 + 45·(years_experience) + 60·(education_level)
Reading the coefficients:
• Intercept 320 → a fresher (0 years, education_level 0) is anchored at ₹3.2 lakh
• w = 45 → each extra year of experience adds ₹45,000, holding education fixed
• w = 60 → each extra education level adds ₹60,000, holding experience fixed
Two cautions. First, the magnitude of a coefficient depends on the feature's units — a coefficient on "area in sq ft" and one on "area in sq m" differ by a factor even though the model is identical. To compare feature importances fairly, standardise the features first, or inspect standardised coefficients. Second, a large coefficient is not the same as a statistically significant one; significance testing (t-tests, p-values on coefficients) is covered in the Statistics tutorial.
R-Squared: How Good Is the Fit?
MSE tells you the error in the target's units, but it is hard to judge in the abstract — is an MSE of 40 good? The coefficient of determination, R² (R-squared), gives a unit-free score of how much of the target's variability the model explains.
SS_res Σ (yᵢ − ŷᵢ)²
R² = 1 − ------ = 1 − ----------------
SS_tot Σ (yᵢ − ȳ)²
where
SS_res = sum of squared residuals (model's errors)
SS_tot = total variance of y around its mean ȳ
Interpretation:
R² = 1.0 → the model explains 100% of the variance (perfect fit)
R² = 0.85 → the model explains 85% of the variance in the target
R² = 0.0 → the model is no better than always predicting the mean ȳ
R² < 0 → the model is worse than predicting the mean (a bad fit or wrong test set)
For multiple regression, prefer Adjusted R², which penalises adding features that do not genuinely help. Plain R² can only go up when you add features, so it can flatter a bloated model. What counts as a "good" R² is domain-dependent: physics experiments may demand R² above 0.99, while noisy social or marketing data may treat 0.30 as useful.
Assumptions of Linear Regression
Linear regression's coefficients and error estimates are trustworthy only when a few assumptions roughly hold. A handy mnemonic is LINE:
- Linearity — the true relationship between features and target is linear. Curved patterns need transformed features (log, polynomial terms) or a different model.
- Independence — observations are independent of each other (a concern with time-series or clustered data).
- Normality — the residuals are approximately normally distributed (matters most for inference and prediction intervals).
- Equal variance (homoscedasticity) — the spread of residuals is roughly constant across the range of predictions, not fanning out.
You check these mainly by plotting residuals versus fitted values (should be a random cloud around zero) and a Q-Q plot of residuals. The deeper treatment — formal tests, diagnostics, and remedies — is in the Statistics tutorial's regression chapter. For prediction-focused ML work, mild violations are often tolerable, but severe ones bias your coefficients and inflate error.
Full Example with scikit-learn
Let's predict flat prices in ₹ lakhs from area, number of bedrooms, and building age. This uses the modern scikit-learn workflow: train_test_split, fit, predict, and metric functions.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# --- 1. Build a small illustrative dataset (values are made up for teaching) ---
data = pd.DataFrame({
"area_sqft": [650, 900, 1100, 1500, 800, 1250, 2000, 1750, 950, 1400],
"bedrooms": [1, 2, 2, 3, 1, 2, 4, 3, 2, 3],
"age_years": [10, 5, 8, 2, 15, 6, 1, 3, 12, 4],
"price_lakh": [42, 68, 75, 115, 50, 92, 170, 135, 70, 108],
})
X = data[["area_sqft", "bedrooms", "age_years"]]
y = data["price_lakh"]
# --- 2. Hold out a test set so we measure generalisation, not memorisation ---
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
# --- 3. Fit the model ---
model = LinearRegression()
model.fit(X_train, y_train)
# --- 4. Inspect the learned coefficients and intercept ---
print("Intercept (b):", round(model.intercept_, 2))
for feature, coef in zip(X.columns, model.coef_):
print(f" {feature:>10}: {round(coef, 3)}")
# --- 5. Predict on the held-out test set ---
y_pred = model.predict(X_test)
# --- 6. Evaluate ---
print("MAE :", round(mean_absolute_error(y_test, y_pred), 2))
print("RMSE:", round(np.sqrt(mean_squared_error(y_test, y_pred)), 2))
print("R2 :", round(r2_score(y_test, y_pred), 3))
# --- 7. Predict the price of a new flat ---
new_flat = pd.DataFrame({"area_sqft": [1000], "bedrooms": [2], "age_years": [7]})
print("Predicted price (lakh):", round(model.predict(new_flat)[0], 2))
The console output looks like this (numbers are illustrative — your exact values will vary with the split):
Intercept (b): 5.31
area_sqft: 0.071
bedrooms: 6.842
age_years: -0.913
MAE : 4.10
RMSE: 5.02
R2 : 0.981
Predicted price (lakh): 78.4
Reading the coefficients: each extra sq ft adds about ₹7,100 to the price, each extra bedroom about ₹6.84 lakh, and each additional year of age lowers the price by about ₹91,000 — all sensible signs. The R2 of 0.981 says the model explains roughly 98% of the price variance on the test set.
Using a Pipeline with Scaling
When you later switch to gradient-descent-based SGDRegressor or add regularisation, wrap scaling and the model in a Pipeline so the scaler is fit on training data only and applied consistently at prediction time:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
pipe = make_pipeline(
StandardScaler(), # scale features (essential for gradient descent)
SGDRegressor(max_iter=1000, tol=1e-3, random_state=42)
)
pipe.fit(X_train, y_train)
print("R2 (SGD pipeline):", round(pipe.score(X_test, y_test), 3))
When to Use It — and Its Limitations
Reach for linear regression when:
- The target is continuous and the relationship with features is roughly linear.
- You need an interpretable model — stakeholders want to know why the number is what it is.
- You want a fast, low-variance baseline before trying fancier models.
- You have a modest number of features relative to rows.
| Strengths | Limitations |
|---|---|
| Simple, fast, cheap to train | Assumes a linear feature-target relationship |
| Fully interpretable coefficients | Sensitive to outliers (squared error punishes them) |
| Needs little data to get going | Cannot capture complex non-linear patterns unaided |
| A strong, honest baseline | Hurt by multicollinearity among features |
| Extends cleanly to regularised forms | Underfits genuinely complex data |
When linearity breaks down, options include adding polynomial features, transforming variables, using regularised variants (Ridge, Lasso — see the Overfitting & Regularization chapter), or moving to tree-based regressors like Random Forests.
Common Mistakes
-
Forgetting to scale features before gradient descent.
LinearRegression(normal equation) is scale-invariant, butSGDRegressorand regularised models are not. Unscaled features make gradient descent converge slowly or not at all. Standardise inside a pipeline. -
Extrapolating far outside the training range. A model fit on flats of
650to2000sq ft should not be trusted to price a10000sq ft mansion. The linear pattern may not hold there, and predictions become unreliable. -
Judging fit on the training set only. A high training
R²can hide overfitting. Always report metrics on a held-out test set (see the Train-Test Split & Cross-Validation chapter). -
Comparing raw coefficient magnitudes across differently-scaled features. A coefficient on "area in sq ft" looks tiny next to "bedrooms", yet may matter more. Standardise features before comparing importances.
-
Ignoring outliers. Because MSE squares errors, one extreme point can drag the whole line toward it. Inspect residuals and consider robust alternatives if outliers are genuine anomalies.
-
Using linear regression for a categorical target. Predicting a yes/no or class label with linear regression is a modelling error — use Logistic Regression for classification instead.
Practice Exercises
-
Simple fit by hand. For
X = [1, 2, 3, 4, 5]andY = [2, 4, 5, 4, 6], fitLinearRegressionwith scikit-learn, then printcoef_andintercept_. PredictYatX = 6and state whether this is interpolation or extrapolation. -
Interpret coefficients. A model fits
sales = 12 + 3.5·(ad_spend) + 1.2·(store_size). Explain in one sentence each what the intercept,3.5, and1.2mean. State the assumption implied by the phrase "holding other features constant". -
Gradient descent intuition. You train an
SGDRegressorand the loss increases every iteration and eventually becomesnan. Which hyperparameter is almost certainly wrong, and in which direction should you change it? What preprocessing step might also be missing? -
Metrics. Given a model with
SS_res = 120andSS_tot = 800, computeR²by hand and interpret it. Would you prefer this over a model withR² = 0.60? What extra metric would you look at? -
Pipeline. Build a
make_pipeline(StandardScaler(), LinearRegression())on any dataset with features of very different scales, and confirm the testR²matches a plainLinearRegression(scaling should not changeLinearRegression's predictions — explain why). -
Residual check. After fitting, plot residuals (
y_test − y_pred) againsty_pred. Describe what a healthy plot looks like and what a funnel shape would indicate about the LINE assumptions.
Summary
- Linear regression predicts a continuous target by fitting a line/hyperplane:
ŷ = wᵀx + b. - Simple regression uses one feature; multiple regression uses many, and each coefficient is a partial effect holding other features constant.
- The model is fit by minimising the Mean Squared Error,
MSE = (1/m)·Σ(yᵢ − ŷᵢ)². - Two fitting methods: the normal equation
w = (XᵀX)⁻¹Xᵀy(exact, one shot — whatLinearRegressionuses) and gradient descent (iterative, scales to huge data, needs a learning rateαand feature scaling). - The gradient-descent update is
wⱼ ← wⱼ − α·(∂MSE/∂wⱼ); too-largeαdiverges, too-smallαcrawls. - Coefficients and the intercept are directly interpretable; standardise features before comparing their magnitudes.
- R² (
1 − SS_res/SS_tot) reports the fraction of target variance explained; use Adjusted R² with many features. - The LINE assumptions (Linearity, Independence, Normality of residuals, Equal variance) should roughly hold — the deeper treatment is in the Statistics tutorial.
- In scikit-learn:
train_test_split→LinearRegression().fit()→.coef_/.intercept_→.predict()→r2_score/mean_squared_error; wrap scaling in aPipelinefor gradient-descent variants. - Watch for extrapolation, outliers, unscaled features with SGD, and never use it for categorical targets.
Linear regression is your interpretable, fast baseline for any numeric-prediction problem — master it and you understand the backbone of many more advanced models.
Next up: Logistic Regression — despite the name, it is a classification algorithm; you will see how it reuses the linear equation but squashes the output through a sigmoid to predict probabilities and class labels.