Feature Engineering & Scaling

What Is Feature Engineering?

Feature engineering is the process of transforming raw data into features — the numeric inputs a machine learning model actually learns from. It covers creating new features from what you already have, transforming existing ones so the model can use them better, and selecting the subset that carries real signal.

There is a well-worn saying in data science:

Applied machine learning is basically feature engineering. — Andrew Ng

The intuition: a model is only as good as what you feed it. Two data scientists can use the exact same algorithm on the exact same dataset and get wildly different results — the difference is almost always in the features. Think of it like cooking. The algorithm is your oven, but the features are your ingredients. A great oven cannot rescue rotten vegetables, and a mediocre oven does wonders with fresh, well-prepped produce.

Feature creation — build new columns (e.g. price_per_sqft from price and area)
Feature transformation — reshape existing columns (scaling, log transforms, binning)
Feature selection — keep the columns that help and drop the noise

Raw data                 Engineered features            Model
────────────             ────────────────────           ──────
transaction_date    →    day_of_week, is_weekend    →
amount, num_items   →    avg_item_value             →   better
address text        →    address_length, has_pincode →  predictions

In the previous chapter, Data Preprocessing & Cleaning, you handled missing values, outliers and encoding. This chapter assumes your data is already clean and turns it into features a model can actually exploit. Later chapters — Linear Regression, K-Nearest Neighbors, Support Vector Machines — will all lean heavily on the scaling techniques you learn here.

Creating Features from What You Have

The best features usually come from domain knowledge, not from an algorithm. A single new column that captures a real-world relationship often beats a fancy model. Here are the high-yield patterns.

Ratios and Aggregations

Raw columns in isolation are often less informative than their combinations. A house that costs ₹80,00,000 tells you little until you divide by area.

import pandas as pd

df = pd.DataFrame({
    "name":  ["Priya", "Rahul", "Anjali", "Vikram"],
    "price": [8000000, 12000000, 6000000, 9500000],   # ₹
    "area":  [1000, 1500, 800, 1100],                  # sq ft
    "bedrooms": [2, 3, 2, 3],
})

# Ratio feature: price per square foot (comparable across sizes)
df["price_per_sqft"] = df["price"] / df["area"]

# Ratio feature: area per bedroom (a "spaciousness" signal)
df["area_per_bedroom"] = df["area"] / df["bedrooms"]

print(df[["name", "price_per_sqft", "area_per_bedroom"]])

     name  price_per_sqft  area_per_bedroom
0   Priya         8000.00        500.000000
1   Rahul         8000.00        500.000000
2  Anjali         7500.00        400.000000
3  Vikram         8636.36        366.666667

Notice how Priya and Rahul cost the same per square foot despite very different absolute prices — the ratio surfaced a pattern the raw numbers hid. For grouped data, aggregations are just as powerful: for each customer, compute total_spend, avg_order_value, orders_last_30_days, or days_since_last_order.

Datetime Features

A raw timestamp like 2026-07-02 14:30:00 is nearly useless to a model as-is. Break it into parts that carry seasonality and behaviour.

df = pd.DataFrame({"order_time": pd.to_datetime(
    ["2026-07-02 14:30", "2026-07-05 21:15", "2026-07-06 09:00"])})

df["hour"]        = df["order_time"].dt.hour          # 0..23
df["day_of_week"] = df["order_time"].dt.dayofweek     # 0=Mon .. 6=Sun
df["is_weekend"]  = df["day_of_week"].isin([5, 6]).astype(int)
df["month"]       = df["order_time"].dt.month
df["is_month_end"]= df["order_time"].dt.is_month_end.astype(int)

print(df[["hour", "day_of_week", "is_weekend"]])

   hour  day_of_week  is_weekend
0    14            3           0
1    21            6           1
2     9            0           0

For cyclical features like hour or month, a common trick is to encode them as sine/cosine pairs so that hour 23 sits next to hour 0:

hour_sin = sin(2 * pi * hour / 24)
hour_cos = cos(2 * pi * hour / 24)

Text Length and Simple Text Features

You do not need full NLP to squeeze signal out of text. Cheap counts often predict surprisingly well — a very short product review or a spammy-looking subject line is informative.

df = pd.DataFrame({"review": [
    "Great product, works perfectly!",
    "Bad. Waste of money.",
    "The build quality is excellent and delivery was on time, highly recommend to everyone",
]})

df["char_count"]  = df["review"].str.len()
df["word_count"]  = df["review"].str.split().str.len()
df["excl_count"]  = df["review"].str.count("!")
df["avg_word_len"]= df["char_count"] / df["word_count"]

print(df[["char_count", "word_count", "excl_count"]])

   char_count  word_count  excl_count
0          31           4           1
1          20           4           0
2          85          13           0

Interaction and Polynomial Features

An interaction feature is the product of two features — it lets the model capture "the effect of A depends on the level of B". A polynomial feature raises a feature to a power, letting a linear model bend to fit curves.

import numpy as np
from sklearn.preprocessing import PolynomialFeatures

X = pd.DataFrame({"area": [1000, 1500, 800], "bedrooms": [2, 3, 2]})

# degree=2 creates: area, bedrooms, area^2, area*bedrooms, bedrooms^2
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

print(poly.get_feature_names_out())
print(X_poly)

['area' 'bedrooms' 'area^2' 'area bedrooms' 'bedrooms^2']
[[1.000e+03 2.000e+00 1.000e+06 2.000e+03 4.000e+00]
 [1.500e+03 3.000e+00 2.250e+06 4.500e+03 9.000e+00]
 [8.000e+02 2.000e+00 6.400e+05 1.600e+03 4.000e+00]]

Polynomial features are powerful but explode quickly: degree=2 on 100 features produces over 5,000 columns. Keep the degree low (2 or 3) and prefer creating a few hand-picked interactions you can justify.

Feature Scaling: Putting Features on the Same Playing Field

Many features live on very different scales. In a housing dataset, area ranges from 800 to 5000 while bedrooms ranges from 1 to 5. To a distance-based model, area will completely dominate simply because its numbers are bigger — not because it is more important. Scaling rescales features to comparable ranges so no single feature bullies the rest.

StandardScaler (Z-Score Standardization)

StandardScaler transforms each feature to have mean 0 and standard deviation 1. This is the most common scaler and the default choice for most models.

z = (x - mean) / std

where:
mean = average of the feature (computed on TRAIN data)
std  = standard deviation of the feature (computed on TRAIN data)

from sklearn.preprocessing import StandardScaler

X = pd.DataFrame({"area": [1000, 1500, 800, 2000],
                  "bedrooms": [2, 3, 2, 4]})

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled.round(2))
print("means:", X_scaled.mean(axis=0).round(2))
print("stds :", X_scaled.std(axis=0).round(2))

[[-0.66 -0.9 ]
 [ 0.42  0.3 ]
 [-1.1  -0.9 ]
 [ 1.34  1.5 ]]
means: [ 0. -0.]
stds : [1. 1.]

The result is unbounded — a value can be -3 or +5 — but it is centred and comparably spread. Use this when your features are roughly bell-shaped.

MinMaxScaler (Normalization)

MinMaxScaler squeezes every feature into a fixed range, usually [0, 1]. This is often called normalization.

x_scaled = (x - min) / (max - min)

Result lies in the range 0 to 1
(min and max are learned from TRAIN data)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()          # default range is 0 to 1
X_scaled = scaler.fit_transform(X)

print(X_scaled.round(2))

[[0.17 0.  ]
 [0.58 0.5 ]
 [0.   0.  ]
 [1.   1.  ]]

Use MinMax when you need a bounded range — for example, pixel intensities for a neural network, or algorithms that assume inputs in [0, 1]. Its weakness: it is very sensitive to outliers, because a single huge value stretches the whole range and squashes everyone else near 0.

RobustScaler

RobustScaler centres on the median and scales by the interquartile range (IQR), which makes it resistant to outliers.

x_scaled = (x - median) / IQR

where IQR = Q3 - Q1  (the 75th minus the 25th percentile)

from sklearn.preprocessing import RobustScaler

# One extreme outlier in income (a millionaire in a middle-class dataset)
X = pd.DataFrame({"income": [30000, 45000, 50000, 55000, 2000000]})

print("Standard:", StandardScaler().fit_transform(X).ravel().round(2))
print("Robust  :", RobustScaler().fit_transform(X).ravel().round(2))

Standard: [-0.51 -0.5  -0.5  -0.5   2.01]
Robust  : [-2.   -0.5   0.    0.5  195. ]

Notice how StandardScaler let the outlier compress the four normal incomes into a tight -0.51 to -0.5 band. RobustScaler keeps the four ordinary values nicely spread (-2 to 0.5) and isolates the outlier — exactly what you want when outliers are present but you cannot remove them.

Choosing a Scaler

Scaler	Formula	Output range	Centred on	Outlier-robust?	When to use
`StandardScaler`	`(x - mean) / std`	unbounded	mean	No	Default; roughly Gaussian features; linear/logistic regression, SVM, KNN, PCA
`MinMaxScaler`	`(x - min) / (max - min)`	`[0, 1]`	none	No	Bounded inputs needed; neural nets; sparse data with known bounds
`RobustScaler`	`(x - median) / IQR`	unbounded	median	Yes	Features with strong outliers you cannot drop
`MaxAbsScaler`	`x / max(abs(x))`	`[-1, 1]`	zero	No	Sparse matrices (preserves zeros); text TF-IDF

Which Models Need Scaling — and Which Do Not

This is one of the most misunderstood topics for beginners. The rule is about how the algorithm uses the numbers.

Scaling is required for distance-based and gradient-based models, because they compare or combine raw magnitudes:

Distance-based (KNN, K-Means, SVM with RBF kernel) compute distances like Euclidean distance. A feature measured in the thousands drowns out a feature measured in single digits.

Euclidean distance between two houses A and B:
d = sqrt((area_A - area_B)^2 + (bedrooms_A - bedrooms_B)^2)

If area differs by 500 and bedrooms by 2:
d = sqrt(500^2 + 2^2) = sqrt(250000 + 4) ~ 500.004

bedrooms is completely invisible until you scale.

Gradient-based (Linear Regression via gradient descent, Logistic Regression, Neural Networks) converge much faster on scaled data because the loss surface becomes rounder instead of a long thin valley.
PCA finds directions of maximum variance, so unscaled features with large variance falsely dominate — always scale before PCA.

Scaling is NOT needed for tree-based models — Decision Trees, Random Forests, and gradient-boosted trees like XGBoost. A tree splits on thresholds ("is area greater than 1200?"), and the split logic is unchanged whether area is in square feet or scaled units. Monotonic transformations do not affect where the tree chooses to split.

Model family	Examples	Needs scaling?
Distance-based	KNN, K-Means, SVM (RBF)	Yes
Gradient-based	Linear/Logistic Regression, Neural Nets	Yes (helps a lot)
Variance-based	PCA, LDA	Yes
Tree-based	Decision Tree, Random Forest, XGBoost	No
Probabilistic	Naive Bayes	Usually no

Binning and Discretization

Binning (or discretization) converts a continuous feature into discrete buckets. This can help capture non-linear effects, reduce the impact of small measurement noise, and make features more interpretable.

df = pd.DataFrame({"age": [22, 35, 41, 58, 67, 19, 45]})

# Fixed-width bins with human-readable labels
df["age_group"] = pd.cut(
    df["age"],
    bins=[0, 25, 40, 60, 120],
    labels=["young", "adult", "middle_aged", "senior"],
)

# Equal-frequency bins: each bucket holds ~the same number of rows
df["age_quartile"] = pd.qcut(df["age"], q=4, labels=False)

print(df)

   age    age_group  age_quartile
0   22        young             0
1   35        adult             1
2   41  middle_aged             2
3   58  middle_aged             3
4   67       senior             3
5   19        young             0
6   45  middle_aged             2

pd.cut creates equal-width bins based on value ranges you specify.
pd.qcut creates equal-frequency bins so each bucket has a similar count.
For a pipeline, sklearn.preprocessing.KBinsDiscretizer does the same and can output ordinal or one-hot encoded bins.

Binning trades information for robustness — you lose the exact value but gain resistance to noise. Use it when the relationship is genuinely stepwise (age brackets for insurance, income slabs for tax).

Log Transforms for Skewed Features

Many real-world features — income, house price, page views, transaction amount — are right-skewed: a long tail of large values. Models that assume roughly symmetric inputs struggle with this. A log transform compresses the tail and pulls the distribution closer to symmetric.

For strictly positive values:
x_transformed = log(x)

When zeros are present, use log1p (log of 1 + x) so log(0) is defined:
x_transformed = log(1 + x)

import numpy as np

income = pd.Series([25000, 40000, 60000, 90000, 2500000])   # heavy right tail

log_income = np.log1p(income)   # log(1 + x), safe for zeros

print("skewness raw :", round(income.skew(), 2))
print("skewness log :", round(log_income.skew(), 2))

skewness raw : 2.23
skewness log : -0.31

The skewness dropped from strongly positive (2.23) to nearly symmetric (-0.31). To reverse the transform on a prediction, apply np.expm1 (the inverse of log1p). Other options for skew include the square root and the Box-Cox / Yeo-Johnson transforms (sklearn.preprocessing.PowerTransformer), which pick the best power automatically.

Feature Selection Basics

More features are not always better. Irrelevant or redundant features add noise, slow training, increase overfitting risk, and make models harder to explain. Feature selection keeps the columns that carry signal. Here are four practical starting points.

Variance Threshold

A feature that barely changes across rows carries almost no information. VarianceThreshold drops near-constant features.

from sklearn.feature_selection import VarianceThreshold

X = pd.DataFrame({
    "area":       [1000, 1500, 800, 2000],
    "has_water":  [1, 1, 1, 1],       # constant → zero variance, useless
    "bedrooms":   [2, 3, 2, 4],
})

selector = VarianceThreshold(threshold=0.0)   # drop zero-variance columns
X_reduced = selector.fit_transform(X)

print("kept columns:", X.columns[selector.get_support()].tolist())

kept columns: ['area', 'bedrooms']

Correlation Filtering

If two features are highly correlated (say |r| > 0.9), they carry almost the same information. Keeping both adds redundancy and can destabilise linear models (multicollinearity). Drop one of each highly correlated pair.

corr = X.corr().abs()
# Look at the upper triangle and flag pairs above the threshold
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
to_drop = [c for c in upper.columns if any(upper[c] > 0.9)]
print("drop candidates:", to_drop)

SelectKBest (Univariate Statistical Tests)

SelectKBest scores each feature against the target using a statistical test and keeps the top k. Use f_classif / mutual_info_classif for classification and f_regression for regression.

from sklearn.feature_selection import SelectKBest, f_classif

# X_train, y_train assumed prepared; keep the 10 best-scoring features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)

scores = pd.Series(selector.scores_, index=X_train.columns)
print(scores.sort_values(ascending=False).head())

Model-Based Importance

Tree-based models expose a feature_importances_ attribute; linear models expose coefficients. Fit a quick model and let it rank the features. SelectFromModel automates keeping the important ones.

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

importances = pd.Series(rf.feature_importances_, index=X_train.columns)
print(importances.sort_values(ascending=False).head())

# Automatically keep features above the median importance
selector = SelectFromModel(rf, threshold="median", prefit=True)
X_selected = selector.transform(X_train)

Model-based importance captures non-linear relationships that univariate tests miss, but it can be biased toward high-cardinality features — cross-check with domain sense. You will meet these importances again in the Random Forests and Ensemble Methods chapters.

Putting It Together with a Pipeline

The single most important scaling rule: fit your scaler on the training data only, then apply the same fitted scaler to the test data. If you fit on the full dataset, information about the test set (its mean, min, max) leaks into training — this is data leakage and it inflates your scores dishonestly. A Pipeline makes leakage almost impossible because it re-fits on each training fold automatically.

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

pipe = Pipeline([
    ("scaler", StandardScaler()),           # fit on train, transform train + test
    ("model",  LogisticRegression()),
])

pipe.fit(X_train, y_train)                  # scaler.fit uses ONLY X_train
print("test accuracy:", pipe.score(X_test, y_test))

The proper train/test split and cross-validation that make this trustworthy are the subject of the very next chapter.

Common Mistakes

1. Fitting the scaler on the whole dataset (data leakage)

WRONG:
scaler.fit(X)                     # sees test data too
X_train, X_test = split(X_scaled) # leakage → optimistic scores

RIGHT:
X_train, X_test = split(X)
scaler.fit(X_train)               # learn params from train only
X_train = scaler.transform(X_train)
X_test  = scaler.transform(X_test)   # reuse the SAME fitted scaler

2. Scaling tree-based models "just in case"

Scaling a Random Forest or XGBoost model wastes effort and adds a moving part for no gain — tree splits are invariant to monotonic scaling. Skip it.

3. Scaling the target variable by accident

Scale features (X), not the target (y), unless you deliberately intend to and remember to inverse-transform predictions back to the original units.

4. Applying MinMaxScaler on data with outliers

A single extreme value stretches the [0, 1] range and crushes every normal value near 0. Use RobustScaler, or handle the outlier first.

5. Building interaction/polynomial features blindly

PolynomialFeatures(degree=3) on many columns creates thousands of features, most of them noise. This causes overfitting and slow training. Keep the degree low and prefer a few justified interactions.

6. Encoding datetime as a raw integer

Feeding a Unix timestamp or 20260702 directly gives the model a meaningless huge number. Decompose it into day_of_week, hour, is_weekend, month instead.

Practice Exercises

Given an e-commerce table with order_amount and num_items, create an avg_item_value feature and explain why it may predict returns better than order_amount alone.
You have a signup_timestamp column. List five datetime features you would extract, and describe one that would help predict whether a user signs up during a promotion.
A dataset has salary ranging ₹20,000 to ₹5,00,000 with three billionaire outliers. Which scaler would you choose and why? Write the formula it uses.
You are training a KNN classifier and a Random Forest on the same features. For which model must you scale, and why does the other not need it?
A feature is_active equals 1 for 998 of 1000 rows. Which selection technique flags it, and what would happen to a model if you kept it?
Given a right-skewed page_views column containing some zeros, write the transform that reduces skew without failing on the zeros, and state how to reverse it for predictions.

Summary

In this chapter you learned:

Feature engineering = creating, transforming, and selecting features; it usually matters more than the choice of algorithm.
Create features from ratios (price_per_sqft), aggregations (avg_order_value), datetime parts (hour, is_weekend), text stats (word_count), and interaction/polynomial terms.
StandardScaler gives mean 0, std 1 via (x - mean) / std — the default for most models.
MinMaxScaler squeezes to [0, 1] via (x - min) / (max - min) — bounded but outlier-sensitive.
RobustScaler centres on the median and scales by the IQR via (x - median) / IQR — resistant to outliers.
Distance-based (KNN, SVM, K-Means), gradient-based (regression, neural nets), and variance-based (PCA) models need scaling; tree-based models (Decision Tree, Random Forest, XGBoost) do not.
Binning discretizes continuous features; log transforms (log1p / expm1) tame right skew.
Feature selection — variance threshold, correlation filtering, SelectKBest, and model-based importance — trims noise and reduces overfitting.
Always fit the scaler on training data only, then reuse it on the test set; a Pipeline enforces this and prevents data leakage.

Great features and honest scaling set you up for reliable models — but only if you evaluate them correctly.

Next up: Train-Test Split & Cross-Validation — how to split your data, avoid leakage, and use k-fold cross-validation to get an unbiased estimate of how your model will perform on data it has never seen.