What Is Feature Engineering?
Feature engineering is the process of transforming raw data into features — the numeric inputs a machine learning model actually learns from. It covers creating new features from what you already have, transforming existing ones so the model can use them better, and selecting the subset that carries real signal.
There is a well-worn saying in data science:
Applied machine learning is basically feature engineering. — Andrew Ng
The intuition: a model is only as good as what you feed it. Two data scientists can use the exact same algorithm on the exact same dataset and get wildly different results — the difference is almost always in the features. Think of it like cooking. The algorithm is your oven, but the features are your ingredients. A great oven cannot rescue rotten vegetables, and a mediocre oven does wonders with fresh, well-prepped produce.
- Feature creation — build new columns (e.g.
price_per_sqftfrompriceandarea) - Feature transformation — reshape existing columns (scaling, log transforms, binning)
- Feature selection — keep the columns that help and drop the noise
Raw data Engineered features Model
──────────── ──────────────────── ──────
transaction_date → day_of_week, is_weekend →
amount, num_items → avg_item_value → better
address text → address_length, has_pincode → predictions
In the previous chapter, Data Preprocessing & Cleaning, you handled missing values, outliers and encoding. This chapter assumes your data is already clean and turns it into features a model can actually exploit. Later chapters — Linear Regression, K-Nearest Neighbors, Support Vector Machines — will all lean heavily on the scaling techniques you learn here.
Creating Features from What You Have
The best features usually come from domain knowledge, not from an algorithm. A single new column that captures a real-world relationship often beats a fancy model. Here are the high-yield patterns.
Ratios and Aggregations
Raw columns in isolation are often less informative than their combinations. A house that costs ₹80,00,000 tells you little until you divide by area.
import pandas as pd
df = pd.DataFrame({
"name": ["Priya", "Rahul", "Anjali", "Vikram"],
"price": [8000000, 12000000, 6000000, 9500000], # ₹
"area": [1000, 1500, 800, 1100], # sq ft
"bedrooms": [2, 3, 2, 3],
})
# Ratio feature: price per square foot (comparable across sizes)
df["price_per_sqft"] = df["price"] / df["area"]
# Ratio feature: area per bedroom (a "spaciousness" signal)
df["area_per_bedroom"] = df["area"] / df["bedrooms"]
print(df[["name", "price_per_sqft", "area_per_bedroom"]])
name price_per_sqft area_per_bedroom
0 Priya 8000.00 500.000000
1 Rahul 8000.00 500.000000
2 Anjali 7500.00 400.000000
3 Vikram 8636.36 366.666667
Notice how Priya and Rahul cost the same per square foot despite very different absolute prices — the ratio surfaced a pattern the raw numbers hid. For grouped data, aggregations are just as powerful: for each customer, compute total_spend, avg_order_value, orders_last_30_days, or days_since_last_order.
Datetime Features
A raw timestamp like 2026-07-02 14:30:00 is nearly useless to a model as-is. Break it into parts that carry seasonality and behaviour.
df = pd.DataFrame({"order_time": pd.to_datetime(
["2026-07-02 14:30", "2026-07-05 21:15", "2026-07-06 09:00"])})
df["hour"] = df["order_time"].dt.hour # 0..23
df["day_of_week"] = df["order_time"].dt.dayofweek # 0=Mon .. 6=Sun
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["month"] = df["order_time"].dt.month
df["is_month_end"]= df["order_time"].dt.is_month_end.astype(int)
print(df[["hour", "day_of_week", "is_weekend"]])
hour day_of_week is_weekend
0 14 3 0
1 21 6 1
2 9 0 0
For cyclical features like hour or month, a common trick is to encode them as sine/cosine pairs so that hour 23 sits next to hour 0:
hour_sin = sin(2 * pi * hour / 24)
hour_cos = cos(2 * pi * hour / 24)
Text Length and Simple Text Features
You do not need full NLP to squeeze signal out of text. Cheap counts often predict surprisingly well — a very short product review or a spammy-looking subject line is informative.
df = pd.DataFrame({"review": [
"Great product, works perfectly!",
"Bad. Waste of money.",
"The build quality is excellent and delivery was on time, highly recommend to everyone",
]})
df["char_count"] = df["review"].str.len()
df["word_count"] = df["review"].str.split().str.len()
df["excl_count"] = df["review"].str.count("!")
df["avg_word_len"]= df["char_count"] / df["word_count"]
print(df[["char_count", "word_count", "excl_count"]])
char_count word_count excl_count
0 31 4 1
1 20 4 0
2 85 13 0
Interaction and Polynomial Features
An interaction feature is the product of two features — it lets the model capture "the effect of A depends on the level of B". A polynomial feature raises a feature to a power, letting a linear model bend to fit curves.
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = pd.DataFrame({"area": [1000, 1500, 800], "bedrooms": [2, 3, 2]})
# degree=2 creates: area, bedrooms, area^2, area*bedrooms, bedrooms^2
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print(poly.get_feature_names_out())
print(X_poly)
['area' 'bedrooms' 'area^2' 'area bedrooms' 'bedrooms^2']
[[1.000e+03 2.000e+00 1.000e+06 2.000e+03 4.000e+00]
[1.500e+03 3.000e+00 2.250e+06 4.500e+03 9.000e+00]
[8.000e+02 2.000e+00 6.400e+05 1.600e+03 4.000e+00]]
Polynomial features are powerful but explode quickly: degree=2 on 100 features produces over 5,000 columns. Keep the degree low (2 or 3) and prefer creating a few hand-picked interactions you can justify.
Feature Scaling: Putting Features on the Same Playing Field
Many features live on very different scales. In a housing dataset, area ranges from 800 to 5000 while bedrooms ranges from 1 to 5. To a distance-based model, area will completely dominate simply because its numbers are bigger — not because it is more important. Scaling rescales features to comparable ranges so no single feature bullies the rest.
StandardScaler (Z-Score Standardization)
StandardScaler transforms each feature to have mean 0 and standard deviation 1. This is the most common scaler and the default choice for most models.
z = (x - mean) / std
where:
mean = average of the feature (computed on TRAIN data)
std = standard deviation of the feature (computed on TRAIN data)
from sklearn.preprocessing import StandardScaler
X = pd.DataFrame({"area": [1000, 1500, 800, 2000],
"bedrooms": [2, 3, 2, 4]})
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled.round(2))
print("means:", X_scaled.mean(axis=0).round(2))
print("stds :", X_scaled.std(axis=0).round(2))
[[-0.66 -0.9 ]
[ 0.42 0.3 ]
[-1.1 -0.9 ]
[ 1.34 1.5 ]]
means: [ 0. -0.]
stds : [1. 1.]
The result is unbounded — a value can be -3 or +5 — but it is centred and comparably spread. Use this when your features are roughly bell-shaped.
MinMaxScaler (Normalization)
MinMaxScaler squeezes every feature into a fixed range, usually [0, 1]. This is often called normalization.
x_scaled = (x - min) / (max - min)
Result lies in the range 0 to 1
(min and max are learned from TRAIN data)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler() # default range is 0 to 1
X_scaled = scaler.fit_transform(X)
print(X_scaled.round(2))
[[0.17 0. ]
[0.58 0.5 ]
[0. 0. ]
[1. 1. ]]
Use MinMax when you need a bounded range — for example, pixel intensities for a neural network, or algorithms that assume inputs in [0, 1]. Its weakness: it is very sensitive to outliers, because a single huge value stretches the whole range and squashes everyone else near 0.
RobustScaler
RobustScaler centres on the median and scales by the interquartile range (IQR), which makes it resistant to outliers.
x_scaled = (x - median) / IQR
where IQR = Q3 - Q1 (the 75th minus the 25th percentile)
from sklearn.preprocessing import RobustScaler
# One extreme outlier in income (a millionaire in a middle-class dataset)
X = pd.DataFrame({"income": [30000, 45000, 50000, 55000, 2000000]})
print("Standard:", StandardScaler().fit_transform(X).ravel().round(2))
print("Robust :", RobustScaler().fit_transform(X).ravel().round(2))
Standard: [-0.51 -0.5 -0.5 -0.5 2.01]
Robust : [-2. -0.5 0. 0.5 195. ]
Notice how StandardScaler let the outlier compress the four normal incomes into a tight -0.51 to -0.5 band. RobustScaler keeps the four ordinary values nicely spread (-2 to 0.5) and isolates the outlier — exactly what you want when outliers are present but you cannot remove them.
Choosing a Scaler
| Scaler | Formula | Output range | Centred on | Outlier-robust? | When to use |
|---|---|---|---|---|---|
StandardScaler | (x - mean) / std | unbounded | mean | No | Default; roughly Gaussian features; linear/logistic regression, SVM, KNN, PCA |
MinMaxScaler | (x - min) / (max - min) | [0, 1] | none | No | Bounded inputs needed; neural nets; sparse data with known bounds |
RobustScaler | (x - median) / IQR | unbounded | median | Yes | Features with strong outliers you cannot drop |
MaxAbsScaler | x / max(abs(x)) | [-1, 1] | zero | No | Sparse matrices (preserves zeros); text TF-IDF |
Which Models Need Scaling — and Which Do Not
This is one of the most misunderstood topics for beginners. The rule is about how the algorithm uses the numbers.
Scaling is required for distance-based and gradient-based models, because they compare or combine raw magnitudes:
- Distance-based (KNN, K-Means, SVM with RBF kernel) compute distances like Euclidean distance. A feature measured in the thousands drowns out a feature measured in single digits.
Euclidean distance between two houses A and B:
d = sqrt((area_A - area_B)^2 + (bedrooms_A - bedrooms_B)^2)
If area differs by 500 and bedrooms by 2:
d = sqrt(500^2 + 2^2) = sqrt(250000 + 4) ~ 500.004
bedrooms is completely invisible until you scale.
- Gradient-based (Linear Regression via gradient descent, Logistic Regression, Neural Networks) converge much faster on scaled data because the loss surface becomes rounder instead of a long thin valley.
- PCA finds directions of maximum variance, so unscaled features with large variance falsely dominate — always scale before PCA.
Scaling is NOT needed for tree-based models — Decision Trees, Random Forests, and gradient-boosted trees like XGBoost. A tree splits on thresholds ("is area greater than 1200?"), and the split logic is unchanged whether area is in square feet or scaled units. Monotonic transformations do not affect where the tree chooses to split.
| Model family | Examples | Needs scaling? |
|---|---|---|
| Distance-based | KNN, K-Means, SVM (RBF) | Yes |
| Gradient-based | Linear/Logistic Regression, Neural Nets | Yes (helps a lot) |
| Variance-based | PCA, LDA | Yes |
| Tree-based | Decision Tree, Random Forest, XGBoost | No |
| Probabilistic | Naive Bayes | Usually no |
Binning and Discretization
Binning (or discretization) converts a continuous feature into discrete buckets. This can help capture non-linear effects, reduce the impact of small measurement noise, and make features more interpretable.
df = pd.DataFrame({"age": [22, 35, 41, 58, 67, 19, 45]})
# Fixed-width bins with human-readable labels
df["age_group"] = pd.cut(
df["age"],
bins=[0, 25, 40, 60, 120],
labels=["young", "adult", "middle_aged", "senior"],
)
# Equal-frequency bins: each bucket holds ~the same number of rows
df["age_quartile"] = pd.qcut(df["age"], q=4, labels=False)
print(df)
age age_group age_quartile
0 22 young 0
1 35 adult 1
2 41 middle_aged 2
3 58 middle_aged 3
4 67 senior 3
5 19 young 0
6 45 middle_aged 2
pd.cutcreates equal-width bins based on value ranges you specify.pd.qcutcreates equal-frequency bins so each bucket has a similar count.- For a pipeline,
sklearn.preprocessing.KBinsDiscretizerdoes the same and can output ordinal or one-hot encoded bins.
Binning trades information for robustness — you lose the exact value but gain resistance to noise. Use it when the relationship is genuinely stepwise (age brackets for insurance, income slabs for tax).
Log Transforms for Skewed Features
Many real-world features — income, house price, page views, transaction amount — are right-skewed: a long tail of large values. Models that assume roughly symmetric inputs struggle with this. A log transform compresses the tail and pulls the distribution closer to symmetric.
For strictly positive values:
x_transformed = log(x)
When zeros are present, use log1p (log of 1 + x) so log(0) is defined:
x_transformed = log(1 + x)
import numpy as np
income = pd.Series([25000, 40000, 60000, 90000, 2500000]) # heavy right tail
log_income = np.log1p(income) # log(1 + x), safe for zeros
print("skewness raw :", round(income.skew(), 2))
print("skewness log :", round(log_income.skew(), 2))
skewness raw : 2.23
skewness log : -0.31
The skewness dropped from strongly positive (2.23) to nearly symmetric (-0.31). To reverse the transform on a prediction, apply np.expm1 (the inverse of log1p). Other options for skew include the square root and the Box-Cox / Yeo-Johnson transforms (sklearn.preprocessing.PowerTransformer), which pick the best power automatically.
Feature Selection Basics
More features are not always better. Irrelevant or redundant features add noise, slow training, increase overfitting risk, and make models harder to explain. Feature selection keeps the columns that carry signal. Here are four practical starting points.
Variance Threshold
A feature that barely changes across rows carries almost no information. VarianceThreshold drops near-constant features.
from sklearn.feature_selection import VarianceThreshold
X = pd.DataFrame({
"area": [1000, 1500, 800, 2000],
"has_water": [1, 1, 1, 1], # constant → zero variance, useless
"bedrooms": [2, 3, 2, 4],
})
selector = VarianceThreshold(threshold=0.0) # drop zero-variance columns
X_reduced = selector.fit_transform(X)
print("kept columns:", X.columns[selector.get_support()].tolist())
kept columns: ['area', 'bedrooms']
Correlation Filtering
If two features are highly correlated (say |r| > 0.9), they carry almost the same information. Keeping both adds redundancy and can destabilise linear models (multicollinearity). Drop one of each highly correlated pair.
corr = X.corr().abs()
# Look at the upper triangle and flag pairs above the threshold
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
to_drop = [c for c in upper.columns if any(upper[c] > 0.9)]
print("drop candidates:", to_drop)
SelectKBest (Univariate Statistical Tests)
SelectKBest scores each feature against the target using a statistical test and keeps the top k. Use f_classif / mutual_info_classif for classification and f_regression for regression.
from sklearn.feature_selection import SelectKBest, f_classif
# X_train, y_train assumed prepared; keep the 10 best-scoring features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)
scores = pd.Series(selector.scores_, index=X_train.columns)
print(scores.sort_values(ascending=False).head())
Model-Based Importance
Tree-based models expose a feature_importances_ attribute; linear models expose coefficients. Fit a quick model and let it rank the features. SelectFromModel automates keeping the important ones.
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X_train.columns)
print(importances.sort_values(ascending=False).head())
# Automatically keep features above the median importance
selector = SelectFromModel(rf, threshold="median", prefit=True)
X_selected = selector.transform(X_train)
Model-based importance captures non-linear relationships that univariate tests miss, but it can be biased toward high-cardinality features — cross-check with domain sense. You will meet these importances again in the Random Forests and Ensemble Methods chapters.
Putting It Together with a Pipeline
The single most important scaling rule: fit your scaler on the training data only, then apply the same fitted scaler to the test data. If you fit on the full dataset, information about the test set (its mean, min, max) leaks into training — this is data leakage and it inflates your scores dishonestly. A Pipeline makes leakage almost impossible because it re-fits on each training fold automatically.
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
pipe = Pipeline([
("scaler", StandardScaler()), # fit on train, transform train + test
("model", LogisticRegression()),
])
pipe.fit(X_train, y_train) # scaler.fit uses ONLY X_train
print("test accuracy:", pipe.score(X_test, y_test))
The proper train/test split and cross-validation that make this trustworthy are the subject of the very next chapter.
Common Mistakes
1. Fitting the scaler on the whole dataset (data leakage)
WRONG:
scaler.fit(X) # sees test data too
X_train, X_test = split(X_scaled) # leakage → optimistic scores
RIGHT:
X_train, X_test = split(X)
scaler.fit(X_train) # learn params from train only
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) # reuse the SAME fitted scaler
2. Scaling tree-based models "just in case"
Scaling a Random Forest or XGBoost model wastes effort and adds a moving part for no gain — tree splits are invariant to monotonic scaling. Skip it.
3. Scaling the target variable by accident
Scale features (X), not the target (y), unless you deliberately intend to and remember to inverse-transform predictions back to the original units.
4. Applying MinMaxScaler on data with outliers
A single extreme value stretches the [0, 1] range and crushes every normal value near 0. Use RobustScaler, or handle the outlier first.
5. Building interaction/polynomial features blindly
PolynomialFeatures(degree=3) on many columns creates thousands of features, most of them noise. This causes overfitting and slow training. Keep the degree low and prefer a few justified interactions.
6. Encoding datetime as a raw integer
Feeding a Unix timestamp or 20260702 directly gives the model a meaningless huge number. Decompose it into day_of_week, hour, is_weekend, month instead.
Practice Exercises
-
Given an e-commerce table with
order_amountandnum_items, create anavg_item_valuefeature and explain why it may predict returns better thanorder_amountalone. -
You have a
signup_timestampcolumn. List five datetime features you would extract, and describe one that would help predict whether a user signs up during a promotion. -
A dataset has
salaryranging₹20,000to₹5,00,000with three billionaire outliers. Which scaler would you choose and why? Write the formula it uses. -
You are training a KNN classifier and a Random Forest on the same features. For which model must you scale, and why does the other not need it?
-
A feature
is_activeequals1for 998 of 1000 rows. Which selection technique flags it, and what would happen to a model if you kept it? -
Given a right-skewed
page_viewscolumn containing some zeros, write the transform that reduces skew without failing on the zeros, and state how to reverse it for predictions.
Summary
In this chapter you learned:
- Feature engineering = creating, transforming, and selecting features; it usually matters more than the choice of algorithm.
- Create features from ratios (
price_per_sqft), aggregations (avg_order_value), datetime parts (hour,is_weekend), text stats (word_count), and interaction/polynomial terms. - StandardScaler gives mean 0, std 1 via
(x - mean) / std— the default for most models. - MinMaxScaler squeezes to
[0, 1]via(x - min) / (max - min)— bounded but outlier-sensitive. - RobustScaler centres on the median and scales by the IQR via
(x - median) / IQR— resistant to outliers. - Distance-based (KNN, SVM, K-Means), gradient-based (regression, neural nets), and variance-based (PCA) models need scaling; tree-based models (Decision Tree, Random Forest, XGBoost) do not.
- Binning discretizes continuous features; log transforms (
log1p/expm1) tame right skew. - Feature selection — variance threshold, correlation filtering,
SelectKBest, and model-based importance — trims noise and reduces overfitting. - Always fit the scaler on training data only, then reuse it on the test set; a
Pipelineenforces this and prevents data leakage.
Great features and honest scaling set you up for reliable models — but only if you evaluate them correctly.
Next up: Train-Test Split & Cross-Validation — how to split your data, avoid leakage, and use k-fold cross-validation to get an unbiased estimate of how your model will perform on data it has never seen.