The Machine Learning Workflow

What Is the Machine Learning Workflow?

The machine learning workflow is the repeatable sequence of steps that takes you from a vague business question to a working, monitored model in production. It is the process that surrounds the algorithms — and in practice, choosing the right algorithm is often the smallest part of the job.

Think of it like building a house. The glamorous part is imagining the finished home (the model), but most of the real work is surveying the land, laying the foundation, checking the plumbing, and inspecting the structure afterwards. Skip any of those and the beautiful house collapses. In ML, the "algorithm" is the visible structure; data understanding, cleaning, and evaluation are the foundation that holds everything up.

Two ideas to hold onto from the very start:

The workflow is a loop, not a straight line. You will discover a data-quality problem during evaluation and jump back to cleaning. You will find a feature idea while tuning and revisit engineering. Iteration is the norm.
Roughly 70–80% of a real project is data work (collect, understand, clean, engineer). Model selection and tuning — the parts beginners obsess over — are a slice of the whole.

The 12 stages (a loop, not a line):

 1. Frame the problem      →  2. Collect & understand data
 3. Exploratory analysis   →  4. Preprocess & clean
 5. Feature engineering    →  6. Split the data
 7. Choose a model         →  8. Train
 9. Evaluate               → 10. Tune hyperparameters
11. Deploy                 → 12. Monitor & retrain ──┐
        ▲                                             │
        └──────── feed learnings back into step 1 ────┘

This chapter walks each stage, explains why it matters, then ties them together with a concrete scikit-learn example and the Pipeline object.

Stage 1: Frame the Problem

Before touching data, be brutally clear about what you are predicting and why. A poorly framed problem wastes weeks.

Ask:

What is the target? Is it a number (regression), a category (classification), or a grouping (clustering)? See Types of Machine Learning for how this choice shapes everything downstream.
What decision will the prediction drive? A churn model that no one acts on has zero value.
How will success be measured? Define the metric before modelling — accuracy, RMSE, recall, or a business number like ₹ saved per month.
What is the baseline? "Always predict the majority class" or "use last month's value" is your bar to beat. If a simple rule already works, you may not need ML at all.

Example framing:
Business question: "Which customers will churn next month?"
→ Target:      churn (yes / no)  ->  binary classification
→ Decision:    send a retention offer to likely-churners
→ Metric:      recall on the churn class (catch churners),
               subject to a limited offer budget
→ Baseline:    "flag anyone whose usage dropped last month"

Stage 2: Collect and Understand the Data

Now gather the raw material. Data may come from SQL databases, CSV exports, APIs, event logs, or third-party providers. Two questions matter most:

Is the data relevant to the target? Features must plausibly influence the outcome.
Was the data available at prediction time? A feature you would not have when the model runs in production causes data leakage (more on this below).

Once collected, understand it before modelling. Inspect shape, types, and a few rows.

import pandas as pd

df = pd.read_csv("telecom_customers.csv")

print(df.shape)          # (7043, 12)  -> rows, columns
print(df.dtypes)         # int, float, object per column
print(df.head(3))
print(df["churn"].value_counts(normalize=True))

Expected output (illustrative):
(7043, 12)

churn
No     0.735
Yes    0.265
Name: proportion, dtype: float64

That last line already tells you the classes are imbalanced (about 27% churn) — a fact that will shape your metric choice and possibly your sampling strategy.

Stage 3: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is where you build intuition: distributions, relationships, and surprises. You are looking for patterns to exploit and problems to fix.

Typical EDA moves:

Summary statistics — df.describe() for numeric spread, df.describe(include="object") for categories.
Missing values — df.isnull().sum() to see gaps.
Distributions — histograms and boxplots reveal skew and outliers.
Relationships — correlation heatmaps and grouped means show which features track the target.

# Which numeric features differ between churners and non-churners?
print(df.groupby("churn")[["tenure", "monthly_charges"]].mean())

Expected output (illustrative):
        tenure  monthly_charges
churn
No       37.6            61.30
Yes      17.9            74.44

The pattern is immediate: churners have shorter tenure and higher monthly charges. That is a real signal — exactly the kind of insight EDA exists to surface. (For the statistical machinery behind correlation and distributions, see the Statistics series.)

Stage 4: Preprocess and Clean

Real data is messy. Preprocessing turns it into something an algorithm can actually consume. The common jobs:

Missing values — impute (fill with mean/median/mode) or drop.
Data types — convert dates, fix numbers stored as text.
Duplicates and errors — remove exact duplicates, fix impossible values (age of -4).
Outliers — cap, transform, or investigate.
Encoding categoricals — models need numbers, so text categories become one-hot or ordinal encodings.
Scaling numerics — put features on comparable ranges so no single feature dominates by magnitude alone.

This stage is deep enough to get its own chapter — the next one, Data Preprocessing & Cleaning, and the following Feature Engineering & Scaling — so here we only note the outline. The critical rule that ties into the workflow is: fit your cleaning steps on the training data only, then apply them to the test data. Doing otherwise leaks information (again, more below).

Stage 5: Feature Engineering

Feature engineering is creating new inputs that expose the signal more clearly. It is where domain knowledge beats brute-force compute, and it is often the single highest-leverage stage.

Raw columns:            Engineered features:
signup_date        →    tenure_months = today - signup_date
total_spend, tenure →   avg_monthly_spend = total_spend / tenure
call_count         →    calls_per_week = call_count / weeks_active
plan (text)        →    is_premium_plan (0/1)

Good features often encode a ratio, a rate over time, or a domain flag that the raw column hides. A model rarely discovers avg_monthly_spend on its own from total_spend and tenure; you hand it that insight. The dedicated chapter Feature Engineering & Scaling covers techniques in depth.

Stage 6: Split the Data

You must estimate how the model performs on data it has never seen. So before training, carve the data into parts:

Training set — the model learns from this (commonly 70–80%).
Test set — held back untouched to estimate real-world performance (commonly 20–30%).
Often a validation set or cross-validation in between, for tuning.

from sklearn.model_selection import train_test_split

X = df.drop(columns=["churn"])
y = df["churn"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,     # reproducible split
    stratify=y           # keep the 27% churn ratio in both halves
)

print(X_train.shape, X_test.shape)   # (5634, 11) (1409, 11)

stratify=y preserves the class balance in both halves — important for imbalanced targets. Splitting before any fitting is what keeps evaluation honest. The full theory of splitting, stratify, and k-fold cross-validation lives in Train-Test Split & Cross-Validation.

Stage 7: Choose a Model

Match the model to the problem type, the data size, and your need for interpretability. Start simple; a well-tuned logistic regression is a stronger baseline than a badly-tuned neural network.

Problem type	Reasonable starting models	Chapter
Regression (predict a number)	Linear Regression, Random Forest	Linear Regression
Binary/multiclass classification	Logistic Regression, KNN, Random Forest	Logistic Regression
Clustering (no labels)	K-Means, DBSCAN	K-Means Clustering
High-dimensional / text	SVM, Naive Bayes	Support Vector Machines
Complex non-linear patterns	Gradient Boosting, Neural Networks	Ensemble Methods

A sensible rule of thumb: begin with an interpretable baseline (linear/logistic regression or a single decision tree), confirm the pipeline works end-to-end, then reach for more powerful models only if the baseline is not good enough.

Stage 8: Train the Model

Training (also called fitting) is where the algorithm learns parameters from the training data. In scikit-learn this is always the same call: .fit(X_train, y_train).

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)      # the model "learns" here

Under the hood, logistic regression is searching for coefficients that minimise a loss function — but the interface is identical across almost every scikit-learn estimator, which is what makes the library so pleasant. Predicting is equally uniform:

y_pred = model.predict(X_test)

Stage 9: Evaluate the Model

Now compare predictions on the test set against the truth using the metric you chose back in Stage 1. Never evaluate on the training set — that measures memorisation, not generalisation.

from sklearn.metrics import accuracy_score, recall_score, confusion_matrix

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Recall (churn):", recall_score(y_test, y_pred, pos_label="Yes"))
print(confusion_matrix(y_test, y_pred))

Expected output (illustrative):
Accuracy: 0.80
Recall (churn): 0.53
[[933  102]
 [176  198]]

Notice the trap: 80% accuracy sounds great, but recall on churners is only about 0.53 — the model misses nearly half of the customers it is supposed to catch. On imbalanced data, accuracy is misleading; a model that always predicts "No" would already score about 73%. This is exactly why you fixed the metric during framing. The full menu of metrics (precision, recall, F1, ROC-AUC, RMSE) is covered in Model Evaluation Metrics.

Stage 10: Tune Hyperparameters

Hyperparameters are the settings you choose before training (unlike parameters, which are learned during training). Examples: the regularisation strength C in logistic regression, the number of neighbours k in KNN, or max_depth in a decision tree.

Tuning searches for the combination that performs best — evaluated via cross-validation on the training data, never on the test set.

from sklearn.model_selection import GridSearchCV

param_grid = {"C": [0.01, 0.1, 1, 10]}

search = GridSearchCV(
    LogisticRegression(max_iter=1000),
    param_grid,
    cv=5,                 # 5-fold cross-validation
    scoring="recall"      # optimise for the metric we care about
)
search.fit(X_train, y_train)

print(search.best_params_)      # e.g. {'C': 0.1}
print(round(search.best_score_, 3))

GridSearchCV tries every combination; RandomizedSearchCV samples a fixed number, which scales better when the grid is large. Crucially, tuning uses only the training data — the test set stays sealed until the very end.

Stage 11: Deploy

A model that lives only in a notebook creates no value. Deployment puts it where predictions are actually consumed — behind a REST API, inside a batch job, or embedded in an app. The trained object is serialised (saved) and loaded by the serving code.

import joblib

# Save the fitted pipeline (see below) to disk
joblib.dump(best_pipeline, "churn_model.joblib")

# In the production service, at startup:
model = joblib.load("churn_model.joblib")
prediction = model.predict(new_customer_df)

The golden rule: the exact same preprocessing must run in production as in training. If you scaled and one-hot encoded during training but forget to in production, predictions are garbage. This is the single strongest argument for the Pipeline object introduced below — it bundles preprocessing and model into one saveable unit.

Stage 12: Monitor and Retrain

Deployment is not the finish line. The world changes, and so does your data. Monitoring watches for:

Data drift — the input distribution shifts (a new customer segment, a pricing change).
Concept drift — the relationship between features and target changes (churn behaviour changes after a competitor launches).
Performance decay — accuracy or recall silently drops over months.

When metrics degrade past a threshold, you retrain on fresh data — which loops you right back to Stage 2. This is the clearest proof that ML is a cycle: a model is a living system, not a one-off deliverable.

Putting It Together: scikit-learn Pipelines

Doing steps 4–8 as separate, manual operations is error-prone. The classic mistake is data leakage — fitting a scaler or imputer on the whole dataset before splitting, so information from the test set leaks into training and inflates your scores.

The Pipeline object chains preprocessing and the model into one estimator. When you call pipeline.fit(X_train), every step is fitted on the training fold only; when you call pipeline.predict(X_test), the already-fitted transformers are merely applied. Leakage becomes structurally impossible, and the whole thing saves as one unit for deployment.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

numeric_features = ["tenure", "monthly_charges", "total_charges"]
categorical_features = ["contract", "internet_service", "payment_method"]

# Preprocessing for numeric columns: impute missing, then scale
numeric_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler()),
])

# Preprocessing for categorical columns: impute, then one-hot encode
categorical_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("encode", OneHotEncoder(handle_unknown="ignore")),
])

# Route each column group to the right preprocessing
preprocess = ColumnTransformer([
    ("num", numeric_pipe, numeric_features),
    ("cat", categorical_pipe, categorical_features),
])

# The full pipeline: preprocessing + model, as ONE estimator
clf = Pipeline([
    ("prep", preprocess),
    ("model", LogisticRegression(max_iter=1000)),
])

clf.fit(X_train, y_train)          # fits every step on train only
y_pred = clf.predict(X_test)       # applies fitted steps, then predicts

Now tuning, cross-validation, and saving all operate on the single clf object — and every fold is leak-free automatically:

from sklearn.model_selection import GridSearchCV

# Note: address a step's hyperparameter with "stepname__paramname"
param_grid = {"model__C": [0.1, 1, 10]}

search = GridSearchCV(clf, param_grid, cv=5, scoring="recall")
search.fit(X_train, y_train)

best_pipeline = search.best_estimator_
print("Test recall:", recall_score(y_test, best_pipeline.predict(X_test), pos_label="Yes"))

The double-underscore syntax model__C reaches into the model step's C hyperparameter — the same convention works for any step in any nested pipeline.

The Workflow at a Glance

#	Stage	What you do	Key output
1	Frame the problem	Define target, decision, metric, baseline	A clear problem statement + success metric
2	Collect & understand data	Gather from sources; inspect shape/types	A raw dataset you trust the origin of
3	Exploratory analysis (EDA)	Distributions, relationships, surprises	Insights and a list of data issues
4	Preprocess & clean	Impute, dedupe, fix types, encode, scale	An analysis-ready dataset
5	Feature engineering	Build ratios, rates, flags, interactions	A stronger feature set
6	Split the data	Train/test (and validation) split	Held-out data for honest evaluation
7	Choose a model	Match model to problem/size/interpretability	A short list of candidate algorithms
8	Train	`.fit(X_train, y_train)`	A fitted model with learned parameters
9	Evaluate	Score on the test set with the chosen metric	An unbiased performance estimate
10	Tune hyperparameters	Grid/random search + cross-validation	The best hyperparameter settings
11	Deploy	Serialise, serve behind an API or batch job	A live model producing predictions
12	Monitor & retrain	Watch drift and decay; refresh on new data	Sustained performance over time

Common Mistakes

1. Data leakage from fitting before splitting

WRONG:  scaler.fit(X)          # sees the whole dataset, incl. test rows
        X_train, X_test = split(X_scaled)

RIGHT:  X_train, X_test = split(X)
        scaler.fit(X_train)    # test set never touched during fitting

Best:   wrap everything in a Pipeline so this cannot happen by accident.

Leakage produces beautiful validation scores that collapse in production — the most common and most painful ML bug.

2. Evaluating on the training data

Scoring on data the model already learned from measures memorisation, not generalisation. Always report the test-set (or cross-validated) score, never the training score.

3. Optimising the wrong metric

On imbalanced data, chasing accuracy rewards a lazy model that predicts the majority class. Pick a metric that reflects the business cost — recall for catching churners, precision for limiting false alarms — during framing, not after.

4. Skipping EDA and preprocessing

Jumping straight to model.fit() on raw data usually fails silently: missing values, unencoded text, and wild outliers all sabotage the model. Most performance gains come from data work, not from swapping algorithms.

5. Tuning on the test set

Using the test set to pick hyperparameters turns it into a second training set, and your final number is no longer honest. Tune with cross-validation on the training data; unseal the test set exactly once, at the end.

6. Treating deployment as the end

A shipped model without monitoring decays quietly. Without drift alerts, you often discover the model broke only when the business results do.

Practice Exercises

Frame it. For a bank predicting loan default, write out the four framing elements: target, decision, metric, and baseline. Justify why accuracy might be the wrong metric here.
Split correctly. Given a dataset where only 4% of transactions are fraud, write the train_test_split call that keeps that ratio in both halves, and explain which argument does it.
Spot the leakage. A colleague fills missing values with the column mean computed over the entire dataset, then splits into train/test. Explain what has leaked and how a Pipeline fixes it.
Build a pipeline. Using ColumnTransformer and Pipeline, construct an estimator that median-imputes and scales two numeric columns, one-hot encodes one categorical column, and ends in a LogisticRegression.
Tune it. Write a GridSearchCV over model__C values [0.01, 0.1, 1, 10] with 5-fold CV optimising recall, using the pipeline from exercise 4. What does the model__ prefix refer to?
Close the loop. Three months after deployment, recall on the churn class drops from 0.55 to 0.38 while input distributions look unchanged. Name the likely phenomenon and describe the workflow steps you would revisit.

Summary

In this chapter you learned:

The ML workflow is a 12-stage loop: frame → collect → EDA → clean → engineer → split → choose → train → evaluate → tune → deploy → monitor — feeding back into framing.
Framing first — define the target, the decision, the metric, and a baseline before modelling; the metric drives every later choice.
Most of the work is data work (understand, clean, engineer); algorithm choice and tuning are a smaller slice than beginners expect.
Split before you fit, evaluate only on the held-out test set, and tune with cross-validation — never on the test set.
scikit-learn is uniform: .fit() to train, .predict() to infer, across nearly every estimator.
The Pipeline (with ColumnTransformer) chains preprocessing and model into one estimator, making data leakage structurally impossible and giving you a single saveable unit for deployment.
Address a nested hyperparameter with the step__param convention (e.g. model__C).
Deployment is not the end: monitor for data and concept drift, and retrain on fresh data — which loops you back to the beginning.

Master this loop and every later chapter — from Linear Regression to Neural Networks — simply slots into stages 7 through 10 of a process you already understand.

Next up: Data Preprocessing & Cleaning — the deep dive into stage 4, where you will handle missing values, outliers, encoding, and messy real-world data so your models get the clean input they need.