Chapter 20 of 20

Introduction to Neural Networks & Deep Learning

Bridge from classic ML to deep learning — the perceptron, layers, activation functions, forward pass, loss, backpropagation and gradient descent, why deep learning wins on images/text/audio, and a tiny illustrative Keras example.

Meritshot18 min read
Machine LearningNeural NetworksDeep LearningBackpropagationActivation FunctionsKeras
All Machine Learning Chapters

What Is a Neural Network?

A neural network is a machine learning model built from many tiny, connected computing units called neurons, arranged in layers. Each neuron does something almost embarrassingly simple: it multiplies its inputs by some weights, adds them up, adds a bias, and passes the result through a small non-linear function. Stack enough of these simple units in enough layers and the network can approximate astonishingly complex relationships — recognising a face, transcribing Hindi speech, or flagging a fraudulent UPI transaction.

Everything you have learned in this series so far — Linear Regression, Logistic Regression, Decision Trees, SVMs, Random Forests, XGBoost — is what practitioners loosely call classic ML. Those models are superb on tabular data (rows and columns) and remain the right first choice for most business problems. Deep learning is the branch of ML that uses neural networks with many hidden layers ("deep" networks) and shines on unstructured data: images, text, audio, and video. This chapter is the bridge from one world to the other.

Intuitive analogy. Think of a large Indian bank deciding whether to approve a loan for a customer named Priya. A single clerk applying one rule ("income above ₹8 lakh → approve") is like linear regression — simple and transparent. Now imagine a hierarchy: junior analysts each extract one small signal (repayment history, spending pattern, employer stability), team leads combine those signals into mid-level judgements, and a committee combines those into a final decision. Each level builds richer features from the level below. A deep neural network is exactly that hierarchy — early layers learn simple patterns, later layers combine them into abstract concepts.

Goal: understand how a neuron computes, how layers stack, how the network learns its weights from data, and when reaching for deep learning is worth it versus sticking with the classic ML you already know.

Where deep learning dominates:
→ Image classification (is this X-ray normal or pneumonia?)
→ Object detection (find every vehicle in a CCTV frame)
→ Speech-to-text (transcribe a customer call in Hindi/English)
→ Machine translation and chatbots
→ Recommendation at scale (what to show next on a streaming app)

From Classic ML to the Neuron

The good news: you already understand a neuron. Logistic Regression is a single neuron with a sigmoid activation. A neuron takes inputs, forms a weighted sum plus a bias, then applies an activation function.

A single neuron:

z = (w1 * x1) + (w2 * x2) + ... + (wn * xn) + b
a = activation(z)

Where:
x1..xn  = inputs (features, or outputs from the previous layer)
w1..wn  = weights (learned)
b       = bias   (learned)
z       = the pre-activation (weighted sum + bias)
a        = the neuron's output after the activation function

In compact vector form, z = w . x + b, where w . x is the dot product of the weight vector and the input vector.

The Perceptron

The perceptron is the original neuron (Rosenblatt, 1958): a weighted sum passed through a hard step function that outputs 0 or 1. It can only separate data that is linearly separable — a single straight boundary. On its own it cannot learn something as simple as the XOR pattern. The breakthrough was realising that stacking layers of neurons with smooth non-linear activations removes that limitation entirely. A network with even one hidden layer of enough neurons is, in theory, a universal function approximator.

Perceptron (historical):
z = w . x + b
output = 1 if z >= 0 else 0     # hard step — not differentiable

Modern neuron:
z = w . x + b
output = activation(z)          # smooth, differentiable — can be trained by gradient descent

The reason we replaced the step function with smooth activations is training. To learn good weights we need to compute gradients (slopes), and a hard step has a slope of zero almost everywhere. Smooth activations give us usable gradients.

Layers and What "Deep" Means

Neurons are organised into layers, and layers are chained one after another.

Input layer   →   Hidden layer(s)   →   Output layer

Input layer:   one node per input feature. Does no computation — just holds the values.
Hidden layers: the workhorses. Each learns intermediate representations (features of features).
Output layer:  produces the final prediction (a probability, a class, a number).

A network with two or more hidden layers is called deep — that is the entire origin of the term deep learning. Modern networks can have dozens or hundreds of hidden layers.

The power of depth is hierarchical feature learning. On an image, the first hidden layer might learn to detect edges; the next combines edges into corners and textures; the next combines those into shapes like eyes or wheels; the final layers combine shapes into concepts like "face" or "car". Crucially, you do not hand-craft these features — the network discovers them from the raw pixels during training. This is the single biggest difference from classic ML, where you (the data scientist) do the feature engineering by hand, as covered in the Feature Engineering & Scaling chapter.

A fully-connected ("Dense") layer:
Every neuron in a layer receives input from EVERY neuron in the previous layer.
A layer with `m` inputs and `n` neurons has an (m x n) weight matrix W and a length-n bias vector b.
Its output for a batch: A = activation(X @ W + b)   # @ is matrix multiplication

Activation Functions

The activation function is the small non-linear step applied after each neuron's weighted sum. Without it, stacking layers would be pointless: a chain of linear operations collapses into a single linear operation, so a 100-layer network with no activations would be no more expressive than plain linear regression. Non-linearity is what lets the network bend and fold decision boundaries.

Here are the four you must know. Keep the symbols in mind: e is Euler's number, x is the input to the function.

Sigmoid

sigmoid(x) = 1 / (1 + e^(-x))

Output range: (0, 1)
Shape: smooth S-curve.
Use: the OUTPUT neuron of a BINARY classifier (interpret output as P(class = 1)).
Downside: for large positive or negative x the slope is ~0 ("saturates"),
          which stalls learning — the "vanishing gradient" problem in deep stacks.

Tanh

tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Output range: (-1, 1)
Shape: S-curve like sigmoid but zero-centred.
Use: hidden layers in older/small networks; zero-centred output often trains
     a little better than sigmoid. Still saturates at the extremes.

ReLU (Rectified Linear Unit)

relu(x) = max(0, x)

Output range: [0, +infinity)
Shape: flat at 0 for negative x, straight line for positive x.
Use: the DEFAULT for hidden layers today. Cheap to compute, does not saturate
     for positive values, and trains fast.
Downside: "dying ReLU" — neurons stuck at 0 for all inputs stop learning.
          Variants like Leaky ReLU (`leaky_relu(x) = max(0.01*x, x)`) address this.

Softmax

softmax(x_i) = e^(x_i) / sum over j of e^(x_j)

Turns a vector of raw scores into a probability distribution:
every output is in (0, 1) and all outputs SUM to 1.
Use: the OUTPUT layer of a MULTI-CLASS classifier (e.g. classify a digit as 0–9).

Which Activation Where

ActivationOutput rangeTypical useWatch out for
sigmoid(0, 1)Output of a binary classifierSaturates; vanishing gradients in deep hidden layers
tanh(-1, 1)Hidden layers (older/small nets)Also saturates at the extremes
relu[0, +inf)Default for hidden layers"Dying ReLU" neurons stuck at 0
softmax(0, 1), sums to 1Output of a multi-class classifierOnly for the final layer

A safe modern recipe: ReLU in every hidden layer, and pick the output activation by the task — sigmoid for binary, softmax for multi-class, and no activation (linear) for regression.

How a Network Learns: Forward Pass, Loss, Backprop

Training a neural network is a loop of three steps repeated over and over: make a prediction, measure how wrong it is, and nudge every weight in the direction that reduces the error. You do not need heavy calculus to grasp the intuition.

Step 1 — Forward Pass

Feed a batch of inputs into the input layer and let the values flow forward, layer by layer, until the output layer produces predictions. This is just repeated "weighted sum → activation" as we described above.

Forward pass (one hidden layer):
H = relu(X @ W1 + b1)        # hidden layer activations
Y_hat = softmax(H @ W2 + b2) # output predictions

Step 2 — Loss

A loss function scores how far the predictions are from the true labels. A smaller loss means a better model. The choice of loss depends on the task.

Regression:                Mean Squared Error
  MSE = (1/n) * sum of (y_true - y_pred)^2

Binary classification:     Binary Cross-Entropy
  BCE = -(1/n) * sum of [ y*log(p) + (1 - y)*log(1 - p) ]

Multi-class classification: Categorical Cross-Entropy
  CCE = -(1/n) * sum over samples and classes of [ y_true * log(y_pred) ]

Step 3 — Backpropagation + Gradient Descent

This is the heart of learning. Gradient descent is an optimisation method: it computes the gradient — the slope of the loss with respect to every weight — and steps each weight a little way downhill to reduce the loss.

Weight update rule (gradient descent):
w_new = w_old - (learning_rate * gradient_of_loss_wrt_w)

learning_rate: a small number, typically `0.001 <= learning_rate <= 0.1`.
  Too large → the loss bounces around or explodes.
  Too small → training crawls and may never converge in reasonable time.

Backpropagation is simply the efficient algorithm that computes all those gradients. It applies the chain rule from the output layer backwards to the input layer, reusing intermediate results so it does not recompute anything. The name literally means "backward propagation of errors": the error at the output is propagated back through the network to tell each weight how much it contributed to the mistake.

The intuition without the calculus. Imagine standing on a foggy hillside (the loss surface), trying to reach the valley floor (minimum loss). You cannot see far, but you can feel the slope under your feet. You take a small step downhill, feel the new slope, step again, and repeat. The gradient is the slope you feel; the learning rate is how big a step you take; backprop is the mechanism that measures the slope in every direction at once.

The training loop (one "epoch" = one full pass over the data):

for each epoch:
    for each mini-batch of data:
        1. Forward pass  → compute predictions
        2. Compute loss  → how wrong are we?
        3. Backprop      → compute gradients for all weights
        4. Update weights → w = w - learning_rate * gradient

Repeat for many epochs until the validation loss stops improving.

Modern optimisers such as Adam are smarter variants of gradient descent that adapt the step size per weight and usually converge faster than plain gradient descent. When you call optimizer="adam" in Keras, this is what you are getting.

When to Use Deep Learning vs Classic ML

Deep learning is not automatically "better". It is a tool, and using it on the wrong problem wastes time and money. The honest rule of thumb:

  • Tabular data (rows and columns) — reach for classic ML first. Gradient-boosted trees (the Ensemble Methods — Bagging, Boosting & XGBoost chapter) usually match or beat neural networks on tabular data, train faster, and are easier to interpret.
  • Unstructured data — images, text, audio, video — deep learning is the clear winner. Here the network's ability to learn its own features from raw pixels or raw text is exactly what you need, and no amount of manual feature engineering competes.

Deep learning also comes with real costs: it is data-hungry (typically needs tens of thousands to millions of examples), compute-hungry (often needs a GPU), harder to tune, and harder to explain to a regulator or a business stakeholder.

AspectClassic ML (trees, SVM, XGBoost)Deep Learning (neural networks)
Best data typeTabular / structuredImages, text, audio, video
Data neededWorks well on hundreds to thousands of rowsUsually needs tens of thousands +
Feature engineeringManual, by the data scientistLearned automatically by the network
ComputeTrains on a laptop CPUOften needs a GPU; longer training
InterpretabilityHigher (feature importances, coefficients)Lower (a "black box"; needs extra tools)
Training timeMinutesMinutes to days
Tuning effortModerateHigh (architecture, learning rate, epochs, ...)
Typical winner on tabularFrequently bestRarely worth it

Practical takeaway. If your problem is a spreadsheet, start with XGBoost. If your problem is a folder of photos, a pile of documents, or a set of audio files, start with deep learning.

Frameworks: TensorFlow/Keras and PyTorch

You almost never implement backpropagation by hand — frameworks do it for you (this is called automatic differentiation). The two dominant frameworks are:

  • TensorFlow / Keras — TensorFlow is Google's framework; Keras is its high-level, beginner-friendly API. A few readable lines get you a working model. Great for getting started and for production deployment.
  • PyTorch — Meta's framework, favoured in research for its flexible, "Pythonic" feel. It is now equally common in industry.

Both are excellent. For a first neural network, Keras is the gentlest on-ramp, so we will use it below.

A Tiny Keras Example (Illustrative)

Below is a minimal, end-to-end neural network for a binary classification problem — say, predicting whether a customer will churn. This code is illustrative — it shows the shape of a Keras program (build → compile → fit → evaluate). Numbers like accuracy are placeholders, not measured benchmarks.

Note the two things carried over from earlier chapters: we split the data (see Train-Test Split & Cross-Validation) and we scale the features (see Feature Engineering & Scaling). Neural networks are very sensitive to feature scale, so scaling is not optional here.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow import keras
from tensorflow.keras import layers

# --- Illustrative data: 20 features, binary target (0/1) ---
# In practice X, y would come from your own dataset (e.g. a churn table).
rng = np.random.default_rng(42)
X = rng.normal(size=(5000, 20))
y = (X[:, 0] + X[:, 1] > 0).astype(int)   # a synthetic, learnable rule

# 1) Split into train / test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 2) Scale features — CRITICAL for neural networks
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)   # fit on TRAIN only
X_test = scaler.transform(X_test)         # reuse the same scaler on TEST

# 3) Build a small Sequential network
model = keras.Sequential([
    keras.Input(shape=(20,)),             # input layer: 20 features
    layers.Dense(32, activation="relu"),  # hidden layer 1
    layers.Dense(16, activation="relu"),  # hidden layer 2  -> "deep" (2+ hidden layers)
    layers.Dense(1,  activation="sigmoid")# output: P(class = 1)
])

# 4) Compile: choose optimiser, loss, and metric
model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

# 5) Fit (train). validation_split holds back 20% of TRAIN to watch overfitting.
history = model.fit(
    X_train, y_train,
    epochs=20,
    batch_size=32,
    validation_split=0.2,
    verbose=0,
)

# 6) Evaluate on the untouched test set
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Test accuracy: {test_acc:.3f}")   # illustrative output below
Model: "sequential"
Layer (type)             Output Shape        Param #
=====================================================
dense (Dense)            (None, 32)          672
dense_1 (Dense)          (None, 16)          528
dense_2 (Dense)          (None, 1)           17
=====================================================
Total params: 1,217

Test accuracy: 0.951        # illustrative, not a measured benchmark

A few things to read off this example:

  • The Dense layers are fully-connected layers; the numbers (32, 16, 1) are the neuron counts.
  • The last layer has one neuron with sigmoid because it is binary classification. For a 10-class problem you would use Dense(10, activation="softmax") with loss="categorical_crossentropy".
  • validation_split lets you watch the validation loss during training — if it starts rising while training loss keeps falling, the network is overfitting.
  • For a regression target, the output layer would be Dense(1) with no activation and loss="mse".

Common Mistakes

1. Reaching for deep learning on small tabular data

You have 800 rows in a spreadsheet and build a 5-layer network.
It overfits badly and underperforms a 10-line XGBoost model.
Fix: on tabular data with limited rows, use classic ML first.

2. Forgetting to scale (or standardise) the inputs

Feature A is in [0, 1], Feature B is in [0, 1,000,000].
The huge-scale feature dominates the gradients; training stalls or diverges.
Fix: StandardScaler / MinMaxScaler on the training set, reuse on test.
Fit the scaler on TRAIN only — never on the whole dataset (data leakage).

3. Ignoring overfitting

Training accuracy = 0.99 but validation accuracy = 0.71 → the network memorised
the training set instead of learning the pattern.
Fixes: get more data, add Dropout layers, use L2 regularisation, and stop
training early (EarlyStopping) when validation loss stops improving.
See the "Bias-Variance, Overfitting & Regularization" chapter.

4. A learning rate that is wrong by orders of magnitude

Loss = NaN or wildly oscillating   → learning rate too HIGH.
Loss barely moves after many epochs → learning rate too LOW.
Fix: start around 0.001 (Adam's default) and adjust by factors of 10.

5. Using the wrong output activation or loss for the task

Multi-class problem with a `sigmoid` output and MSE loss → poor results.
Fix: match them —
  binary       -> sigmoid  + binary_crossentropy
  multi-class  -> softmax  + categorical_crossentropy
  regression   -> linear (none) + mse

6. Evaluating on data the model has seen

Reporting accuracy on the training set (or on data used to tune) looks great
but lies. Always report on a held-out TEST set the model never touched.

Practice Exercises

  1. Write out, in the form z = w . x + b followed by a = activation(z), the computation of a single neuron with inputs x = [2, -1], weights w = [0.5, 1.5], bias b = -1, and a ReLU activation. What is a?

  2. A colleague uses sigmoid on the output of a network that must classify an image into one of 5 animal categories. Explain why this is wrong and state the correct output activation and loss function.

  3. Explain in two or three sentences, without calculus, what backpropagation does and how the learning rate controls training.

  4. You have a dataset of 1,200 rows of customer transaction features (tabular) and must predict fraud. Would you start with deep learning or classic ML? Justify your answer using the comparison table.

  5. In the Keras example, change the task to a 3-class classification problem. What would you change about the final Dense layer, its activation, and the loss argument in compile?

  6. A network reaches 98% training accuracy but only 74% validation accuracy. Name the problem and list three concrete techniques to reduce it.

Summary

In this chapter you learned:

  • A neuron computes z = w . x + b then a = activation(z); logistic regression is a single sigmoid neuron, so you already knew the building block.
  • Neurons are organised into input, hidden, and output layers; a network with two or more hidden layers is deep, and depth enables automatic hierarchical feature learning.
  • Activation functions add the non-linearity that makes depth useful: sigmoid (binary output), tanh (older hidden layers), relu (default hidden layer), and softmax (multi-class output).
  • Training is a loop: forward passlossbackpropagation (compute gradients via the chain rule) → gradient descent weight update w_new = w_old - learning_rate * gradient.
  • Deep learning wins on unstructured data (images, text, audio) but needs lots of data and compute; on tabular data, classic ML like XGBoost usually matches or beats it and is easier to interpret.
  • The dominant frameworks are TensorFlow/Keras and PyTorch; both provide automatic differentiation so you never hand-code backprop.
  • A Keras workflow is build (Sequential + Dense) → compilefitevaluate, and always split and scale your data first.
  • Common pitfalls: deep learning on tiny tabular data, unscaled inputs, ignoring overfitting, a badly chosen learning rate, mismatched output activation/loss, and evaluating on seen data.

This completes the Machine Learning tutorial series — from the Introduction to Machine Learning through preprocessing, feature engineering, every major supervised and unsupervised algorithm, model evaluation, regularization, ensembles, and now neural networks and deep learning. You now have an end-to-end foundation covering the full modern ML toolkit. Great next steps are to build a portfolio project on a real dataset, go deeper into a specialisation such as computer vision (CNNs) or natural language processing (transformers), and keep practising the workflow on problems that matter to you.