Fundamentals of Deep Learning — Interview Questions & Answers

50 essential deep learning interview questions covering neural networks, activation functions, backpropagation, CNNs, RNNs, and training techniques.

Meritshot19 min read
Deep LearningNeural NetworksMachine LearningInterview QuestionsData Science
Back to Interview Guides

Neural Network Basics

1. What is deep learning?

Deep learning is a subfield of machine learning that uses artificial neural networks with multiple hidden layers to automatically learn hierarchical representations of data. Unlike traditional machine learning, which often relies on hand-crafted features, deep learning models learn features directly from raw inputs such as images, text, or audio. This capacity for automatic feature extraction has made deep learning the dominant approach for tasks like computer vision, speech recognition, and natural language processing.

2. What is an artificial neuron?

An artificial neuron, sometimes called a perceptron, is the basic computational unit of a neural network that loosely mimics a biological neuron. It computes a weighted sum of its inputs, adds a bias term, and passes the result through an activation function to produce an output. Mathematically this is expressed as output = f(w·x + b), where w are the weights, x are the inputs, b is the bias, and f is the activation function.

3. What is the difference between a shallow and a deep neural network?

A shallow neural network typically has only one hidden layer between the input and output layers, while a deep neural network has two or more hidden layers. The additional layers in a deep network allow it to learn increasingly abstract and hierarchical features, with early layers capturing simple patterns and later layers combining them into complex concepts. This depth is what gives deep learning its representational power but also makes it harder to train.

4. What are weights and biases in a neural network?

Weights are the learnable parameters that determine the strength and direction of the connection between two neurons, scaling how much an input influences the output. The bias is an additional learnable parameter added to the weighted sum that allows the activation function to be shifted, giving the model more flexibility to fit the data. During training, both weights and biases are adjusted through optimisation to minimise the loss function.

5. What is a fully connected (dense) layer?

A fully connected or dense layer is a layer in which every neuron is connected to every neuron in the previous layer, so each output is a weighted combination of all inputs. These layers are the most general type of layer and are commonly used in the final stages of a network for classification or regression. Because every connection has its own weight, dense layers can have a very large number of parameters, which increases computational and memory cost.

6. What is the role of the input, hidden, and output layers?

The input layer receives the raw features of the data and passes them into the network without performing computation. Hidden layers sit between the input and output and progressively transform the data into more useful representations through weighted sums and activation functions. The output layer produces the final prediction, with its size and activation function chosen to match the task, such as a single sigmoid unit for binary classification or a softmax layer for multi-class problems.

7. What is forward propagation?

Forward propagation is the process of passing input data through the network, layer by layer, to compute the final output or prediction. At each layer, the inputs are multiplied by weights, summed with a bias, and passed through an activation function before being fed to the next layer. The result of forward propagation is then compared with the true target using a loss function to measure how well the network is performing.

8. Why do neural networks need non-linearity?

Neural networks need non-linearity because stacking multiple linear layers without non-linear activations is mathematically equivalent to a single linear transformation, no matter how many layers are added. Non-linear activation functions allow the network to approximate complex, non-linear relationships between inputs and outputs. Without them, the model could only learn linear decision boundaries and would lose the expressive power that makes deep learning effective.

9. What is the universal approximation theorem?

The universal approximation theorem states that a feedforward neural network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function to arbitrary accuracy. It provides theoretical justification for why neural networks are powerful function approximators. However, the theorem does not guarantee that such a network is easy to train or that a single wide layer is more practical than a deeper, narrower architecture.

10. What is the difference between deep learning and traditional machine learning?

Traditional machine learning often requires manual feature engineering and works well on structured, tabular data with relatively small datasets. Deep learning automatically learns features from raw data and tends to excel on large, high-dimensional, unstructured data such as images, audio, and text. However, deep learning typically needs far more data and computational resources, and its models are less interpretable than simpler algorithms like decision trees or linear regression.

Activation & Loss Functions

11. What is an activation function?

An activation function is a mathematical function applied to the output of a neuron that introduces non-linearity into the network. It determines whether and how strongly a neuron should be activated based on its weighted input. Common activation functions include ReLU, sigmoid, tanh, and softmax, each suited to different layers and tasks within a network.

The Rectified Linear Unit (ReLU) is defined as f(x) = max(0, x), outputting the input directly if positive and zero otherwise. It is popular because it is computationally cheap, helps mitigate the vanishing gradient problem for positive inputs, and promotes sparse activations. Its main drawback is the dying ReLU problem, where neurons can become permanently inactive if they consistently output zero during training.

13. What is the sigmoid activation function?

The sigmoid function maps any real-valued input to a value between 0 and 1 using the formula f(x) = 1 / (1 + e^-x), producing an S-shaped curve. It is commonly used in the output layer for binary classification because its output can be interpreted as a probability. However, it suffers from vanishing gradients at extreme input values and outputs that are not zero-centred, which can slow down training in hidden layers.

14. What is the tanh activation function and how does it differ from sigmoid?

The tanh function maps inputs to a range between -1 and 1 and is defined as f(x) = (e^x - e^-x) / (e^x + e^-x). Unlike the sigmoid, tanh is zero-centred, which often makes optimisation easier because gradients are less biased in one direction. However, like the sigmoid, it still suffers from the vanishing gradient problem for very large or very small input values.

15. What is the softmax function and where is it used?

The softmax function converts a vector of raw scores (logits) into a probability distribution where all values are positive and sum to one. It is used in the output layer of multi-class classification networks, with each output representing the predicted probability of a particular class. The class with the highest softmax value is typically chosen as the model's prediction.

16. What is the vanishing gradient problem?

The vanishing gradient problem occurs when gradients become extremely small as they are propagated backward through many layers, causing the early layers to learn very slowly or stop learning altogether. It is especially common with saturating activation functions like sigmoid and tanh, where derivatives approach zero. Solutions include using ReLU activations, batch normalisation, residual connections, and careful weight initialisation.

17. What is the exploding gradient problem?

The exploding gradient problem occurs when gradients grow exponentially large during backpropagation, causing unstable updates and weights that diverge to very large values or NaN. It is most common in deep networks and recurrent neural networks processing long sequences. Common remedies include gradient clipping, weight regularisation, and using more stable architectures such as LSTMs.

18. What is a loss function?

A loss function quantifies the difference between a model's predictions and the actual target values, producing a single number that the training process tries to minimise. It serves as the objective that guides how weights are updated during optimisation. The choice of loss function depends on the task, with regression and classification problems requiring different formulations.

19. What loss functions are used for regression and classification?

For regression tasks, common loss functions include Mean Squared Error (MSE) and Mean Absolute Error (MAE), which measure the average magnitude of prediction errors. For classification, cross-entropy loss is standard, with binary cross-entropy used for two-class problems and categorical cross-entropy for multi-class problems. The loss function should align with the output activation, such as pairing softmax outputs with categorical cross-entropy.

20. What is cross-entropy loss?

Cross-entropy loss measures the dissimilarity between the predicted probability distribution and the true distribution of class labels, heavily penalising confident but incorrect predictions. For classification, it encourages the model to assign high probability to the correct class. It is the most widely used loss for neural network classifiers and pairs naturally with sigmoid or softmax output activations.

Training & Optimization

21. What is backpropagation?

Backpropagation is the algorithm used to compute gradients of the loss function with respect to each weight in the network by applying the chain rule of calculus. It works by propagating the error backward from the output layer to the input layer, calculating how much each parameter contributed to the total error. These gradients are then used by an optimiser to update the weights and reduce the loss.

22. What is gradient descent?

Gradient descent is an optimisation algorithm that iteratively adjusts model parameters in the direction that reduces the loss function. At each step, it computes the gradient of the loss with respect to the parameters and moves them in the opposite direction, scaled by the learning rate. By repeating this process, the model gradually converges toward a set of parameters that minimise the loss.

23. What is the difference between batch, stochastic, and mini-batch gradient descent?

Batch gradient descent computes the gradient using the entire training dataset before each update, which is accurate but slow and memory-intensive. Stochastic gradient descent (SGD) updates parameters after every single training example, making it fast but noisy. Mini-batch gradient descent strikes a balance by updating after small batches of examples, combining computational efficiency with more stable convergence, and is the most commonly used approach in practice.

24. What is the learning rate and why is it important?

The learning rate is a hyperparameter that controls the size of the steps taken when updating weights during gradient descent. If it is too high, the model may overshoot the minimum and fail to converge or even diverge, while if it is too low, training becomes very slow and may get stuck in poor local minima. Choosing an appropriate learning rate, often with the help of schedules or adaptive methods, is one of the most critical decisions in training a neural network.

25. What are optimisers like Adam, RMSprop, and SGD with momentum?

Optimisers are algorithms that determine how weights are updated using computed gradients, often improving on plain gradient descent. SGD with momentum accumulates a moving average of past gradients to accelerate convergence and dampen oscillations. RMSprop adapts the learning rate per parameter based on recent gradient magnitudes, while Adam combines momentum and adaptive learning rates, making it a robust and popular default choice for many deep learning tasks.

26. What is an epoch, a batch, and an iteration?

An epoch is one complete pass of the entire training dataset through the network. A batch is the subset of training examples processed together before a single weight update, and the batch size controls how many examples that includes. An iteration is one update step, so the number of iterations per epoch equals the dataset size divided by the batch size.

27. What is weight initialisation and why does it matter?

Weight initialisation is the process of setting the starting values of a network's weights before training begins. Poor initialisation can lead to vanishing or exploding gradients, slow convergence, or neurons that fail to learn. Strategies such as Xavier (Glorot) initialisation for tanh and He initialisation for ReLU set weights to appropriate scales based on layer size, helping signals and gradients flow properly through the network.

28. What is batch normalisation?

Batch normalisation is a technique that normalises the inputs to each layer within a mini-batch so they have a consistent mean and variance, then applies learnable scale and shift parameters. It stabilises and accelerates training by reducing internal covariate shift, allowing higher learning rates and reducing sensitivity to initialisation. It also provides a mild regularisation effect because of the noise introduced by batch statistics.

29. What is a learning rate schedule?

A learning rate schedule adjusts the learning rate during training rather than keeping it fixed, typically reducing it over time as the model approaches a minimum. Common strategies include step decay, exponential decay, and cosine annealing, while warm-up schedules gradually increase the rate at the start of training. Scheduling helps the model make large progress early and fine-tune carefully later, improving both convergence speed and final accuracy.

30. What is gradient clipping?

Gradient clipping is a technique that limits the magnitude of gradients during backpropagation to prevent the exploding gradient problem. When the norm of the gradient exceeds a predefined threshold, it is rescaled so its norm equals that threshold while preserving its direction. It is especially useful in training recurrent neural networks and other deep architectures prone to unstable gradients.

Convolutional Neural Networks (CNNs)

31. What is a convolutional neural network (CNN)?

A convolutional neural network is a type of deep network designed primarily for processing grid-like data such as images. It uses convolutional layers that apply learnable filters across the input to detect local patterns like edges, textures, and shapes. CNNs exploit spatial locality and weight sharing, making them far more efficient and effective than fully connected networks for image-related tasks.

32. What is a convolution operation in a CNN?

A convolution operation slides a small filter or kernel across the input image, computing the dot product between the filter weights and the local region it covers at each position. This produces a feature map that highlights where specific patterns appear in the input. By learning many filters, the network can detect a wide variety of features at different locations across the image.

33. What is a filter or kernel in a CNN?

A filter, also called a kernel, is a small matrix of learnable weights that is convolved with the input to detect a particular feature, such as a vertical edge or a corner. Each convolutional layer typically contains many filters, each producing its own feature map. The values within the filters are learned during training so the network can discover the most useful features for the task.

34. What is pooling and why is it used?

Pooling is a downsampling operation that reduces the spatial dimensions of feature maps, decreasing computation and the number of parameters. Max pooling takes the maximum value within each region while average pooling takes the mean, with max pooling being the more common choice. Pooling also provides a degree of translation invariance, making the network more robust to small shifts in the input.

35. What is stride and padding in convolution?

Stride is the number of pixels the filter moves at each step as it slides across the input, with a larger stride producing smaller output feature maps. Padding adds extra pixels, usually zeros, around the border of the input so that the spatial dimensions can be preserved or controlled. Together, stride and padding determine the size of the output feature map and how edge information is handled.

36. Why are CNNs more efficient than fully connected networks for images?

CNNs use weight sharing, meaning the same filter is applied across the entire image, drastically reducing the number of parameters compared with connecting every pixel to every neuron. They also exploit local connectivity, since each neuron looks at only a small region of the input, which matches the spatial structure of images. This combination makes CNNs more parameter-efficient, faster to train, and better at generalising on visual data.

37. What is a feature map in a CNN?

A feature map is the output produced when a filter is convolved across the input, representing the presence and strength of a particular feature at each spatial location. Early layers produce feature maps detecting simple patterns like edges, while deeper layers produce feature maps representing more complex structures like object parts. The collection of feature maps at each layer forms the learned representation passed to subsequent layers.

38. What is transfer learning in the context of CNNs?

Transfer learning is the practice of reusing a model pre-trained on a large dataset, such as ImageNet, as the starting point for a new but related task. The pre-trained network has already learned general visual features, so only the later layers need to be retrained or fine-tuned on the new data. This approach dramatically reduces training time and data requirements while often improving performance, especially when the new dataset is small.

39. What are some well-known CNN architectures?

Notable CNN architectures include LeNet, one of the earliest designs for digit recognition, and AlexNet, which popularised deep CNNs after winning the 2012 ImageNet competition. VGG introduced very deep networks with small filters, ResNet introduced residual connections to train extremely deep models, and Inception used parallel filters of different sizes. Each architecture contributed innovations that influenced modern computer vision design.

40. What are residual connections and why are they useful?

Residual connections, introduced in ResNet, add the input of a layer directly to its output through a skip connection, allowing the layer to learn a residual function rather than a full transformation. They help gradients flow more easily through very deep networks, mitigating the vanishing gradient problem. This makes it possible to train networks with hundreds of layers without degradation in performance.

Recurrent Networks & Sequence Models

41. What is a recurrent neural network (RNN)?

A recurrent neural network is a type of network designed to process sequential data by maintaining a hidden state that carries information from previous time steps. At each step, the RNN combines the current input with the previous hidden state to produce an output and an updated state. This makes RNNs well suited to tasks such as language modelling, time-series forecasting, and speech recognition where order matters.

42. Why do basic RNNs struggle with long-term dependencies?

Basic RNNs struggle with long-term dependencies because gradients tend to vanish or explode as they are propagated back through many time steps. This makes it difficult for the network to learn relationships between events that are far apart in a sequence. As a result, vanilla RNNs effectively remember only short-term context, which motivated the development of gated architectures like LSTMs and GRUs.

43. What is an LSTM and how does it work?

A Long Short-Term Memory (LSTM) network is a type of RNN that uses a cell state and three gates — input, forget, and output — to control the flow of information over time. The gates decide what to add, retain, or discard from the cell state, allowing the network to preserve relevant information across long sequences. This gating mechanism solves the vanishing gradient problem and enables LSTMs to capture long-term dependencies effectively.

44. What is a GRU and how does it differ from an LSTM?

A Gated Recurrent Unit (GRU) is a simplified variant of the LSTM that combines the forget and input gates into a single update gate and merges the cell state and hidden state. This makes GRUs computationally lighter and faster to train, with fewer parameters than LSTMs. In practice, GRUs and LSTMs often achieve comparable performance, with the best choice depending on the dataset and task.

45. What is a bidirectional RNN?

A bidirectional RNN processes a sequence in both forward and backward directions using two separate hidden layers, then combines their outputs at each time step. This allows the network to use both past and future context when making predictions, which is valuable for tasks like named entity recognition and machine translation. The trade-off is that the entire sequence must be available before processing, so it is unsuitable for real-time streaming tasks.

46. What is the attention mechanism?

The attention mechanism allows a model to dynamically focus on the most relevant parts of the input when producing each element of the output, rather than relying on a single fixed-length representation. It computes a set of weights that indicate how much each input position should contribute to the current output. Attention dramatically improved performance on sequence tasks and forms the foundation of the Transformer architecture.

47. What is a Transformer and why is it significant?

A Transformer is a sequence model that relies entirely on self-attention mechanisms rather than recurrence, allowing it to process all positions in a sequence in parallel. This parallelism makes Transformers far more efficient to train on large datasets and better at capturing long-range dependencies than RNNs. They underpin modern large language models such as BERT and GPT, making them one of the most influential architectures in deep learning.

48. What are embeddings in deep learning?

Embeddings are dense, low-dimensional vector representations of discrete items such as words, products, or categories, learned so that semantically similar items lie close together in the vector space. They allow neural networks to capture relationships and meaning that one-hot encodings cannot, while greatly reducing dimensionality. Word embeddings like Word2Vec and GloVe are classic examples widely used in natural language processing.

Regularization & Best Practices

49. What is dropout and how does it prevent overfitting?

Dropout is a regularisation technique that randomly deactivates a fraction of neurons during each training step, forcing the network not to rely too heavily on any single neuron. This encourages the network to learn redundant, robust representations and effectively trains an ensemble of sub-networks. At inference time all neurons are used, with their outputs scaled appropriately, which reduces overfitting and improves generalisation.

50. What are common techniques to prevent overfitting in deep learning?

Common techniques include adding dropout layers, applying L1 or L2 weight regularisation, and using data augmentation to artificially expand the training set. Early stopping halts training when validation performance stops improving, while batch normalisation and gathering more training data also help. Combining several of these methods, along with choosing an appropriately sized architecture, typically yields the best generalisation to unseen data.