Fundamentals of Machine Learning — Interview Questions & Answers

50 essential machine learning interview questions covering core concepts, supervised learning, unsupervised learning, model evaluation, and deep learning.

Meritshot19 min read
Machine LearningData ScienceAIInterview QuestionsPython
Back to Interview Guides

Machine Learning Basics

1. What is machine learning?

Machine learning is a subset of artificial intelligence that enables systems to learn patterns from data and improve their performance on tasks without being explicitly programmed. Instead of following hard-coded rules, a machine learning model identifies statistical patterns in training data and uses those patterns to make predictions or decisions on new, unseen data.

2. What are the three main types of machine learning?

The three main types are supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the model trains on labelled data where the correct output is provided. In unsupervised learning, the model finds hidden patterns in unlabelled data. In reinforcement learning, an agent learns by interacting with an environment and receiving rewards or penalties based on its actions.

3. What is the difference between a parameter and a hyperparameter?

A parameter is a value learned by the model from the training data, such as the weights and biases in a neural network or the coefficients in a linear regression. A hyperparameter is a configuration setting set before training begins and controls the learning process itself, such as the learning rate, number of trees in a random forest, or the depth of a decision tree.

4. What is the bias-variance tradeoff?

Bias refers to the error introduced by approximating a complex real-world problem with a simplified model — high bias leads to underfitting. Variance refers to the model's sensitivity to fluctuations in the training data — high variance leads to overfitting. The bias-variance tradeoff means reducing one often increases the other, and the goal is to find a model complexity that minimises total error on unseen data.

5. What is overfitting and how do you prevent it?

Overfitting occurs when a model learns the training data too well, including its noise and random fluctuations, resulting in poor generalisation to new data. It can be prevented by using more training data, applying regularisation techniques such as L1 or L2, using dropout in neural networks, pruning decision trees, applying cross-validation, or choosing a simpler model architecture.

6. What is the difference between a generative and a discriminative model?

A discriminative model learns the boundary between classes directly by modelling the conditional probability P(y|x) — examples include logistic regression and SVMs. A generative model learns the joint probability distribution P(x, y) and can generate new data samples — examples include Naive Bayes and GANs. Discriminative models tend to perform better on classification tasks while generative models are useful for data generation and density estimation.

7. What is feature engineering?

Feature engineering is the process of using domain knowledge to create, transform, or select input variables that improve model performance. It includes techniques such as one-hot encoding for categorical variables, normalisation and standardisation for numerical variables, creating interaction terms, extracting date components, and applying log or polynomial transformations to skewed features.

8. What is the curse of dimensionality?

The curse of dimensionality refers to the phenomenon where the performance of machine learning algorithms degrades as the number of features increases, because the data becomes increasingly sparse in high-dimensional space. This makes distance-based algorithms unreliable and increases computational cost. It is addressed through dimensionality reduction techniques such as PCA, feature selection, or autoencoders.

9. What is the difference between bagging and boosting?

Bagging (Bootstrap Aggregating) trains multiple models in parallel on different random subsets of the training data and averages their predictions to reduce variance — Random Forest is the most popular example. Boosting trains models sequentially, where each model focuses on correcting the errors of the previous one, reducing bias — examples include AdaBoost, Gradient Boosting, and XGBoost.

10. What is cross-validation and why is it used?

Cross-validation is a technique for evaluating model performance by splitting the data into multiple folds, training the model on some folds and validating on the remaining fold, and repeating this process for all folds. K-fold cross-validation is the most common approach. It provides a more reliable estimate of model generalisation than a single train-test split, especially on small datasets, and helps in hyperparameter tuning.

Supervised Learning

11. What is linear regression and when is it used?

Linear regression models the relationship between a dependent continuous variable and one or more independent variables by fitting a straight line that minimises the sum of squared residuals. It is used when the relationship between variables is approximately linear, for example predicting house prices from square footage or forecasting sales from advertising spend. It assumes independence of errors, homoscedasticity, and no multicollinearity.

12. What is logistic regression and how does it differ from linear regression?

Logistic regression is a classification algorithm that models the probability of a binary outcome using the sigmoid function to map predictions to values between 0 and 1. Unlike linear regression, which predicts a continuous value, logistic regression predicts the probability of class membership and applies a threshold (typically 0.5) to assign class labels. It is used for binary classification tasks such as spam detection or disease prediction.

13. How does a decision tree work?

A decision tree recursively splits the dataset into subsets based on the feature that best separates the classes, using criteria such as Gini impurity or information gain (entropy). Each internal node represents a decision on a feature, each branch represents the outcome of that decision, and each leaf node represents a class label or a predicted value. Trees are interpretable but prone to overfitting without pruning.

14. What is a Support Vector Machine (SVM)?

A Support Vector Machine finds the hyperplane that maximally separates two classes in feature space by maximising the margin — the distance between the hyperplane and the nearest data points from each class, called support vectors. For non-linearly separable data, SVMs use the kernel trick to map data into a higher-dimensional space where a linear separator exists. Common kernels include linear, polynomial, and RBF.

15. What is the difference between hard and soft margin SVM?

Hard margin SVM requires perfect linear separability and allows no misclassifications, making it sensitive to outliers and noise. Soft margin SVM introduces a slack variable that permits some misclassifications, controlled by the regularisation parameter C. A large C allows fewer misclassifications but risks overfitting, while a small C allows more misclassifications to improve generalisation. Soft margin SVM is used in most real-world applications.

16. What is Naive Bayes and why is it called "naive"?

Naive Bayes is a probabilistic classification algorithm based on Bayes' theorem that assumes all features are conditionally independent given the class label. It is called "naive" because this independence assumption is rarely true in practice — features are usually correlated. Despite this simplification, Naive Bayes performs surprisingly well on text classification tasks such as spam filtering and sentiment analysis due to its speed and simplicity.

17. What is the difference between Random Forest and Gradient Boosting?

Random Forest builds many decision trees in parallel, each trained on a different bootstrap sample with random feature selection, and aggregates their predictions to reduce variance. Gradient Boosting builds trees sequentially, where each tree corrects the residual errors of the previous ensemble, reducing bias. Random Forest is faster to train and more robust to outliers, while Gradient Boosting typically achieves higher accuracy with proper tuning.

18. What is regularisation and what are L1 and L2?

Regularisation adds a penalty term to the loss function to discourage large coefficients and prevent overfitting. L1 regularisation (Lasso) adds the sum of absolute values of coefficients, which can shrink some coefficients to exactly zero and thus performs feature selection. L2 regularisation (Ridge) adds the sum of squared coefficients, which shrinks all coefficients towards zero but rarely to exactly zero. ElasticNet combines both L1 and L2 penalties.

19. What is gradient descent?

Gradient descent is an optimisation algorithm that iteratively updates model parameters in the direction of the negative gradient of the loss function to minimise the error. In batch gradient descent, the gradient is computed over the entire dataset. Stochastic gradient descent (SGD) updates parameters after each training example, while mini-batch gradient descent uses small batches. The learning rate controls the step size of each update.

20. What is the difference between classification and regression?

Classification predicts a discrete class label, such as whether an email is spam or not spam, or which category a product belongs to. Regression predicts a continuous numerical value, such as a house price or a temperature reading. The key difference is in the output — categorical versus continuous — and this determines the appropriate algorithms, loss functions, and evaluation metrics to use.

Unsupervised Learning & Clustering

21. What is K-Means clustering?

K-Means partitions data into K clusters by iteratively assigning each data point to the nearest centroid and then recalculating the centroid as the mean of all points in the cluster. The algorithm converges when assignments no longer change. The choice of K is a hyperparameter and can be determined using the elbow method or silhouette score. K-Means assumes spherical clusters and is sensitive to outliers and initialisation.

22. What is the elbow method?

The elbow method is used to select the optimal number of clusters K for K-Means by plotting the within-cluster sum of squares (WCSS) against different values of K. As K increases, WCSS decreases. The "elbow" point is where the rate of decrease slows significantly, suggesting that adding more clusters beyond that point yields diminishing returns. It is a heuristic and the elbow may not always be clearly visible.

23. What is hierarchical clustering?

Hierarchical clustering builds a tree of clusters called a dendrogram without requiring a pre-specified number of clusters. Agglomerative clustering starts with each point as its own cluster and merges the two closest clusters iteratively. Divisive clustering starts with all points in one cluster and splits them. The linkage criterion (single, complete, average, or Ward) determines how cluster distances are measured. The dendrogram can be cut at any level to obtain a desired number of clusters.

24. What is Principal Component Analysis (PCA)?

PCA is a dimensionality reduction technique that transforms the original features into a smaller set of uncorrelated components called principal components. Each principal component is a linear combination of the original features, ordered by the amount of variance they explain. PCA is used for visualisation, noise reduction, and addressing the curse of dimensionality, but the resulting components lose interpretability.

25. What is the difference between PCA and t-SNE?

PCA is a linear dimensionality reduction technique that preserves global variance and is deterministic and fast. t-SNE (t-distributed Stochastic Neighbour Embedding) is a non-linear technique that preserves local structure and is excellent for visualising high-dimensional data in 2D or 3D, but is computationally expensive, non-deterministic, and should not be used for tasks requiring preserved global distances. PCA is used for preprocessing while t-SNE is used primarily for visualisation.

26. What is an anomaly detection algorithm?

Anomaly detection identifies data points that deviate significantly from the expected pattern. Common approaches include Isolation Forest (which isolates anomalies by randomly selecting features and split values), One-Class SVM (which learns a boundary around normal data), and statistical methods using z-scores or IQR to flag outliers. Autoencoders can also detect anomalies by measuring reconstruction error — anomalies have high reconstruction error.

27. What is DBSCAN?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are closely packed (high density) and marks points in low-density regions as outliers. Unlike K-Means, it does not require specifying the number of clusters and can find arbitrarily shaped clusters. It requires two parameters: epsilon (the radius of a neighbourhood) and min_samples (the minimum number of points to form a dense region). It handles noise well but struggles with clusters of varying density.

28. What is an autoencoder?

An autoencoder is a neural network trained to compress input data into a low-dimensional latent representation (encoding) and then reconstruct the original input from that representation (decoding). The encoder maps input to a bottleneck, and the decoder reconstructs from it. Autoencoders are used for dimensionality reduction, anomaly detection, denoising, and feature learning. Variational autoencoders (VAEs) add a probabilistic element enabling data generation.

29. What is the silhouette score?

The silhouette score measures how similar a data point is to its own cluster compared to other clusters. It ranges from -1 to 1, where a score close to 1 indicates the point is well within its cluster, a score near 0 indicates the point is on the boundary between clusters, and a negative score indicates the point may have been assigned to the wrong cluster. The average silhouette score over all points is used to evaluate clustering quality.

30. What is association rule mining?

Association rule mining discovers interesting relationships or patterns between variables in large datasets, commonly used in market basket analysis. The Apriori algorithm generates rules of the form "if A then B" and measures their quality using support (frequency of the itemset), confidence (likelihood that B occurs given A), and lift (how much more likely B is given A compared to random chance). A lift greater than 1 indicates a positive association.

Model Evaluation

31. What is a confusion matrix?

A confusion matrix is a table that summarises the performance of a classification model by showing the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). From these values, key metrics such as accuracy, precision, recall, and F1 score can be derived. It is particularly useful for understanding which specific classes the model is confusing with each other.

32. What is the difference between precision and recall?

Precision is the proportion of positive predictions that are actually correct: TP / (TP + FP). It answers "of all the items predicted positive, how many were actually positive?" Recall (sensitivity) is the proportion of actual positives that the model correctly identified: TP / (TP + FN). It answers "of all the actual positives, how many did the model catch?" There is typically a tradeoff: increasing one often decreases the other.

33. What is the F1 score?

The F1 score is the harmonic mean of precision and recall, calculated as 2 × (precision × recall) / (precision + recall). It provides a single metric that balances both precision and recall and is particularly useful when dealing with imbalanced datasets where accuracy alone can be misleading. The F1 score ranges from 0 to 1, with 1 being perfect precision and recall.

34. What is the ROC curve and AUC?

The ROC (Receiver Operating Characteristic) curve plots the true positive rate (recall) against the false positive rate at various classification thresholds. AUC (Area Under the Curve) summarises the ROC curve into a single number between 0 and 1. An AUC of 0.5 indicates a model no better than random chance, while an AUC of 1.0 indicates perfect discrimination. AUC is threshold-independent and useful for comparing classifiers on imbalanced datasets.

35. What is the difference between accuracy and balanced accuracy?

Accuracy is the proportion of all correct predictions (TP + TN) / (TP + TN + FP + FN). It can be misleading on imbalanced datasets — a model predicting the majority class always would achieve high accuracy. Balanced accuracy is the average of recall for each class, giving equal weight to each class regardless of its frequency. It is a better metric when class distributions are skewed.

36. What is mean squared error (MSE) and root mean squared error (RMSE)?

MSE is the average of the squared differences between predicted and actual values, penalising larger errors more heavily. RMSE is the square root of MSE and is in the same units as the target variable, making it more interpretable. Mean Absolute Error (MAE) is the average of the absolute differences and is more robust to outliers than RMSE. R-squared measures the proportion of variance in the target explained by the model.

37. What is a learning curve?

A learning curve plots model performance (such as accuracy or loss) on both the training set and validation set as a function of the number of training examples or training iterations. It helps diagnose underfitting (both curves are poor), overfitting (training curve is good but validation curve is poor), or a good fit (both curves converge at a satisfactory level). It guides decisions about collecting more data or changing model complexity.

38. What is stratified sampling and when is it important?

Stratified sampling ensures that each class is proportionally represented in both the training and test sets by splitting within each class separately. It is particularly important for imbalanced classification problems where a simple random split might result in the minority class being underrepresented or absent in the test set. It is standard practice in classification tasks and is implemented with stratify=y in scikit-learn's train_test_split.

39. What is early stopping?

Early stopping is a regularisation technique used during model training that monitors performance on a validation set and stops training when performance stops improving (or starts degrading), preventing overfitting. It saves the model weights from the best-performing epoch. It is commonly used in neural networks and gradient boosting algorithms and is controlled by a patience parameter that determines how many epochs without improvement are tolerated before stopping.

40. How do you handle class imbalance?

Class imbalance is handled through several strategies: resampling techniques such as oversampling the minority class (SMOTE), undersampling the majority class, or combining both. Adjusting the class weight parameter in algorithms like logistic regression and random forest to penalise misclassification of the minority class more heavily. Using appropriate evaluation metrics such as AUC-ROC, F1 score, or balanced accuracy instead of raw accuracy.

Deep Learning & Advanced

41. What is a neural network?

A neural network is a computational model inspired by the structure of the human brain, consisting of layers of interconnected nodes (neurons). Each connection has a weight, and each neuron applies an activation function to the weighted sum of its inputs. The network learns by adjusting weights using backpropagation and gradient descent to minimise a loss function. Deep neural networks have multiple hidden layers enabling them to learn hierarchical representations.

42. What is backpropagation?

Backpropagation is the algorithm used to train neural networks by computing the gradient of the loss function with respect to each weight using the chain rule of calculus. Starting from the output layer and propagating backward through the network, it calculates how much each weight contributes to the overall error. These gradients are then used by gradient descent to update the weights in the direction that reduces the loss.

43. What are activation functions and why are they needed?

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns beyond simple linear relationships. Without activation functions, a deep network would behave identically to a single-layer linear model. Common activation functions include ReLU (Rectified Linear Unit), which returns max(0, x) and is the most widely used; Sigmoid, which outputs values between 0 and 1; Tanh, which outputs between -1 and 1; and Softmax, which is used in the output layer for multi-class classification.

44. What is a Convolutional Neural Network (CNN)?

A CNN is a deep learning architecture designed for processing grid-structured data such as images. It uses convolutional layers that apply learned filters across the input to detect local patterns like edges and textures, pooling layers that reduce spatial dimensions, and fully connected layers for classification. CNNs achieve state-of-the-art performance on image classification, object detection, and segmentation tasks due to their ability to exploit spatial hierarchies.

45. What is a Recurrent Neural Network (RNN) and what problem does the LSTM solve?

An RNN processes sequential data by maintaining a hidden state that captures information from previous time steps, making it suitable for tasks like language modelling and time series prediction. However, vanilla RNNs suffer from the vanishing gradient problem, where gradients become extremely small during backpropagation through time, making it difficult to learn long-range dependencies. LSTMs (Long Short-Term Memory) solve this with gating mechanisms (input, forget, output gates) that regulate information flow.

46. What is transfer learning?

Transfer learning is the technique of taking a model pre-trained on a large dataset (such as ImageNet for images or large text corpora for NLP) and fine-tuning it on a smaller, task-specific dataset. The pre-trained model's learned features serve as a starting point, dramatically reducing the amount of labelled data and compute required. It is standard practice in computer vision (VGG, ResNet) and NLP (BERT, GPT) applications.

47. What is a Transformer architecture?

A Transformer is a deep learning architecture based entirely on the self-attention mechanism, introduced in the paper "Attention Is All You Need." It processes all input tokens in parallel rather than sequentially, making it highly parallelisable. Self-attention allows each token to attend to all other tokens in the sequence, capturing long-range dependencies effectively. Transformers form the basis of modern large language models such as GPT and BERT.

48. What is dropout and how does it work?

Dropout is a regularisation technique for neural networks that randomly deactivates a proportion of neurons (set to 0) during each training pass, controlled by the dropout rate parameter. This prevents neurons from co-adapting too closely and forces the network to learn redundant representations, reducing overfitting. During inference, all neurons are active but their outputs are scaled down by the dropout rate to maintain expected values.

49. What is batch normalisation?

Batch normalisation normalises the inputs to each layer across the mini-batch to have zero mean and unit variance, then applies learnable scale and shift parameters. It accelerates training by allowing higher learning rates, reduces sensitivity to weight initialisation, acts as a regulariser, and helps mitigate internal covariate shift — the phenomenon where the distribution of layer inputs changes during training. It is applied before or after the activation function.

50. What is the difference between XGBoost, LightGBM, and CatBoost?

All three are gradient boosting implementations but differ in their approach. XGBoost uses level-wise (breadth-first) tree growth and requires explicit encoding of categorical variables. LightGBM uses leaf-wise growth which makes it faster and more memory-efficient, especially on large datasets. CatBoost handles categorical features natively using ordered boosting and is particularly effective on datasets with many categorical variables. LightGBM is fastest for large datasets, while CatBoost often wins on categorical-heavy data.