Fundamentals of Natural Language Processing — Interview Questions & Answers | Meritshot Interview Guides

NLP Fundamentals

1. What is Natural Language Processing?

Natural Language Processing (NLP) is a branch of artificial intelligence concerned with enabling computers to understand, interpret, and generate human language. It sits at the intersection of linguistics, computer science, and machine learning. NLP tasks range from simple (spell checking, keyword search) to complex (machine translation, question answering, sentiment analysis, text generation). Modern NLP is dominated by large language models (LLMs) based on the Transformer architecture, which have dramatically improved performance across virtually all NLP benchmarks and enabled new capabilities like conversational AI, code generation, and document summarisation.

2. What is tokenisation?

Tokenisation is the process of splitting text into smaller units called tokens, which serve as the input to NLP models. Word tokenisation splits on whitespace and punctuation. Subword tokenisation (used by modern LLMs) breaks rare words into smaller pieces: "unhappiness" → ["un", "happiness"] or ["un", "##hap", "##pi", "##ness"]. Common algorithms include Byte Pair Encoding (BPE, used by GPT), WordPiece (BERT), and SentencePiece (used for multilingual models). Subword tokenisation balances vocabulary size (too large = memory inefficient; too small = too many tokens per sentence) and handles unseen words by decomposing them into known subwords.

3. What is text preprocessing and what steps does it involve?

Text preprocessing transforms raw text into a clean, standardised form for NLP models. Common steps include: lowercasing (normalise case), removing punctuation and special characters, removing stopwords (common words like "the", "is" with little semantic value — though modern deep learning models often keep them), stemming (reducing words to their root by removing suffixes: "running" → "run") or lemmatisation (reducing to the dictionary form: "ran" → "run"), expanding contractions ("don't" → "do not"), handling numbers and dates, and removing HTML tags or URLs. Deep learning models often require less manual preprocessing than traditional ML models, as they learn representations from raw or lightly cleaned text.

4. What is the difference between stemming and lemmatisation?

Stemming is a rule-based process that removes suffixes to produce a word stem, which may not be a valid word: "running" → "run", "happiness" → "happi". It is fast and simple but imprecise. Lemmatisation uses a dictionary and morphological analysis to return the valid base form (lemma) of a word: "running" → "run", "better" → "good", "are" → "be". Lemmatisation requires part-of-speech information (knowing whether "running" is a verb or adjective) for correct results. Lemmatisation is more accurate but slower. For classical ML pipelines, lemmatisation is generally preferred; for deep learning, neither is usually needed as the model learns from raw tokens.

5. What are stopwords and should you always remove them?

Stopwords are high-frequency function words that carry little semantic meaning in isolation — "the", "a", "is", "in", "and". Traditional NLP and bag-of-words models remove them to reduce noise and dimensionality. However, removing stopwords is not always appropriate: sentiment analysis tasks where "not" and "but" carry semantic value, question answering where functional words are part of the structure, and transformer models where the full context window is important. Modern deep learning models trained on full text (BERT, GPT) generally do not require stopword removal as they learn to weigh the importance of each token from context.

6. What is TF-IDF?

TF-IDF (Term Frequency–Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document in a corpus. TF (Term Frequency) measures how often a word appears in a document — words that appear frequently are likely important. IDF (Inverse Document Frequency) down-weights words that appear across many documents (common words), amplifying words unique to specific documents. TF-IDF = TF × log(N/df), where N is total documents and df is documents containing the term. TF-IDF vectors are used for document similarity, information retrieval (keyword search), and as features in traditional text classifiers. It is interpretable but ignores word order and semantics.

7. What is the bag-of-words model?

Bag-of-words (BoW) represents a document as an unordered set of word counts, ignoring grammar and word order. Each document is a vector where each dimension corresponds to a vocabulary word and the value is the word's count (or TF-IDF weight). A vocabulary of 10,000 words produces 10,000-dimensional vectors (sparse). BoW is simple, fast, and surprisingly effective for classification tasks like spam detection. Limitations: does not capture word order ("dog bites man" vs "man bites dog" are identical), cannot handle synonymy (different words with same meaning) or polysemy (same word with different meanings), and produces very high-dimensional, sparse representations.

8. What are n-grams?

N-grams are contiguous sequences of n items (words, characters, tokens) from a text. Unigrams (n=1) are individual words. Bigrams (n=2) are word pairs: "machine learning", "data science". Trigrams (n=3) are three-word sequences. Character n-grams are used for language identification, morphological analysis, and handling out-of-vocabulary words. In classical NLP, n-gram language models predict the next word from the previous n-1 words using frequency statistics. Larger n captures more context but requires more data and memory. N-grams extend bag-of-words to capture some local word order information while remaining interpretable and computationally efficient.

9. What is named entity recognition (NER)?

Named Entity Recognition is an NLP task that identifies and classifies named entities in text into predefined categories such as persons (PER), organisations (ORG), locations (LOC), dates, monetary values, and more. For example: "Apple (ORG) was founded by Steve Jobs (PER) in Cupertino (LOC) in 1976 (DATE)." NER is used in information extraction, question answering, knowledge graph construction, and compliance monitoring. Modern NER uses transformer models fine-tuned on annotated datasets. spaCy, Hugging Face Transformers, and AWS Comprehend provide pre-trained NER models. Custom NER requires annotated training data in IOB (Inside-Outside-Beginning) tagging format.

10. What is part-of-speech (POS) tagging?

POS tagging assigns grammatical labels (noun, verb, adjective, adverb, pronoun, preposition, etc.) to each token in a sentence. For example: "The (DT) quick (JJ) brown (JJ) fox (NN) jumps (VBZ) over (IN) the (DT) lazy (JJ) dog (NN)." POS tags provide syntactic structure that enables downstream NLP tasks: named entity recognition, parsing, word sense disambiguation, and feature engineering. Traditional approaches used Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs). Modern approaches use transformer models fine-tuned for sequence labelling. spaCy provides fast, accurate POS tagging in multiple languages as part of its NLP pipeline.

Embeddings & Representations

11. What are word embeddings?

Word embeddings are dense, low-dimensional vector representations of words that capture semantic and syntactic relationships. Unlike one-hot encoding (sparse, no semantic information), embeddings place semantically similar words close together in vector space: the vectors for "king" and "queen" are closer than "king" and "banana." Embeddings are learned by training neural networks to predict context words (Word2Vec) or character co-occurrences (GloVe). The famous example: vector("king") − vector("man") + vector("woman") ≈ vector("queen"). Embedding dimensionality is typically 100-300 for word-level embeddings. They are the foundation of all modern NLP systems.

12. What is Word2Vec and how does it work?

Word2Vec learns word embeddings from large text corpora using two architectures. CBOW (Continuous Bag of Words) predicts a target word from its surrounding context words. Skip-gram predicts the surrounding context words from a target word. Both are trained with negative sampling — for each positive context pair, sample random negative pairs and train the model to distinguish them. Word2Vec embeddings capture analogical relationships (king:queen::man:woman), syntactic properties (good:better::bad:worse), and semantic similarity. Limitations: a single static vector per word (no disambiguation of polysemy), no understanding of morphology, and no sentence-level meaning.

13. What is the difference between Word2Vec and GloVe?

Word2Vec learns embeddings from local context windows (predicting context words within a fixed window). GloVe (Global Vectors) learns from a global word co-occurrence matrix across the entire corpus, factorising the co-occurrence count matrix. GloVe explicitly encodes the ratio of word co-occurrence probabilities, which the authors argue better captures meaning. In practice, both produce embeddings of similar quality for most tasks. GloVe embeddings are often preferred when pre-trained vectors are used directly (without fine-tuning) because they are available as static, downloadable files in various dimensions trained on large corpora (Wikipedia, Common Crawl).

14. What are contextual embeddings?

Unlike Word2Vec (one static vector per word), contextual embeddings produce different vector representations for the same word depending on its context — solving polysemy. "bank" in "river bank" and "bank account" get different embeddings. ELMo (Embeddings from Language Models) was the first major contextual embedding model, using bidirectional LSTMs. BERT and its derivatives are the standard today — they produce contextual embeddings by encoding the full sentence simultaneously through Transformer layers. Contextual embeddings dramatically improved performance on all NLP benchmarks and are the foundation of modern NLP systems. Fine-tuning pre-trained contextual embeddings on task-specific data is standard practice.

15. What is sentence embedding?

A sentence embedding is a fixed-length vector representation of an entire sentence that captures its semantic meaning. Applications include semantic search, sentence similarity, clustering, and zero-shot classification. Methods: BERT's [CLS] token representation (simple but not optimal), mean-pooling of token embeddings (averaging all token vectors), and Sentence-BERT (SBERT) — BERT fine-tuned with siamese networks on sentence pairs to produce semantically meaningful sentence embeddings optimised for similarity tasks. OpenAI's text-embedding models produce high-quality sentence embeddings. Cosine similarity between sentence embeddings measures semantic similarity — the foundation of Retrieval-Augmented Generation (RAG) and semantic search systems.

Transformers & Language Models

16. What is the Transformer architecture?

The Transformer (introduced in "Attention Is All You Need," 2017) is a deep learning architecture that processes all tokens in parallel using self-attention, replacing sequential RNNs. Components: input embeddings + positional encoding, multi-head self-attention (each token attends to all other tokens), feed-forward layers, layer normalisation, and residual connections. The encoder processes input sequences; the decoder generates output sequences attending to both self and encoder. Self-attention computes attention scores: Attention(Q, K, V) = softmax(QK^T / √d_k) × V where Q, K, V are learned projections of the input. Transformers are the basis of BERT, GPT, T5, and all modern LLMs.

17. What is the difference between BERT and GPT?

BERT (Bidirectional Encoder Representations from Transformers) uses only the Transformer encoder and is trained with Masked Language Modelling (predict masked tokens) and Next Sentence Prediction. It is bidirectional — each token attends to all other tokens simultaneously. BERT is excellent for understanding tasks: classification, NER, question answering. GPT (Generative Pre-trained Transformer) uses only the Transformer decoder with causal (unidirectional) self-attention — each token only attends to previous tokens. GPT is trained with autoregressive language modelling (predict the next token). GPT is excellent for generation tasks: text completion, summarisation, dialogue. T5 and BART use both encoder and decoder for seq2seq tasks.

18. What is pre-training and fine-tuning?

Pre-training trains a large model on a massive general-purpose corpus (books, web pages, code) with a self-supervised objective (masked language modelling, next token prediction) — no task-specific labels needed. This teaches the model language understanding, world knowledge, and reasoning. Fine-tuning adapts the pre-trained model to a specific downstream task using a smaller, labelled dataset by continuing training with a task-specific head (classifier, span extractor) at a lower learning rate. Fine-tuning typically achieves high performance with far less labelled data than training from scratch. Parameter-efficient fine-tuning methods (LoRA, Prefix Tuning) update only a small subset of parameters, reducing computational cost.

19. What is attention mechanism in NLP?

Attention allows a model to focus on relevant parts of the input when producing each output. In the original encoder-decoder attention, the decoder queries the encoder's hidden states to determine which input tokens are most relevant for generating each output token — solving the information bottleneck of fixed-size context vectors in RNNs. Self-attention (Transformer) allows each token in a sequence to attend to all other tokens, capturing long-range dependencies. Multi-head attention runs multiple attention operations in parallel, each learning to attend to different aspects (syntax, semantics, coreference). The attention weights are interpretable — showing which tokens the model focused on.

20. What is a large language model (LLM)?

A Large Language Model is a Transformer-based model trained on massive text corpora (hundreds of billions to trillions of tokens) with billions to hundreds of billions of parameters. LLMs exhibit emergent capabilities at scale: in-context learning (performing tasks from a few examples in the prompt without weight updates), instruction following, chain-of-thought reasoning, and code generation. Prominent LLMs include GPT-4 (OpenAI), Claude (Anthropic), Gemini (Google), and Llama 3 (Meta). LLMs are deployed via APIs and fine-tuned for specific domains. Understanding their capabilities, limitations (hallucination, knowledge cutoff, context limits), and safe usage patterns is essential for modern NLP practitioners.

21. What is prompt engineering?

Prompt engineering is the practice of designing input prompts to guide LLM behaviour without modifying model weights. Techniques include: zero-shot prompting (no examples, just the instruction), few-shot prompting (providing 2-5 examples of the desired input-output pattern), chain-of-thought prompting (instructing the model to "think step by step" to improve reasoning), role prompting ("you are an expert data scientist"), and output formatting instructions ("respond in JSON with keys 'category' and 'confidence'"). Prompt engineering is a critical skill because LLM output quality is highly sensitive to prompt formulation. System prompts (instructions given before the conversation) set persistent behaviour across a session.

22. What is Retrieval-Augmented Generation (RAG)?

RAG is an architecture that augments an LLM's response with relevant information retrieved from an external knowledge base, overcoming knowledge cutoff and hallucination limitations. Workflow: (1) Index — chunk documents, embed with a sentence embedding model, store in a vector database (Pinecone, Weaviate, pgvector); (2) Retrieve — embed the user query, find the top-k most similar document chunks via cosine similarity; (3) Generate — include retrieved chunks in the LLM prompt as context, instructing the model to answer based on the provided information. RAG enables LLMs to answer questions about private data, recent events, and domain-specific knowledge without expensive fine-tuning.

23. What is vector similarity search?

Vector similarity search finds the most similar items (documents, images, entities) in a high-dimensional vector space by computing distance metrics. Cosine similarity measures the angle between vectors — the most common metric for sentence embeddings (range -1 to 1; 1 = identical direction). Euclidean distance measures straight-line distance. Approximate Nearest Neighbour (ANN) algorithms (FAISS, HNSW) find the top-k most similar vectors efficiently across millions of embeddings without comparing every pair. Vector databases (Pinecone, Weaviate, Milvus, Qdrant) store embeddings and serve ANN queries with metadata filtering. Vector search is the retrieval layer of all modern RAG and semantic search systems.

24. What is fine-tuning an LLM?

Fine-tuning adapts a pre-trained LLM to a specific task or domain. Supervised fine-tuning (SFT) trains on labelled instruction-response pairs. RLHF (Reinforcement Learning from Human Feedback) further aligns the model with human preferences using a reward model trained on human rankings. Parameter-efficient fine-tuning techniques are used to make fine-tuning affordable: LoRA (Low-Rank Adaptation) injects trainable rank-decomposition matrices into transformer layers, updating less than 1% of parameters while achieving near full fine-tune performance. QLoRA combines LoRA with 4-bit quantisation for GPU-memory-efficient fine-tuning. Hugging Face PEFT library implements these methods.

25. What is the context window of an LLM?

The context window is the maximum number of tokens an LLM can process in a single forward pass — both the input (prompt + retrieved context) and the output. Longer contexts can process more information but are computationally more expensive. Early GPT models had 2,048 tokens. GPT-4 supports 128K tokens; Claude 3 supports 200K tokens; Gemini 1.5 Pro supports 1 million tokens. Challenges with long contexts include the "lost in the middle" problem (models pay less attention to information in the middle of long contexts) and higher inference cost. Chunking, RAG, and summarisation are strategies for handling content that exceeds the context window.

NLP Applications

26. What is sentiment analysis?

Sentiment analysis (opinion mining) classifies text according to the expressed sentiment — typically positive, negative, or neutral. Aspect-based sentiment analysis identifies sentiment towards specific aspects of a product (e.g., "The camera is great [positive] but the battery life is poor [negative]"). Applications: product review analysis, social media monitoring, customer feedback processing, and brand reputation management. Approaches range from rule-based (VADER for social media text), to classical ML (SVM with TF-IDF features), to fine-tuned transformer models (BERT variants) that achieve state-of-the-art accuracy. Pre-trained sentiment models are available via Hugging Face.

27. What is text classification?

Text classification assigns a predefined category to a text document. Applications include spam detection, topic categorisation, intent classification (chatbots), language identification, and document routing. Classical approach: TF-IDF features + logistic regression or SVM. Deep learning approach: fine-tuned BERT or DistilBERT with a classification head. Zero-shot classification uses NLI (Natural Language Inference) models to classify text without task-specific training data. Few-shot learning (in-context learning with GPT-4) classifies with only a few examples in the prompt. Multi-label classification assigns multiple categories to a single document — for example, a news article tagged as both "Politics" and "Economy".

28. What is text summarisation?

Text summarisation condenses long documents into shorter summaries. Extractive summarisation selects and combines the most important sentences from the original text. Abstractive summarisation generates new text that captures the key information — like how a human would summarise. Seq2seq models (T5, BART, PEGASUS) are standard for abstractive summarisation. Evaluation metrics: ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap between the generated and reference summaries. LLMs (GPT-4, Claude) are excellent zero-shot summarisers for various document types. Long-document summarisation uses hierarchical approaches or map-reduce patterns (summarise chunks, then summarise summaries).

29. What is machine translation?

Machine translation automatically translates text from one language to another. Statistical Machine Translation (SMT, pre-2015) aligned words and phrases using statistical models. Neural Machine Translation (NMT, seq2seq with attention) dramatically improved quality. Transformer-based models (MarianMT, M2M-100, NLLB) are the modern standard. Google Translate and DeepL use large-scale NMT systems. Evaluation metrics include BLEU (Bilingual Evaluation Understudy), which measures n-gram precision against human reference translations, and COMET, a neural metric better correlated with human judgement. Low-resource language translation remains challenging due to limited parallel training data.

30. What is question answering?

Question answering (QA) systems return direct answers to natural language questions. Extractive QA (reading comprehension) identifies the answer span within a given passage — used in models fine-tuned on SQuAD. Open-domain QA retrieves relevant passages from a knowledge source, then extracts or generates an answer. Generative QA (used in ChatGPT, Claude) generates free-form answers from the model's knowledge. Evaluation uses Exact Match (EM) and F1 score comparing predicted vs. ground truth answers. Knowledge graph QA queries structured databases. Modern LLMs perform impressively on QA tasks zero-shot but can hallucinate — RAG mitigates this by grounding responses in retrieved evidence.

31. What is text generation and how is it evaluated?

Text generation produces coherent, contextually appropriate text — used in content creation, dialogue systems, code generation, and summarisation. Autoregressive generation samples from the model's predicted probability distribution over the vocabulary one token at a time. Decoding strategies: greedy (highest probability token, repetitive), beam search (maintain top-k sequences, deterministic), temperature sampling (control randomness), top-k sampling (sample from top-k tokens), and nucleus sampling / top-p (sample from smallest set of tokens comprising probability mass p). Evaluation: BLEU, ROUGE for reference-based tasks; human evaluation for open-ended generation; perplexity measures how surprised the model is by held-out text.

32. What is semantic search vs. keyword search?

Keyword search matches documents containing the exact query terms — fast and interpretable but misses synonyms and paraphrases ("automobile" won't match "car"). Semantic search converts queries and documents to embedding vectors and retrieves the most semantically similar results using cosine similarity — finds relevant results even when exact terms don't match. Semantic search is powered by dense retrieval models (DPR, Sentence-BERT, OpenAI embeddings). Hybrid search combines both: keyword search (BM25) catches exact matches while semantic search finds conceptually similar content. Re-ranking with a cross-encoder model (slower but more accurate) improves precision after initial retrieval.

33. What is zero-shot and few-shot learning in NLP?

Zero-shot learning performs a task the model was never explicitly trained on, relying on general language understanding. For example, GPT-4 performs sentiment analysis when instructed: "Classify the following review as positive, neutral, or negative." Few-shot learning (in-context learning) includes 2-10 examples in the prompt to guide the model's output format and behaviour without updating weights. Few-shot prompting can match fine-tuned model performance on many tasks. These capabilities emerge from pre-training on diverse data and scale — smaller models rarely exhibit strong zero-shot generalisation. Zero/few-shot evaluation is used to benchmark LLM capabilities without task-specific fine-tuning.

34. What is the BLEU score?

BLEU (Bilingual Evaluation Understudy) is an automated metric for evaluating machine translation and text generation quality. It measures the precision of n-gram overlap (n=1,2,3,4) between the generated text and one or more human reference translations, with a brevity penalty for outputs shorter than the reference. BLEU scores range from 0 to 1 (often expressed as 0-100). Limitations: it does not measure recall, penalises valid paraphrases that use different words, has low correlation with human judgements for abstractive tasks, and cannot evaluate semantic correctness. Despite limitations, BLEU is still the standard metric for MT benchmarks due to its reproducibility and ease of computation.

35. What is topic modelling?

Topic modelling is an unsupervised NLP technique for discovering latent thematic structure in a collection of documents. Latent Dirichlet Allocation (LDA) is the most common algorithm — it models each document as a mixture of topics and each topic as a probability distribution over words. For example, a corpus of news articles might reveal topics: and . Applications: content recommendation, document organisation, trend detection. BERTopic uses BERT embeddings and clustering to find more coherent topics than LDA. Topic models require choosing the number of topics (hyperparameter) and interpreting the resulting word distributions manually.

Advanced NLP

36. What is the difference between NLU and NLG?

NLU (Natural Language Understanding) is the ability of a system to comprehend and interpret text — extracting meaning, intent, entities, sentiment, and relationships from language. Tasks: intent classification, NER, relation extraction, sentiment analysis, reading comprehension. NLG (Natural Language Generation) is the ability to produce grammatically correct, contextually appropriate text — translating structured data or internal representations into human language. Tasks: text summarisation, machine translation, dialogue response, data-to-text generation, report writing. Modern LLMs combine both: they understand the input (NLU) and produce appropriate text responses (NLG) in a unified autoregressive framework.

37. What is coreference resolution?

Coreference resolution identifies all expressions in a text that refer to the same real-world entity. For example, in "Alice said she would attend the meeting. She was the first to arrive," coreference resolution links "Alice" and "she" (both mentions) to the same person entity. It is a key challenge in NLP because correctly resolving coreferences is needed for accurate information extraction, question answering, and summarisation. Neural approaches using span representations and pairwise scoring models (SpanBERT) currently achieve state-of-the-art performance. Coreference resolution is particularly important for documents with many pronoun references to the same entities.

38. What is dependency parsing?

Dependency parsing analyses the grammatical structure of a sentence by identifying dependencies between words, producing a directed tree where each word (except the root) depends on exactly one head word. Dependency labels indicate the type of relationship: subject (nsubj), object (obj), modifier (amod), etc. For "The cat sat on the mat": cat → sat (nsubj), mat → on (pobj), on → sat (prep). Dependency parsing is used in information extraction, relation detection, and machine translation. spaCy provides fast, accurate dependency parsing. Universal Dependencies is a cross-linguistic annotation scheme enabling consistent dependency parsing across 100+ languages.

39. What are hallucinations in LLMs and how do you mitigate them?

Hallucinations are confident, plausible-sounding but factually incorrect or unsupported outputs generated by LLMs. They occur because LLMs generate based on statistical patterns rather than verified knowledge retrieval. Types: factual hallucinations (wrong facts), faithfulness hallucinations (contradicting the provided context), and attributional hallucinations (inventing citations). Mitigation strategies: RAG (ground responses in retrieved documents), chain-of-thought prompting (encourage step-by-step reasoning), self-consistency (sample multiple answers and vote), temperature reduction (lower randomness), fine-tuning on high-quality factual data, and post-generation fact-checking with external APIs or search. Hallucinations remain an active research problem — no perfect solution exists.

40. What is instruction tuning?

Instruction tuning is fine-tuning a pre-trained LLM on a dataset of (instruction, response) pairs to make the model better at following natural language instructions. Pre-trained LLMs (trained only with next-token prediction) do not naturally follow instructions — they continue text. Instruction tuning teaches the model to interpret tasks described in natural language and respond helpfully. Examples include FLAN-T5 (fine-tuned on 1,836 NLP tasks framed as instructions) and the InstructGPT/ChatGPT pipeline (supervised instruction fine-tuning followed by RLHF). Instruction-tuned models can perform new tasks zero-shot by describing them in the prompt.

41. What is text chunking and why does it matter for RAG?

Text chunking splits documents into smaller, overlapping segments before embedding for RAG. Chunk size affects retrieval quality: chunks too small lack context; chunks too large are less semantically focused and exceed embedding model limits. Common strategies: fixed-size chunking (by character or token count), sentence-based chunking, paragraph-based chunking, and semantic chunking (split at topic boundaries using embedding similarity). Overlap between consecutive chunks (e.g., 20% overlap) ensures context at boundaries is not lost. Recursive chunking (try paragraph → sentence → fixed size) adapts to document structure. Optimal chunk size varies by document type and is typically tuned empirically.

42. What is the difference between encoder-only, decoder-only, and encoder-decoder models?

Encoder-only models (BERT, RoBERTa) process the full input bidirectionally, producing contextual embeddings for each token. Best for: classification, NER, question answering, sentence embeddings. Decoder-only models (GPT family, Llama, Mistral) process tokens left-to-right (autoregressive) with causal attention. Best for: text generation, dialogue, code generation. Encoder-decoder models (T5, BART, MarianMT) encode the input and generate the output sequentially. Best for: translation, summarisation, seq2seq tasks with a clear input-output structure. Modern LLMs are predominantly decoder-only due to their unified generation capability and strong emergent reasoning at scale.

43. What is cross-lingual NLP?

Cross-lingual NLP enables models to work across multiple languages, either by processing multiple languages simultaneously or transferring knowledge from resource-rich languages (English) to low-resource ones. Multilingual BERT (mBERT) and XLM-RoBERTa are pre-trained on 100+ languages jointly — they share a vocabulary and model weights across languages. Zero-shot cross-lingual transfer fine-tunes a model on English tasks and evaluates it on other languages with no additional training data. Multilingual sentence embeddings (LaBSE, SONAR) produce language-agnostic embeddings for semantic search across languages. This capability is critical for global applications serving non-English speakers.

44. What are knowledge graphs and how do they relate to NLP?

A knowledge graph is a structured representation of entities (nodes) and their relationships (edges) — e.g., (Steve Jobs, founded, Apple). NLP populates knowledge graphs through relation extraction (identifying entity relationships in text), NER (identifying entities to populate nodes), and entity linking (resolving text mentions to knowledge base entries). Knowledge graphs power question answering, recommendation systems, and entity disambiguation. Wikidata, Freebase, and DBpedia are large public knowledge graphs. Graph neural networks (GNNs) reason over knowledge graphs for tasks requiring structured world knowledge that LLMs may hallucinate.

45. What is the difference between NLP evaluation metrics: accuracy, F1, and ROUGE?

Accuracy is the proportion of correct predictions — appropriate when classes are balanced. F1 score is the harmonic mean of precision and recall — preferred for classification tasks with class imbalance (NER, intent classification). Macro-F1 averages F1 per class equally; weighted-F1 weights by class frequency. ROUGE (ROUGE-1, ROUGE-2, ROUGE-L) measures n-gram overlap between generated and reference summaries — used for summarisation and translation. BLEU measures precision of n-gram overlap (generation quality). BERTScore uses BERT embeddings for semantic similarity evaluation — better correlated with human judgement than ROUGE/BLEU. Perplexity (inverse probability of held-out text) evaluates language models.

Practical NLP

46. What is Hugging Face and why is it important?

Hugging Face is a platform and open-source library that has become the central hub for NLP (and broader ML) models and datasets. The transformers library provides a unified API for thousands of pre-trained models: AutoTokenizer.from_pretrained('bert-base-uncased'), AutoModelForSequenceClassification.from_pretrained('bert-base-uncased'). The Hugging Face Hub hosts 500,000+ models and 100,000+ datasets. The datasets library provides standardised access to NLP benchmarks. peft implements LoRA and other parameter-efficient fine-tuning methods. accelerate simplifies distributed training. Hugging Face has democratised state-of-the-art NLP — downloading a BERT model and fine-tuning it takes fewer than 20 lines of code.

47. What is spaCy and what are its main features?

spaCy is an industrial-strength, production-focused NLP library for Python. It provides: tokenisation, POS tagging, dependency parsing, NER, lemmatisation, sentence segmentation, and word vectors — all in a unified, highly optimised pipeline. nlp = spacy.load('en_core_web_sm'); doc = nlp("Apple is a company") processes text through the full pipeline in a single call. spaCy is 10-100x faster than NLTK for production use. Custom components are added with @Language.component. spaCy v3 introduced transformer-based models via spacy-transformers. Its Matcher and PhraseMatcher enable rule-based text pattern matching. spaCy is the standard for production NLP pipelines.

48. What is text data augmentation?

Text data augmentation artificially increases training data size by creating modified versions of existing labelled examples. Techniques include: synonym replacement (replace words with WordNet synonyms), random insertion/deletion/swap of words, back-translation (translate to another language and back — generates paraphrases), EDA (Easy Data Augmentation — combination of synonym replacement, random insertion, random swap, random deletion), paraphrasing with LLMs (generate semantically equivalent rewrites), and masked language model insertion (use BERT to fill in masked tokens). Augmentation helps with class imbalance and low-resource scenarios. Care must be taken to preserve the label semantics — augmentation that changes meaning invalidates the label.

49. What are vector databases and which are commonly used?

Vector databases store high-dimensional embedding vectors with metadata and enable efficient approximate nearest neighbour (ANN) search at scale. They are the retrieval layer of RAG systems, semantic search engines, and recommendation systems. Popular options: Pinecone (fully managed, easy to use), Weaviate (open-source, supports hybrid search, GraphQL API), Qdrant (open-source, fast, Rust-based), Milvus (open-source, enterprise-grade), Chroma (lightweight, local-first for prototyping), and pgvector (Postgres extension for vector storage and search). Choice depends on scale, hosting preference (managed vs. self-hosted), query latency requirements, and the need for metadata filtering alongside vector search.

50. What is the difference between NLP and NLU and LLM-based AI assistants?

NLP is the broad technical field of processing and analysing human language — encompassing all techniques from rule-based to deep learning. NLU is the subset of NLP focused specifically on understanding (comprehending meaning, intent, entities, relationships) as opposed to generation. An LLM-based AI assistant (ChatGPT, Claude, Gemini) is an application built on top of a large language model that integrates understanding (NLU) and generation (NLG) in a conversational interface, augmented with tools (search, code execution, APIs), memory, and safety mechanisms (RLHF, Constitutional AI). Modern AI assistants have largely obsoleted separate NLU/NLG pipeline architectures for many tasks, replacing them with a unified LLM-based approach.