Vector Embeddings Explained: Why Your Search Results Are Only as Good as Your Embedding Model

You built the RAG pipeline. You indexed 50,000 documents. You wired up the vector database. You tested it on ten representative queries and got clean, accurate results. You shipped it.

Three weeks later, users are reporting that the system returns completely irrelevant results for certain question types. You check the logs. The retrieval is finding documents — they're just the wrong documents. The system is confidently retrieving content that shares surface-level vocabulary with the query but has nothing to do with what the user actually needs.

The LLM isn't the problem. The vector database isn't the problem. The chunking strategy isn't the problem.

The embedding model is the problem — and specifically, the mismatch between what your embedding model was trained to represent and what your users are actually asking.

This is the embedding model selection problem that most practitioners encounter after building their first production RAG system. Almost nobody encounters it before. This article is the explanation that should have come first.

What Embedding Models Are Actually Doing (And Why the Common Explanation Is Incomplete)

The standard explanation of embeddings goes like this: texts with similar meanings are mapped to nearby points in high-dimensional space, so semantic search finds semantically similar content.

That explanation is technically correct and practically misleading, because it skips the part that actually determines whether your system works: what kind of similarity the model was trained to capture.

Every embedding model is trained on a specific task with a specific definition of "similar." Some models are trained on symmetric similarity — "cat" should be near "feline" because they mean the same thing. Other models are trained on asymmetric similarity — a short question should be near a long document that answers it, even if they share no vocabulary.

These are fundamentally different representations, and using the wrong one for your use case produces retrieval failure that looks like everything is working because the similarity scores are still high — they're measuring the wrong kind of similarity.

The scenario that makes this concrete:

A legal tech company builds document search using text-embedding-ada-002. Their users search with short phrases like "indemnification clause risk" and expect to find contract sections containing indemnification language, even when those sections use different phrasing.

The embedding model was trained primarily on symmetric similarity — finding texts that say similar things. When a user queries "indemnification clause risk," the model returns documents that discuss indemnification clause risk (e.g., legal commentary, academic articles) rather than documents that contain indemnification clause language in a contract.

The problem isn't vector search. It's that the model is optimizing for the wrong relationship between query and document.

The fix is switching to a model trained on asymmetric query-document relevance — specifically designed to match short queries to long, relevant documents. Retrieval quality improves substantially on the same corpus with the same chunking strategy.

The Embedding Model Landscape: What's Available and What Each Is Actually For

The embedding model landscape in 2025 has three tiers: general-purpose proprietary models, task-optimized open-source models, and domain-specialized fine-tuned models.

Tier 1 — General-Purpose Proprietary Models:

OpenAI's text-embedding-3-small and text-embedding-3-large, Cohere's embed-v3, and Google's text-embedding-004 sit here. These are trained on broad web-scale data with general-purpose contrastive objectives.

They work well for: general document search, mixed-domain knowledge bases, applications where you can't control document types.

They underperform for: highly specialized domains (legal, medical, scientific), code search, very short text matching, applications requiring precise technical term differentiation.

The non-obvious limitation of general-purpose models:

These models were trained on general web text, which means domain-specific terminology that appears rarely in web text is poorly represented. A medical AI assistant built on a general embedding model will treat "myocardial infarction," "MI," and "heart attack" as somewhat similar but not as equivalent as a medical-domain model would. For clinical queries, this distinction matters — and it produces retrieval failures on exactly the high-stakes queries where failure is most costly.

Tier 2 — Task-Optimized Open-Source Models:

The MTEB (Massive Text Embedding Benchmark) leaderboard tracks the best-performing open-source models across specific tasks. Key models to understand:

E5-large-v2 and E5-mistral-7b: Microsoft's E5 family, trained specifically for retrieval tasks with asymmetric query-document objectives. Strong performance on information retrieval benchmarks.

BGE-large-en-v1.5 and BGE-M3: BAAI's models, with BGE-M3 supporting multi-lingual and multi-granularity retrieval. Strong cross-lingual performance.

GTE-large: Alibaba's General Text Embeddings, strong general-purpose performance with good efficiency profile.

nomic-embed-text: Nomic AI's open-source model with competitive performance and full open weights.

The MTEB leaderboard trap:

Teams often select the highest-ranking model on the MTEB leaderboard without checking which tasks the leaderboard averages across. A model that ranks first on average MTEB score may rank third on the specific retrieval task you're building. Always filter the leaderboard to your specific task category (retrieval, classification, semantic textual similarity) before selecting.

Tier 3 — Domain-Specialized Models:

For specialized domains, fine-tuned or purpose-built models often outperform general-purpose models by significant margins:

medbert-base-uncased and BiomedBERT for medical/clinical text
legal-bert-base-uncased for legal document processing
codebert-base and unixcoder for code search
FinBERT for financial text

If more than 60% of your queries or documents are in a specific domain, and that domain has specialized vocabulary that general web text doesn't capture well, domain-specialized models typically outperform general-purpose models by 10-20% on retrieval tasks within that domain.

Why Retrieval Fails in Ways That Look Like It's Working

This is the failure mode that's hardest to catch: retrieval that returns high similarity scores on genuinely irrelevant documents.

It happens because similarity scores are relative, not absolute. The embedding model assigns a similarity score by measuring distance in its learned vector space. If the most similar document in your index to a given query is only weakly relevant, it still receives a high similarity score — because it's the closest thing in the space, not because it's actually relevant.

The scenario that every practitioner recognizes too late:

An e-commerce company builds semantic product search. Users search for "comfortable running shoes for flat feet" and expect to see running shoes with arch support features.

The embedding model returns high similarity scores for: generic running shoe product pages (correct), articles about running shoe selection (wrong — editorial content, not products), and hiking boot product pages (partially wrong — comfortable shoes, not running shoes).

The similarity scores for all three categories are high — 0.87, 0.82, 0.79 — because the model is measuring general semantic relatedness. But only the first category is actually what the user wants.

The three failure signatures and their causes:

False precision — high scores on wrong documents. Cause: corpus contamination (multiple document types in the same index) or wrong similarity type for the task.

False negatives — correct documents that should rank high but don't. Cause: vocabulary mismatch between query language and document language.

Score compression — all documents receive similar scores, making ranking meaningless. Cause: embedding model represents all documents in a compressed region of the space — often happens with domain-specialized queries on general-purpose models.

The Embedding Evaluation Framework You Need Before Going to Production

The most consistent error in production RAG systems is evaluating end-to-end answer quality without first evaluating retrieval quality independently. When the system produces a wrong answer, you can't tell whether retrieval found the wrong document or the LLM generated wrong content from the right document.

Evaluating the embedding model specifically requires a framework built around three questions:

Question 1: Is the model finding the right documents? (Recall)

Recall@k measures: for a set of queries with known relevant documents, what fraction of relevant documents appear in the top-k retrieved results?

The evaluation process:

Create a golden dataset of 100-200 queries with manually identified relevant documents
Run retrieval for each query
Check whether the relevant document appears in the top-k results
Calculate recall@k for k=1, k=5, k=10

A model with recall@10 of 0.85 finds the relevant document in the top 10 results 85% of the time.

Question 2: Are the right documents ranking highest? (Precision and NDCG)

Finding the right document in the top 10 is not the same as finding it at position 1. NDCG (Normalized Discounted Cumulative Gain) measures ranking quality — it penalizes for correct documents appearing in lower positions.

High recall with low NDCG means the model finds relevant documents but buries them under irrelevant ones. This is a re-ranking opportunity: the correct documents are retrieved, they just need to be moved to the top.

Question 3: Does retrieval quality translate to answer quality? (End-to-End Faithfulness)

After establishing baseline retrieval metrics, run the full RAG pipeline and measure faithfulness: does the generated answer reflect the retrieved context? A system with high recall but low faithfulness has a generation problem, not a retrieval problem.

The golden dataset construction trap:

Most teams build golden datasets using the same document types and query styles that appear in the product specification. The solution is adversarial golden dataset construction: include queries that are phrased differently from the document content (testing vocabulary mismatch), queries that span multiple documents, queries at the edge of the domain, and queries that have near-miss documents that look relevant but aren't.

Chunking Strategy and Embedding Models: Why They're Inseparable

Chunking strategy and embedding model selection are typically treated as independent decisions. They're not. The chunking strategy you use determines the characteristics of the text your embedding model receives, and every embedding model has an implicit optimal input length that it was trained to handle.

The context window mismatch problem:

Most general-purpose embedding models were trained on sentence-to-paragraph length text. Their effective context window is typically 256-512 tokens for older models and 512-8192 tokens for newer models.

If you chunk your documents into 2,000-token chunks but use an older embedding model with an effective context window of 512 tokens, the embeddings are generated primarily from the first 512 tokens of each chunk. You've done the computational work of processing a 2,000-token chunk but gotten the embedding quality of a 512-token one.

The chunk-to-query length mismatch problem:

Asymmetric retrieval models are trained to match short queries to longer documents. If your queries are short phrases and your chunks are long paragraphs, this works well.

If your queries are long — "What are the specific provisions in Article 5 regarding indemnification for third-party claims in commercial software licenses?" — and your chunks are also long paragraphs, you've created a symmetric-length matching scenario using an asymmetric model, which often degrades performance.

The three chunking strategies and when each works:

Fixed-size chunking with 256-512 token chunks is appropriate when: documents are homogeneous, query length is short (under 20 tokens), and you're using a standard retrieval-optimized model.

Semantic chunking that splits at sentence or paragraph boundaries is appropriate when: documents have variable information density, chunks need to preserve complete ideas, and factual accuracy of retrieved content matters.

Hierarchical chunking that creates both small chunks (for specific retrieval) and large parent chunks (for context) is appropriate when: specific facts need to be retrieved but surrounding context is needed for accurate answer generation.

Fine-Tuning Embedding Models: When It's Worth the Investment

For most applications, selecting the right pre-trained model from the landscape above is sufficient. But for high-stakes applications in specialized domains with proprietary relevance signals, fine-tuning an embedding model on domain-specific data can produce substantial gains.

When fine-tuning embedding models is worth it:

Your domain has specialized terminology that general models represent poorly
You have proprietary relevance judgments that differ from general web relevance
You have at least 1,000 query-document positive pairs with reliable relevance labels
The baseline retrieval quality with the best pre-trained model is below your quality threshold

What fine-tuning embedding models actually requires:

Unlike fine-tuning generative LLMs, embedding model fine-tuning uses contrastive learning: you provide (query, positive document) pairs and optionally (query, negative document) pairs. The model learns to place queries close to their relevant documents in embedding space.

The most critical input is the negative examples. Easy negatives teach the model very little. Hard negatives — documents that look relevant but aren't, based on your specific relevance criteria — are what drive the most quality improvement.

The fine-tuning risk nobody mentions:

Fine-tuned embedding models can overfit to the specific query-document patterns in the training data. A model fine-tuned on your current query distribution will degrade when your query distribution shifts. Monitoring retrieval quality in production is more important after fine-tuning than with pre-trained models.

The practical alternative to fine-tuning:

Before investing in embedding model fine-tuning, evaluate whether a cross-encoder re-ranker achieves the same quality improvement at lower cost. Re-rankers don't change the embedding model — they reorder the top-k retrieved results based on a more expensive but accurate relevance model. For many cases where the right document is in the top-20 but not in the top-5, a re-ranker resolves the problem without any embedding model changes.

The evaluation that determines whether to re-rank or fine-tune is Recall@20: if the correct document exists in the top-20, a re-ranker can surface it without any model changes. Only when the correct document doesn't appear in the top-20 at all does embedding model fine-tuning address the actual failure.

Hybrid Search: When Embeddings Alone Are Insufficient

Vector similarity search works well for semantic queries — queries about concepts, themes, ideas. It fails predictably on exact-match queries: specific product codes, regulatory references, proper names, technical identifiers, version numbers.

The failure scenario:

A developer builds search for an internal documentation system. Engineers search for "timeout error in Redis connection pool" — a mix of a technical error type and a specific product name. The embedding model encodes the semantic meaning well, but the specific identifier "Redis" may be weighted less than broader concepts in the embedding space.

Keyword search (BM25) would have found the exact Redis documentation immediately based on the literal token "Redis." Vector search deprioritizes exact token matches in favor of semantic closeness.

How hybrid search works:

Parallel retrieval: Execute both BM25 search and vector similarity search independently on the same query.

Score fusion: Combine the ranked lists using Reciprocal Rank Fusion (RRF) — a simple, effective method that combines rankings without requiring score normalization. Each document's score in the fused list is 1/(rank + k) summed across both retrieval methods, where k is typically 60. Documents that rank well in both retrieval methods receive the highest combined scores.

When hybrid search is non-optional:

Queries contain specific product names, model numbers, or identifiers
Domain uses technical acronyms that may not be well-represented in embedding space
User queries are short (under 5 tokens) where semantic context is sparse
Documents contain both structured data (IDs, codes) and unstructured prose

When pure vector search is sufficient:

Queries are conceptual and don't contain specific identifiers
Documents are purely prose with no structural identifiers
User intent is always broadly semantic

Closing: From Embedding Selection to Production Reliability

The embedding model is not one component among many in a RAG system — it is the component that determines the ceiling of every other component's contribution. The LLM cannot generate correct answers from incorrectly retrieved context. Re-ranking cannot surface documents that were never retrieved. Chunking strategy cannot compensate for a model that measures the wrong kind of similarity.

Getting the embedding model selection right — matching the similarity type the model was trained to capture to the retrieval task your users actually have — is the highest-leverage decision in RAG system design. It is also the decision most often made by default rather than by diagnosis.

At Meritshot, the AI Engineering curriculum covers embedding model selection, retrieval evaluation, hybrid search architecture, and production monitoring as an integrated system — because the practitioners who build reliable RAG systems understand how these components interact, not just how each one works in isolation.

Explore the Meritshot Data Science Programme →