The default mental model in RAG pipelines goes something like this: embed everything with a state-of-the-art model, store it in a vector DB, query with the same model. Done. Semantic similarity will find what is relevant. It is a fine mental model for natural language. It degrades quietly on technical corpora, and the failure mode is subtle enough that most teams ship it to production before they notice.
We ran 3,200 queries against a 620K-chunk technical corpus: API docs, SDK references, error-code registries, and internal engineering runbooks. The kind of content that reads fine to a human but is statistically sparse in any embedding model's training data. BM25 won two of four query categories outright, by margins large enough to give any RAG engineer pause. The crossover is the story.
The retrieval assumption everyone makes
Dense embeddings learn to represent semantic neighborhoods: words and phrases that appear in similar contexts cluster together. That is exactly what you want when a user asks "how do I restart the service" and your docs say "to reinitialize the daemon." The embedding bridges the lexical gap, and the retrieval feels magical.
The failure happens when a user asks about a specific model number, an error code, a product SKU, a person's name, or a rare acronym. The embedding model has a different problem with these. They are either underrepresented in the training distribution or their surface form matters more than their semantic neighborhood. ERR_CONN_RESET_7421 does not have a semantic neighborhood. It has an exact string that either appears in the document or does not.
BM25, a 1994 probabilistic retrieval function, handles this case by design. It is, at its core, a weighted term-frequency model. It rewards exact matches on rare terms more than it rewards approximate matches on common ones. On technical content, this is often what you want, and the 32-year-old algorithm beats the transformer.
What we ran, and why the query taxonomy matters
Same methodology as the vector DB shootout: recall@5 against exact-search ground truth, hand-classified queries, no cherry-picking. Queries were classified blind: classifiers did not know which retrieval method would be tested. The corpus has a high density of rare named entities (model identifiers, version strings, internal codenames, error codes), exactly the kind of content that exposes the dense-only failure mode.
| query type | share | BM25 | Dense | Hybrid | winner |
|---|---|---|---|---|---|
| named entity lookup | 28% | 0.961 ★ | 0.712 | 0.948 | BM25 |
| conceptual / semantic | 34% | 0.724 | 0.951 | 0.963 ★ | Hybrid |
| error code / identifier | 21% | 0.974 ★ | 0.681 | 0.952 | BM25 |
| mixed / ambiguous | 17% | 0.803 | 0.861 | 0.931 ★ | Hybrid |
The split is clean. BM25 wins on named entity lookup and error codes by margins that should give any RAG engineer pause. Dense wins on conceptual queries, which confirms the embeddings are working as designed. Hybrid wins or ties everywhere, but the margin over the category winner is small. The real story is the gap between dense and BM25 on the sparse-entity categories: 0.712 vs 0.961 on named entity lookup is not a tuning problem, it is a structural one.
Why embeddings fail on rare named entities
This is worth understanding mechanically, because it changes how you think about corpus design and query routing. Embedding models are trained on large text corpora with a language-modeling objective. Terms that appear rarely in that training distribution get weak, high-variance representations. When the model encounters them at inference time, it maps them to whatever semantic neighborhood its training data associated them with, which is often wrong or diffuse.
| term | type | dense score | failure mode |
|---|---|---|---|
ERR_CONN_RESET_7421 | error code · 3× in corpus | 0.58 | maps to generic "connection error" cluster |
libvorbis-1.3.7 | version string | 0.51 | conflates with 1.3.5 and 1.3.6: version specificity lost |
Thornton-Vance protocol | internal codename | 0.43 | no external training signal, lands near generic "protocol" |
| authentication flow | common concept | 0.91 | placed correctly near OAuth, token exchange, login |
BM25's term-frequency / inverse-document-frequency weighting is actually well-suited to this failure mode. IDF gives higher weight to rarer terms: the opposite of what a dense embedding model does implicitly. A query containing ERR_CONN_RESET_7421 scores that token very highly precisely because it appears in only 3 documents. The signal is concentrated, not diffused.
Hybrid retrieval with RRF
Hybrid retrieval runs both methods and fuses the ranked lists. The canonical fusion approach is Reciprocal Rank Fusion: simple, parameter-light, and surprisingly effective. Each document's RRF score is the sum of its reciprocal ranks from each retrieval method.
RRF(d) = Σ 1 / (k + ranki(d)) k = 60 (constant, dampens the impact of very high ranks). rank_i(d) = rank of document d in result list i. Sum is over each retrieval method (BM25, dense, optionally others). Documents not appearing in a list get rank = ∞, which contributes 0.
RRF's k=60 default is not magic. It was empirically derived in the original 2009 paper and has held up surprisingly well across domains. The constant prevents a document ranked #1 in one list from dominating over a document ranked #2 in both lists. It is a soft cap on single-list dominance.
The alternative to RRF is a weighted linear combination: α · dense_score + (1-α) · bm25_score. This requires normalizing scores across methods, a non-trivial problem since BM25 scores and cosine similarity scores live on different scales. RRF sidesteps this entirely by operating on ranks, not scores. For most use cases it is the right default.
Where weighted combination wins is when you have strong prior knowledge about the query distribution. If you know 80% of your queries are named-entity lookups, you can tune α toward BM25 and outperform fixed RRF. We built a query classifier to do this dynamically, more on that below.
The heatmap makes the crossover visible. For named entity and error code queries, the recall curve slopes steeply downward as α increases. Even a small push toward dense costs accuracy. For semantic queries, the opposite. The optimal fixed α is a compromise that satisfies neither query type fully. This is why query-adaptive routing is worth building.
Query routing: the two-line classifier
Adaptive α sounds expensive. In practice, the signal for routing is often lexical: you can classify most queries correctly with a simple heuristic before reaching for a model. We built a two-stage classifier.
Stage 1, pattern matching. If the query contains any token matching a regex for known entity patterns (error codes, version strings, model identifiers, UUIDs, alphanumeric codes above a length threshold), route to BM25-dominant (α = 0.1). This catches roughly 82% of the sparse-entity query category with no model inference required.
Stage 2, entropy heuristic. Compute the token entropy of the query against the corpus vocabulary. Low-entropy queries (tokens that appear frequently in the index) go dense-dominant (α = 0.8). High-entropy queries (tokens that are rare or absent from the index) go BM25-dominant. This handles the cases pattern matching misses, at the cost of a fast vocabulary lookup.
The full model classifier, a fine-tuned binary head on a small encoder, only gets invoked for the ambiguous middle: roughly 15% of traffic. End-to-end routing overhead: under 3ms p99. Recall improvement over fixed RRF: +4.2 points on mixed traffic. Worth it.
What to actually build
The choice between retrieval methods is a function of your corpus and query distribution, not your embedding model's benchmark ranking. Here is the practical decision tree.
k=60). Measure your query distribution. Add the two-stage router once you have confirmed the distribution is stable enough to tune against. Our current default for new deployments. One operational note: BM25 indices are lightweight, fast to build, and trivially inspectable. If you are debugging a recall failure and your retrieval stack is dense-only, you are operating blind. You cannot see why a document ranked where it did without probing the embedding space. BM25 score decomposition is human-readable. That matters in production when something goes wrong at 2am.
If you'd like us to look at the right retrieval architecture for your corpus, the contact form is the fastest way. We do 30-minute reviews for production RAG systems, free.