RAG & Knowledge Retrieval

Why your RAG accuracy plateaus at 70% — and the four-tier retrieval architecture that breaks past it

The default vector-search-plus-LLM design is a great prototype and a poor production system. After shipping production RAG across legal, healthcare, and financial-services corpora, the pattern that consistently clears 90%+ answer accuracy at sub-second latency is not a bigger model — it is four retrieval tiers working together.

The 70% plateau is a retrieval problem, not a model problem

Almost every stalled RAG project arrives with the same story. The prototype was magic. It answered the demo questions, the stakeholders nodded, the budget got approved. Then it met the real document corpus and answer quality settled somewhere around two-thirds correct — good enough to be impressive, not good enough to trust. The instinct at that point is to reach for a larger language model, a longer context window, or a better prompt.

That instinct is almost always wrong. When a RAG system plateaus near 70%, the model is rarely the bottleneck. The model can only answer from what it is handed, and a single-stage vector search hands it the wrong passages a third of the time. No amount of prompt engineering rescues an answer built on retrieved chunks that never contained the fact in the first place. The plateau is a retrieval ceiling, and you raise it by fixing retrieval.

You cannot prompt your way out of a retrieval miss. If the right passage is not in the context window, the smartest model on the planet will confidently answer from the wrong one.

The good news is that the ceiling is structural, which means it is fixable with architecture rather than luck. The four-tier pattern below is the version that has held up across very different corpora — dense legal contracts, sprawling healthcare protocols, and tightly access-controlled financial documents. It is the same backbone behind the enterprise RAG solutions we put into production.

The four failure modes of naive RAG

Before the fix, name the disease. Single-stage vector retrieval fails in four distinct ways, and most plateaued systems are suffering from all four at once.

  1. Lexical blind spots. Embeddings capture meaning, not exact tokens. Ask for “Form 10-K filing deadline” and a pure semantic search may rank a passage about annual reporting cadence above the one that literally states the date. Identifiers, codes, SKUs, statute numbers, and proper nouns are where dense vectors quietly miss.
  2. Chunk-boundary amnesia. A fact split across two chunks — the condition in one, the exception in the next — is retrievable as neither. Fixed-size chunking is the single most common cause of “the answer was in the document but the system never found it.”
  3. No scoping. Without metadata filtering, retrieval searches the entire corpus every time. A reader in one business unit gets passages from another; a question about the current policy surfaces last year’s superseded version; a user without clearance sees a chunk they should never have been shown.
  4. Single-stage ranking. A bi-encoder scores the query and every chunk independently and hopes their vectors land close. It is fast and shallow. It has no mechanism to judge whether a specific passage actually answers this specific question — it only knows they are broadly about the same topic.

The four-tier retrieval architecture

Each tier targets one of the four failure modes. They run as a pipeline: the first two tiers cast a wide, high-recall net in parallel; the third tier scopes that net to what the user is allowed and intended to see; the fourth tier does the slow, precise judgement that a fast first pass cannot.

Tier 1Dense vector retrieval (recall for meaning)

The familiar tier, done deliberately. Two decisions dominate its quality. First, embedding selection: a general-purpose embedding model underperforms on domain corpora with heavy jargon — a model fine-tuned or chosen for legal or clinical text routinely lifts recall by double digits over an off-the-shelf default. Second, chunking strategy: replace fixed-size splits with structure-aware chunking that respects section and clause boundaries, and add modest overlap so a fact never falls into a seam. This tier alone does not need to be precise; it needs to be high-recall. Precision is tier four’s job. Your vector database development choices — index type, dimensionality, metadata co-location — are set here.

Tier 2Sparse keyword retrieval (recall for tokens)

Run a classic lexical search (BM25 or equivalent) in parallel with the dense tier. This is the cure for lexical blind spots: it retrieves on the exact tokens dense vectors smooth over — the statute number, the part code, the rare proper noun. The two tiers are complementary by construction. Dense retrieval wins on paraphrase and concept; sparse retrieval wins on exact match. Hybrid retrieval is not a nice-to-have; on corpora full of identifiers it is the difference between finding the clause and inventing it.

Tier 3Metadata filtering (scope and authorization)

Before anything is ranked, constrain the candidate set by metadata: business unit, document type, effective date, jurisdiction, and — non-negotiable in regulated settings — the user’s authorization scope. This tier is where RAG stops being a demo and becomes deployable. It prevents the system from retrieving a superseded policy, leaking a cross-tenant document, or surfacing a passage the asker has no clearance to read. In financial services and healthcare, scoping is not an optimization; it is the line between a usable system and a compliance incident.

Tier 3 — metadata pre-filter (illustrative)
# Tier 3 — metadata pre-filter (illustrative)
candidates = index.search(
    query_vector,
    filter = {
        "doc_type":   in(allowed_types),
        "effective":  current_as_of(today),
        "auth_scope": subset_of(user.clearances),   # hard gate
    },
    top_k = 50,                                     # wide, reranker narrows later
)

Tier 4Cross-encoder reranking (precision)

Here is where most of the missing 20 points live. Tiers 1–2 return perhaps fifty candidates with decent recall and mediocre ordering. A cross-encoder reranker then reads the query and each candidate together — not as two independent vectors but as a joined pair — and scores how well that passage answers that question. It is far slower per candidate than a bi-encoder, which is exactly why it runs last, on a short list, rather than across the whole corpus. Keep the top three to five reranked passages and discard the rest. This single tier is the highest-leverage change in the entire pipeline; it is the difference between “topically relevant” and “actually answers the question.”

Fusing the tiers without re-introducing noise

Tiers 1 and 2 produce two ranked lists. Naively concatenating them re-introduces the noise you are trying to remove. The robust, embarrassingly simple answer is Reciprocal Rank Fusion (RRF): score each document by the sum of the reciprocals of its ranks across both lists. A passage that ranks well in either the dense or sparse list floats up; a passage that ranks poorly in both sinks. RRF needs no score normalization between fundamentally different scoring scales, which is precisely why it is hard to get wrong.

Reciprocal Rank Fusion — the whole idea in five lines
def rrf(dense, sparse, k=60):
    score = defaultdict(float)
    for ranked in (dense, sparse):
        for rank, doc_id in enumerate(ranked):
            score[doc_id] += 1.0 / (k + rank)
    return sorted(score, key=score.get, reverse=True)
# feed the fused top-50 into the Tier-4 reranker

The fused list goes to the reranker; the reranker’s top few go to the model. That ordering — wide recall, scoped, then precisely reranked — is the whole architecture in one sentence.

The numbers: what each tier actually buys you

Exact figures vary by corpus, but the shape of the improvement is remarkably consistent across the production systems we have measured. The pattern below is representative, not a guarantee — your mileage depends on your documents.

Accuracy & latency by configuration
ConfigurationAnswer accuracyAdded latency
Tier 1 only (naive vector RAG) ~68–72% baseline
+ Tier 2 (hybrid dense + sparse) ~78–82% negligible (parallel)
+ Tier 3 (metadata scoping) ~82–85% + zero scope leaks negligible (pre-filter)
+ Tier 4 (cross-encoder rerank) ~90–93% +80–200 ms on a short list

Read the table as a story: hybrid retrieval recovers the lexical misses, scoping makes it safe and removes a class of wrong-document errors, and the reranker converts recall into precision. The reranker is the only tier that adds meaningful latency, and it adds it where it is affordable — on fifty candidates, not fifty thousand. Sub-second end-to-end remains comfortably achievable.

Production realities nobody demos

A pipeline that benchmarks well on a static eval set can still degrade in production. Three realities decide whether the architecture survives contact with real traffic.

  1. A real evaluation harness. A handful of demo questions is not an eval set. You need a labelled set of representative queries with known-correct passages, scored on retrieval recall and answer faithfulness, and re-run on every change. Without it you are tuning blind, and every “improvement” is a guess.
  2. Retrieval drift. Corpora change — new documents arrive, old ones are superseded, embeddings models are upgraded. Accuracy that was 92% at launch decays silently as the index ages. Monitoring retrieval quality over time, not just at deployment, is the unglamorous work that keeps the number up. This is core RAG pipeline development, not an afterthought.
  3. Latency and cost budgets. Each tier has a price. The reranker costs milliseconds and compute; sparse and dense indexes cost storage and memory. Decide the budget up front and engineer to it, rather than discovering at scale that the architecture is correct but unaffordable.

When you should NOT build all four tiers

The honest part. Four tiers is the right answer for a large, heterogeneous, access-controlled corpus where wrong answers are expensive. It is over-engineering for a small, clean, public knowledge base where a well-chunked dense index already clears your accuracy bar. Build the tier your failure mode demands, not the most impressive diagram.

  • Small, uniform corpus, no lexical identifiers, no access control → Tier 1, done well, may be enough.
  • Lots of codes, IDs, or proper nouns → add Tier 2. This is usually the highest-ROI single addition.
  • Multi-tenant, regulated, or version-sensitive → Tier 3 is mandatory, regardless of accuracy.
  • Accuracy still short and answers are “topically right, factually wrong” → add Tier 4. Almost always the missing piece at the 70% plateau.

If you are weighing graph RAG development or agentic RAG development on top of this, treat them as additions to a working four-tier base — not substitutes for it. A knowledge graph does not fix a retrieval pipeline that never had a reranker.

FAQ

Frequently asked questions


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *