AI-Enabled Systems for Legal, Healthcare & Operational

You gave your AI access to all your docs. It still makes things up. Here's why naive RAG fails and how to fix it.

Why Naive RAG Fails

Bad chunking—splitting on character count breaks context and splits key information across chunks.
Wrong embedding model—trained on the wrong domain; retrieval quality suffers.
Low retrieval precision—wrong docs in the top-k.
No re-ranking—the best chunk isn't first; the LLM focuses on the wrong one.
Poor prompt engineering—the model ignores or underweights the retrieved context.
No "I don't know"—when retrieval score is low, the model answers anyway and hallucinates.

Real-world scenario: A support RAG was chunking by 512 characters with no overlap. Questions about "return policy for international orders" often pulled a chunk that had "return" and "policy" but not "international," so the model filled in from its training and gave a wrong answer. After switching to section-based chunking (one chunk per FAQ or policy section) and adding hybrid search (keyword + semantic), retrieval precision on held-out questions improved from 54% to 87%, and end-to-end accuracy followed.

"The biggest win was confidence filtering: when the retrieval score was below 0.7, the system said 'I don't have a confident answer' and escalated instead of hallucinating."

The Anatomy of Good RAG

Chunking: Semantic chunking with overlap so meaning isn't split—e.g. by section or paragraph, not by 500 characters. Embeddings: Domain-specific or fine-tuned for your content so "force majeure" and "act of God" land near each other in legal docs. Vector DB: pgvector, Pinecone, or Qdrant with proper indexing and filters (e.g. by doc type or date). Hybrid search: Semantic + BM25 keyword so you don't miss exact terms like product codes or IDs. Re-ranking: Cross-encoder over top-k to put the best chunk on top; the first chunk often gets the most weight from the LLM. Prompt: Clear template that says "answer only from context; if not found say so." Confidence: If retrieval score is below threshold, don't answer—escalate or say "I don't know."

Pipeline order matters: chunk → embed → index → at query time: embed query → retrieve top-k → re-rank → build prompt with top chunks → generate. Skipping re-ranking or confidence checks is where many teams lose accuracy.

Real Example: Legal RAG 60% → 92%

Fixing chunking, switching to a legal-tuned embedding, adding hybrid search and re-ranking, and a strict "don't guess" prompt took accuracy from 60% to 92%. The biggest win was confidence filtering: when the retrieval score was below 0.7, the system said "I don't have a confident answer" and escalated instead of hallucinating. We'll cover code for semantic chunking, hybrid search, and confidence-based answering in the full article.

Concretely, the client had been using a general-purpose embedding and fixed 400-token chunks. We switched to a legal-domain embedding, chunked by "section" (clause or numbered paragraph), and added a cross-encoder re-ranker over the top 10 retrieved chunks. The prompt was updated to: "Answer only using the following context. If the context does not contain enough information, respond with 'I don't have a confident answer' and do not guess." Evals on 200 held-out questions showed precision at 0.7 threshold went from 61% to 91%; recall improved because we were no longer counting wrong answers as "answers."

Pipeline and Confidence

Your retrieval step should return a score (e.g. cosine similarity or cross-encoder score). Define a threshold below which you will not call the LLM—instead, you respond "I don't have a confident answer" or route to human. Example logic in pseudocode:

chunks, scores = retrieve(query, top_k=5)
max_score = max(scores)
if max_score < CONFIDENCE_THRESHOLD:
  return "I don't have a confident answer. Please contact support."
context = format(chunks)
response = llm.generate(system_prompt + context + query)
return response

Tune the threshold on a held-out set: too high and you'll say "I don't know" too often; too low and you'll hallucinate. We've seen 0.65–0.75 work well for many document Q&A systems.

When RAG Isn't Enough

For highly specialized, consistent tasks, fine-tuning may beat RAG. Always measure: precision, recall, and F1 on a held-out set, plus real user feedback. Run evals on "known good" questions and on adversarial cases (questions that shouldn't be answerable from your docs). If the model answers those anyway, tighten your confidence threshold and your prompt. Many systems use both: RAG for up-to-date knowledge, and a fine-tuned or prompted model for tone and structure.

What to Do Next

Audit your pipeline: chunking strategy, embedding model, whether you have hybrid search and re-ranking, and whether you gate on retrieval confidence. Schedule a RAG system audit and we can pinpoint where your pipeline is leaking accuracy. For production RAG and AI systems, our AI Agent Development practice includes RAG design, evals, and confidence-based answering so your system stays reliable at scale.

RAG Implementation Mistakes That Kill Accuracy (And How to Fix Them)

Why Naive RAG Fails

The Anatomy of Good RAG

Real Example: Legal RAG 60% → 92%

Pipeline and Confidence

When RAG Isn't Enough

What to Do Next

Get AI & Engineering Insights

RAG Implementation Mistakes That Kill Accuracy (And How to Fix Them)

Why Naive RAG Fails

The Anatomy of Good RAG

Real Example: Legal RAG 60% → 92%

Pipeline and Confidence

When RAG Isn't Enough

What to Do Next

Get AI & Engineering Insights

Related Articles