A'sTechware Logo — AI & Platform Engineering
A'sTechware Logo — AI & Platform Engineering

A'sTechware Logo — AI & Platform Engineering

Custom Software & AI for Operations
Share
Lost in the Middle: Why Large Context Windows Don't Solve Everything Technical Deep Dives

Lost in the Middle: Why Large Context Windows Don't Solve Everything

A
A'sTechware AI & Platform Engineering
Feb 2025 · 10 min read

You have a 128K context window. Your AI still misses key info buried in the middle. Here's why and what to do.

The "Lost in the Middle" Problem

Research shows LLMs perform worse on information in the middle of the context—attention favors start and end. Performance degrades as context length grows. In experiments, accuracy can be ~95% at start/end and drop to ~60% in the middle. Dumping a 100K-token doc and asking a question about a fact in the middle is a recipe for missed answers.

Real-world scenario: A team built a "ask anything about this 80-page contract" feature by sending the full PDF text (about 90K tokens) to a long-context model. For questions about clauses in the first or last 20 pages, answers were usually correct. For clauses in the middle 40 pages, the model often missed the clause or conflated it with a similar one. After switching to RAG—chunk by section, retrieve relevant sections, and send only those (3–5 chunks, ~4K tokens)—accuracy on the same question set went from 62% to 89%. The lesson: long context is not a substitute for retrieval when the critical information is a small fraction of the total.

"Put critical info at start and end; use structure so the model can locate sections; add explicit instructions to 'search the entire context.'"

When Long Context Helps

Code generation (full file in context), single-document analysis, meeting transcripts in order. When you need the whole thing in one call, long context is the right tool. Use cases: editing or summarizing one long doc, answering questions about a single codebase file, or following a conversation thread from start to end. The key is that the relevant information isn't scattered—it's one coherent input where order and full visibility matter.

When RAG Is Better

Large knowledge bases (too big for context), need for precise retrieval (specific facts), cost-sensitive (context tokens are expensive), latency-sensitive (smaller context = faster). RAG retrieves only what's relevant and keeps context short. You avoid the "lost in the middle" effect because you're only sending a few retrieved chunks, and you can re-rank so the best chunk is first (and optionally repeated at the end for emphasis).

The Hybrid Approach

Use RAG to get relevant chunks, then feed those chunks in a long(er) context; re-rank so the most relevant is first (and optionally last). Best practices: put critical info at start and end, use structure (e.g. XML tags) so the model can locate sections, and add explicit instructions to "search the entire context" so it doesn't default to the first paragraph. If you must use long context, avoid putting the answer-critical passage in the middle—or retrieve it and place it near the start. Repeating the key snippet at the end can also help. We'll include cost and latency comparison and code for hybrid RAG + long context.

What to Do Next

Audit your context usage: are you sending huge docs and asking about facts in the middle? If so, consider RAG or a hybrid. For systems that need to reason over large corpora, our AI Agent Development practice includes RAG design and context strategy. Schedule a context strategy consultation and we'll recommend the right mix of long context and retrieval for your use case.

Share:
A
A'sTechware

A'sTechware designs and builds production-grade AI automations and custom platforms so businesses can run faster without adding headcount. We focus on systems that survive production: governance, human-in-the-loop, and complete audit trails.

Get AI & Engineering Insights

Practical perspectives on production AI and platform engineering. No spam.

A's Gpt