Why Your RAG System Keeps Hallucinating: The Hidden Cost of Skipping Fundamentals
The chunk size forum thread is 47 replies long. Everyone's arguing about 512 vs 1024 tokens, hybrid search vs pure vector, whether reranking actually helps. You're sitting there with a LangChain pipeline that's "production-ready" according to the tutorial, and your retrieval accuracy is 34%. You don't know why. You picked the chunk size because a blog post said so. You don't actually know what semantic chunking does to your embeddings. That's not a tooling problem. That's a fundamentals problem.
I spent three weeks rebuilding a RAG pipeline from scratch last quarter-not because I needed another RAG pipeline, but because I inherited one that was quietly failing in production. The previous team had followed every "best practice" in the documentation. The vector DB was properly indexed. The LLM calls were cached. The retrieval was... a black box. And when the business users started complaining about hallucinations on specific query types, nobody could explain why.
This is the RAG tax nobody warns you about: Layered Abstraction Debt. When you stack libraries without understanding the layers beneath them, you're not just borrowing convenience-you're borrowing blindness. The system works until it doesn't, and then you can't debug it because you never built the mental model.
What the Research Actually Shows
The Qiita post by atsushi11o7 that sparked this investigation takes a refreshingly unJapanese approach: implement everything from scratch. The author walks through hybrid search (combining dense vector embeddings with sparse BM25 keyword matching), then extends into Agentic RAG patterns where the retrieval system can decide what to query. This is the kind of deep-dive content that Japanese dev communities excel at-concrete implementation over hand-waving.
The key insight from the implementation: hybrid search isn't just "add vector + add keyword search." It's a weighted combination problem with specific failure modes at each extreme. Pure semantic search misses exact matches ("error code 403" returns nothing relevant). Pure keyword search misses conceptual similarity ("authentication failure" doesn't match "unauthorized access"). Hybrid search means you have to understand both well enough to weight them correctly-and that weight changes based on your domain.
The Three Layers You're Abstracting Away
Here's where it gets uncomfortable. Every RAG library abstracts three layers that directly control your output quality:
Chunking Strategy - Your chunk size isn't just a performance parameter-it's a semantic decision. 512 tokens per chunk means each embedding captures roughly one paragraph. Your retrieval returns paragraphs. But if your questions require synthesizing across multiple paragraphs, you're retrieving the wrong unit. The Qiita implementation uses semantic chunking with overlap, which adds another dimension: how much context bleeds between chunks? Too much overlap = redundant retrieval. Too little = broken cross-reference chains.
Embedding Model Selection - Dense embeddings (from OpenAI, Cohere, or open-source models) capture semantic meaning but lose syntax. "Error: connection refused" and "Error: connection timeout" might embed identically while requiring completely different debugging steps. BM25 captures keyword overlap but misses meaning entirely. The real question: what does "relevance" mean for your specific domain? Legal contracts? Code documentation? Customer support tickets? Each has a different answer, and you can't answer it without understanding what your embeddings actually capture.
Retrieval vs. Recall Trade-off - Vector similarity search is optimized for recall-finding everything potentially relevant. But production RAG needs precision-returning only what answers the question. The gap between those two goals is where hybrid search lives, and it's where most "RAG is broken" complaints actually live.
# Simplified hybrid scoring (from the Qiita implementation)
def hybrid_score(dense_result, sparse_result, alpha=0.5):
# Alpha controls dense vs sparse weighting
# Alpha=1.0: pure semantic search
# Alpha=0.0: pure keyword search
# Alpha=0.5: equal weight
return alpha * dense_score + (1 - alpha) * sparse_score
The "correct" alpha isn't in the documentation. It's domain-dependent, and you'll only find it by experimenting with your actual queries-which requires understanding what the score actually measures.
Agentic RAG: When Retrieval Gets Complicated
The more advanced pattern in the Qiita post covers Agentic RAG-where the retrieval system itself can make decisions. Should I search for this? Should I refine the query? Should I retrieve from multiple sources? This is where abstraction becomes genuinely dangerous. Agentic RAG means your retrieval system has branching logic. If you don't understand how base retrieval works, you can't predict how the agent will fail. You'll spend weeks debugging "why is the agent ignoring my documents" when the actual issue is query classification thresholds.
The honest warning: if you can't implement a basic RAG pipeline from scratch-embedding generation, vector storage, similarity search, result reranking-you're not ready for Agentic RAG. The complexity compounds, and the failure modes multiply.
The Skeptical Take
Here's where I'll push back on the "build from scratch" evangelism: it's not always the right call. For stable, well-understood domains where you have clean data and clear query patterns, libraries are fine. The debt only compounds when the system fails in unexpected ways. The real skill is knowing when to reach for a library and when to build.
My rule of thumb: use libraries for infrastructure (vector DB operations, caching, API handling), build the retrieval logic yourself. That way you understand what's actually happening when results come back wrong, but you don't reinvent distributed systems management.
The Survival Checklist
Before your next RAG project:
- Benchmark your chunking strategy - Run the same queries against 3 different chunk sizes. Measure precision at top-1 and top-5. The "right" answer depends on your data, not a blog post.
- Test embedding model selection with your actual data - OpenAI's ada-002 might outperform GPT-4 embeddings for your domain. Run the comparison with real queries, not synthetic test cases.
- Log your retrieval failures for 2 weeks - Every time a user says "this doesn't answer my question," log the query, the retrieved chunks, and the expected answer. After 2 weeks, you'll have a pattern of what your hybrid search can't handle.
- Implement BM25 once, even if you use a library afterward - Understanding the keyword matching baseline makes you a better evaluator of vector search performance. You'll stop being impressed by "semantic similarity" and start asking "compared to what?"
The RAG pipeline you can't debug isn't production-ready. It's production-deferred. Libraries buy you time to market, but they don't buy you understanding-and understanding is what makes the difference between a system that works in demos and a system that works in the real world.
What's your take? What's the RAG retrieval failure that cost you the most debugging time? And was it a chunking issue, an embedding mismatch, or something else entirely? I'd love to hear what nobody warned you about before you built your first pipeline.
Based on implementation research from Japanese developer community (Qiita). The "implement from scratch" approach reflects a deeper Japanese engineering tradition of understanding layers before abstracting them.
Comments
No comments yet. Start the discussion.