Why My RAG App Kept Hallucinating (and How I Fixed It)
A few months ago I was demoing my RAG-powered support bot to a colleague, feeling pretty confident about it. Then it confidently told her our refund policy was “30 days, no questions asked.” Our actual policy is 14 days, with conditions. The bot didn’t hedge. It didn’t say “I’m not sure.” It just made it up and said it with the same calm tone it uses for everything else.
That demo stung. RAG was supposed to fix hallucinations, not just relocate them. Here’s what I learned debugging it, roughly in the order I learned it.
1. My chunks were too big, and too dumb
I was splitting documents by character count, 1000 chars with slight overlap. It felt efficient. It wasn’t. A single chunk often contained unrelated sections. For example, the end of a “Shipping Policy” and the start of a “Returns Policy” could sit together in the same block. So when the retriever saw a query about returns, it would grab that chunk and the model would blend both sections into one confident but wrong answer.
Fix: I switched to semantic chunking based on headings and paragraphs instead of raw character limits. More work upfront, but it stopped feeding the model Frankenstein context.
2. I trusted top-k similarity way too much
My retriever was pulling the top 3 chunks by cosine similarity and passing them straight into the prompt. The problem: “similar” is not the same as “relevant.” A chunk can be semantically close to the query but still not actually contain the answer. The model doesn’t know that, it just assumes everything in context is true.
Fix: I added a reranking step using a cross-encoder and started logging retrieval scores properly. That alone made it obvious when the system had no real answer but was still trying to act confident.
3. I never told the model it was allowed to say “I don’t know”
My prompt was basically: “Use the context to answer the question.” That’s it. No instruction on what to do when the context is insufficient. So the model did what LLMs do when under-specified: it filled the gaps with something plausible.
Fix: I explicitly added: If the answer is not clearly present in the context, say you don’t know. Hallucinations dropped immediately after this. It was almost embarrassing how effective this was.
4. No retrieval, no answer (I wasn’t enforcing it)
Even with better prompting, the model would still sometimes answer from general knowledge when retrieval quality was weak. I wasn’t actually gating anything. I was just hoping the prompt would enforce behavior.
Fix: I added a real threshold. If the top retrieval score is below a cutoff, the system doesn’t proceed normally. It returns a fallback instead of letting the model improvise. No relevant context → no forced answer.
5. I wasn’t testing the cases that actually break systems
All my testing was on “happy path” questions, things I already knew the documents covered well. I wasn’t testing:
- ambiguous queries
- missing information cases
- partially covered topics
- multi-part questions
And that’s exactly where hallucinations show up.
Fix: I built a small evaluation set of “trap questions”, cases where the correct answer is not in the system and started running it regularly against changes. That exposed weaknesses immediately.
Where it stands now
It’s not perfect. RAG doesn’t eliminate hallucinations, it just makes them more controllable if you pay attention to how the system is built. How you chunk. How you retrieve. How you decide what not to answer.
The bot still doesn’t know our refund policy is 14 days. But now, when it’s unsure, it actually says so. And honestly, that’s the part that made it usable.
Comments
No comments yet. Start the discussion.