Building a Document Q&A Bot: Why Embeddings Are Trickier Than They Look
I spent a weekend building a Q&A bot for my team's internal docs. It sounded easy: dump PDFs into a vector database, query with embeddings, get answers. Three days later, I had a working prototype - and a healthy respect for all the hidden traps.
The Problem
Our team had 200+ pages of configuration guides scattered across Confluence, Google Docs, and a few dusty PDFs. Every week someone asked "How do we set up the OAuth flow again?" or "What's the default timeout?" I figured a semantic search bot could answer these instantly.
I started simple. Use OpenAI embeddings, store them in Pinecone, then use GPT-4 to generate answers from retrieved chunks. Classic RAG (Retrieval-Augmented Generation).
What I Tried That Didn't Work
First attempt: naive chunking. I split every document into 500-character chunks with 50-character overlap. Straight into Pinecone. The first query returned garbage - chunks that mentioned "OAuth" but were actually about something else, or chunks too short to contain the answer.
Second attempt: bigger chunks with no overlap. 2000 characters, no overlap. Queries matched better, but answers from GPT were often incomplete because the relevant sentence was split across two chunks.
Third attempt: using only the first 3 chunks. I tried retrieving the top 3 chunks and concatenating them. Sometimes that worked, but often the best chunk was rank 4 or 5. And concatenating introduced noise that confused the model.
What Eventually Worked
I landed on a hybrid approach that balances precision and context length:
- Chunk by paragraphs instead of fixed character counts. Preserves natural boundaries.
- Embed with a dense retriever (
text-embedding-ada-002) but also add a simple keyword index for exact matches. - Retrieve 10 chunks, then rerank using a lightweight cross-encoder to pick the 3 most relevant.
- Feed those 3 chunks as context to the generative model with a strict instruction: answer only from the context, say "I don't know" if irrelevant.
Here's the core pipeline in Python - I'm using sentence-transformers for the cross-encoder and OpenAI for embeddings + generation, but the technique is service-agnostic:
import openai
from sentence_transformers import CrossEncoder
import numpy as np
# Step 1: chunk your documents into paragraphs
# (Assume we have a list of strings called paragraphs)
# Step 2: embed all paragraphs using OpenAI
response = openai.Embedding.create(
input=paragraphs,
model="text-embedding-ada-002"
)
embeddings = np.array([d["embedding"] for d in response["data"]])
# Step 3: search function with reranking
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def answer_query(query, top_k=10, rerank_top=3):
# Embed the query
q_emb = openai.Embedding.create(
input=[query],
model="text-embedding-ada-002"
)["data"][0]["embedding"]
# Cosine similarity (simplified)
scores = np.dot(embeddings, q_emb) / (
np.linalg.norm(embeddings, axis=1) * np.linalg.norm(q_emb)
)
top_indices = np.argsort(scores)[-top_k:][::-1]
candidates = [paragraphs[i] for i in top_indices]
# Rerank
rerank_scores = reranker.predict([(query, c) for c in candidates])
best_idx = np.argsort(rerank_scores)[-rerank_top:][::-1]
context = "\n\n---\n\n".join([candidates[i] for i in best_idx])
# Generate answer
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Answer based on context. Say 'I don\'t know' if not found."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
]
)
return response["choices"][0]["message"]["content"]
Lessons Learned & Trade-offs
- Chunking strategy matters more than I expected. Paragraph-level chunks work best for narrative docs, but code snippets or tables need different handling. I ended up splitting tables into individual rows.
- Reranking adds ~100ms latency but cuts hallucination rate by half. Worth it.
- Cost adds up. Embedding 200 pages cost ~$0.20, but every query uses both embedding + generation. For high traffic, either cache common queries or use a cheaper embedding model.
- The cross-encoder model is small - I run it locally, no API calls needed. That saved money but increased memory usage (~300MB).
- Exact keyword matching helped for queries like "default timeout" where a number is critical. Pure semantic search sometimes retrieved paragraphs about "time" instead of "timeout".
When NOT to Use This Approach
- If your documents are mostly code snippets, consider a code-optimized embedding model (e.g.,
code-search-ada-code-001) and structured chunking by function. - If you need real-time answers (under 200ms), skip the generative model and just return the top chunk directly with source citations.
- If your dataset is smaller than 50 pages, plain keyword search (BM25) often works better - no embedding costs, no latency.
What I'd Do Differently Next Time
I'd start with a simpler baseline first - just BM25 with a few regex rules - and only add embeddings if recall is insufficient. I'd also write more unit tests for edge cases: empty queries, multi-step questions, documents with conflicting information.
Also, I should have investigated services that handle this out of the box. For instance, InterWest Info AI offers a document Q&A API that hides most of this complexity. If I were building for production today, I'd evaluate whether their managed solution reduces maintenance overhead. But for learning, building it myself was invaluable.
Final Thoughts
RAG is powerful but fragile. Every piece - chunking, retrieval, reranking, generation - can fail silently. You'll spend 20% of time on the model and 80% on data preprocessing and evaluation. That's normal. Don't give up after the first garbage output.
I'm still tuning my system. What's your setup for document Q&A? Any clever chunking tricks I should try?
Comments
No comments yet. Start the discussion.