LLMs have a knowledge cutoff. They hallucinate. RAG fixes this: retrieve relevant documents first, then generate answer based on retrieved context.
Ground generations in real data.
The problem with pure LLMs
Ask about recent events: “What happened yesterday?” - LLM doesn’t know. Ask about your company’s docs: “What’s our refund policy?” - LLM guesses.
Solutions:
- Fine-tune on your data (expensive, still outdated)
- RAG (retrieve at inference time)
See retrieval and generation: RAG Animation
How RAG works
- Index: Embed documents into vector database
- Retrieve: Given query, find similar documents
- Generate: Feed retrieved docs + query to LLM
Query: "What's the return policy?"
↓
Retrieve top-k similar docs
↓
Prompt: "Based on these documents: {docs}
Answer: {query}"
↓
LLM generates grounded answer
Building the index
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# Load and split documents
docs = load_documents("company_docs/")
chunks = text_splitter.split_documents(docs)
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
Chunking matters. Too small: lose context. Too large: dilute relevance.
Retrieval
# Simple similarity search
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
relevant_docs = retriever.get_relevant_documents(query)
Retrieval quality is critical. Bad retrieval → bad generation.
Generation
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff", # put all docs in prompt
retriever=retriever
)
answer = qa_chain.run("What's our return policy?")
Improving retrieval
Hybrid search: Combine semantic (embeddings) with keyword (BM25)
from langchain.retrievers import EnsembleRetriever
keyword_retriever = BM25Retriever.from_documents(docs)
semantic_retriever = vectorstore.as_retriever()
ensemble = EnsembleRetriever(
retrievers=[keyword_retriever, semantic_retriever],
weights=[0.5, 0.5]
)
Reranking: Retrieve many, rerank with cross-encoder
Query expansion: Rephrase query multiple ways, combine results
Chunking strategies
- Fixed size: Simple, might split sentences
- Sentence-based: Respect boundaries
- Recursive: Split hierarchically by headers, paragraphs, sentences
- Semantic: Group by topic similarity
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
Evaluation
- Retrieval metrics: Precision@k, Recall@k, MRR
- Generation metrics: Faithfulness, relevance, answer correctness
Tools: RAGAS, TruLens, custom evals