Every demo RAG app works. You chunk some documents, embed them, retrieve the top-k, and generate an answer. Then you ship it to production and discover that "works" and "works reliably" are very different things.
After building RAG systems for procurement tender matching (BandiFinder), inventory demand forecasting (Pellemoda), compliance monitoring (Holding Morelli), and a customer support chatbot (H-Farm), here's what I've learned about building retrieval pipelines that survive contact with real users and real data.
The Two Approaches
Vector RAG embeds your documents as vectors and retrieves by semantic similarity. It's the default approach — well-tooled, fast, and works for most use cases.
GraphRAG builds a knowledge graph from your documents and traverses relationships to answer queries. It's more complex but excels when answers require connecting information across multiple documents.
The right choice depends on your data and query patterns. Let me break down both.
Vector RAG: The Foundation
The Retrieval Pipeline
A standard Vector RAG pipeline:
Sources → Document Loaders → Text Splitters → Embeddings → Vector Store
↓
User Query → Query Embedding → Similarity Search → Retrieved Docs → LLM → Answer
Each component is modular — you can swap loaders, splitters, embeddings, or vector stores independently. This modularity is why Vector RAG is the default: it's composable and debuggable.
Chunking: Where Most Pipelines Fail
The number one mistake in production RAG is bad chunking. Your retrieval quality is bounded by your chunk quality — no amount of prompt engineering compensates for chunks that split a concept mid-sentence.
What I've learned:
1. Use RecursiveCharacterTextSplitter for general text:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=500,
chunk_overlap=100,
)
chunks = splitter.split_documents(docs)It tries to split on paragraphs first, then sentences, then words — preserving semantic boundaries. The from_tiktoken_encoder variant counts tokens accurately instead of characters.
2. Chunk size matters more than you think:
- Too small (100-200 tokens): Loses context, retrieves fragments
- Too large (1000+ tokens): Dilutes the relevant signal with noise
- Sweet spot for most use cases: 400-600 tokens with 50-100 overlap
3. Domain-specific splitters outperform generic ones:
For BandiFinder's procurement tenders, I built a custom splitter that splits on tender section headers (requirements, eligibility, deadlines) rather than character counts. Retrieval precision improved ~35% because each chunk maps to a single concept.
For Pellemoda's inventory data, the "documents" are structured product records — I split by product rather than by character count, embedding each product as a single chunk with all its attributes.
Embedding Model Selection
| Model | Dimensions | Speed | Quality | Cost |
|---|---|---|---|---|
OpenAI text-embedding-3-small |
1536 | Fast | Good | Low |
OpenAI text-embedding-3-large |
3072 | Medium | Best | Medium |
Cohere embed-v3 |
1024 | Fast | Great | Low |
| Open-source (BGE, E5) | Varies | Self-hosted | Good | Free |
For most production use cases, text-embedding-3-small is the right default — it's fast, cheap, and good enough. Switch to large only when you've confirmed that retrieval quality (not generation quality) is the bottleneck.
Vector Store Choice
- Pinecone: Managed, scales easily, good for teams that don't want to manage infrastructure. Used this for BandiFinder.
- PostgreSQL + pgvector: Great if you already have Postgres. Single infrastructure to manage. Used this for Pellemoda.
- Supabase (pgvector under the hood): Easiest setup if you're already on Supabase. Used for the H-Farm chatbot.
Don't over-optimize your vector store choice early. The retrieval quality difference between stores is minimal — your chunking and embedding strategy matters 10x more.
GraphRAG: When Relationships Matter
Vector RAG retrieves by similarity — "find chunks that look like the query." This fails when the answer requires connecting information across multiple documents.
Example from BandiFinder: "Which open tenders require ISO 27001 certification AND have a deadline after March 2026?"
Vector RAG would retrieve chunks about ISO 27001 and chunks about deadlines separately, but struggle to find tenders where both conditions are true. GraphRAG solves this by modeling entities and relationships.
How GraphRAG Works
- Entity extraction: Parse documents to identify entities (companies, certifications, deadlines, requirements)
- Relationship extraction: Identify how entities relate ("Tender X requires certification Y", "Tender X has deadline Z")
- Knowledge graph construction: Store as nodes and edges in a graph database
- Graph traversal for retrieval: Answer queries by traversing relationships, not by similarity
# Simplified GraphRAG pipeline
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# Step 1: Extract entities and relationships from each document
extraction_prompt = ChatPromptTemplate.from_template("""
Extract entities and relationships from this tender document.
Document: {document}
Return as JSON:
{{
"entities": [
{{"name": "...", "type": "tender|requirement|certification|deadline", "properties": {{}}}}
],
"relationships": [
{{"source": "...", "target": "...", "type": "requires|has_deadline|issued_by"}}
]
}}
""")
llm = ChatOpenAI(model="gpt-4o-mini")
# Step 2: Build knowledge graph
for doc in documents:
result = llm.invoke(extraction_prompt.format(document=doc.page_content))
graph_db.add_entities_and_relationships(result)
# Step 3: Query via graph traversal
query = "Tenders requiring ISO 27001 with deadline after March 2026"
# Translates to: MATCH (t:Tender)-[:REQUIRES]->(c:Certification {name: 'ISO 27001'})
# WHERE t.deadline > '2026-03-01' RETURN t
results = graph_db.query(cypher_query)When to Use GraphRAG vs Vector RAG
| Scenario | Vector RAG | GraphRAG | Winner |
|---|---|---|---|
| "What does document X say about topic Y?" | Great | Overkill | Vector |
| "Which entities satisfy conditions A AND B?" | Poor | Great | Graph |
| FAQ / documentation chatbot | Great | Unnecessary | Vector |
| Multi-hop reasoning ("A relates to B which affects C") | Poor | Great | Graph |
| Real-time, low-latency queries | Fast | Slower | Vector |
| Data with clear entity-relationship structure | Okay | Excellent | Graph |
| Unstructured text (blogs, articles) | Great | Expensive to build | Vector |
My rule of thumb: Start with Vector RAG. Add GraphRAG only when you can point to specific query patterns that fail with vector similarity.
For BandiFinder, I use both — Vector RAG for free-text search ("tenders about renewable energy") and GraphRAG for structured queries ("tenders requiring specific certifications in Lombardy with budget over €500K"). The agent decides which retrieval path to use based on the query.
RAG Architectures for Production
LangChain defines three RAG architectures. Understanding which to use is critical for production:
1. 2-Step RAG (Simple, Predictable)
Retrieve first, then generate. Always.
User Question → Retrieve Documents → Generate Answer → Return
from langchain.tools import tool
from langchain_core.vectorstores import InMemoryVectorStore
@tool
def search_knowledge_base(query: str) -> str:
"""Search and return relevant documents."""
docs = retriever.invoke(query)
return "\n\n".join([doc.page_content for doc in docs])
# Always retrieves, then generates
docs = search_knowledge_base.invoke({"query": user_question})
answer = llm.invoke(f"Answer based on this context:\n{docs}\n\nQuestion: {user_question}")Use when: Query always needs retrieval (documentation bots, FAQ systems). Latency is predictable — you know exactly how many LLM calls happen.
I used this for the H-Farm chatbot — every question is about admissions policies, so retrieval always applies.
2. Agentic RAG (Flexible, Intelligent)
An LLM agent decides whether to retrieve, what to retrieve, and when it has enough context. It can call retrieval tools multiple times or skip retrieval entirely.
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode, tools_condition
def generate_query_or_respond(state: MessagesState):
"""Agent decides: retrieve or respond directly."""
response = (
model
.bind_tools([retriever_tool])
.invoke(state["messages"])
)
return {"messages": [response]}
workflow = StateGraph(MessagesState)
workflow.add_node(generate_query_or_respond)
workflow.add_node("retrieve", ToolNode([retriever_tool]))
workflow.add_node(generate_answer)
workflow.add_edge(START, "generate_query_or_respond")
workflow.add_conditional_edges(
"generate_query_or_respond",
tools_condition, # Did the agent call a tool?
{"tools": "retrieve", END: END},
)
workflow.add_edge("retrieve", "generate_answer")
workflow.add_edge("generate_answer", END)
graph = workflow.compile()Use when: Not every query needs retrieval, or queries need multiple retrieval steps. BandiFinder's chat agent uses this — sometimes the user asks about their profile (no retrieval needed), sometimes about tenders (vector retrieval), sometimes complex queries (graph retrieval).
3. Hybrid RAG with Self-Correction (Production-Grade)
This is what you actually want in production. It adds document grading and query rewriting to the pipeline — if retrieved documents aren't relevant, the system rewrites the query and tries again.
from pydantic import BaseModel, Field
from typing import Literal
class GradeDocuments(BaseModel):
"""Binary relevance score for retrieved documents."""
binary_score: str = Field(
description="'yes' if relevant, 'no' if not"
)
def grade_documents(state: MessagesState) -> Literal["generate_answer", "rewrite_question"]:
"""Grade retrieved docs. If irrelevant, rewrite the query."""
question = state["messages"][0].content
context = state["messages"][-1].content
score = (
grader_model
.with_structured_output(GradeDocuments)
.invoke([{"role": "user", "content": f"Is this relevant?\nQuestion: {question}\nContext: {context}"}])
)
if score.binary_score == "yes":
return "generate_answer"
else:
return "rewrite_question"
def rewrite_question(state: MessagesState):
"""Rewrite the query for better retrieval."""
question = state["messages"][0].content
response = model.invoke([{
"role": "user",
"content": f"Rewrite this question for better search results:\n{question}"
}])
return {"messages": [HumanMessage(content=response.content)]}
# Wire it all together
workflow = StateGraph(MessagesState)
workflow.add_node(generate_query_or_respond)
workflow.add_node("retrieve", ToolNode([retriever_tool]))
workflow.add_node(rewrite_question)
workflow.add_node(generate_answer)
workflow.add_edge(START, "generate_query_or_respond")
workflow.add_conditional_edges("generate_query_or_respond", tools_condition, {"tools": "retrieve", END: END})
workflow.add_conditional_edges("retrieve", grade_documents) # Grade → generate or rewrite
workflow.add_edge("generate_answer", END)
workflow.add_edge("rewrite_question", "generate_query_or_respond") # Loop back
graph = workflow.compile()The flow: Query → Retrieve → Grade → (relevant? Generate) / (irrelevant? Rewrite → Re-retrieve → Grade again)
This is what I run in production for BandiFinder and Pellemoda. The self-correction loop catches ~20% of queries that would have returned poor results with a single retrieval pass.
Production Lessons
1. Evaluate Retrieval Separately from Generation
Most teams only evaluate the final answer. But if retrieval returns irrelevant chunks, no prompt will save you. Use LangSmith to evaluate retrieval precision and recall independently:
- Retrieval precision: Of the chunks retrieved, what % were relevant?
- Retrieval recall: Of all relevant chunks, what % were retrieved?
Build an evaluation set of 50-100 (query, expected_chunks) pairs. Run it on every pipeline change.
2. Metadata Filtering > More Embeddings
Before adding more documents to your vector store, add metadata filters. For BandiFinder:
# Instead of searching all 50K tender chunks
results = vectorstore.similarity_search(query, k=5)
# Filter by region and status first, then search within that subset
results = vectorstore.similarity_search(
query,
k=5,
filter={"region": "lombardy", "status": "open"}
)This is faster, cheaper, and dramatically more precise.
3. Chunk Overlap Prevents "Split Brain"
Without overlap, a concept split across two chunks may never be fully retrieved. 50-100 token overlap ensures continuity. This is especially important for legal documents (tenders, compliance regulations) where a single clause can span multiple paragraphs.
4. Cache Embeddings Aggressively
Embedding the same document twice is wasted money. Track document hashes and only re-embed on change. LangChain's indexing API handles this, but even a simple hash check saves significant cost at scale.
5. GraphRAG is an Investment
Building a knowledge graph requires entity extraction (LLM calls per document), relationship validation, and graph database infrastructure. Budget 3-5x the development time of Vector RAG. Only invest when you have clear multi-hop query patterns that vector similarity can't solve.
Related Posts
- Building AI Agents with LangGraph: From Prototype to Production — durable execution, multi-agent orchestration, and the patterns that make agents reliable
- MCP (Model Context Protocol): Connecting AI Agents to Real Tools — standardizing how agents connect to databases, APIs, and external tools
Building a RAG system and want production-grade architecture from day one? I've shipped retrieval pipelines across procurement, inventory, compliance, and customer support. Let's talk.