Building RAG Pipelines That Actually Work: GraphRAG vs Vector RAG in Production

Every demo RAG app works. You chunk some documents, embed them, retrieve the top-k, and generate an answer. Then you ship it to production and discover that "works" and "works reliably" are very different things.

After building RAG systems for procurement tender matching (BandiFinder), inventory demand forecasting (Pellemoda), compliance monitoring (Holding Morelli), and a customer support chatbot (H-Farm), here's what I've learned about building retrieval pipelines that survive contact with real users and real data.

The Two Approaches

Vector RAG embeds your documents as vectors and retrieves by semantic similarity. It's the default approach — well-tooled, fast, and works for most use cases.

GraphRAG builds a knowledge graph from your documents and traverses relationships to answer queries. It's more complex but excels when answers require connecting information across multiple documents.

The right choice depends on your data and query patterns. Let me break down both.

Vector RAG: The Foundation

The Retrieval Pipeline

A standard Vector RAG pipeline:

Sources → Document Loaders → Text Splitters → Embeddings → Vector Store
                                                                ↓
User Query → Query Embedding → Similarity Search → Retrieved Docs → LLM → Answer

Each component is modular — you can swap loaders, splitters, embeddings, or vector stores independently. This modularity is why Vector RAG is the default: it's composable and debuggable.

Chunking: Where Most Pipelines Fail

The number one mistake in production RAG is bad chunking. Your retrieval quality is bounded by your chunk quality — no amount of prompt engineering compensates for chunks that split a concept mid-sentence.

What I've learned:

1. Use RecursiveCharacterTextSplitter for general text:

from langchain_text_splitters import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=500,
    chunk_overlap=100,
)
chunks = splitter.split_documents(docs)

It tries to split on paragraphs first, then sentences, then words — preserving semantic boundaries. The from_tiktoken_encoder variant counts tokens accurately instead of characters.

2. Chunk size matters more than you think:

Too small (100-200 tokens): Loses context, retrieves fragments
Too large (1000+ tokens): Dilutes the relevant signal with noise
Sweet spot for most use cases: 400-600 tokens with 50-100 overlap

3. Domain-specific splitters outperform generic ones:

For BandiFinder's procurement tenders, I built a custom splitter that splits on tender section headers (requirements, eligibility, deadlines) rather than character counts. Retrieval precision improved ~35% because each chunk maps to a single concept.

For Pellemoda's inventory data, the "documents" are structured product records — I split by product rather than by character count, embedding each product as a single chunk with all its attributes.

Embedding Model Selection

Model	Dimensions	Speed	Quality	Cost
OpenAI `text-embedding-3-small`	1536	Fast	Good	Low
OpenAI `text-embedding-3-large`	3072	Medium	Best	Medium
Cohere `embed-v3`	1024	Fast	Great	Low
Open-source (BGE, E5)	Varies	Self-hosted	Good	Free

For most production use cases, text-embedding-3-small is the right default — it's fast, cheap, and good enough. Switch to large only when you've confirmed that retrieval quality (not generation quality) is the bottleneck.

Vector Store Choice

Pinecone: Managed, scales easily, good for teams that don't want to manage infrastructure. Used this for BandiFinder.
PostgreSQL + pgvector: Great if you already have Postgres. Single infrastructure to manage. Used this for Pellemoda.
Supabase (pgvector under the hood): Easiest setup if you're already on Supabase. Used for the H-Farm chatbot.

Don't over-optimize your vector store choice early. The retrieval quality difference between stores is minimal — your chunking and embedding strategy matters 10x more.

GraphRAG: When Relationships Matter

Vector RAG retrieves by similarity — "find chunks that look like the query." This fails when the answer requires connecting information across multiple documents.

Example from BandiFinder: "Which open tenders require ISO 27001 certification AND have a deadline after March 2026?"

Vector RAG would retrieve chunks about ISO 27001 and chunks about deadlines separately, but struggle to find tenders where both conditions are true. GraphRAG solves this by modeling entities and relationships.

How GraphRAG Works

Entity extraction: Parse documents to identify entities (companies, certifications, deadlines, requirements)
Relationship extraction: Identify how entities relate ("Tender X requires certification Y", "Tender X has deadline Z")
Knowledge graph construction: Store as nodes and edges in a graph database
Graph traversal for retrieval: Answer queries by traversing relationships, not by similarity

# Simplified GraphRAG pipeline
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
 
# Step 1: Extract entities and relationships from each document
extraction_prompt = ChatPromptTemplate.from_template("""
Extract entities and relationships from this tender document.
 
Document: {document}
 
Return as JSON:
{{
  "entities": [
    {{"name": "...", "type": "tender|requirement|certification|deadline", "properties": {{}}}}
  ],
  "relationships": [
    {{"source": "...", "target": "...", "type": "requires|has_deadline|issued_by"}}
  ]
}}
""")
 
llm = ChatOpenAI(model="gpt-4o-mini")
 
# Step 2: Build knowledge graph
for doc in documents:
    result = llm.invoke(extraction_prompt.format(document=doc.page_content))
    graph_db.add_entities_and_relationships(result)
 
# Step 3: Query via graph traversal
query = "Tenders requiring ISO 27001 with deadline after March 2026"
# Translates to: MATCH (t:Tender)-[:REQUIRES]->(c:Certification {name: 'ISO 27001'})
#                WHERE t.deadline > '2026-03-01' RETURN t
results = graph_db.query(cypher_query)

When to Use GraphRAG vs Vector RAG

Scenario	Vector RAG	GraphRAG	Winner
"What does document X say about topic Y?"	Great	Overkill	Vector
"Which entities satisfy conditions A AND B?"	Poor	Great	Graph
FAQ / documentation chatbot	Great	Unnecessary	Vector
Multi-hop reasoning ("A relates to B which affects C")	Poor	Great	Graph
Real-time, low-latency queries	Fast	Slower	Vector
Data with clear entity-relationship structure	Okay	Excellent	Graph
Unstructured text (blogs, articles)	Great	Expensive to build	Vector

My rule of thumb: Start with Vector RAG. Add GraphRAG only when you can point to specific query patterns that fail with vector similarity.

For BandiFinder, I use both — Vector RAG for free-text search ("tenders about renewable energy") and GraphRAG for structured queries ("tenders requiring specific certifications in Lombardy with budget over €500K"). The agent decides which retrieval path to use based on the query.

RAG Architectures for Production

LangChain defines three RAG architectures. Understanding which to use is critical for production:

1. 2-Step RAG (Simple, Predictable)

Retrieve first, then generate. Always.

User Question → Retrieve Documents → Generate Answer → Return

from langchain.tools import tool
from langchain_core.vectorstores import InMemoryVectorStore
 
@tool
def search_knowledge_base(query: str) -> str:
    """Search and return relevant documents."""
    docs = retriever.invoke(query)
    return "\n\n".join([doc.page_content for doc in docs])
 
# Always retrieves, then generates
docs = search_knowledge_base.invoke({"query": user_question})
answer = llm.invoke(f"Answer based on this context:\n{docs}\n\nQuestion: {user_question}")

Use when: Query always needs retrieval (documentation bots, FAQ systems). Latency is predictable — you know exactly how many LLM calls happen.

I used this for the H-Farm chatbot — every question is about admissions policies, so retrieval always applies.

2. Agentic RAG (Flexible, Intelligent)

An LLM agent decides whether to retrieve, what to retrieve, and when it has enough context. It can call retrieval tools multiple times or skip retrieval entirely.

from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode, tools_condition
 
def generate_query_or_respond(state: MessagesState):
    """Agent decides: retrieve or respond directly."""
    response = (
        model
        .bind_tools([retriever_tool])
        .invoke(state["messages"])
    )
    return {"messages": [response]}
 
workflow = StateGraph(MessagesState)
workflow.add_node(generate_query_or_respond)
workflow.add_node("retrieve", ToolNode([retriever_tool]))
workflow.add_node(generate_answer)
 
workflow.add_edge(START, "generate_query_or_respond")
workflow.add_conditional_edges(
    "generate_query_or_respond",
    tools_condition,  # Did the agent call a tool?
    {"tools": "retrieve", END: END},
)
workflow.add_edge("retrieve", "generate_answer")
workflow.add_edge("generate_answer", END)
 
graph = workflow.compile()

Use when: Not every query needs retrieval, or queries need multiple retrieval steps. BandiFinder's chat agent uses this — sometimes the user asks about their profile (no retrieval needed), sometimes about tenders (vector retrieval), sometimes complex queries (graph retrieval).

3. Hybrid RAG with Self-Correction (Production-Grade)

This is what you actually want in production. It adds document grading and query rewriting to the pipeline — if retrieved documents aren't relevant, the system rewrites the query and tries again.

from pydantic import BaseModel, Field
from typing import Literal
 
class GradeDocuments(BaseModel):
    """Binary relevance score for retrieved documents."""
    binary_score: str = Field(
        description="'yes' if relevant, 'no' if not"
    )
 
def grade_documents(state: MessagesState) -> Literal["generate_answer", "rewrite_question"]:
    """Grade retrieved docs. If irrelevant, rewrite the query."""
    question = state["messages"][0].content
    context = state["messages"][-1].content
 
    score = (
        grader_model
        .with_structured_output(GradeDocuments)
        .invoke([{"role": "user", "content": f"Is this relevant?\nQuestion: {question}\nContext: {context}"}])
    )
 
    if score.binary_score == "yes":
        return "generate_answer"
    else:
        return "rewrite_question"
 
def rewrite_question(state: MessagesState):
    """Rewrite the query for better retrieval."""
    question = state["messages"][0].content
    response = model.invoke([{
        "role": "user",
        "content": f"Rewrite this question for better search results:\n{question}"
    }])
    return {"messages": [HumanMessage(content=response.content)]}
 
# Wire it all together
workflow = StateGraph(MessagesState)
workflow.add_node(generate_query_or_respond)
workflow.add_node("retrieve", ToolNode([retriever_tool]))
workflow.add_node(rewrite_question)
workflow.add_node(generate_answer)
 
workflow.add_edge(START, "generate_query_or_respond")
workflow.add_conditional_edges("generate_query_or_respond", tools_condition, {"tools": "retrieve", END: END})
workflow.add_conditional_edges("retrieve", grade_documents)  # Grade → generate or rewrite
workflow.add_edge("generate_answer", END)
workflow.add_edge("rewrite_question", "generate_query_or_respond")  # Loop back
 
graph = workflow.compile()

The flow: Query → Retrieve → Grade → (relevant? Generate) / (irrelevant? Rewrite → Re-retrieve → Grade again)

This is what I run in production for BandiFinder and Pellemoda. The self-correction loop catches ~20% of queries that would have returned poor results with a single retrieval pass.

Production Lessons

1. Evaluate Retrieval Separately from Generation

Most teams only evaluate the final answer. But if retrieval returns irrelevant chunks, no prompt will save you. Use LangSmith to evaluate retrieval precision and recall independently:

Retrieval precision: Of the chunks retrieved, what % were relevant?
Retrieval recall: Of all relevant chunks, what % were retrieved?

Build an evaluation set of 50-100 (query, expected_chunks) pairs. Run it on every pipeline change.

2. Metadata Filtering > More Embeddings

Before adding more documents to your vector store, add metadata filters. For BandiFinder:

# Instead of searching all 50K tender chunks
results = vectorstore.similarity_search(query, k=5)
 
# Filter by region and status first, then search within that subset
results = vectorstore.similarity_search(
    query,
    k=5,
    filter={"region": "lombardy", "status": "open"}
)

This is faster, cheaper, and dramatically more precise.

3. Chunk Overlap Prevents "Split Brain"

Without overlap, a concept split across two chunks may never be fully retrieved. 50-100 token overlap ensures continuity. This is especially important for legal documents (tenders, compliance regulations) where a single clause can span multiple paragraphs.

4. Cache Embeddings Aggressively

Embedding the same document twice is wasted money. Track document hashes and only re-embed on change. LangChain's indexing API handles this, but even a simple hash check saves significant cost at scale.

5. GraphRAG is an Investment

Building a knowledge graph requires entity extraction (LLM calls per document), relationship validation, and graph database infrastructure. Budget 3-5x the development time of Vector RAG. Only invest when you have clear multi-hop query patterns that vector similarity can't solve.