Building AI Agents with LangGraph: From Prototype to Production

Most AI agent tutorials stop at "hello world" — a single LLM call with a tool. Production agents are a different beast. After shipping agents for procurement matching (BandiFinder), inventory forecasting (Pellemoda), and compliance monitoring (Holding Morelli), here's what I've learned about building agents that actually work.

Why LangGraph

LangGraph is a low-level orchestration framework for building stateful, long-running agents. Unlike linear chains, real agents need cycles — an agent evaluates a result, decides it's insufficient, and loops back. LangGraph gives you a state machine where nodes are functions and edges are conditional transitions.

The core benefits that matter in production:

Durable execution — persist progress through failures, resume exactly where you left off
Human-in-the-loop — pause, inspect, modify agent state at any point via interrupt()
Checkpointing — every step is saved, so crashes don't lose work
Streaming — stream tokens, state updates, and tool calls to your frontend

from langgraph.graph import StateGraph, START, END
 
graph = StateGraph(AgentState)
graph.add_node("analyze", analyze_deal)
graph.add_node("score_risk", score_risk)
graph.add_node("decide", decide_next_action)
graph.add_node("escalate", escalate_to_human)
 
graph.add_edge(START, "analyze")
graph.add_edge("analyze", "score_risk")
graph.add_conditional_edges(
    "score_risk",
    route_by_risk_level,
    {"low": END, "medium": "decide", "high": "escalate"}
)

Key insight: model your agent as a graph, not a prompt. Each node does one thing well. Edges encode your business logic. This makes agents debuggable, testable, and auditable.

Pattern 1: Durable Execution with Checkpointers

This is LangGraph's killer feature and the one most tutorials skip. When you compile a graph with a checkpointer, every step is persisted. If your agent crashes mid-execution — LLM timeout, network failure, process restart — it resumes from the last checkpoint, not from scratch.

from langgraph.checkpoint.postgres import PostgresSaver
 
checkpointer = PostgresSaver(conn_string=DATABASE_URL)
 
graph = builder.compile(checkpointer=checkpointer)
 
# Every invocation is tied to a thread
config = {"configurable": {"thread_id": "deal-analysis-123"}}
result = graph.invoke({"deal_id": "D-456"}, config)

LangGraph v1 supports three durability modes — pick based on your tolerance for data loss vs. performance:

Mode	Behavior	Use when
`"sync"`	Persists before each step	You can't afford to lose any work
`"async"`	Persists in background while next step runs	Good balance of speed and safety
`"exit"`	Persists only when graph exits	Maximum performance, okay with replay on crash

# For critical financial operations, use sync
graph.stream(
    {"deal_id": "D-456"},
    config,
    durability="sync"
)

For RevAgent's risk scoring agents that affect real deal pipelines, I use "sync". For the weekly CRO brief generator, "exit" is fine — if it crashes, just re-run it.

Pattern 2: The @task Decorator for Side Effects

A subtle but critical production pattern. When a workflow resumes from a checkpoint, LangGraph replays from the last starting point. If your node makes API calls or writes to a database, those side effects will repeat on replay.

The @task decorator solves this — it caches the result so replayed tasks return the cached value instead of re-executing:

from langgraph.func import task
 
@task
def call_crm_api(deal_id: str) -> dict:
    """Fetch deal data from HubSpot. Cached on replay."""
    return hubspot.deals.get(deal_id)
 
@task
def send_slack_alert(channel: str, message: str):
    """Send alert. Won't re-send on workflow replay."""
    slack.chat_postMessage(channel=channel, text=message)
 
def evaluate_deal(state: AgentState):
    deal = call_crm_api(state["deal_id"]).result()
    risk = score_risk(deal)
 
    if risk > 0.7:
        send_slack_alert("#deal-alerts", f"High risk: {deal['name']}").result()
 
    return {"risk_score": risk, "deal_data": deal}

Rule of thumb: wrap every external API call, database write, or non-deterministic operation in @task. This makes your agents idempotent and safe for durable execution.

Pattern 3: Human-in-the-Loop with interrupt()

LangGraph's interrupt() function is the official way to pause execution and wait for human input. Unlike building custom approval queues, interrupt() saves the full graph state and resumes exactly where it paused — even days later.

from langgraph.types import interrupt, Command
 
def execute_action(state: AgentState):
    action = state["proposed_action"]
 
    if action["safety_score"] < 0.8:
        # Pause and surface the action for human review
        human_decision = interrupt({
            "action": action,
            "reason": "Safety score below threshold",
            "deal": state["deal_name"],
        })
 
        if human_decision == "reject":
            return {"status": "rejected_by_human"}
 
    # Execute the action (auto or after approval)
    return perform_action(action)

The caller sees the interrupt payload in the stream and resumes with Command:

# Resume after human approves
graph.invoke(
    Command(resume="approve"),
    config  # Same thread_id — picks up where it paused
)

In RevAgent, every autonomous action (CRM field update, email draft, deal escalation) goes through this pattern. Teams configure their safety threshold — below it, interrupt() kicks in and the action waits in a review queue.

Pattern 4: Multi-Agent Orchestration

For RevAgent, I built 7 specialized agents — Risk, Forecast, Hygiene, Follow-up, Escalation, CRO Brief, and Chat. Each is a separate LangGraph graph. An orchestrator chains them:

async def orchestrate_post_risk(deal_id: str, risk_score: float):
    """After risk evaluation, chain downstream agents."""
    if risk_score >= 0.7:
        await run_agent("escalation", deal_id)
    await run_agent("hygiene", deal_id)
    await run_agent("follow_up", deal_id)

Don't build one mega-agent. Build small, focused agents with clear contracts (input state → output state) and orchestrate them. Each agent is independently testable, deployable, and has its own checkpointer thread.

For BandiFinder, the orchestration is simpler — a crawl agent feeds into a matching agent which feeds into an alert agent. Each runs on its own schedule with its own persistence.

Pattern 5: Hybrid Architecture (LLM + Rule-Based Fallback)

LLM APIs go down. Rate limits hit. Costs spike. Every agent I ship has a deterministic fallback:

@task
async def score_deal_risk(deal: Deal) -> RiskScore:
    try:
        return await llm_risk_agent.invoke(deal)
    except (RateLimitError, TimeoutError):
        return rule_based_risk_score(deal)

The rule-based fallback uses weighted heuristics — not as nuanced as GPT-4, but it keeps the system running. In RevAgent, this is backed by a process-wide circuit breaker (3 consecutive failures → 60s cooldown → fallback mode).

Note: wrapping this in @task means if the LLM call succeeds, the result is cached. On workflow replay, it won't re-call the LLM.

Pattern 6: Self-Improving Agents

Static agents degrade as data distributions shift. For BandiFinder's tender matching, I implemented a continuous learning loop:

Agent scores tender relevance (0-1)
User accepts/rejects the match
Weekly job recalibrates scoring weights using outcome data
Brier score tracks calibration quality over time

def recalibrate(outcomes: list[Outcome]):
    predicted = [o.predicted_score for o in outcomes]
    actual = [1.0 if o.accepted else 0.0 for o in outcomes]
    brier = mean((p - a) ** 2 for p, a in zip(predicted, actual))
 
    # Adjust signal weights based on predictive power
    for signal in RISK_SIGNALS:
        signal.weight *= signal.correlation_with_outcomes(outcomes)

This runs as a scheduled LangGraph workflow with durability="sync" — if it crashes mid-recalibration, it resumes without corrupting the model weights.

Production Checklist

Before shipping any agent:

Checkpointer: Use PostgresSaver or a durable backend — never InMemorySaver in production. Every thread should be recoverable.
Durability mode: Choose "sync" for critical paths, "async" or "exit" for batch jobs.
@task on side effects: Every API call, DB write, and notification wrapped in @task for idempotent replay.
LangSmith tracing: Set LANGSMITH_TRACING=true. You need to see what the agent "thought" when something goes wrong. Trace every agent run with correlation IDs.
interrupt() for approvals: Don't build custom approval queues. Use LangGraph's native interrupt() — it handles state persistence and resumption for you.
Rate limiting: Per-tenant, per-agent limits. One customer's spike shouldn't affect others.
Graceful degradation: Rule-based fallbacks for every LLM-dependent path.
Cost tracking: Log token usage per agent run. Set alerts for anomalies.

What I'd Do Differently

Start with the graph, not the prompt. Draw the state machine first. Prompts are implementation details.
Use durable execution from day one. Don't bolt on persistence later — design around checkpointers and @task from the start.
Build evaluation sets early. You can't improve what you can't measure. Use LangSmith's evaluation tools.
Don't over-agent. Some problems are better solved with a SQL query than an LLM call. Use agents where reasoning and flexibility actually matter.
Deploy with LangSmith. It handles assistants, threads, runs, auth, webhooks, and cron out of the box — don't rebuild that infrastructure.