Over the past year I've integrated LLMs into two production systems — a document intelligence platform and an approval workflow tool. Both went through the usual arc: exciting prototype, humbling production issues, gradual stabilization. Here's what survived that process.
The gateway pattern (LiteLLM)
The first and most important decision is whether your application code calls an LLM provider directly or goes through a gateway. We chose a gateway (LiteLLM) and I'd make the same call again.
Direct calls create provider lock-in at the code level. Every API call becomes a commitment to that provider's request format, rate limit behavior, and error codes. When you want to swap from GPT-4 to Claude or add a fallback provider, you're rewriting call sites.
LiteLLM gives you an OpenAI-compatible API that routes to any provider. Your application code is unchanged whether you're hitting OpenAI, Anthropic, or a self-hosted model. The fallback configuration is a YAML file, not a code change.
What the gateway doesn't solve
Provider routing is solved. Prompt management is not. We still had a prompt version control problem — when we updated a system prompt, there was no easy way to know which version was responsible for which outputs. We solved this with Langfuse tags on trace records, but it required discipline to apply consistently.
Structured output
The biggest reliability improvement in our extraction pipeline was moving from free-form text output to structured JSON via the provider's structured output mode (or response_format: { type: "json_object" } for older models).
Before: parse the LLM's text response, hope the field names matched, write defensive code for every variant.
After: define the output schema, validate against it, throw if it doesn't match.
The LLM still makes mistakes, but they're schema violations, not format surprises. Schema violations are easier to detect, log, and route to a fallback.
Function calling vs. JSON mode vs. structured outputs
These are three different mechanisms for getting structured output, and they're not interchangeable:
- JSON mode: The model is instructed to produce valid JSON. Field names are not guaranteed.
- Function calling: The model decides whether to call a function and fills in the parameters. Good for agentic patterns.
- Structured outputs (OpenAI): The model is constrained to produce output matching a JSON Schema. Highest reliability for extraction tasks.
For extraction pipelines where you need specific fields reliably, use structured outputs if available, JSON mode with explicit field listing as a fallback.
Confidence gating
One of the more useful patterns we built was a confidence gate on extracted fields. The LLM was asked to return both the extracted value and a confidence score (0–1) for each field. Fields below a threshold were flagged for human review rather than auto-accepted.
The implementation:
type ExtractedField = {
value: string | null;
confidence: number; // 0–1
needs_review: boolean;
};The LLM doesn't always calibrate confidence correctly — it tends to be overconfident on fields it's seen many times and underconfident on unusual formatting. But directionally, it's reliable enough to route high-confidence extractions away from the review queue.
Observability with Langfuse
Every LLM call in both systems is traced with Langfuse. A trace captures: the prompt, the model, the response, the latency, the token count, and any custom metadata (user ID, document ID, workflow stage).
The value isn't just debugging — it's also cost attribution. Knowing which features consume the most tokens tells you where to optimize prompt length first.
import Langfuse from "langfuse";
const trace = langfuse.trace({ name: "document-extraction", userId: docId });
const span = trace.span({ name: "field-extraction" });
const response = await litellm.completion({ ... });
span.end({
output: response.choices[0].message,
usage: response.usage,
});The observability overhead is minimal. The debugging payoff is large.
What failed
Chaining prompts without validation between steps. In an early version of the extraction pipeline, we'd run three sequential LLM calls: classify → extract → summarize. If the first step produced malformed output, the error propagated silently through steps 2 and 3. We fixed this by adding schema validation and error logging between every step.
Letting the model decide the field schema. We tried letting the model infer which fields to extract from the document. The output was inconsistent — different documents produced different schemas. The fix was defining the target schema explicitly in the system prompt and treating any deviation as an error.
Long system prompts without versioning. Prompts got edited in place. When the model started producing different output, we couldn't tell if it was a prompt change, a model update, or a data shift. Langfuse tags partially solved this, but explicit version strings in the system prompt (# System v2.3) would have been simpler.
The honest summary
LLM integrations in production are harder than the demos suggest and easier than the horror stories imply. The failure modes are real but manageable. The key is treating LLM calls like any other I/O: validate inputs, validate outputs, log everything, have a fallback path.