In classic backend systems, we are used to determinism: code either works or crashes with a clear stack trace. In LLM systems, we deal with "soft failures" — the system runs fast and without log errors, but outputs hallucinations or irrelevant context.
As an engineer with a highload and distributed systems background, I like to view the system as a conveyor with measurable efficiency at each stage. For this, I use the Observability Pyramid, where each layer protects the next.
1. System Layer: Telemetry and SRE Basics
Without this layer, the others make no sense. If you don't meet SLAs for availability and speed, response accuracy doesn't matter.
Key Metrics:
- TTFT (Time to First Token): the main metric for UX
- TPOT (Time Per Output Token): generation stability
- Tokens/Sec & Input/Output Ratio: critical for capacity planning and understanding KV-cache load
Engineering Approach: Monitor inference engines (vLLM/TGI) via Prometheus/Grafana and OpenTelemetry (OpenLLMetry).
For details on profiling the engine and finding bottlenecks — see my article:
LLM Engine Telemetry: How to profile models
2. Retrieval Layer: Data Hygiene (RAG Triad)
Most hallucinations stem from poor retrieval. RAG evaluation should be decomposed into three components:
A. Context Precision
How relevant are the retrieved chunks? Noise distracts the model and wastes tokens.
Tools: RAGAS, DeepEval.
B. Context Recall
Does the retrieved set contain the factual answer?
Practice: You need a "golden standard" — a labeled dataset. I use Meta CRAG because it simulates real-world chaos and dynamically changing data.
See my guide on local CRAG evaluation here.
C. Faithfulness
Is the answer derived from the context or hallucinated?
A judge model checks every claim in the response against the provided source.
3. Semantic Layer: LLM-as-a-Judge at Scale
This level checks logic. The main challenge is balancing evaluation quality with cost/speed.
Engineering Best Practices:
- CI/CD Gating: Full run on a reference dataset. If Faithfulness drops below 0.8 — block deployment (tune the threshold for your domain).
- Production Sampling: In highload systems, evaluating 100% of traffic via GPT-4o is financial suicide. Use sampling (1–5%). Additionally: implement judge caching (GPT cache, LangChain cache, or vLLM prefix caching). This is especially effective when users ask similar questions — the same prompt+context can be evaluated multiple times, but you pay only once.
- Specialized Judges: Instead of "naked" small models (which often struggle with logic), use Prometheus-2 or Flow-Judge. They are trained specifically for evaluation tasks, comparable in quality to GPT-4, and can be hosted locally.
- Out-of-band Eval: In production, evaluation always runs asynchronously to avoid increasing main request latency.
Diagnostic Map: What to Fix?
| Metric | If Dropped, Problem In: | Action Plan |
|---|---|---|
| Context Recall | Embeddings / Indexing | Switch embedding model, implement Hybrid Search (Vector + Keyword) |
| Context Precision | Chunking / Noise | Add Reranker (Cross-Encoder), revise Chunking Strategy |
| Faithfulness | Temperature / Context | Lower Temperature, strengthen system prompt, check chunk integrity |
| TTFT (Latency) | Hardware / Load | Check Cache Hit Rate, enable quantization or PagedAttention |
Implementation Plan (Checklist)
- Instrument (Day 0): Set up export of metrics and traces (vLLM + OpenTelemetry).
- Golden Set: Collect 50–100 critical cases. Use Meta CRAG structure as reference (details in my article Build Your Own Spaceport: Local RAG Evaluation with Meta CRAG).
- Automate: Integrate DeepEval/RAGAS into GitHub Actions.
- Sampling & Feedback: Set up log and user feedback collection (thumbs up/down) for gray-zone analysis in Arize Phoenix or LangSmith.
Conclusion
For an experienced engineer, an LLM system is just another probabilistic node in a distributed architecture. Our job is to surround it with sensors so its behavior becomes predictable — like the trajectory of a rocket on a verified orbit.


Top comments (9)
I especially liked the layered approach.in your opinion, is it viable to run automated evaluations (LLM-as-a-judge) continuously in production, or does it add too much overhead in terms of cost and latency?
Great point! It really depends on your infra. If you have idle GPUs, using pricey APIs for every check is overkill. It’s better to build a 'Golden Dataset' and fine-tune a smaller model (like Llama 3) or get specific llm as a dedicated judge. Also, throwing GPTCache into the mix is a game-changer. If a similar response has been judged before, just pull it from the cache and save the tokens. I think it will be useful for you link
Yes, it's really expensive. I don't think many companies can afford it.
Yes, you're right. I tried to describe the solution above in the comment.
Great article on applying observability principles to LLM systems, love the conveyor belt analogy for making probabilistic outputs more predictable. One thing I'd add to the Semantic Layer: for specialized judges in production, combining Prometheus-2 with a few-shot prompting strategy using domain-specific examples has worked wonders in my experience with financial LLMs. It reduces bias in narrow niches without needing full fine-tuning. How do you handle judge drift over time as your dataset evolves?
Thanks, great insight! Handling judge drift is a continuous process. We tackle it by 'auditing the auditor'—running periodic blind tests where humans re-evaluate the judge’s scores
We do the same "auditing the auditor” practice and additionally feed the human disagreements back to periodically refresh the few-shot examples or even lightly fine-tune the judge (Prometheus-2). This keeps the cost low while maintaining quality.
Curious have you noticed drift happening more after certain types of changes (e.g. system prompt updates, new chunking strategy, or model version bumps)?
Regarding the Prometheus-2, it's an excellent model, but out of the box it can still be a bit of a pain in narrow niches (medicine, law and etc)
True. In complex niches, the 'judge' is just a tool, not a silver bullet. The secret lies in Custom Rubrics and providing Ground Truth answers. If you give Prometheus-2 a clear scoring scale and a few domain-specific examples, it hits the mark much more consistently