AI Reliability Engineering
TL;DR
AI systems are not reliable by default. They are probabilistic engines embedded inside software systems, and their behavior is fundamentally non-deterministic. If we want AI systems to behave reliably, the reliability must come from the engineering around the model, not from the model itself.
Right now the industry is repeating a mistake we have already made before. Instead of treating AI systems like production infrastructure, we are treating them like magical components that will somehow handle complexity on their own. Spoiler alert - they will not. If anything, AI systems demand more operational discipline, not less.
If you are building agentic systems today, whether you realize it or not, you are doing reliability engineering.
In other words: you are an SRE now.
We Already Solved This Problem Once
Several years ago I wrote a talk and article titled Introduction to Data Reliability Engineering (available on Medium here ). At the time, data teams were dealing with a common operational failure mode: pipelines broke constantly, dashboards were unreliable, and nobody really owned the stability of the system. Data infrastructure was treated like a collection of scripts rather than a production system.
The solution turned out to be simple, albeit culturally difficult. Data pipelines had to be treated like services. Once that shift happened, teams began applying the same operational discipline used in traditional infrastructure: observability, service level objectives, incident response, and operational ownership.
Today we are watching the exact same pattern emerge in AI systems. Agent frameworks are proliferating quickly, and organizations are rushing to build AI workflows; but in the process, reliability thinking has largely disappeared. Engineers talk about prompts and models and tools, but very few discussions focus on operational stability.
Instead, we hear a different kind of assumption: the belief that the model will somehow resolve complexity on its own.
That assumption is dangerously misleading.
Guardrails Have Hidden the Real Problem
One reason this misunderstanding exists is that modern frontier model APIs hide a significant amount of the underlying instability of language models. Most hosted LLM services include multiple layers of guardrails embedded directly in the API layer. These can include structured output enforcement, moderation filters, schema validation, retry logic, and tool-calling constraints.
Because of these layers, many engineers building AI systems rarely encounter the raw behavior of a language model. The system quietly absorbs many failures before they ever reach the application layer.
As a result, something subtle has happened: people have forgotten how language models actually behave.
Language models hallucinate. They do so frequently, not just in edge cases but as a natural consequence of probabilistic generation. That does not make them useless, but it does mean they cannot be treated as deterministic software components.
Reliable AI systems are therefore not built by pretending hallucination does not exist. They are built by designing systems that expect hallucination and manage it.
Frontier APIs made LLMs dramatically easier to use. At the same time, they made it easier to forget what those systems actually are.
Agent Systems Are Distributed Systems
Another challenge appears as soon as AI systems begin to perform real work. Modern agent architectures rarely consist of a single prompt and response. Instead, a single user request may trigger retrieval steps, tool calls, reasoning loops, validation passes, or background tasks.Once a workflow reaches that level of complexity, it begins to resemble something very familiar to systems engineers: a distributed system.
Agentic systems are distributed systems with a probabilistic component, which introduces an interesting cultural shift. For the last decade, cloud platforms have absorbed much of the complexity involved in building distributed infrastructure. Managed services handle scaling, orchestration, queueing, and failover automatically. That abstraction has been enormously valuable for developer productivity.
However, AI systems reintroduce distributed behavior into the application layer itself. Instead of the infrastructure handling orchestration, the workflow logic now lives inside the software the engineer writes. This means engineers must once again think about system behavior under failure conditions. Latency amplification, cascading failures, and inconsistent state are no longer theoretical concerns, now they are common patterns in agent workflows.
The Reliability Framework Still Applies
Fortunately, the industry already has a well-developed framework for thinking about system reliability. Google’s Site Reliability Engineering book (Beyer et al.) describes what are commonly known as the Four Golden Signals of system health: latency, traffic, errors, and saturation.
These signals translate surprisingly well to AI systems, although their interpretation changes slightly.
Latency in an AI system is not just the time required to produce a response. It may include model inference time, time to first token, streaming latency, tool execution time, and the completion time of the overall workflow. Traffic is similarly more complex because a single user request can expand into multiple model calls as an agent moves through its reasoning process.
Errors also look different. Agents rarely return explicit error messages. Instead they often return responses that appear valid but are structurally incorrect or incomplete. From a reliability perspective, these outcomes must still be treated as failures.
Finally, saturation occurs not only in infrastructure resources such as CPU or GPU capacity but also in the context window of the model itself. Excessive context can increase latency, reduce accuracy, and destabilize workflows. This intersection between reliability and context design is something I discussed previously in Why I Hate the Term Context Engineering.
Systems That Never Break Are Fragile
Reliability engineering has emphasized for a long time about an important philosophical shift: reliable systems are not systems that never fail.
Charity Majors, co-author of Database Reliability Engineering, often emphasizes this point in her work. She notes that systems designed to avoid all failure are often fragile because they are never tested under stress. Instead, reliable systems are designed to fail in controlled ways that minimize blast radius and recovery time.
As Majors writes in Database Reliability Engineering, systems that never fail are often fragile and when they go down it takes all hands on deck to get it live again. True reliability comes from designing systems that can break safely and recover quickly.
This mindset is especially important for AI systems because probabilistic components will inevitably behave unpredictably. The goal is not eliminating that behavior, but rather designing systems that remain stable despite it.
Toil and Operational Work
Another concept from reliability engineering that applies directly to AI systems is toil. Majors and Campbell define toil as manual, repetitive work that can be automated and scales with the growth of the service.
AI workflows introduce new kinds of toil. Engineers frequently find themselves debugging agent outputs, rewriting prompts repeatedly, re-running workflows, or manually correcting incorrect responses. These activities consume engineering time without improving the underlying system.
The reliability approach to this problem is the same one used in infrastructure operations: automate repetitive operational work wherever possible.
Why Homelabs Matter
One reason I enjoy running a homelab is that it forces me to confront these system realities directly. When systems run locally, the abstractions that cloud platforms provide disappear. You are forced to think about queues, networking, inference latency, resource limits, and failure modes.
If something breaks in a homelab, there is no platform team to escalate to. You must understand what the system is actually doing.
This mindset turns out to be exactly the mindset required to build reliable AI systems. Agent workflows behave much more like distributed systems than traditional web applications. When they fail, the root cause is rarely the model itself, it’s usually the surrounding system architecture.
Predictability Versus Reliability
Self-hosting models also exposes deeper layers of model behavior. When running models locally, engineers can observe elements such as logits, token probabilities, sampling behavior, and generation drift.
This visibility does not automatically make a system reliable. However, it does make the system more predictable. Predictability is the first step toward reliability engineering. Engineers cannot build stable systems around components they treat as complete black boxes.
A Pattern We Have Seen Before
Interestingly, this pattern is not new. I wrote about a similar dynamic in an earlier article, Optimizing Into Chaos: Why AI Agents Fail. In that post I argued that agent systems tend to optimize aggressively toward goals without strong boundaries. When the surrounding system lacks guardrails, those optimization loops naturally drift into unstable behavior.
The lesson is simple but important: the model is not the system. The system is everything around the model.
The AI Reliability Stack
A useful way to think about this architecture is to separate the model from the reliability layers surrounding it.
User Request
↓
Agent Workflow / Orchestrator
↓
Tool Layer (APIs, Databases, Services)
↓
LLM (Probabilistic Engine)
↓
Inference InfrastructureWrapped around all of this are the reliability layers:
Observability
Validation
Retries
SLOs
Error Budgets
Context Management
The model provides reasoning ability, but reliability emerges from the surrounding system design.
The Engineer’s Job Has Changed
AI has not removed engineering complexity. Instead, it has shifted where that complexity lives.
The role of the engineer is no longer just writing deterministic code. It now involves designing systems that contain probabilistic components and ensuring those systems behave predictably under failure conditions.
That responsibility includes architecture, workflow design, validation layers, retries, observability, and operational boundaries. Engineers must understand how their systems behave under stress because the model will not solve those problems automatically.
The Intelligence Is Still the Engineer
Language models are extremely powerful tools, but they are not intelligent systems in the traditional sense. They are statistical engines that generate tokens based on probability distributions learned during training.
The intelligence of the system does not come from the model.
It comes from the engineer designing the system around it.


