From pilot to production: the 7 layers for engineering generative AI agents that actually work

Deploying AI agents at scale requires critical extensions to the operational stack. Unlike traditional software, agents are non-deterministic, can take autonomous actions, and face novel attack vectors like prompt injection, ranked the #1 security risk by OWASP in 2025.

The stakes are real. Air Canada lost a lawsuit over a hallucinated discount policy. NYC's chatbot gave illegal business advice. With 79% of organisations now adopting AI agents and the market projected to reach $103-199 billion by 2034, getting this infrastructure right separates successful deployments from the 40%+ of agentic AI projects Gartner predicts will be canceled by 2027.

In fact, most companies that have tried to run AI agents in production have failed. If you have built a successful proof of concept (POC) or prototype for a business application or process automation using an LLM, congratulations, you’ve done better than most! But the difficult part is only just beginning. Now you need to run it at scale in your organisation while avoiding joining the 95% of failed GenAI projects, or the 50%+ shelved due to complex infrastructure.

Organisations implementing production agents need seven interconnected infrastructure layers, on top of existing engineering good practices:

Observability
Evaluation methodology
Guardrails and safety
Security
Cost management
Unified interfaces
Orchestration

Over the course of 8 articles (including this one) I will walk through each of these steps, providing actionable advice, examples of tools and guidance on the tradeoffs at each of these new layers. But first, a quick detour into what doesn’t change for companies implementing agents.

Some things never change

The first thing to know is that engineering with LLMs is still engineering; the same principles and techniques apply.

You still need a Continuous Integration/Continuous Delivery (CI/CD) pipeline to ensure that code is properly reviewed and tested before it is deployed. You still need environments (e.g., dev, staging, production). You still need change management, even for prompts. Especially for prompts. A one-word change to a system prompt can alter behavior as much as a hundred lines of code. Treat prompt changes with the same rigor you'd treat code changes.

An AI agent is just another component in a system. It takes inputs, does work, and produces outputs. The only difference from what you're used to is that its outputs aren't deterministic. But that's not as novel as you might think; any sufficiently complex system already behaves in ways you can't fully predict. If you've ever run a large distributed system or called a third-party API, you know the feeling: you think you understand it, until one day it does something you never expected and you have to spend a week figuring out why!

Behind the scenes, the agent part is simple to understand. An AI agent is an LLM with tools running in a loop. If you haven't written one yourself, I'd encourage you to try it. It's instructive in the same way that writing a web server from scratch is instructive: it demystifies the thing, and then you can make better decisions about which framework to use or whether to use one at all.

You’ll soon realise the complex part isn't getting the AI to do what you want but everything else around it: the same important infrastructure work that makes any system reliable at scale. Tests. Monitoring. Access controls. Incident response. Change management. The stuff that separates a demo from a product.

That might make agentic engineering sound more boring than you imagined. But boring means well-understood. Boring means you can look at how mature engineering teams run reliable systems and do the same thing. Boring means you can avoid making the common mistakes that businesses make when implementing agents in production.

The mistakes to avoid

There's a pattern here that we've seen before. When the web first appeared, companies treated it as something magical and separate. They created "internet divisions" staffed by people who knew HTML, as if the web required a fundamentally different kind of expertise. Eventually everyone realised that a web application is just an application.

AI agents are in that same awkward adolescent phase. Companies are setting up "AI teams" and treating AI as if it exists outside the normal rules of engineering. But the companies that will do best are the ones that figure out soonest that an AI agent is just another service. One with unusual properties, yes, but still a service. One that needs tests, monitoring, security, documentation, and all the other boring (and I mean that in the most loving way) things that make software work.

A proof of concept is the first 10% of the work. The remaining 90% is the ‘boring’ part of making it reliable. This is true of all software, but you may be surprised at how many businesses forget this fact when AI is involved. Demos end up so impressive that they create a kind of illusion, as if being impressive and being production-ready were the same thing.

Running AI in production is just engineering. Do it carefully, and it works. Skip the boring parts because the demo was exciting, and it won’t.

**What is actually different**

That said, some things are bound to change when one of your components has a mind of its own. Deploying AI agents at scale requires critical extensions to the operational stack across the seven aforementioned layers (each of which will be covered in a subsequent article).

Observability

When a traditional service misbehaves, you can usually trace the problem to a specific input, a specific code path, a specific bug. When an AI agent misbehaves, the "bug" might be that it interpreted an ambiguous instruction differently than you expected. You need richer logs: what happened, what the model saw, what it decided, and why.

Evaluation

With deterministic code, you write tests: given this input, expect that output. With an AI agent, you need something more like a scoring jury. Did the agent complete the task? Was the answer relevant? Was it correct? You can even use other LLMs to judge the output, which sounds circular but works surprisingly well. Think of evaluation as your testing framework, except the assertions are fuzzy.

Guardrails and safety

The most powerful agents work best when you let them run freely: give them tools, let them loop until they've solved the problem. That freedom also makes their failures more spectacular. If a traditional program goes wrong, it usually does something narrowly wrong. If an agent goes wrong, it might do something creative wrong, which is much worse. You need sandboxes. You need guardrails. You need to think carefully about what tools the agent can access and what blast radius a mistake could have.

Security

Security for AI agents is different because the attack surface includes language itself. Traditional systems are exploited through code paths; agents can be exploited through instructions. Prompt injection turns untrusted input into adversarial control. A document, webpage, or user message can persuade the model to ignore its original constraints. That means your threat model must assume all external content is hostile. You need scoped credentials, least-privilege access to tools, and hard boundaries around side effects. Treat tool execution as a security boundary. If the agent’s reasoning is compromised, the permissions model should still prevent serious damage.

Cost management

LLM costs don’t scale like traditional compute. They scale with reasoning depth: tokens, retries, tool loops, and context growth. A subtle prompt change can double usage and an agent stuck in a loop can easily burn thousands before you realise what’s happened. You are probably already doing cost management but it becomes a first-class engineering concern. You need per-request attribution, token-level visibility, and dashboards by feature or customer. Set hard ceilings: max tokens, max iterations, circuit breakers when usage spikes. Evaluation and cost must be linked: sometimes a slightly worse model is economically superior. Unlike demos, production systems need the flexibility to optimise across quality, latency, and price, not just performance.

Unified interfaces

Model capabilities evolve faster than any other dependency in your stack. If you tightly couple your application to one provider’s SDK, every upgrade comes with the risk of extra rewrite cost or the risk of being left behind. A unified interface layer abstracts model providers behind your own internal contract. Your application talks to that layer, not directly to vendors. It handles retries, fallbacks, logging, parameter normalisation, and model selection. This makes A/B testing, benchmarking, and switching providers far cheaper. The teams that move fastest are the ones who can swap models without touching product logic.

Orchestration

A production agent participates in a workflow: it calls tools, retries failures, hands off tasks, waits for human approval, and maintains state across steps. Orchestration is the layer that makes that complexity explicit. You need durable state, idempotent side effects, timeouts, retries, and visibility into intermediate decisions. Long-running tasks must survive restarts. Failures must resume safely. Agents start to look like distributed systems more than chatbots.

Next steps

AI agents don’t replace engineering fundamentals, they amplify them. The companies that succeed won’t be the ones with the flashiest demos, but the ones who apply disciplined, boring, production-grade practices to systems that happen to be non-deterministic.

What has changed is the behaviour of the component you’re deploying. When part of your system reasons probabilistically and acts autonomously, you need new layers to make that behaviour observable, measurable, constrained, and governable.

If you’re deciding where to begin, follow the hierarchy above. Start with observability and evaluation; without them, you’re flying blind, even in development. Then implement guardrails and security before exposing agents to real users or data. Finally, cost management, unified interfaces, and orchestration turn your working system into a scalable one.

Over the coming articles, I’ll break down each of these layers in depth, with practical guidance and trade-offs from real-world deployments, beginning next week with observability, the foundation everything else depends on.