Beyond Prompts: Building AI Systems That Actually Work
By Sam Davies
Hello, you!
We’re back for a special Sponsored edition of AWSCQ.
For this edition we’ve got Sam Davies - Head of Technology for Thoughtworks in Europe.
Before we hand over the reigns just a wee reminder that we’ve our first AWS Community Summit: Birmingham this June. Grab your tickets at the comsum site now.
Okidoki - over to Sam!
Introduction
Over the last twelve months, the changes in AI have continued to accelerate at a blistering pace. New models and tools continue to appear almost weekly; but if we can now create prototypes and generate code and features quicker than ever, can we govern, test, and release them at the same pace? Broadly, the answer to that is “no”.
To prompt is not enough
The focus of the last 12 months has been on context engineering: the practice of designing, structuring, and delivering precise, relevant information to an LLM’s input window at the right time to improve its accuracy and knowledge to increase model performance. That’s exactly what is needed to ensure their effectiveness, but it’s just one context. It doesn’t cover the bit about governing, releasing, maintaining, and operating the systems produced; what we rarely do is provide context about all the others.
In most systems today, it is humans who are the holders of these contexts and residual memory from previous experiences and interactions. Those humans have also been responsible for evolving their ways of working, conventions, tooling, workflow and engineering practices that have brought about the repeatability (albeit through a very large human context window).
Getting agentic workflows to run smoothly across a development team is genuinely hard, and the problems tend to cluster around engineering practices and the following key areas:
Trust and control boundaries
Teams struggle to agree on how much autonomy agents should have. Different organisations have different risk tolerances, rates of change and regulations. Without clear policies on what agents can do without human approval, you end up with either too much friction (constant confirmations) or too little oversight (agents taking actions nobody sanctioned).
Context and handoff quality
Agents are only as good as the context they receive. When work passes between humans and agents, or between multiple agents, context degrades fast. Implicit knowledge that lives in someone’s head doesn’t make it into prompts, and agents make wrong assumptions as a result. How many of the systems we build are, in reality, “one-shot”? Not many.
Humans are slow and inconsistent at providing the inputs agents need: clear goals, the right documents, domain clarity, and decision-making.
Observability
It’s surprisingly hard to know what your agents actually did and why (and saving some of that information for the future could be really helpful). Without good observability, understanding failures and systems health can often feel like poking around in the dark. Teams often don’t invest in this until something goes wrong in a costly way, but it should be an engineering concern from the outset.
Instruction drift
Agent instructions change over time, but unlike code, they often aren’t version-controlled or reviewed with the same rigour. One person tweaks a prompt or a settings file and the whole workflow behaves differently, and nobody knows why. Teams need the discipline to agree, share, and version the settings and instructions, and maintain Architectural Decision Records (ADRs) for their future and agentic partners to stop key decisions from being undone and remade.
Handling the unexpected
It’s hard for humans to often think beyond the “happy path” and be explicit about all potential failures; controlling agent behaviour when it encounters an unexpected state needs a clearly thought-out escalation path or graceful degradation strategy, defined up front as input to the LLM.
Tool and permission sprawl
As teams add more tools (e.g., MCPs) to agents or external autonomous AI agents, the attack surface and the surface area for mistakes grows. Agents with broad permissions will eventually use them in ways you didn’t anticipate; currently, least-privilege thinking is rarely applied. The software supply chain for your application becomes more complex and less transparent to the engineers, and whilst that might be okay in a static world, this is no longer a static world.
Coordinating agents
Multi-agent systems introduce orchestration complexity: who delegates to whom, how conflicts get resolved, and how you avoid redundant or contradictory actions. This is still a largely unsolved design problem for most teams. There are lots of emerging strategies and tools to help cope with these, but we’ve transferred the problem of software development complexity from the code to somewhere else.
Evaluation
Evaluating if your agentic workflow is actually performing well is hard. Unlike a model benchmark, agentic workflows have long-horizon success criteria that are hard to measure systematically; consequently, teams often rely on feeling and spot checks.
The friction of having Humans in the loop (HITL)
The typical implementation of HITL has been as a workaround or a bolt-on to satisfy governance. The user experience of human-agent collaboration (approval interfaces, context surfacing, observability, regression, and easy overrides) is often similarly an afterthought. We need to put more upfront thought into how we use humans in the process, both to get the best out of them and to create a meaningful role that brings empathy and understanding of their value.
So what should you do?
The underlying thread across most of these is that agentic systems require operational discipline that teams haven’t had to develop before—somewhere between software engineering rigour and the kind of process design you’d apply to human workflows. Most teams are still figuring this out.
AI is going to amplify what is there: both the good and the bad. You need to double down on your engineering practices, as small changes, upgrades, dependencies, and new models can—and will—produce unexpected outcomes.
Teams need to manage any assets that inform the context as though they were source code (because, effectively, they are!).
Leverage services from the cloud where you can; build strong guardrails independent of your model so that you build better verification, transparency, and confidence in your work and the checks in place. Build up the wider, more permanent context of decisions made and “roads not travelled,” and tell stories—providing examples in your own domain that people can see implemented in the resultant application.
We recently ran an independent meetup of recognised technologists focused on “What the future of software engineering will look like?”. One of the key themes that came up again and again was that the rigour has to go somewhere. As AI agents produce more and more code, the engineering discipline doesn’t disappear but instead moves elsewhere.
So for the sake of future you or me, focus on engineering discipline, transparency and repeatability for the next person who may need to understand what is there and how it was produced.
From our latest Looking Glass report: https://www.thoughtworks.com/insights/looking-glass
“The rise of AI reliability engineering. With concerns about the (mis)use of AI growing, more leaders will invest in ethical design frameworks and agentic orchestration to ensure resilience and trust. The most successful organisations will not just use AI to optimise decisions; they will reimagine how intelligence itself is operationalised within the enterprise and leverage exemplary AI governance as a differentiator.”
And that’s all for another AWSCQ!
Thanks to Sam at Thoughtworks for putting this together.
Thanks as always to our fabulous sponsors!
See you for the next one…











