Stories by Thoughtworks on Medium

Key themes in Technology Radar Vol.34

Thoughtworks — Mon, 20 Apr 2026 07:42:28 GMT

Podcast host Ken Mugrage | Podcast guest Alessio Ferri and Jim Gumbley

Listen to the podcast on the Thoughtworks Technology Podcast episode.

In April 2026 we published a new edition of the Thoughtworks Technology Radar — volume 34. Like many recent volumes, this one was dominated by AI. However, while editions over the last couple of years have illustrated the dizzying proliferation of AI-related technologies, vol.34 indicates a degree of evolution in the field, demonstrated by a focus on consistency, reliability and mitigating the collaborative and individual challenges of working with AI. This is reflected in the four themes identified for this Radar: the challenge of evaluating technology in an agentic world; retaining principles, relinquishing patterns; securing permission-hungry agents; putting coding agents on a leash.

On this special Technology Radar episode of the Technology Podcast, host Ken Mugrage is joined by Alessio Ferri and Jim Gumbley to discuss the key themes in Technology Radar Vol.34. Diving into topics ranging from cognitive debt, harness engineering and the lethal trifecta, listen to gain a deeper understanding not just of the latest Radar but, more importantly, what AI-assisted and agentic software engineering really looks like today.

Read the latest volume of the Thoughtworks Technology Radar.

Macro trends in the tech industry | April 2026

Thoughtworks — Thu, 16 Apr 2026 12:35:38 GMT

By Richard Gall

The last few editions of the Technology Radar have captured relentless AI-accelerated change in the industry. However, while recent volumes have reflected the astounding energy of the field, from the proliferation of new tools to the almost monthly emergence of new terms and concepts, volume 34 is different: it highlights a level of maturity, a moving away from endless experimentation to a desire for repeatability and stability and something cognitively manageable.

However, this isn’t to say things are stabilizing: the macro trends in the tech industry reflected in volume 34 all speak to unresolved tensions — reliability and AI’s unpredictability, AI-acceleration and developer experience and past and future practices.

Searching for consistency and reliability

Consistency and reliability have always been significant concerns in AI. However, in the early part of 2026 they appear to have shifted from one of many issues to one of the most critical. Perhaps driven by increasing adoption and the step change in capabilities we witnessed at the end of 2025, the best evidence of this is the emergence of the term ‘harness engineering’ in recent months.

Harness engineering

Broadly speaking, harness engineering refers to the infrastructure, constraints and feedback loops that wrap around AI agents to improve their reliability. Part of this is an extension or evolution of spec-driven development (SDD); one of the ways in which we can harness agents is by using SDD frameworks such as OpenSpec and GitHub SpecKit to provide guardrails and structured workflows.

However, it also goes beyond this to consider the ways in which agents ‘learn’ and self-correct. In this edition we featured something called the ‘feedback flywheel’, which essentially adds a further step to the spec → plan → implement flow typical in SDD aimed at iteratively improving the coding agent. It’s worth flagging a number of techniques here including feedback sensors for coding agents to reduce the manual review burden and provide agentic systems with the capacity to improve themselves.

Sandboxing

This apparent desire for increased reliability arguably suggests a growing awareness of the many risks associated with AI and agent assistants in software engineering. However, while we welcome the expansion of risk-aware practices like sandboxing coding agents, demonstrated in blips on this Radar including Dev Containers and Sprites, it would be wrong to think there’s been an industry about-face. There’s certainly lots of high-risk experimentation happening, including agent coding swarm projects like Steve Yegge’s Gastown. While these are intriguing and may offer insight for the future of software engineering, as we note in this volume, these need to be approached with caution.

It’s also worth noting the importance of agent durability in the context of reliability. We’ve noticed that ignoring agent durability is a bit of an antipattern, with teams successfully developing agent workflows only to find they fail when deployed to production in complex distributed systems. Bringing durable computing approaches and tools such as Golem and Temporal to bear on these use cases can help minimize the risks of execution failures.

Rethinking developer experience and productivity

Many of the practices that grapple with AI reliability are closely related to the question of the role of the developer in the software development process: where should the humans have control? What needs to be reviewed? What needs to be iterated manually and what can be automated?

One of the things that’s becoming clear is that ‘agentic’ coding poses challenges for developer experience. This is something we’ve been thinking about a lot at Thoughtworks; indeed, even before we began putting this volume of the Radar together, the potential for AI workflows to degrade developer experiences, leading to a divergence between productivity and personal flow and satisfaction, was a significant topic of discussion at Martin Fowler’s Future of Software Development Retreat.

Measuring the right things

Undoubtedly some of the challenges are cultural, informed by misunderstandings of what AI can and cannot do. For instance, despite long-running discussion on this topic, we thought it was still important to caution against using coding throughput as a measure of productivity. As an alternative we suggest measuring collaboration quality with coding agents using metrics such as iteration cycles per task, post-merge rework and failed builds. This shifts the focus in a way that ensures developers are focusing on the right things and should ultimately lead to higher quality software being delivered.

MCP scepticism and the return to the command line

One of the major shifts in the experience of developing with AI is the shift away from MCP. While it was hailed as a game-changer 12 months ago, in many instances it isn’t necessary, which is why we’ve cautioned against MCP by default. This isn’t to say it shouldn’t ever be used but instead that there are often more appropriate approaches that avoid what Justin Poehnelt calls ‘the abstraction tax’.

Interestingly, it appears reservations around MCP have led to a return to the command line. One of the reasons for this is Agent Skills, an open standard that packages instructions, executable scripts and other associated resources to modularize and progressively disclose context to coding agents. This means that rather than interfacing with an MCP server, a given skill can be called in the command line. In addition to Agent Skills is the Claude Code plugin marketplace which we’ve found is significantly improving the developer experience and collaboration; workflows and other resources can be easily synced with the CLI.

The importance of older and established practices

The return to the command line points to another trend we noticed during Radar discussions: persistence or re-emergence of older and established technologies and practices.

To a certain extent this is a corrective to the period of novelty and experimentation we’ve been living through; when things are constantly changing, ensuring there are robust and even familiar foundations takes on even greater importance. Measuring the right things is something we’ve already discussed; we wanted to emphasize how critical this is by placing the DORA metrics on this edition of the Radar. Yes, they’re almost as old as the Technology Radar itself (introduced more than a decade ago), but they have a vital role to play in helping teams ensure they’re focusing on what’s most critical even when practices and technology shapes are continually evolving. We even noted in the write up for that particular blip that using the DORA metrics effectively doesn’t require sophisticated and complex tracking and dashboards; if anything these can be a distraction, and, as we write, “simple mechanisms, such as check-ins during retrospectives” can be much more powerful.

Another established technique that we’ve featured in this volume of the Radar is zero trust architecture. First appearing on the Radar in May 2020 and moved to Adopt in October 2021, we wanted to bring attention to it half a decade later. “Principles such as ‘never trust, always verify,’ along with identity-based security and least-privilege access, should be treated as foundational for any agent deployment,” we write.

We also talked a lot about testing too, this time discussing mutation testing and building on last edition’s mention of fuzz testing by featuring WuppieFuzz. These techniques certainly aren’t new, but recent developments in AI are both lowering the barrier to entry and making it more important to test for a wider range of unpredictable behaviors.

Cognitive debt

What ties this all together is the issue of cognitive debt. Yes, AI can accelerate many parts of the software development process — far beyond just writing code — but in doing so it does two things: first, it creates greater distance between developers and the software they’re responsible for and, second it increases the range of tasks and problems they may be working on.

We caution against codebase cognitive debt on this edition of the Radar, but it’s also important to think beyond day-to-day work to recognize how we may individually incur cognitive debt as professionals. If you offload everything to a coding assistant, what are you avoiding learning? And what might the impact be in the future? Of course, this is always a question of trade-offs; an important part of a developer’s skillset is knowing what to pay attention to. However, given the speed of AI-accelerated change, consistent self-reflection is important as a reminder of our agency.

For all the novelty in the industry at the moment, much of the most interesting work in this area is exploring exactly how we can manage cognitive debt, whether that’s at an individual, project or organization level. We’re excited to continue monitoring this work in future volumes and contributing to ideas and practices that help technologists everywhere.

The cognitive demands of AI novelty

Thoughtworks — Thu, 16 Apr 2026 12:32:41 GMT

Too young to blip?

By Alessio Ferri

The Technology Radar is an opinionated snapshot in time of technologies and techniques we’ve used with our clients. It offers insights across many dimensions, but one that’s particularly interesting is that it surfaces new technologies people find useful. No one can know or experience everything happening in software development, but, thanks to the Radar, we get to explore some of the most exciting or important things our colleagues are working with.

However, when putting this edition together we encountered a particularly extreme level of newness in ways we haven’t before. Forget ‘emerging technologies’; some of the things we discussed were barely a few weeks old. This presented a dilemma: yes they were interesting, but they were typically far too new for us to really be able to say much confidently about them in a publication like the Technology Radar.

While the radar has always brought us face to face with the novel, there was something unique about what we found in this edition. Yes, we’ve had periods like the JavaScript framework boom in the 2010s, but this is different. What’s more, it also speaks to the nature of the AI-accelerated moment we’re working in right now.

AI-driven volume and velocity

Out of more than 300 proposed technologies and techniques put forward by Thoughtworkers, a significant proportion had a) only been around for a few weeks and b) very few GitHub contributors. Often, one of the contributors was a coding agent.

It’s hard not to see this as anything other than evidence of AI-accelerated proliferation. The fact we had more proposals for this edition than we’d received from colleagues since 2023 indicates we’re entering an era distinct from the one we were in 12 months ago, one in which the barrier to creating software has reduced drastically thanks to the improvements in AI since the end of 2025.

A solo developer with an idea and a few free hours can now produce something quickly of seemingly high quality to open source standards and including well-documented readmes, SBOMs, clean implementations, licensing, contribution models, documentation, CI badges, stars and a history of multiple releases.

Semantic diffusion

Parallel to the proliferation of software is the proliferation — and diffusion — of language. When new things are developed they require concepts and descriptions to communicate what the software does and why. Given the ease with which such content can now be created, we are faced not only with lots of similar kinds of software, but new words for subtly different things. Far from elucidating things, this often has the opposite effect: it adds to confusion.

To compound the complexity of the present moment, a lack of shared understanding means underlying practices are not only evolving quickly, but also in different ways that aren’t easily articulated. Indeed, even if we are building in good faith, without a language that has stabilised, it’s also challenging for practices to stabilize and mature too.

Waiting for the ecosystem to settle is itself a choice, and an increasingly expensive one. What’s needed isn’t a higher tolerance for ambiguity, but sharper judgment when operating inside it.

Alessio Ferri

Four implications for software developers

The implications for software developers are multifaceted, but there are a few key things to keep in mind.

Intensifying the challenge of evaluating software

The different ways language is used to name and describe technologies and practices makes evaluating those technologies more challenging. Without a clear shared understanding of what’s being discussed, how can we assess what’s in front of us?

This is something we encountered when putting the Radar together — to be able to judge and evaluate we spent considerable time discussing and trying to clarify what was being referred to. There’s an obvious cognitive demand here that’s additional to the challenges of day-to-day software development.

Increasing cognitive debt

The second implication is cognitive debt. The rapid pace of change and endless novelty means developers may lack understanding and appropriate mental models of not only the things they’re using but also what their colleagues are actually doing. In short, there’s a risk of a kind of organizational atomization.

Distinguishing between disposable and durable code

This cognitive burden is related to a third issue, which is about the importance of distinguishing between disposable and durable code. As Charity Majors explained in an article written last year, disposable code is that which is created for prototypes, scripts and experiments, while durable code is that on which long-running systems are built upon. The former tolerates a shallow understanding of the code and requires relatively little governance and maintenance because its lifecycle is very short; the latter, though, demands developer understanding, appropriate documentation and, of course, evolving and maintenance.

This isn’t to say we need to avoid using AI. It’s more that we need to recognize which mode we’re working in and be intentional about what kind of software we’re creating and consequently how far in the loop we as developers feel comfortable with.

Developers who understand this distinction and are able to intentionally move between these two modes will undoubtedly have an edge. Those who don’t will accumulate cognitive debt for the durable software they build, and, as Majors notes, costs will be much higher.

As with technical debt, cognitive debt isn’t inherently bad; we may well reasonably choose to do something with low cognitive burden with the knowledge we’ll need to address the debt in the future. It’s really a question of awareness and intention.

Securing and governing software

Cognitive debt will inevitably weaken security posture. This is because developers lack the internal understanding or landscape knowledge to respond to incidents or perform effective threat modeling.

One threat in particular exacerbated by AI-accelerated cognitive debt is the software supply chain and the risk of malicious prompt injection. When we call on AI to develop software, it’s extremely easy to lose sight of the dependencies of our systems. There might well be vulnerabilities we’re unaware of — a vulnerability discovered at the end of March 2026 in AI gateway LiteLLM was found to steal credentials, compromising users’ applications.

At the heart of this is what Simon Willison calls the lethal trifecta — the ability for agents to access private data, exposure to untrusted content and the ability to communicate with external data and systems. In the context of coding agents, the risks for are exacerbated when developers are managing significant cognitive load.

Ambiguity requires sharper judgment

It’s not clear whether this is the new normal; however, the current cognitive demand isn’t sustainable and will arguably undermine the possible gains AI can deliver. Consequently, we may see an explosion of new tools, evaluation frameworks and trust signals to help us assess a much larger volume of these technologies.

Uncertainty won’t resolve before teams need to make decisions; waiting for the ecosystem to settle is itself a choice, and an increasingly expensive one. What’s needed isn’t a higher tolerance for ambiguity, but a sharper judgment for operating inside it. Knowing what signals matter, distinguishing “too early to assess” from too early to adopt, and being willing to revisit previous decisions quickly.

Navigating the AI imperative: A strategic framework for AI enterprise adoption and risk management

Thoughtworks — Fri, 03 Apr 2026 09:19:04 GMT

By Sunit Parekh

Artificial Intelligence is no longer just a buzzword or a futuristic concept; it has become a de facto mandate for just about every organization. Boardrooms and executive teams across the globe are asking not if they should adopt AI, but how fast they can. However, the rush to deploy AI often overshadows a critical reality: that the true challenge lies not in the technology itself, but in how gracefully an enterprise can integrate it.

Successful AI adoption requires defining clear principles, robust guidelines and uncompromising guardrails. A one-size-fits-all approach to AI deployment is a recipe for operational friction and unmanaged risk. To navigate this landscape safely and effectively, organizations must first categorize their AI use cases into three distinct tiers, applying a tailored strategy and risk profile to each.

In the first of a series, I propose a practical framework for categorizing AI use — and managing the associated enterprise risks.

Category 1: Frontline AI — Revenue-generating and direct-to-customer

This category encompasses AI applications that sit squarely on the critical path to the customer and directly impact the top line. Examples include dynamically calculating insurance policy premiums, automated risk assessment and underwriting in lending, and customer-facing service chatbots handling live queries.

The strategy and risk profile

Risk level: Critical

Because these systems directly interface with customers and govern financial transactions, the stakes are exceptionally high. Errors here can lead to direct revenue loss, severe brand damage, regulatory penalties and increased susceptibility to fraud.

The right approach:

For AI use cases in this category, organizations cannot afford to cut corners. You need highly sophisticated, enterprise-grade AI solutions. This means the strategy must prioritize:

Strong risk controls: Rigorous testing for bias, accuracy and fairness before deployment.
Multi-model validation: Deploying and cross-referencing output from two or more distinct AI models (e.g., one proprietary, one open-source) to reduce reliance on a single point of failure and to verify results before customer interaction.
Explainability: In regulated industries like finance and insurance, the AI’s decision-making process must be transparent and auditable.
Continuous monitoring: Real-time dashboards to track model drift and performance degradation, ensuring the AI behaves exactly as intended under shifting market conditions.

Category 2: Productivity AI — Business and operational assistant

The second category revolves around internal empowerment. Here, AI acts as a co-pilot for your workforce, augmenting their capabilities rather than acting autonomously on behalf of the company. Examples here include running complex analyses on massive internal datasets, synthesizing reports or deploying an internal chatbot to serve as a conversational employee knowledge hub.

The strategy and risk profile

Risk level: Moderate

While this use case is less risky than customer-facing AI, the danger here lies in hallucinations leading to poor internal decision-making.

The right approach:

The strategy for this tier should lean heavily on embedded solutions — such as Microsoft Copilot integrated into existing office suites or enterprise search tools.

Human-in-the-Loop (HITL): The golden rule for this category is that a human must always review the AI’s output before it is acted upon or published.
Safe adoption: Because the final decision rests with a human employee, organizations can adopt these tools relatively quickly provided they invest in basic AI literacy and training for their workforce on how to verify AI-generated insights.

Category 3: Supporting AI — Non-customer and non-business

This final category includes AI used for specialized, deeply internal or highly technical workflows that don’t directly touch the end customer or general business operations. Examples here include AI-assisted software development (e.g., GitHub Copilot / Claude Code generating code snippets, automating test scripts), IT infrastructure optimization and back-end data processing.

The strategy and risk profile

Risk level: Low to moderate

While generating bad code or optimizing a server poorly carries operational risk, these environments are already built to catch errors before they reach production.

The right approach:

This is where organizations should be highly experimental. You can afford to push the boundaries of AI capabilities here because of the inherent structure of modern engineering and IT workflows.

Multi-layered checkpoints: Similar to Category 2, there’s a human-in-the-loop (the developer reviewing the code).
Automated guardrails: Beyond human review, this tier benefits from rigorous automated safety nets. For instance, if an AI generates code, it must pass through automated code-scanning agents that check for security vulnerabilities, syntax errors and compliance with coding standards before it’s merged into the main product.

Conclusion

The AI mandate is clear, but reckless adoption is not the answer. By categorizing AI initiatives into direct-to-customer, business assistant, and back-office operational tiers, enterprises can deploy their resources smartly.

Treat high-risk revenue drivers with the utmost caution and enterprise-grade scrutiny. Empower your general workforce with embedded, human-supervised AI assistants. Finally, unleash your technical teams to experiment rapidly within the safety of automated, heavily fortified guardrails. This tiered strategy ensures that your enterprise not only adopts AI gracefully but leverages it as a sustainable, secure competitive advantage.

Originally published at https://www.thoughtworks.com.

How to perform a structured evaluation of AI conversational solutions

Thoughtworks — Fri, 03 Apr 2026 06:54:43 GMT

By Ashwin Mattur , Rajgokul R M , Sharanya S , Zichuan Xiong and Anushrav Vatsa

Knowledge-driven conversational solutions are essential in modern digital workflows. They allow large language models to interact with curated knowledge bases for a diverse range of tasks including customer support, search and analytics. However, evaluating them poses some challenges — they’re typically composed of multiple interdependent components, continuously evolving sources of data and feature opaque, black box mechanisms.

In this post, we detail how we evaluated an enterprise AI conversational system powered by AWS Bedrock Agents and Amazon Kendra, using Weights & Biases (W&B) Weave.

Before this, our evaluation relied on ad hoc manual testing. Team members submitted queries to the system, manually inspected the generated responses, and subjectively judged quality. This approach was inconsistent across evaluators, couldn’t scale as the knowledge base grew, lacked reproducibility and provided no way to isolate whether failures originated from retrieval, prompting, or generation.

By replacing this fragmented process with a structured, metrics-driven evaluation framework, we achieved significant improvements in system performance, accuracy and stakeholder alignment.

The challenges of evaluating AI systems

Although traditional metrics like accuracy or F1 score are applicable to algorithmically-generated text data, they’re insufficient for assessing a complex conversational system. This isn’t because we can’t calculate precision/recall for text, but because binary classifications miss the multiple dimensions of retrieval systems. Our evaluation identified several critical challenges:

Black-box evaluation limitations. Bedrock Agents concealed retrieval mechanisms, which made it difficult to directly calculate retrieval metrics.
Inconsistent source attribution. Generated responses were often missing citations linked to knowledge base documents.
An evolving knowledge base. As the knowledge base expanded from object storage to enterprise search, the evaluation criteria needed to evolve.
Multi-stakeholder alignment. Business users, technical teams and compliance officers all required different evaluation perspectives.

How can we define system accuracy holistically?

In most established forms of machine learning, accuracy is typically a single measure of correct predictions. However, for knowledge-driven conversational solutions, accuracy is multi-dimensional; it needs to be assessed across three interdependent components:

A retrieval component (RAG): Did the system retrieve relevant and correct documents based on the input query?
A prompt engineering component: Did the system effectively guide the LLM to use retrieved context?
Language model component (the LLM): Did the LLM generate a factually correct and coherent answer?

All three need to work in harmony for accuracy. Perfect retrieval with poor prompting, for example, still produces irrelevant answers, while strong prompting with bad retrieval leads to hallucinations.

A structured evaluation framework

To address these challenges, we implemented Weights & Biases Weave as a unified evaluation platform. The system under evaluation is a customer-facing conversational AI assistant: users submit natural language queries, Amazon Kendra retrieves relevant documents from a knowledge base and AWS Bedrock Agents generates a response using those documents as context.

Because failures can occur at any stage of this pipeline — wrong documents retrieved, poor prompt construction, or unfaithful generation — a single accuracy score isn’t sufficient. Our framework consolidated assessment across five critical dimensions: retrieval quality, answer faithfulness, answer relevance, context precision and system performance. Each dimension targets a specific stage of the pipeline, enabling us to pinpoint exactly where quality degrades.

The components of the solution included:

A unified evaluation project: Using W&B Weave’s project structure, we centralized all test cases, metrics and production data in a single workspace — ensuring every team member worked from the same source of truth.
A structured test dataset: We curated a set of representative queries, each paired with ground truth answers and expected source documents, to measure system performance consistently across iterations.
Multi-dimensional scoring: More than 25 granular metrics were implemented, including precision@k, recall@k, semantic similarity and hallucination detection — each targeting a specific stage of the RAG pipeline.
Traceability: Each evaluation run captured full traces of prompts, retrieved documents and generated responses. This made it possible to debug failures at any stage.
Visualization: Custom dashboards provided actionable insights for both technical teams (who needed component-level diagnostics) and business stakeholders (who needed system-level quality trends).

Implementing the evaluation process

We developed a systematic analysis and improvement workflow, which enabled consistent progress tracking and targeted enhancements.

Our initial evaluation runs provided a strong starting point across multiple dimensions:

The overall RAG score was 0.8626, providing a reliable benchmark for system performance.
Answer similarity began at 0.4444, reflecting early alignment with ground truth answers.
Topic coverage was 0.6667, showing partial but consistent coverage of expected topics.
Response confidence started at 0.85, representing a solid foundation for reliable responses.

Figure 1: W&B dashboard showing initial baseline metrics across all evaluation dimensions.

Component-level evaluation insights

To gain a detailed understanding of system performance, we evaluated individual components — document retrieval, prompt engineering and language model generation — using targeted metrics.

Document retrieval metrics

The retrieval component (Amazon Kendra) is responsible for finding relevant documents from the knowledge base. If it fails, returning irrelevant or incomplete documents, the LLM cannot produce a correct answer regardless of prompt quality.

We measured:

Retrieval precision: The percentage of relevant documents among retrieved ones improved from 0.65 to 0.82 (+26%).
Retrieval recall: Relevant documents successfully retrieved increased from 0.58 to 0.79 (+36%).
Context relevance: Semantic alignment between query and retrieved documents rose from 0.72 to 0.88 (+22%).
Information coverage: The completeness of retrieved information grew from 0.56 to 0.91 (+63%).

Prompt engineering metrics

The prompt engineering component determines how effectively retrieved context is presented to the LLM. Poor prompts can cause the model to ignore relevant context or produce badly formatted responses.

Instruction following: Adherence to prompt instructions improved from 0.79 to 0.93 (+18%).
Context utilization: Effective use of retrieved context increased from 0.61 to 0.84 (+38%).
Format adherence: Consistency with requested response format improved from 0.88 to 0.97 (+10%).
Prompt robustness: Stability across input variations rose from 0.70 to 0.89 (+27%).

Language model metrics

The generation component is where the LLM produces the final response. We measured whether outputs were factually grounded in the retrieved documents and complete.

Factual accuracy: The correctness of generated responses increased from 0.73 to 0.88 (+21%). We used a threshold-based embedding similarity approach (responses with >0.8 similarity to ground truth deemed correct) combined with rule-based factual verification against knowledge sources.
Hallucination rate: Unsupported content generation decreased significantly from 0.18 to 0.04 (–78%).
Answer coherence: Logical flow and readability improved from 0.85 to 0.92 (+8%).
Response completeness: Coverage of all query parts rose from 0.67 to 0.91 (+36%).

Source attribution metrics

Source attribution is critical for enterprise trust. Users need to verify where answers come from. We measured whether responses properly cited the knowledge base documents they drew from.

Citation presence: The percentage of claims with supporting citations increased from 45% to 87% (+93%)
Citation accuracy: The correctness of source references improved from 62% to 91% (+47%)
Attribution completeness: The percentage of retrieved documents correctly cited rose from 39% to 82% (+110%)

Integrated system metrics

The component-level metrics above helped us pinpoint where in the pipeline problems occurred. For example, low retrieval recall meant the LLM never saw the right documents, while low context utilization pointed to a prompting issue. However, we also needed to measure how the system performed end-to-end from the user’s perspective, since users don’t distinguish between retrieval failures and generation failures — they simply see a bad answer.

We therefore tracked four system-wide metrics that capture overall answer quality:

Key improvement highlights

To be clear, the evaluation framework didn’t improve the system by itself; it provided the diagnostic visibility needed to identify specific weaknesses. We then followed an iterative cycle: evaluate the system using W&B Weave, identify the lowest-performing metrics, implement a targeted fix and then re-evaluate to measure impact.

Below are the specific interventions we made and their results.

Semantic alignment

Our initial evaluation used token-based matching to measure how closely system responses matched ground truth answers. However, this penalized semantically correct responses that used different wording — creating false negatives that obscured the system’s true performance. We replaced this with embedding-based similarity, which compares meaning rather than exact words.

This was important because once our scoring accurately recognized semantically equivalent answers, we were able to trust the metrics to guide real improvements rather than chasing false negatives.

Example:

Before: The system answered “The product is available” for a query where the ground truth was “The product is in stock.” Token-based matching scored this poorly despite identical meaning.
After: Embedding-based similarity correctly scored this as a strong match. This recalibrated our baseline and let us focus optimization on areas with genuinely low scores.

Topic coverage

Our evaluation revealed that the system only covered 66.7% of expected topics in its responses. There were two root causes: (1) user queries used different terminology than the knowledge base documents, which caused retrieval misses, and (2) some documents weren’t properly indexed by Kendra.

We addressed the first with query expansion — automatically augmenting user queries with synonyms and related terms so that retrieval could match documents even when wording differed. For the second, we performed a knowledge base indexing audit, verifying that all relevant documents were indexed and discoverable by Kendra. Together, these interventions improved topic coverage from 0.6667 to 1.0 (+50%).

Example:

Previously, a query about “return policy for defective electronics” failed to retrieve relevant documents because the knowledge base used the term “warranty claims for faulty devices.” After query expansion, the system matched the correct policy documents covering all scenarios.

Response confidence

Improvements in source attribution and confidence scoring mechanisms increased response confidence by 12%.

Example:

Now, responses include citation links to knowledge base documents along with confidence scores, such as “Product specs validated with KB Doc ID 12345, Confidence: 0.92.”

Factual accuracy

We implemented rule-based factual verification that extracts structured claims from responses and validates them against our knowledge base.

Example:

A response stating “Supports up to 16 GB RAM” was validated by matching the triple (Product X, supports, 16 GB RAM) to the knowledge base, eliminating hallucinations.

Overall system optimization

Holistic adjustments to retrieval and generation configurations resulted in a 9% performance improvement.

Example:

By fine-tuning retrieval thresholds, a query for “latest software updates” now retrieves up-to-date documents, improving relevance.

These targeted interventions allowed us to systematically address bottlenecks, improving the quality and reliability of the RAG-based system without manual guesswork.

Integrating with AWS and Weights & Biases

To streamline evaluation of the retrieval system powered by AWS Bedrock Agents and Amazon Kendra, we integrated the evaluation pipeline with Weights & Biases (W&B). This allowed us to automatically track evaluation configurations, test inputs, outputs, and granular metrics in a consistent, transparent, and reproducible way.

A key practical advantage is that this integration required minimal code changes to our existing AWS pipeline. By “patching” the Bedrock client, W&B Weave automatically captures every LLM call — including the prompt, model parameters and response — as a traced event.

The following snippet shows the core pattern:

This automatic tracing eliminated manual logging and ensured every evaluation run was fully reproducible — any team member could inspect what happened at each step of the pipeline for any past evaluation.

This integration provided several important benefits, including:

Faster deployment and reduced setup time.
Standardized metrics which enabled consistent evaluation across test cases.
Full traceability of evaluation runs for auditing.
Cross-team visibility: Interactive W&B dashboards offered clear insights for both technical and business stakeholders.

Conclusion

Our approach transformed the evaluation of a complex RAG-based AI system from ad hoc, manual inspection into a systematic, metrics-driven, and reproducible process.

By analyzing retrieval, prompt engineering and language model components in a unified framework, we gained clear visibility into system performance and identified actionable improvement areas.

Integrating structured evaluation practices with W&B Weave enabled faster iteration cycles, traceable decision-making, and alignment across technical and business teams. This approach ensured more reliable and complete responses, reduced need for manual review, and strengthened stakeholder confidence in deploying enterprise AI solutions.

Reimagining API modernization with deterministic AI-assisted engineering

Thoughtworks — Fri, 27 Mar 2026 11:53:12 GMT

By Aditya Sharma and Mahesh Kharade

APIs are the backbone of enterprise platforms, yet many continue to operate on unsupported frameworks alongside modern services. This mismatch introduces inconsistent standards, security risks and growing technical debt, slowing delivery and increasing operational friction.

In large enterprises, APIs evolve over a decade or more, accumulating undocumented behavior, implicit contracts, and hidden dependencies. As a result, modernization becomes less about rewriting code and more about rediscovering intent.

Most modernization effort is spent understanding existing behavior rather than implementing the new solution. However, AI-driven migration offers a new path forward by accelerating API uplift activities such as dependency discovery, instruction-guided controller migration, and the transformation of legacy unit tests into modern test suites.

The challenge

Recently, a Thoughtworks team helped a client modernize an enterprise platform. Aware of the challenges of understanding the existing system, we developed a migration framework that uses AI to orchestrate & accelerate legacy API modernization

The client’s platform, which powers a B2B retail app, is supported by 25+ backend APIs across multiple domains, such as invoices, operations and payments. Each contains anything from 100 to more than 1,200 controllers and handles critical operational processes. Built more than a decade ago on .NET Framework 4, the system has evolved to meet business demands; this has resulted in increased architectural complexity and technical debt.

While the platform remains essential for day-to-day operations, its legacy foundations now create challenges for maintainability, security and modernization. Addressing these challenges requires a thoughtful approach that reduces risk while maintaining the stability of critical business workflows. The challenges include:

High system complexity: 25+ APIs across domains with large controller footprints (100–1200+) create tightly coupled services and difficult-to-manage codebases.
Aging technology stack: The system is built on .NET Framework 4, which is now outdated and increasingly difficult to maintain or extend.
Accumulated technical debt: Over 11 years of incremental development has introduced inconsistent patterns and complex dependencies.
Security and compliance risks: Legacy framework vulnerabilities in .NET4 have introduced security risks and compliance issues.
Operational risk: APIs support critical workflows. That means any changes could be a commercial risk without careful planning.
Engineering productivity constraints: Large controllers and fragmented logic increase development effort, regression risk and time required for system analysis.

Why traditional API uplifts struggle to scale

Traditional uplift is constrained not by coding effort, but by uncertainty about the existing system: lack of documentation and knowledge stuck only in the minds of those who worked on it (who may no longer be working at the organization) are real and familiar challenges to people who do this kind of work.

Indeed, when working with large, business-critical API ecosystems, the majority of the effort shifts from coding to understanding, validating and coordinating changes safely. In practice, this creates several systemic bottlenecks.

The discovery tax

Before any migration work can begin, engineers must first understand the existing API landscape — its endpoints, consumers, downstream systems and contracts. This discovery phase can be manual and time-intensive, especially in legacy systems where documentation is incomplete or outdated.

Fear-driven development

Each change introduces risk, especially when APIs support existing core business, financial or operational workflows. Engineers will often proceed cautiously, often replicating legacy patterns to avoid breaking anything.

The test coverage gap

Legacy systems often lack reliable automated validation, forcing long manual regression cycles. This can significantly extend release timelines and introduce operational friction.

The practical limits of traditional migration approaches

In practice, traditional API uplift approaches struggle to deliver modernization at the speed required by the business. Taking the environment we were working with, timelines looked something like this:

The average migration velocity would be approximately two controllers per sprint per developer.
Validating regressions would be around four weeks per release cycle.
The estimated timeline for full migration of the 25+ APIs, around 10 years.

Clearly, modernization is slow and time-consuming. Such timelines understandably make large-scale transformation difficult to sustain. This is because prolonged modernization cycles tie up engineering capacity, extend exposure to legacy security risks, increase operational costs, and delay the adoption of modern architectures-ultimately limiting the organization’s ability to innovate and respond quickly to evolving business needs.

From AI as a generator to a guided agent

To streamline API modernization, we introduced a semi-automated, instruction-driven migration framework powered by Copilot. The approach focuses on accelerating understanding of existing behavior and enabling confident transformation through structured guidance.

Instead of treating AI as a free-form generator, Copilot operates as a guided migration agent governed by defined rulebooks. This turns modernization into a consistent and repeatable engineering workflow. The breakthrough was not simply using Copilot, it was constraining it which ultimately led to transforming our approach.

Traditional modernization relies on manual investigation and cautious changes due to limited system visibility. AI-assisted engineering gives clarity and confidence in how the legacy system is constructed, enabling teams to evolve APIs confidently and at scale.Instead of treating AI as a free-form code generator, we introduced a deterministic, instruction-driven migration framework.

The deterministic AI-driven modernization framework

Instruction files: Defining migration rules

To form the foundations of the modernization initiative, we created version-controlled instruction files (YAML/JSON). These captured the rules and patterns required to migrate the legacy APIs consistently.

The instruction set defined modernization rules such as:

Deprecated-to-modern API mappings
Namespace and dependency replacements
Build and project configuration conventions
Test validation rules
Reusable remediation and fix patterns

The evolution of the instruction set

Initially, controller migrations required substantial manual interpretation and experimentation. As engineers migrated controllers, recurring migration patterns began to emerge. These patterns were progressively encoded into the instruction files, allowing Copilot to apply them deterministically during future migrations.

Independent controllers. These are controllers with minimal dependencies that could be migrated directly.
Dependent controllers (low complexity). Controllers with limited dependencies that require targeted adjustments.
Dependent controllers (high complexity). Controllers with multiple downstream dependencies requiring more structured transformation logic.
Supporting library migration. Any controllers that depended on shared internal libraries and utilities.

Controller-level parallelization

To scale the modernization effort, the migration process was structured at the controller level. This meant multiple controllers could be processed independently and in parallel.

Each controller followed a standardized transformation workflow:

By isolating modernization work at the controller level, teams could progress incrementally without requiring large-scale API rewrites.

Multi-layer quality validation

While AI significantly reduced the mechanical effort of migration, maintaining deterministic system behavior remained a critical requirement.

To ensure modernization did not introduce regressions, a multi-layer validation strategy was implemented.

Just-enough unit tests. Instead of attempting to maximize test coverage, the framework focused on targeted correctness validation.
API comparison tests. Automated comparison tests validated behavioral parity between legacy endpoints and migrated APIs.
Consumer regression tests. Full web application regression tests were executed against the migrated backend to verify contract compatibility and ensure downstream consumers continued to function correctly.

The impact of AI-assisted migration

The effectiveness of the AI-assisted modernization framework was evident in the accelerated migration outcomes:

370 controllers successfully migrated within just 3 months of development effort.
Migration velocity increased by 300+% compared to the traditional baseline approach.
Developer productivity improved significantly, increasing from 6.7 to 27.5 controllers migrated per developer per month.

These results demonstrate how AI-assisted modernization can dramatically accelerate large-scale code migrations while improving developer productivity and delivery predictability.

Impact and business value

The structured AI-assisted modernization approach transformed API migration from a slow, manual effort into a scalable and predictable engineering process. By combining deterministic transformation patterns, evolving instruction sets, and AI-assisted analysis, the framework enabled teams to modernize complex APIs with greater speed and confidence.

As a result, the initiative delivered several measurable benefits:

Teams maintained consistent throughput during controller modernization without sacrificing quality.
Instruction-driven patterns standardized migration approaches, ensuring consistent outcomes regardless of individual experience levels.
AI-assisted analysis and fix-loop acceleration significantly reduced repetitive debugging and manual fixes.
Structured controller-level workflows enabled reliable planning and execution of migration milestones.
Increased productivity and reduced regression effort shortened overall modernization timelines.
Modernized APIs with consistent contracts enabled faster integration and more predictable evolution of the platform.

Together, these outcomes demonstrate how AI-assisted engineering can turn modernization from a long-running technical burden into a structured and repeatable transformation capability.

What we learned about AI in legacy modernization

This work taught us a number of important things about using AI in legacy modernization:

AI works best when constrained by deterministic rulebooks.
Institutional knowledge must be codified and made explicit.
The more we migrate, the more patterns we uncover — these patterns are then incorporated as instructions.
Testing maturity is a prerequisite for AI-assisted transformation.
AI reduces mechanical effort; human oversight ensures architectural integrity.

This shifts modernization from a risky rewrite exercise to a scalable engineering discipline. There are, moreover, some important implications for engineering leaders:

For CTOs, modernization needs to be understood as a systems problem. That means it’s vital to Invest in visibility and validation before scaling AI.
For engineering leaders, it’s essential to create version-controlled instruction libraries before expanding AI adoption.
For delivery leaders, remember to measure baseline cycle time before introducing AI to quantify impact.

Building the modernization muscle in the AI era

AI doesn’t eliminate modernization complexity, but when structured properly, it transforms modernization from fear-driven refactoring into confident evolution.

The real innovation isn’t Copilot itself — it’s the deterministic modernization framework that governs the tool.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Originally published at https://www.thoughtworks.com.

Making sense of the SaaSpocalypse

Thoughtworks — Mon, 23 Mar 2026 11:33:53 GMT

Have rumors of the death of SaaS been greatly exaggerated?

By Natalie Drucker

The term ‘SaaSpocalypse’ has been circulating across technology and business worlds in recent months, supposedly marking the death of the software-as-a-service market thanks to AI.

But what’s the reality? While it’s clear the evolution of AI capabilities are transforming the way businesses think about purchasing software, from my seat leading Thoughtworks’ go-to-market (GTM) stack, this isn’t a theoretical debate. It’s showing up in practical decisions about what stays in the stack, what becomes harder to justify, and where AI is genuinely changing the economics and expectations around software. The views in this piece come from working through those questions directly, including the good, the bad, and the uncomfortable parts.

What are the forces driving the SaaSpocalypse?

The SaaSpocalypse is what happens when a model that made real sense in the pre-AI era starts to strain. Standardize workflows in a shared product, spread build and maintenance costs across customers and charge per seat for access to the UI.

That was rational when custom software was expensive and took longer to build. AI changes that equation, but not just because it makes software cheaper. More importantly, it raises expectations. The shift is from a one-size-fits-most organizations to systems of intelligence that understand your strategy, context, and ways of working.

That distinction matters. Much of the current industry conversation is being pulled toward cost, and the term is gaining traction because several forces have collided at once. Capital markets love a label, and “SaaSpocalypse” captures the fear that AI makes parts of SaaS vulnerable. AI-native founders and implementation consultancies benefit from framing this as a rupture, so many are doing exactly that. At the same time, enterprise leaders are increasingly frustrated with software stacks that cost millions yet still fail to deliver usable intelligence without spreadsheets, workarounds, and manual effort.

But treating cost reduction as the main prize risks driving the wrong decisions. In our own AI4GTM work in Thoughtworks marketing, cost is the third driver, not the first. The real opportunity is to create systems that embed the processes and intelligence of your best people, move faster than human teams can on their own, and only then reduce the cost to serve. Quality can’t be compromised; speed and productivity is the goal, with cost reduction downstream of any changes you implement.

So yes, there is an economic thread here, including growing pressure on SaaS pricing models built around seats rather than usage, outputs, or outcomes. But the deeper issue is not cost alone. It is that AI is changing what buyers should expect from SaaS in the first place.

Has the death of SaaS been greatly exaggerated?

It’s not unreasonable to question whether SaaS really is dead or dying. As with many things in this industry, the answer is ‘it depends’. What is worth bearing in mind though is something my colleague Martin Fowler said 15 years ago: some software is utility and some is strategic, and you shouldn’t run them the same way.

In the pre-AI world, “buy the package” was rational for utility systems because software was expensive and slow to build. AI changes the economics by making software creation dramatically cheaper and faster, so the build vs buy boundary shifts. And there’s a second twist: systems we treated as utility, like CRM, can become strategic in the AI era because the competitive edge is no longer the system of record, it’s the customer intelligence you can extract and act on from it. That’s why startups and mid-sized companies can choose to classify their GTM systems as strategic, and go the custom route without paying the traditional SaaS rent.

However, enterprises are different. With the tech available today, organizations can easily replace point solutions that have a narrow feature set. We did exactly that at Thoughtworks marketing, eliminating three SaaS platforms with a narrow feature set in 2025 and replacing them with bespoke AI workflows, which removes vendor complexity from our stack, and the lower price is also a bonus. The inflection point is when businesses choose to abandon CRM-class systems that are used by hundreds or thousands of employees, with deep feature sprawl, uneven user behavior, support expectations, plus security and privacy obligations that SaaS quietly absorbs. Our first attempt at ripping out a rock system in Thoughtworks marketing and replacing it with an AI-native solution has taught us some important lessons. If you take such a challenge, you need to shift from a traditional pre-AI era, classic agile product/IT team delivery model, otherwise it’s impossible to keep up and build the full feature set, even when using vibe coding tools like Lovable and coding assistants like Claude Code to expedite the development process.

Making replacing SaaS viable

We’ve all seen entrepreneurs using tools like Base44 to generate a CRM for personal use in a few hours. We know that this doesn’t hold in enterprise grounds. The issue is not whether AI can help generate software. It is whether you can deliver enterprise-grade systems fast enough, with the right quality, governance, and cost profile, to make replacement a serious option. To make “replace big SaaS” viable you need a new software development lifecycle.

That is exactly why Thoughtworks launched AI/works™ in January 2026, our agentic development platform. We are now using it internally across sales and marketing as we work to replace a core SaaS platform in our GTM stack. The goal is to reach the full feature set and workflows the business requires, while continuously regenerating application components as business requirements and regulations change, with less human intervention. In that model, humans shift from manually chasing every feature to acting as architects and overseers of an AI-native development process. With such an approach, the economics to replace SaaS in the enterprise can make sense.

But until that model matures, the more immediate trend is not mass SaaS extinction. It is how to get more value from the SaaS and the data your organization already has. This has been a major focus for us in Thoughtworks marketing. Before touching the core stack, we focused on improving value by building the intelligence layer above the SaaS. The aim was to help our GTM teams get better signals, better recommendations, and better timing than the pre-AI model allowed. To do this, we established an AI marketing applications team and introduced forward-deployed AI engineering graduates to work directly with marketers, sitting alongside them in the business. Their role is to bring together data, context, and business logic to generate more useful intelligence than any single SaaS platform could provide on its own. To act on that intelligence, we introduced GTM Engineers focused on workflow automation using low-code tooling such as n8n, helping turn insight into execution faster. We also built a close working relationship with internal IT, which provides infrastructure, guardrails, agent-building protocols, and other horizontal capabilities.

Once organisations prove they can generate better intelligence and faster action on top of the existing stack, they are in a much stronger position to challenge legacy pricing, reduce vendor sprawl, and decide more selectively which platforms still earn their place.

Stay adaptive, stay hands-on so you can test things yourself and be bold enough to change strategy when the evidence changes, not when the noise does.

Natalie Drucker

Director of AI & Data Strategy — Global Marketing

Can AI agents really do what SaaS vendors have been doing for years?

AI agents are good when they have the right data foundations. If that’s in place, you can stop treating SaaS as the place work happens and start treating it as a set of data sources, not destinations.

Clay is a good example of SaaS as a data source. We’re not partnering with them for “a nicer screen;” we partner for their ability to enrich our data to enable our Go to market. We have two GTM engineers on an approximately 150 person marketing team that are responsible for Clay configurations. The broader marketing organisation consumes intel from Clay via our five super agents that cover our critical GTM capabilities and workflows, rather than having to go directly into the Clay. In a world where that data can be fetched on demand, we’re able to combine it with intelligence from other systems in a governed way.

That’s why our first move at Thoughtworks marketing was to rewire the stack for agents. In practice this involves making critical structured data queryable and governed, reorganizing unstructured data so it can be retrieved reliably, and pulling business logic out of SaaS workflows so agents can orchestrate it without being trapped in the Salesforce worldview.

In terms of the data we pull from the SaaS into our Super agents, we onboard regularly used data into BigQuery and pull opportunistic data on demand by invoking governed tools and sources, including direct SaaS API calls, and Model Context Protocol (MCP), depending on the context. This means that our teams no longer need to go to five different tools to get an answer.

Once this foundation exists, agents can do what SaaS UIs have been trying (without success) doing for years: answer questions across your entire data ecosystem in context, route work and trigger actions across systems. Without it, you just get shiny chat over messy data.

A challenge that we are facing right now is on the experience layer. Our super agents have a bespoke experience layer, and as a marketing leader, this is not something I want to worry about. However tools like Glean that want to be that single chat interface for teams that’s so compelling to users, they’re invariably expensive and offer limited control. You’re also more likely to run into hallucinations when the logic isn’t truly yours. Given we are on the Google stack and deeply invested in it, Gemini Enterprise, meanwhile, is admittedly cheaper but not as feature complete; that’s why, for now, we’re building a bespoke Super agent experience layer while actively watching the market for a well priced option that can reduce the need for us to run our own.

Thoughtworks’ GTM technology stack architecture

Is the demise of SaaS deserved?

If the Saaspocalypse is at all accurate then it’s deserved for those parts of the SaaS market that have been overcharging for many years. The real villain in the old model is the combination of platform rent and licence economics: you pay a platform tax when everything has to be built on, integrated through or licensed around a dominant ecosystem like Salesforce, and you then pay high per-seat fees that don’t reflect usage or value.

Treat SaaS as data sources, not destinations

In the AI era, as per my earlier point, when you treat your SaaS as data sources rather than destinations, and value extraction is done at our Super agent layer bringing data and capabilities of multiple systems together, which supports my point on SaaS pricing having to move toward consumption, output and outcomes.

While cost is not the main driver of the program, at Thoughtworks marketing we effectively began shifting SaaS contracts to cheaper, AI-friendly alternatives and pushing existing vendors we want to keep in our stack into aggressive price reductions because they know parts of their feature-set are increasingly replicable and because breaking away from expensive platform ecosystems (Salesforce is the obvious one) changes the math. In our case, we’re seeing 50%+ in SaaS vendor contract reductions and moving from licence-heavy seat models to consumption-based pricing that fits B2B volumes. Broadly, the industry is experimenting more with consumption and outcome-style pricing, even if seats don’t disappear overnight. We recently replaced a rock system in our stack for a modern version of that SaaS at 30% of the incumbent’s cost. Suddenly the whole GTM SaaS economics start to make sense again.

So, where SaaS is actually properly priced for the AI era and earns its keep on reliability, security, compliance and support, you should feel fine about it. That’s still a good trade.

Could the SaaSpocalypse be a warning to the AI market?

A warning is deserved, but it’s not “AI is fake;” it’s “personal productivity is outpacing enterprise profit.” We see significant adoption at the individual level with Microsoft’s Work Trend Index reporting 75% of global knowledge workers are using generative AI, enterprise returns are still uneven. An S&P Global survey found 46% of companies said no single enterprise objective produced a strong positive impact from genAI initiatives, and only 19% reported strong positive impact across most objectives.

The key point is that AI will not help you if you have a bad strategy. If your teams become more productive, but work on the wrong thing in the wrong way, AI and automation will not impact the bottom line. The key is to amplify what’s working, the processes of your best employees, with AI. Those are the companies that win.

How will this play out in the months and years to come?

In six months, a year, or three years, any prediction can be made to look silly because the pace of change is brutal. The way we handle it at Thoughtworks marketing is to treat this as a continuous sensing problem: my AI and data team in marketing monitors the market daily; we then test what’s showing up as the next big thing, and we review what it means for our strategy and roadmap. The real skill is knowing when something is a genuine milestone that changes the industry versus more hype you need to ignore.

Ecosystem lock-in

The other reality is ecosystem lock-in. Your path depends on the stack you’re on and the partners you’re aligned to, because it’s not easy to change a company-wide platform direction. Thoughtworks is on the Google stack, so we stay very close to their AI innovations and use that as our baseline for what’s possible; luckily it’s one of the strongest AI ecosystems out there right now.

My advice is simple: stay adaptive, stay hands-on so you can test things yourself and be bold enough to change strategy when the evidence changes, not when the noise does.

A new class of SaaS products

I’m beginning to see a new class of SaaS that looks very different to what’s been the norm over the last 15 years. The winning products won’t be “another UI with features,” they’ll be systems designed to be accessed by agents, built to expose data, and actions safely into an enterprise’s broader intelligence layer. In that world, SaaS is less a destination and more a set of transaction rails and governed capabilities that agents can compose, while the user experience consolidates into fewer surfaces.

We’ll also see new economics. The old seat-based licence model makes less sense when agents are doing the work and human logins become optional, so the products that win will price around consumption, output and outcomes. They’ll earn trust by reducing operational risk, not just adding features. That’s also where you’ll see differentiation: vendors that can ingest and structure unstructured data, enforce custom business logic and integrate cleanly into your data architecture will thrive.

The “new SaaS” isn’t dead. It just has to be priced and engineered for the agentic, data-centric operating model.

Technology Radar Vol. 34 webinar series

Thoughtworks — Fri, 20 Mar 2026 10:37:56 GMT

Be the first to explore the newest edition of the Thoughtworks Technology Radar in one of two special preview webinars.

The sessions will feature Thoughtworks technologists that helped put the Radar together. They’ll provide a unique insight into what they think is important and interesting on this volume.

Explore the state of AI-assisted and agentic coding in 2026.
Find out what new techniques and technologies are defining software development.
Learn what we believe the industry needs to begin trialling and where it should exercise caution.

The webinars are designed to be interactive. You’ll have the opportunity to ask us questions directly and better understand how we see the future of software.

The sessions are free to attend. There are two options — choose the one that works best for you.

Wednesday, April 8

Western session with Alessio Ferri, Cecilia Geraldo and Ken Mugrage

Thursday, April 9

Eastern session with May Xu, Selvakumar Natesan and Ni Wang

Don’t miss the chance to get a deeper perspective and understanding of the Thoughtworks Technology Radar.

You can register here.

Beyond vibe coding: The five building blocks of AI-native engineering

Thoughtworks — Fri, 20 Mar 2026 10:33:32 GMT

By Sunit Parekh

In 2026, the software engineering landscape has moved beyond “vibe coding”. Throwing raw prompts at a chat interface and hoping for a usable result does not work in enterprise software development. To build production-grade, industrial-scale software today, developers need to adopt a structured approach that treats AI as a sophisticated engineering stack.

To build software effectively you should be orchestrating. You pick an agent to do the work, a model to ‘think’, a methodology like BMAD™ to follow, a spec to define the goal, and context to set the guidelines and guardrails.

Whether you’re modernizing a legacy mainframe or building a greenfield cloud-native application, mastering these five core building blocks is essential for the new engineering stack to achieve professional professional excellence.

1. Choose your agent: The hands

The “agent” is the autonomous execution layer. It acts as an active participant in the development workflow, significantly surpassing basic reactive assistants.

Core competencies and functionality:

Navigating and analyzing the file system. The agent interrogates the project’s directory, analyzes architecture and understands component interdependencies.
Executing terminal commands. It executes terminal commands to install dependencies (npm, pip), run build scripts, manage source control (git), and perform diagnostics. It directly controls the environment.
Automated testing and verification. It initiates test suites (unit, integration) to validate code changes and uses the resulting data as iterative feedback.
Autonomous multi-file editing and refactoring. The agent implements complex changes across multiple files cohesively (e.g., refactoring class identifiers or updating API signatures) without direct human intervention.
Supervised autonomy. All operations are under human supervision; the agent works autonomously (on things such as bug resolution or implementing minor features), but its actions are submitted to the developer for formal review and final authorization (e.g., via a pull request).

There are a number of popular agents available for software development, including:

Claude Code (Most popular): Google’s heavy-hitter for enterprise integration, deeply tied into the Gemini ecosystem and cloud deployment. Works with only Claude models. (See Claude Code Tutorials and Guides for more information)
OpenCode (Open source): A privacy-first, terminal-native agent ideal for local models and sensitive codebases. OpenCode works with all the models including self hosted.
Cline: An open-source favorite for VS Code that offers granular control over tool-calling and file permissions.
Antigravity / Cursor / Windsurf: Specialized IDEs that treat the agent as a first-class citizen rather than a plugin.

2. Choosing the model: The brain

The foundational architecture of AI-driven systems in software development relies on a critical division of labor: the agent manages the execution of tasks and actions, while the model serves as the repository and processor of knowledge.

By 2026, the market has undergone a significant bifurcation, moving away from a ‘one-size-fits-all’ large general-purpose model. Instead, the industry is now characterized by highly specialized models, each meticulously optimized for a distinct set of cognitive tasks essential to the software development lifecycle. This specialization leads to superior performance, efficiency, and context-awareness in their respective domains.

This landscape includes, but is not limited to:

Code generation models, optimized for syntactical correctness, idiomatic adherence to specific programming languages, and complex logical structure generation, moving beyond mere boilerplate.
Architectural reasoning models, focused on evaluating high-level design patterns, microservice communication, scalability, and security implications, serving as a ‘digital architect’ assistant.
Test and quality assurance models, which specialize in generating comprehensive test cases (unit, integration, end-to-end), identifying potential edge cases, and predicting failure points based on code changes.
Documentation and knowledge synthesis models, which are Excellent at ingesting existing codebases, technical specifications, and historical tickets to automatically generate up-to-date documentation, tutorials, and context-aware summaries for onboarding new developers.
Security and vulnerability analysis models, trained specifically to recognise common and novel security flaws (e.g., OWASP Top 10, logic vulnerabilities) during the coding and review process, often operating as a mandatory pre-commit hook.

The successful software agent of the future must, then, be adept at orchestrating these specialized models, calling upon the most appropriate model (the knowledge source) for the specific action it needs to execute.

3. Choosing a methodology: The playbook

To successfully integrate AI into the software development lifecycle, it is crucial to adopt a disciplined methodology that counters the inherent risks of autonomous agents. A major challenge is “agent thrashing,” a phenomenon where an AI becomes trapped in an infinite or lengthy loop of self-correction, often fixing one generated error only to introduce a new one, leading to wasted compute resources and time.

To prevent this in professional, enterprise-grade development, we must shift the paradigm from informal, open-ended generative interactions, often dubbed “chat-oriented programming” or “vibe coding.” This unsystematic approach relies too heavily on conversational prompts and lacks the rigor needed for production-ready code.

Instead, the focus must be on establishing a continuous, high-integrity flow where AI-driven development is firmly anchored in established engineering discipline and best practices. This involves:

Structured prompts and context (AI as the engineer). This requires detailed, structured inputs that define the scope, expected output, architectural constraints and quality metrics, rather than vague requests. The AI is assigned the specific role of a software engineer responsible for delivering code against clear specifications.
Integration with CI/CD (AI as the committer). This involves embedding the AI within the continuous integration and continuous delivery pipeline, where its outputs are immediately subjected to automated testing, linting, security scans and code reviews, ensuring rapid feedback and adherence to standards. The AI acts as a committer; its work is instantly validated by the system.
Test-driven AI (TDA) (AI as quality assurance). Mandating AI agents generate code alongside, or even based on, comprehensive unit and integration tests, making test coverage a prerequisite for successful code generation. The AI takes on the role of a QA specialist, ensuring functional correctness before delivery.
Version control and audit trails (AI as the documentarian). Ensuring every AI-generated contribution is committed to a version control system with clear commit messages and traceability, allowing human developers to audit and roll back changes. The AI serves as the documentarian, providing clear, auditable logs of its work.
Human oversight and vetting (human as the architect/reviewer). Implementing mandatory human review gates for AI-generated code, especially for critical sections, to ensure non-functional requirements (like performance, security and maintainability) are met. Human developers take the crucial role of the lead architect and code reviewer, maintaining overall system integrity and adherence to strategic goals.

By systematically applying these engineering best practices, organizations can harness AI to accelerate development while maintaining the quality, stability and control essential for enterprise applications.

One such playbook is BMAD Method, a methodology for Agile AI-driven development that simulates a multi-role software team through role-based agent orchestration. It uses a specialized loop of “plan-analysis-design-architect-dev-test” personas to ensure code is not just generated, but validated against architectural constraints and unit tests before human review. It focuses on reducing “hallucination drift” by requiring cross-agent consensus on system design before implementation begins.

Similarly Thoughtworks AI/works™ supports legacy modernisation starting from reverse engineering, requirements enhancement, and spec-to-code generation with 3–3–3 delivery model — accelerating from concept in three days, to functional prototype in three weeks, and production-ready MVP in three months.

4. Prompt using specs: The what

In the rapidly evolving landscape of agentic software development, the “Spec to Code” pipeline represents the critical bridge between human intent and autonomous execution. As AI agents become increasingly capable of writing, testing, and deploying software, the bottleneck of development has shifted from raw coding to the precise articulation of requirements.

Ultimately, the effectiveness of an autonomous coding agent is directly proportional to the quality of its input specification. Therefore, mastering the “Spec to Code” translation is no longer just an efficiency hack, but the foundational skill required to successfully navigate the future of AI-driven engineering.

Examples of such toolkits include:

SpecKit: Developed by GitHub, SpecKit is an open-source toolkit that brings structure to AI-assisted coding through Spec-Driven Development. Using a simple CLI, it guides developers and AI agents through a rigorous five-step pipeline of writing specifications — Constitution, Specify, Plan, Tasks and Implement — turning high-level requirements into production-ready code and eliminating chaotic “vibe coding.” (See SpecKit: Master the Art of Spec-Driven Prompting.)
OpenSpec: Developed by Fission-AI, OpenSpec is a lightweight, open-source toolkit that brings spec-driven discipline to AI coding without heavy bureaucracy. Using plain markdown and native slash commands, it guides AI agents through a fast three-step workflow: Proposal, Apply and Archive. It is especially powerful for safely modifying existing “brownfield” projects.
BMAD Quick Flow: The BMAD Quick Flow is a streamlined, three-step AI development framework optimized for rapid feature delivery. It transitions from raw requirements to a technical specification (quick-spec), immediate coding (quick-dev), and optional validation (code-review). It’s perfect for fast prototyping.

Read more about spec driven development here.

5. Providing context: The how

The final layer is providing the how through what is today being called context engineering. This is the strategic curation of institutional knowledge and design principles provided to AI assistants to enforce enterprise standards. Rather than accepting generic code, developers inject a rich context containing specific design patterns, architectural blueprints and strict security guardrails.

By embedding structural guidelines and security protocols — such as OWASP mandates, authentication requirements and specific microservice architectures — directly into the AI’s workspace, context engineering acts as a foundational constraint. It guides the AI on how to build software that is not just functional, but inherently secure, scalable and aligned with organizational architecture.

Rules and instructions. Using AGENTS.md or .cursorrules files to provide persistent instructions like “ Always use Tailwind CSS” or “ Follow the Hexagonal Architecture pattern.”
Security guardrails. Integrating automated security policies and “never-allow” rules to prevent the introduction of common vulnerabilities (OWASP Top 10), secret leakage or insecure dependency patterns (such as security_auditor.skill).
Design systems and architecture. High-level architecture guidelines to ensure the AI-generated output adheres to your brand and system design.
Thoughtworks AI/works™ Context Integration. Advanced capabilities for automated context harvesting from enterprise codebases, ensuring models understand intricate system dependencies and domain-specific logic. (See the AI/works™ Technical Guide)

The new engineering stack

In essence, software development with AI shifts from mere vibe coding to thoughtful orchestration. Success lies in the deliberate combination of the right agent, the most suitable model, a proven methodology like BMAD™, a precise spec, and well-defined context. This deliberate composition is the key to building effective software in the age of AI.

Originally published at https://www.thoughtworks.com.

Last week in AI | 16 March

Thoughtworks — Wed, 18 Mar 2026 09:03:09 GMT

By Ben O’Mahony and Danilo Sato

It’s funny how even minor version updates can cause such a stir in the AI world. That was certainly the case with last week’s release of GPT-5.4, which brought significant improvements for enterprise and general knowledge work.

We discussed the implications on last week’s episode of This Week in AI — sorry we’re a little late this time, but we’re sure you’ll appreciate the chance to reflect on some of the key stories from the last week or so.

You can watch the entire session here

Other things we discussed include:

Claude Code launching a /loop command for scheduled tasks, which allows users to run prompts automatically on a recurring schedule. An automated code review feature was also introduced, highlighting a growing need for “antagonistic agents” to stress test quality.
Google rolling out more native Gemini integration across Workspace apps like Docs, Sheets and Slides.

Alongside these launches and releases were also some important stories that flag continuing challenges around security and stability in an AI-assistance context:

McKinsey’s internal AI platform “Lilli” was compromised in two hours by autonomous offensive agents. The attack exploited a development version of the API documentation, which listed unauthenticated endpoints, granting read and write access to the entire production database, including the critical system prompts that define the agents’ behavior
High-blast radius outages at Amazon and persistent availability issues at other providers like Claude suggest that instability is a trade-off for the current speed of innovation in AI engineering (although Amazon denied the blame lies with AI coding this last week)
An analysis — published on X — showed that LLMs often write “plausible code” instead of functionally correct or performant code. This underscores the necessity of automated performance tests and explicit instructions, as implicit knowledge kills AI effectiveness.

Finally, we also discussed a shift back towards the command line. In a security context, there’s an emerging concept of a CLI vault, proposed as a way to route agent access through a secure gateway, which should prevent them from accessing environment variables and API keys. We also discussed the perils of the cognitive load of managing a swarm of agents, and agreed that what really matters is focusing on effectiveness and optimizing workflow bottlenecks. We don’t need to just keep agents busy.