From Black Box To Blueprint: Thoughtworks Uses Generative AI To Extract Legacy System Functionality

Thoughtworks consultants recently described an experiment that applied generative AI to a legacy system with no available source code.

The article, shared on Martin Fowler’s blog, highlighted a pilot where a five-person team analyzed the system’s database, UI, and binaries in parallel.

InfoQ reached out to the authors, Thiyagu Palanisamy and Chandirasekar Thiagarajan, who explained that during the two-week pilot the team used Gemini 2.5 Pro to analyse a thin slice of what was an enormous legacy system. The output of that analysis was a functional specification — a “blueprint” of the black-box system that domain experts were able to validate.

AI proved most effective in decoding code, summarizing binaries, and mapping database changes, while also easing schema discovery.

AI made a significant difference in reverse engineering the ASM code. Traditional approaches would have taken months to decode the logic specified ASM and also to identify the system functions vs business functionality.

The exercise demonstrated how AI can accelerate reverse engineering, providing insights into legacy systems at a pace difficult to achieve through manual methods alone.

Enterprises often rely on critical systems that have become opaque after many years of use. Documentation is incomplete, source code may be missing, and institutional knowledge erodes over time.

The article frames this as the “black box” problem: the system works, but its internal rules are hidden. The goal is not to regenerate code but to reconstruct a “blueprint” of functional intent that can inform modernization with lower risk.

The pilot combined several techniques. One strand focused on connecting dots across data sources by correlating what could be observed in the UI, database schema, and runtime behavior. Another applied change data capture to trace how specific user actions triggered mutations in the database.

Change Data Capture Methodology (source: martinfowler.com)

From there, the team attempted server logic inference by linking database activity with binary calls. This extended into what they describe as AI-assisted binary archaeology, where decompilation tools and large language models helped summarize functions and propose candidate responsibilities.

The process was iterative, involving steps such as finding relevant functions, building subtrees, validating entry points, and assembling specifications from fragments into coherent functionality.

At each stage, AI provided speed by generating summaries, highlighting relationships, or drafting candidate rules, while humans validated the results across perspectives.

Inferred Logic Spec (source: martinfowler.com)

When domain experts reviewed the output, they confirmed it captured behavior accurately enough to serve as a reliable reference point. The authors told InfoQ they had high confidence the approach could scale across the broader system, provided continuity of the core team and its accumulated context.

The authors also noted that the same techniques have since been applied to other client engagements, providing significant acceleration in building context with or without access to source code.

The experiment also revealed challenges. While AI accelerated many steps, the models were not always reliable, with risks of hallucinations, false positives, and gaps in coverage. Each hypothesis needed confirmation from other evidence before being accepted.

Validation was critical. Cross-checks between data sources and domain expert reviews ensured that draft specifications were accurate, keeping speed from undermining trust.

The pilot illustrates both promise and limits for architects considering AI-assisted reverse engineering. The approach showed that AI can help correlate evidence across UI, databases, and binaries, producing draft specifications that domain experts could validate.

Off the back of the team’s encouraging results, InfoQ spoke with the authors, Thiyagu and Chandirasekar, to learn more about the setup of the pilot, and their reflections on the technique’s potential.

InfoQ: How long did the pilot take, and how many people were involved?

About 2 weeks. We had about 5 folks involved in parallel, focusing on extracting context from 3 different areas: DB, Application UI, Binaries. We analysed one of the 24 business domains, including 650 tables, 1,200 stored procedures, 350 user screens and 45 compiled DLLs

InfoQ: Beyond the thin slice pilot, what confidence did the team have that the approach would scale across the full system?

Pretty high as long as we have the people with context preserved as a core team. We had the knowledge and techniques pinned down, and familiarity with the problem and domain, after the experiment with the thin slice.

InfoQ: Has this method since been applied to other client engagements?

Yes, this has been applied to similar engagements where we need to build context of the legacy systems with or without the source code. This approach has provided us with significant acceleration.

InfoQ: How did the AI-generated specification provide tangible value to the client?

We walked them through with detailed specifications of the thin slice, which gave them confidence to take up this initiative. Using the same approach we also identified high level capabilities of the overall system, which helped them to build a much deeper understanding of the overall system than before.

InfoQ: Were there specific moments where the AI made a material difference versus traditional reverse engineering?

AI made a significant difference in reverse engineering the ASM code. Traditional approaches would have taken months to decode the logic specified ASM and also to identify the system functions vs business functionality.

InfoQ: What were the most significant pitfalls?

One key pitfall we observed is that AI performs best at a detailed level. When asked to process very large amounts of context, it tends to hallucinate. We also saw instances of positive reinforcement bias, where the model generated overly optimistic or false-positive outputs. Our takeaway is to use AI for fine-grained analysis and build the broader context outside the model, where we can validate and synthesize insights

InfoQ: How did the team handle validation: what governance or review practices ensured that the AI’s output was trustworthy?

We handled validation by breaking the work into smaller steps and adding detailed lineage at each stage. This allowed us to cross-check and confirm every output before incorporating it into a larger context block. By validating incrementally, we ensured that the overall result remained trustworthy and consistent

InfoQ: How do you see this approach evolving in the next few years — are there toolchains or practices you’d like to see emerge?

We anticipate a new generation of toolchains that make context ingestion and consolidation almost effortless, with MCP server style wrappers seamlessly orchestrating existing reverse engineering tools. Beyond that, we envision AI becoming a native capability within these tools, enabling near real time insights as engineers explore complex systems. Perhaps most transformative will be collaborative context building, where multiple stakeholders can co create, validate, and evolve system blueprints in real time, dramatically reducing the cycle time from discovery to decision making.

InfoQ: What advice would you give to someone in a similar position who wants to try this on their own legacy estate?

Pick a manageable slice, experiment, and let the learnings inspire the next step toward modernizing your legacy estate

From Black Box to Blueprint: Thoughtworks Uses Generative AI to Extract Legacy System Functionality

Leave a Reply

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply