Key Takeaways
- Focus modernization efforts on conceptualizing software, not producing code, since conceptualizing is the bottleneck in the development lifecycle.
- Use AI tools to retrieve the conceptual design of legacy software to reduce the toil of lengthy up-front design.
- Most commercial AI tools are focused on the “accidental complexity” of the development phase, where code generation has been largely commodified.
- Use static analysis to systematically identify code and database context that can be used effectively by large language models.
- Use the summarization capabilities of large language models to draft business requirements for legacy code.
In No Silver Bullet, Fred Brooks argues that achieving an order of magnitude gain in software development productivity will only occur if the essential complexity of software engineering is addressed. For Brooks, the essential complexity of software engineering is conceptualizing software’s “interlocking pieces”. This is in contrast to the relatively trivial task of representing the chosen concept in an implementation.
Today’s widely adopted AI-enabled tools for software development, like Copilot, aider, and cline, readily produce representations when given a natural language description of a concept. However, if Brooks was correct, these tools address only the accidental complexity of software engineering, not the essential complexity of “specifying, designing, and testing [the conceptual construct]”.
In this article, we share our experiences and insights on how large language models (LLMs) helped us uncover and enhance the conceptual constructs behind software. We discuss how these approaches address the inherent complexity of software engineering and improve the likelihood of success in large, complex software modernization projects.
Legacy modernization and the concept recovery problem
Nowhere is the difficulty of properly conceptualizing software more apparent than in legacy systems. By definition, legacy systems are in use despite being outdated, making them business critical yet less familiar to most engineers.
These systems are particularly susceptible to conceptual drift between the business and the software.
As the software becomes difficult to change, businesses may choose to tolerate conceptual drift or compensate for it through their operations. When the difficulty of modifying the software poses a significant enough business risk, a legacy modernization effort is undertaken.
Legacy modernization efforts showcase the problem of concept recovery. In these circumstances, recovering a software system’s underlying concept is the labor-intensive bottleneck step to any change.
Without it, the business risks a failed modernization or losing customers that depend on unknown or under-considered functionality.
[Click here to expand image above to full-size]
Consider an e-commerce company that initially sells only to retail customers. Later, the company creates an independent business unit for wholesale customers, subject to its own discounting and billing rules. Conceptual drift could occur if the software for the new wholesale business unit depends directly on the retail business’s implementation. Features for the new wholesale business would be implemented in terms of the retail business’s rules, failing to model the concept of independence between the two business units.
Modernizing vertical slices of complex software after substantial conceptual drift is difficult because the concepts tend to be vast, nuanced, and hidden. Fortunately, techniques leveraging large language models provide significant assistance in concept recovery, providing a new accelerator for modernization efforts.
AI interventions for software modernization
The goal of any software development effort is the reduction of cycle time to deploy code while maintaining a tolerable error rate. We set out to develop AI interventions at multiple points in the modernization process to reduce cycle time, even for complex systems. Here we focus primarily on design phase interventions, which are largely unaddressed by commercial AI tools.
Using all interventions together in a fully automated manner provides a way of rapidly prototyping and exploring solutions, such as variations on system architecture. An earnest modernization effort will require human experts to adapt and verify AI-generated outputs, but we found that the reduction in toil from using AI de-risks and systematizes the effort.
[Click here to expand image above to full-size]
Design phase interventions
The goal of a software modernization’s design phase is to perform enough validation of the approach to be able to start planning and development while minimizing the amount of rework that could result due to missed information. Traditionally, substantial lead time is spent in the design phase inspecting legacy source code, producing a target architecture, and collecting business requirements. These activities are time-intensive, mutually interdependent, and usually the bottleneck step in modernization.
While exploring how to use LLMs for concept recovery, we encountered three challenges to effectively serving teams performing legacy modernizations: which context was needed and how to obtain it, how to organize context so humans and LLMs can both make use of it, and how to support iterative improvement of requirements documents. The solutions we describe below fit together in a pipeline that can address these challenges.
Code tracing identifies relevant context for LLMs
What is a trace?
To design tools for gathering useful code context for modernization tasks, we observed how engineers use modern code editors and static analysis tools to explore unfamiliar codebases. Engineers use keyword search, regex, and source tree naming conventions to orient themselves around the proper entrypoint to a codebase for a particular task. Software diagramming tools help visualize the structure of the software, and editor functionality such as go-to-definition and find-all-references help the reader navigate through related portions of code.
Typically, entrypoint identification was a low effort task for architects and engineers, who are practiced at understanding the high level structure of software. However, following references in the codebase to assess functionality and structure was labor-intensive. For this reason we focused on building a way to trace code for the purpose of supplying it to an LLM.
A trace describes the systematic traversal of the abstract syntax tree (AST), resulting in a tree structure that defines the code context relevant to a modernization task. To perform a trace, we choose a starting point and stopping criterion that reflect the rewrite strategy. For example, a rewrite strategy for a form-driven application could be to “freeze” the form and database layers but rewrite all business logic in between. A valuable trace would start with a form class and terminate in nodes with database access, limited by a pre-defined reference depth to control the volume of code returned.
We assessed various tools to parse code and perform a trace. Compiler toolchains provide language-specific APIs, while tools like tree-sitter and ANTLR can parse a wide variety of programming languages with a consistent API. For our work, we built on top of the Roslyn compiler’s API because it returns detailed type information for VB and C#.NET applications. Our syntax tree walker stored details like the current symbol’s type, its source code, and its relationships to other symbols.
Collecting Code Context
We experimented with the format of the code context provided to the LLM. An AST includes details about every symbol in the code, which can lead to very granular data such as the value of an assignment statement. Based on prior research on LLM code summarization, we expected that LLMs would produce better summaries of code if provided the raw source code. So we compared markdown-formatted code context to raw ASTs. We used H3 headings to label the name of each method or class in our markdown-formatted code context. Meanwhile, our AST format resembled a nested ASCII tree structure. We found that LLM summaries were more complete when we used the markdown formatting. ASTs generally produced more technical responses, which were often overly detailed to the detriment of overall coherence.
Collecting Database Context
While walking the syntax tree, we also kept track of database dependencies by parsing the source code for SQL syntax using ANTLR. If SQL was found, we parsed the names of tables and stored procedures and stored them on the node. After the trace, we compared the collected table and stored procedure names to those from a SQL schema dump of the application’s database. This allowed us to produce a markdown-formatted database context file describing the portion of the schema touched by the traced code. Similar to our code context, each table or stored procedure received an H3 label followed by the relevant schema or SQL.
Benefits of tracing
The systematic nature of a trace for identifying code and database context yielded productivity gains over manually jumping through references and summarizing code. Architects we worked with spent a week or more time jumping through code, getting caught in cycles, and struggling to build up knowledge of the codebase. By comparison, a trace can be performed in minutes, and the systematic, rule-based process gives a structured and repeatable way to analyze code.
Making context useful for humans and LLMs
Visualizing a trace
Architects and engineers asked for ways to visualize a trace’s code and database context so they could better understand their architecture and design. We did this by creating an export functionality for our traces. It serializes code and database context into PlantUML-formatted class, sequence, and entity-relationship diagrams. We experimented with prompting an LLM to produce the PlantUML directly from code and database context. However, the results were unreliable. Even with moderate contexts of 50k tokens, LLMs lost details and failed to consistently follow PlantUML syntax.
We also found that the PlantUML markup was useful LLM context in its own right. The class diagram’s markup grounded LLMs in the structural relationships among code symbols. Instead of counting on an LLM to infer the relationship between two snippets of code, the explicit references within the PlantUML increased the reliability of responses. Similarly, the entity relationship diagram summarized the portion of the storage layer that the application depends on, which led to more technical responses. As a result, we included the PlantUML markup for the class and entity relationship diagrams within the code and database context that we sent to LLMs.
Recovering business requirements
To address the central problem of concept recovery, we prompted LLMs to produce a business requirements document (BRD) using our collected code and database context. Following prompt engineering best practices, we devised a single prompt that was useful to technical and non-technical users. The outputs included a general overview, functional requirements, user interface description, and references. These are summarized below.
Output component | Relevance |
---|---|
General overview | A summary of the purpose of the code and its relevance to the larger application, which included a short background description of the application’s purpose |
Functional requirements | A description of business rules, calculations, and data handling |
User interface | A description of the user interface elements involved and place within the user journey |
References | A listing of the classes, tables, and stored procedures involved |
As we spoke with engineers who frequently work on software modernization tasks, we learned that each modernization effort likely needs customized outputs. For example, some modernization efforts require exact functional duplication from a source system to the target. Perhaps the goal is retiring aging infrastructure or out-of-support languages and frameworks. In these cases, extracting functional test cases is particularly valuable. Other modernizations include affordances for new functionality or behavioral modifications. In these cases, the BRD may be used to understand the feasible paths towards the desired state, so functional test cases would be less valuable.
Because the size of our context (often >150k tokens) was greater than many frontier models’ context limits, we devised an iterative prompting strategy. First, we estimated the token count of our context. Then, we chunked the context into manageable sizes, usually around 50k tokens, before iteratively prompting for a summary of each chunk. To estimate tokens we used tiktoken. However, a simple rule of thumb such as 1 token equaling 4 characters is also effective for this purpose. One key was to make sure that our chunking did not split up samples of code or database context. We relied on detection of markdown separators in our code context files to prevent this.
Once all chunks were processed, we synthesized the outputs into a single response using a dedicated synthesis prompt. This prompt included chain-of-thought reasoning, a checklist for the required sections, and a request to check the completeness of each section. If a one-off prompt is possible with your context, we recommend starting with that approach. With advancements in long-context models (see Gemini Flash and Lllama 4 models) and reasoning capabilities, it is possible that iterative prompting may eventually be unnecessary.
AI code chat enables deeper inquiry
We received feedback that an initial BRD tends to spawn follow-up questions. Often, the answers to those follow-up questions are valuable additions or updates to the initial BRD. We experimented with retrieval augmented generation (RAG) for these tasks. However, RAGs can require considerable tuning to ensure high search relevance, so we favored solutions that did not require them because our users had divergent query patterns. For example, engineers and architects were concerned with code references whereas business analysts and product managers were interested in semantic meaning. Optimizing for both sets of users was difficult enough that we focused on addressing a few specific follow-up tasks our users asked for.
Task 1: Combining two BRDs
To combine the BRDs of two traces, we employed the same iterative prompting technique previously described to manage large context windows. Typically, users wanted an existing BRD to be updated with newer context, so we used the synthesis step to update the original BRD with details from the newer trace. The total cost of this process can be expensive. Its cost scales linearly with the amount of code and database context pulled in by a trace, with the additional cost of the synthesis step. However, it focuses the LLM’s attention over each chunk, retaining the details from the two traces best.
Task 2: Finding other relevant context
To find relevant source code that may fall outside of the initial trace, we experimented with chunking all of the code in a repository down to the method level and using CodeBERT to create embeddings which we stored in a vector database for retrieval. This formed a naive RAG users could search for code snippets relevant to a process or behavior of interest. Returning the top results to the user helped suggest entrypoints to the code, but this implementation suffered from a few drawbacks. First, results are returned without any surrounding context, making it hard for users to gauge their relevance. Second, overall search relevance was low. We found that this functionality was not reliable enough to implement a truly automated RAG where code snippets were retrieved, augmented, and submitted to an LLM on the user’s behalf.
Task 3: Targeted inquiries on the current trace
To service targeted inquiries of the code in a trace, we experimented with iterative queries against the whole context (see Task 1) and indexing the code in a vector database (see Task 2). The iterative queries approach yielded more exhaustive responses but was costlier than using vector database retrievals. We anticipate that advances in long context models will further improve the practicality of submitting large contexts as a single prompt.
Conclusion
Combining rigorous, systematic approaches like static analysis with AI summarization allows for new, customized approaches to modernization. The ability to automate and inspect the outputs of static analysis helps human experts perform their jobs with confidence, while LLM summarization greatly reduces the toil of preparing lengthy documentation. These advantages bring collaborators from different backgrounds into the process of modernizing software. Only once software teams understand the concepts behind legacy software can they effectively rewrite it–with or without AI tools.