Every software engineer knows working with massive codebases is challenging, especially when debugging a production bug or adding a new feature that impacts multiple classes or files. Going through the code base that has hundreds of files, trying to figure out where to make a change can be overwhelming, even for developers with years of experience on the project.
New use-cases have emerged for LLM with their recent surge. LLMs are being used to fix bugs and add features to analyzing code for vulnerabilities or to ensure that all APIs have proper authentication. However, applying LLMs to large codebases presents a significant hurdle. The entire codebase often won’t fit into the model’s context window, and feeding it too much data at once can lead to inaccurate results. Text is sequential, but code is not. With its complex class relationships, conditional statements, and exceptions, makes it difficult for models to interpret correctly. Incomplete context can lead to inaccurate responses, which in this case could be code that fails to comprehensively address the issue that you are trying to solve.
Breaking Down Code the Right Way
Why does chunking work for text and not for Code?
When you are working with large text documents, the go-to approach to break down the document is chunking. Chunking is breaking everything into smaller pieces to make that chunk fit within the context window limits. This works well for regular text as it is structured linearly. Some chunks may have characters that are cut-off, it is easy to solve these issues by trimming at the end of line or at the end of paragraph to make it easier for the model to understand. Summary of the text does not get impacted with this approach.
This type of chunking simply does not work for code. It is not a book with a sequence of pages. There is no order as such when code is read beyond the given file or for that matter, methods within a given file need not be in a sequence. So chunking it with a strategy similar to text will not retain the essence of what code does.
Code is completely different. It is not just text—it’s a formal language with rigid rules about what makes sense and what doesn’t. When you chunk it with standard text-splitting methods this can result in functions that are cut in half, class definitions that are separated from their methods and import statements floating around without context.
This creates a cascade of problems. When your embedding model tries to understand these broken fragments, it is working with garbage data. The vector representations it creates are basically meaningless because they’re based on syntactically invalid code snippets. Your agent built on that gives you responses that sound confident but they won’t be accurate entirely because it is working with an incomplete understanding of your code. Summarizing previous chunks to retain context only aggravates the underlying problem.
The Better Way is to follow Code Structure for chunking
To feed code to an LLM effectively, you have to work with the code’s natural structure, not against it. This means using Abstract Syntax Trees (ASTs) or Concrete Structure Trees (CSTs) to guide your chunking strategy.
Let us start with ASTs. An AST is basically what your compiler builds when it first looks at your code—a tree structure that represents the organization of everything you’ve written. Functions, classes, loops, conditionals—they all become nodes in this tree with clear relationships to each other.
So instead of cutting code at arbitrary character boundaries, you split along these natural divisions. A function becomes one chunk. A class with all its methods becomes another chunk. Each piece is syntactically complete and semantically meaningful.
This approach gives you several huge advantages:
First, every chunk can actually be parsed and understood by the model. No more feeding it broken code fragments that would never compile.
Second, you get rich metadata for free. When you extract a function through AST parsing, you automatically know its name, parameters, return type, and which class or module it belongs to. This metadata becomes incredibly useful for building sophisticated retrieval systems.
Third, you can implement smarter search strategies. Instead of just doing semantic similarity matching, you can filter for “all methods in the PaymentProcessor class” or “functions that return authentication tokens.”
Below are tooling options:
-
Python has the built-in AST module that handles everything you need
-
C# developers can use Roslyn, Microsoft’s compiler platform
-
Java code can work with libraries like JAST
-
For polyglot projects, tree-sitter is amazing—it can parse dozens of languages consistently
CSTs and Tree Sitter
Tree-sitter is a powerful parsing library that generates concrete syntax trees(CSTs) for various programming languages. Concrete Syntax Trees (CSTs) are similar to ASTs with one key difference – ASTs focus on meaning and structure necessary for compilation, omitting non-semantic tokens. CSTs represent exact source code preserving all tokens including comments. This is helpful to get context as it is not a compiler. When a small code change is done CST does not re-parse the entire file, it can efficiently update the affected parts of the tree. While this is less critical for one-time chunking, this is valuable for real-time code analysis tools that feed context to LLMs as code is being edited.
Lets look at sample code with AST and CST representation to understand differences better:
var x = 1 + 2;
An AST captures the core meaning, omitting syntactic sugar like the equals sign and semicolon:
VariableDeclaration (kind: 'var') └── VariableDeclarator ├── Identifier (name: 'x') └── BinaryExpression (operator: '+') ├── NumericLiteral (value: 1) └── NumericLiteral (value: 2)
A CST, however, creates a literal map of the code, including every token:
VariableDeclarationStatement ├── Keyword ('var') ├── VariableDeclaration │ ├── Identifier ('x') │ └── Assignment │ ├── Operator ('=') │ └── AdditiveExpression │ ├── NumericLiteral ('1') │ ├── Operator ('+') │ └── NumericLiteral ('2') └── Punctuation (';')
This complete, lossless representation is exactly what we need for deep code understanding.
Tree-sitter is widely used for syntax highlighting and advanced code analysis in IDEs such as VS Code and Atom. It is a universal Parser that provides a framework for creating parsers for any language.
Tree-sitter is suitable for LLMs as parsers are written in C which makes them extremely fast. It also gracefully handles the syntax errors which are common during development. So partial CST will be generated even when there are errors allowing partial analysis. It has bindings for many languages such as Python, JavaScript, Rust and Go etc. which makes it easy to integrate and provides standard structure to traverse through the code irrespective of the language. This simplifies the development of polyglot analysis tools for LLM context.
Practical Application with LLMs:
In addition to these it is best suited for LLMs because of the rich metadata they provide such as comments and documentation associated with the functions and variables. It also helps with semantic search refinement because of the rich metadata it has. For code generation, Tree-sitter can help ensure that the generated code is syntactically valid and that modifications are applied to the correct locations in the tree, then serialized back to text.
Using Structured Trees for LLM Context
The fundamental challenge of using LLMs on large codebases is not just the size of the code, but the lack of context when model is forced to look at only a piece of code at a time. Naive chunking fails because it destroys the very structure that gives code its meaning, leaving the LLM with a confusing and incomplete picture. Below two-step process helps parse the code to store it and retrieve it effectively.
The Concrete Structure Tree (CST) created with Tree-sitter can be used to understand an entire repository. This can be done by creating intelligent and searchable documents from the CST nodes generated by Tree-sitter and then implementing a strategy to retrieve them.
Step 1: Create Enriched, Searchable Chunks
Instead of just using the source text from the CST, combine the code with the rich metadata extracted from the tree. This creates a meaningful document for the LLM to embed and search.
For example, consider an example delete_user
function, it wouldn’t just be the code itself. Its searchable document would look like this:
---
File: /app/main.py
Function: delete_user
Endpoint_Route: /admin/delete_user
HTTP_Methods: [POST]
---
Code:
def delete_user():
# VULNERABILITY! This critical endpoint is missing an authentication check.
user_id = request.form.get('user_id')
# ... logic to delete the user from the database ...
print(f"DELETED USER {user_id} WITHOUT AUTHENTICATION!")
return jsonify({"status": "success"})
This enriched format gives the embedding model crucial context about the code’s location, purpose, and signature, leading to significantly more accurate context to retrieve. Model can answer queries such as “Find all chunks where Endpoint_Route
contains ‘/admin/’ and the code does not contain a call to ‘check_auth()’. “. This query can instantly flag the vulnerable function which would be impossible to do reliably with the code text alone.
Step 2: Implement a Retrieval Strategy
With the codebase converted into these enriched, structured chunks, you can now store them in a specialized database to be retrieved as needed. The two leading strategies are:
- Vector Search for Semantic Retrieval: This is the widely used approach called Retrieval-Augmented Generation (RAG). Each enriched chunk is converted into a vector embedding and stored in a vector database such as PineCone. When a user asks a question, the system finds the code chunks that are semantically most similar to the query and feeds them to the LLM. This is excellent for finding what code is relevant to a task.
- Graph Databases for Architectural Analysis: Code is inherently a graph of relationships (a function
calls
another, a classinherits_from
another). By storing CST nodes in a graph database, you can traverse these connections. This allows you to answer complex architectural questions that semantic search can’t handle, such as, “What downstream services will be affected if I change this function?” or “Show me all API endpoints that use this deprecated library.”
Conclusion
By using these strategies LLM gets the right context it needs to get the job done, without being overwhelmed by the entire codebase. This smart parsing and smart retrieval strategy is the key to make LLM understand and work with real-world code.
Disclaimer: The opinions expressed here are my own and do not reflect the views of my employer.