Since its launch in 2021, GitHub Copilot has reshaped software development, with surveys showing widespread adoption and perceived improvements in code quality. A study from GitHub and Accenture reported an 80% adoption rate, but success was measured by developers accepting Copilot’s suggestions rather than objective efficiency gains.
Developers all across the spectrum are reporting a personal productivity boost, but in reality, the benefits to enterprise apps may be overstated—an analysis by Uplevel found minimal efficiency improvements but a 41% rise in bugs, raising concerns beyond just reducing manual coding times. The reality is that many of the perceived efficiency gains from Copilot usage focus on the short-term gains. This poses a much larger problem than the inconvenience of manually writing boilerplate code. What developers are learning is that AI code generation is still very much in the R&D phase.
Even when AI-powered code generation saves time for developers, that time is often reclaimed during code reviews, ongoing maintenance, and incident response. Which is why post code generation, AI coding is a powerful tool to balance this overhead, while also improving the long term process for teams.
Let’s explore where and why AI Code Gen is falling short and what’s required to help developers successfully leverage AI-generated code and reviews.
The key lies in retrospective
Copilot’s output is constrained by the context it has. As a result, AI-generated code often fails to execute effectively in multi-repo, multi-language projects. While AI copilots and agents can accelerate code generation, overall productivity is still hindered by other critical development processes like code reviews, testing, integrating, building, deploying, etc.**
Copilot now considers not just the document you’re working on but also other open tabs in your IDE. However, this remains far short of the full context required to handle systems spanning multiple repositories, cloud environments and possibly even different runtimes – the context window is simply ineffective. **
One major limitation is the lack of downstream impact data in the code generation process. In building my AI code review agent, testing with OpenTelemetry, I’m betting on CI/CD logs and tracing to enhance visibility and help agents better understand nuanced implementation details.
I’m betting on CI/CD logs and tracing to enhance visibility and help agents better understand nuanced implementation details.
Without this type of granular contextual awareness, AI coding agents won’t be able to correctly predict how new code will fully integrate with existing systems, often producing suggestions that misalign with broader project requirements.
The context window for Copilot is constrained to what is directly in front of or behind the cursor and, potentially, other open documents in the IDE. When generating code, Copilot relies primarily on its large language model’s training on general programming patterns, not specific project conventions. While this makes the tool flexible, it often overlooks critical project-specific elements like naming conventions, architectural patterns, or dependencies between components spread across multiple repositories.
Unless your AI tool is highly customized and deeply integrated with your project (a resource-intensive endeavor for most teams), it cannot retain knowledge about the project’s history, evolution, or previous commits. This can lead to inconsistent or misaligned code suggestions, which are costly to fix later.
When AI-generated code ends up in code review
I mentioned that while time saved in generated code is now often time spent in the review. Suggestions that fail to align with project architecture risk introducing inconsistencies, which reviewers must manually uncover and resolve. Code that initially appears functional often results in technical debt or hidden bugs, increasing the workload for reviewers—or worse, causing production failures that must be fixed under the pressure of downtime.
Just as AI assistants lack the ability to account for multi-repository dependencies, code review tools often fail to provide a project-wide view. These limitations increase workloads for reviewers, who lack efficient ways to identify changes, spot dependencies, or assess the broader impacts of code modifications.
The rapid adoption of AI tools has outpaced the development of frameworks to ensure code quality. Until more sophisticated tooling becomes standard, the rise in technical debt, production issues, and bugs will likely continue. This places additional pressure on development teams, who must balance faster deployments with intensified quality control.
A new generation of AI code reviewers
The good news? A new generation of AI code assistants is emerging – but engineers should understand the pros and cons to get the most out of them.
Baz AI Code Review
Baz focuses on automated pull request (PR) reviews with AI-driven suggestions and provides real-time feedback on code quality and best practices. By leveraging specialized models and embeddings, Baz generates code review suggestions that cover API impact and deep downstream analysis. It integrates with GitHub and has a standalone experience with a copilot chat functionality. It’s a strong platform for complex, multi-repo, multi-language code bases.
At this time Baz is wholly focused on the code review cycle so has limited tooling as an IDE or code generation earlier in the development cycle. Full disclosure, this is my AI Code Review product that was just released in January.
CodeRabbit
CodeRabbit is also hyper-focused on AI PR reviews by offering code explanations and improvement suggestions in areas like readability, security, and efficiency. It is particularly useful for small and medium-sized teams looking to streamline their review processes. However, it has limited customization for advanced AI review criteria and is not as comprehensive as others when it comes to code search and analysis. Developers have shared feedback that its AI-generated suggestions can sometimes be redundant or misaligned with a team’s coding conventions. It’s also free for open source projects.
Graphite
Graphite is designed to enhance developer workflows by enabling fast, incremental PRs with stacked diffs, helping maintain a cleaner Git history. It also includes AI-assisted code change summaries, making it easier for teams to review updates efficiently. While Graphite is excellent for workflow management, its primary focus has not been deep AI-driven code analysis and requires adoption of the platform, which includes a learning curve for teams unfamiliar with stacked diffs.
Sourcegraph
Sourcegraph is known for its powerful code search and intelligence tool, particularly well-suited for large codebases. In recent announcements they discuss how Cody, their coding agent allows for deep analysis across repositories and historical code trends, making it a valuable resource for developers needing advanced search capabilities. It also features AI-powered autocomplete and code explanations. It’s setup and indexing can create overhead for larger organizations and while it excels at code exploration, it is less focused on automated PR reviews.
Bottom line: AI Code Review Needs Codebase Context
Tracing and observability are productivity multipliers for code reviews, enabling developers to better understand complex, multi-repo, and multi-language environments. Cross-repo and cross-language visibility should be the baseline for large-scale projects – non-negotiable for today’s distributed applications. Tools that prioritize these capabilities will redefine code generation and review workflows, allowing AI to produce truly context-aware code tailored to modern software environments.