Key Takeaways
- Apply structured goal-setting cycles to AI coding sessions: Set clear, observable success criteria for each session using plan-do-check-act principles and adjust course based on results.
- Use structured task-level planning with AI: Have the agent analyze the codebase and break large features into small, testable chunks that can be completed in short iterations to prevent scope creep.
- Apply a red-green unit test cycle to AI code generation: Have the agent write failing tests first, then production code to make them pass, creating a structured feedback loop that reduces regressions and unintended consequences.
- Establish validation checkpoints: Perform “completion analysis” moments asking the agent to review outcomes against the plan before moving to the next iteration.
- Implement daily micro-retrospectives: After each coding session, spend five to ten minutes with the AI agent analyzing what worked and how to improve your prompts and interactions.
AI code generation tools promise faster development, but often create quality issues, integration problems, and delivery delays. In this article, I describe a structured Plan-Do-Check-Act (PDCA) framework for human-AI collaboration that I’ve been refining over the last six months after working with agents in an unstructured process for over a year before that. Using this PDCA cycle, I believe I can better maintain code quality while leveraging AI capabilities. Through working agreements, structured prompts, and continuous retrospection, I use this practice to assert my accountability over the code I commit while guiding AI to produce tested, maintainable software.
Code Generation is Not Achieving Its Potential
The rapid adoption of AI code generation is increasing outputs, but is not yet regularly achieving measured improvement in delivery and outcomes. Google’s DORA State of DevOps 2024 Report concluded that every twenty-five percent increase in AI adoption correlates with a 7.2 percent decrease in delivery stability. This gap is potentially due to increased batch sizes exceeding organizations’ abilities to define, review, test, deploy, and maintain the output.
More troubling are indications of quality issues. GitClear’s 2024 analysis of 211 million lines of code (sign up required for download) reveals a ten time increase in duplicated code blocks, with duplicated code exceeding moved code for the first time in their surveys. Besides ballooning the amount of code to maintain, cloned code has a seventeen percent defect rate (Do Code Clones Matter? Wagner, et al. 2017), and 18.42 percent of those bugs are propagated into other copies (An empirical study on bug propagation through code cloning, Mondal, et al. 2019).
Why a Structured Plan-Do-Check-Act (PDCA) Cycle?
The industry is not achieving productivity gains and quality improvements because both AI tools and their use need to evolve. Engineers need repeatable practices that leverage their experience to guide agents in making test-verified changes while taking advantage of existing code patterns. This use of agents involves introducing structured prompting techniques.
Structured prompting outperforms ad-hoc methods by one to seventy-four percent depending on approach and task complexity, A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications, (Sahoo et al. 2024).
PDCA provides structure through a proven software engineering practice that incorporates continuous improvement and iterative delivery, principles underlying agile practices I have pursued for twenty years.
A controlled study, PDCA process application in the continuous improvement of software quality (Ning et al., 2010), found that PDCA reduced software defects by sixty-one percent.
An Outline of the PDCA Framework
What follows is a structured PDCA cycle I am assembling for my interactions with coding agents. I use this process individually inside the project management process my team uses to plan, track, and accept work. The Plan and Check steps produce artifacts that I add to the Jira Stories we use to track work. This practice creates transparency, and explainability with low overhead.
This PDCA cycle works best for coding tasks of one to three hours but I also use it to break down larger scopes of work into units of this size. This length of effort both aligns with my attention span and the context sizes of the models I use.
The framework consists of a working agreement and structured prompts that lead the Human/AI collaboration through the steps of the cycle. Each step builds upon prior ones and the act step (retrospection) enforces continuous improvement into the cycle.
- Working Agreements
These agreements are commitments the developer makes to engage and guide the agent according to a standard of quality. This step should take one minute for a human read-through.
- Planning Analysis
Instruct the agent to do a project-wide analysis of the business objective to achieve, existing patterns in the code, and alternative approaches to solve the problem. For this step allow two to ten minutes human/AI exchange, which provides an artifact for project tracking.
- Planning Task Breakdown
Instruct the agent to lay out the steps it will follow to achieve the business objective. Sets a path for the agent to follow so the output is better focused on the immediate objectives. This step should take about two minutes.
- Do
Implement the plan in a test-driven iterative cycle. Provide implementation guidelines that form strict guardrails for how the agent builds code focused on desired behavior and verified by unit tests. Instruct the agent to make its reasoning visible to the developer, providing access points to intervene and direct the agent. The duration of this step depends on scope of work, which is best kept under three hours.
- Check
Ask the agent to verify the code implementation, internal documentation, and readme text based on the initial objectives and implementation guidelines. This step provides an aid for the developer to review the work and surface data for the retrospective. This step should take five minutes of human/AI interaction and provide an artifact for project tracking.
- Act
Conduct a retrospective to continuously improve by learning from the session and refining the prompts and human/AI interactions. This step should take two to ten minutes human/AI interaction.
The specific prompts included in this framework have been developed and refined through actual coding sessions using the retrospective process described. They reflect a specific set of quality concerns and have evolved through interactions with Anthropic’s model family. They are a starting point for others to adapt to their own development priorities and preferred AI tools. I have shared my current working agreements and PDCA prompts in a git repository.
Working Agreements: Human Accountability in AI Collaboration
Working agreements are a well-established practice for helping teams improve consistency and sustain code quality, concerns that directly apply even in the context of an individual engineer working solo, contributing to a shared codebase. I have adapted this team-based approach for human-AI collaboration, using structured agreements to anchor human responsibility in the interaction.
In two years of working with different generative AI code tools and evolving models, my working agreements declare the minimal set of norms I consider essential for maintaining code quality with AI assistance. My intention is to create small batch sizes, coherent commits, and isolated pull requests by directing the agent to make changes with less coupling, better coherence, and reduced code duplication.
The agreements include principles (test-driven development (TDD), incremental change, and respect for established architecture) and example intervention questions: “Where’s the failing test first?” or “You’re fixing multiple things, focus on one failing test?” These agreements reinforce habits I believe I need in order to stay accountable for the code the AI produces.
The following is an excerpt of my analysis prompt to enforce repo wide examination:
Required Deliverables BEFORE Analysis:
- Identify [two to three] existing implementations that follow similar patterns
- Document the established architectural layers (which namespaces, which interfaces)
- Map the integration touch points (which existing methods will need modification)
- List the abstractions already available (FileProvider, interfaces, base classes)
Plan
Planning is composed of two activities: high-level analysis and detailed planning.
High-level Analysis: Solve a Business Problem
Up-front analysis forces clarity on the explicit business problem and technical approach before beginning code generation. This practice combats AI’s tendency towards implementing without sufficient context, resulting in failed implementation attempts, code duplication, and unnecessary regressions. Through iterative cycles at the agent’s own suggestion, I have expanded the analysis prompt to enforce project wide code search explicitly for similar implementations and integration and configuration patterns.
My analysis prompt mandates codebase searches to identify similar code patterns, system dependencies, and existing data structures. It asks for alternative approaches to solve the business problem. The prompt constrains output to human-readable analysis focused on “what” and “why” rather than implementation details.
I often ask questions of clarification and suggest additional context before moving onto the next part of the “Plan” step. I include the analysis response in my Jira story to document my approach.
Here’s the introduction to my planning prompt:
Planning Phase. Based on our analysis, provide a coherent plan incorporating our refinements that is optimized for your use as context for the implementation.
Execution Context. This plan will be implemented in steps following TDD discipline with human supervision. Each step tagged for optimal model selection within the same thread context.
Detailed Planning: Create Observable and Testable Increments
Once I have agreed upon the approach, the detailed planning prompt asks the agent to prepare an initial execution plan. This plan breaks work into a set of atomic, testable checklist items with clear stop/go criteria and transparency requirements so I can follow what the agent is doing.
Large Language Models (LLMs) struggle to maintain a coherent direction overextended interactions, particularly in large codebases with established patterns that require architectural consistency. Detailed planning provides a road map and contract between humans and AI, fostering more engaged and accountable coding sessions.
My prompt enforces TDD discipline by requiring failing tests before any code changes and limits attempts to three iterations before stopping to ask for help. It mandates numbered implementation steps with explicit acceptance criteria and process checkpoints.
My interactions with the agent encourage it to proceed step-by-step through the plan. If the work is complex enough, reality will force me to diverge from the plan steps. For example, I may have to address regression test failures, or realize I misunderstood or overlooked something, or learn something during the course of the work that changes the approach. At that point, I often ask the agent to re-plan from where we are or work through the side task and then ask the agent to return to the plan from the step it last finished.
Do: Test-Driven Implementation with Human Oversight
The implementation prompt enforces TDD discipline while allowing related functionality to be grouped into batches of parallel changes and test verified simultaneously. The red-green refactor discipline addresses AI’s tendency to create overly complex scenarios or skip test-first entirely. Batching reduces inference costs while accommodating AI’s strength at producing complete blocks of working code rather than the just enough changes of true red-green test driving. Research shows structured TDD with AI achieves better success rates than unstructured coding approaches but requires substantial human guidance throughout the process (LLM4TDD: Best Practices for Test Driven Development Using Large Language Models, Piya & Sullivan, 2023).
My implementation prompt includes checklists that both the agent and I can track, emphasizing behavioral test failures over syntax errors and real components over mocks. The step-by-step process may use more tokens upfront than an unstructured approach but allows for more active human oversight and isolated and coherent commits. I follow the agent’s reasoning and intervene when I see reasoning errors I can correct, context gaps I can supplement, or context drift. Signs of context drift include going off on tangents, duplicating code, or ignoring established patterns.
Here is an example of TDD rules in my implementation prompt:
TDD Implementation
- ❌ DON’T test interfaces – test concrete implementations
- ❌ DON’T use compilation errors as RED phase – use behavioral failures
- ✅ DO create stub implementations that compile but fail behaviorally
- ✅ DO use real components over mocks when possible
Compilation errors are not a valid red. A red test occurs when an invocation does not meet the expectation. So, that would imply the project can compile and the method stubs exist, but the behavior is not fully implemented.
Check: Completion Analysis
The completion analysis asks the agent to review both the chat session transcript and generated code to confirm the changes produce the intended output and to flag deviations from the original plan and implementation guidelines. This review creates an explicit definition of done that goes beyond functional testing to include process adherence and architectural consistency.
Specifically, the review confirms that all tests were passed and that complex output was verified through end-to-end testing, if necessary. The agent reviews the new code for accurate internal documentation and good test coverage. It then audits the session to see whether we addressed all to do items from the original plan and if we consistently followed the test first approach. These results are summarized narratively and as a checklist along with a list of any outstanding items, and a conclusion over whether the work is ready to close.
This output speeds the human code review and provides an artifact the developer can review, correct, supplement, and add to work tracking systems. The results also provide data for the final step, the retrospective.
Here is my completeness prompt:
Completeness Check
Review our original goal outcome and plan against our execution.
Verification:
All tests passing
Manual testing completed (if needed)
Documentation updated
No regressions introduced
No TODO implementations remaining created by this test driving
Process Audit:
Testing approach was followed consistently
TDD discipline maintained (if chosen)
Test coverage is adequate and appropriate
No untested implementation was committed
Simple test scenarios were effective
Status: [Complete/Needs work]
Outstanding items: [any remaining tasks]
Ready to close: [Yes/No with reasoning]
Act: Retrospect for Continuous Improvement
The retrospective step analyzes the session to highlight collaboration patterns, identify successful human interventions, and suggest improvements to prompts and developer’s use of the tools. Continuous improvement through retrospection mitigates AI’s inconsistent performance by systematically identifying which human interventions and prompt patterns yield better results.
The retrospective prompt asks the agent to summarize what occurred, flag wasted effort or wrong paths, present things that could have gone better, and suggest the one most valuable change I can make next time to improve the coding session. I focus on what I can change in the prompt language, process, and my behavior, because those are the only levers I can control to improve results.
Here is an example of evaluation points in my retrospection prompt:
Critical Moments Analysis:
- What were the 2-3 moments where our approach most impacted success/failure?
- What specific decisions or interventions were game-changers?
Technical & Process Insights:
- What patterns in our collaboration most impacted effectiveness?
- What would have accelerated progress?
- What process elements worked well vs. need improvement?
Measuring Success
Continuous improvement extends beyond individual PDCA cycles and benefits from independent quality measures. GitHub’s API provides a hook for creating an early warning system. I have git actions to measure five proxies for quality:
- Large commit percentage: commits containing over one hundred lines changed, targeting less than twenty percent
- Sprawling commit percentage: commits touching more than five files, targeting less than ten percent
- Test-first discipline rate: Percentage of commits modifying both test and production files, targeting greater than fifty percent
- Average files changed per commit: targeting less than five files
- Average lines changed per commit: targeting less than one hundred lines
My git actions are available on github; I run an action on pull requests and every thirty days on the full repository.
Illustration: Partial output of the PR analysis github action
For enterprises that are interested in more sophisticated metrics, there are commercial solutions like GitClear, DX, and LinearB. I do not have direct experience with any of them, but am trialing GitClear’s free tier to evaluate the results.
Experimental Results
To compare PDCA versus an unstructured approach, I implemented the same story in Cursor with Anthropic models using different approaches. I collected both quantitative data and qualitative metrics: token consumption, lines of code, subjective developer experience and code quality evaluation. I used this approach for a story requiring complex system interaction:
The overall goal is to enable @Tracer.cs to accept an entry point as class, method, or file and based on a @TracerOption.cs configured in the settings json rather than run the Rosalyn based code path trace to check the kuzu database to determine if the containing dll has been analyzed, then retrieve the subgraph as a @ScoredOutputNodeGraph.cs from kuzu using the existing @KuzuDepedencyGraphReader.cs and @DatabaseDependencyGraphBuilder.cs and have the resulting map be functionally equivalent to the one that would be created through the method and class based trace.
Token Usage By Activity Unstructured
Activity | Tokens Used |
Code | 264767 |
Troubleshooting | 1221217 |
Grand Total | 1485984 |
Token Usage By Activity PDCA
Activity | Tokens Used |
Analysis | 106587 |
Detailed Plan | 20068 |
Do | 1191521 |
Check | 6079 |
Act | 7383 |
Grand Total | 1331638 |
The results demonstrate a trade-off between upfront planning costs and troubleshooting efficiency. In the unstructured session, eighty percent of the tokens were expended after the agent declared the task complete. This additional work entailed troubleshooting the implementation: debugging failures, resolving incomplete implementation, and correcting assumptions about existing code patterns. While I would not characterize this level of troubleshooting as typical, it is not unusual when working with complex integrations.
Code Output Metrics
Metric | Unstructured | PDCA |
Lines of Production Code | 534 | 350 |
Lines of Test Code | 759 | 984 |
Number of Methods Implemented | 16 | 9 |
Number of Classes Created | 1 | 1 |
Number of Files Modified/Created | 5 | 14 |
The PDCA method resulted in fewer lines of production code, more comprehensive test coverage, and more atomic commits. The higher file count in PDCA reflects its emphasis on smaller, focused changes rather than broad modifications reflected in discreet code and test files. While both approaches achieved working solutions, the unstructured approach required more extensive debugging after initial implementation, whereas PDCA’s test-driven increments caught issues earlier.
Qualitatively, PDCA creates a better developer experience. In the PDCA human interaction occurs throughout planning and coding, whereas in an unstructured approach, interaction is stacked at the end and is mainly focused on reviewing and troubleshooting.
I realize these results are based on a single experiment. I present them not as proof, but as directional data supporting my interest in continuing to evolve this approach in my own practice.
Areas for Further Development
My agreements, prompt templates, and measures of success are relatively new and evolving as are the capabilities of the AI tools themselves. Here are my current focus areas of refinement and experimental learning:
Matching process formality to task complexity
The PDCA framework’s structured approach provides value but needs to be calibrated to match the complexity and risk of the work being performed. The planning prompt requires significant token use and I am looking to experiment with a lighter weight analysis and planning steps for well-isolated changes, such as implementing an interface where concrete examples already exist. In this case and potentially others, the existing code provides sufficient context and patterns for the AI to follow without extensive upfront analysis.
Early in the evolution of agile approaches, Alistair Cockburn proposed the Crystal approach (see the link to the slideshow overview), where the level of process rigor should scale with project criticality and team size. This approach suggests developing less formal versions of the planning and implementation prompts for lower-complexity scenarios, while still performing pattern analysis, enforcing sufficient transparency, and retrospecting.
Complex changes involving architectural decisions, cross-system integration, or novel problem domains benefit from a more structured PDCA cycle. The up-front investment in analysis and detailed planning prevents the compound costs of rework, regression, and technical debt that emerge when AI tools operate without sufficient context.
Including a model selection strategy
The structured analysis and planning opens up an opportunity to optimize execution cost by switching model selection based on task complexity. I have started including a complexity assessment in analysis and planning and asking the AI to estimate implementation difficulty, pattern clarity, and scope of each proposed solution. I then ask it to make a recommendation on which model within my selected family to use at each point in the execution (e.g., Sonnet and Haiku within Anthropic Claude). The criteria by which it forms a recommendation is useful, but the recommendations are not founded on empirical evidence and the models actual behaviors are more recent than its training data. In this case, I am also waiting for Anthropic to release an updated small model and will experiment with smaller models from other families.
Initial analysis and planning phases require more capable models to handle ambiguous requirements and architectural reasoning. However, implementation phases following clear specifications may work effectively with less expensive models, particularly when the codebase contains strong patterns and the changes are well-scoped. The power of addressing this in a human in the loop process is the ability for the human to proactively downgrade models and quickly intervene when they struggle.
Conclusion
Research shows that AI code generation isn’t achieving its productivity potential due to quality degradation and integration challenges. The PDCA framework closes this gap by applying structure to human-AI collaboration that better maintains code quality while leveraging AI capabilities.
The framework delivers the five key practices: structured goal-setting through analysis and planning phases, task-level planning to produce atomic commits, red-green test cycles that catch issues early, validation checkpoints for completeness, and micro-retrospectives for continuous improvement. The experimental results suggest the core trade-off; PDCA requires more upfront planning investment to reduce troubleshooting and maintenance with improved developer experience.
Organizations adopting AI code generation need systematic practices that scale but allow for individual preferences. The PDCA framework provides structure while remaining adaptable to different contexts. As AI capabilities evolve rapidly, disciplined approaches to human-AI collaboration are essential for sustainable software development.
AI Disclosure
In accordance with InfoQ’s AI policy for contributors, I used generative AI tools (Claude) as a support tool while maintaining human expertise as the driver of this content. Specifically, I used Claude for brainstorming and ideation to develop my initial outline, feedback and devil’s advocate review to identify gaps in my arguments, and drafting assistance to improve clarity, flow, and grammar across multiple revisions. The prompt examples included in this article are co-authored with Claude and represent actual prompts from my work. I personally retrieved and reviewed all research sources, authored the analysis and framework, and made all final content decisions. I take full responsibility for the accuracy, originality, and quality of all content, having rigorously verified any AI-generated suggestions before adoption.