Key Takeaways
- Combining DevOps Research and Assessment (DORA) metrics with Process Behavior Charts (PBCs) allows engineering teams to distinguish normal process variation from real signals, turning delivery metrics into a reliable decision-making tool.
- DORA metrics become powerful instruments for validating hypotheses, such as the impact of pair programming, team scaling, and tooling changes, rather than simple reporting KPIs.
- PBCs expose delivery degradation early by visualizing unusual spikes caused by broken tooling, unstable environments, onboarding challenges, or human factors.
- Long-term DORA data reveals systemic performance plateaus and shifts, allowing organizations to connect improvements to architectural, cultural, and process changes.
- DORA metrics describe only the delivery part of the value stream, so pairing them with product metrics and well-being indicators provides a more complete understanding of both performance and impact.
I live in Amsterdam, a compact city crisscrossed with canals and full of small bridges that occasionally rise to let barges glide back and forth. It’s unbelievably beautiful, so walking is a real pleasure. My office is two kilometers away, and on foot the trip takes about twenty-five minutes, sometimes a bit more, sometimes less.
Like most locals, I eventually switched to a bike, and the commute dropped to around 6.5 minutes. Later, I found a shorter route without bridges or traffic lights; my commute dropped down to about 4.8 minutes.
If you plot these trips over time, you get a picture like this:
Figure 1. Each flat section, or plateau, represents a stable process (walking, biking, new route). Real improvements appear as clear, sudden drops to a new, faster level.
Each flat section represents a stable process, walking, biking, new route, with its own average and some natural variation around it.
Even when nothing changes, no two trips take exactly the same amount of time. That variation is just noise: wind, red lights, or waiting for a bridge.
Only when I made a real change at the right time, bought a bike, found a new route, did the process shift to a new performance level.
But what if, instead of buying a bike, I had bought new shoes? Surely, that wouldn’t change much.
Software delivery behaves the same way.
Every team has a stable process shaped by its own rules, tools, and habits. If we apply random or uncoordinated improvements not aimed at the real bottleneck, they rarely lead to visible progress. Over time, this creates frustration and demotivation.
A sustainable improvement in culture relies on three things: outcome-based metrics to measure real progress, a focus on bottlenecks instead of random tweaks, and the ability to assess results and learn from them.
What Good Looks Like
That’s where DevOps Research and Assessment (DORA) metrics come in. They’re based on solid research and have a consistent, predictive correlation with desirable outcomes, both team well-being and organizational performance, as shown in the book, Accelerate, and the DORA research program. We can use DORA metrics to describe our software development process and draw conclusions that lead to meaningful results.
The DORA framework consists of several key metrics. Among them, Change Lead Time (CLT) shows how quickly a team can deliver change. Deployment Frequency (DF) shows what the team actually delivers. While important, DF is often more volatile, influenced by team size, vacations, and the type of work being done. Finally, the instability metrics and reliability SLOs serve as a counterbalance. We want to go faster, yes, but never at the cost of stability.
Figure 2. High-level illustration of desired DORA metric trends. As teams improve their delivery process, Change Lead Time typically decreases, while Deployment Frequency increases.
In our software development process with DORA metrics, we want CLT to go down. We expect DF to rise steadily over time.
How Process Behavior Charts Enhance the Use of DORA Metrics
The good news is that we can have a lot of useful tools from statistical process control to help us make use of the data. One of them is the Process Behavior Chart (PBC), introduced and popularized by Donald Wheeler.
Let’s zoom into the biking part of my commute process.
Figure 3. A Process Behavior Chart of the biking process. The spike at trip 24 is a special cause, an exceptional event (getting a fine) that is outside the normal variation.
The PBC shows the mean together with upper and lower control limits. All points within those limits represent common-cause variation, which are normal fluctuations such as traffic or wind. Any point outside them signals a special cause, an exceptional event not part of the usual process.
I like to get to work early in the morning, so in winter it is still dark. One day, I was fined by the police for riding without lights, a clear special cause, an unexpected event outside my usual routine. Having lived in the Netherlands for only a short time, I didn’t yet think of cycling as something serious, so the police check came as a surprise. Still, in terms of my regular commute, this didn’t change much, because the likelihood of it happening again felt low and the officer handled everything very quickly (though it was financially painful). I eventually fixed the lights, but not because of my commute process, rather for safety, once I realized how intense Amsterdam bike traffic can be and how risky it is to ride in the dark in camouflage mode.
Notice that addressing a special cause doesn’t improve the process itself, it simply prevents degradation or instability. For sustainable improvement, we must focus on what is considered normal in the current process: the common causes.
Beyond spotting special causes, PBCs are also useful for detecting shifts, moments when the entire system moves to a new performance level. In the commute example above, these shifts appear as clear drops in the average commute time whenever a real improvement is introduced, such as buying a bike or finding a shorter route. Technically, a shift occurs when several consecutive points fall above or below the previous mean, signaling that the process has fundamentally changed. Unlike a one-off special cause, a shift indicates a new “normal”, often resulting from a deliberate improvement such as automation or a change in workflow.
Einstein once said we cannot expect different results by doing the same thing. That’s exactly the point: real progress, like cutting a commute from twenty-five to six minutes, comes only from challenging the system itself.
A note on interpretation: A Process Behavior Chart does not prove a causal relationship, it only shows that the system has shifted in a statistically meaningful way. The chart tells us when the change happened, but not why. The explanatory link comes from context: If we know that no other major factors changed at that time, the observed shift becomes strong supporting evidence for the hypothesis. A fully rigorous proof would require A/B-style experimentation, which is often impractical in real-world software development. In practice, PBCs offer a pragmatic balance: not scientific proof, but solid, data-backed confidence that a change is working as intended.
Now, let’s put this framework into action. The following real-world examples show how DORA metrics can be applied in both tactical and strategic contexts.
Case Study 1: Detecting Real Issues vs. Normal Noise in Change Lead Time
Imagine a team noticing a rise in CLT around mid-February, 2025 (see picture below).
Figure 4. Process Behavior Chart for Change Lead Time, weekly aggregation. The chart shows a period of instability caused by several factors: broken tooling, a new joiner, and personal issues.
Is it a sign of trouble? The PBC says no; everything is still within normal variation. After a quick look at PR data, the team finds the explanation: Most of the work that month touched the legacy monolith with manual deployments and plenty of friction. Since there was already a long-term plan to move away from it, no immediate action was needed.
Then, in late March, the chart lights up again, this time for real. The PBC shows a clear signal that something unusual is happening. After some digging, the team discovers the culprit: deployment tooling issues that slowed down releases across the board. Once the platform team fixed them, delivery speed bounced back.
But the story didn’t end there. In late April and early May, a few more spikes appeared, this time for human reasons. One reason was a new teammate who wasn’t yet comfortable with the team’s way of working in small, incremental changes. At the same time, another teammate was going through personal issues that affected their performance and communication. Together, these factors disrupted the team’s usual flow and showed up in the PBC. After some focused support and adjustments, the process stabilized again. By August, the data showed the system back to its normal rhythm.
Flaky tools, unstable environments, or personal situations can build up slowly and go unnoticed for weeks. In this case, the PBC made those patterns visible and gave the team a chance to act before things got worse.
Some might argue that a PBC for CLT is a lagging indicator; it’s based on changes already deployed to production. That’s true. But the more often the team deploys, the more useful this tool becomes, helping them spot problems faster and act on evidence, not assumptions.
Case Study 2: Validating a Bold Process Change: Pair Programming
Beyond detecting issues, PBCs help confirm major process changes with data instead of opinions.
One of the core ideas of continuous improvement is reducing PR size, which brings many benefits: quicker feedback cycles, easier diagnostics during outages, and a smoother development flow. But working in small batches is not trivial, it requires skills like making safe, incremental refactorings, running changes in parallel, and using feature flags effectively. The team had been investing in these practices for some time to improve performance.
At some point, however, the team realized that the overhead of asynchronous code reviews made it difficult to reduce PR sizes further. While we often say that smaller PRs are easier to review, that’s only partially true. A meaningful review still requires understanding the context of the change, and too many PRs force reviewers to switch context repeatedly. As a result, the overall process becomes slower and less efficient.
This effect was described by Stepanovic, who also highlights a key trade-off: “With the async way of working, we’re forced to make a trade-off between losing quality (big PRs) and losing throughput (small PRs)”. Not wanting to sacrifice either quality or throughput, the team decided to experiment with pair programming, despite a common concern that it would reduce throughput. The results proved otherwise. The CLT dropped 3x, while DF increased approximately by twenty percent. A single dip in late December was due to a temporary release freeze.
Figure 5. Process Behavior Charts for Deployment Frequency and Change Lead Time for a team. The upper chart shows a shift detected to a more performant level since the introduction of pair programming. The lower chart shows that the throughput has not only not degraded, but has even improved.
[Click here to expand image above to full-size]
The experiment confirmed the hypothesis and showed that the initial concern was unfounded. Without this data, the discussion would have remained subjective: “it feels faster” versus “it feels slower”. PBCs turned intuition into evidence.
Pair programming is a powerful practice supported by well-known experts such as Kent Beck and many others in the XP community. Yet, in my experience, it remains surprisingly underused in real-world teams. One reason is that its ROI is hard to communicate: Pairing can easily be perceived as “two developers doing the work of one”. Without data, this intuition is difficult to challenge. In our case, PBCs helped make the impact visible and provided the justification the team needed.
Case Study 3: Measuring the Impact of Team Scaling on Throughput
Let’s look at another team.
Figure 6. Process Behavior Charts for Deployment Frequency and Change Lead Time for a team. The upper chart shows a sixty percent shift in throughput detected after adding extra resources.
[Click here to expand image above to full-size]
We added new team members in April, and DF immediately rose by about sixty percent while CLT stayed the same. That’s a great sign. Our system was mature enough to onboard new people without slowing down. A result well worth celebrating.
Three-Year Longitudinal Analysis: Performance Plateaus and Strategic Shifts
Sustainable improvement is rarely linear. It depends on a series of strategic bets whose effects emerge over time. Some succeed, others fail, and external factors, from tooling changes to team turnover, often introduce temporary setbacks.
By zooming out, we can see the bigger story: what worked, what didn’t, and how team composition or strategic choices shaped performance. Over such a long time span, the charts no longer represent a single stable process, but rather a sequence of different system states. Here, the value of a PBC-style analysis is not strict statistical control, but making long-term shifts and performance plateaus visible.
Figure 7. Process Behavior Chart-style time series illustrating long-term shifts and performance plateaus in Deployment Frequency and Change Lead Time for a team over almost three years.
[Click here to expand image above to full-size]
Even though the major constraint for the team was the legacy components, we saw meaningful improvements only when the migrations were paired with process and cultural changes. When teams move mechanically from monolith to microservices or micro-frontends, they often replicate the same heavyweight processes, and the result is little improvement, or even degradation.
Typical examples include deployment procedures that require multi-stage manual verification even when changes are trivial and fully owned by a single team; migrations that still depend on a slow, mandatory cross-team review process; or platform policies that enforce end-to-end testing on every deployment, effectively destroying loose coupling. Many of these decisions are made with good intentions, but some reflect cultural issues around trust and control – and their impact often remains invisible until you look at the metrics.
When we look at almost three years of DORA data in Figure 7, we can distinguish three stable performance levels, each corresponding to a distinct system state and separated by major process changes:
- CLT ≈ 25h, DF ≈ 5/week (legacy migration done, no process changes).
- CLT ≈ 10h, DF ≈ 10/week (deployment automation introduced).
- CLT ≈ 3h, DF ≈ 15/week (pair programming introduced).
Each plateau represents a new, more efficient system. The first major shift came from deployment automation, moving to a model where every merged change goes straight to production with no manual steps or approvals, built on earlier investments in legacy migration, testing, and refactoring. The second followed the adoption of pair programming, which accelerated collaboration and reduced review delays.
Changes in team composition (onboarding, departures, extended absences) also introduced noticeable variation and sometimes temporary degradation, highlighting how sensitive delivery performance is to team stability.
During three years, CLT improved roughly ten times, and DF tripled. This improvement was driven by internal optimizations. Could it have been even better? Possibly. Earlier adoption of pair programming might have delivered these gains sooner.
It’s worth noting that the team members were aware of the metrics but not incentivized by them. DORA metrics were never part of personal objectives, so there was little motivation to “game” the numbers. Instead, we discussed each initiative in terms of how it could unlock better ways of working or improve outcomes. Metrics served as a thermometer, a way to observe change, not a target to optimize. This helped keep the focus on product improvements rather than chasing specific values.
Connecting to the Outcomes
The strength of DORA metrics is that they’re easy to collect and consistent across teams. This makes it possible to compare performance objectively and track improvement over time. According to DORA research, these metrics have a predictive relationship with broader outcomes such as organizational performance and team well-being. In other words, teams that score higher on DORA metrics are statistically more likely to achieve better business results and report higher satisfaction. But that correlation becomes reliably visible only when looking at a large number of teams. For a small sample, we can’t claim causation with confidence; Improvement in metrics doesn’t automatically prove improvement in outcomes.
Still, the early signs are encouraging. Among the few teams that have adopted this approach, we already see a clear rise in the number of major initiatives delivered and higher well-being reported in internal surveys. As more teams join in, we’ll be able to validate whether these local improvements consistently translate into broader organizational outcomes.
Looking back at the commute example from the beginning of the article, I eventually returned to walking. I realized that I had been optimizing for a narrow metric, travel time, even though it was already good enough, while overall satisfaction was the outcome that mattered more to me. In hindsight, some of the improvements were not strictly necessary; an earlier investment in good walking shoes might have delivered more value than optimizing for speed.
The same principle applies to software delivery. Delivery performance is only one part of a broader value-creation system. While delivery is often the initial bottleneck worth addressing, once improvements are made the constraint typically shifts elsewhere. Teams may discover that the real limitation is no longer how fast they can ship, but whether they are building the right product, how quickly they learn from users, how changes are adopted in production, or how sustainable the pace is for the people doing the work.
How to Get Started
- Measure the baseline. Track your DORA metrics and collect three to six months of data to understand normal variation.
- Explore the changes. Analyze individual PRs to spot bottlenecks rather than symptoms.
- Select experiments. Choose initiatives that balance expected impact and effort, and define clear indicators of success.
- Implement and observe. Roll out changes and watch for statistically meaningful shifts, not isolated data points.
- Reflect and repeat. Review outcomes with the team, keep what works, and design the next experiment. Improvement is a loop, not a project.
Conclusion
We’ve all been in meetings where “it feels slower” is countered by “it feels faster“, and the discussion goes nowhere. Many capable teams keep working hard, yet the data show flat performance. Some are rewarded for random success; others are blamed for outcomes beyond their control.
As a quote often attributed to W. Edwards Deming says, “Without data, you’re just another person with an opinion“. DORA metrics provide the data to move beyond opinion, and Process Behavior Charts show how to interpret it. Together, they turn intuition into evidence and progress into something visible.
This is how teams stop debating noise and start improving deliberately. Don’t just have an opinion, be the one with the answer.
References
- Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. IT Revolution Press.
- The DORA Team. (Annual). State of DevOps Report. Google Cloud.
- Goldratt, E. M., & Cox, J. (1992). The Goal: A Process of Ongoing Improvement. North River Press.
- Stepanović, D. (2022, November 08). From Async Code Reviews to Co-Creation Patterns. InfoQ.
- Wheeler, D. J. (2000). Understanding Variation: The Key to Managing Chaos. SPC Press.
- Gancarz, R. (2024). Booking.com Doubles Delivery Performance Using DORA Metrics and Micro Frontends. InfoQ News.
All images and charts in this article are original visualizations created by the author based on anonymized real-world team data.
