Key Takeaways
- Decoupling, technically, organizationally, and semantically, enabled us to evolve away from a tightly coupled legacy architecture without rewriting everything at once.
- Change Data Capture allowed us to build a near real-time system of reference, eliminating the need for direct synchronous access to mainframes for most applications.
- By using GraphQL instead of REST, we removed the need for dozens of Backend-for-Frontend layers and improved performance, flexibility, and maintainability.
- Aligning teams with domain boundaries using Team Topologies reduced cognitive load, streamlined delivery, and gave teams clear ownership of systems and outcomes.
- Through incremental rollout, automation, and hybrid architecture, we delivered value continuously, replaced the legacy system safely, and avoided the pitfalls of big-bang re-platforming.
Introduction: “Where Did Half the Room Go?”
This article is based on our talk at QCon San Francisco in November 2024. It was mid-year in 2024, and our team was in a PI planning session with about 50 or 60 people, including our primary business stakeholders. In the middle of that session, half the room got up and left. The billing mainframe, which drives the customer web portal and many other applications, had gone down, causing outages.
The half that remained in the room was primarily composed of our team. We were about halfway completed with an incremental multi-year program to replace that customer web portal. Thanks to edge routing, the application dynamically routed users to new pages as we developed and released them. Those pages were still being served, despite the fact that the system of record for it was down, thanks to our cloud-based streaming architecture.
The customer self-service portal provides crucial functionality to customers such as viewing bills, making payments, starting or stopping utility services, and monitoring energy usage patterns. The system we were replacing failed to deliver a reliable experience, but the new system had just proven itself to be resilient. This is the story of how we transformed our architecture through a comprehensive decoupling strategy spanning technical systems, organizational structures, and semantic understanding.
This Is a Story About Decoupling
At its heart, our transformation journey is about breaking dependencies at multiple levels. Many enterprises face similar challenges with legacy systems: tightly coupled architectures that are difficult to scale, change, or maintain. For us at National Grid, the solution came through four complementary paradigms that worked together to enable different forms of decoupling:
- Domain-Driven Design (DDD): A software development methodology focusing on modeling complex systems around business domains. DDD gave us tools to reshape legacy mainframe data into business-friendly concepts that were easier to understand and evolve. By creating a common language between technical and business stakeholders, DDD enabled semantic decoupling from legacy constraints.
- Team Topologies: An organizational design approach that structures teams according to different types and interaction patterns. For us, this meant aligning teams to business value streams rather than technical layers, creating stream-aligned teams that could own entire business capabilities, and complicated subsystem teams that could handle the mainframe integration complexity. This organizational decoupling reduced coordination overhead and accelerated delivery.
- Event-Driven Architecture: A system design pattern using asynchronous events as the primary means of communication between components. Unlike synchronous APIs with direct request-response patterns, events provide a stronger abstraction that allows for multiple or even no responses. This architectural style shifted our system from strong consistency to eventual consistency, a deliberate tradeoff that enabled technical decoupling between front-end and back-end systems.
- Change Data Capture (CDC): A technique that identifies and captures changes in a database, allowing those changes to be propagated to other systems. For us, CDC formed the foundation of our new architecture, enabling the creation of a “system-of-reference” that could serve applications without direct mainframe dependency.
These four paradigms, while powerful individually, created an even stronger solution when applied together. Domain-Driven Design shaped how we modeled the business. Team Topologies aligned our organizational structure with those domains. Event-Driven Architecture provided the technical patterns to decouple components. Change Data Capture created the bridge from legacy systems to modern, domain-aligned data stores.
Where We Started: The Unified Web Portal 1.0
National Grid is a regional utility provider delivering both gas and electric service. Over years of mergers and acquisitions, we had accumulated multiple systems-of-record for billing and customer information. The resulting customer experience was fragmented – customers with both gas and electric service often needed to use different portals for each utility type.
The first version of the Unified Web Portal (UWP 1.0) attempted to solve this fragmentation by consolidating data from multiple mainframes into a single customer experience. The technical approach was straightforward:
- Extract data from multiple mainframe systems using ETL (Extract, Transform, Load) processes
- Transform this data into a unified data model
- Load it into a SQL database
- Finally, move it into a SaaS platform with its own datastore
However, this approach had significant limitations:
- Our ETL processes ran in batches only a few times per day, creating data freshness issues
- Combining data from multiple sources led to data quality problems
- For critical operations where real-time data was essential, applications had to make synchronous calls directly to the mainframe, bypassing the unified model
ETLs are perfectly fine if you’re looking at analytical data. However, in our case, we were dealing with operational data, and the batch nature of ETL was ill-suited to give customers the most recent information about their accounts. Worse still the synchronous connection to the mainframe was an inelastic resource serving an inherently elastic system, a web site.
Why That Didn’t Work: The Emergent Problems
The UWP 1.0 architecture created multiple interconnected problems that compounded over time:
Technical Problems
- Data Currency Issues: With ETL batch processes running only a few times daily, customers would see outdated information. A payment made in the morning might not appear on their account until evening, creating confusion and support calls.
- Synchronous Mainframe Dependencies: For critical operations requiring current data, applications needed direct mainframe access. This created a scenario where a highly elastic web application was tightly coupled to an inelastic mainframe system.
- Backend-for-Frontend Proliferation: To handle different use cases with current data, we created dozens of Backend-for-Frontend (BFF) APIs, each tightly coupled to specific frontend needs and mainframe endpoints. These multiplied over time, creating a maintenance nightmare.
- Distributed Transaction Complexity: Our API integration platform used synchronous calls to achieve distributed transactions across multiple systems, increasing coupling and failure points.
Organizational Problems
That technical architecture was shaped by the organizational structure of the business’ technology groups (Conway’s Law):
- Mainframe Architects managing the core legacy systems
- SQL Database Administrators handling the intermediate datastore
- Integration Engineers creating and maintaining the API platform and BFFs
- Web Developers building the frontend experience using a SaaS web platform
This led to a change and release process that required high coordination effort to release even minor features, requiring siloed deliverables across multiple teams with different priorities and release cycles.
Cascading Failures
The most dramatic consequence of this architecture was the potential for cascading failures due to inelasticity.
When traffic spikes hit our web portal, those requests would flow through to the mainframe. Unlike cloud systems, mainframes can’t elastically scale to handle sudden load increases. This created a bottleneck that could overload the mainframe, causing connection timeouts. As timeouts increased, the mainframe would crash, leading to complete service outages with a large blast radius, hundreds of other applications which depend on the mainframe would also be impacted.
This is a perfect example of the problems with synchronous connections to the mainframes. When the mainframes could be overwhelmed by a highly elastic resource like the web, the result could be failure in datastores, and sometimes that failure could result in all consuming applications failing.
Setting Our New Goals: The Path Forward
Understanding these challenges, we had clear objectives for our architectural transformation:
Technical Goals
- Decoupling: Separate the frontend and web services from direct mainframe dependencies
- Reducing Dependencies: Eliminate brittle point-to-point integrations, especially the proliferation of BFFs
- Empowered Engineering: Create a structure where teams could own and deliver end-to-end slices of functionality
Business Goals
- Reduced Call Center Volumes: Decrease support calls resulting from portal outages or stale data
- Lower Licensing Costs: Eliminate expensive third-party middleware and integration platforms
- Improved Customer Satisfaction: Provide a more reliable experience with fresher data
Our plan required a fundamental rethinking of both the architecture and team structure – not just technical changes but also a shift in how people collaborated and what they owned.
Our New Architecture: A Bird’s Eye View
Our new architecture represented a complete departure from the previous approach:
- On-premises: The mainframe remained as the system-of-record with CDC configured to capture changes
- Cloud: A modern, cloud-native architecture with several key components:
- Streaming CDC events into Event Hubs (Kafka)
- Background services that processed events, both upstream and downstream
- Document databases (Cosmos DB for MongoDB) storing the processed domain entities
- Public APIs (GraphQL and REST) exposing these entities through an API gateway
- Web, mobile, and other applications consuming these APIs
The specific technology here isn’t necessarily important. It’s the strategies and patterns that we use. You could replace Azure with AWS. We used widely adopted standards wherever we could, further increasing decoupling and reducing lock-in.
This architecture created a clear separation between our legacy systems and modern applications, with change data capture serving as the engine and event-driven architecture as the highway.
Step One: Change Data Capture – Creating a System-of-Reference
Change Data Capture became the foundation of our new architecture. Instead of batch ETLs running a few times daily, CDC streamed data changes from the mainframes in near real-time. This created what we called a “system-of-reference” – not the authoritative source of truth (the mainframe remains “system-of-record”), but a continuously updated reflection of it. The system of reference is not a proxy of the system of record, which is why our website was still live when the mainframe went down.
Additionally, this system of reference was not constrained by the schema, structure or semantics of the mainframe. We modeled this system using domain-driven design, allowing the API to evolve at its own pace.
Our data flow worked like this:
- Mainframe data changes were captured through CDC
- These changes were published to Kafka topics
- Data processors formed an anti-corruption layer, transforming mainframe concepts to domain concepts
- The transformed data populated domain-specific entity databases
- APIs exposed these entities to applications
This approach allowed us to decouple the frontend experience from the constraints of the mainframe. Applications could now access data in business-friendly formats without being coupled to mainframe semantics.
Scaling Our System-of-Reference
Our system needed to handle massive throughput – processing several million events daily. This required a horizontal scaling strategy with multiple components:
We implemented:
- Kafka topics with multiple partitions to distribute event processing and streaming
- Kubernetes for horizontal scaling of data processors and APIs
- Document databases optimized for the API usage patterns (both internal and public)
We can scale the topics via partitions to accommodate any kind of change volume that we have, and the same with the data processors. We’ve proven this works at the scale we require. Our application was not only rolled out incrementally, but our user base was also scaled up incrementally. We started with dozens of users being routed to the new experience. Then hundreds, then thousands, then hundreds of thousands, and eventually millions. This occurred over a year and a half, with releases every two weeks, and we never once had a scaling problem.
Step Two: Domain-Driven Design with GraphQL
With Change Data Capture providing the data pipeline, we needed to model this data effectively to create an API that was wide enough to power many applications, but avoid APIs that were tailored to the specifics of the customer portal. We turned to Domain-Driven Design (DDD), starting with event-storming workshops, to identify:
- Bounded Contexts: solution spaces that tie together related data (Entities) and operations (Commands)
- Entities: objects defined by continuity and identity (e.g., Billing Account, Payment)
- Commands: operations that change state (e.g., Submit Payment)
- Events: things that happen as a result of commands (e.g., Payment Submitted) or when our reference data changes (via CDC)
Importantly, we modeled these domains based on business needs first, independent of mainframe constraints. Only after establishing the model did we map it back to the mainframe data.
Our next challenge was to create a productized API that could be scalable and maintainable.
GraphQL: Avoiding Backend-for-Frontend Proliferation
To prevent recreating the dozens of Backend-for-Frontend (BFF) APIs that UWP 1.0 used, we chose GraphQL. Each bounded context acted as a node in a GraphQL graph, and like all graphs, the edges defined the relationships between those nodes.
We used schema stitching to compose these domain nodes into a supergraph. This allowed applications to write queries that traversed multiple domains without knowing the underlying complexity.
This approach offered us several advantages:
- Preventing over-fetching: Frontend developers could request only the specific fields they needed
- Eliminating under-fetching: Complex nested queries could fetch related data across domains in a single request
- Flexible traversal: Queries could start from any node in the graph, allowing for query optimization
Team Topologies – Organizational Design is Architecture
We know that organizational structure has a major impact on architecture. Execution of a successful software delivery is as much about scaling teams as it is about scaling hardware and software. Without the ability to scale teams, you are limited in how fast you can deliver.
With Team Topologies, we restructured from highly interdependent technology-centric teams to autonomous, empowered, domain-oriented teams. Team Topologies suggests several ‘shapes’ for teams, what we used was:
- Stream-aligned teams: Focused on specific business functionality like Payments, Move Service, or Billing. These teams owned entire features end-to-end within the scope of their bounded contexts.
- Enablement teams: Provided technical foundations like observability frameworks and DevOps tooling to support other teams.
- Complicated subsystem teams: Focused on complex technical areas like mainframe integration. Our Enterprise Integration Team handled the complexity of data streaming out of the mainframe, as well as state-changing operations to the mainframe. Stream-aligned teams interact with integration services asynchronously, allowing them to focus on customer value.
This structure established team boundaries as system boundaries, giving each team full ownership.
Event-Driven Architecture – Beyond CDC
While Change Data Capture formed the foundation of our system-of-reference, we implemented additional event-driven patterns for communication within and between bounded contexts
Internal Domain Events
Even within our system-of-reference, bounded contexts are needed to communicate changes. We implemented the Outbox Pattern – another form of change data capture – to publish events when data changed within a bounded context.
Some example usage scenarios include:
- Performance optimization: Pre-compute and cache frequently accessed data
- Derived value calculation: Pre-calculate complex values that would otherwise require mainframe logic
- Cross-domain consistency: Keep related data in sync across bounded contexts
The Anti-Corruption Saga
For operations like payment submissions that required changes to the mainframe, we implemented the Parallel Saga pattern (asynchronous communication, eventual consistency, and orchestration). However, that alone was not quite enough. Our CDC-based system-of-reference needed a way to synchronize with these delegated requests. To make this work, we needed two data collections, one that would only be updated by the CDC, one that tracked the state machine of the requests, and an API that could aggregate both.
How it works:
- A user initiates an action, such as making a payment, which marks the initial state of a state machine, which is then managed by an orchestrator.
- The orchestrator transitions to the “Requested” state and emits an event of that state transition
- Another component listens for that event and upon receiving it converts the domain event into the supporting mainframe operation
- The mainframe processes the request and returns a success/failure response, and our component wraps that response in a ‘state change reaction’ event
- The orchestrator listens for state change reactions and makes another state transition, to “succeeded” or “failed”.
- In parallel, the CDC captures the resulting data change from the request
- The API uses an aggregate to reconcile both signals (state machine state and CDC) to present a consistent view
This pattern handled the reality that two signals would return from the mainframe – the user-initiated operation response and the CDC-captured data change – potentially in any order. It’s a race condition. The aggregate reconciles those two things through the two data stores and presents a unified state for that request to an application.
Composable Workflows
The great thing about state-machine-based workflows is that they’re inherently composable. A state change can trigger a fully independent workflow, and the completion of a workflow can trigger a state change. Each sub-workflow could be tested and deployed independently while still supporting composite operations. An example: one user flow on the web portal allows a user to make a payment while simultaneously storing bank account information. The “Make a Payment and Add Bank Account” workflow is simply first running “Add Bank Account” and then running “Make a Payment”.
Challenges and Tradeoffs We Faced
Our transformation wasn’t without difficulties. We encountered several challenges:
Event-Driven Complexity
Event-driven architecture is hard. People don’t understand it. It takes a paradigm shift most of the time. Moving from request-response thinking to event-based patterns required significant education and mindset changes within our teams.
Observability Requirements
With asynchronous processing and eventual consistency, traditional debugging approaches became insufficient. We had to implement comprehensive observability from the start, with correlated logs, metrics, and traces across services and workflows.
Mainframe Batch Processes
Large mainframe batch jobs (like monthly billing) could flood our CDC pipeline with events. This causes data latency. Worse still, if I happen to have an orchestrator that’s waiting on one of those statuses to change, it’s now backed up because of the lag created by the batch. This was an early bug we identified, and soon realized we needed to handle our CDC and Saga topics differently (hence the Anti-Corruption Saga).
GraphQL Schema Management
With schema stitching, we ended up with a shared supergraph schema that all nodes referenced. It’s not fun to maintain. It also more tightly couples the nodes in the graph to each other. We came up with strategies to work around that, such as versioning specific iterations of the supergraph, but that was a bandaid.
An alternative would have been to use a federated composition, which would compute schemas at runtime rather than at build time. The federated approach uses dependency inversion. Instead of a node having knowledge of its outgoing relationships, nodes define *what they extend*. This breaks the potentially inexhaustible chain of transitive dependencies. It does require a supergraph gateway to do that runtime composition, which is why we shied away from starting with federation. However, having experienced the challenges of maintaining a single graph schema, it may have been the better choice.
Cross-Cutting Concerns
While team boundaries aligned with domains, some concerns cut across all teams – authentication, observability, and error handling, for instance. We also chose a multi-repo approach, rather than a monorepo. This made things a bit difficult, but we addressed it through inner sourcing and automation:
- Private libraries provide foundational capabilities like API patterns, network communication, persistence, service hosting, distributed tracing, and more
- Those libraries are published as semantically versioned packages, and a weekly “patch Tuesday” job publishes them
- We used Dependabot to automate the dependency updates on the consuming services’ repositories.
Our Release Strategy – All aboard the Release Train!
Organizationally, we follow an Agile Release Train process. As mentioned, we also have autonomous teams who own multiple repos. Each of those repos has its own CI/CD pipelines and produces an artifact (a Docker image). To coordinate releasing those artifacts in a single release train, we created a release train automated pipeline.
The “train station” in this analogy is merely another repository, our deployment repository. Our Kubernetes manifests live here, which describe the shape of our cluster. When a CI/CD job would run on one of the application repos, after the Docker image was published, that same job would update a Kustomize file in the deployment repo, incorporating the latest change into the full system structure. The Kustomize file changes would then get promoted to higher environments. First, one-by-one to QA, when stories were ready for QA. Then all at once, when “the train leaves the station” (i.e., at the end of a sprint/start of UAT).
The final piece of that puzzle is feature flags. Feature flags enabled us to release security updates and bug fixes to services, but hold back feature work in progress. Because of all this, we were able to use trunk-based development across the board, avoiding potentially messy and painful merge scenarios.
Overall, this greatly reduced coordination cost (of 5 teams and 30+ repositories) for releases, as well as the support needed to release. Which in turn increased our velocity, stability, and predictability.
The “Hybrid Architecture”
To avoid a risky “big bang” cutover, we implemented a gradual transition from UWP 1.0 to UWP 2.0:
- Edge routing to direct traffic between old and new implementations (Hybrid Router)
- Context awareness between systems for a seamless user experience
This approach allowed us to incrementally replace features while maintaining a consistent user experience, reducing risk, and enabling continuous delivery of value.
Putting all that together, this is the big picture:
Conclusion – Building a Platform That Can Evolve
Looking back on our transformation, we were able to achieve our technical and business goals:
- We decoupled the frontend from the mainframe
- We eliminated brittle point-to-point integrations
- We empowered teams to own their domains end-to-end
- We reduced call center volumes
- We lowered licensing costs
- We improved customer satisfaction through stability
Here’s what some of that looks like from a product and performance metrics perspective:
Plus, we built a foundation that could evolve over time, serve diverse applications, and isn’t reliant on a centralized system or technology. These patterns can facilitate future modernization efforts, including potential mainframe replacement using the Strangler Fig pattern.
By thoughtfully applying Change Data Capture, Domain-Driven Design, Event-Driven Architecture, and Team Topologies, we created not just a better portal, but an innovative way of building software with legacy systems – one that embraces decoupling at technical, organizational, and semantic levels and unlocking speed, scale, and sanity.