Monzo Bank recently revealed Monzo Stand-in, an independent backup system on GCP that ensures essential banking services remain operational during application and AWS infrastructure outages. Unlike traditional replicated backups, it’s a minimal stand-alone system that exclusively supports key operations and features a cost-effective design, resulting in 1% of the operational costs of the primary deployment.
The Stand-in operates as a fully independent backup system, running separately from Monzo’s Primary Platform to ensure continued service during outages. It shares no code components with the Primary Platform and has its own cloud vendor, infrastructure components, payment processing, and data synchronisation mechanisms, reducing reliance on shared elements.
High-level architecture of Monzo Stand-in (source)
By running entirely separate software from the Primary Platform, Monzo Stand-in minimises the chance that a single bug or process failure could impact both systems. Unlike conventional disaster recovery solutions focusing on hardware redundancy, Monzo prioritises software independence, ensuring each platform can operate autonomously.
Furthermore, traditional backup deployments often rely on replicated systems that mirror the primary platform in real time, requiring strong consistency and synchronous data replication. While this approach ensures an up-to-date backup, it also introduces dependencies that can limit availability during particular failures.
In contrast, Monzo Stand-in follows an eventual consistency model to maximise availability. Instead of requiring immediate synchronisation with the Primary Platform, it asynchronously updates essential data, ensuring operations can continue even during outages. Transactions are recorded as independent “advice,” later reconciled when the Primary Platform is restored, reducing dependencies and failure risks.
Data synchronisation in Monzo Stand-In (source)
Monzo Stand-in solely supports a minimal subset of Monzo’s core functionalities, prioritising critical operations like card payments, bank transfers, and balance checks while omitting non-essential features. This streamlined approach reduces complexity and significantly lowers its total cost of ownership, as Stand-in only incurs about 1% of the Primary Platform’s operating expenses.
The Monzo App integrates with Stand-in, automatically detecting failovers and switching to a simplified interface that maintains key banking capabilities, ensuring a consistent user experience.
Monzo App experience during failover (source)
Monzo is a UK-based digital bank. Founded in 2015, it has grown rapidly, offering millions of customers current accounts, savings tools, and financial insights. Monzo operates primarily through its app, leveraging modern cloud-based infrastructure to provide seamless banking services.
InfoQ spoke about the Monzo Stand-in with Daniel Chatfield, a Distinguished Engineer at Monzo.
InfoQ: In the article, you mention that Monzo Stand-in is tested in production. Can you share more details about the testing strategies and failure scenarios you simulate and how you ensure Monzo Stand-in remains reliable over time?
Daniel Chatfield: Regular unit tests and acceptance tests are supplemented by several production testing practices.
- Shadow testing – a portion of payments are continuously run against stand-in in shadow mode. This allows us to compare the decisions between the primary platform and stand-in and detect unexpected differences.
- Load testing – the shadow testing proportion is set to 100% over our peak time each day to validate that we can handle peak load. We can also perform ad-hoc load tests that go beyond 100% (each payment is replayed multiple times). We’ve load tested up to 5x peak load.
- Direct testing – shadow mode still involves the payment message initially coming into AWS and then being replayed to Google Cloud. This leaves the part of the stand-in system that connects directly to payment schemes via our data centres untested. An automated system tests this regularly by enabling stand-in to directly connect to payment schemes and process payments for a short period before disabling itself.
- End-to-end customer testing – The final puzzle piece is the end-to-end integration with our mobile application. The best way to be confident this will work when needed is to exercise it “for real” regularly. To do this, we have a system that selects a section of customers each day and enrols them in a scheduled test. If the customer opens their mobile app during that period, they will see the simplified stand-in experience and an explanation of why we do this testing. The customer can opt out of the testing and return to the full experience, but everyday customers who don’t opt out initiate payments that test the system end to end. Once a customer has been enrolled in this testing, they won’t be enrolled again for another 5 years.
InfoQ: Given that Monzo Stand-in relies on an eventually consistent model, how do you reconcile discrepancies between the Primary Platform and the Stand-in after a major outage? Are there specific cases where reconciliation becomes particularly complex?
Daniel: Stand-in doesn’t directly modify any of the data synced from the primary platform. So, for example, if someone’s balance in stand-in is recorded as £100 and they do a £10 transaction, we don’t change the balance to £90. Instead, we record that Stand-in has authorised a £10 transaction. Then, their current balance is derived at runtime by summing £100 and the -£10. This provides a clear separation between the state that comes from the primary platform and the state created within Stand-in, and the state is only synced in one direction.
Then, when the primary platform is syncing this “advice”, it applies the delta to the primary platform. So, in the case of that £10 transaction, it applies a £10 transaction onto the account, not setting the balance to £90. In exceptional circumstances, this can result in an account going negative if a transaction was processed on the primary platform just before Stand-in was activated and wasn’t synced to Stand-in before another payment was processed in Stand-in. Keeping the sync latencies very close to real-time makes this risk very low in practice.
InfoQ: You mentioned that Monzo Stand-in runs at about 1% of the cost of the Primary Platform. What architectural choices or optimisations were made to keep costs low while ensuring resilience and functionality?
Daniel: Our primary platform uses a microservices architecture designed to allow many independent teams to ship lots of changes regularly without clashing with each other. In contrast, we expect stand-in to be much more stable – as it only intends to support payment processing in the most basic way possible, it doesn’t need frequent changes. Since introducing Stand-in, we’ve made thousands of changes to the primary platform, but only a handful of changes have been made to Stand-in. As a result, stand-in runs a smaller number of “larger” services. For example, there is a single system in Stand-in for card processing compared to a dozen or so independent systems in the primary platform.
Another contributing factor to the low cost was choosing a managed database where we pay per operation. This makes stand-in more expensive when it’s fully enabled but cheaper when it’s just syncing the state from the primary platform. Given that we expect stand-in to be disabled most of the time, this works out cheaper overall.
InfoQ: Running Monzo Stand-in on GCP while the Primary Platform is on AWS introduces a multi-cloud architecture. What challenges did you face regarding interoperability, networking, and cloud-provider-specific limitations when implementing this strategy?
Daniel: Our platform is already built in a way that minimises reliance on cloud services that don’t have close equivalents in other clouds. There was a bunch of “glue code” that had to be different, e.g., in both AWS and GCP, we used managed Kubernetes clusters, but the services provided weren’t identical. Our primary platform uses AWS Keyspaces as its primary database, so we had to think carefully about the choice of database in GCP. To make this decision more reversible, we invested in building tooling such that the choice of database is abstracted from the application code.