Preventing The Next Cloud Outage: Inside The Architecture Of Modern ITSM Platforms

The outage of the CrowdStrike disrupted flight operations in the summer of 2024. A year later, Google Cloud took a nosedive, and with it went Gmail, Discord, and Spotify. If there was any doubt before, these events proved it: no tech company is too big to fail – and when they do, the fallout is massive. As businesses move rapidly into the cloud and rely heavily on code-generation tools, keeping control over changes in code, IT infrastructure, and preventing system-wide breakdowns is no longer optional. It’s essential. According to many experts, in this new technological reality, attention to the service management practices could save from IT failures. ITSM and ESM systems that implement these practices are becoming central tools for managing business operations and ensuring their resilience and maturity. These systems make it possible to centrally manage changes in software and infrastructure, monitor processes, and automatically detect potential risks before they escalate into major incidents. This makes them a vital link in protecting businesses from large-scale disruptions.

Dmitry Malygin is a software engineer, architect and technical leader with over 12 years of experience. He has been named among the 40 most influential digital experts of the country and was recognized by the expert jury of the national Digital Leaders Award for his outstanding contribution to the development of enterprise digital platforms. He authored and led the development of an enterprise-grade Sphere ITSM/ESM platform, deployed across leading national banks, including one with international presence, collectively serving over 30 million customers. His advanced solutions form the backbone of automated service and process management in distributed and mission-critical infrastructures.

In this interview, Dmitry discusses the role of ITSM/ESM platforms in reducing the risks of large-scale IT outages amid the rise of automation and intelligent tools. He shares the key technical challenges he faced while developing the system, the solutions that helped address complex engineering problems, and the nuances of adapting a digital product for international markets.

Dmitry, IT systems of the largest tech companies are built using advanced DevOps practices, automation, and monitoring tools. In your view, as an architect and developer, why do they remain so vulnerable to failures?

Modern systems like Google Cloud Platform, Spotify, or Snapchat are built as complex, multi-component, and distributed solutions, which significantly complicates their development, maintenance, and evolution. In ecosystems with thousands of microservices, points of failure may arise at various levels – from network and database to third-party integrations. A single failure at the periphery can trigger a cascading error throughout the entire system. The human factor also plays a significant role. Failures often occur at the intersection of different teams’ responsibilities, especially when there is no end-to-end, formalized change management process, whether for code or infrastructure.

The widespread use of automation and code generation tools adds further complexity. Developers’ blind trust in code generators sometimes leads to critical bugs slipping into the production environment unnoticed, despite formally well-established CI/CD processes. This issue becomes particularly acute under the pressure of tight deadlines. Therefore, the development and maintenance of such systems require advanced mechanisms for control, verification, and validation to be integrated at every stage of the lifecycle, but without compromising development speed. System reliability isn’t only about avoiding failures – it’s equally about how fast the system can bounce back and limit the damage when something goes wrong.

Can shortcuts taken in the early stages come back to cause issues after the system launches? From what you’ve seen, how much do things like rushed timelines or unclear requirements shape architecture decisions and affect the final product?

Tight deadlines combined with a high workload, as well as uncertainty in business requirements, undoubtedly have a sometimes quite negative impact on the quality of delivered code, even if the release process itself is well-established. Primarily, this is a matter of project management. While prioritizing tasks, managing the backlog, and testing thoroughly are all critical, real-world conditions are rarely perfect, and trade-offs are often unavoidable. On one project for a Fortune Global 500 client, where I was the technical lead, we faced a strict deadline to launch an MVP (Minimum Viable Product). Architecturally, it was obvious that an asynchronous communication model between components was the right fit. But the deadline was tight, and the business wanted it out fast – so we laid out the risks and went ahead with a synchronous integration, even knowing it wasn’t ideal. As a result, some time after the launch, we discovered that the system couldn’t handle peak loads, which led to temporary unavailability.

A post-incident analysis showed that the initial compromise in integration was precisely the point of failure. After the incident, I initiated the creation of a high-priority architectural risk register, where we documented technical debt items that had the potential to cause major incidents. We eventually resolved the issue, but from that point on, the business had a clear picture of the consequences of deviating from the original engineering plans. Understanding that compromises are inevitable, my role as a lead engineer was to ensure a proper balance between business objectives and sound engineering practices.

In your experience, which measures that can significantly reduce the likelihood of errors in code and IT infrastructure, as well as minimize the risk of failures, are most often underutilized?

From my experience, there are several measures with strong potential to improve delivery control, code quality, and IT infrastructure stability that are often not fully leveraged. First, integration testing and comprehensive system testing. Teams write unit tests and keep the test coverage numbers high, but when it comes to testing how different parts of the system actually talk to each other, or how they handle external APIs, that’s where the gaps show up. Second, monitoring is often limited to basic metrics, which reduces its effectiveness. Many setups still underuse predictive response scenarios, despite their potential to significantly improve system resilience. Additionally, I often observe teams facing difficulties with implementing Infrastructure as Code (IaC) in practice, primarily because the code responsible for provisioning and configuring infrastructure is rarely tested with the same rigor as application code.

This can result, for example, in launching servers with incorrect security settings or misconfigured databases. I would also highlight ITSM processes, which are often implemented in a formal and isolated manner, without close integration with development and automation. When integrated into the DevOps pipeline, even a single process like Change Management can significantly improve the balance between control and speed of delivery. Such integration enables the use of Change Requests – a formal process for controlling changes to code, infrastructure, and configurations, helping ensure compliance and support risk assessment.

With hands-on leadership in building a service management platform, you’ve gained deep technical insight into what it takes to keep large IT systems stable and resilient. How does your system work behind the scenes to steer changes safely, dodge risks in delivery, and keep everything running without hiccups?

The platform I designed is based on the ITIL framework and built around ITSM (IT Service Management) principles, offering a systematic approach to managing IT services. Additionally, it extends to an ESM (Enterprise Service Management) model, allowing these principles to be applied across all business units, not just IT. The platform is built with modules that help lower the risk of release-related issues and keep services running reliably. One of the most important pieces is the Change Management module – it’s what keeps every change to the code or infrastructure under control, from start to finish. It’s possible to link the system to a CI/CD pipeline so that critical deployments get blocked automatically unless there’s an approved Change Request. Another key piece is the Incident Management module, which catches problems, keeps track of them, figures out their type, and helps to get them resolved fast to avoid downtime.

By connecting with monitoring systems like Prometheus and observability tools like Datadog and Grafana, incidents are created automatically when unusual metrics appear, activating set runbook actions to fix the issues. One of the platform’s key components is the CMDB (Configuration Management Database), which stores information about configuration items (CI) and how they relate to each other. This database is used to build a dependency graph, making it possible to analyze the potential impact of changes and block releases that pose a high risk. Thanks to the integration of these modules into a single platform, it becomes possible to significantly improve release stability, reduce the number of change-related incidents, lower mean time to recovery (MTTR), and ensure coordinated response efforts across engineering teams during incidents.

However, as noted in Forrester’s report “The Future of ITSM and ESM”, service management platforms, like any complex systems, are becoming increasingly difficult and costly to customize. Based on your experience, what challenges did your enterprise clients face most often, and how did those influence your architectural decisions?

When creating the platform, my team and I conducted in-depth interviews with enterprise clients to understand their core frustrations. One CTO told us bluntly: “I don’t want to restructure my organization’s business processes just to install a service management platform”. That stuck with me. It became clear that customization – a system’s ability to adapt to existing processes – was one of the most critical requirements. We also saw scalability issues in companies with over 10,000 employees – in one case, the system couldn’t handle concurrent requests from multiple business units because it was built as a monolith and lacked proper horizontal scaling. As a result, even routine ticket processing slowed to a crawl during peak hours. Another common pain point was integration.

For example, one client needed real-time sync with both HR and CRM systems, but their previous platform required custom scripts for it. We knew we had to address that through open APIs and built-in connectors. We considered these insights in our architecture. We ultimately chose a low-code approach to make the system both flexible and accessible, and added predictive analytics to help automate repetitive processes.

When you look back, what were the trickiest technical choices or architectural hurdles that really tested you? Were there any strategic technical decisions you made that later proved critical to the system’s performance or resilience?

One of the biggest challenges I faced was coping with the increasing load on the modules that manage changes and releases. Due to some business considerations, they were initially developed as a monolith. Early on, I recognized that this design would become a bottleneck as user demand scaled. I led the initiative to split the modules into separate microservices, aligning two development teams under a tight delivery timeline. The refactoring allowed development of the modules independently and scaling them according to their load profiles, eliminating the choke points caused by the monolith approach. Additionally, to keep control of the system under high load, I championed an observability-first approach and led the rollout of an advanced monitoring stack using Prometheus, Grafana, and ELK. This helped us catch regressions early and decrease the bug fix time by 3 times.

The strategy was adopted company-wide as a standard. Another challenging task, for which I was responsible for architecting the core algorithms, overseeing implementation, and ensuring seamless integration with other modules, was designing and developing an intelligent pattern-matching engine that powered shift-left automation, significantly reducing the first-line support workload by 20%.

What measurable improvements and competitive advantages did your platform achieve during pilot projects or real-world implementations? Can you highlight which of your architectural or technical decisions most directly contributed to these outcomes?

One of the most notable accomplishments was cutting several key ITSM/ESM metrics, especially achieving a 30% drop in MTTR (Mean Time to Resolve), which spans the full incident process – from spotting the issue to resolving and closing it. A low MTTR is especially critical for sectors such as finance, healthcare, and transportation. This improvement was made possible by my solution, which implements the “shift-left” approach: an intelligent self-service portal enabling users to resolve common issues themselves, from password resets to access requests and service restarts. As a result, the number of requests handled by first-line support was reduced by approximately 20%, allowing teams to redirect resources to more complex and critical incidents. Apart from that, we achieved a 20% reduction in AFRT (Average First Response Time) compared to the legacy system.

This metric reflects how quickly a support team begins interacting with users after receiving a request, directly influencing the perceived quality of service. The improvement was driven by several technical solutions, designed and developed by me, including automated ticket classification using predictive analytics, intelligent routing algorithms that prioritized and directed requests to the appropriate support tier in real-time, and streamlined notification workflows to minimize delays between ticket creation and agent engagement. Taken together, these improvements enhanced not only the technical performance of support services but also increased the overall transparency, predictability, and trust in IT services.

In big projects with many development teams, keeping everyone aligned and maintaining quality can be a real challenge. As a technical leader, how did you approach team organization, communication, and alignment? Did anything you brought to the table make a real, noticeable change?

To coordinate the work of more than 10 development teams with 100 engineers and maintain a steady pace of delivery with high product quality, I introduced an architectural and organizational model based on three core principles. First, teams were structured around the system’s modular architecture, following Conway’s Law: each team owned a distinct functional module aligned with a specific business domain. I focused on establishing clear API contracts and promoted a shift toward asynchronous communication, which helped us scale more effectively and eased inter-team dependencies. Next, I put together a dedicated platform team to take care of the common stuff – so product teams could zero in on real business needs, avoid reinventing the wheel, and ship quicker. Third, I introduced an architectural council and regular design review sessions.

These helped maintain a unified technical vision across all teams while still supporting local autonomy. As a result, teams were able to scale independently while adhering to a coherent architectural framework, leading to faster delivery cycles, better cross-team alignment, and improved product reliability.

As code generation tools become more common, reduced developer oversight can increase the risk of serious issues slipping through. While the outages at Google and CrowdStrike were probably not caused by AI-generated code, they served as powerful reminders of how fragile complex systems can be when safeguards fail. How does your platform address these kinds of risks and help ensure operational resilience?

The service management approach integrated into our platform enforces strict oversight of all changes, including those to code and dependencies, ensuring stability and compliance even when code generation tools are part of the development process. This is especially relevant given studies from Stanford and Veracode indicating that auto-generated code tends to contain more vulnerabilities and errors.

The platform can be deployed both on the software provider’s infrastructure and on the client’s premises. It includes a Change Management module that governs the change process and reduces the risks associated with deploying unstable code. When auto-generated code enters the CI/CD pipeline, the system can automatically create a Change Request (CR), if configured, which triggers a Change Advisory Board (CAB) to make the final decision on code delivery. This decision takes into account review checklists, results of SAST/DAST scans, the presence of “hallucinated” dependencies, and other factors. Rollback plans are also mandatorily considered. If an error occurs, an incident is automatically created and logged within the platform, immediately triggering the technical support team. In this way, the ITSM/ESM platform enables a controlled and secure delivery process and effectively helps prevent disruptions like those experienced by CrowdStrike and Google Cloud.

Back in 2024, you stood out among 1,500 candidates and took home the Digital Leaders Award in two nominations – one is For Contribution to the Development of Digitalization in Russia and another one is Online Assistants. What do you think made you rise above the rest? Which part of your work do you think really clicked with the jury?

The award was granted following a competitive selection process conducted by an independent panel of industry experts. After the initial screening stage, participants’ submissions were scored based on multiple criteria, including the level of innovation, technical depth, overall impact, business value, and other relevant indicators. My application received the highest overall score among the participants and was recognized as the best in two separate nominations. The expert panel highlighted the broad impact of my ITSM/ESM solution on improving the efficiency of business operations in large organizations. The jury specifically emphasized my contributions as a developer, system architect, and team leader – my proposed technical approaches formed the solution that proved effective in projects involving major banks. Key factors in the evaluation included the use of asynchronous microservice architecture, predictive analytics, flexible business logic configuration, and omnichannel user interactions.

These features allowed the platform to demonstrate high scalability and adaptability to a wide range of organizations, from large enterprises to government institutions. My work on a digital platform for drivers was also highlighted. The system serves a broad user base, simplifying access to public services in everyday life and standing as a successful example of digital transformation in the industry. I believe the deciding factor was the combination of deep technical expertise, my architectural leadership, effective team management skills, and the real-world impact of my solutions on the digital sector.

In addition to your hands-on engineering work, you actively share your findings through scientific and expert publications, including those based on your projects. Which of these studies sparked the greatest interest among professionals in the field?

Every tricky engineering challenge leaves you with something new, and I’ve always felt that sharing those insights is how I can contribute to the field. Some of my research, which you can find on Google Scholar, comes straight from hands-on engineering experience. It’s picked up attention because it deals with the real problems teams hit when scaling systems and offers fixes that have actually worked in practice. One such article, Methodology for evaluating JVM frameworks for the development of ITSM/ESM automation platforms based on autoscaling performance, presents a structured approach to comparing Java frameworks such as Spring Boot, Micronaut, and Quarkus. The methodology was developed during the creation of a high-load enterprise service management platform and enabled teams to make technology choices based on measurable indicators like cold start time and resource consumption.

After publication, I was contacted directly by professionals from other companies who were interested in applying this methodology in their own projects. Another example is the article Performance issues of relational databases in distributed architectures and strategies for resolution. It provides a classification of typical performance bottlenecks in microservice-based systems and proposes optimization strategies that proved effective in real-world deployments. These works combine theoretical insights with engineering solutions that were validated through real projects and provide practical value to teams working with high-load distributed systems.

You led the effort to ready the platform for an international rollout, ensuring it was well adapted for the Asian market. What technical and product aspects did you focus on? What were the key challenges and outcomes?

The ITSM/ESM platform plays a critical role in the operational management of an organization. It is a technically complex B2B product that must comply with strict regulatory requirements, especially when used in financial or government sectors. Adapting it for international markets involves far more than simply translating the user interface. It requires adaptation to local infrastructure, adherence to security standards, legal compliance, UX localization, and integration with local systems. For example, in China, most B2B solutions, particularly in regulated industries, are required to store data locally. If the platform is publicly accessible, ICP (Internet Content Provider) registration is typically required. In government and public institutions, on-premise deployment is often the only option because of strict requirements around data control and compliance. Additionally, local data protection laws (such as PIPL, APPI, and others) must be strictly followed, particularly when handling sensitive information and logs.

Business logic and workflows often have to reflect the way things are managed in different regions. In Japan and South Korea, for instance, many enterprise and public sector clients expect support for hierarchical SLAs and multi-level approval processes, reflecting structured decision-making practices. Localizing the platform is a complex engineering challenge that demands a deep understanding of regional specifics and a highly flexible architecture.

Throughout your career, you’ve taken on increasingly complex technical roles. How do you ensure that your engineering approaches remain effective across different teams and projects? Have any of your peers or managers commented on specific qualities that they value in your work?

Having progressed from an applied engineer to an architect and technical leader, I have worked on complex, high-impact projects for global companies such as Deutsche Bank. These roles exposed me to rigorous engineering environments, advanced technical practices, and gave me the autonomy to make critical decisions, learn from failure, and develop a deep understanding of system design at scale.

Over time, I identified universal architectural principles that help systems remain reliable and scalable. I consistently approach architecture with a full-lifecycle mindset – anticipating how systems will evolve as business needs grow, where bottlenecks may appear in one to two years, and what long-term risks may arise. Colleagues and managers have consistently commended my ability to rapidly analyze and create complex architectures, identify structural weaknesses, and deliver strategic solutions, which was reflected in multiple formal recommendations. As a result, I am often entrusted with defining technical strategy and architecture, even on projects where I am not the designated lead. Beyond delivery, I actively mentor junior engineers and contribute to cultivating engineering culture across teams – a responsibility I see as integral to long-term technical excellence.

Preventing the Next Cloud Outage: Inside the Architecture of Modern ITSM Platforms | HackerNoon

Leave a Reply Cancel reply

Stay Connected

Latest News

SIM swapping is real, and this setting protects me from it

Bounce launches a service for moving accounts between Bluesky and Mastodon | News

EcoFlow’s Rapid power bank is the fastest yet

Details About the First iPhone Foldable Are Coming Into Focus

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News