Human‑Centred AI For SRE: Multi‑Agent Incident Response Without Losing Control

A growing body of recent research and industry commentary suggests that a shift in how organisations approach site reliability engineering is underway. Rather than handing the pager to a machine, teams are designing multi-agent AI systems that work alongside on-call engineers, narrowing the search space and automating the tedious steps of incident investigation while leaving judgment calls to humans.

In a blog post performing a deep-dive into multi-agent incident response, Ar Hakboian, co-founder of OpsWorker, which offers an agentic AI co-worker as a service, argues that the real value of AI in incident management lies in orchestration. Hakboian describes a pattern in which specialised agents: one for logs, one for metrics, one for runbooks and so on, are coordinated by a supervisor layer that decides who works on what and in what order. The aim, the author explains, is to reduce the cognitive load on the engineer by proposing hypotheses, drafting queries, and curating relevant context, rather than replacing the human entirely.

The blog post frames this approach succinctly, noting that AI agents should propose hypotheses, queries and remediation options while humans stay in the loop for judgment and approval. This framing aligns closely with a recent academic paper by Zefang Liu published on arXiv, which uses the Backdoors and Breaches tabletop framework to study how teams of large language model agents coordinate during simulated cyber incidents.

Liu’s experiments compared centralised, decentralised and hybrid team structures and found that homogeneous centralised and hybrid structures achieved the highest success rates. In contrast, decentralised teams of domain specialists struggled to reach consensus without a leader. Liu’s findings suggest that having autonomous agents working together actually causes more confusion and doesn’t solve problems faster. The implication for SRE is that having a supervisor or orchestrator is a better approach. However, mixed teams of domain specialists sometimes struggled more than homogeneous teams of generalists, even when there was a supervisor, seemingly because the specialists disagreed on priorities and couldn’t converge on a single course of action.

The OpsWorker blog post indirectly addresses this by emphasising explicit role design and structured hand-offs, where each agent has a clear set of tools and responsibilities to reduce the risk of deadlock.

The experiment validates technical feasibility but reveals the productionization gap is substantial. The agents are excellent technical investigators but lack the safety controls, reliability engineering, and operational maturity required for production incident response.

– Ar Hakboian

Cloud consultancy EverOps have recently written a post on how LLMs are transforming SRE work without replacing engineers, which supports this hypothesis. The firm reports that only a small minority of surveyed SRE professionals believe AI will replace their jobs within two years, while a clear majority see it as a tool to make work easier. The piece notes that practical use cases centre on log ingestion and anomaly detection, triage automation, alert clustering and retrieval-based access to internal knowledge repositories. EverOps also highlights the gap between promise and performance, citing a ClickHouse experiment in which they tested several advanced language models on real root-cause analysis scenarios. The autonomous analysis fell short of human-guided investigation.

The OpsWorker blog post shares that caution by emphasising evaluation and safety. It makes a series of recommendations, such as testing multi-agent setups with realistic incidents and granting agents the minimum necessary privileges. Hakboian suggests rolling out these agentic techniques gradually, starting with read-only access, and moving to controlled agentic actions only after carefully validating their work. He also argues for using guardrails and integrating tooling carefully rather than spending time working on clever prompts in an incident context. Hakboian consistently calls for human oversight, and he highlights the risks of hallucination when agents interact with tools.

Amazon Web Services has published a detailed example of a multi-agent SRE assistant built on its Bedrock platform. The architecture mirrors the OpsWorker blog post almost directly, with a supervisor coordinating four specialised agents for metrics, logs, topology and runbooks, all wired into a synthetic Kubernetes backend. The AWS piece is vendor-focused and tied to specific services such as Bedrock and LangGraph, but shares a workflow-first mindset with the OpsWorker blog post.

As a whole, these sources suggest that agentic SRE is maturing quickly, but organisations are using them to augment rather than replace staff. The OpsWorker blog post offers a thoughtful, detailed methodology for teams looking to integrate AI agents into their incident workflows while keeping human engineers in control.

Human‑Centred AI for SRE: Multi‑Agent Incident Response without Losing Control

Leave a Reply

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply