A study by ClickHouse found that large language models (LLMs) can’t yet replace Site Reliability Engineers (SREs) for tasks such as finding the root causes of incidents. AI technology is, however, advancing significantly towards this.
The study, conducted by Lionel Palacin and Al Brown, tested five leading models against real-world observability data to determine whether AI could autonomously identify production issues. The results suggest that whilst LLMs show great promise as assistive tools, they fall short of completely replacing human engineers.
“Autonomous RCA is not there yet,” the authors explained. “The promise of using LLMs to find production issues faster and at lower cost fell short in our evaluation, and even GPT-5 did not outperform the others.”
The research team tested Claude Sonnet 4, OpenAI GPT-o3, OpenAI GPT-4.1, and Gemini 2.5 Pro against four datasets containing distinct anomalies from the OpenTelemetry demo application. Each model was given access to observability data and asked to identify root causes using a simple prompt: “You’re an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?”
The results were mixed across all of the models. Some of them successfully identified some issues, but none found root causes consistently without some human guidance. In scenarios involving payment failures linked to specific user loyalty levels, both Claude Sonnet 4 and OpenAI o3 managed to identify the problem after the initial prompt. However, with more complex issues like cache and product catalogue errors, the AI needed some degree of human intervention to get to the right answer.
“This reflects a common pattern: the model tends to lock onto a single line of reasoning and doesn’t explore other possibilities,” the researchers noted when describing Claude Sonnet 4’s performance on cache-related issues.
Using different scenarios also produced variations in performance. For example, Gemini 2.5 Pro excelled at identifying a specific product catalogue issue but struggled with cache-related problems. It also hallucinated and doubled-down on incorrect information. “It then began to formulate an imaginary cause (for which it had no evidence), and began trying to prove its case,” the authors observed regarding Gemini’s tendency to create unfounded theories.
Cost and efficiency varied dramatically between models and scenarios. Token usage ranged from thousands to millions, making cost prediction difficult. Investigation times spanned from just over a minute to 45 minutes, whilst costs per investigation ranged from $0.10 to nearly $6.
When OpenAI released GPT-5 during the study period, the researchers tested it against the same scenarios. Despite being the newest model, GPT-5 performed similarly to existing models, essentially matching OpenAI o3’s results whilst using fewer tokens.
There were some limitations in the testing approach that the team used. They used relatively simple datasets that represented hour-long periods of telemetry data which had anomalies injected into them that were easier to detect than real production problems. The team also didn’t fine-tune their prompts with content enrichment or other techniques that might have improved performance. The study did however find that LLMs excelled at writing root cause analysis reports, with all models producing strong initial drafts. “We found the results to be consistently strong across different models and anomaly types,” the researchers reported.
The researchers concluded that the current optimal approach combines human expertise with AI assistance rather than full automation. They recommend using LLMs to “summarise noisy logs and traces, draft status updates and post-mortem sections, suggest an investigation plan to follow, and review investigation data and validate findings” whilst keeping engineers in control of the process.
A post by Varun Biswas on LinkedIn argues that AI-driven tools can take over a significant share of monitoring, analysis, and remediation tasks, with but humans stay in the loop especially for strategic decisions and oversight. The most repetitive, automatable tasks are being delegated to AI, while system design, escalation, and recovery remain human-led.
Another recent study by Tomasz Szandała evaluates the capability of GPT-4o, Gemini-1.5, and Mistral-small in conducting root cause analysis (RCA) for infrastructure incidents using chaos engineering scenarios. This paper tested LLMs with eight failure scenarios invented from a controlled e-commerce environment, and compared their performance to human Site Reliability Engineers.
This report found that in zero-shot settings, LLMs were moderately successful, reporting 44-58% accuracy, and with human SREs performing significantly better at 62%. The study found that “LLMs achieved significantly lower results” compared to humans, with GPT-4 achieving 0.52, Gemini 0.58, and Mistral 0.44 accuracy. However, some prompt engineering did improveperformance to 60-74% accuracy, though humans still did better at over 80%.
The ClickHouse study found that “even GPT-5 did not outperform the others” and required significant human guidance, whereas this study showed measurable improvement through prompt engineering techniques. Szandała’s research demonstrated more consistent improvements through structured prompting, suggesting that “prompt engineering emerged as the critical element for LLMs’ performance”.
“So can LLMs replace SREs right now? No. Can they shorten incidents and improve documentation when paired with a fast observability stack? Yes,” the authors of the Clickhouse report concluded. “The path forward is better context and better tools, with engineers in control.”