By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Report Finds LLMs Not Yet Ready to Replace SREs in Incident Management
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Report Finds LLMs Not Yet Ready to Replace SREs in Incident Management
News

Report Finds LLMs Not Yet Ready to Replace SREs in Incident Management

News Room
Last updated: 2025/09/27 at 6:40 PM
News Room Published 27 September 2025
Share
SHARE

A study by ClickHouse found that large language models (LLMs) can’t yet replace Site Reliability Engineers (SREs) for tasks such as finding the root causes of incidents. AI technology is, however, advancing significantly towards this.

The study, conducted by Lionel Palacin and Al Brown, tested five leading models against real-world observability data to determine whether AI could autonomously identify production issues. The results suggest that whilst LLMs show great promise as assistive tools, they fall short of completely replacing human engineers.

“Autonomous RCA is not there yet,” the authors explained. “The promise of using LLMs to find production issues faster and at lower cost fell short in our evaluation, and even GPT-5 did not outperform the others.”

The research team tested Claude Sonnet 4, OpenAI GPT-o3, OpenAI GPT-4.1, and Gemini 2.5 Pro against four datasets containing distinct anomalies from the OpenTelemetry demo application. Each model was given access to observability data and asked to identify root causes using a simple prompt: “You’re an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?”

The results were mixed across all of the models. Some of them successfully identified some issues, but none found root causes consistently without some human guidance. In scenarios involving payment failures linked to specific user loyalty levels, both Claude Sonnet 4 and OpenAI o3 managed to identify the problem after the initial prompt. However, with more complex issues like cache and product catalogue errors, the AI needed some degree of human intervention to get to the right answer.

“This reflects a common pattern: the model tends to lock onto a single line of reasoning and doesn’t explore other possibilities,” the researchers noted when describing Claude Sonnet 4’s performance on cache-related issues.

Using different scenarios also produced variations in performance. For example, Gemini 2.5 Pro  excelled at identifying a specific product catalogue issue but struggled with cache-related problems. It also hallucinated and doubled-down on incorrect information. “It then began to formulate an imaginary cause (for which it had no evidence), and began trying to prove its case,” the authors observed regarding Gemini’s tendency to create unfounded theories.

Cost and efficiency varied dramatically between models and scenarios. Token usage ranged from thousands to millions, making cost prediction difficult. Investigation times spanned from just over a minute to 45 minutes, whilst costs per investigation ranged from $0.10 to nearly $6.

When OpenAI released GPT-5 during the study period, the researchers tested it against the same scenarios. Despite being the newest model, GPT-5 performed similarly to existing models, essentially matching OpenAI o3’s results whilst using fewer tokens.

There were some limitations in the testing approach that the team used. They used relatively simple datasets that represented hour-long periods of telemetry data which had anomalies injected into them that were easier to detect than real production problems. The team also didn’t fine-tune their prompts with content enrichment or other techniques that might have improved performance. The study did however find that LLMs excelled at writing root cause analysis reports, with all models producing strong initial drafts. “We found the results to be consistently strong across different models and anomaly types,” the researchers reported.

The researchers concluded that the current optimal approach combines human expertise with AI assistance rather than full automation. They recommend using LLMs to “summarise noisy logs and traces, draft status updates and post-mortem sections, suggest an investigation plan to follow, and review investigation data and validate findings” whilst keeping engineers in control of the process.

A post by Varun Biswas on LinkedIn argues that AI-driven tools can take over a significant share of monitoring, analysis, and remediation tasks, with  but humans stay in the loop especially for strategic decisions and oversight. The most repetitive, automatable tasks are being delegated to AI, while system design, escalation, and recovery remain human-led.

Another recent study by Tomasz Szandała evaluates the capability of GPT-4o, Gemini-1.5, and Mistral-small in conducting root cause analysis (RCA) for infrastructure incidents using chaos engineering scenarios. This paper tested LLMs with eight failure scenarios invented from a controlled e-commerce environment, and compared their performance to human Site Reliability Engineers.

The Szandala Report

This report found that in zero-shot settings, LLMs were moderately successful, reporting 44-58% accuracy, and with human SREs performing significantly better at 62%. The study found that “LLMs achieved significantly lower results” compared to humans, with GPT-4 achieving 0.52, Gemini 0.58, and Mistral 0.44 accuracy. However, some prompt engineering did improveperformance to 60-74% accuracy, though humans still did better at over 80%.

The ClickHouse study found that “even GPT-5 did not outperform the others” and required significant human guidance, whereas this study showed measurable improvement through prompt engineering techniques. Szandała’s research demonstrated more consistent improvements through structured prompting, suggesting that “prompt engineering emerged as the critical element for LLMs’ performance”.

“So can LLMs replace SREs right now? No. Can they shorten incidents and improve documentation when paired with a fast observability stack? Yes,” the authors of the Clickhouse report concluded. “The path forward is better context and better tools, with engineers in control.”

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article What are the benefits of wearing riding armor under clothing?
Next Article Unvanquished Game Ported To SDL3, Working Natively On Wayland
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Something is seriously wrong with YouTube TV multiview, users say
News
How to Schedule Videos to Instagram with in 2025
Computing
Lowest price ever: 15" MacBook Air M3, 24GB, 512GB plunges to $1,099 ($600 off)
News
Today's NYT Mini Crossword Answers for Sept. 28 – CNET
News

You Might also Like

News

Something is seriously wrong with YouTube TV multiview, users say

5 Min Read
News

Lowest price ever: 15" MacBook Air M3, 24GB, 512GB plunges to $1,099 ($600 off)

1 Min Read
News

Today's NYT Mini Crossword Answers for Sept. 28 – CNET

2 Min Read
News

Ditch Bulky Trackers: This $23 MagTag Card Fits Right in Your Wallet

4 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?