By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: How to Make On-Call Sustainable | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > How to Make On-Call Sustainable | HackerNoon
Computing

How to Make On-Call Sustainable | HackerNoon

News Room
Last updated: 2026/04/06 at 8:36 PM
News Room Published 6 April 2026
Share
How to Make On-Call Sustainable | HackerNoon
SHARE

What Healthy On-Call Looks Like

Incident response gets most of the attention, but teams rarely measure the human cost of on-call with the same rigor they apply to incidents.

Phone charged. Laptop by my bed. The distinctive PagerDuty alarm goes off in the middle of the night, and the process begins. I sit up, open my screen to a burst of blinding light, and try to figure out: is this transient, or is something actually broken? I pull up PagerDuty, Grafana, New Relic, and whatever else I need to piece together what’s happening. Sometimes the runbook helps. Sometimes it’s vague enough to become another problem.

I felt that most clearly one Thanksgiving week years ago. Repeated system failures kept me stuck in my room while the rest of my family gathered. I couldn’t shop, cook, or really take part in the holiday, and I missed Thanksgiving dinner altogether, which for my family is one of the few times each year when everyone is together. That Monday after my rotation, I was expected to return to business as usual: join meetings, pick feature work back up, and handle the follow-up work like postmortems and runbook updates. There was little recognition of the cost. That kind of stress adds up quickly, and over time it becomes burnout.

Repeated incidents should be easy to recognize and resolve

Runbooks are not useful just because they exist. They are useful if someone on the team can open one at 2:37 a.m. and know the right next step. On-call rotations include people with different specialties, different contexts, and different levels of experience, and the person writing the runbook will almost always know more than the person relying on it later.

That means the documentation has to work for the person under pressure, not just the expert who wrote it. It also has to stay current. If it no longer reflects how the system behaves or fails to cover a recurring incident pattern, it should be updated.

Repeated incidents should also show up as a pattern, not be rediscovered one page at a time. If the same class of failure keeps returning, that should be obvious in review, reporting, and prioritization. Otherwise, teams end up paying for the same operational weakness again and again.

And teams need enough room to fix what keeps causing pain. If engineers are expected to carry their normal feature load while also handling incidents, tech debt, noisy alerts, and follow-up work, the same problems tend to linger. There has to be space outside of feature delivery to improve reliability, reduce alert noise, and deal with the underlying issues that keep interrupting people. n

Incidents don’t end when services recover

The cost of a tough on-call often shows up the next day, when someone is still expected to join a standup, sit through meetings, hit deadlines, and do the follow-up work, like updating the runbook or writing the postmortem.

Operationally, the incident may be over. For the person who handled it, it often is not.

That gap shows up in the data, too. Catchpoint’s 2025 SRE [Report](https://www.catchpoint.com/learn/sre-report-2025) found that 14% of respondents reported being more stressed after incidents than during them. It also found that support tends to drop off once the incident is over: 55% reported high support during incidents versus 44% after. That lines up with what many teams already know from experience. Support is visible while the incident is active. Recovery is much easier to ignore.

If a team wants on-call to be sustainable, it has to account for that aftereffect. Recovery time, follow-up capacity, and the ability to return to normal work all matter just as much as whether the incident was resolved.

Tooling should reduce work, not create more

This matters even more now that teams are bringing AI into operations workflows. Done well, AI can help responders triage faster, surface relevant context, summarize what changed, and suggest next steps so people spend less time hunting through tabs and more time actually resolving the issue.

That is the standard AI should be held to in incident response. The goal is not just to generate output. It is to reduce ambiguity, shorten the path to the next decision, and help people move with more confidence when time and attention are limited.

We are already seeing why that matters. In March 2026, Fortune reported that Amazon moved humans further back into the loop after a series of retail-site incidents tied to inaccurate advice from an AI agent using an old wiki. The lesson is not that AI has no place in operations. It is that operational AI has to be trustworthy, relevant, and useful in the moments when people need it most.

Historically, teams have measured the health of their systems better than the health of the people operating them. Part of that was a tooling problem. That excuse is getting weaker. With better internal data plumbing, AI agents, and more mature open-source tools for tracking on-call health, teams can get a much clearer picture of who is being interrupted, what keeps repeating, and whether recovery is actually happening after a bad night.

That feels like a much better goal than just trying to page less.

Takeaway

Healthy on-call is about whether the people carrying the system can keep doing it without getting ground down in the process.

If the same failures keep repeating, if recovery is invisible, and if the burden quietly concentrates on the same people, the system is not healthy no matter how clean the incident metrics look. The teams that get this right are not the ones with perfect uptime. They are the ones that make the operational burden visible, create space to fix what keeps breaking, and treat recovery as part of the job instead of a private cost paid by whoever had the pager.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Are Electric Bag Resealers the Key to Chip Freshness? I Tested 2 to Find Out Are Electric Bag Resealers the Key to Chip Freshness? I Tested 2 to Find Out
Next Article Samsung’s Galaxy Glasses just passed a major launch milestone Samsung’s Galaxy Glasses just passed a major launch milestone
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Google Play Store just lost one of the handiest review filters you probably never used
Google Play Store just lost one of the handiest review filters you probably never used
News
How Satish Kumar Is Redefining Real-Time Payment Systems at Scale | HackerNoon
How Satish Kumar Is Redefining Real-Time Payment Systems at Scale | HackerNoon
Computing
Today's NYT Connections Hints, Answers for April 7 #1031
Today's NYT Connections Hints, Answers for April 7 #1031
News
‘Moon joy!’ Artemis 2’s crew sets a distance record, documents lunar far side and heads back toward Earth
‘Moon joy!’ Artemis 2’s crew sets a distance record, documents lunar far side and heads back toward Earth
Computing

You Might also Like

How Satish Kumar Is Redefining Real-Time Payment Systems at Scale | HackerNoon
Computing

How Satish Kumar Is Redefining Real-Time Payment Systems at Scale | HackerNoon

7 Min Read
‘Moon joy!’ Artemis 2’s crew sets a distance record, documents lunar far side and heads back toward Earth
Computing

‘Moon joy!’ Artemis 2’s crew sets a distance record, documents lunar far side and heads back toward Earth

9 Min Read
OnePlus launches Ace 5 series with Snapdragon 8 Gen 3 and Elite processor · TechNode
Computing

OnePlus launches Ace 5 series with Snapdragon 8 Gen 3 and Elite processor · TechNode

1 Min Read
How to Make Money on Pinterest: 7 Tips for 2025
Computing

How to Make Money on Pinterest: 7 Tips for 2025

2 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?