How To Make On-Call Sustainable | HackerNoon

What Healthy On-Call Looks Like

Incident response gets most of the attention, but teams rarely measure the human cost of on-call with the same rigor they apply to incidents.

Phone charged. Laptop by my bed. The distinctive PagerDuty alarm goes off in the middle of the night, and the process begins. I sit up, open my screen to a burst of blinding light, and try to figure out: is this transient, or is something actually broken? I pull up PagerDuty, Grafana, New Relic, and whatever else I need to piece together what’s happening. Sometimes the runbook helps. Sometimes it’s vague enough to become another problem.

I felt that most clearly one Thanksgiving week years ago. Repeated system failures kept me stuck in my room while the rest of my family gathered. I couldn’t shop, cook, or really take part in the holiday, and I missed Thanksgiving dinner altogether, which for my family is one of the few times each year when everyone is together. That Monday after my rotation, I was expected to return to business as usual: join meetings, pick feature work back up, and handle the follow-up work like postmortems and runbook updates. There was little recognition of the cost. That kind of stress adds up quickly, and over time it becomes burnout.

Repeated incidents should be easy to recognize and resolve

Runbooks are not useful just because they exist. They are useful if someone on the team can open one at 2:37 a.m. and know the right next step. On-call rotations include people with different specialties, different contexts, and different levels of experience, and the person writing the runbook will almost always know more than the person relying on it later.

That means the documentation has to work for the person under pressure, not just the expert who wrote it. It also has to stay current. If it no longer reflects how the system behaves or fails to cover a recurring incident pattern, it should be updated.

Repeated incidents should also show up as a pattern, not be rediscovered one page at a time. If the same class of failure keeps returning, that should be obvious in review, reporting, and prioritization. Otherwise, teams end up paying for the same operational weakness again and again.

And teams need enough room to fix what keeps causing pain. If engineers are expected to carry their normal feature load while also handling incidents, tech debt, noisy alerts, and follow-up work, the same problems tend to linger. There has to be space outside of feature delivery to improve reliability, reduce alert noise, and deal with the underlying issues that keep interrupting people. n

Incidents don’t end when services recover

The cost of a tough on-call often shows up the next day, when someone is still expected to join a standup, sit through meetings, hit deadlines, and do the follow-up work, like updating the runbook or writing the postmortem.

Operationally, the incident may be over. For the person who handled it, it often is not.

That gap shows up in the data, too. Catchpoint’s 2025 SRE [Report](https://www.catchpoint.com/learn/sre-report-2025) found that 14% of respondents reported being more stressed after incidents than during them. It also found that support tends to drop off once the incident is over: 55% reported high support during incidents versus 44% after. That lines up with what many teams already know from experience. Support is visible while the incident is active. Recovery is much easier to ignore.

If a team wants on-call to be sustainable, it has to account for that aftereffect. Recovery time, follow-up capacity, and the ability to return to normal work all matter just as much as whether the incident was resolved.

Tooling should reduce work, not create more

This matters even more now that teams are bringing AI into operations workflows. Done well, AI can help responders triage faster, surface relevant context, summarize what changed, and suggest next steps so people spend less time hunting through tabs and more time actually resolving the issue.

That is the standard AI should be held to in incident response. The goal is not just to generate output. It is to reduce ambiguity, shorten the path to the next decision, and help people move with more confidence when time and attention are limited.

We are already seeing why that matters. In March 2026, Fortune reported that Amazon moved humans further back into the loop after a series of retail-site incidents tied to inaccurate advice from an AI agent using an old wiki. The lesson is not that AI has no place in operations. It is that operational AI has to be trustworthy, relevant, and useful in the moments when people need it most.

Historically, teams have measured the health of their systems better than the health of the people operating them. Part of that was a tooling problem. That excuse is getting weaker. With better internal data plumbing, AI agents, and more mature open-source tools for tracking on-call health, teams can get a much clearer picture of who is being interrupted, what keeps repeating, and whether recovery is actually happening after a bad night.

That feels like a much better goal than just trying to page less.

Takeaway

Healthy on-call is about whether the people carrying the system can keep doing it without getting ground down in the process.

If the same failures keep repeating, if recovery is invisible, and if the burden quietly concentrates on the same people, the system is not healthy no matter how clean the incident metrics look. The teams that get this right are not the ones with perfect uptime. They are the ones that make the operational burden visible, create space to fix what keeps breaking, and treat recovery as part of the job instead of a private cost paid by whoever had the pager.