From Paging To Postmortem: Google Cloud SREs On Using Gemini CLI For Outage Response

A recent article by Google Cloud SREs describes how they use the AI-powered Gemini CLI internally to resolve real-world outages. This approach improves reliability in critical infrastructure operations and reduces incident response time by integrating intelligent reasoning directly into the terminal-based operational tools.

According to the authors, the Gemini CLI, built on Gemini 3, can assist the team at every stage of an outage, from classification and initial mitigation to root-cause analysis and automated postmortem generation. This helps reduce Mean Time to Mitigation (MTTM) and minimize user impact while keeping SREs in control for safety and validation. Riccardo Carlesso, developer advocate at Google, and Ramón Medrano Llamas, software engineer at Google, outlined their end goal:

We obsess over MTTM. Unlike Mean Time to Repair (MTTR), which focuses on the full fix, MTTM is about speed: how fast can we stop the pain? In this space, SREs typically have a 5-minute Service Level Objective (SLO) just to acknowledge a page, and extreme pressure to mitigate shortly after.

The authors explain that while a typical incident goes through four standard phases (paging, mitigation, root cause, and postmortem), the AI-powered Gemini CLI can help at every step of this journey to keep MTTM low. Using a fictitious incident, they demonstrate a full incident lifecycle driven entirely from the terminal. Starting with the paging and initial investigation, they explain:

This is a perfect task for an LLM: classify the symptoms and select a mitigation playbook. A mitigation playbook is an instruction created dynamically for an agent to be able to execute a production mutation safely. These playbooks can include the command to run, but also instructions to verify that the change is effectively addressing the problem, or to rollback the change.

A human-in-the-loop is currently required to verify the proposed mitigations. As agent capabilities mature and agentic safety systems advance, this dependency is expected to decrease. Execution requires explicit safety checks, as actions safe in one context may be unsafe in another. The CLI approach enforces layered safety controls, ensuring the agent supports operators as a copilot rather than operating autonomously. Wen-Tsung Chang, senior infrastructure engineer at Houzz, agrees on the importance of human-in-the-loop:

No matter what stage we are at right now, we should always stay accountable and never give up on critical thinking.

The focus then shifts to identifying the root cause and defining a long-term fix. With infrastructure health confirmed, the issue is isolated to the application logic, and the agent is directed to the relevant source code.

The last step is the postmortem: while compiling timelines, logs, and actions is often tedious, the Gemini CLI can simplify this process through a custom command that scrapes the conversation history, metrics, and logs from the incident, populates a CSV timeline, generates a Markdown document, and suggests action items to prevent recurrence. Carlesso and Medrano Llamas warn that while their example used some Google-internal tools, the pattern is universal. They conclude:

Perhaps the most exciting part is what happens next. That Postmortem we just generated? It becomes training data. By feeding past Postmortems back into Gemini, we create a virtuous loop of self-improvement: the output of today’s investigation becomes the input for tomorrow’s solution.

A similar workflow can be built using the Gemini CLI, MCP servers to connect Gemini to tools like Grafana, Prometheus, and PagerDuty, and custom slash commands that define reusable prompts to simplify interactions with the Gemini CLI.

From Paging to Postmortem: Google Cloud SREs on Using Gemini CLI for Outage Response

Leave a Reply

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply