Building An Autonomous SRE Incident Response System Using AWS Strands Agents SDK

Follow this guide to learn how to automate CloudWatch alerts, Kubernetes remediation, and incident reporting using multi-agent AI workflows with the AWS Strands Agents SDK.

The SRE Incident Response Agent is a multi-agent sample that ships with the AWS Strands Agents SDK. It automatically discovers active CloudWatch alarms, performs AI-powered root cause analysis using Claude Sonnet 4 on Amazon Bedrock, proposes Kubernetes or Helm remediations, and posts a structured incident report to Slack.

This guide covers everything you need to clone the repo and run it yourself.

Prerequisites

Before you begin, make sure the following are in place:

Python 3.11+ installed on your machine
AWS credentials configured (aws configure or an active IAM role)
Amazon Bedrock access enabled for Claude Sonnet 4 in your target region
kubectl and helm v3 installed — only required if you plan to run live remediations. Dry-run mode works without them.

Step 1: Clone the Repository

The sample lives inside the strands-agents/samples open source repository. Clone it and navigate to the SRE agent directory:

git clone https://github.com/strands-agents/samples.git
cd samples/02-samples/sre-incident-response-agent

The directory contains the following files:

sre-incident-response-agent/

├── sre_agent.py &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # Main agent: 4 agents + 8 tools

├── test_sre_agent.py &nbsp; &nbsp; &nbsp;# Pytest unit tests (12 tests, mocked AWS)

├── requirements.txt

├── .env.example

└── README.md

Step 2: Create a Virtual Environment and Install Dependencies

python -m venv .venv
source .venv/activate        # Windows: .venvScriptsactivate
pip install -r requirements.txt

The requirements.txt pins the core dependencies:

strands-agents>=0.1.0
strands-agents-tools>=0.1.0
boto3>=1.38.0
botocore>=1.38.0

Step 3: Configure Environment Variables

Copy .env.example to .env and fill in your values:

cp .env.example .env

Open .env and set the following:

# AWS region where your CloudWatch alarms live
AWS_REGION=us-east-1

# Amazon Bedrock model ID (Claude Sonnet 4 is the default)
BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-20250514-v1:0

# DRY_RUN=true means kubectl/helm commands are printed, not executed.
# Set to false only when you are ready for live remediations.
DRY_RUN=true
# Optional: post the incident report to Slack.
# Leave blank to print to stdout instead.
SLACK_WEBHOOK_URL=

Step 4: Grant IAM Permissions

The agent needs read-only access to CloudWatch alarms, metric statistics, and log events. No write permissions to CloudWatch are required. Attach the following policy to the IAM role or user running the agent:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "cloudwatch:DescribeAlarms",
      "cloudwatch:GetMetricStatistics",
      "logs:FilterLogEvents",
      "logs:DescribeLogGroups"
    ],
    "Resource": "*"
  }]
}

Step 5: Run the Agent

There are two ways to trigger the agent.

Option A: Automatic Alarm Discovery

Let the agent discover all active CloudWatch alarms on its own. This is the recommended mode for a real on-call scenario:

python sre_agent.py

Option B: Targeted Investigation

Pass a natural-language description of the triggering event. The agent will focus its investigation on the service and symptom you describe:

python sre_agent.py "High CPU alarm fired on ECS service my-api in prod namespace"

Example Output

Running the targeted trigger above produces output similar to the following:

Starting SRE Incident Response
   Trigger: High CPU alarm fired on ECS service my-api in prod namespace

[cloudwatch_agent] Fetching active alarms...
  Found alarm: my-api-HighCPU (CPUUtilization > 85% for 5m)
  Metric stats: avg 91.3%, max 97.8% over last 30 min
  Log events: 14 OOMKilled events in /ecs/my-api

[rca_agent] Performing root cause analysis...
  Root cause: Memory leak causing CPU spike as GC thrashes
  Severity: P2 - single service, <5% of users affected
  Recommended fix: Rolling restart to clear heap; monitor for recurrence

[remediation_agent] Applying remediation...
  [DRY-RUN] kubectl rollout restart deployment/my-api -n prod

================================================================
*[P2] SRE Incident Report - 2025-10-14 09:31 UTC*

What happened: CloudWatch alarm my-api-HighCPU fired at 09:18 UTC.
CPU reached 97.8% (threshold 85%). 14 OOMKilled events in 15 min.

Root cause: Memory leak in application heap leading to aggressive GC,
causing CPU saturation. Likely introduced in the last deployment.

Remediation: Rolling restart of deployment/my-api in namespace prod
initiated (dry-run). All pods will be replaced with fresh instances.

Follow-up:
  - Monitor CPUUtilization for next 30 min
  - Review recent commits for memory allocation changes
  - Consider setting memory limits in the Helm chart
================================================================

Running the Tests (No AWS Credentials Required)

The sample ships with 12 pytest unit tests that mock boto3 entirely. You can run the full test suite in any environment, including CI, without any AWS credentials:

pip install pytest pytest-mock
pytest test_sre_agent.py -v

# Expected: 12 passed

Enabling Live Remediation

Once you have validated the agent’s behaviour in dry-run mode and are satisfied with the decisions it makes, you can enable live kubectl and helm execution by setting DRY_RUN=false in your .env file:

DRY_RUN=false

Conclusion

In under five minutes of setup, the AWS Strands Agents SDK gives you a working multi-agent incident response loop: alarm discovery, AI-powered root cause analysis, Kubernetes remediation, and a structured incident report, all driven by a single python sre_agent.py command. The dry-run default means there is no risk in running it against a real environment while you evaluate its reasoning.

From here, the natural next steps are connecting a Slack webhook for team notifications, adding a PagerDuty tool for incident tracking, or extending the RCA agent with a vector store of past postmortems. All of that is a tool definition away.

I hope you found this article helpful and that it will inspire you to explore AWS Strands Agents SDK and AI agents more deeply.

Building an Autonomous SRE Incident Response System Using AWS Strands Agents SDK | HackerNoon

Prerequisites

Step 1: Clone the Repository

Step 2: Create a Virtual Environment and Install Dependencies

Step 3: Configure Environment Variables

Step 4: Grant IAM Permissions

Step 5: Run the Agent

Option A: Automatic Alarm Discovery

Option B: Targeted Investigation

Example Output

Running the Tests (No AWS Credentials Required)

Enabling Live Remediation

Conclusion

Leave a Reply Cancel reply

Stay Connected

Latest News

Hermès doesn’t include a power adapter with its $5,150 charging case

Disorganized? These Paper Planners Can Save You From Yourself

The Stranger Things Blu-ray box set is worth braving the Upside Down for

54 EDR Killers Use BYOVD to Exploit 34 Signed Vulnerable Drivers and Disable Security

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Prerequisites

Step 1: Clone the Repository

Step 2: Create a Virtual Environment and Install Dependencies

Step 3: Configure Environment Variables

Step 4: Grant IAM Permissions

Step 5: Run the Agent

Option A: Automatic Alarm Discovery

Option B: Targeted Investigation

Example Output

Running the Tests (No AWS Credentials Required)

Enabling Live Remediation

Conclusion

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News