By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Building an Autonomous SRE Incident Response System Using AWS Strands Agents SDK | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Building an Autonomous SRE Incident Response System Using AWS Strands Agents SDK | HackerNoon
Computing

Building an Autonomous SRE Incident Response System Using AWS Strands Agents SDK | HackerNoon

News Room
Last updated: 2026/03/19 at 2:03 PM
News Room Published 19 March 2026
Share
Building an Autonomous SRE Incident Response System Using AWS Strands Agents SDK | HackerNoon
SHARE

Follow this guide to learn how to automate CloudWatch alerts, Kubernetes remediation, and incident reporting using multi-agent AI workflows with the AWS Strands Agents SDK.

The SRE Incident Response Agent is a multi-agent sample that ships with the AWS Strands Agents SDK. It automatically discovers active CloudWatch alarms, performs AI-powered root cause analysis using Claude Sonnet 4 on Amazon Bedrock, proposes Kubernetes or Helm remediations, and posts a structured incident report to Slack.

This guide covers everything you need to clone the repo and run it yourself.


Prerequisites

Before you begin, make sure the following are in place:

  • Python 3.11+ installed on your machine
  • AWS credentials configured (aws configure or an active IAM role)
  • Amazon Bedrock access enabled for Claude Sonnet 4 in your target region
  • kubectl and helm v3 installed — only required if you plan to run live remediations. Dry-run mode works without them.

Step 1: Clone the Repository

The sample lives inside the strands-agents/samples open source repository. Clone it and navigate to the SRE agent directory:

git clone https://github.com/strands-agents/samples.git
cd samples/02-samples/sre-incident-response-agent

The directory contains the following files:

sre-incident-response-agent/

├── sre_agent.py           # Main agent: 4 agents + 8 tools

├── test_sre_agent.py      # Pytest unit tests (12 tests, mocked AWS)

├── requirements.txt

├── .env.example

└── README.md

Step 2: Create a Virtual Environment and Install Dependencies

python -m venv .venv
source .venv/activate        # Windows: .venvScriptsactivate
pip install -r requirements.txt

The requirements.txt pins the core dependencies:

strands-agents>=0.1.0
strands-agents-tools>=0.1.0
boto3>=1.38.0
botocore>=1.38.0

Step 3: Configure Environment Variables

Copy .env.example to .env and fill in your values:

cp .env.example .env

Open .env and set the following:

# AWS region where your CloudWatch alarms live
AWS_REGION=us-east-1

# Amazon Bedrock model ID (Claude Sonnet 4 is the default)
BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-20250514-v1:0

# DRY_RUN=true means kubectl/helm commands are printed, not executed.
# Set to false only when you are ready for live remediations.
DRY_RUN=true
# Optional: post the incident report to Slack.
# Leave blank to print to stdout instead.
SLACK_WEBHOOK_URL=

Step 4: Grant IAM Permissions

The agent needs read-only access to CloudWatch alarms, metric statistics, and log events. No write permissions to CloudWatch are required. Attach the following policy to the IAM role or user running the agent:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "cloudwatch:DescribeAlarms",
      "cloudwatch:GetMetricStatistics",
      "logs:FilterLogEvents",
      "logs:DescribeLogGroups"
    ],
    "Resource": "*"
  }]
}

Step 5: Run the Agent

There are two ways to trigger the agent.

Option A: Automatic Alarm Discovery

Let the agent discover all active CloudWatch alarms on its own. This is the recommended mode for a real on-call scenario:

python sre_agent.py

Option B: Targeted Investigation

Pass a natural-language description of the triggering event. The agent will focus its investigation on the service and symptom you describe:

python sre_agent.py "High CPU alarm fired on ECS service my-api in prod namespace"

Example Output

Running the targeted trigger above produces output similar to the following:

Starting SRE Incident Response
   Trigger: High CPU alarm fired on ECS service my-api in prod namespace

[cloudwatch_agent] Fetching active alarms...
  Found alarm: my-api-HighCPU (CPUUtilization > 85% for 5m)
  Metric stats: avg 91.3%, max 97.8% over last 30 min
  Log events: 14 OOMKilled events in /ecs/my-api

[rca_agent] Performing root cause analysis...
  Root cause: Memory leak causing CPU spike as GC thrashes
  Severity: P2 - single service, <5% of users affected
  Recommended fix: Rolling restart to clear heap; monitor for recurrence

[remediation_agent] Applying remediation...
  [DRY-RUN] kubectl rollout restart deployment/my-api -n prod

================================================================
*[P2] SRE Incident Report - 2025-10-14 09:31 UTC*

What happened: CloudWatch alarm my-api-HighCPU fired at 09:18 UTC.
CPU reached 97.8% (threshold 85%). 14 OOMKilled events in 15 min.

Root cause: Memory leak in application heap leading to aggressive GC,
causing CPU saturation. Likely introduced in the last deployment.

Remediation: Rolling restart of deployment/my-api in namespace prod
initiated (dry-run). All pods will be replaced with fresh instances.

Follow-up:
  - Monitor CPUUtilization for next 30 min
  - Review recent commits for memory allocation changes
  - Consider setting memory limits in the Helm chart
================================================================

Running the Tests (No AWS Credentials Required)

The sample ships with 12 pytest unit tests that mock boto3 entirely. You can run the full test suite in any environment, including CI, without any AWS credentials:

pip install pytest pytest-mock
pytest test_sre_agent.py -v

# Expected: 12 passed

Enabling Live Remediation

Once you have validated the agent’s behaviour in dry-run mode and are satisfied with the decisions it makes, you can enable live kubectl and helm execution by setting DRY_RUN=false in your .env file:

DRY_RUN=false

Conclusion

In under five minutes of setup, the AWS Strands Agents SDK gives you a working multi-agent incident response loop: alarm discovery, AI-powered root cause analysis, Kubernetes remediation, and a structured incident report, all driven by a single python sre_agent.py command. The dry-run default means there is no risk in running it against a real environment while you evaluate its reasoning.

From here, the natural next steps are connecting a Slack webhook for team notifications, adding a PagerDuty tool for incident tracking, or extending the RCA agent with a vector store of past postmortems. All of that is a tool definition away.

I hope you found this article helpful and that it will inspire you to explore AWS Strands Agents SDK and AI agents more deeply.


Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Now is definitely the best time to buy the supremely powerful OnePlus Buds Pro 3! Now is definitely the best time to buy the supremely powerful OnePlus Buds Pro 3!
Next Article AI Model Discovers 22 Firefox Vulnerabilities in Two Weeks AI Model Discovers 22 Firefox Vulnerabilities in Two Weeks
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Hermès doesn’t include a power adapter with its ,150 charging case
Hermès doesn’t include a power adapter with its $5,150 charging case
News
Disorganized? These Paper Planners Can Save You From Yourself
Disorganized? These Paper Planners Can Save You From Yourself
Gadget
The Stranger Things Blu-ray box set is worth braving the Upside Down for
Gadget
54 EDR Killers Use BYOVD to Exploit 34 Signed Vulnerable Drivers and Disable Security
54 EDR Killers Use BYOVD to Exploit 34 Signed Vulnerable Drivers and Disable Security
Computing

You Might also Like

54 EDR Killers Use BYOVD to Exploit 34 Signed Vulnerable Drivers and Disable Security
Computing

54 EDR Killers Use BYOVD to Exploit 34 Signed Vulnerable Drivers and Disable Security

6 Min Read
GeekWire Awards: Billion-dollar deals, rare IPO, pharma pact, and mega-round vie for Deal of the Year
Computing

GeekWire Awards: Billion-dollar deals, rare IPO, pharma pact, and mega-round vie for Deal of the Year

6 Min Read

Facebook Marketing: A Complete Guide to Your Brand’s Strategy (2025)

6 Min Read
WayaVPN Earns a 35.36 Proof of Usefulness Score by Building Residential VPN and Proxy Infrastructure | HackerNoon
Computing

WayaVPN Earns a 35.36 Proof of Usefulness Score by Building Residential VPN and Proxy Infrastructure | HackerNoon

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?