DevOps Modernization: AI Agents, Intelligent Observability And Automation

Transcript

Renato Losio: In this session, we’re going to chat about DevOps modernization. We’re going to talk about agents. We’re going to talk about so-called intelligent observability and automation.

Just a couple of words about today’s topic. Of course, nothing surprising here, AI is changing DevOps and is changing the way teams are moving beyond reactive monitoring towards predictive automated delivery and operations. What does that mean? How can teams actually implement predictive incident detection, intelligent rollout, and AI-driven remediation? Also, how can we accelerate delivery? Those are all topics that today’s panelists hopefully are going to cover.

My name is Renato Losio. I’m by day a Cloud Architect, mostly working with cloud technology, AWS stuff. I’m an editor here at InfoQ. I’m joined by four different experts coming from different industries, different backgrounds. They will discuss how agents and generative models are being integrated into the pipeline, but not only in platform engineering, feature management, observability. I’d like to allow each one of them to introduce themselves and share their journey in DevOps modernization.

Patrick Debois: My name is Patrick Debois. You might know me from two things in the industry, which is the DevOps Handbook and DevOpsDays. We sparked a little bit of the DevOps thing. I’m now like two and a half years into more coding with AI. I left a little bit of the ops side, but I’m still keeping an eye on what is happening in Ops because there’s obviously a close relation to that. I currently work as Product DevRel at Tessl where we’re trying to solve some of the delivery problems of context.

Mallika Rao: I’m Mallika. I have led platform and infra teams at Netflix, Twitter, Walmart in the past, mostly with a background in search personalization and recommendation systems. I came into DevOps through a very large-scale distributed systems lens rather than tooling first automation. My team sits at that intersection of innovation, reliability, migrations, and developer velocity even. I’m really interested in how AI helps teams reason faster about complex systems, not just automate tasks, because at this intersection of reliability, migration, innovation, small mistakes have massive customer impact is how I’ve seen.

Olalekan Elesin: My name is Olalekan. I am a VP of engineering at HRS Group, leading multiple teams and also recently transitioned into platform engineering, responsible for the production reliability, and also setting the standards when it comes to engineering. At the same time, in my free time, I’m an AWS Machine Learning Hero, which then leads to usage of AI and maybe setting the standards on the use of AI with the teams and also across the teams that I work with in HRS. Maybe for me also, I have a small background in product management. In case you’d see me talking about product management, don’t be angry. It’s just the fact that we need to wear product hats a lot of times as engineers.

Martin Reynolds: I’m Martin. I’m Field CTO at Harness, where I’ve been here for a couple of years now. Prior to that, I’ve had 30 years of building, delivering software, and the last 12 or so of that before I joined Harness was running DevOps, migrating products into cloud, and ultimately building out platform engineering teams. Now I’m much more focused on helping teams modernize their software delivery, doing that through a variety of ways. I don’t want to steal the thunder of the whole thing, but a lot of what we’re going to talk about today is what I spend most of my time doing.

Where Human Attention Gets Wasted – AI and Modernization

Renato Losio: I’ve actually put a lot about today’s topic, because I’m actually managing some production environment myself. I put my SRE hat on, sometimes. Actually, I’ve been thinking there’s a lot going on, of course, in terms of AI and modernization of the entire space. One thing I’ve always looked at is, if I think today, where does human attention get wasted? What’s the part where at the moment we are spending most of our energy?

Patrick Debois: We had this idea within DevOps that we would automate a lot of things, and by doing so, it becomes a repeatable thing. Like you said, part of that is also reducing the cognitive load of doing certain things and not having to waste our time on that. I still see a lot of, there’s a failure in production and let’s dig in and do a lot of things. I think AI can help us surface and reach more into the depth of sources like where it’s supposed to happen. A lot of that toil can happen. Another thing that I often see and that’s maybe as another example in bigger outages, there is obviously the triage. There is summarizing things, but there’s also a communication burden to the outside if you have a large impact on that. All of these things can be actually improved with the assistance of AI to do that.

Mallika Rao: I completely agree with Patrick. I do feel like human attention is mostly wasted on triage without context especially. Engineers spend enormous amounts of time answering questions like, is this signal real? Is it new? Is it customer impacting? That waste isn’t just investigation. It’s that uncertainty around what is happening. I feel like AI is most valuable when it collapses ambiguity early so humans can actually focus on decision-making and not just that data wrangling. It’s less about the fact that alerts are noisy. They could be at that stage, but it’s also that ambiguous uncertainty at that stage is the real tax that teams pay.

The Core Role of AI in DevOps

Renato Losio: I feel that pressure myself with alerts and failure, and it’s something that we tend to almost all over time and thinking how I build everything over time and go from a few monitoring emails per day to a few hundred, a few thousand, to then try to find a way out of it and hope that AI is going to sort out everything for me. Actually, I wonder when I look at, what’s the first real problem AI should help in DevOps today?

Martin Reynolds: I think there’s a couple. It’s going to be a little bit different for every organization. I think echoing some of what we just heard, turning those raw signals that you get into something contextual that’s meaningful and understandable. I think, though, probably the easiest thing, and I say this from experience, is just eliminating some of that low-level toil that often happens around some of the things that we do when we’re delivering software. It doesn’t matter how good your pipelines are, you still get failures, and trawling through logs and things like that is painful.

AI is great at doing those kinds of things and can absolutely dig into what the failure is and probably tell you how to remediate that failure rather than you trawling through all those logs to find the problem. I think also when you have, as mentioned, that summarization of changes that’s happened when you have an outage. Knowing all the things that have changed in that environment, because almost always when there’s a failure in the system, it’s due to a change. Knowing what’s changed and when and how is critical to making resolving issues and outages easier.

Certainly, the approach I took previously that worked really well was just picked a couple of things that really mattered that actually created a lot of noise and took those away. Whether that’s finding something that just people are spending a lot of time doing, sometimes it’s just the glue between systems. Those scripts that glue system A to system B. Removing those and allowing AI to do that integration, I think is where to start. Picking those high priority, high noise things and removing them is probably a great place to dig in and remove them.

Renato Losio: What’s your experience in that sense? What do you think is the first real problem AI should solve, but as well, what it has been for you eventually?

Olalekan Elesin: I think it’s a two-part question, but I would answer what it is for me and what I think AI should solve or the other way around. I think in general, because when I coach the senior, or let’s say staff engineers and also the engineers that work with me on my teams, I usually start with, think of AI as a junior engineer. What would you not want to do today that you would say to a junior engineer, either DevOps or not, to do, so that you can spend time on high value work? Usually this is how I start my thinking. Let’s say there’s a failure in production, whatever that is, or a failure somewhere. What would I ask the junior engineer to do? I remember when I started, my coach back then, or my line manager back then would tell me, this is where to start your investigation. Go through the logs, sieve through XYZ to understand what you need to debug or how you need to debug. Over time, I learned debugging is an art.

Then you need to think about it from AI as well. This is what I ask AI to do is go sieve through the logs in this particular way to understand where to look in the application code. Imagine if a human being has to sieve through at least thousands of lines of logs, depending on how connected the systems are? AI is really good at that. This is what I think when I talk about how I would use AI is that I would assume it’s a junior engineer. I give it the logs in a file, then it figures out what might be wrong, then tells me which part in the application needs to be fixed and what I need to do to fix it. Then say, yes, it’s AI, why not? Let me just ask it to get the repository and ask it to fix it and create a pull request. This is how I would do it. I think “DevOps engineers”, or cloud operations engineers are well positioned in this AI transformation because they know how to go up and down the stack, and AI can be a force multiplier for them. This is how I think about how to use it.

AI Capabilities in SRE

Renato Losio: When discussing AI, stakeholders often think about just generative AI. That’s what you’re thinking at the moment. It’s like, how can we clarify what we do in reliability engineering, SRE, and better communicate the capabilities of AI?

Martin Reynolds: I think there’s a lot there. I’ll give you an example. One of the first things we harnessed with AI was more ML/AI, and we have something called continuous verification. What that essentially does is it learns what good looks like for your application when it’s deployed. It learns that, you get these errors in the log, but you get those every time you deploy. That’s not something you need to worry about. This is what the metrics always look like when you deploy. It doesn’t alert you when those things happen.

You don’t have to have anybody watching it when you do that deployment. What it does instead is stop you having to have people babysit the deployments and just tell you when there’s something different from what a normal deploy looks like. This is like, you don’t have to pay your engineers to work on a night after you’ve done deployment because you think it’s scary. They can just sit there and the system will tell them. I think that’s true across the whole thing. You can use it to actually solve real world problems. I think especially when you’re talking to leadership, you want to talk about outcomes rather than specific technical functionality. Did you know we’re generating loads more code, but we can’t get it out because we’re not generating enough tests. Actually, we can use AI to help us build the gaps in our testing.

The outcome will be that everything we deliver will be more reliable and we’ll actually get more of that newly generated code from our more productive engineers actually out to production safely and reliably. That conversation up then becomes less about I can do these cool things, to, here’s an outcome that happens because we’re leveraging AI, whether that’s generative AI or just straight AI, machine learning processing of data.

Patrick Debois: The machine learning, AI used to sometimes be referred to as the AIOps, was the predictive thing. It was the time series. It was like the trends as Martin explained. What we were lacking, if we were seeing a trend, we didn’t have automation that could explain it. That’s where the GenAI comes in. It can explain it, can triage it, can look for context, can look at our runbooks and things like that. A lot of people expected GenAI to also understand the time series. I don’t think that is the application of it. If you need more contextual explanation, that’s where you put the GenAI in there.

Everything like we discussed, the triage, the filtering, the explanation, is more on those pieces as well. Maybe one last thing that I wanted to add is we think a lot about triaging, and Olalekan has kind of explained that, we might ask things to fix because it now understands our code and that’s also part of the context that we give it, so it’s not on the predictive side. What I’ve seen used more and more is that we also use it for hypothesis building. Like, give me three options of how you would solve this. Give me three options of why it will fail. That’s another way you could leverage for your SRE solutions in a way that it helps you actually through the whole process of finding the solution, creating options and going from there. That’s really powerful and that didn’t exist before with the more machine learning aspects of it.

Martin Reynolds: I left that because that’s a great description of the evolution. Going back to that original example, the one thing was finding the signal from the noise. That’s what machine learning, AI was really good at. Now you can find the signal from the noise and actually get something meaningful out of it that says, I’ve looked at the signal from the noise. Actually, I can see from all the context that I have that this is where you should probably look and here’s why I think you should probably go here to fix this thing. I think that’s the evolution.

The first one pointed you to say there is a problem and now you can say there’s a problem and here’s some ideas of, given everything I know, how you might solve that problem. What you’re doing there is you are saving so much time. That is the difference between a one-day outage and a 30 minutes outage or a 5-minute outage because you’re being helped along the way.

The other stuff that Patrick was talking about there, if you’ve got your AI and it’s listening to that Slack conversation that spun up for the war room because you have an outage and it’s summarizing everything. If you come in after the call’s been going for 10 minutes, you don’t miss anything because it’s already summarizing it and it’s making hypotheses along the way based on the conversation that’s happening and all the data that it has and all the context that it has. I think it is an exciting revolution. As somebody who in the past has been in those rooms sat there for hours trying to solve a problem, working out whether rollback is the better solution and whether rollback was even going to work, I find it genuinely exciting the evolution that’s happening. There’s been so much noise about the code generation, but there’s so much more in the software delivery lifecycle that can be addressed.

Where to Start, with AI

Renato Losio: Actually, that raised two questions on my side. The first one is, I love the idea, but if I think from a team, from a personal perspective, maybe you have just started or you have just that predictive part based on more traditional MM or whatever, where should I start? How can I do it incrementally before going full automation? Because it looks like an amazing thing, but it’s like, ok, I have my environment now with some others, something running there, my team that is on-call and whatever else. What should be the first step? Where does actually AI save time and reduce the amount of pain before really talking about that full automation that is giving me all the advice and everything else? Where should you start?

Mallika Rao: For teams starting out, I really do feel summarization and correlation and not just automation like Renato pointed out, is a great starting point. Because in my head, if we are owners of reliability and so much about operations is an organizational problem to solve, and not just a technical problem. I do feel like if the goal is to have better impact assessment, before just the prediction and the remediation, we need to have a very good view into who is affected, who are our customers, how badly and how bad is it when it’s compared to the baseline.

If that’s our goal, I do feel like if AI or LLM workflows or agentic workflows, if they can write a first draft of incident timelines. Like what happened, correlate the logs to the deploys, or maybe even just summarize what changed in the last 24 hours, that would build a lot of trust with stakeholder teams and with partner teams because a lot of this is also not just a handoff. It’s a continued partnership around, how do we solve for incidents? What happens after? How do we recover from it? Then, how do we prevent these kinds of things? I feel like teams build faster trust when AI explains what it’s seen in the past before it’s allowed to actually act. Building a path for that and having that journey is a better organizational framework to solve for.

Olalekan Elesin: I agree with you. I would just add to what you’re saying. For transparency reasons, get your observability basics correct. If you’re in a large enterprise and you have multiple downstream systems talking to one another, get your tracing correct. Be able to trace across downstream systems because then it’s a lot easier to then tell the AI that system A, which connects to system B, follow through with this trace ID to identify where the errors are coming from, from the upstream to the downstream. Then to the point, it’s the summarization of what might happen when you want to triage based on the trace. This then is the next step, when you’re working with large distributed systems. For me, one example is, how would I do it? This is the question Renato, you asked. I would go through what I would do as a junior engineer. I receive an alert. I go into the log. I check the log, try to understand it, and then try to figure out in the codebase what I need to do. I model the AI, not create a model, but model the workflow with the AI to think through like this. This is how I would think about it.

Then if this was taking me 15 hours, hypothetically, now I can get it done in 5 minutes because AI can do it. Then I think, yes, let’s figure out how to build this as a workflow or an agentic AI workflow. On a smaller note, I was working on one of my private open-source projects and something was failing in production. What I would do is go into CloudWatch, copy the log. I got an alert through CloudWatch alerts, we went into the logs directly, just copied where it was failing, went into the codebase and pasted it in there, and said, to Patrick’s point, hypothesis or what are the other ways to fix this? Put it in there and then immediately it fixed it, created the pull request and asked the co-collaborator to resolve. This was in less than 15 minutes. This would have taken at least 5 hours if not more. It’s looking at how I would do it today, and I would hand that process over to AI but still stay in the steering wheel to check if it is doing the right thing.

Trust and Maturity of AI

Renato Losio: Actually, if I understand well what you are saying, it’s basically the build as well the trust that we want to build with the AI recommendation, it’s not really letting them take action. It’s the union engineer giving me the advice and then I react in a much faster way somehow. When should you give them the power to take action and what can really give you confidence to let the AI run alone?

Patrick Debois: It’s a little bit of a trust maturity thing. The first thing is, I can read it. You give read access, they give you feedback and you don’t do anything. What is common in automation is that you then find like a workflow that you can do a reliable thing, like react to something or a known state that you’re actually acting upon. The GenAI thing might not be deterministic but the action it takes is actually deterministic. You know, given certain criteria, it cannot do something wrong. That builds up that confidence to do it. We were talking about building the observability and the metrics. Then making that workflow predictable is one thing.

I think we can also do a very good job at providing the agent to write context, what it needs to do something. Developers do this in their coding agents. They write their CLAUDE.md, they give it context and more context and more context how to do things. Within Ops, we have already a lot of those things in our runbooks, how to do certain things, or SLA documents, and we need to put in the work. We can’t just say blindly to the agent, here’s our logs, here’s our process, figure it out. I know we’re talking about the junior, but we have to onboard the junior agent with all the context that it needs to make that decision. That’s also part of our job, I think, to make this trustworthy. One that we actually provided the context that it needs to do instead of it going crazy and trying things out, going in all directions and then losing our trust, basically.

Martin Reynolds: You’re right in the same space as me. It’s funny because I think there’s two things. It’s almost like we’re forgetting that if you go back to traditional DevOps automation, there was this process of saying, I’m going to do this automation, and like you say, it’s deterministic, but I’m still going to have somebody in that loop who’s going to say it’s ok. Then after you’ve said ok a thousand times, that is just a waste of time and then it just becomes fully automated and you take the human out of the loop for that. I do not think that it is any different with AI. You build trust in a process to the point that you don’t need the human in the loop anymore for that particular workflow or process because it’s done it a hundred times or a thousand times and you know what it looks like and what the outcome is going to look like and you can say it’s ok.

The thing I really wanted to pick up on is the context thing. You get much better results out of AI agents when it has a full understanding of your estate. Whether that is when your artifacts were built, how they were tested, what your service dependencies look like, what does your documentation look like, when were things deployed, how were they deployed? What are the open-source dependencies that they have? What is your appetite for risk in the organization? What are your standards that you’re trying to adhere to? Actually, the interesting thing is, is that to get the most out of AI, you still need the good fundamentals in place to give it to it. If you were onboarding a new engineer and AI didn’t exist, and there wasn’t documentation for all those things, and some of that knowledge was tribal, and getting all of that stuff and those good, solid foundations in place, of, this is how we work, this is what our estate looks like, having at least reasonably up-to-date documentation, is how you onboarded junior developers.

Then you gave them tasks inside of that as per previously. You have to have that foundation to give to the AI, so that when it’s reading all those metrics, and looking at the services, and maybe looking at the outage, and the logs, and what’s been deployed where, and how those things depend on each other. Having all of that context available to it, means that you get much smarter responses, and it ultimately accelerates that delivery that you’re going to get and the time to value, but you have to have some of those fundamentals in place first. Having GenAI, and AI agents, doesn’t take away the need to put the fundamentals in place.

Mallika Rao: You just touched upon something so interesting Martin, and I do feel like that trust is earned through explanation, and not just accuracy alone. I was smiling before Renato, because we learned this the hard way in one of the systems that we built, where it was a very large distributed system, and we used AI driven analysis to support canary rollouts, and the model would flag elevated risk based on subtle metric shifts across regions. We were deploying it across regions. Early on we noticed something interesting, the AI consistently missed failures that only showed up in our shadow canaries. It was very interesting from a deployment traffic shape perspective, where based on the traffic shape, and I think even the percentages, the downstream dependencies didn’t fully mirror production. The issue wasn’t the model in this case.

The important point is that it was assumption mismatch. It was reasoning correctly over incomplete reality, and to Martin’s point, that goes back to the context. How rich is its context? The trust breakthrough did not come from better models, it came from changing the operating model itself. We made AI recommendations visible before we took any actions. We required it to cite signals, and restricted the automation that we were doing behind the scenes. Only when we were able to answer the question, if engineers were able to explain what the AI was doing, that’s when trust started building. That was a lesson learned from the trenches.

Martin Reynolds: I do love that, and I do agree that having the explanation of why it’s doing it is super critical. Even if you’re looking at something like, let’s take, there’s a security vulnerability that’s been identified, and it says, I know how to fix this in this code. It’s great that it can do that, but you want it to create the PR with the explanation of what it’s doing to fix that security vulnerability in the code. You don’t want just to say, I’m fixing CVE, blah, blah, blah. You want to say, I’m fixing CVE whatever, and this is what I’ve done to fix it, and this is why. Then when somebody comes to review that PR, they can say, ok, that makes sense, the code looks correct, great, approve, and then it goes into the flow.

Patrick Debois: I actually raise you on that one, Martin. You’re doing it in the CI/CD loop, but what if we take all the context that we’ve written in our runbooks and all our knowledge that we have operationally, and we actually plug that into the coding agent of the developer itself, then we’re actually almost shifting left with all the operational knowledge, and we keep that up to date and prevent it even earlier. That context and sharing that context across the org. That’s the power that I see that we didn’t have before maybe with DevOps, and it was also hard to get the incentive for people to do context beyond their own piece, but now it’s useful for yourself, and the side benefit, it’s useful for everybody. Like, how many docs went out of date, how many runbooks were not working, and now it’s like, because we’re working on this all together, it stays in a much better shape.

Martin Reynolds: I think you’re a hundred percent right. One of the examples I often give is that, if I’m an engineer and I’m writing some code and I need to know something, it might be great. You might have some internal developer portal, or wiki, or whatever it is with all that information. What I really want to do is just be able to type into my agent that sat in my IDE with me and say, can you just get me the documentation for the API for our authorization service and tell me how I get a new token or how I’d refresh my token, for example? I’m picking a random innocuous thing, but being able to do that in the environment and get the response and not having to context switch out to go to wherever that documentation is, find it, look it up, and then context switch back is a real accelerator. It might be a one-minute accelerator, but a one minute, 100 times across 10,000 developers in an enterprise is a massive time-saving.

Olalekan Elesin: I think the context topic is good, but I also assume that we’re overlooking the fact that these models have become really powerful. I think in the beginning of last year, we usually had maybe 5,000-word context or maybe 20,000-word context. Now the models are so powerful that some of them have 1 million tokens that can maintain context. A lot of topics shifted from RAG into context engineering.

Usually, how I think about it is that if I’m going to triage a repository that I don’t know anything about, how would I start with it? How do I get an understanding? In a practical sense, the first prompt is get a detailed understanding of this repository as much as you can. This is the prompt I type in my IDE to the LLM. It goes as much as possible to get context of what the repository is. That is the first context is the code. If you are skilled enough as a DevOps engineer and SRE, you can also build your own MCPs, Model Context Protocol servers, which can connect to where documentation lies.

Then, in the local context from the project itself, it can get context from Confluence or whichever documentation that you have in there. That also helps the AI to build context. Then it has better understanding of what it needs to do to resolve. If you’re also skilled enough, depending on the observability tool you use, you can also have an MCP that connects your observability tool. That also gets information about the time series of what happened or the series of events that led to the issue and the outage. My ultimate view is that in the AI transformation that we’re in right now, I believe strongly that DevOps engineers are in the best position because they know how to make things work in the cloud.

Decisions That Should Never Be Automated

Renato Losio: So far you have basically killed one of my dreams. I always thought that whatever problem I had, one day GenAI and whatever in DevOps would have solved it. It’s very clear to me now that if I have a poor legacy bunch of a bash script to manage my environment, there’s not going to be AI solving that for me. I still need to document it and action properly to get it working. Going back to our problem, I was thinking, so far I bought the idea that, yes, AI can help me at least, or it can help me sort out that outage at 3 a.m. The question I really have now is, which decision should never be automated. You should never let AI do to manage that outage.

Mallika Rao: For these 3 a.m. incidents, breaks, I do think automation is great at speed, but humans are still responsible for creating that meaning. That’s around how I think about this problem. At 3 a.m., machines should gather and be able to correlate, reason about that overall context, and humans should be able to decide under uncertainty. Those two things are very separate when an incident strikes. In a high RPS system, we actually saw incidents where latency slowly crept up around our systems, blast radius was very big, and abstractions were questioned across multiple services.

The AI agent could quickly correlate some config shape or a partial rollout or a subtle downstream queue buildup in minutes, not hours. This is what Ola was highlighting as well. That was hugely valuable, but what the machine couldn’t decide was the intent or what was the goal, what was the mission of the service? Do we roll back and risk losing a critical fix? Do we shed the load of a particular region and protect it? Or do we accept some short-term fixes so that we can have more time to recover from the incident? These are the value-laden decisions I feel that need to have some human in the loop to handle that uncertainty and still do right by our SLOs, SLAs, so that trust in the system is still retained. How do we encode customer promises, business priorities, that risk tolerance? I think it’s somewhere a blend of how much do we use AI to automate and have that human in the loop for decision-making on the high leverage points. That balance is where I’m hedging on.

Patrick Debois: To me, it’s a risk game. Forget AI for a moment. We were just doing plain DevOps automation with a pipeline and reacting to some of the monitoring that came around, even not predictable monitoring, but known states. Would you have the system reboot automatically, or would you restore the database automatically? It all boils down to how well is your procedure? How predictable is it? How much of the known states that you can do? If you know exactly what the known state is, that helps you, and you’re able to automate that, with or without AI. Imagine you don’t trust the AI to make the right call, then you have to think about, it’s making the wrong call with this. What is the impact of that? Do we have a mitigation on rolling back or not when actually stronger things happen? We cannot prevent everything. We cannot react to everything. We’ll have to find a middle ground of saying, this is a known state, we’re reacting like that.

Then, if it goes wrong, we just need to make sure we have a safe system. I do acknowledge in many environments, that is a hard thing to do. You might have to re-architect for allowing that failure to happen to take that risk. That’s why a lot of people will still watch, babysitting the alerts, but it is about you putting in the work to say, this is maybe like a separate service. This is maybe the impact we can roll back. We have a mitigation for that, or anything to mitigate a risk that says, I’m ok with taking the risk because I have actually mitigated the impact on that. It could be data. It could be customer failure. It could be various things, but that’s how I think about it. A lot of people talk about the undeterminism, and that’s great, but you can have undeterminism because that’s production basically, end users are undeterministic on what they do. We design for the same thing actually to be happening if you’re doing your architecture right.

Martin Reynolds: If I was summarizing from both of you guys, and I think my view is aligned with it, the machine can genuinely help, it can correlate, it can summarize, it can suggest to known runbooks, and potentially, I think especially if you’ve built resiliency testing into your release process, if it knows how to act on maybe a scaleup, maybe a rollback, maybe a flip a flag in known safe actions, then I think it can do those things.

Then equally on the other side, I feel like the human is there to make those decisions that have an impact on the customer or the business tradeoff, the SLAs, the SLOs, the risks associated with that. I think also they own anything that might be an irreversible change, maybe a data migration or a destructive operation, the human needs to be the person that’s ultimately saying, we need to escalate this up and let people know that this is happening. They’re that decision point for that. Again, I think, ultimately, is that interpret the nuanced business context that isn’t necessarily codified into any runbook, or something that the AI can just read and do. I think there’s that split between, here’s a bunch of things the machine can do and can do really well. Here’s where the human needs to be there, because the machine doesn’t necessarily understand, or can make those decisions and can’t be accountable for some of those decisions.

Crafting an AI Strategy for DevOps Workflows (Stakeholder Buy-In)

Renato Losio: What would be the best approach around creating an AI strategy so that stakeholders who are sceptic around the adoption of AI for DevOps workflow, can buy in to the promise? What would be the key pillars to have in place before adoption can begin?

Olalekan Elesin: The answer to the question is inside it, how do you create in crafting an AI strategy for DevOps workflows? I would really take that and put it in our internal AI tool and ask it that question. This is my starting point. Give it the context of my organization to see, what do I need to do to convince my stakeholders? I do this a lot in crafting anything, any documentation, and say, what do we need to do? This is the way we started.

The next point is in editing that information or aligning it into the organizational context is, you know the stakeholders really well, you’ve been in conversations with them multiple times. What do they really care about? For us, where I work today, it’s working backwards from the customer perspective. What value are we trying to create for the customer? Once you have that nailed, also provide a context to the AI as well to refine your prompt and to refine the output. Once you get to what it looks like, talk to the first and most important stakeholder. This is how I drive organizational change, because what you’re talking about is not an AI strategy in itself, is, how do you drive organizational change around leveraging AI for DevOps workflows? Once you get the buy-in, talk to them individually and then talk to them collectively. Maybe in the discussion, you have a simple proof of concept, which you can also build with AI. You can say, today, one of our most recurring incidents is null pointer exception in a system that nobody knows about that was built 7,000 years ago. That’s a hypothetical example.

Then you show how you use AI for, let’s say, after an alert comes in, summarization. Show how you use AI to generate whatever it is, and then map out the workflow steps. Then show in the discussion the real value through a proof of concept. Make it clear to the stakeholders that AI will take our MTTR, mean time to restore, from X days or X hours to Y minutes. This is proven in this proof of concept, hypothetically. This is a more convincing discussion than, yes, we have an AI strategy, and then nobody can really bring it to reality, so can’t connect to it.

Mallika Rao: I just want to also add right around what Ola was mentioning, as I think, especially around those skeptical stakeholders. I’ve found it very useful to anchor it in outcomes rather than capabilities at that point of time, because having that clarity changes from why to how. Instead of starting with, we’re building an AI strategy for DevOps, let’s say. Maybe starting with something grounded and concrete, like, maybe the main customer problem is to reducing mean time to understand incidents, or we have bad customer minutes, or unsafe rollbacks. Having a strategy around it or a way to think about it is a good starting point, because that grounds the conversation and drives that alignment. I typically have seen skeptics buy in when AI is tied to a metric they already care about. Starting with that makes some progress.

Patrick Debois: I like the way that you framed it, like, if they feel a pain, then you can come with a solution. Basically, that’s what you’re saying. If they don’t feel a pain, and you’re just saying, here’s a hammer, they wouldn’t care. You can’t sell that. It was early days DevOps, the same thing. Why would I do DevOps? You don’t have this problem. I don’t care. It’s hard to sell if that is not a problem. Today, maybe you don’t feel the pain in your company, and you might not care about downtime, and you’re still selling to your customers. In two years, your competitors will be, and it might give a better service, and it might do something. Some domains are maybe more immune to this. It’s kind of like, tap into a pain, because then they’ll listen to you. That’s my learning over all the years of DevOps. Don’t come with the solution, but actually solve their problem that they need.

Olalekan Elesin: Just to add on the strategy. If you’re thinking about strategy at all, read a blog post on Good Strategy Bad Strategy by Richard Rumelt. It’s a good one. It talks about the diagnosis, the guiding policies, really understanding the pain that exists, or not invent a pain that you think is your pain, but invent the pain that is relevant to the customer, relevant to the business. What are the guiding policies that you want to put in place? What are the coherent actions? Back it up with a proof of concept. To the point of Mallika, is the most important stakeholder. Who feels this pain the most in the business? Then start from that person.

GenAI in SRE, and Incident Management

Renato Losio: I read an interesting article from the team at Google that explained how internally they use for their SRE, generative AI. Of course, what they use is pretty obvious because they’re Google. They basically describe the entire process. They basically define four different areas. One was the one we covered early, like the 3 a.m. paging, then the mitigation phase, the analysis, the root cause, and finding the root cause of our problem. Then as well, all the post-mortem with the incident rep or whatever. They cover how AI can help in all four. Do you all agree that you need AI on all those? Do you actually need it in certain? Do you use it everywhere? What’s your feeling about that?

Martin Reynolds: Across all the four, I would say, yes, basically. You can use it. Paging, yes. Agents, they can help deduplicate, they can root intelligently, they can include rich context with that alert. I think they can help in mitigation. They can give well-defined reversible policy governed actions that are safe to autopilot, for known scenarios. They can absolutely help you come up with your root cause analysis. They can look at all of that data, look at the timeline synthesis and create those hypotheses of what is the likely root cause. Absolutely, they can help in the post-mortem phase too. They can draft the initial incident report. They can pull the relevant metrics and context, and propose follow-up action items for people to actually do. My red line is that agents can assist in all of those phases but accountability and the final decisions still need to stay with humans. You can’t make the agent accountable, you’re accountable.

Mallika Rao: I like that as well. I also agree with agents participating in all four phases, but maybe not owning all four of those phases, because participation and owning are two different things. For example, paging and mitigation benefit enormously from speed, velocity, how quickly can we come to a point of alignment. Root cause and post-mortems require more narrative, accountability, to Martin’s point, and just that learning that comes from deep rich context. AI should accelerate that learning, but I don’t think it should absolve teams of that understanding, because that’s the opportunity for the teams also to build that maturity over time.

Martin Reynolds: Absolutely, 100%.

Olalekan Elesin: Most people are not working at Google. How do you start? I think one practical experience I learned is when we have post-mortems or when I run post-mortems, I always try to record them, and then I take the transcript and give it to our internal AI tool to convert that into the post-mortem documentation. It can do a whole lot of good, because then it saves me a lot of time in trying to synthesize the time series of how events happened. I think this is a great way to start. Secondly is also, sometimes it’s not only on the call, it’s also the chat correspondence as well. You can also take that and put it as well into the AI or the LLM that you’re using within your organization, so that it’s not leaking data out, to generate the sequence of time series of events as it happened and how it was resolved. For me, I found this very useful, saving me hours of time. I translated this also to my colleagues to save time in post-mortem documentation.

Short-Term Action Items

Renato Losio: I’d like each one of you to give some advice. I’m a DevOps engineer. I joined the roundtable. I have some experience with AI, but I haven’t actually done anything in terms of using agents in my DevOps deployment, production, whatever. I want to start tomorrow. I want to spend a few hours doing something. What’s the first step I can take? What’s some advice? It can be a book. It can be an article. It can be an agent. It can be a test. It can be, do a transcript of the last post-mortem. Something that can be done in the short term, not like move everything tomorrow, but something concrete that could be done in the short term.

Martin Reynolds: If I’m sat there and I’m a DevOps engineer or a platform engineer, I would find the thing that causes me the most frustration and see if I can solve for that. I would find my personal frustration and solve for that, and then say, ok, it can help me here and this is how, now how do I codify that and make it work at scale reliably and trust? In terms of something to read or do, the side of the question, and blatant plug, we have a DevOps Modernization virtual summit. We have Anthropic there, United Airlines, Google, AWS. We have Matthew Skelton, the author of “Teams Topologies” at that. I would recommend you go see our DevOps Modernization summit.

Mallika Rao: I’ll keep this one really simple and hopefully actionable. Just pick one painful operational workflow that happened recently and try to make it explainable. For example, it can be something like what happened in the last incident. Why did we see this SLA breach? What happened in our system? Start from that question and then use AI to auto-generate timeline summaries. How did the deploys happen? Tag signals as you are going through that workflow, and don’t automate actions yet. If we can start with automating that understanding and then automate the actions, maybe that’s a good place to start. Very simple and hopefully actionable to start tomorrow.

Olalekan Elesin: Download the logs from your CI/CD pipeline. Download the logs, put it in a markdown file. Download the logs from your production environment where you observed the failure, put it in the markdown file. Fire up your IDE that has the LLM in there and ask it to go through both files and tell you why the incident happened, what happened, and then get insights from it. That’s all.

Patrick Debois: I’m going to give an odd advice. Use one of the chat solutions like Claude and so on, not maybe encoding, not in your IDE, and start by learning how to ask it questions and correct it. They’re really powerful in the way that they learn from what you want, how you want it. The more time you spend, the better results you get. You take those learnings and actually put them in the context while you’re doing your job. That’s been proven quite powerful for me, at least.

See more presentations with transcripts

DevOps Modernization: AI Agents, Intelligent Observability and Automation

Transcript

Where Human Attention Gets Wasted – AI and Modernization

The Core Role of AI in DevOps

AI Capabilities in SRE

Where to Start, with AI

Trust and Maturity of AI

Decisions That Should Never Be Automated

Crafting an AI Strategy for DevOps Workflows (Stakeholder Buy-In)

GenAI in SRE, and Incident Management

Martin Reynolds: Absolutely, 100%.

Short-Term Action Items

Leave a Reply Cancel reply

Stay Connected

Latest News

ThreatsDay Bulletin: OpenSSL RCE, Foxit 0-Days, Copilot Leak, AI Password Flaws & 20+ Stories

Sharp Invented a New High-Speed Cooking Technology Called the 'Golden Heater'

Sun, 02/22/2026 – 18:00 – Editors Summary

Best Wireless Mouse 2026: Our top choices

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Where Human Attention Gets Wasted – AI and Modernization

The Core Role of AI in DevOps

AI Capabilities in SRE

Where to Start, with AI

Trust and Maturity of AI

Decisions That Should Never Be Automated

Crafting an AI Strategy for DevOps Workflows (Stakeholder Buy-In)

GenAI in SRE, and Incident Management

Martin Reynolds: Absolutely, 100%.

Short-Term Action Items

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News