Tiger Teams, Evals And Agents: The New AI Engineering Playbook

Transcript

Shane Hastie: Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today I’m sitting down with Sam Bhagwat. Sam, welcome. Thanks for taking the time to talk to us.

Sam Bhagwat: Thanks for having me, Shane.

Shane Hastie: My normal starting point on these conversations is, who’s Sam?

Introduction and background [00:37]

Sam Bhagwat: Well, I guess a bit about myself. Early on in my career, I worked as an engineer for a few different Silicon Valley type startups. Then I was the co-founder of a framework in a company called Gatsby, which is a React web framework, which became quite popular in the late 2010s. I’ve been doing open source JavaScript for 10 years now, and I’m currently the co-founder and CEO of a framework called Mastra, which is an open source JavaScript/Typescript framework for building AI agents.

Why open source? [01:07]

Shane Hastie: Let’s dig into that open source background a little bit. Why open source?

Sam Bhagwat: In my case, it was sort of a happenstance. So I was working with my best friend, this is 10 years ago, so we saw React kind of emerging. We were very confident this was the right paradigm for web development. And it was very controversial when we go back to 2015, but we really felt like it was the future. And so we were working on this framework around that and it started to take off. And so we kind of turned to each other and we’re like, “Okay, how do we do this full-time?” And then we figured that story out and then we spent the next several years building out the framework and the company.

Shane Hastie: What’s different? What’s special about an open source environment?

Sam Bhagwat: When you work in open source, you collaborate with people all over the world. Our top contributors at Gatsby was a brilliant engineer in Poland. We had people in completely different life circumstances. You don’t know if the person that shows up in your GitHub issues is maybe an engineer in a similar life stage as you. Maybe there’s some director of engineering at some large company, or maybe there’s some college students in India. It could be any of the cases.

The other thing I would say is watching people tinker around with the things that you were building and following your lead. “Oh, you used this to do this thing that I didn’t imagine you did. You built another layer on top of my thing and this tool that’s interoperable”. And this kind of permissionless ecosystem just allows so many different things to emerge and integrate with each other. It’s amazing. I mean, it’s been a lot of fun.

Keeping open source communities healthy [02:40]

Shane Hastie: This is the Engineering Culture Podcast. How do you keep that open source culture community, I want to say generative and productive, because we certainly do see instances where it breaks down quite nastily.

Sam Bhagwat: I think open source communities evolve over time. In the beginning, it’s a lot of people tinkering around with things. If your project gains traction, people start bringing it into their work environments. The first people you get that don’t like your thing are usually the people who inherited a project that someone else built with your thing, and they were like, “I don’t like how this does it”. And I think you have to have a light touch with things. Sometimes people aren’t asking questions. You start it with, “Oh, well, if you’re here, it’s because you’re excited. And if you don’t like it, that’s okay. We can’t make everybody happy”.

And then some people have the choice in the technology to use and some people don’t have the choice. So you start off doing one thing in an opinionated way, and you have to be more flexible over time as to what you want to let people do with your thing, and also just receptive to their feedback in terms of your product development. I think most people who start building an open source build a product just scratch their own itch. And we had to learn that we could not continue down that path of development forever and we had to adopt. We had to just be more open to what people wanted to do with the thing that we started.

Shane Hastie: That sounds like letting go.

Balancing open source spirit with commercial reality [04:08]

Sam Bhagwat: It’s letting go, and it’s a combination of the emotional maturity to decouple your identity from a certain particular way of doing things that you are actually genuinely very curious and invested in. I have a slightly unique point of view as the founder of commercial open source company. And then now this is the second time where we’ve started an open source project, and then there’s a business to be built as well.

There are people in open source that are open source purists and have a very difficult time working in a company that has any sort of commercial mission. There are also commercial type people that are just very, I win, you lose kind of people. And these people have a hard time working in open source type companies because there’s a certain magnanimity where, no, we don’t want to charge for this. We never want to charge for this. This part of the product should always be free, should always be open source. Everyone should use it.

And those type of people, if you bring them into your company, they will do their best to squash that sort of spirit. So there’s a bit of you have to find the open source people, but people that aren’t too anti-commercial. And you have to find the commercial people who are savvy, but they’re like, you win, I win kind of people rather than too much in the other direction.

Shane Hastie: So finding that middle ground. The reason we got together was not just about the open source stuff, of course, but you’ve got some thoughts on AI engineering and AI engineering teams. How is that different or is it different in terms of one, traditional engineering, but also in that open source space?

The emerging field of AI engineering [05:39]

Sam Bhagwat: So I wrote a couple of books to help people get into the AI engineering field and get started. There’s a lot of people from stack developers, data engineers, data science type folks that are trying to pivot their careers and get into AI engineering. I think my perspective now as someone in my mid 30s that have seen different technical waves is that somewhat similar to maybe DevOps or data engineering in the past where these are new domains that have emerged from these larger organizations, maybe the Googles of the world and get sort of diffused to the rest of the world.

And then there’s a moment and a period of time where if folks want to transition into them, it’s kind of easier because there’s this very unmet need that companies are wanting to build these types of applications or to do this kind of engineering, but there’s not that many people that have three years of experience. And so if you’re able to get on the right project or do the right kinds of things, you can actually end up developing expertise and moving into a new domain that might be interesting or professionally advantageous for you.

Shane Hastie: So what’s different?

Sam Bhagwat: Everything is happening faster this time. The metric that I look at the most is we see how fast growth is happening in AI projects versus previous kinds of projects. And three or four months of project growth was before is now happening in one month. And I think that’s maybe a similar track for how quickly these technologies are just adopting or diffusing from companies like Google to the rest of the industry.

AI-augmented open source development [07:00]

Shane Hastie: And what about merging the two? The application of generative AI in the coding of open source, is that happening and how’s it going on there?

Sam Bhagwat: I mean, we are sort of obsessed with this. The lifecycle of an open source maintainer is maybe you get some bug reports in Discord and you’re like, okay, trying to get more information to triage that and then trying to distill that down into a GitHub issue. And then maybe you make some PR to fix that, and then you need to review that fix and get it, merged it in. And then maybe a week later you’re aggregating all the changes and putting that into a change log. And we’re just heavy Claude Code and internally we’re using Composer to do multiple agents at the same time. My co-founder was thinking about getting a new computer so we can run more parallel coding agents. But we also have built agents for every step of that.

We built an agent that takes a bug report in Discord in a thread and summarizes it in a GitHub issue. We’ve built bugs to try to write repro issues when many times you don’t get very detailed reproduction instructions. And so to try to create reproductions given less than ideal information, we’ve built agents to generate change logs. And we have multiple agents who are third-party agents that are commenting on PRs and judging their quality. I mean, it’s a lot of fun. You just feel like you put on this superpower suit and you can just get more done. There’s infinite numbers of issues and there’s infinite amounts of things you can do and you can just do more faster now.

What an AI-augmented engineering team looks like [08:27]

Shane Hastie: So what does an AI augmented engineering team look like?

Sam Bhagwat: We have this channel called Kindergarten in our Slack, and co-founder, Abi, named it Kindergarten because he’s like, “Look, we’re really beginners at this stuff”. We just kind of drop links about how to do things. We’re a remote team, but we pair a lot because it helps us diffuse our individual understanding of how to use these tools better in the broader team, how to notice when your agent is going off the rails. It matters if you notice it because if Composer starts going off the rails, if you notice it one second versus 10 or 30 seconds just in terms of your ability to be in the zone and stay in the zone. It’s a fun time to be an engineer.

The basics of AI engineering: agents, workflows and evals [09:09]

Shane Hastie: Kindergarten, we all need to learn new things. We are going right back to the basics. What are those basics?

Sam Bhagwat: I think the basics of AI engineering are agents and workflows are kind of the two fundamentals. So agents are, you’re running an LLM in a loop we can call tools that has a memory. And workflows are just a structured graph where LLM can be a decider node in that graph and being able to call tools and have memory. I mean, memory we can think of as a structured compression of a queue of messages. There’s different ways of compressing that working memory, semantic memory, observational memory, but anything fundamentally, that’s what it is.

I like the fact that it’s a different word and we don’t just call them statistical tests because that’s what they are. And there’s different kinds. You could write more of unit type tests. You can write more integration or end-to-end type tests. You can come in at different layers of the stack, but tracing and evals are sort of … We’ve seen them being, let’s say 10X as important in AI engineering as your normal engineering, because the non-determinism of agentic applications, you can’t, anymore, expect that … You could have multiple successes that have different response bodies, and that’s not the case when you’re building traditional software applications.

Writing effective evals for agentic applications [10:21]

Shane Hastie: Let’s dig into that evals. You’re right. The code that’s generated, it’s non-deterministic. How do we make sure that there’s not the hidden quality defects in what is being generated?

Sam Bhagwat: There are some these out-of-the-box evals that you can usually install in a variety of different environments. For example, prompt accuracy or fairness and unbiasedness or toxicity of a response or accuracy in tool calling. And these are somewhat generic type things. But where you really start getting into high amounts of value for your particular use case is when you are able to write evals that are unique to your business based on data that your organization has that others don’t have.

Because if you think the models have evals. They have these legal evals and these medical evals and all these different datasets and benchmarks that GPT-5.2 and Claude 4.5 Opus and all these models are being trained on and evaled against. But the things that are important when building an application that the model providers are not going to do are the things that are unique to your organization’s area of core competence and the data that your organization has.

Shane Hastie: Can we dig into one of those? A concrete example, what does this look like, feel like for the developer sitting there using Claude Code inside the IDE?

Sam Bhagwat: So I do think it’s important to distinguish between the agentic vibe coding in Claude Code or whatever. And so we’re building in agentic applications. So I think there are two different kinds of development, and we typically tend to interact more with folks. I mean, we ourselves are obviously vibe coding with Cursor or in Claude Code, et cetera, but the applications that we tend to see more are people building the agentic type applications.

Building agents inside SaaS applications [12:19]

Shane Hastie: So can we take an example of one of those?

Sam Bhagwat: Sure. So I think one of the modal use cases we see right now for folks building agents is to build an agent as an interface within your SaaS application. In some ways, we can think that the web is a client in my SaaS app. But maybe I’m a mobile client as well, or multiple mobile clients across iOS and Android and maybe desktop as well. And in some ways, your agents is another client for your APIs.

And so we’ve seen, for example, an HR SaaS platform with building, and they watched a lot of their users trying to answer questions with their data and their users would export CSVs and then paste them into ChatGPT. And they were like, “Well, there’s two problems here. One, this is maybe not optimal from a privacy standpoint”. But then it’s also the consumer chat tools probably don’t have a lot of context on your organization that you have embedded inside your application.

And so this is a team that built an agent inside their SaaS application that can generate reports for them or answer HR policy type questions by merging salary and some other documents. It’s the modal use case of people building agents. And something that’s really interesting are these sort of customer type facing agents that have access to organizational data and can interact with users in ways to service information that maybe it’s just not clear or obvious or easy how to do using the basic functionality that exists in the SaaS app.

Creating domain-specific evals with subject matter experts [13:58]

Shane Hastie: And how would I build a good evaluator, particularly with, let’s take that HR advice, where we are really bound by very, very strict legal rules.

Sam Bhagwat: Typically, the way that we see teams doing it is they’ll bring in a subject matter expert. And so they’ll ask the subject matter expert, “Can you give us a list of questions that would be reasonably comprehensive of the domain?” And then it’s really just a process of gathering a lot of human created data. Okay, so here are the different questions that people might ask. Here are the other inputs, here’s the relevant PDF. Here’s five different sample sets of employee salary and information data, and then five different answers depending on their salary data or whatever it may be.

And so I think that kind of goes back to what we were talking about, about the things that you want to write evals on are the things that are very unique to your organization. And if you’re building HR software, for example, these are maybe not things that are going to be in some jurisdiction that has some particular set of payroll rules and termination severance payment rules and onboarding rules and employee fairness rules. And these may not exactly be publicly present in the data that the models are being trained on in company-specific policies, yet you want to just create these kinds of comprehensive datasets.

Typically, these projects have two phases. The first phase is, can we get a prototype working that you can chat with it? It will give you answers. And then around there is where you start assessing the accuracy of the agent. Basically like, okay, so this agent has 80% accuracy or 85% accuracy. We need it to be 95% accurate or 99% accurate or however you want to score it. And then you have to figure out, well, what are the modes of failure that the agent is running into? Maybe it answers this class of questions reasonably well, but it really struggles with this other class of questions.

This is kind of like an analytical exercise, and this is obviously often where you might bring in more of a PM type to help stare at a lot of data and help classify the modes of failure. And then you start tweaking the prompts and the context that you’re feeding into the agent and systematically burn down your sources of inaccuracy until you’re able to score highly enough with your dataset that you’ve collected.

You have to understand what is the risk for your organization of giving incorrect answers, and sometimes that’s higher and sometimes that’s lower. And so you may have different thresholds of tolerance, but when you are able to increase accuracy to the point where you’re past your threshold of tolerance, and typically then it’s very much like staged rollout. We see a lot of use of feature flagging to bring it to maybe a first group of beta testers and then to 1% to 5% to 10% to 50%. And these don’t roll out typically over days. It might be over weeks as you’re gaining confidence and you’re rolling it out to wider groups of people.

Marrying software engineering and data science mindsets [17:00]

Shane Hastie: Some of this sounds like a fairly straightforward typical analysis exercise that we have done in software engineering for decades. But some of it is quite different. These are skillsets that I’m going to say your traditional engineer doesn’t have.

Sam Bhagwat: It’s interesting because a lot of times in organizations you have these two groups of folks. You might have data scientists who are more comfortable with this sort of statistical uncertainty, but they’re not experienced in building production software. And they might build prototypes in some Jupyter Notebooks or whatever. And then you have software engineers that are thoughtful about, here’s how we’ll build and iterate this thing that’s scalable and I can deploy into production. But we aren’t typically trained in thinking about things in statistical methods.

And some of the interesting challenges are being able to marry those two frames of mind. And I think we now have language around P99 and P95 in terms of a response and latency time where we know that you want to optimize not just the median response time, but we want to also optimize the long-tail response time so that a very large fraction of our users have good experiences.

And I think that we’re kind of developing some of these terminologies for what is the equivalent of P95 or PN99 in AI engineering, but it’s a very new field.

Cross-functional Tiger Teams for AI projects [18:16]

Shane Hastie: So a lot emerging, a lot evolving. What does this mean, coming back to the culture and the team? So if I’ve got this engineering style of focus folks and the data scientist type folks, how do we get them to work effectively together?

Sam Bhagwat: What we’ve seen teams have success with is being able to find folks to work on a project that are able to gather information from different types of people. So I think there’s a few different team archetypes here. Even a large engineering organization of 150 or 200 people, we’ve seen the CTO act as basically the project’s lead for this project and write a decent amount of the code of this project. Is that typical of most projects? No, but this is a high risk high value project. And so the CTO is really hands on the driver’s seat, someone that has that veteran experience to be able to wear multiple hats.

Sometimes we see objects handoff where someone will start with prototyping it, and then when the prototype gets to a certain point, then the organization may say, “Okay, well, we really want to put this into production”. And then they think about now that a couple of folks or a small team has gotten this prototype phase, what are the skillsets that we need to bring into the Tiger Team?

I think this Tiger Team concept though, I think is very important because you do need to pull in people cross-functionally, and it is not going to map into your existing org structure. So organizations that we see struggling with this are the ones that are more command and control They have a harder time making Tiger Teams for specific projects that are cross-functional.

Embrace discomfort and lean into AI engineering [20:04]

Shane Hastie: What’s the one piece of advice that you would give the listeners about embracing this AI engineering approach?

Sam Bhagwat: I’m 37. And I think when you get out of your 20s into your 30s and into your later 30s, and beyond that as well, there can just be a sense of when you see new things, you can react with default skepticism rather than default enthusiasm. And I think we’re engineers, we’re naturally skeptical people. And I think that where that can be challenging is that if we lead with our skepticism, to be good in a new field, you need to be okay with being uncomfortable and okay with being kind of bad at this new thing that you’re doing.

And you’re going to have the sense of taste of it, like, gosh, I’m not very good at this, and you’re going to be upset at yourself. But you have to stick with it and be okay with this period of uncomfortability, and not just reject it because it’s new and it’s weird and it’s different than the thing that you’ve done before. And our CEO keeps shouting about this thing.

But there’s a lot of reasons why you could choose to be skeptical. But I think if you want, there’s a lot of opportunity in being able to be the person that is able to build a new kind of technology, and to be an early adopter and a pioneer in your field or your community or your organization, and figure out how the different pieces fit together.

I think for me, I think there’s two sort of magical experiences that I’ve had. One was just the first time I’ve had a working program running, and I was like, “This is so cool. I’m having the computer do this”. And the second is just vibe coding in AI engineering and watching what the LLM is doing and being part of this co-creation process. So I try to encourage folks to lean into that raw energy and enthusiasm for this cool thing that we all get to do.

Shane Hastie: Sam, thanks very much for taking the time to talk to us today. If people want to continue the conversation, where would they find you?

Sam Bhagwat: You can find me on LinkedIn. You can also find me on Twitter, X.

Shane Hastie: Well, I’ll make sure we include those links. Thanks so much, Sam.

Sam Bhagwat: Thanks for having me, Shane. It’s been an absolute pleasure.

Mentioned:

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Tiger Teams, Evals and Agents: The New AI Engineering Playbook

Transcript

Introduction and background [00:37]

Why open source? [01:07]

Keeping open source communities healthy [02:40]

Balancing open source spirit with commercial reality [04:08]

The emerging field of AI engineering [05:39]

AI-augmented open source development [07:00]

What an AI-augmented engineering team looks like [08:27]

The basics of AI engineering: agents, workflows and evals [09:09]

Writing effective evals for agentic applications [10:21]

Building agents inside SaaS applications [12:19]

Creating domain-specific evals with subject matter experts [13:58]

Marrying software engineering and data science mindsets [17:00]

Cross-functional Tiger Teams for AI projects [18:16]

Embrace discomfort and lean into AI engineering [20:04]

Leave a Reply Cancel reply

Stay Connected

Latest News

The Ryzen 7 9800X3D is worth the price of an old CPU, AMD is selling off without margin

Chip exports to China: US government tightens export restrictions

There are coupons of up to 110 euros and discounts on mobile phones, consoles and more

Record investment: Softbank is investing 75 billion euros in AI infrastructure in France

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Introduction and background [00:37]

Why open source? [01:07]

Keeping open source communities healthy [02:40]

Balancing open source spirit with commercial reality [04:08]

The emerging field of AI engineering [05:39]

AI-augmented open source development [07:00]

What an AI-augmented engineering team looks like [08:27]

The basics of AI engineering: agents, workflows and evals [09:09]

Writing effective evals for agentic applications [10:21]

Building agents inside SaaS applications [12:19]

Creating domain-specific evals with subject matter experts [13:58]

Marrying software engineering and data science mindsets [17:00]

Cross-functional Tiger Teams for AI projects [18:16]

Embrace discomfort and lean into AI engineering [20:04]

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News