10 Reasons Your Multi-Agent Workflows Fail And What You Can Do About It

Transcript

Dibia: To get started, let’s consider a few scenarios. In my line of work, we constantly think of how we can build systems that help us with increasingly complex tasks. Let’s consider three of those tasks. Imagine you could tell a system, download my email attachments, and expect that, extract a bunch of information. Put some stuff into GitHub Issues. Put some stuff into a custom lead tracking tool. Then maybe load the rest of the data into Excel. The system will go look at your email, extract all that data, connect to all these external systems.

Another example of such a task that we term as complex is the idea of a software engineering task. Here we’re asking the system to build an Android app that can help users view and purchase stocks. Yet another example, which is my favorite, is file taxes. Imagine that you could actually build a system that could go and explore the task of trying to file your taxes, break that down into a bunch of steps, and then when needed, it will reach out to you proactively for help, and then get feedback, and progress.

One thing we can see is that in these scenarios, the idea is that we provide some high-level description, and the agent figures out all the intermediate steps independently to act and accomplish the task on our behalf. In addition to that, the bonus piece here is that, let’s say, they have the ability to learn, understand our preference, model our context, and become even more helpful to us. In this future, agents become like the frontier of computing, and they transform how we interact with the digital world. While each of these examples cut across multiple different verticals, there are a few things that are common across all three of them. The first is that they’re tedious and repetitive. They’re important. They’re tasks that nobody wants to do, but everybody really needs to do them. In many cases, they involve a bunch of steps, and in some cases, the steps need to be proactive.

If you think of all these three things, there are three insights that I think any business-minded person will take away. The first is that, if we succeed at building these systems, then we can save a lot of people time. In business, when we save people time, what happens? We create value and they give us money. The second insight that we could extract from all of this is that if we build a single system that could accomplish all of these different tasks, then we will end up with a unified digital interface. As opposed to the current situation where for every different type of task, people go to different apps, we could end up with a situation where we have a single unified digital interface, and essentially the benefit is that we have potentially zero switching costs for users.

Then, finally, if step one and two goes well, then we have some idea of some disruptive tool on hand. This observation is not just limited to me, the entire industry is observing and reacting to that. In this article by Forbes, Andrew Ng is quoted as saying that AI agentic workflows will drive massive progress in the next year, and perhaps they might even have more profound effects than the next generation of foundation models. In this other article by Bill Gates, I think Bill makes a couple of important points. He talks about the fact that today if we wanted to achieve different tasks, we need to go to different apps. To send an email, we go to Gmail or Outlook. To order food, we go to Uber.

The idea of agents creates this single everything app where we eliminate this switching cost, and then it’s possible to accomplish multiple different tasks with a singular interface. Then, a recent article based on an interview with Sam Altman, here Sam characterizes these AI agents as competent assistants that might help with a lot of tasks, and, essentially, he terms this as the killer app for AI. In addition to that, I was also curious, how is the investment space reacting? I went to Y Combinator, which is a popular startup incubation organization. I pulled up all the data over the last two years. One of the things we see is that over the last two years, there’s been a 469% increase in the number of startups that explicitly mention AI agents in their company description. There’s just a massive amount of interest.

Across all of these observations, it’s clear that there’s universal agreement that the future is agentic. However, that’s not all that there is to it. In a recent survey essentially published by LangChain, they talked to developers and product managers from, I think, 1,300 different organizations, and they asked them two questions.

On one hand, almost 85% of them mentioned that they planned to roll out agents in production. As of today, only about 50% of them had any type of agentic workflow in production. In addition to that, they also asked them, what are the key drivers or what are the key sources of challenges for you that limit you from going to production? Interestingly, the first thing that they mentioned here is that performance quality, not cost, not safety concerns, not latency, not any other thing, performance quality is the key driver that prevents them from going to production. We can see that anecdotal evidence also supports the idea that autonomous AI agents have the last mile problem. Very similar to self-driving cars. It’s really easy to hack out the really simple prototype. It’s easy to get down to the first 95%. It turns out that that last 5% is just as hard as the first 95%.

Background

Given all of these, there’s a bit of a paradox and a bunch of questions. These might be questions that a lot of you have here. What are multi-agent systems? How do we build them? What factors drive reliability and other issues? Then, finally, what can you do about it? Should I invest in the multi-agent space? How do I navigate this wave? That’s what my talk is all about. In part 1, I’m going to define what multi-agent systems are, how you can build them with the AutoGen framework. In part 2, I’ll cover 10 concrete reasons why these things fail in production. In the last part, conclusions. What can you do about it? What are some next steps going forward?

My name is Victor Dibia. I’m a Principal Research Software Engineer at Microsoft Research. I work with a group called the Human AI Experiences Group. We care about scenarios where a human works in tandem with an AI model to solve problems. About three years ago, when I joined the group, one of the first things we worked on was GitHub Copilot. We spent a lot of time working with the core GitHub team to understand how developers could work in tandem.

Back then, it was the first set of AI models from OpenAI, the Codex set of models. We worked with that team to understand how we could take an AI model, integrate it right in the IDE, build the right developer experience, study, understand the offline or online metrics, and ship a product that has been widely used today. Previously, I was a machine learning engineer at Cloudera. Also, I was a research staff member at IBM Research. Currently, I spend a lot of my time working on the AutoGen open-source application. How many people have used AutoGen? I have a simple overview. If you’ve used AutoGen over the last year, so the project is about a year old. About a year ago, a couple of colleagues at Microsoft Research with collaborators from Penn State University and a big open-source community, we came up with an initial framework to simplify the process of building multi-agent apps.

About a month ago, we released a new preview API. If you haven’t tried that, I really encourage you to try it out. It’s event-driven. It’s asynchronous by design. It provides both a low-level API that lets you express your agents in whatever language you’re interested in, and a high-level API with a bunch of presets. In addition to that, I’m also the lead developer for AutoGen Studio, which is a no-code or low-code tool to let you prototype, test, debug these multi-agent applications.

What Are Multi-Agent Systems?

Going to the first part of the talk, what are multi-agent systems? We’ll start out there with thinking through what are agents. The simplest definition of agents you could think of is, each time you give an LLM access to tools, you give it the ability to act, then you get an agent. In practice, it’s usually just a little bit more than that. You want your agent to be able to reason, take a problem, explore, decompose that into a set of steps. You want it to act, preferably using some tools. You want it to adapt. Given the context, given changes in environment, you want it to make new decisions.

Then, finally, you want it to communicate either with humans or other agents. Implementation-wise, the planning is driven by an LLM adaptation in some ways, maybe from the perspective of memory. How the agents learn from previous interactions, from explicit and implicit feedback. Action is pretty much driven by tools. All of this is in some way driven by a generative AI model. The communication piece comes together when a group of these agents, they get a task, and they follow some communication pattern towards solving the task. By communication pattern or orchestration, what I mean is, when a task comes in, how do we decide which agent takes the first step? How do we define how control flows across agents? How do we define when the task is complete or the task ends? For the rest of the presentation, when I mention a group of agents, I actually mean a team of agents or workflow.

Right now, we have a simple working definition of what a multi-agent system is. What does it look like for us to express this as code? I’ll show some examples using the AutoGen framework. The AutoGen framework has two APIs. The new redesigned version of the AutoGen has two APIs. The first is the AutoGen Core API. It’s by design meant to be unopinionated. The primary definition of an agent is any entity that can respond to a message event. The idea is that if you define this agent, as long as you can send a message to that agent, and then, essentially, the API gives you a structure to respond to that message without any other conditions or without any other configurations.

However, the API that I would recommend people to start with is the AgentChat API, which is a high-level API. It has a bunch of presets for agents. What happens when these agents receive messages? How can they use things like an LLM or a tool? Then, also things like teams. How do we get groups of these agents to collaborate? Now, how do we define things like termination conditions when a task is done, as completed? I’ll show some examples. In the code that we’re looking at right here, here we’re using the AssistantAgent preset in AutoGen. What an AssistantAgent preset does is that it just helps you define what an agent does when it receives a message. In this case, we give it access to an LLM, a large language model, which is the OpenAI client, GPT‑4o mini. Once we’ve done that, we can test that agent using the run method. In this case, we tell it, what is the height of the Eiffel Tower? We get a result that looks like this.

This is a typical interaction with an LLM. However, if we change the task slightly, what is the weather in San Francisco? We get a result that looks like this. I’m sure a lot of you are familiar with interacting with an LLM on ChatGPT, or Gemini, or anything like that. You say something like, I’m unable to provide real-time data, including weather updates. You can check a reliable weather, all that stuff. This is not particularly satisfying, neither is it helpful. How do we solve this? Here we see that the LLM is unable to act, go in the real world and look up weather data. The simple solution is to give it access to a tool. To define a tool in AutoGen, the process is relatively straightforward. Here we define a tool as a Python function. In AutoGen, it could be any Python function or it could be a LangChain tool.

Next, we can see that in addition to giving the agent access to a model client, we also give it a list of tools that it can then use. If you used any of these modern LLM APIs, most of them are fine-tuned to perform really well at correctly calling tools, given the task that has been provided. In this case, we repeat that task. What is the weather in San Francisco? We call the run method, and then we get a response.

So far, we’ve shown how we could define just a single agent, give that single agent access to a tool, and we can call the run method to get it to take a single step. However, what if we wanted to get this agent to take more than a single step? Or if we wanted this agent to collaborate with a bunch of other agents, how would we do that? Within the AutoGen AgentChat API, we have the concept of teams. Here, what we see is we have a preset called the RoundRobinGroupChat preset. What it does is that it takes in a list of agents.

Essentially, what it does is that it coordinates the flow of information across each of these agents in a round-robin sequential manner. If there’s just a single agent in that list of participants, the same agent keeps getting called until some termination or stop condition is met. Here, we run. The first thing is we have a list of participants. We have an orchestration condition where agents get called in a round-robin fashion. The last interesting thing about this preset is something called the termination_condition. Here, we’re using something called the text termination_condition. What that means is whenever any of the agents emits the string, terminate, then the entire interaction ends.

That was a more simple agent. What could a more complex interaction look like? Here, we’re showing a separate preset, something called the SelectorGroupChat preset. As opposed to the round-robin preset, what this means is that in order to select the next agent that speaks at each turn, we take the state of the task, we take the description of all the agents that are provided, and we get an LLM to predict which of the agents within this list is more likely to advance the task. Here, we have a simple team called the book_team. There are three agents there. The first is a planner agent. I haven’t shown the code for that, but the goal is, given a task from a user, the user might say something like, create a 1 page book with 2 images about the wonders of the Amazon Rainforest. This initial agent will take that task, come up with a plan, come up with text content for the book.

The second is an image generator agent. It has access to a function that lets it generate images. The third is a book_generator_agent that can take some text, take some images, and then compile that into a PDF that’s saved to disk. The last bit here that’s different compared to the round-robin group chat is that there’s something called a selector prompt that guides this team preset towards how the next agent in the group should be selected until the task is terminated. The visualization I’m showing here is a sample from AutoGen Studio. Within AutoGen Studio, when you run a workflow that looks like that, as the agents exchange messages, we have an interactive graph that shows the transition of messages within agents.

In this case, the user says, generate a book. The message flows to the planner agent. The planner agent generates a plan. Next, the image generator agent is called by the team preset. Next, it generates a set of images. You can see that little loop there. It generated two images. Then, all of that data flows down to the book_generator_agent. That’s called twice. I think the first time it’s called, some error occurs. All of that gets into the history. It recovers from that, and then it generates the book. Then, in this case, it returns to the user for some confirmation, and then the task is completed. Then, at the end of that, you get a PDF that looks just like this.

So far, I breezed through all you need to create a basic agent. In reality, the configuration space for multi-agent systems is actually exponential. What do I mean by that? First, you need to make some decisions as to how you orchestrate these agents. There are multiple options you could select here. You could decide, when I get a task, do I come up with an explicit plan, a set of steps, and sequentially execute them? Or do I explore an implicit approach where I take a single step, look at the side effects of that step, then take another step iteratively until the task is done?

Another set of decisions you need to make is, do I let the developer define the properties of all the agents before the task starts, or, might I be in a dynamic space where it might be useful to automatically generate the agent definitions given the task or the properties of the task? In addition to that, what tools should my agents have access to? Do I give them access to general purpose tools like a code interpreter where the agent can generate any arbitrary code and execute it, which has its own side effects, or do I give it access to a set of specific tools, let’s say, call the weather, generate some images, generate a PDF file? Similar situation with memory. How do we define what to learn, what information to retain, when to learn, when to index information?

Then, termination. This looks simple, but it’s actually very complex in practice. Do I define my task termination based on some budget system, let’s say some timeout? Do I define termination based on some resource consumption, let’s say number of tokens or dollar costs? Do I have some external tool monitoring the process or the progress of the task, and then that tool gives some signal that terminates the task?

Then, finally, which is a really important piece, and I’ll show some examples of that later, how do I intelligently define when to delegate to users or not? There’s a bit of a spectrum here. If you reach out to the user all the time, then this tool is not that helpful. Then, if you explore a fully autonomous setup, then you could have all the side effects where the agent takes actions that the user might not necessarily agree with. Improper configuration can lead to mistakes and errors, and, overall, drive performance.

Failure Modes – 10 Reasons Current Multi-Agent Workflows Fail

This brings me to the second part of the talk, 10 common reasons why agents fail. Over the last year, building AutoGen, I’ve talked with dozens of startup founders, hobbyist developers, just trying to prototype multi-agent applications. In the next 10 steps, in no particular order, I’ll just explore a set of examples where people tend to make very simple mistakes that cause their multi-agent platforms, or teams, or workflows to fail. The first one is, your agent lacks detailed instructions. Underneath most agents, of course, not all agents have an LLM, but most of them do as to generate their next step or the next action. Given that an LLM is usually driven by prompt, and a lot of people forget that a lot of time needs to be spent very carefully tuning the prompt.

In fact, a good agent has lengthy detailed instructions covering instructions on how to respond, what tools to use, and what behaviors to avoid. For example, in the book_generator_agent that I talked about, the system message just says things like, you are a book compilation specialist. Your role is to collect story sections, images, format them into PDF. Important: Use the actual image files. You’re wondering, how do you arrive at something this strange and this weird? It’s mostly by trial and error. This is a very strange way to build software. In practice, if you just don’t spend enough time tuning the behaviors and the structure of your system message, then you just get agents that behave in very odd manners.

The second thing is, stop using small models. I know a lot of people are really excited about the LLaMA-7B models and the 5 billion parameter models and the 12 billion or 13 billion parameter models. In practice, small models have limited instruction following capabilities. Like we saw previously, your agent could have pages of instructions to guide their behaviors. In our experience, these small models, out of the box, they just don’t do well. One concrete example we saw a lot in the last year was that we had an agent that was a code executor agent. It worked well when the other agents that participated in the group chat generated code with a specific structure. The code was encased in backticks followed by a language symbol. We saw that GPT-4, the larger models, Claude just did a really good job at generating that structure. The small models just couldn’t. They just wouldn’t work for this type of workflows.

The third thing is, your agent instructions don’t match your LLM. A really common mistake that people make is that they try a single agent workflow with GPT-4, or GPT‑4o. The next thing they want to do is they want to switch it out with the November version of the same model. What we’ve seen is that sometimes there are fundamental changes to the model. Even the same model across multiple versions can lead to severe regressions. Simply changing the models and expecting similar behavior is often a mistake, for the same version of the same model, and even more severely across multiple models. If you make a switch from something like GPT-4 to the Gemini models to the Anthropic models, you might need to actually tune the rest of your agentic stack parameters to fit the behaviors of those models.

Four, your agents lack good tools. In theory, tools dictate the action space of your models. Whatever action that your models can take is really related or limited to the type of tools that you give to them. There are lots of configurations to make here. Do you take all your tools and you stuff them into the prompt? Do you have some sophisticated retrieval pipeline that says, given the state of the task, you only select a subset of those tools? Do you give your agents access to general purpose tools like the ability to use a computer interface, drive a web browser, or the ability to execute arbitrary code, or you just give them access to functions? These are configurations that have a lot of effects on the reliability of agents.

Just as an example, I talked about how an agent in the previous workflow could generate and save PDFs. The only reason why that works is because there’s actually an actual Python function that’s designed very carefully to take in a structured set of inputs, a list of images and text, then combine that very carefully into a PDF that gets executed to generate a book.

Number five, your agents don’t know when to stop. Fun fact, I call this the banana problem. If you have a 3-year-old or a kid at home, you could ask them to spell banana. They might start out really well, B-A-N-A-N-A, but then they just keep on going, B-A-N-A-N-A-N-A-N-A, all the way. They just don’t know when to stop. The strange thing is, even agents with sophisticated models like GPT-4, sometimes it’ll suffer from this. In part, it’s because LLMs are autoregressive model, they will keep on generating text condition and everything they’ve seen so far. Then, also, it can be really hard to define the right termination condition. The termination condition really depends on the system message of the agent.

For example, let’s reflect on the text message based terminate condition. That will only work if your agents are primed to output that string, terminate, once the task is done. In the example here, this is just a screenshot from AutoGen Studio, we ask an agent team, what is the height of the Eiffel Tower? It does answer the question in the first step, but it just keeps going and generates about 8 or 10 more messages after that, just repeating itself ad infinitum.

The reason it does that is because the termination condition here is set incorrectly to terminate after a maximum of 15 messages. Because of that, it just continues generating messages until that termination condition, which in this case is inefficient, is reached. Number six, you have the wrong multi-agent pattern. This space is really early. It’s not clear that any specific multi-agent pattern is the best right now, but some patterns are better than others. In some cases, it makes sense to have a round-robin group chat. In some cases, it makes sense to have an LLM decide in the moment which agent should act next. This can have some effects on the quality of your agent behavior.

Number seven, your agents are not learning. Most agents today are the equivalent of a goldfish. Every time you run them, they explore a path. Then, across every run, they do the same thing over and again. It can be quite frustrating to observe. A good memory implementation should let an agent learn from explicit feedback, learn from implicit feedback, and then intelligently recall when to reuse some of these things that it has learned. Number eight, your agent lacks metacognition.

Oftentimes, if you’re building agents that would explore long-running tasks, they will benefit from the ability to plan, to occasionally review the plans, to abandon wrong or incorrect trajectories, to reset their states to ensure that their mistakes previously don’t affect their future runs. Then, to make progress until the task is done or some final termination condition is met. To illustrate some of this, I’ll share some findings from some research paper that I and my colleagues wrote recently.

Recently, we wrote a paper called Magentic-One, a generalist agent team that solves complex tasks. We benchmarked it on a dataset called GAIA. This is an example of some of the problems that these agents have to solve to make progress in that set. If you read the task cases of the cities within the United States where U.S. presidents were born, which two are the farthest apart from the westernmost to the easternmost going east, giving the city names only? Then, finally, it says, give them to me in alphabetical order in a comma-separated list. To solve a problem like this, you need agents that can search the web, retrieve birthplaces, retrieve city coordinates, find what’s westernmost and easternmost points for each of these things. Write some code to compute distances.

Then, return the final results, sort them in alphabetical order, and finalize that. In practice, when agents try to explore this long-running task with a lot of steps, they fail somewhere along the line and never get to get the task done. Our solution was to build a team of five agents, an agent called an orchestrator that comes up with a plan. I’ll talk a bit more about what that does. A coder agent that writes code. A computer terminal that can execute code. A web surfer primarily that just surfs web pages. A file surfer that can interact with files. What the orchestrator does is that it has the idea of a two-ledger system. The first comes up with a plan, comes up with educated guesses.

Then, a second ledger works in a fast loop. At each step, says, is the task complete? Are we in a stalled state? What is the next speaker? Then based on that, it assigns the next step to the next agent that’s meant to take an action, keeps track of if there’s a stalled state. Then does things like resets the entire state of all the agents until the task is complete. If you’re interested in these sorts of systems, I’ll encourage you to take a look at that paper. Some of the insights from that whole process is that we find that the use of a ledger has about approximately 31% increase in task performance. We have an agent team that generalizes across at least four to five different task types.

Number nine, you don’t have any evals for your tasks. It’s a frequent thing that we see occur. Without evals, you have no way of inspecting how all of the changes you make to these exponential set of configurations impacts your task performance. Then, number 10, your agents don’t know when to delegate to humans. To illustrate this, I’ll show three tasks that an agent might perform for you. Imagine that you had an agent that was deployed. The agent came and said, one day, I made a call to fetch the weather. Now I know it’s cloudy in San Francisco. It’s not too bad. Then, the next day the agent says, I deleted two videos. It’s not the end of the world.

Then, the third day, the agent comes and says, I transferred 3 Bitcoins to Victor. Now that is a problem. To the agent, each of these tasks might have about the same amount of risk or cost or irreversibility. To the human, that is actually not the case. Frequently, you actually need to build in mechanisms to cost or at least predict the risk of an action that’s to be taken by a human, and intelligently delegate to a human when the cost of the risk is high. Then, the final bonus point is, you probably don’t need an agent or an autonomous multi-agent system. You’ll be surprised how often this is my response to people who are interested in this space.

What You Can Do

What can you do? Number one, know when a multi-agent approach is the right thing to do. Autonomy comes with its own set of reliability issues. The moment you start giving agents autonomy to explore the problem space, then essentially the surface of errors or points where failures can occur increases pretty significantly. Like any other tool, a multi-agent system should be selected when they are the right tool for the job. In my experience, the big overall task that most companies need to address as part of their innovation or their process or their product development process.

The small dot is a much smaller subset of the actual tasks that benefit from a multi-agent approach. Sometimes just due to the excitement in the space, a lot of people conflate these two things. Sometimes you might try to explore a multi-agent approach where it’s really not the right fit. How do I know if my task benefits from a multi-agent approach? I have a little framework that I always ask people to go through. Does the task benefit from planning? Is the task such that you could take it, break it down to a bunch of steps such that if you accomplish each of these steps, it takes you from state A, which is a problem state, to state B, which is a solve state.

The second is, does the task benefit from diverse expertise? The value here is that in a multi-agent design, it makes sense to map each of the agents’ specific expertise, some sort of domain-driven design. Say in the software engineering field, one agent focuses on UI, another agent focuses on backend API development, another one focuses on things like integration testing. In a situation like that, it might make sense to explore a multi-agent approach. Does your task require processing extensive contexts? In the case of writing code, you might need to consult a bunch of documentation, reach out to humans, get feedback. Does your task exist in a dynamic environment? There are some tasks that you just can’t solve ahead of time because you just don’t know the solution.

For example, if you had a task that required manipulating a web page, every time you visit the web pages, it’s usually very different. Sometimes some buttons didn’t exist, and every time you click something, it changes the state of the environment. In that case, you really need a multi-agent approach where there’s the ability to react to the dynamic environment and explore new types of plans. If the answer to all of these is no, you most likely don’t need a multi-agent approach.

Two, eval-driven design. I talked about this already. Next, a constrained, tool-focused implementation. In my opinion, I think most production agents today mostly just use the LLM to encapsulate battle-tested business tools. The idea here is that you spend all of your engineering effort building reliable tools. In order to build reliable agents, you essentially get the LLM to just intelligently orchestrate or call those tools as needed. The opposite of that is giving the entire agent a lot of autonomy. Instead, you just constrain them to using a set of tools that make sense for your business problem.

The third is investing in observability and debugging tools. Because the space of configurations is really large, sometimes your best bet is to run these agents a few times, and then retroactively inspect what they have accomplished. In this case, again, this is an example of AutoGen Studio showing the loop of agent activities. Then, fourth, so this is more in the research domain, my bet is a little bit that the agents that get us to a reliable state are a combination of both soft and hard logic. Reasoning provided by LLMs but controlled by logic programs.

I talked about at the beginning how there are a bunch of questions. One of them is, should you invest in a multi-agent system? My thought process here first is, invest in your benchmark, your evaluation, and let that help you decide. If you have a strong baseline and you actually have a multi-agent approach, just explore if the multi-agent approach is actually helping you make progress in metrics that you care about. Another thing to note is that the models are getting better and a lot of the reliability issues we see today are moving up the stack. These models are fine-tuned to act fundamentally better when they use these agents. Then, the third is, does your business have disruption exposure here? Are you in a field like finance, or software engineering, or back-office work, some of the use cases I showed earlier where systems like these would disrupt you, and maybe all of this can help your decision here.

Next Steps

Next steps. We covered a few things, so what are multi-agent systems? What are agents? We talked about orchestration and how to implement them using AutoGen. We talked about 10 common reasons why they failed. We talked about some insights and steps to take, things like evaluation-driven design, a constrained, tool-focused approach. What we did not cover, we didn’t talk about the user experience for multi-agent systems. Interface agents. These are agents that act by controlling interfaces. You might have seen things like the Anthropic tool, a computer use agent, how to design things like that, their unique failure modes. We haven’t talked about optimizing multi-agent systems, let’s say fine-tuning small models to work better for multi-agent systems. Responsible AI considerations, multi-agent patterns, or a deep dive into use cases.

If any of these things are of interest to you, I’m putting together my thoughts on the topic in a book, “Multi-Agent Systems with AutoGen”. The book is published by Manning. It’s in early preview right now. I think the first three chapters are available. I think you can download and read some of them right now.

Questions and Answers

Participant 1: Actually, maybe there is a further aspect that we didn’t cover. We’ve been playing around a lot with AutoGen in the company, so prototypes, and also, we have something in production, but for internal functionalities. What we’ve been wondering about is horizontal scalability as long as you are going to face a large user base. How is it going to scale? What are the best practices? Should you engage more container interactions? Are there any considerations?

Dibia: Yes. Over the last year, we’ve been speaking with companies like yours, and the scalability story has been something that we just didn’t have until recently. I mentioned earlier that we have a new API, version 0.4. That thing is completely rewritten from scratch. First of all, it’s asynchronous, and so there’s a stronger path to actually have multi-threaded versions of this stuff. We also have the concept of a runtime. We have a single-threaded runtime and a distributed runtime. All of this can be abstracted using tools like Ray and a few other things. Look at the new 0.4 preview API. Our current documentation has some ideas on how you can scale these things horizontally across very large clusters.

Participant 1: This is an aspect that I don’t know whether it’s covered, but the fact that if you go horizontally, you may target whatever of the various instances of the same agent, and they should share the same knowledge. This is something which is important.

Dibia: Our new preview API has a scalability story there.

Participant 2: Could you please provide a couple examples of real-world problems that you think multi-agent systems would be better at compared to single, and that humans wouldn’t necessarily want to tackle?

Dibia: Earlier, I started with, I think, three examples. The first was something like, scan my email, look for emails from clients that have attachments, download those attachments. Extract the data in those attachments. Put that into an Excel file. Put some of that into our custom lead generation tool, maybe open some GitHub Issues. I think that example falls into a class of issues that we refer to as back-office tasks. Those things don’t require a lot of skill, they just require a lot of time, and they require touching multiple interfaces and tools. It’s very hard to write an API or some software program that would do stuff like that, because now you need things to move across multiple systems.

The way you would use a multi-agent system here is that you will build something like this, and you will be able to do things like open up your email. If needed, it will control your desktop, spin up the custom tool that has no API, type in text, put in all of that. Spin up, navigate to GitHub, open up issues, and things like that. That’s one example.

Another example that’s pretty common is the software engineering example. You’ve probably heard of systems like Devin the AI software engineer. I wouldn’t say we are there yet, but I think we will see very strong progress over the next year. Then, finally, very long-running tasks that could take weeks to complete. The North Star for something like that is filing taxes. How do you even start out with things like that? Lots of emails to be sent, lots of requirements to be gathered. You want a system that can pause, is asynchronous, knows like, here’s the state, here’s what progress looks like, here’s what the next steps look like. I would say those classes of problems, it’s very hard to write software for that right now. It’s a small slice, but it’s just something that we can’t write software for right now.

Participant 3: I am an AutoGen user myself, and as I think about doing research from personal experience and reading from the Reddit community, LangChain and AutoGen are definitely the top contenders. What’s your thought on your product versus your competitor?

Dibia: I think the space is still so early that we really can’t tell. It really reminds me of eight years ago when we were all converging on deep learning frameworks. There was Caffe2. Then there was TensorFlow, and all of us learned TensorFlow. Then there was PyTorch, and PyTorch just showed us a completely new developer experience, and it seems to be what we need now. We are very cognizant of that. We all just need to keep iterating really fast until we find the thing that works. As an example of that, right now we are deep in the weeds, completely rewriting AutoGen. It’s a direct response to the fact that the first version just didn’t get a lot of things right, and we need to do better. I don’t know which one will win, but we’ll see.

See more presentations with transcripts

10 Reasons Your Multi-Agent Workflows Fail and What You Can Do About It

Transcript

Background

What Are Multi-Agent Systems?

Failure Modes – 10 Reasons Current Multi-Agent Workflows Fail

What You Can Do

Next Steps

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

Sentry launches MCP Server Monitoring to give developers deeper operational insight – News

Google Wants You to Pick Your Own News Sources for Searches

Solana (SOL) or Ruvi AI (RUVI)? Experts Say This Newcomer’s Audited Token Is the Real Deal

High-Resolution Sensors, AI Photography, and 4K Video

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Background

What Are Multi-Agent Systems?

Failure Modes – 10 Reasons Current Multi-Agent Workflows Fail

What You Can Do

Next Steps

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News