Transcript
Joseph: We are here to share some of our key learnings in developing customer facing LLM powered applications that we deploy across Europe, across Deutsche Telekom’s European footprint. Multi-agent architecture and systems design has been a construct that we started betting pretty early on in this journey. This has evolved since then into a fully homegrown set of tooling, framework, and a full-fledged platform, which is fully open source, which now accelerates the development of AI agents in Deutsche Telekom. We’ll walk you through our journey, the journey that we undertook, the problem space where we are deploying these AI agents in customer facing use cases. We’ll also give you a deep dive into our framework and tooling, and also some code and some cool demos that we have in store for you.
I am Arun Joseph. I lead the engineering and architecture for Deutsche Telekom’s Central AI program, which is referred to as AICC. It’s AI Competence Center. With the goal of deploying AI across Deutsche Telekom’s European footprint. My background is primarily engineering. I come from a distributed systems engineering background. I’ve built world class teams across U.S., Canada, and now in Germany, and also scalable platforms like IoT platforms. Patrick Whelan is a core member of our team and lead engineer of AICC, and also the platform, who has contributed so much to open source. A lot of these components that you might see are from Pat.
Whelan: It’s been a year now, Arun recruited me for this project. When I started out I had very basic LLM knowledge, and I thought everything would be different. It turns out, a lot has pretty much stayed the same. It’s been very much a year full of learnings. This presentation is really much a product of that. Not only that, I think it’s worth noting that this is very much from the perspective of an engineer, and how we took this concept of LLMs.
Frag Magenta 1BOT: Problem Space
Joseph: Let’s dive into the problem space that we are deploying this technology, especially in Deutsche Telekom. There is a central program for customer sales and service automation referred to as Frag Magenta. It’s called the Frag Magenta 1BOT program. The task is simple. How do you deploy GenAI across our European footprint, which is around 10 countries? Also, for all of the channels through which customers reach us, which is chat channel, voice channel, and also autonomous use cases where this might come in.
Also, as you would have noticed, these European countries would require different languages as well. Especially at the time when we built RAG-based chatbots, this is not something which can really scale, unless you really have a platform to solve these hard challenges. How do you build a prompt flow, or a use case which requires a different approach in voice channel, as against the chat channel. You cannot send links, for example, in the voice channel. Essentially, this is where we started off.
Inception
It is important to understand the background, to understand some of the decisions that we made along this journey. To attack the problem space, we started back last year, somewhere around June, when a small pizza team was formed to look into the emerging GenAI scope. We were primarily looking into RAG-based systems and see whether such a target can be achieved back then. This is an image inspired from the movie, Inception. It’s a movie about dream hacking. There are these cool architects in the movie who are dream architects, and their job is the coolest job in the world. They create dream worlds and inject into a dreamer so that they can influence and guide them towards a specific goal. When we started looking at LLMs back last year around this time, this is exactly how we felt as engineers and architects.
On one hand, you have this powerful construct which has emerged, but on the other side, it is completely non-deterministic. How do you build applications where a stream of tokens or strings can control a program flow. The classical computing is not built for that. How do you build? What kind of paradigms can we bring in? Essentially, at that point in time, LangChain was a primary framework which was there for building LLM RAG applications. OpenAI had just released the tool calling functionality in the OpenAI APIs. LangChain4j was a port which was also emerging, and nothing particularly available in the JVM ecosystem. It’s not really the JVM ecosystem, but rather the approach towards a scalable solution, versus you build functions on top of a prompt was not particularly appealing if you really wanted to build something which is scalable.
Also, as Deutsche Telekom, we had huge investments on the JVM stack. A lot of our transactional systems were on the JVM stack. We have SDKs, client libraries already built on the JVM stack, which allows data pulls and also observability platforms. What skillsets do you require to build these applications, was a question. Is it an AI engineer? Does it require data scientists? Certainly, most models were not production ready. I remember having conversations with some of the model providers, or the major model providers, none of them advised, don’t put it in front of customers. You always need to have a human in the loop. The technology is going to emerge. If you look at the problem space, and with this background, it was pretty clear we cannot take a rudimentary approach in building something and expect it to work for all these countries with different business processes, APIs, specifications.
Multi-Agent Systems Inspiration
This also provided an opportunity for us. This is what Pat was referring to. I looked at it, and it was pretty clear, there is nothing from a framework standpoint or a design standpoint which exists to attack this. It was pretty clear, models can only get better. It is not going to get any worse. What constructs can you build today, assuming the models are going to get better, which is going to stand the test of time in building a platform which allows democratization of agents? That’s how I started looking into open-source contributors within Deutsche Telekom, and we brought a team together to look at it as a foundational platform that need to be built.
Minsky has always been an inspiring figure. This is a 1986 set of essays, he always talked about agents and mind, and mind is a construction of agents. I wanted to highlight one point here. The recent OpenAI’s o1 release, or how that model is trained, is not what we are referring to here. We are referring to the programming constructs which are required if you want to build the next generation of applications at scale. Certainly, the different specialists for different processes collaborating with each other. What is the communication pattern? How do you manage the lifecycle of such entities? These were the questions we had to answer.
Our Agent Platform Journey Map
We set out on a journey wherein we decided we will have to build the next Heroku. I remember exactly telling Pat, we have a chance to build the next Heroku. This is how I started recruiting people, while doing this, at a point where there was RAG. Back in September, it’s been one year since this journey, we started releasing our first use cases, which was a FAQ RAG bot on LangChain. Today, what we have is a fully open-source set of multi-agent platform, which we will talk about in this journey, which provides the constructs to manage the entire lifecycle of agents: inter-agent communication, discovery, advanced routing capabilities, and all that. It’s not been an easy ride. We’re not paid to build frameworks and tooling. We are hired to solve business problems.
With that in mind, it was clear that the approaches of rudimentary prompt abstractions and functions on top is not going to scale if you want to build this platform. How many developers in data centers are going to be hired, if you took this approach and then go across all those countries? We have around 100 million customers only in Europe, and they reach us through all these channels. We knew that voice models are going to emerge, so we needed something fundamental, it was pretty clear. We decided to bet on that curve. We started looking at building the stack with one principle in mind, how can you bring in the greatest hits of classical computing, and bake it into a platform? We started creating a completely ground-up framework back then, and we ported the whole RAG pipeline, which was the RAG agent or the RAG construct, which we released back then, and ported onto the new stack. It had two layers.
One we referred to as kernel, because we were looking at the operating system constructs, and we decided these constructs, every developer need not handle it, let’s create a library out of it. Then we have another layer, which, at that point in time was the IA platform, or the Intelligent Agents platform, where developers were developing customer facing use cases. This was referred as a code named LMOS, which stands for Language Models Operating System. We had a modulith back then. We chose Kotlin because we knew that, at that point in time, we had huge investments in JVM stack. We also knew that we have to democratize this. There was a huge potential with DSLs, which Kotlin brings in. Also, the concurrency constructs of Kotlin, what is the nature of application that we see? The APIs are going to be the same OpenAI APIs. They might get enhanced, but you need advanced concurrency constructs. That’s why we went with Kotlin-first approach back then.
Then, in February, when the first tool calling agents were released, this was the billing agent, one API, and Pat was the guy who released it. You can ask the Frag Magenta chatbot, what’s my bill? It should return. This was a simple call, but essentially built entirely on the new stack. We were not using even LangChain4j, or Spring AI at that point in time. Then we realized, as we started scaling our teams, the entry barrier we have to reduce. There was still a lot of code which had to be written. The DSL started to emerge, which brought down the democratizer. It’s called the LMOS ARC, which is the agents reactor, as we call it.
By July this year, we realized that it’s not only the frameworks and platforms which is going to accelerate this, we needed to change, essentially, the lifecycle of developing applications. Because it’s a continual iteration process, prompts are so fragile and brittle. There are data scientists, engineers, evaluation teams, so the traditional development lifecycle need to be changed. We ran an initiative called F9 which is derived out of Falcon 9 from SpaceX. Then we started developing agents, and we brought down the development time of developing a particular agent to 12 days. In that one month, we almost started releasing 12 use cases in that month. Now we are at a place where we have a multi-agent platform which is completely cloud native. This is what we will talk about now.
Stats (Frag Magenta 1BOT, and Agent Computing Platform)
Some of the numbers, what we have today. We have started replacing some of the use cases in Frag Magenta, with the LLM powered agents. We have had so far, more than a million questions answered by the use cases for which we have deployed this, with an 89% acceptable answer rate. That is more than 300,000 human-agent conversations deflected with a risk rate under 2%. Not only that, we were able to benchmark what we built against some of the LLM powered vendor products. We did the A/B testing in production, and this is around 38% agent handovers were better in comparison to the vendor products, for the same use cases that we tried. Going back to the inception analogy, one of the things with the dream architects, is they used to create worlds which are constrained, so that the dreamer cannot go into an infinite, open-ended world.
That is exactly the construct that we wanted to perfect or bring down into the platform, so that the regular use case developers need not worry about it. They used to create these closed loop Penrose steps like constructs that we wanted to bake right into the platform, so the use case developers need not worry about it. Let’s look at some of the numbers of this platform, what it has done. The development time of an agent which represents a domain entity, like for billing, contracts, this is a top-level domain for which we develop agents. When we started, it was 2 months, and now it has brought down to 10 days. This involves a lot of discovery of the business processes, API integration, and everything.
For a simple agent, with a direct API. Also, for the business use cases, once you build an agent, you can enhance it with new use cases. Say you release a billing agent, you can enhance it with a new feature or a use case, like now it can answer or resolve billing queries. This is the usual development lifecycle, not building agents every day. It used to take weeks, and it is now brought down to 2.5 days. Earlier we used to release only one per month. As most of you might know, the brittleness or the fragility of these kind of systems, you cannot release fast, especially for a company with a brand like Deutsche Telekom, it can be jailbreaked if you don’t do the necessary tests.
We brought it down to two per week in production. Risky answer, there are a lot of goof-ups as well, or the latest one was someone jailbreaked, or bought and turned it into a [inaudible 00:15:26] bot or something. The thing is, we need to design for failure. Earlier, we used to reward the whole build, but right now we have the necessary constructs in the platform which allows us to intervene and deploy a fix within hours. That, in essence, is what the platform stands for, which we refer to as agent computing platform, which we will talk about here.
Anatomy of Multi-Agent Architecture
Whelan: Let me get you started off by giving you an overview of our multi-agent architecture. It’s quite simple to explain. We have a single chatbot that’s facing our customer and our user, and behind that, we have a collection of agents, each agent focusing on a single business domain running as a separate, isolated microservice. In front of that, we have an agent router that routes each incoming request to one of those agents. This means, during a conversation, multiple agents can come into play. At the bottom here we have the agent platform, which is where we integrate services for the agents, such as the customer API and the search API. The search API is where all our RAG pipelines reside. The agents themselves, they don’t really have to do much of this RAGing, which obviously simplifies the overall architecture.
There were two main key factors for us choosing this kind of design. There’s a lot of pros and cons. The first one is, we needed to upscale the number of teams working on the application. We had a very ambitious roadmap, and the only way we were going to achieve this is by multiple teams working on the application in parallel. This is a great design for that. Then we have this prompt Jenga. Basically, LLM prompts can be fragile, and whenever you make a change, no matter how small, you are at risk of breaking the entire prompt. With this multi-prompt agent design, worst case is you break a single agent as opposed to having the entire chatbot collapse, kind of like Jenga. This is definitely something we struggled with quite a bit at the beginning.
The Evolution of the Agent Framework
That’s the top-level design. Let’s go one level deeper and take a look at the actual code. What I have here on the left is one of our first billing agents. We had a very traditional approach here. We had a billing agent class, an agent interface. We had an LLM executor to call the LLM. We had a prompt repository to pull out prompts. We mixed the whole thing up in this execute method. As you can see, there’s a lot happening in there. Although this was a good start, we did identify key areas that we simply had to improve. The top one being this higher knowledge barrier. If you wanted to develop the chatbot, you basically had to be a Spring Boot developer. For a lot of our teammates, who were data scientists, they were more familiar with Python, so this is a little tricky for them.
Even if you were a good Spring Boot developer, there’s a lot of boilerplate code you needed to learn before you could actually become a productive member of the team. Then we were also missing some design patterns, and also the whole thing was very much coupled to Spring Boot. We love Spring Boot for sure, but we were building some really cool stuff, and we wanted to share it, not only with other teams, but as Arun pointed out, with the entire world. This gave birth to ARC. ARC is a Kotlin DSL designed specifically to help us build LLM powered agents quickly and concisely, where we’re combining the simplicity of a low-code solution with the power of an enterprise framework. I know it sounds really fancy, but this started off as something really simple and really basic, and has really grown into our secret sauce when it comes to achieving that breakneck speed that Arun goes on about all the time.
Demo – ARC Billing Agent
Let’s go through a demo. We’re now going to look at our billing agent. We’ve simplified it for the purpose of this demo. What I show you is stuff that we actually have in production and should be relevant no matter what framework you use. This is it. This is our ARC DSL. Basically, we start off defining some metadata, like the name and the description. Then we define what model we want to use. We’re currently transitioning to 4o. Unfortunately, every model behaves differently, so it’s a big achievement to get it to migrate to a newer model. Unfortunately, the models don’t always behave better. Sometimes we actually see a degrade in our performance. That’s also quite interesting. Here in the settings, we always set the temperature to 0 and have the static seed.
This makes the LLM far more reproducible, the results a lot more reproducible. It also reduces the overall hallucinations of the bot. Then we have some filter inputs and outputs and tooling, and we’ll take a look at that. First, let’s take a look at the heart of an agent, the system prompt. We start off by giving the agent a role, some context, a goal, an identity. Then we start with some instructions. We like to keep our instructions short and concise. There’s one instruction here I would like to highlight, which I always have in all my prompts, and that is, we tell the LLM to answer in a concise and short way. Combining this with the settings we had up there really reduces the surplus information that the LLM gives.
At the beginning, we had the LLM giving perfect answers, and then following up with like, and if you have any further questions, call this number. Obviously, the number was wrong. The combination of these settings and this single line in the prompt really reduces the surplus information. Then we add down here, you can see we’re adding the customer profile, which gives extra context to the LLM. It also highlights the fact that this entire prompt is generated on each request, meaning we can customize it, tailor it for each customer, each NatCo, or each channel, which is a very powerful feature that we really lie on heavily. There we go.
Now we come to the knowledge block. Here we’re basically listing the use cases that the LLM agent is meant to handle, together with the solution. We also have here some steps, which is how we do a little bit of dialog design, dialog flow. I’ll demonstrate that. As you can see, the knowledge we’re injecting here isn’t actually that much. Obviously, in production, we have a lot more knowledge, but we’re talking about maybe one or two pages. With modern LLMs that have a context window of 100,000 characters, we don’t need RAG pipelines for the majority of our agents, which super simplifies the overall development. Let’s take a look at this filter in and outputs. These constructs here allow us to validate and augment the in and output of an agent.
We have, for example, here, this CustomerRequestAgentDetector. If a customer comes and they ask specifically for a human agent, then this will trigger this filter, and that process would be then triggered. We then also have a HackingDetector. Like any other software, LLMs can be hacked, and with this filter here, we can detect that, and it will throw an exception, and the agent will no longer be executed. Both these filters, in turn, themselves, use LLMs to decide if they need to be triggered or not. Then, once the output has been generated, we clean up the output a bit. We can often see these back ticks and these back tick JSONs. This happens because we’re feeding the LLM in the system prompt with a mixture of Markdown and JSON, and this often happens in the output.
We can simply remove these by just simply putting a minus and this text. Then, we want to detect if the LLM is fabricating any information. Here, we can use regular expressions within this filter to extract all the links and then verify that these links are actually valid links that we expect the LLM to be outputting. Then, finally, we have this UnresolvedDetector. As soon as the LLM says it cannot answer a question, this filter will be triggered, and then we can do a fallback to another agent, which, in most cases, is the FAQ agent, which in turn holds our RAG pipelines, then should hopefully be able to answer any question that the billing agent itself cannot answer.
These are LLM tools. LLM tools are a great way to extend the functionality of our agent. As you can see here, we have a lot of billing related functions like get_bills, get_open_amount, but we also have get_contracts. This is a great way for our agents to share functionality between each other. Usually, you would have a team that has already built these functions for you, but if you need to build it yourself, don’t worry, we have a DSL for that as well. As you can see here, we have a function, it has got a name, get_contracts. We give it a description, which is very important, because this is how the LLM determines whether this function needs to be called. What is unique to us is we have this isSensitive field.
As soon as the customer is pulling personalized data, we mark the entire conversation as sensitive and apply higher security constructs to that conversation. This is obviously very important to us. Then within the body, we can simply get the contracts, as you can see here, a little bit of magic. We don’t have to provide any user access token. All this happens in the background. Then we generate the result. Because this result of this function is fed straight back into the LLM, it’s very important for us that we anonymize any personal data. Here we have this magical function here, anonymizeIBAN, which will anonymize that data so that the LLM never sees the real customer data. Again, it’s a little bit of magic, because as soon as the customer gets the answer, or just before the customer gets the answer, this will be deanonymized, so that the customer sees its own data. That’s now functions.
I think it’s time now to look at it in action. Let me see if this is working. Let’s see, and ask, how can I pay my bill? You see this? It’s asking us a question, whether we’re talking about mobile and fixed line. Say, mobile. I’m really happy this works. LLMs are unpredictable, so this is great. As you can see here, we have actually implemented a slight dialog flow. We’ve triggered the LLM to execute this step before showing the information. This is important, because a lot of the time, if we go back here to the system prompt, you can see here that we are giving the LLM two options, two IBANs, and the LLM naturally wants to give the customer all the data it has. Without this step that we’ve defined up here, the LLM will simply return this massive chunk of text to the customer. We want to avoid that. These steps are a very powerful mechanism allowing us to simplify the overall response for the customer. I think that’s it.
This is the entire agent. Once we’ve done this, once we’ve done the testing, we just basically package this as a Docker image and upload it into our Docker registry.
Joseph: What Pat shied away from saying is it’s just two files. It’s pretty simple. Why did we do this? We wanted access for our developers, who are already knowing the ecosystem. They would have built APIs for contracts and billing. They are familiar with the JVM ecosystem. These are two scripting files. These are Kotlin scripts, so it is provided to the developer, and it can be given to the data scientists, along with the view. It comes with the whole shebang for testing.
One Agent is no Agent
We’ll do a quick preview of the LMOS ecosystem. Because, like I said, the plan is not to have one agent. We needed to provide those constructs of managing the entire lifecycle of agents. One agent is no agent. This comes from the actor model. We used to discuss this quite a lot when we started. How do you design the society of agents? Should it be the actor approach? Should there be a supervisor? In essence, where we come up with was, don’t reinvent the wheel, but provide enough constructs which allows extensibility of different patterns. Billing agent, from a developer point of view, what they usually do is just develop the business functionality and then just push it as a Docker image. We’ll change that into Helm charts in a bit. It is not enough if you want this to join the system.
For example, the Frag Magenta bot, it’s composed of multiple agents. You would need discoverability. You would need version management, especially for multiple channels. Then there’s dynamic routing, routing between agents, which are the agents that need to be picked up for a particular intent. It can be a multi-intent query as well. Not only that, the problem space was huge, multiple countries, multiple business processes. How do you manage the lifecycle when everything can go around with one change in one prompt? All those learnings from building microservices and distributed systems, they’ll still apply. That means we needed to bring that enterprise grade platform to run these agents.
LMOS Multi-Agent Platform
This is the LMOS multi-agent platform. The idea is, just like Heroku, the developer only does the Docker push or the git push Heroku master. Similarly, we wanted to get to a place where git push agent or LMOS master. Everything else should be taken by this platform. What it actually has is we have built a custom control plane, which is called the LMOS control plane. It is built on existing constructs around Kubernetes and Istio. What it allows is, agents are now a first-class citizen in the fabric, in the ecosystem, as a customer source, and so is the idea of channels. Channel is the construct where we group agents to form a system, for example, Frag Magenta. We needed agent traffic management.
For example, for Hungary, what is the traffic that you need to migrate to this particular agent? Tenant channel management. Also, agent release is also a continuous iteration process. You cannot just really develop that agent and push it to production and believe that all is going to work well. You needed all those capabilities. Then we also have a module called LMOS RUNTIME, which is bootstrapping the system with all the agents required for a particular system.
We’ll show a quick walkthrough of a simple agent. For example, there is a weather agent, which is supposed to work only for Germany and Austria. We have introduced the custom channels, need to be available only for the web, and the app channels. Then we provide these capabilities. What does this agent provide as capabilities? This is super important. Because it’s not only the traditional routing based on weights and canaries, which is important, now multi-agent systems require intent-based routing, which you cannot really configure, which is what the LMOS router does.
Essentially, it provides bootstrapping of even the router based on the capabilities which an agent advertises once it’s pushed into the ecosystem. We wanted to build this not as a closed platform where you can only run your ARC agent, or the agents in JVM or Kotlin, we were also keeping a watch on rest of the ecosystem catching up, or it’s going much faster. There is also, you can bring your own Python, LangChain, LlamaIndex, whatever agent. The idea is it can all coexist in this platform if it follows the specifications and the runtime specifications that we are coming up with. You can also bring the Non ARC Agent, wrap it into the fabric, deploy it, and even the routing is taken care by this.
We will show a quick demo of a multi-agent system. It is composed of two agents, a weather agent and a news summarization agent. We will start by asking a question to summarize a link. The system should not answer because this agent is not available in the system right now. There is only one agent right now. Let’s assume Pat had developed a news agent and deployed it and just did the LMOS push. Right now, just as Helm charts, it’s packaged as Helm charts, and it’s just installed. As you can see, there’s a custom resource, you can manage the entire lifecycle with the very familiar tooling that you already know, which is Kubernetes. Now we apply a channel.
For example, the UI that you’ve shown us, assume that this should be made available only for Germany and for one channel. Agents should be available only for that channel, along with the view that it should not result in additional routing configurations usually, which means, with the agent advertising, I can now handle news summary use cases. The router is automatically bootstrapped, and now it dynamically discovers, drops the traffic for this particular channel, and the router picks up the right agent. Of course, it’s a work in progress. The idea is not to have one strategy. If you look at all the projects which were there, LMOS control plane, LMOS router, LMOS runtime, these are all different modules which provides extensibility hooks so that you can come up with your own routing strategies if need be.
Takeaways
Whelan: When I started this project a year ago, as I said, I thought everything would change. I started burning my Kotlin books. I thought, I’m going to be training LLMs, fine-tuning LLMs, but really nothing much has changed. At its core, our LLM bodies know much about data processing and integrating APIs, and LLMs is just another API to integrate. At least nothing has changed yet. That said, we see a new breed of engineer coming out. I’m an engineer. I spend 500 hours prompt engineering, prompt refining. What we see is this term being coined, LLM engineer. Though a lot has stayed the same, we’re still using a lot of the same technologies, a lot of the same tech stack. Some of the capabilities that we want from our developer is definitely growing in this new age of LLMs.
Joseph: Especially if you’re an enterprise, we have seen this. There are many initiatives within Deutsche Telekom, and we often see that everyone is trying to solve these problems within an enterprise twice, thrice. The key part is you need to figure out a way in which this can be platformified, like you build your own Heroku, so that these hard concerns are handled by the platform and it allows democratization of building agents.
You need not look for AI engineers, per se, for building use cases, but what you need to have is a core platform team, how you can build this. Choose what works best for your ecosystem. This has been quite a journey, going against the principles, so let’s use this framework, that framework. Why would you want to build it from scratch, and all that. So far, we’ve managed to pull it off. I’m pretty sure the reason why, if it needed to continue, it needed to be open sourced, because the open-source ecosystem thrives on ideas and not just frameworks, and we wanted to bring all those contributions back into the ecosystem.
Summary
Just to summarize the vision that we had when we started this journey, we did not want to just create use cases. We saw an opportunity that if we could create the next computing platform, ground-up, what would be the layers it might look like, like the network architecture or the traditional computing layers that we already are familiar with? At the bottom most layer, we have the foundational computing abstractions, which allows prompting optimization, memory management, how to deal with LLMs, the low-level constructs. The layer above, what we see, the single agent abstractions layer, how do you build single agents? What tooling, frameworks can we bring in which allows this? On top of that, the agent lifecycle, which is Claude, or a Lang, or whatever it is, you need to manage the lifecycle of agents. It is different from the traditional microservices.
It brings in additional requirements around shared memory conversations, the need for continuous iterations, the need to release only to specific channels to test it out, because no one knows. The last one is the multi-agent collaboration layer, which is where we can build the society of agents. If you have these abstractions, it allows thriving set of agents which can be open and sovereign, so that we don’t end up in a closed ecosystem of agents provided by whomsoever, are the monopolies who might emerge in this space. We designed LMOS to absorb each of these layers. This is the vision. Of course, we are building use cases, but this has been the construct which has been in our minds when we started this journey. We have all those layers open sourced. All of those modules are now open sourced, and it’s an invitation for you to also join us in defining our GitHub org, and defining the foundations of agentic computing.
Questions and Answers
Participant 1: I would be interested in the QA process for those agents, how do you approach it? Do you have there some automation? Do you run this with other LLMs? Is there human in the loop, something like that?
Joseph: The key part is, there’s a lot of automation requirements. For example, in Deutsche Telekom, we needed human annotators to start with, because there is no particular technique by which you can fully say that an automated pipeline to figure out hallucinations or risky answers is there. We started out with human annotators. Slowly, we are building the layer which restricts the perimeter of the risky questions that might come up.
For example, if somebody had flagged this question or the nature of these questions, it can go into that list of test cases which runs, execute it against a new release of that agent. It’s a continual iteration process. Testing is a really hard problem. That’s also the reason why we need all those guardrails absorbed somewhere, so that the developer need not worry about all that, most likely. Also, the need to reduce the blast radius and release it only for maybe 1%, 2% of the customers, get feedback. These are the constructs that we are in. The solution to a fully automated LLM guardrailing is not yet there. If you’re defining the perimeter as small, of an agent, it allows testing also, much better.
Whelan: Testing is awful. It’s very tricky. That’s especially why we wanted to have these isolated microservices so we can really limit the damage, because often when we do break something we don’t realize until it’s too late. Unfortunately, it’s not a problem that we’re going to solve, I think, anytime soon, and we still need human agents in the middle.
Participant 2: Basically, as far as I understood, it’s the chatbot at the end, so it’s just available for the user, and then there’s an underlying form of agents. Do you have active agents that can do this stuff? Not like in this example, that provide the information in some forms or get projections from the system, like contracts, but really do the stuff, so do the changes in the system, maybe irreversible ones, or something like that.
Joseph: Yes. For example, if you want to take actions, it’s essentially API calls from a simplicity construct. If you want to limit the perimeter, for example, update IBAN was a use case that is awaiting the PSA process, but we built it because you needed to get the approval of this privacy and security thing. It really works. Essentially, the construct of an agent that we wanted to bring in is the ability to take actions autonomously, is a place that you can get to. Also, for multiple channels, since you mentioned chatbot, the idea is, what is the right way to split an agent so you don’t replicate the whole thing again for different channels? What is that right slicing? There could be features that might be built in, which allows it to be plugged in for the voice channel as well. For the example, the billing agent, is not only we are deploying it for chat, we are also now using the same constructs for the voice channels, which should potentially take also actions like asking customer authentication and also initiating actions.
Participant 3: I’m quite interested in the response delay. I saw you have hierarchical agent execution, as well as within the agent, we’re seeing billing agent example, you have two filters, like some of the hack filter. Do they execute, or do they invoke GPT in a sequential order or in a parallel way? If it is in sequential order, how do you guys minimize or optimize for the delay?
Whelan: We have two ways we can do it. Often, we execute the LLM sequentially, but in some cases, we also run them in parallel, where it’s possible. For the main agent logic, for the system prompt and everything, we use a higher model, like 4o. For these simpler filters, we usually use lower models, like 4o mini, or even 3.5 which execute a lot faster. Overall, this is something that can take a few seconds, and we’re looking very much forward to models becoming faster.
Joseph: What you saw here was the ARC construct of building agents, which allows quick prototyping. We are now releasing it as a way for developers to work with. What is there in production also has this elementary construct called LMOS kernel, which we built, which is not based on this kind of construct of a simple prototyping element, which essentially looks like a step chain. For example, if you say, for an utterance that comes in, you first want to check whether it contains any PII data. You needed to remove the PII data, which requires named entity recognitions to be triggered, which is a custom model that we run internally, which we have fine-tuned for German.
Then the next step could be, also check whether this contains an injection prompt. Is it safe to answer? All of that could potentially be triggered within that loop that we have in parallel as well. There are two constructs. We have only shown one construct here, which allows this democratization element, but we are still getting into that way of, how do you balance programmability, which brings in these kind of capabilities. We might be able to extend the DSL. This is fully extensible. The ARC DSL is extensible. You can come up with new constructs like repeat, in parallel, and a couple of function calls, and it can execute in parallel. That’s also the beauty of the DSL also we are coming up with.
Participant 4: You built a chat and voice bot, and it seems like it was a lot of work. You had to get into agents, you had to get into LLMs, you had to build a framework, and you also dealt with issues that they only have with LLMs, like hallucination. Why did you not pick a chatbot or voice bot system off the shelf? Why did you decide to build your own system?
Joseph: Essentially, Frag Magenta right now, if you check today, it’s not completely built on this. We already had Frag Magenta before we started this team. It is based on a vendor product, and it follows the dialog pre-design, which used to be the previous case. It’s not like we built this yesterday, so we already had a bot. The solution rates, however, was low. Because dialog tree-based approaches you can never expect what the customer might ask, the traditional. You used to have their custom DSL, which looks like a YAML file, where you say, if customer asks this, do this, do that. That’s where this came in. When LLMs came in, we decided, should we not try out a different approach? There was a huge architectural discussion, POCs created.
Should we go with fluid flows especially in a company like Deutsche Telekom? If you leave it out, everything open for the LLMs, you never know what brand issues you might end up with, versus the predictability in the dialog tree. This is a key point that came in, in our design. I showed this number, 38% better than vendor products. We came up with the design, at least we think this is the right course of action. It’s a mix between the dialog tree versus a complete fluid flow wherein you’re not guardrailing at all. This is this programmability that we are bringing in, which allows this dialog design, which combines both, which used to show better results. That 38% was, in fact, comparing. The vendor product also came up with LLMs, but the LLM was used as a slot filling machine, but this was performing better. We are migrating most of the use cases into this new architecture.
See more presentations with transcripts