Elena Samuylova On Large Language Model (LLM) Based Application Evaluation And LLM As A Judge

Transcript

Introductions [00:27]

Srini Penchikala: Hi everyone, my name is Srini Penchikala. I am the lead editor for AI, ML and data engineering community at infoq.com website and I’m also a podcast host. Thank you for tuning into this podcast. In today’s episode, I will be speaking with Elena Samuylova, co-founder and CEO at Evidently AI, the company behind the tools for evaluating, testing and monitoring the AI powered applications.

Elena will discuss the topic of how to evaluate large language model based applications, LLM based applications, as well as applications leveraging AI agent technologies. So with a lot of different language models being released by major technology companies almost every day, it is very important to evaluate and test LLM powered applications. So we are going to focus on that. We are going to hear from Elena on the best practices and any other resources we want to go to learn about LLM evaluations. Hi Elena. Thank you for joining me today. Can you introduce yourself and tell our listeners about your career and what areas have you been focusing on recently?

Elena Samuylova: Hello everyone and thank you so much for the lovely introduction. So indeed, I’m CEO and co-founder of a company called Evidently AI, but before that I was actually doing applied machine learning and AI stuff for almost a decade. I think the meaning of this word changed over this time, but the idea remains working with different non-deterministic predictive systems. So originally I was working in the department called Yandex Data Factory of a large search engine called Yandex where the idea was to apply machine learning technologies anywhere else outside consumer internet. And that was my first introduction to the topic. So imagine coming to different companies from healthcare to industrial manufacturing and trying to figure out what to do with that data. So that was my first introduction to this idea and I worked with a lot of them and back then we actually used the word “big data” mostly and then “predictive analytics”.

And then we started talking about machine learning and suddenly, I think it was around 2015 or 2016, again, we started talking about AI. And then I left the company and co-founded my first startup called Mechanica AI and it was focused on industrial manufacturing. That was somehow one of the topics that I picked up. And we did a lot of deployments of real machine learning based systems in industrial environments. So imagine like steel making plant, oil and gas or something like this. And that brought me to the topic of maintenance, safety and reliability of the systems because when you deploy it in these environments, that is actually quite scary. Many things can go wrong.

And eventually that also led us to founding Evidently. When we started the company, we originally focused on machine learning model monitoring, specifically helping to figure out how to keep track of all these systems. And now with LLM craze, of course LLM became part of the picture. And we’ve been focusing again on the reliability and safety of these systems, but a bit more even on the beginning of the cycle when you have to think about safety and testing even before you have to monitor them. So you can see all my career was pretty much around AI systems, the topics of reliability, but in different worlds. I understand well both consumer internet and very real world applications. So I hope it sets the scene where I’m coming from.

LLM Evaluation Terminology [03:32]

Srini Penchikala: Thank you. It definitely does. Safety and reliability are even more important now with all the explosion of the models and the solutions. So Elena, before we start on the LLM evaluation process and methods and everything, for our listeners who are new to this topic, can you tell us what is the difference between LLM system, probably most of us know about it, and evaluating LLM models?

Elena Samuylova: So every time there is a new model released, just like you mentioned right now in the beginning, like a new provider’s model or maybe a new open source model, people have to somehow present why this model is better on how it deals compared to all the other models. So we have some ways to compare models in an abstract way, and this is usually done through benchmarks. So some standardized tests, maybe they’re focused on mathematics, maybe on coding, maybe on general comprehension and so on. And when we release the model, we definitely attach performance on these benchmarks, but this is made without connection to a specific use case. At the same time, if you’re a builder inside a company or a startup founder or something, you are working on a specific application. So you’re building a chatbot or an AI agent or maybe a summarization tool and you need to evaluate this system.

And while this system uses a model on the back end, this model is just one of the components. You also add a prompt on top of it, you also connect it to different parts of the system, maybe like RAG application, databases and so on. I imagine a lot of our listeners are software architects so they understand well how many little pieces come together to create a complete system and you need to evaluate this system on the use case that you’re solving. So this is different from evaluating models. This is done just like maybe fore initial choice. It’s important for you to understand which model to compare, but then when you’re working with the system, you need to focus on evaluating this system, testing it on every release and so on. So that’s two different parts of the medal and of course we’ll touch both, but probably most people who are listening to us eventually will need to evaluate systems, not models because they’re probably not training their own models.

LLM as a Judge [05:24]

Srini Penchikala: And also there is a term called LLM as a judge, can you please explain what this is and how does it help with the evaluation efforts?

Elena Samuylova: That’s one of my favorite topics. I think it’s a bit controversial, but that’s probably what makes it fun. So the idea behind LLM as a judge is to use LLM to evaluate the outputs of your LLM system. I know this sounds a bit recursive at first and it’s almost like cheating. How can we trust one model to take the output of another and treat it fairly? But the idea is that you’re not just asking the model to redo its job. Even side note, sometimes actually this can be useful, but you are trying to use the model as a classifier to evaluate specific qualities of the output. So for example, let’s say your chatbot generated a response, you might want to understand if this response is safe or is it polite or is it comprehensive enough?

You can ask another LLM to just read this output and pretty much assign this label. So this is an evaluation technique that can be applied in many scenarios when you’re testing, when you’re running online evaluations and so on. And the main idea is yes, you’re using LLMs pretty much to do what otherwise the human would have done doing some sort of manual labeling, approving the response, commenting on it. But there are a lot of tricks and issues with how exactly to implement it because of course it’s not a silver bullet. I’m happy to dive into more as we discuss that throughout the conversation.

Srini Penchikala: Looking forward to that. So are there some LLMs that are specifically designed to be evaluators or any LLM can be a judge?

Elena Samuylova: There are actually a few and people do release them, like framing that these are judgment LLMs. But in reality, I would say that you can use any LLM and it should be done with consideration of the task because when you say just general purpose judgment, it’s almost the same as general purpose LLM. You need to define what you’re judging.

There are few things, like for example, imagine judging a sentiment. You don’t even need an LLM for that. You can actually have a smaller machine learning model. Or maybe detecting toxicity. This is a very narrow task. And then at the same time, imagine another task, which would be “how comprehensive this response is” or “how well worded it is”. This would require some different capabilities. So depending on what you’re judging, you might need to use different LLMs. So our general approach is to first identify what is the criteria, and that’s actually the hard part. And then you can choose the suitable LLM for that. But to directly answer your questions, yes, there are special LLMs, but I would definitely recommend just to use the general purpose ones.

LLM Based Application Evaluation Process [07:44]

Srini Penchikala: Makes sense. Okay. And now that we have some idea on the definitions, we can jump into the more details. So I think first question is what does the typical LLM evaluation process look like from your experience? What are the different steps in the process and also what’s involved? What are the best practices in evaluating LLM models?

Elena Samuylova: I think we can stick to the systems here. So we are evaluating the systems. So probably the first step would be when you’re trying to design something like maybe a chatbot, I will keep using this example, but you can always mentally replace it with some other system like summarization tool, an agent and so on. The first step that you need to do is to, well, try it out, create some first attempt of it and understand how well it’s doing its job. That’s where I usually first come up with a problem of evaluation. So you just enter some test inputs, you look at the results and you either like them or not. And in the beginning you can do it very iteratively and the community actually calls it “vibe checks”, which means that you’re just vibing it. You look at the results, you like them or not, but this is not very systematic or fair or structured.

So the next step is typically to implement some sort of automated scoring that will allow you to compare the outputs automatically during the experimentation. This is the first step in development. So you may be trying different models, you may be trying different prompts, you may be trying different RAG designs and any need a way to evaluate the results of your experiments. Maybe you tried three different vector databases or chunking strategies, so you need to somehow compare which one is doing better. So this is the experimental evaluation and at this point you already need to design either LLM as a judge or come up with some other metric that would allow you to do this automatically. Sometimes it’s easier like say you’re generating code, you can have some sort of test data set where you just check if the code tests will be passed after you generate this code.

So it depends on the use case, but this is the first step. However, that’s not where it ends, so just where it starts. Then when you move to pre-production phase, you typically need to expand your testing. So maybe you experimented on a smaller subset of data, but before you deploy, before you roll it out to your better users, you need to cover some more scenarios. And here it can come from just running a few more tests to actually implementing a whole comprehensive testing strategy, which can take quite some time. Because imagine you have a medical or healthcare or educational application or maybe legal application. So you need to take into account all the risks that are associated with this and test it on a much wider, broader range of applications. So this can come in the form of stress testing or even if it’s a customer-facing application, you can do red teaming.

In this case, red teaming is also for people who are coming from a security background. It’s a very well-known concept, but here we’re talking about testing the LLM systems maybe for prompt injections and some other risks. And that’s even before we deploy it. So then we have the online ongoing monitoring where you need to evaluate how the system behaves in production and regression testing because every time you change something you need to redeploy it, you make some changes to the prompt, you actually need to verify that nothing broke. And that’s actually, I can share from experience where a lot of people start thinking about this. Because somehow they just go through the first steps very roughly and very quickly, but then they deploy to production and either they notice issues or they want to change something and they already have users and start really caring about these changes.

Imagine you already have a workflow, like agentic workflow that people are depending on. Now you plan to change your prompt. You don’t know how exactly the system behavior will change because LLMs are very non-deterministic and unpredictable in this case. So yes, so at least at this regression testing, monitoring, stress testing and experimental evaluations, these at the very least the four workflows where you come with evaluations.

Srini Penchikala: It looks like the testing, monitoring, evaluation, they all go hand in hand. So you want to not wait too late in the process to do these evaluations.

Elena Samuylova: I think it’s like a continuous cycle where one fits into another, but you do have to start somewhere. Maybe start with monitoring and then you figure it out.

Custom “LLM as a Judge” Solutions [11:35]

Srini Penchikala: Regarding the LLM as judge, I think we can create our own custom LLM judges. Can you talk about how to design and tune a custom LLM judge program?

Elena Samuylova: I think in a way the whole point of LLM judges is that you can implement custom criteria because it’s like, I think a very good analogy here would be first imagine that you have your problem, you’re looking at the outputs. When you are trying to understand that they’re good and bad, you have some mental criteria that you’re applying. So maybe you’re looking at tone, maybe you’re looking at correctness, maybe you’re looking at the length, maybe it’s something else much more nuanced. So the LLM did not consider something when formulating the response. So when you look through this and when you try to do manual labeling, you’re actually already applying some sort of criteria. And now imagine that you want to hand over this task to another human. Maybe you would want to hand it over to me and say, “Hey Elena, could you please review these outputs now for me?”

You would need to explain to me what are the criteria to look at, because imagine I’m not familiar with the topic. So that’s how you would do with manual labeling and actually LLM as a judge is the way to achieve exactly that. So you could say, “Hey, when I was reading these responses, I was trying to understand if the tone is easily comprehensible for a person who is new to the topic. So if you use too complicated words, I don’t like it”. Then we can actually implement a judge that would measure this. So they would assign the label, like the tone is comprehensible or easy or not. The trick is: how do you trust the judge? So okay, we write a prompt, but are we going to like the results or not? And there it becomes a bit meta, but you actually need an evaluation system for your own evaluator.

And in this case we can say, okay, now let me go and label the responses first. Then I create the judge and then I see if the judge’s labels match mine. This can be done with a few examples, maybe 30, 50, a hundred depending on the complexity of the use case. And when you trust the judge, you can see that actually it can successfully replicate the labels. Then you can scale it and you can start applying it on real data. You might need to intervene a little bit, but it’s much easier than many other tasks because it’s purely classification. And in this case, that’s why I think it’s so important to think about this as a classification task. Some people come and ask like, “Hey, can I label something like goodness of tone on a scale from zero to a hundred?” I know it does sound like attractive at first, but if you actually look at it and you try to do this yourself, how do you define what’s a 60 and what’s a 70 or what’s a 20?

So there is no criteria that you can actually relay to another person. So the LLM will not be consistent. So you should also try to formulate your criteria. Even a scale one to five is usually not suitable. Maybe you can come up with individual classes, but that’s very important part how you do define it. So when you want to define this judge, you pretty much start with being the judge yourself, then writing the prompt. And in this case, we are using pretty much prompt engineering techniques to create this judge. Then you evaluate it and once it’s doing a good job, you can apply it in practice. If you don’t like the results, you try again or maybe understand that you need to break down your criteria into several or something like this.

Srini Penchikala: And once you get it all working and start using a custom LLM judge in production, how much reusable it is for other use cases? Can we reuse for other cases or are we talking about one-off solution for every application?

Elena Samuylova: It does depend. I mean, I would say between companies for example, probably zero. Inside one company, actually quite a lot because let’s say that you create a lot of systems where you communicate with the customers. So one can be in marketing, another can be in HR, somewhere else, you can have multiple applications and then you create a judge that can verify if the tone of the response matches your corporate guidelines. And this can be very specific to the company. You can use certain words, you can use certain tone and so on. And then this judge can actually be reused across a lot of different LLM applications or something about, for example, all the negative stuff like safety, toxicity and so on. This can probably encapsulate certain rules which again, can be shared across the company.

But when it comes to specific quality dimensions, they’re usually very specific to the application and they’re also very specific to the error types that you observe. Because from my experience, it’s very hard to come up with useful criteria until you actually see the specifics of the data. When you see the data, you see the exact errors that you want to fix and that serves you with the criteria to implement. But this can be different issues depending on the model used. For example, different models behave differently, some of them are more complex, some of them are less. So it really varies a lot.

RAG Based Application Evaluation [15:47]

Srini Penchikala: That’s good. So these language models, they’re more useful for the companies once they start using this retrieval augmented generation, RAG techniques, where they can input their own company data and make the models richer in terms of business domain and business knowledge. So where does the LLM evaluation fall into when we need to evaluate the RAG-based solutions?

Elena Samuylova: Just also maybe to comment a bit for the listeners. So when you talk about RAG, we talk about feeding additional context to the LLM before it generates the response. One usual way is to actually feed some sort of documents that you find that help answer the questions. So that LLM pretty much becomes like: it formulates the answer, but it uses not the knowledge that the LLM has but the document that was fed to it. In many cases, this is actually documents, but sometimes it can actually be like Agentic RAG. For example, it can maybe do some queries, maybe query some databases, pull some information and so on. So it’s all about giving the right context to the LLM to answer it so that it does not hallucinate or doesn’t just rely on its own built-in knowledge because you should not rely on the built-in knowledge of the LLM.

Context Engineering vs Prompt Engineering [16:54]

I think there is now a term that is getting traction, which is called “context engineering” instead of “prompt engineering” because the whole problem is to supply the right context. Well RAG is one way of how you do this and I think inside the enterprises when I talk to companies, almost always they use RAG for all kinds of applications just because otherwise there is no way you can trust the LLM responses. But now that we set the scene: to evaluate a RAG system, we actually need to relate two parts of it. The first part of RAG is where you do the search and search is actually a very well-known problem. It’s been there for a couple of decades, at least in the internet and even more so in theory, we call it information retrieval and that’s a very well-known problem. All you need to do is to find the documents or chunks of the documents that satisfy the query. We typically use the term “relevant”, but in this case “relevant” meaning that the document contains the answer to the query.

And then there’s a second part of RAG, which the is the G, which is the generation. The point of this generation is to take the document that was found and to formulate the answer. And if you evaluate RAG, you actually need to evaluate these two things separately because the problem can come either from the search or from the generation. So to give you a specific example, sometimes you just don’t have the answer.

So the person comes, asks for something, you search inside your document and you don’t find anything. And then if you pass to the LLM like nothing or some documents that only can contain some partial information, there are high chances that the LLM hallucinates. So you need to test for the quality of search separately. And the good thing that this is a very well solved problem, so you don’t need to use LLM judges here even, but you do need to create a golden data set of responses, basically correct answers to known queries and then see if you can find them.

And then you can separately evaluate the generation. And in this case, you typically look for things like “does it hallucinate compared to the context”, or “does it contain a link”, or “does it fit inside the expected length” or something like this. So here you can evaluate some other parameters. I think this presence of links, not having hallucinations and finding the correct context is basically the triad. There are a few more specific things that you can apply on top, but the general idea, if anyone is building RAG, I would recommend that first break it down into two separate pieces because in this case it also helps with debugging. We are not just doing this because we know you want to have more metrics. It’s about figuring out where the problem comes from. So at the end you get a bad generation, you need to understand why it happened – because you couldn’t find the information or because you did find the information, but for some reason the LLM somehow processed the response wrong. And then you probably need to go and fix the prompt.

Role of Synthetic Data in LLM Systems Evaluation [19:29]

Srini Penchikala: Divide and conquer. So those are the two main components of RAG. I want to switch gears a little bit, Elena. So for some of the use cases, just manufacturing or other healthcare type of domains, it is not easy to have a lot of data. So the teams basically generate some synthetic data to test their models. So what is the role of synthetic data in this LLM systems evaluation process?

Elena Samuylova: The first problem I think is not even in the medical domain or somewhere, it’s always the data. It’s always the data, whatever we talk about. And the first issue is like say, “Hey, I’m going to go and build this chatbot. How do I test it? I need some test data”. So in an ideal world, I would go and take real user queries and then run tests on them. But to get real user queries, you need to deploy the system in front of your users. And in some lower risk domains you can actually do that. You can just put it in front of them, start collecting the inputs and then use it for testing, evaluation and development. But yes, in higher risk domains or in general, if it’s like a core product that you can’t just release in one day, you don’t have access to the users, you need to start somewhere and you need this data also for testing purposes.

So one way to approach this is to generate synthetic data. So you can for example, come and say, “Hey, I want to generate hundreds of different queries where people ask about my company pricing, different terms and so on”. So anything that you expect the users to do, and you can present it from different user standpoints, and you can use LLMs for that. So that’s the cool part. So you can use LLM to basically fill in the gaps here. For RAG, that’s actually even more applicable because if you’re building a RAG system, this implies that you already have some documents. So you have a context and documents that you’ll be searching in and you can use the same documents to generate the golden data set. In this case, the golden data set is the data set that contains the queries and the expected answers from this system.

And you can also use LLMs again to basically sample parts of the documents, then generate the questions that are answerable from these documents and use it to get this test data set. So this is a very handy technique. It’s not the only one and it should be applied with, I would say, moderation and human review because the tests are only as good as the data that you test them on. So you have to add your own product thinking. For example, you as a product owner or someone who is the main expert can bring in the ideas of how the user should behave or at least what’s the expectation of who the user should be, what’s their level of knowledge, what their needs and so on. And you should take this into account when generating the synthetic data. Because the failure mode here is to generate the data that is completely not representative or not realistic, then be very happy that the LLM works well on these artificial queries and then spectacularly fail in production.

Skillsets Required for LLM Application Evaluation [22:03]

Srini Penchikala: Thank you. I want to talk about the people’s side of the equation. So what kind of skill sets, you mentioned about context engineering versus prompt engineering, so what kind of additional new skill sets that team members need to acquire to best leverage the LLM evaluation techniques?

Elena Samuylova: I find it very curious how it splits between different teams because depending on the type of problem you are solving, someone has to be the product owner and understand the domain. Say if you’re building a coding assistant, often the same engineers who are building the system like writing the prompts and so on, they understand the problem well because they’re essentially building the product for themselves. So they themselves are engineers. But then if you imagine a legal advisor or medical advisor or something like this, the engineer who is implementing the system or RAG, writing the prompts, they themselves probably don’t have enough understanding and knowledge in the domain and they have to bring someone in. So in this case, we actually need to build some sort of collaborative system, but I think that someone always has to be the bridge. Back in the day, in classic analytics, we used the term “analytics translator”.

So the person who was supposed to be the bridge between the business team and the data team. I do feel that here with LLMs we also need someone like this. The critical part in the beginning is that someone wears this hat. So you say basically “I will own the evaluations and I will try to understand what the domain actually is, and if I don’t, I will find the people to ask”. Because the negative scenario that can happen is that no one ever looks at the data. Just engineers just writing the prompt, some tests pass, but no one ever looks at the data or tries to make sense of it. So you have to be, I was speaking to someone who actually brought up the point, very high tolerance to this difficult work of just looking at the data. It’s not very glamorous, but you have to understand the data, get some intuition about different failure modes that you observe before you actually implement automated evaluations.

And I think for people who are coming from very engineering backgrounds, this analytical part is a little bit hard. At the same time, some teams who are mostly built from data scientists, they might struggle with the engineering, but they’re actually happy looking at the data. So it does depend on the team composition. And in general, LLM is so new that you can see people from very different backgrounds. They were doing front-end engineering and now they’re writing prompts or they come from DevOps background or classic machine learning and so on. So it does really vary on the team, but my prediction is that we will have a role which will be maybe akin to product analytics. It was not there originally. So there was just an understanding that someone has to look and interpret product metrics and ask the right questions. So probably we’re going to have a role around LLM evaluations on the larger teams at some point.

Srini Penchikala: For now, do you see them being owned by quality assurance testing teams?

Elena Samuylova: In a few cases, yes, but I think mostly it’s the same people who are writing the prompts and they can be engineers and sometimes this is the product person or whoever is wearing the product person hat. So sometimes it’s the founders, sometimes it’s the head of engineering, but someone who can look at the data. I think QA teams, unfortunately I just haven’t seen that many. There is not yet a proper QA for LLM systems, especially QA always is an afterthought. I think everyone knows it. No one likes to write tests and you don’t have a proper existing flow for that.

Srini Penchikala: Yes, definitely. I agree with you. I think the evaluation process, since it needs to be looked into more carefully from the beginning of the project, it may become its own mini project with its own requirements, the evaluation criteria, and then obviously a solution.

Elena Samuylova: Maybe when you come to a more mature product and you have every release, you have a more structured QA system, you just need to manually review something. This will take this shape of having a proper QA team. But yes, like you said, you’re perfectly right. When it starts from the very beginning, it’s not the QA’s job, it’s the product builder job.

Limitations of LLM Application Evaluation [25:48]

Srini Penchikala: Makes sense. Okay, so definitely, I know it sounds very powerful, but what are the limitations or gotchas of LLM system evaluation tools? Now, what should our listeners be aware of if they need to work on something like this in their own systems, what should they be watching out for?

Elena Samuylova: And I say this as a tool builder, the tool is only half of the problem. So another half is the process, and I think most of the teams, they need to build the process first and then to understand it, and that’s something that we’re lacking. But in terms of specific gotchas, I think the major flaw that I see is when people actually do come and they look for built-in metrics that will help in their use case, they just think like, okay, I’m building a chatbot, so someone has to give me these three metrics that will work out of the box that will tell me that this chatbot is doing great. But the reality is that you need to actually define the metrics. So when it comes with this data analysis and everything that we spoke about, and this is a limitation and the gotcha.

So you have to put in this work. And another part is that of course not all things are solvable. So if you’re building a very complex problem, you for example try to evaluate the medical safety of the responses and you don’t actually have the right context to get it, there is no way how you can delegate it to the LLM. So there is no amount of effort we’ll solve this with 100% accuracy. So what you need to do in this case is maybe rethink your product design and so on. So maybe breaking down the smaller components or bringing “human in the loop” for some final review. So not all problems can be magically solved just because you’re bringing the evaluation tool.

LLM Application Evaluation Metrics and Benchmarks [27:14]

Srini Penchikala: Makes sense. You mentioned metrics, can you talk about some benchmarks that are available with the standardized tests for LLM system evaluations? How do we integrate these into AI applications?

Elena Samuylova: So there are a lot of benchmarks that are usually the ones that you use to compare different models. So there are safety benchmarks, coding benchmarks, mathematical benchmarks, like general benchmarks. I think there is now a lot of work around releasing domain specific benchmarks, maybe medical or financial or legal and so on. And this is usually a manual job done by some research lab or some company that sponsors this project, which is incredibly interesting. And I think here the main value for the builders is to use these benchmarks when they’re selecting models because I think it’s very overwhelming. So you just go out there and think “which model should I be building with if I’m trying to solve this task, if I’m trying to build a coding assistant or something like this, which model should I use?”

And in this case, you can go, look up at these benchmarks, maybe use some of them internally to try to figure out which model is better as initial screening and understanding. And in some scenarios you can reuse these benchmarks for your application, but only in a narrow sense. When you’re building a coding tool, yes, some coding benchmarks will work for you. Or maybe when you’re doing adversarial testing, there are some safety benchmarks which will be applicable, like the general purpose type of queries that you don’t want your system to answer. But when it comes to actually evaluating your system well, you need to have your own benchmark. That’s the evaluation data set you create and so on.

Srini Penchikala: Yes, makes sense. So we cannot have a podcast on the AI topic without talking about AI agents.

Elena Samuylova: I was waiting for that.

Evaluating Agentic AI Applications [28:48]

Srini Penchikala: So I know we see a lot of AI agent-based solutions coming out lately. So how do we evaluate AI agents and also you mentioned agentic workflows. If somebody is working on an application either using AI agents or agent orchestration, is it different in terms of evaluating that system versus a pure LLM system?

Elena Samuylova: I think there is an overlap, and probably the difficult question in the beginning is to define what an agent is because when you bring it into the conversation, sometimes people have different perceptions. So I like this term “workflow” because in many cases the agent is actually not required. You don’t need the LLM to decide what to do. Maybe you can have a deterministic choice between a number of routes and bringing LLM only in specific parts of the process. And in this case, it’s much easier to test this system because you’ll be testing this individual component. So maybe the LLM, like you first do some sort of routing and then the LLM has a defined number of choices. It’s pretty much the same thing that we discussed just at the larger scale. So you can have specific unit tests or specific datasets that evaluate different capabilities of your agent.

And I think I love this topic a lot, but I haven’t seen a lot of implementations yet. You can also do synthetic testing when, it’s like end-to-end testing. So for example, imagine that I have a booking agent. I know that example that always comes up. And if I were to test it manually, I would maybe just come and say, “Hey, I want to go and buy a ticket to London”. And then midway I would say, “No, actually I didn’t mean that it’s London, Ontario”. or something like this. So you just try and go through all these workflows. You can actually try it then and replay this conversation using another agent. So this will be like a tester agent that follows a very defined scenario that interacts with your agents, and then you can run an evaluation on the final output. Or you can also go through the whole transcript and also evaluate, for example, did your agent behave empathetically or did your agent interact logically not break down somewhere in the middle and so on.

So this is a more complex testing, which requires, as you can imagine, a very thorough design. So you need to first build this agent and so on. But I do think that’s where we are all going. And to be honest, we as humans and software developers, we still need jobs to do. So we cannot delegate everything to agents. So probably this testing and evaluation and orchestrations of all these agents who will be doing some of our work for us, that’s actually our job. Be the oversight, be the designer. One way I like to think about all these evals, it’s basically your product specification. Because through your tests, through your evaluators, you actually express what you expect the system to do, where the boundaries are. So what is good and what is bad. And in this case, this is extremely useful and this is your job to own. You can’t just to LLM: “figure out everything including how my product should behave”.

Role of Software Development in the age of AI [31:27]

Srini Penchikala: What’s your advice for the application developers who may have the fear or concern of being replaced by AI programs. So like you said, we can generate not only code snippets, but the entire applications nowadays. So what does it mean for the human developers who may be concerned with what’s my future as an IT professional?

Elena Samuylova: My opinion that code writing is not the bottleneck and that there are developers in large companies that almost never write code. All of their job is actually specification, understanding which technologies to use and so on, higher level decisions. So in this case, it’s more of a junior engineer’s job, so to say, to actually write the code. So there are all the other ways around it. But my honest advice is that I think everyone who’s working in IT knows that our job is changing every few years even without AI. New technologies, new stack, new roles. So I think we all should be prepared for change from the very beginning. If you’re not comfortable with that, we probably should not go into IT either way. So we should always be ready to reinvent ourselves. And I personally, I think that’s fun. So it’s very curious to learn your roles. I’m sure we’ll figure out new jobs to do when some of our older jobs are taken. Taking meeting notes and sending follow-ups – hell yes, please take this job from me. I don’t want to do that.

Srini Penchikala: Exactly. Those things should have been automated a long time ago.

Elena Samuylova: Exactly right. It’s important what we say on the call, but sending the follow-up – perfectly.

Srini Penchikala: So I think as humans, we’ve been spending too much time on machine tasks, so now it’s time to go back to human tasks.

Elena Samuylova: Absolutely. Yes. Maybe we finally can work fewer hours, but just do more directive type of job making hard decisions, not the implementation part.

Srini Penchikala: Go to the office in the morning, start your agents, and then you take off.

Elena Samuylova: And maybe we review it in the evening. So just go back to the office, approve, approve, approve or implement a new test to fix the error mode you saw and then go back to your life.

Srini Penchikala: Yes, that’s definitely, I think that’s how we should be anyway. So we should be focusing on more important parts of the lifecycle.

Elena Samuylova: And defining what’s important.

Learning Resources [33:27]

Srini Penchikala: There you go. Just like defining what’s an agent. So yes, thanks, Elena. A lot of good stuff. So where can our listeners get more information on these topics? Do you have any online resources you recommend? Books or anything like articles?

Elena Samuylova: I’ve actually been very actively creating content myself and with my co-founder, that is part of what we do also as an open source company, so you can check out, we actually created two free open courses on LLM evaluations. One is no code, high level, and another one with code applications. And we also have a bunch of guides and tutorials. So I’ll be happy to share it with the listeners of the podcast.

Srini Penchikala: Good. Thank you. Do you have any additional comments before we wrap up today’s discussion?

Elena Samuylova: I think everyone who is listening to this, it might feel a bit overwhelming. So all these things, I would say: “start somewhere”. Just write a couple of tests for your LLM system if you don’t have any. And then you can always build from there.

Srini Penchikala: And also, there are a lot of tools that you don’t have to have subscriptions to these commercial LLM tools. There are a lot of open source options.

Elena Samuylova: Absolutely. Including ours. Yes. So Evidently is open source.

Srini Penchikala: Not small, locally. Yes.

Elena Samuylova: Just write it in Python and ask an LLM if you need help.

Srini Penchikala: Yes, there you go. Exactly. Cool. Yes, thanks very much for joining this podcast. It’s been great to discuss one of the important topics, LLM evaluations. Quality assurance is one of my favorite topics, and so anything new that is definitely going at a faster pace, like LLM advancements, we definitely need these checks and balances, which is what the evaluation is. So thanks for your time. And to our listeners, thank you for checking out this podcast.

If you’d like to learn more about AI and ML topics, check out the AI ML and data engineering community page on infoq.com website. I also encourage you to listen to the recent podcasts, especially the AI ML trends report. We publish the trends reports once a year, as well as the trends reports on other topics like architecture, culture, and DevOps. This will help you stay up to date on what the InfoQ editors, who are also practitioners like you, think about what’s coming up in AI ML and other areas. Thank you, Elena.

Elena Samuylova: Thanks for having me.

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Elena Samuylova on Large Language Model (LLM) Based Application Evaluation and LLM as a Judge

Transcript

Introductions [00:27]

LLM Evaluation Terminology [03:32]

LLM as a Judge [05:24]

LLM Based Application Evaluation Process [07:44]

Custom “LLM as a Judge” Solutions [11:35]

RAG Based Application Evaluation [15:47]

Context Engineering vs Prompt Engineering [16:54]

Role of Synthetic Data in LLM Systems Evaluation [19:29]

Skillsets Required for LLM Application Evaluation [22:03]

Limitations of LLM Application Evaluation [25:48]

LLM Application Evaluation Metrics and Benchmarks [27:14]

Evaluating Agentic AI Applications [28:48]

Role of Software Development in the age of AI [31:27]

Learning Resources [33:27]

Leave a Reply Cancel reply

Stay Connected

Latest News

The Best Ereaders We’ve Tested for 2025

Chinese Cybercrime Group Runs Global SEO Fraud Ring Using Compromised IIS Servers

SwitchBot’s new safety tracker can discreetly trigger a fake phone call

Best soundbars in 2025 for every budget reviewed | Stuff

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Introductions [00:27]

LLM Evaluation Terminology [03:32]

LLM as a Judge [05:24]

LLM Based Application Evaluation Process [07:44]

Custom “LLM as a Judge” Solutions [11:35]

RAG Based Application Evaluation [15:47]

Context Engineering vs Prompt Engineering [16:54]

Role of Synthetic Data in LLM Systems Evaluation [19:29]

Skillsets Required for LLM Application Evaluation [22:03]

Limitations of LLM Application Evaluation [25:48]

LLM Application Evaluation Metrics and Benchmarks [27:14]

Evaluating Agentic AI Applications [28:48]

Role of Software Development in the age of AI [31:27]

Learning Resources [33:27]

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News