A Framework For Building Micro Metrics For LLM System Evaluation

Transcript

Linkov: Who here has changed a system prompt before and led to issues in production? You’ll have this experience soon. You run all these tests, hopefully you have evaluations before you change your models, and they all pass. Then things are going well until somebody pings you in the Discord server that everything is broken. One scenario that happened that led to this whole concept of a micro metric was, we released a change in our system prompts for how we interact with models. Somebody was prompting a model in a non-English language. They were having a conversation with their end user in a non-English language.

By conversation turn number five, the model responds in English, and this customer is very mad why their chatbot is responding in a English language when it’s been talking in German the whole time. They’re very confused, and so are we. Building LLM platforms, or any kind of platform is challenging. Who here has worked on a platform? The company I work at, Voiceflow, is an AI agent building platform. We’ve been around for six years. I’ve been there for three and a half. It’s been interesting to allow people to build different kinds of AI applications, starting with more traditional NLU intent entity applications, to things that now focused on LLMs.

For scale, we support around 250,000 users. We have a lot of projects, a lot of different variety on those projects. We have lots of different languages. We have lots of different enterprise customers. When we’re doing rollouts, it’s not quite at the scale of some of the companies you’ve seen here, but it’s at a pretty decent scale.

What Makes a Good LLM Response?

We get into this question of, when you’re building an LLM application, what actually makes a good LLM response? It’s a pretty philosophical question, because it’s actually hard to get people to agree what good means.

The first one is that, what makes LLMs very attractive, but also very misleading is that, generally, they sound pretty good. They sound pretty convincing. The second part that I mentioned already is that people often do not agree on what is good, so you have this constant pressure of LLMs make up plausible things, and people do not know what’s good. Sometimes people don’t read the responses. You have many different options for scoring. You likely have used some of these approaches before. We have things like doing regex matches or exact matches on outputs. We have things like doing cosine similarity between different phrases, outputs, and golden datasets. You might use an LLM as a judge, or you might use more traditional data science metrics.

Lesson 1: The Flaws of One Metric

Let’s start it off with some lessons that we’ve learned. The first one is the flaw of a single metric. We’ll start off with semantic similarity. It’s a thing that powers RAG generally, the way where you search for similar phrases. Here’s a list of three phrases that I’m going to compare to the phrase, I like to eat potatoes, and three different models that are being used. The first one is the OpenAI, most recent model. Then we have two open-source models that rank quite highly on different embedding-based metrics. Who here can guess what the top matching phrase is to, I like to eat potatoes? Who thinks it’s, I am a potato. Who thinks it is, I am a human? Who thinks it is, I am hungry? All three models thought it was the first one.

Apparently, when you say I like to eat potatoes and I am a potato, you get into some weird dynamic there. When you train these models that do comparisons of cosine similarity or any kind of semantic similarity, there are flaws to it. As we talked about, I am hungry is probably the closest one, and I am human, because humans eat potatoes. Metrics don’t work all the time. Then we go on to LLM as a judge. This is quite popular, especially GPT-4. A lot of synthetic data is generated with LLMs, and it’s a common technique to verify when we’re too lazy to actually read the responses ourselves.

The problem is that these models have significant bias associated with them on what they score results with. This is a paper that was released in 2023 talking about how GPT-4 and human agreement is misaligned whenever prompts are shorter versus longer. GPT-4 really likes long prompts, and then GPT-4 does not like short prompts. We’ve seen this bias through a number of different studies. This is an interesting concept, because these models are trained in a certain way to mimic certain human tendencies or certain preferences that might emerge after the training.

Now we go into the question, what about humans? Are humans reliable judges? This is a good way to control for this type of topic. I want to take you through standardized exams. Who here has written a standardized exam before? There was some research done almost 20 years ago on the SAT Essay, where the researcher found that if you simply looked at the length of the essay, it correlated very well with how examiners scored an essay.

The essay that determined how high school students in the U.S. go on to university was almost purely based on the length of the essay, ignoring facts or other pieces of information. Humans are great judges. What does it mean to be better? What does it mean to be good? Who would rather watch a YouTube video about cats or LLMs? If we look at two highly performing results, we see that these baby cats videos have 36 million views, versus this very good lecture by Karpathy, has 4 million views. We say, “Cats are better than LLMs. Obviously, we should serve only cat content to people”. That’s generally what social media is like. Now we have this concept that views, accuracy, or all these metrics by themselves are not enough. They have flaws. You could probably get to that within your own reasoning.

If we talk about how we give instructions to people, we generally give pretty specific instructions for some tasks, but vaguer instructions for others. Who here has worked in fast food before? I used to work at McDonald’s. It was a character-building experience. When we get different instructions, some are specific and some are vague. This is the nature of human nature. Sometimes they’re the right amount of information, sometimes they’re not. For example, when I worked at McDonald’s, the amount of time you should have cooked the chicken nuggets was very specific. It was in the instruction manual. There are beepers that went off everywhere if you did not lift the chicken nuggets.

At the same time, you’d come into work and your manager would be like, mop the floor. If you hadn’t mopped the floor before, you ask a follow-up question or you make a mess. There are things in the middle. All these different things talk about the ambiguity of what instructions are actually like for humans. Who here is a manager? When doing performance reviews, it’s important to give specific feedback. We heard this probably in a lot of the engineering track talks about, how do you manage a good team? These are some of the questions that you might be asked in a McDonald’s performance review, or any fast food. I always got in trouble for giving too much swirls on the ice cream cone, and that’s probably why the machine was broken. It is what it is. We have specific things that you got feedback for, and then you’d get some review.

Metrics for LLMs, you can think of them as being fairly similar, not because LLMs are human or they’re becoming some human entity, but because it’s a good framework to think about how vague or specific you should be. Who’s got this kind of feedback on a performance review? It’s awful. What am I supposed to do with this? “You’re doing great”. Off to your next meeting.

One of the things that we do at Voiceflow, completely separate from LLMs, is that, for engineering, our performance reviews are quite specific, maybe too specific. I’ll go through all 13 categories and rate people based on our five different levels, and provide three to five examples based on the work. It might be a lot, but that’s the specific feedback that I think is appropriate to give as a manager to people within your team. Similar for large language models, if somebody just says there was a hallucination, I’ll be like, “Great. What am I supposed to do with this information?”

Lesson 2: Models as Systems

Let’s go on to the next lesson, models as systems. Who here has done general observability work or platform work before writing metrics, traces, logs? Do the same things for large language model systems. You do this because you need to observe results and see how your system is actually behaving. You don’t just put something in production, close your eyes and run away. You could do that, but you’re not going to have a good day when you get paged. There are different types of observability events. You have logs, so typically like what happened. You have metrics, how much of that thing happened. A little less verbose. Then you have traces, trying to figure out why something happened. It goes through this level of granularity. You typically have metrics that are not as granular or not as verbose, going all the way down to traces.

Focusing on metrics, there’s different dimensions of defining metrics. You’re going to see a lot of these 2D graphs in the presentation. Let’s talk about two types of metrics for LLMs: we can have model degradation and we can have content moderation. For model degradation, you might have this chart on latency, saying, it takes a relatively low amount of time to figure out if a provider is failing, or your inference point is failing. If you want to do model response scoring, that might take a little bit longer, so still on the order of magnitude of maybe seconds.

Then you go to something that’s offline, like choosing the best model. This might be a week’s long decision, or if you work at an enterprise, months. On the content moderation side, likewise, if you’re facing a spam attack, you probably want to do this in some online fashion. You don’t want to run a batch job next week to be like, there is somebody doing some weird stuff on our platform. You need to figure out what the purpose of your metric is, how much latency it’ll take, and how do you define an action going forward.

Going to some more details on this. Here we have four types of applications. I’ve defined them as real-time metrics versus async. Something that’s real-time, you might want to know if there is a model degradation event that is happening. For example, events are timing out, or the model is just returning garbage, that happens. While, if you’re doing model selection, you can do that on an async basis either running evaluations or having philosophical debates. Same thing with guardrails. You can have guardrails both run online or offline, in this case, real-time or async, and so forth. You can make a million metrics, you can define all of them, but at the end of the day, your metrics should help drive business decisions or technical decisions for the next three months. We talk about this in a mathematical sense, as an analogy of metrics should give you magnitude and direction. It shouldn’t just tell you, you need to do this thing. It should give you a sense of importance.

Lesson 3: Build Metrics That Alert of User Issues

The lesson here is, let’s build metrics that alert of user issues, whether immediate or things that will hurt your product in the long term. If you’re building a successful product or a business, whether internal or external, if your product doesn’t work, users are going to leave. Going back to my example of, my LLM isn’t responding in the right language, we have this message from a few users panicking from our community, and we’re like, let’s verify this before it spreads to our enterprise customers in terms of the rollout. We could not reproduce it. We tried. We got one instance where it produced the wrong language response. We’re like, “We see it in the logs. Something weird is happening. We can’t get it to work, because when you look at your experiment, it’s going to break”.

Then we ended up putting in a monitor and a guardrail for this, saying that, double check what language the model responded in, make sure it’s the intended language. We had just a model that would detect this over the course of some milliseconds, and then it would send a retry if it detected a difference. This generally worked. We chose the online method of doing a retry, rather than storing it for later and then trying to do something else. This goes back to the question of, are you going to be doing this online or offline, and is this going to be, in a programming sense, a synchronous or asynchronous call?

If you’re doing content moderation and somebody says something silly into your platform, you might want to not respond, or you might want to flag. It really depends on your business. When you’re trying to make this decision of how is the metric going to affect your business, just go through the scenario, backpropagate through, calculate the gradients of what’s going to happen.

Generally, when you’re building a product, whether internal or external, you want to get your customers’ trust. First, you have to build a product that works, sometimes. Next, you want to do nice things for your customer, so sales 101. You live on this island of customer trust if people are buying your product, and if things break, for example, models responding in the wrong language, you lose trust. Your product is no longer working in your customers’ eyes, and your customers are mad that their customers are complaining. You can do things. You can refund what’s happening. You’re like, “Sorry. I’m going to give you your money back for the pain we caused you”. You could fix the issue, for example, adding an auto retry.

Then you can write an RCA, a root cause analysis to say, “This is why this broke. We’re communicating to you that we fixed it and it shouldn’t happen again”. Whether or not this actually gets you back to the island of customer trust depends on your customer, but you should use this metric and your process of engineering to actually go through and get back the customer’s trust and make sure that your product works as it’s supposed to work. The more complex systems that you’re building, the more complex the observability is. This is something that they have to be very aware of. Just because you make a very complex LLM pipeline, you use all the recent, most modern forms of RAG, you can’t keep track of all of them. It’s going to be harder to debug and figure out what’s going wrong. Going through some simple RAG metrics, you can break RAG down into the two components, the retrieval portion and then the generation portion.

For the retrieval part, you want to make sure you have the right context. You have the correct information. It’s relevant. You don’t have too much superfluous information that’s going to damage your generation, with some kind of optimization of precision and recall, if you know what the ranking should be. Then, on the generation side, there’s different ways to measure that. Some sample ones are correct formatting, the right answer, and no additional information. These are just a few, but recognize that RAG, because of its multiple components, will have different metrics for different parts. If we get more specific in order, you can have accuracy, faithfulness, which is a retrieval metric, correct length, correct persona, or something super specific, like your LLM does not say delve, even though that seems to be very hard.

Lesson 4: Focus on Business Metrics

We’re on lesson 4 now. We’ve come up with some metrics. Hopefully, in your head, you’re thinking about your use case of what you’re building, but at the end of the day, it needs to bring business value. For example, what is the cost of a not safe for work response from your LLM? Everybody’s business is going to be different. Depending who you’re selling to, depending on the context, it’s all going to be different. Your business team should figure this out. Likewise, if you’re providing bad legal advice, you’re building a legal LLM, and it says, go see your neighbor. It’s not good. You need to calculate this as a dollar cost and figure out, how much are you going to invest into metrics, how much are you going to pay extra for this? How much latency you might incur to do online checks. What is the cost of a bad translation, from our earlier example? The reason why we build these metrics and we use LLMs at the end of the day is we want to save some human time.

All the automation that’s being built, all these fancy applications, there’s some way to save human time. Unless you’re building a social media app, then you want people to be stuck on your app for as long as they can. You’re like, “I don’t fully know the business. This is not my job. I’m a developer. I write code”. First of all, no, understand your business. Understand what you’re building. Understand the problem you’re solving. Second of all, it’s fair to ask your business team to do most of the work, otherwise, why are they around? There’s a lot of things that business teams should be doing. They should be defining use cases, talking about how things integrate with their product, measuring ROI, choosing the right model. It goes on for a long time where you need somebody from the product and business side to tell you how to relatively prioritize things. You should be part of that conversation, but metrics are not just a technical thing that should be built.

Especially in an LLM world where LLMs are being put into all sorts of products, make sure that the business team stops and thinks about, how are we defining these things for our product? If you’re finding that you’re doing these things as a technical person, just be aware that the job is quite expensive. Metrics should be retired when they’re no longer useful, or you find a different way to solve it. As models become better, that language problem that I indicated might no longer occur. Or we do the calculation and say, these few users who are non-English users are no longer our target customer, and if they have a bad time, we’ll just absorb the cost of that, whatever that is. Make sure that your metrics align with your current goals, what you’ve learned. Because if you’re launching an LLM application into production, you will learn many things. Make sure you have a cleanup process that handles this.

Lesson 5: Crawl, Walk, Run

Finally, to give some more actionable tips. Don’t jump into the deep end right away. Follow this crawl, walk, run methodology. Again, going back to the metrics approach. You want to make sure you understand use cases, and you want to make sure you have the technical teams. That’s generally how I think measuring any kind of LLM maturity is, but likewise for LLM metric maturity. Talking about crawl, the different prerequisites before you implement these metrics. You want to know what you’re building, why you’re building it. You want to have datasets for evaluations. If you don’t, please go back and make a few. You want to have some basic scoring criteria, and you want to have some basic logging. You want to be able to track your system and know what’s going on, and be able to know generally what’s right and wrong. Some sample metrics here are things like moderation or maybe some kind of accuracy metric based on your evaluation datasets. Again, they’re not perfect, but they’re a great place to start. Again, to walk.

At this point, you should know the challenges of your system, things where the skeletons are, what’s working, what’s not. You should have a clear hypothesis on how to fix them, or at least how to dig deeper into these problems. You should have some feedback loop for addressing these kinds of questions. How can I test my hypothesis, gather feedback data, whether through logs or through users, and address some of these concerns. You should have done some previous metrics attempts before. These metrics get a lot more specific. Some of them might be format following. You could do some recall metric, in this case, net discounted cumulative gain, more of a retrieval metric. You can do things like answer consistency to figure out what’s the right temperature to set and what the tradeoffs are, or you can do language detection. You can see, these are getting more specific. You need a little bit more infrastructure to actually best leverage these. We get into run.

At this point, you should be up on the stage and talking about the cool things that you’re doing. You have a lot of good automation of what you’ve built in-house. You’re doing auto prompt tuning. You have specific goals mapped to your metrics. You have a lot of good data, probably to fine-tune. Again, that’s another business decision. Then your metrics are whatever you want them to be. You understand your system. You understand your product. Figure out what those micro metrics are.

Summary

The five lessons that we covered, we’ve noticed that single metrics can be flawed. Hopefully, from my potato examples, it becomes clear. We know that models are not just LLMs, they’re broader systems, especially as you introduce complexity in various ways, whether things like RAG, tool use, or whatever it might be. You want to build metrics that actually alert you on user issues and things that affect the business, and align them with future business direction. How am I improving my product using LLMs? Don’t overcomplicate it: go through the crawl, walk, run methodology. The worst thing that you can do is make a giant dashboard with 20 metrics, and it’s not helping you do anything. Start off with one metric, build the infrastructure, build the confidence, and deploy to production.

Questions and Answers

Participant 1: Your example of switching from German to English resonated with me. I feel like I’ve seen things like this in production, where it’s obvious in retrospect, like we should have had a test for that, and any human would have seen it as a problem, clearly, the customer did. What I’m not clear we should be doing, I don’t know how to be writing robust tests for unforeseen behavior. Any hints on not just a specific test for, don’t switch languages, but like, what’s the broad type of issue, and how you write a test for that?

Linkov: I think it goes back to any foreshadowing of issues. You try to plan as much as you can, but at the end of the day, production data is the best data. This is where good software practices make sense. Run beta tests, onboard more forgiving users first, feature flag things out. Figure out what parts of your system are going to be most affected, what are weird behaviors. At the end of the day, it’s hard to see, especially if you’re running multiple models, especially if you’re a platform where people are doing so many different things. If your use case is more clearly defined, then it’s a little bit easier to ideate and think through it. I wouldn’t beat yourself up too much, but have those good release patterns and just see how your users break your product.

Luu: What about getting product managers to help you write tests?

Participant 2: I think the major difference here is that because AI models are not deterministic, because you cannot treat the LLM measurements as we do with other software systems. What are the major methods or metrics that you have used to make this need to be better, even though it won’t be 100%, for sure.

Linkov: There’s a lot of techniques to do so. One recent one is constrained decoding, where you can provide a list of possible options to the model to actually produce a result. It’s still an active research area, constrained decoding and many other ones, to actually try to make it a little bit more deterministic. I think there’s also the question of, should you be using an LLM? If an LLM hallucinates or makes an error, is non-deterministic 1% of the time, and you’re used to having 99.99% accuracy or consistency, an LLM is probably not the right model for that. There’s been a lot of advancements in other models, but they’ve been overshadowed by LLMs. I think this is the question of determining, am I going to use an LLM? Am I going to use a more standard ML model, some kind of encoder model to do a task? I’m going to define a manual workflow, and the LLM just helps guide me through that workflow. These are all decisions that you can be making. Hopefully, we’re past the just throw everything into an LLM and pray, in production.

Participant 3: You mentioned synthetic data, I wonder what Voiceflow’s take on using synthetic data. If so, how does it compare using the micro metrics that you mentioned?

Linkov: I think synthetic data is part of that general process of evaluating and generating examples. This is something where, when we’re writing test cases, it’s a really good way to expand beyond like, I’ll write some by hand or write some augmented ones, and then figure out where the extra edge conditions are. Then you can verify how often is this metric being triggered. I think there’s different ways to use it. We primarily use it in the testing stage, just to give us more variety, because it takes a lot of time to write good tests.

Participant 4: In your particular use case, how do you handle the balance between testing for too many things and just bearing the cost of being wrong or making a mistake. Is there a number of tests? Do you say, I’m going to test but it’s not going to be more than one second or something, or is there like a dollar amount? How did you handle it in your particular use cases?

Linkov: I think we have a few tests that are run online, things like content moderation, things like this language test. A few other ones as well. I think generally, you’re never going to get it right 100% of the time. We prefer to launch with the techniques I mentioned earlier. Get into production. Get into users’ hands. This is part of digital transformation in general. In large companies the whole agile process and feature flags are still making their way through. It’s still really important to have good staging environments, paid environments, going through all of this together. Recommendation is, test more. Don’t ignore testing. Have good evidence where somebody says, how do you know this will work? Don’t just say, I made it up, or, I think it’s going to work. Have some evidence to showcase that. Come up with good test cases to get most of the coverage.

At the end of the day, your product is useless unless it’s in a customer’s hands. Products that die in prototyping, you don’t get a thumbs up for that. You need to make it into production. Every organization has a different process for doing risk assessment and everything else, so, really depends on that. Write some baseline tests so you’re confident, as somebody who owns this service or owns this product, that said, I did enough, given the tradeoffs of shipping quickly, versus making sure it’s definitely going to work.

See more presentations with transcripts

A Framework for Building Micro Metrics for LLM System Evaluation

Transcript

What Makes a Good LLM Response?

Lesson 1: The Flaws of One Metric

Lesson 2: Models as Systems

Lesson 3: Build Metrics That Alert of User Issues

Lesson 4: Focus on Business Metrics

Lesson 5: Crawl, Walk, Run

Summary

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

Samsung Galaxy Tab S11 vs S11 Ultra: Which should you choose?

10 Best Google Keep Alternatives and Competitors 2025 |

D-ID acquires Berlin-based video startup Simpleshow | News

Galaxy Z Fold 7 and Flip 7 get some major DeX upgrades in latest One UI 8 update

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

What Makes a Good LLM Response?

Lesson 1: The Flaws of One Metric

Lesson 2: Models as Systems

Lesson 3: Build Metrics That Alert of User Issues

Lesson 4: Focus on Business Metrics

Lesson 5: Crawl, Walk, Run

Summary

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News