Key Takeaways
- Each problem in the AI space has unique challenges. Once you’ve been serving production traffic, you’ll find edge cases and scenarios you want to measure.
- Consider models as systems: LLMs are part of broader systems. Their performance and reliability require careful observability, guardrails, and alignment with user and business objectives.
- Build metrics that alert of user issues and make sure you have a cleanup process to phase out outdated metrics.
- Focus on business direction. Build metrics that align with your current goals and the lessons learned along the way.
- Don’t overcomplicate it. Adopt a crawl, walk, run methodology to incrementally develop metrics, infrastructure, and system maturity.
Denys Linkov presented the “A Framework for Building Micro Metrics for LLM System Evaluation” talk at QCon San Francisco. This article represents the talk, which starts by explaining the challenges of LLM accuracy and how to create, track, and revise micro LLM metrics to improve LLM models.
Have you ever changed a system prompt and ended up causing issues in production? You run all the tests, and hopefully, you have evaluations before changing your models, and everything passes. Then, things are going well until someone pings you in the Discord server saying everything is broken.
One real scenario that led to the idea of micrometrics happened when I changed our system prompts at Voiceflow, an AI agent platform, for how we interact with models.
Someone was prompting a model in German while having a conversation with their end user. By the fifth turn of the conversation, the model suddenly responded in English. The customer was mad, wondering why their chatbot switched to English after speaking German the entire time. They were confused, and so were we.
Building LLM platforms, or any kind of platform, is challenging.
What Makes a Good LLM Response?
When you’re building an LLM application, what makes a good LLM response? It’s a pretty philosophical question because it’s hard to get people to agree on what good means.
LLMs are attractive but misleading. They sound convincing, even when they’re wrong. Not only do people often disagree on what’s good, they sometimes don’t even read responses closely. To evaluate responses, you might use regex or exact matches, cosine similarity with golden datasets, LLMs as judges, or traditional data science metrics.
The Flaws of One Metric
Let’s start it off with some lessons that I’ve learned. The first one is the flaw of a single metric. Take semantic similarity, which powers RAG by searching for similar phrases. Here’s an example: I’m comparing the phrase I like to eat potatoes with three phrases using three models – OpenAI’s latest and two high-ranking open-source models. Can you guess which phrase they matched most closely?
The options are:
- I am a potato
- I am a human
- I am hungry
Figure 1: The challenges of semantic similarity
All three models chose I am a potato. This creates an odd dynamic. Saying I like to eat potatoes and matching it to I am a potato highlights flaws in models relying on cosine or semantic similarity. Realistically, I am hungry or even I am human makes more sense. Metrics don’t work all the time.
LLM as a judge
Let’s talk about the challenges of using LLMs as judges, a common practice, especially with GPT-4. Many use LLMs to evaluate responses when they don’t want to review them manually. However, these models have biases. For example, a 2023 paper found GPT-4 aligns poorly with human judgment on short prompts but performs better with longer ones. We’ve seen this bias through several different studies. This is an interesting concept because these models are trained in a certain way to mimic certain human tendencies or certain preferences that might emerge after the training.
But what about humans? Are we reliable judges? Let’s look at standardized exams. There was some research done almost 20 years ago on the SAT essay, where the researcher found that if you simply looked at the length of the essay, it correlated very well with how examiners scored it. This reveals a similar bias in human judgment: we prioritize superficial metrics like length.
What does it mean to be good? Who would rather watch a YouTube video about cats or LLMs? Consider another example: baby cat videos vs. a Karpathy lecture. The cat videos got 36 million views compared to the lecture’s 4 million views. We say, “Cats are better than LLMs. Obviously, we should serve only cat content to people”. Social media might think so, but it shows the limitations of metrics like views or accuracy. These measures, on their own, are flawed. You could probably get to that within your reasoning.
If we talk about how we give instructions to people, we generally give pretty specific instructions for some tasks, but vaguer instructions for others. For example, when I worked at McDonald’s (a true character-building experience), the instructions for cooking chicken nuggets were extremely specific. The manual detailed the exact cooking time, and beepers would go off if you didn’t lift the nuggets on time. But then there were tasks like “mop the floor”, where the instructions were vague. If you hadn’t mopped before, you’d either ask a follow-up question or risk making a mess. There are things in the middle.
Figure 2: Instruction specificity at McDonald’s
These examples highlight the ambiguity of instructions in human contexts. Some are precise, some are vague, and many fall somewhere in between.
When doing performance reviews, it is important to give specific feedback. This is something we have probably heard in many engineering talks about managing good teams. At McDonald’s, for example, performance reviews included questions like, “How many swirls are you putting on the ice cream cone?” I always got in trouble for too many swirls, probably why the machine kept breaking. It is what it is.
The feedback was often specific, but sometimes it was not. Metrics for LLMs can feel similar, not because LLMs are human, but because the framework for giving feedback works the same way. Vague feedback like “You’re doing great” in a performance review is not helpful. What are you supposed to do with that? It is the same for LLMs. If someone says, “There was a hallucination”, I will be like, “Great. What am I supposed to do with this information?”
Models as Systems
Let’s talk about models as systems. If you’ve done observability work (e.g. writing metrics, traces, and logs), you already know the importance of monitoring. The same applies to large language model systems. You can’t just deploy a model, close your eyes, and run away. That approach guarantees a bad day when you get paged. Observability involves three main types of events: logs, metrics and traces.
- Logs: What happened?
- Metrics: How much of it happened?
- Traces: Why did it happen?
These range from less granular (metrics) to highly detailed (traces). For LLMs, metrics can focus on areas like model degradation and content moderation. For model degradation, metrics like latency can quickly identify issues with a provider or inference point. Scoring model responses takes longer (seconds or even minutes). Offline tasks like selecting the best model might take weeks or even months in an enterprise setting.
For content moderation, metrics need to work in real time. If you’re facing a spam attack, a batch job next week won’t help. You need to figure out your metric’s purpose, how much latency there will be, and how you define an action going forward.
Let’s dig deeper into metrics by dividing applications into real-time and async categories:
- Real-time metrics are crucial for detecting immediate issues like model degradation, events timing out, or the model returning garbage.
- Async metrics are for tasks like model selection, which might involve running evaluations or even philosophical debates.
- Guardrails can function in both real-time and async modes, depending on the situation.
Figure 3: Realtime versus async metrics and guardrails
You can make a million metrics, and you can define all of them, but at the end of the day, your metrics should help drive business decisions or technical decisions for the next three months. We talk about this in a mathematical sense, as an analogy of metrics should give you magnitude and direction.
Build Metrics That Alert of User Issues
It’s important to build metrics that alert of user issues, whether immediate or long-term. If you’re building a successful product or a business, whether internal or external, if your product doesn’t work, users are going to leave.
Going back to the previous example of my LLM responding in the wrong language. A user flagged the issue, and we verified it before it affected enterprise customers. While the problem was really hard to reproduce, we found one instance in the logs showing the wrong response. To address it, we added a guardrail, a monitor that checked the response language in milliseconds and retried if it detected a mismatch. This online approach worked better than storing the issue for later action.
When deciding whether to handle metrics online or offline (synchronous vs. asynchronous), it depends on the use case. For example, in content moderation, you might flag or ignore inappropriate input in real time. Always evaluate how a metric impacts your business by thinking through the scenario and weighing the outcomes.
When building a product, whether internal or external, you want to get your customers’ trust. First, you need a product that works, at least sometimes. Then, you want to delight your customers, classic sales 101. Customer trust is like an island. As long as your product works and people are buying it, you’re on solid ground.
Figure 4: The island of customer trust
When things break, like a model responding in the wrong language, you lose trust. Your customers are angry because their customers are complaining. There are steps you can take: offer refunds to address the inconvenience, implement a fix like auto-retries, and write a root cause analysis (RCA) to explain what went wrong and how it’s been resolved. Whether or not this gets you back to the island of customer trust depends on your customer, but the goal is always to rebuild trust by ensuring your product works as intended.
The more complex the systems that you’re building are, the more complex the observability is. A highly intricate LLM pipeline with all the latest techniques, like RAG, is harder to debug and monitor. Breaking RAG into two components (retrieval and generation) can simplify things. For retrieval, focus on providing the right context: ensure the information is relevant and avoids unnecessary details that could harm the generation process, balancing precision and recall if rankings are known. For generation, metrics might include correct formatting, accurate answers, and avoiding extraneous information. You can also refine this further with measures like accuracy, proper length, correct persona, or even specific rules like ensuring the LLM doesn’t say “delve”. RAG’s multiple components mean different metrics for different parts.
Focus on Business Metrics
By now, you’ve likely thought of some metrics for your use case, but at the end of the day, they need to drive business value. For example, what’s the cost of a not-safe-for-work response from your LLM? Everybody’s business is going to be different. Depending on who you’re selling to, depending on the context, it’s all going to be different. Your business team needs to determine these costs. Similarly, if you’re building a legal LLM and it provides bad advice (like suggesting someone “go sue your neighbor”), that’s a serious issue. You need to calculate these mistakes in dollar terms and decide how much to invest in metrics, how much extra to pay for safeguards, or how much latency to tolerate for online checks. For example, what’s the cost of a bad translation in our earlier scenario?
The reason we build metrics and use LLMs, at the end of the day, is to save human time. All the automation and all the fancy applications being developed are about saving time. Unless, of course, you’re building a social media app, where the goal is to keep people engaged for as long as possible. You might think, “I don’t fully know the business. This is not my job. I’m a developer. I write code”. First, no. Understand your business. Know what you’re building and the problem you’re solving. Second, it’s fair to expect your business team to do most of the work here, after all, that’s their job.
The business team should define use cases, explain how features integrate with the product, measure ROI, and choose the right model. While you should be part of these conversations, metrics aren’t just a technical responsibility. In the world of LLMs, where these models are embedded in a variety of products, the business team must take the time to define what metrics make sense for your product.
Make sure your metrics align with your current goals and the lessons learned along the way. When launching an LLM application, you will inevitably learn many things. Make sure you have a cleanup process to phase out outdated metrics.
Crawl, Walk, Run
Finally, here are some more actionable tips. Don’t jump into the deep end right away. Start with a crawl, walk, run approach. This applies to metrics as well. Begin by thoroughly understanding your use cases and ensuring your technical teams are aligned. That’s generally how I think measuring any kind of LLM maturity and LLM metric maturity are.
Figure 5: The LLM metric maturity using a Crawl, Walk, Run methodology
Starting with the crawl stage, there are different prerequisites before implementing these metrics. You need to know what you’re building and why you’re building it. You want to have datasets for evaluations. If you don’t, take the time to create some. You’ll also need basic scoring criteria and logging in place. This lets you track your system, understand what’s happening, and determine what’s generally right or wrong. Some example metrics to start with might include moderation or an accuracy metric based on your evaluation datasets. Again, these metrics may not be perfect, but they’re a great place to start.
At the walk stage, you should have a solid understanding of your system’s challenges, including where the weaknesses lie, what’s working, and what’s not. By now, you should have a clear hypothesis about how to address these issues or at least how to investigate them further. There should be a feedback loop in place to test your hypothesis, gather feedback (whether through logs or user data), and tackle these concerns. Ideally, you’ve already made attempts at using basic metrics, and now it’s time to get more specific. For instance, you might include recall metrics such as net discounted cumulative gain, or answer consistency to optimize settings like temperature and assess tradeoffs, or you can do language detection. These more specific metrics require a bit more infrastructure to implement effectively.
At the run stage, you should be up on the stage and talking about the cool things that you’re doing. You have a lot of good automation of what you’ve built in-house, such as auto prompt tuning, and your metrics are aligned with specific goals. You likely have a lot of high-quality data, which can be used for fine-tuning, although it remains a business decision. At this level, your metrics are whatever you want them to be. You understand your system and product, and you can figure out what the micro metrics are.
Summary
We covered five key lessons. Single metrics can be flawed. Hopefully, my potato examples made that clear. Models aren’t just standalone LLMs; they’re part of broader systems, especially as complexity grows with features like RAG, tool use, or other integrations. It’s crucial to build metrics that alert you to user issues, focusing on what impacts your product and aligns with your business goals.
When improving products with LLMs, keep it simple. Follow the crawl, walk, run methodology. The worst thing you can do is overload yourself with dashboards full of 20 metrics that don’t drive action. Don’t overcomplicate it: go through the crawl, walk, run methodology.