If You Can’t Test It, Don’t Deploy It: The New Rule Of AI Development?

Transcript

Olimpiu Pop: Hello everybody. I’m Olimpiu Pop, an InfoQ editor, and I have in front of me Magdalena. She has extensive experience in AI, both professionally and academically. So I’ll not spend too much time talking about it, and I’ll ask Magdalena to introduce herself.

Magdalena Picariello: Hi Olimpiu, and thank you for having me here. I’m someone on a mission to make AI more accessible, and I try to do it in all possible ways. One way to do it is to help companies implement AI projects that really matter and optimise processes. But the other way to do it is to educate the next generation of employees and AI developers on how to build AI solutions that actually work and solve our problems. This is the academic part I want to mention. I want to bring those two words together, hopefully towards a future where AI enables us rather than annoys us.

Olimpiu Pop: Great, thank you. Because I think that’s quite important these days, as they feel that everybody is moving towards Idiocracy, given that they just said, “We have AI to do everything. We don’t have to do anything”, so then we just drop it. On the other hand, Europe often struggled to bring academia and the commercial side together, and it’s essential to involve people from both. So let’s rewind a bit.

Why AI Demands a New Engineering Mindset [02:03]

It has been almost three years since Pandora’s box was opened. And now, we have everybody in the IT sector – and in any company – wanting to do more with AI. Even though they mean generative AI. And I had a conversation with someone else involved in generative AI for an extended period of time, Jade Abbott, and she was talking about the need to change your mindset because if you think like an engineer – software engineer, or any other engineer – you’ll think in binary. The values are either 0 or 1, so there is no gradient between them.

But on the other hand, she said – and I totally agreed with her – that in the world of machine learning, there is a gradient. Some things can be 80% true, or less or more, and we need to envision something that continuously evolves, not just side true or false. What’s your experience with it?

Magdalena Picariello: Yes, I can totally relate. The key problem with GenAI and LLMs is that there is often no ground truth, and software engineers are so used to binary outputs like the ones you mentioned. It either works or it doesn’t. And if it doesn’t, I can go back to my code to see why it doesn’t work: which part doesn’t work, what’s the problem with the input and output? You can really pinpoint it back to the particular point. And now with GenAI, when you don’t have the ground truth, in the very best case, you have human preferences. This sounds better; I prefer this answer, but it might not even be uniform across all human cultures or within the company. But the problem is that you cannot pinpoint the issue to a specific place in the code. It is not working. Systemically, why is it not working? It’s a black box we’re operating with; you only see the input and output, and you cannot really debug it.

And that’s what your colleague meant, right? There is a spectrum, but you don’t necessarily, not only it a spectrum, but you’re also not able to identify why. And this is really challenging for many people working with GenAI.

Olimpiu Pop: One of the reasons I got to your area of interest and the points you had on validation, machine learning, and AI, so to speak, is that in the last two years, I was looking for something to allow me to do what we call testing. So validation is the more appropriate word.

Magdalena Picariello: Evals. Very often, we talk about evals.

Olimpiu Pop: And what I was telling at some point, I think last year, I was talking at the conference about the dangers of relying too much on generative AI, the fact that you don’t have, as you said, the same answer if you ask the same question twice, and over and over again, the same point. And what I was saying at that point, thinking the conclusion is to create a space around your system, evaluate the output of the system, and then create ranges. So do you have any battle stories or war stories? You said you’re supporting somebody, which helps us understand how you see things, or we’ll just go into the more techie stuff. What do you prefer?

Magdalena Picariello: So imagine this: 900,000 Swiss francs saved every year, 10,000 employee hours saved annually, 34% productivity boost. And all this came from a GenAI model we implemented, but actually, it came down to just three words. But the interesting part is that you get this result – almost 1 million in savings – thanks to three words, but that’s not how it starts. Actually, for us, it began with a struggle.

We were implementing a chatbot for a customer and had reached 60% accuracy. We were drowning in hallucinations. It was like two years back at the very beginning of the GenAI era, so it was much worse. And now we’re trying dozens of different system prompts, and nothing has worked. We would fix one thing, but then it would break another component, and it’s like all interconnected, and you don’t understand how and why, and the time was short, and the budget was tight, and it was getting tighter, and we’re not getting anywhere – more prompts, fixing this, breaking that. And I think we were stuck in this mode until we realised that we were solving the wrong problem. We didn’t need better prompts; we didn’t need more prompts. We needed a system to find the prompts that work. And that’s what testing is, that’s what evals are about.

It’s not about finding the perfect combination of your data and the instructions to generate the output you want. Still, it’s about having a system that allows you to iterate quickly through your ideas, but also through model versions that are evolving so quickly, data that is growing on your side or on the customer side, and being able to combine it all and very quickly see what works and what doesn’t.

Stop Building Features. Start Delivering Outcomes [07:50]

Olimpiu Pop: What’s the recipe for success? How should you approach it as a – I don’t know – a solution architect, or a company? No technical person, because obviously, nowadays with AI, everybody can do everything, and we’re just watching the screen.

Magdalena Picariello: So, for people coming from a software background, which is a beneficial concept, it’s data-driven development, or you would call it test-driven development, and I would call it, in this context, data-driven development. Instead of asking yourself what the best model is and what data to feed into a system, start with the user perspective. What does the user expect from the application, from the chatbot? And then translate it back into the test cases you can automate to scale the test. One concept that is super helpful for this is a coverage matrix. In your coverage matrix, you want to understand the business impact of different types of questions or problems people will bring to the GenAI. It’s a matrix because it allows you to identify multiple dimensions.

Let’s say you’re working on some customer-facing chatbots: you might have two segments – new customers and returning customers – but you’ll also have different question types. You may have billing questions, general product questions, or technical implementation details that the customer needs. In the coverage Matrix, I think we can put a link to an example below, which shows the percentages of questions from new customers, returning customers, billing, technical, and generic. This allows you to see the distribution – the spectrum of what is essential to the end user. This is the first stage to understand what you’re working with. Ideally, you have it from an existing application or logs; if you don’t, try to get it from customer interviews. And then what I would add to this is actually business importance.

So one thing is the frequency, but another thing is what value does solving this specific query generate? And solving problems for new customers about the product is very valuable because it gets them in, keeps them, and may be more important than solving a billing issue for a customer who’s been with you for 2 years. So the idea is that you have, on the one hand, the distribution of problems, and on the other hand, the business importance. And then you multiply them together to see which test cases you should build first.

Olimpiu Pop: So let me break it down from my software engineer mindset. You mentioned test-driven development, and that’s usually what drives improved architecture for a system. And then you said we can start with what users already did – either extracting it from the logs or from the interviews – and build from there. The picture that was drawn was the following: You have different types of things plotted on a graph, with the kinds of questions that you have, and I’d expect things to get clustered. You have the same things, but you have the outliers. What are they doing with those guys who are doing something else? In which category do they fit in? Or we are initially looking at the whole spectrum and then clustering them into different types of categories. Looking at someone asking, as you said, about pricing, or it’s something that will happen in the next iteration.

Magdalena Picariello: So the first question was like, are you even able to identify it? Because some of these things we all just see in production, and there is nothing you can do about them. But let’s say you have a decent output, you’re able to identify it. What I would claim is that, even if it’s 1 in 10,000, if it has high business value for you and high business impact, you should take it into account. So that’s why you look not only at the proportion, 1 in 10,000, but also at the value of solving this problem for this particular outlier. Let me give you an example.

Building a chatbot for a wine fair: people come and want to buy some wine. Most of them will be like amateurs, bringing home 1 to 6 bottles. But 1 in 1,000 may be a big chain of restaurants that suddenly wants to place an order for 1,000 bottles a month. This is someone who behaves very differently as a one-off, but he has more value than hundreds of other users.

Your AI Accuracy Doesn’t Matter. Your Business Impact Does [12:59]

Olimpiu Pop: So then what I hear you saying is that focus on the business metric that matters rather than just treating everybody individually. And then, if you find that one-off that allows you to look at it differently, you can provide all the value your business needs, and that’s one crucial aspect you have to bear in mind.

Magdalena Picariello: Exactly right. Ideally, you want to quantify the value of solving this specific case for this particular problem in terms of your business KPIs. So in the one example, it’s just revenue. They care how many bottles they sell, basically. Then I would prioritise test cases based on expected revenue and solve each case that generates revenue.

Olimpiu Pop: That’s something that will make a lot of people from the financial side very happy. It’s a good ROI for this kind of thing. Great. Okay, we have written our test. What are we doing next?

Magdalena Picariello: If you have written your test, I congratulate you, because for me it’s easy enough to get this good coverage in six weeks of work. Once you have your test and implement your pipeline – your agent or chatbot – you start running combinations of your system prompts with some data formatting, and honestly, you just see what works and what doesn’t. You test ideas. In this anecdote, I mentioned to you that we’ve seen such a huge boost; it’s because we added three wars to every system prompt. To be able to add three wars to the system prompts, I think we tried hundreds of them. You can only try hundreds of them; you have well-designed, fully automated test cases.

Olimpiu Pop: I have to ask it because it keeps going over and over in my head. Do you write the test, or generate it? Because now it’s very popular, you generate everything. You just ask Gen AI to create a test for you, and you don’t do anything.

Magdalena Picariello: You can generate the test, but you still need this human to convert whatever Yuzo told you into a numeric value – basically – and there’s no way I can do that for you.

Olimpiu Pop: So what I hear you saying, loud and clear, is to keep the human in the loop. Make sure that somebody validates the input because, in the end, you are responsible for what you’re going to use, regardless of the tool that you’re going to use to have a test written.

Magdalena Picariello: Yes, that sounds right.

Olimpiu Pop: For a long time, everybody was focusing on the first L, the large one. Then we had a turnaround and we said, “We can put an S in front of the LLM”, and we have small and large language models, which was an oxymoron more or less. Then everybody says they should use something in the cloud because it’s impossible to run it on their local machines. Now it’s increasingly possible to do it with smaller machines. A lot of things happened, especially this year. DeepSeek showed that there are alternatives to the large GPUs. A lot of things are giving a lot of options. I feel like, in a restaurant, I don’t know where to eat – everything will satisfy my body’s need for calories, but which is the best option for me?

Magdalena Picariello: I think this is a widespread problem for us geeks, nerds, techies: there is so much out there, and this is the new star, but we should try the latest model – and, like, oh my god, did you see the Gemini contact window? It’s so big, and the flash is so fast – but wait, OpenAI has the fifth version of GPT, right? Shall we try it all? But it starts with the wrong question, actually, because at the end of the day, you’re solving a problem for someone, and that problem of someone is probably very rarely the type of model we’re behind. It’s more like, am I getting the answers? Are the answers precise enough? Are there too many hallucinations? Is this in the reasonable range? Maybe latency is the issue. So I think if you encode all that matters for your users in the tests, then you’ll actually see which tests perform poorly.

Here we go back to your question of the spectrum. You will have a lot of tests that are not pass or fail, but they have some probability attached to them, and it’s up to you to interpret them. And once you have this test, you can quickly pinpoint the issues. And let’s say latency is an issue for you – then okay, you either spin out a bigger cluster in the cloud or you make a smaller model. The smaller model is cheaper, so you want to try it out. So once you have a very well-tested application, you’ll basically just switch the model version in your code, run your test, and that’s all you need to see if it solves your problem.

Olimpiu Pop: That means you should abstract the model’s workings, so the application itself is the central part, with the model as one of the moving parts. You have a nice interface, then you just play around and see how things are working – or not – in the frame you had in mind, right?

Magdalena Picariello: Yes. And does this solve the problem that you were trying to tackle?

You Sneak Peek In a Black Box. But You Can Observe Human Reactions [18:26]

Olimpiu Pop: We discussed verification and testing in a very abstract way. Are there any tools you can recommend? Or is everybody building their own tool and playing with it?

Magdalena Picariello: There are tools, and it’s tricky because the universe is evolving so quickly. I’ll give you some names today. I really like DeepEval, I really like opik. Some people love Evidently AI, MLflow, Lang Fuse, and then there is something from OpenAI that was published a few months back. One feedback I’m hearing from my colleagues and other developers is that it’s difficult to choose a tool because it’s evolving so quickly. Whatever your needs are today, they might not be the same tomorrow, and the tooling evolves, too. We can put a link to the playbook where I compare the tooling for what you can do. You have features such as experiment tracking, an evaluation framework, custom metrics, human evaluation, prompt management, observability tracing, and so on. But just bear in mind: whatever is there is valid as of today, and in six months it will be very different.

Olimpiu Pop: Yes, we’ll treat it like a Schrodinger’s cat. It’s either in the box or it’s not, and we all know that. You mentioned a couple of interesting points. You mentioned the evaluation, and that’s what we touched upon. But then there is the other point, observability. That’s becoming increasingly important because people are moving more into developing countries, and we are deploying it there now. You need to see what the customer is doing. Then, observability is obviously a tool that gained a lot of momentum lately, but as you said, AI is a black box. How do you observe a black box?

Magdalena Picariello: That’s an excellent question. At the end of the day, you don’t observe the black box; you want to observe users in front of it. You want to see how people interact with your application. So, at the end of the day, all observability metrics are about understanding whether you’re solving users’ problems and keeping them engaged in a more salesy context. And this – you can see from the logs – how do the conversations go? Where do the conversations stall? When do people stop asking questions once the problem is solved? Or is it because they asked the same thing five times in five different ways and went nowhere?

Olimpiu Pop: That’s a good answer. Don’t observe the car; observe how we interact with it and how it interacts back. Great. What else should I have asked you, but I didn’t?

Magdalena Picariello: That’s a good one. One challenge for developers is transforming business KPIs into code. We’re very used to getting our tickets that are very well specified. If they’re not precise, your project manager or something is wrong. But here, it’s like you have a big challenge: looking at the business side and then transforming it into something mathematical. And this is where many people will find the most outstanding value if they actually try out this process. Try to capture the business in the code or the formulas with the metrics behind your eyeballs.

Olimpiu Pop: That’s very important, and I’ve been preaching it for a long time. Understanding the customer and solving the customer’s problem – not the problem you’d like the customer to have. Thank you for saying that. Thank you for all the insights, Magdalena, and let’s see where this takes us.

Magdalena Picariello: Thank you so much, Olimpiu, for having me here.

Mentioned:

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

If You Can’t Test It, Don’t Deploy It: The New Rule of AI Development?

Transcript

Why AI Demands a New Engineering Mindset [02:03]

Stop Building Features. Start Delivering Outcomes [07:50]

Your AI Accuracy Doesn’t Matter. Your Business Impact Does [12:59]

You Sneak Peek In a Black Box. But You Can Observe Human Reactions [18:26]

Leave a Reply Cancel reply

Stay Connected

Latest News

7 of Our Favorite A24 Movies Are Now Streaming for Free

101 gadgets: Stuff’s pick of brilliant photography tech from 2025 | Stuff

Build Native-Like Bottom Sheets with CSS Scroll Snap | HackerNoon

SpaceX Moves to Block Third-Party Starlink Sales in Unauthorized Markets

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Why AI Demands a New Engineering Mindset [02:03]

Stop Building Features. Start Delivering Outcomes [07:50]

Your AI Accuracy Doesn’t Matter. Your Business Impact Does [12:59]

You Sneak Peek In a Black Box. But You Can Observe Human Reactions [18:26]

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News