Transcript
Srini Penchikala: Hi, everyone. My name is Srini Penchikala. I am the lead director for AI, ML, and the data engineering community at InfoQ website and a podcast host.
In this episode, I’ll be speaking with Apoorva Joshi, senior AI developer advocate at MongoDB. We will discuss the topic of how to develop software applications that use the large language models, or LLMs, and how to evaluate these applications. We’ll also talk about how to improve the performance of these apps with specific recommendations on what techniques can help to make these applications run faster.
Hi, Apoorva. Thank you for joining me today. Can you introduce yourself, and tell our listeners about your career and what areas have you been focusing on recently?
Apoorva Joshi: Sure, yes. Thanks for having me here, Srini. My first time on the InfoQ Podcast, so really excited to be here. I’m Apoorva. I’m a senior AI developer advocate here at MongoDB. I like to think of myself as a data scientist turned developer advocate. In my past six years or so of working, I was a data scientist working at the intersection of cybersecurity and machine learning. So applying all kinds of machine learning techniques to problems such as malware detection, phishing detection, business email compromise, that kind of stuff in the cybersecurity space.
Then about a year or so ago, I switched tracks a little bit and moved into my first role as a developer advocate. I thought it was a pretty natural transition because even in my role as a data scientist, I used to really enjoy writing about my work and sharing it with the community at conferences, webinars, that kind of thing. In this role, I think I get to do both the things that I enjoy. I’m still kind of a data scientist, but I also tend to write and talk a bit more about my work.
Another interesting dimension to my work now is also that I get to talk to a lot of customers, which is something I always wanted to do more of. Especially in the gen AI era, it’s been really interesting to talk to customers across the board, and just hear about the kind of things they’re building, what challenges they typically run into. It’s a really good experience for me to offer them my expertise, but also learn from them about the latest techniques and such.
Srini Penchikala: Thank you. Definitely with your background as a data scientist and a machine learning engineer, and obviously developer advocate working with the customers, you bring the right mix of skills and expertise that the community really needs at this time because there is so much value in the generative AI technologies, but there’s also a lot of hype.
Apoorva Joshi: Yes.
Srini Penchikala: I want this podcast to be about what our listeners should be hyped about in AI, not all about the hype out there.
Let me first start by setting the context for this discussion with a quick background on large language models. The large language models, or LLMs, have been the foundation of gen AI applications. They play a critical role in developing those apps. We are seeing LLMs being used pretty much everywhere in various business and technology use cases. Not only for the end users, customers, but also for the software engineers in terms of code generation. We can go on with so many different use cases that are helping the software development lifecycle. And also, devops engineers.
I was talking to a friend and they are using AI agents to automatically upgrade the software on different systems in their company, and automatically send the JIRA tickets if there are issues. Agents are doing all this. They’re able to cut down the work from number of days and number of weeks for these upgrades. The patching process is down to minutes and hours. Definitely the sky is the limit there, right?
Apoorva Joshi: Yes.
Current State of LLMs [04:18]
Srini Penchikala: What do you see? What’s the current state of LLMs? And what are you seeing in the industry, are they being used, and what use cases are they being applied today?
Apoorva Joshi: I think there’s two slightly different questions here. One is what’s the current state of LLMs, and then application.
To your first point, I’ve been really excited to see the shift from purely text generation models to models that generate other modalities, such as image, audio, and video. It’s been really impressive to see how the quality of these models has improved in the past year alone. There’s finally benchmarks and we are actually starting to see applications in the wild that use some of these other modalities. Yes, really exciting times ahead as these models become more prevalent and find their place in more mainstream applications.
Then coming to how LLMs are being applied today, like you said, agents are the hot thing right now. 2025 is also being touted as the year of AI agents. Definitely seeing that shift in my work as well. Since the past year, we’ve seen our enterprise customers move from basic RAG early or mid last year to building more advanced RAG applications using slightly more advanced techniques, such as hybrid search, parent document retrieval, and all of this to improve the context being passed to LLMs for generation.
Then now, we are also seeing folks further move on to agents, so frequently hearing things like self-querying retrieval, human in the loop agents, multi-agent architectures, and stuff like that.
Srini Penchikala: Yes. You’ve been publishing and advocating about all of these topics, especially LLM-based applications which is the focus of this podcast. We’re not going to get too much into the language models themselves.
Apoorva Joshi: Yes.
Srini Penchikala: But we’ll be talking about how those models are using applications and how we can optimize those applications. This is for all the software developers out there.
LLM-based Application Development Lifecycle [06:16]
Yes, you’ve been publishing about and advocating about how to evaluate and improve the LLM application performance. Before we get into the performance side of discussion, can you talk about what are the different steps involved in a typical LLM-based application, because different applications and different organizations may be different in terms of number of steps?
Apoorva Joshi: Sure. Yes. Thinking of the most common elements, data is the first obvious big one because the LLMs work on some task out of the box. But at most organizations they want them to work on their own data or domain-specific use cases in industries like healthcare, legal. You need something a bit more than just a powerful language model, that’s where data becomes an important piece.
Then once you have data and you want language models to use that data to inform their responses, that’s where retrieval becomes a huge thing. Which is why things have progressed from just simple vector search or semantic search to some of these more advanced techniques, like again, hybrid search, parent document retrieval, self-querying, knowledge graphs. There’s just so much on that front as well. Then the LLM is a big piece of it if you’re building LLM-based applications.
I think one piece that a lot of companies often tend to miss is the monitoring aspect. Which is when you put your LLM applications into production, you want to be able to know if there’s regressions, performance degradations. If your application is not performing the way it should, so monitoring is the other pillar of building LLM applications.
Srini Penchikala: Sounds good. Once the developers start work on these applications, I think first thing they should probably do is the evaluation of the application.
Apoorva Joshi: Yes.
Evaluation of LLM-based Applications [08:02]
Srini Penchikala: What is the scope? What are the benchmarks? Because the metrics and service level agreements (SLAs) and response times can be different for different applications. Can you talk about evaluation of LLM-based applications, like what developers should be looking for? Are there any metrics that they should be focusing on?
Apoorva Joshi: Yes. I think anything with respect to LLMs is such a vast area because they’ve just opened up the floodgates for being used across multiple different domains and tasks. Evaluation is no different.
If you think of traditional ML models, like a classification or regression models, you had very quantifiable metrics that applied to any use case. For classification, you would have accuracy, precision recall. Or if you were building a regression model, you had means squared error, that kind of thing. But with LLMs, all that’s out of the window. Now the responses from these models are natural language, or an image, or some generated commodity. The metrics, when it comes to LLMs, are hard to quantify.
For example, if they’re generating a piece of text for a Q&A-based application, then metrics like how coherent is the response, how factual is the response, or what is the relevance of the information provided in the response. All of these become more important metrics and these are unfortunately pretty hard to quantify.
There’s two techniques that I’m seeing in the space broadly. One is this concept of LLM as a judge. The premise there is because LLMs are good at identifying patterns and interpreting natural language, they can be also used as an evaluation mechanism for natural language responses.
The idea there is to prompt an LLM on how you wanted to go about evaluating responses for your specific task and dataset, and then use the LLM to generate some sort of scoring paradigm on your data. I’ve also seen organizations that have more advanced data science teams actually putting the time and effort into creating fine-tuned models for evaluation. But yes, that’s typically reserved for teams that have the right expertise and knowledge to build a fine-tuned model because that’s a bit more involved than prompting.
Domain-specific Language Models [10:31]
Srini Penchikala: Yes. You mentioned domain-specific models. Do you see, I think this is one of my predictions, that the industry will start moving towards domain-specific language models? Like healthcare would have their own healthcare LLM, and the insurance industry would have their own insurance language model.
Apoorva Joshi: I think that’s my prediction, too. Coming from this domain, I was in cybersecurity, I used to do a lot of that. This was in the world when BERT was supposed to be a large language model. A lot of my work was also on fine-tuning those language models on cybersecurity-specific data. I think that’s going to start happening more and more.
I already see signals for that happening because let’s take the example of natural language to query. That’s a pretty common thing that folks are trying to do. I’ve seen that usually, with prompting or even something like RAG, you can achieve about, I would say, 90 to 95 percent accuracy or recall on slightly complicated tasks. But there’s a small set of tasks that are just not possible by just providing the LLM with the right information to generate responses.
For some of those cases, and more importantly for domain-specific use cases, I think we are going to pretty quickly move towards a world where there’s smaller specialized models, and then maybe an agent that’s orchestrating and helping facilitate the communication between all of them.
LLM Based Application Performance Improvements [12:02]
Srini Penchikala: Yes, definitely. I think it’s a very interesting time not only with these domain-specific models taking shape, and the RAG techniques now, you can use these base models and apply your own data on that. Plus, the agents taking care of a lot of these activities on their own, automation type of tasks. Definitely that’s really good. Thanks, Apoorva, for that.
Regarding the application performance itself, what are the high level considerations and strategies that teams should be looking at before they jump into optimizing or over-optimizing? What are the performance concerns that you see the teams are running into and what areas they should be focusing on?
Apoorva Joshi: Most times, I see teams asking about three things. There’s accuracy, latency, and cost. When I say accuracy, what I really mean is performance on metrics that apply to a particular business use case. It might not be accuracy, it might be, I don’t know, factualness or relevance. But yes, you get the drift. Because that’s how it is, because there are so many different use cases, it really comes down to first determining what your business cares about, and then coming up with metrics that resonate with that use case.
For example, if you’re building a Q&A chatbot, your evaluation parameters would be mainly faithfulness and relevance. But say you’re building a content moderation chatbot, then you care more about recall on toxicity and bias, for example. I think that’s the first big step.
Improvements here could be, again, depend on what you end up finding are the gaps of the model. Say you’re evaluating a RAG system, you would want to evaluate the different components of the system itself first, in addition to the overall evaluation of the system. When you think of RAG, there’s two components, retrieval and generation. You want to evaluate the retrieval performance separately to see if your gap lies in the retrieval strategy itself or do you need a different embedding model. Then you evaluate the generation to see what the gaps on the generation front are, to see what improvements you need to do there.
I think work backwards. Evaluate as many different components of the system as possible to identify the gaps. And then work backwards from there to try out a few different techniques to improve the performance on the accuracy side. Guardrails are an important one to make sure that the LLM is appropriately responding or not responding to sensitive or off-topic questions.
In agentic applications, I’ve seen folks also implement things like self-reflection and critiquing loops to have the LLM reflect and improve upon its own response. Or even human in the loop workflows, too. Get human feedback and incorporate that as a strategy to improve the response.
Maybe I’ll stop there to see if you have any follow-ups.
Choosing Right Embedding Model [15:02]
Srini Penchikala: Yes. No, that’s great. I think the follow-up is basically we can jump into some of those specific areas of the process. One of the steps is choosing the right embedding model. Some of these tools come with … I was trying out the Spring AI framework the other day. It comes with a default embedding model. What do you see there? Are there any specific criteria we should be using to pick one embedding model for one use case versus a different one for a different use case?
Apoorva Joshi: My general thumb rule would be to find a few candidate models and evaluate them for your specific use case and dataset. For text data, my recommendation would be to start from something like the massive text embedding, or MTEB Benchmark on Hugging Face. It’s essentially a leader board that shows you how different proprietary and open source embedding models perform on different tasks, such as retrieval, classification, and clustering. It also shows you the model size and dimensions.
Yes. I would say choose a few and evaluate for performance and, say latency if that’s a concern for you. Yes, there’s similar ones for multi-modal models as well. Until recently, we didn’t have good benchmarks for multi-modal, but now we have things like MME, which is a pretty good start.
Srini Penchikala: Yes. Could we talk about, real quick, about the benchmarks? When we are switching these different components of the LLM application, what standard benchmarks can we look at or run to get the results and compare?
Apoorva Joshi: I think benchmarks apply to the models themselves more than anything else. Which is why, when you’re looking to choose models for your specific use case, you take that with a grain of salt because the tasks that are involved in a benchmark. If you look at the MMLU Benchmark, it’s mostly a bunch of academic and professional examinations, but that might not necessarily be the task that you are evaluating for. I think benchmarks mostly apply for LLMs, but LLM applications are slightly different.
Srini Penchikala: You said earlier the observability or the monitoring. If you can build it into the application right from the beginning, it will definitely help us pinpoint any performance problems or any latencies.
Apoorva Joshi: Exactly.
Data Chunking Strategies [17:18]
Srini Penchikala: Another technique is how the data is divided or chunked into smaller segments. You published an article on this. Can you talk about this a little bit more, and tell us what are some of the chunking strategies for implementing the LLM apps?
Apoorva Joshi: Sure, yes. I think my disclaimer from before, with LLMs the answer starts from it depends, and then you pick and choose. I think that’s the thumb rule for anything when it comes to LLMs. Pick and choose a few, evaluate on your dataset and use case, and go from there.
Similarly for chunking, it depends on your specific data and use case. For most text, I typically suggest starting with this technique called recursive token with overlap, with say a 200-ish token size for chunks. What this does is it has the effect of keeping paragraphs together with some overlap at the chunk boundaries. This, combined with techniques such as parent document or contextual retrieval could potentially work well if you’re working with mostly text data. Semantic chunking is another fascinating one for text where you try to find or align the chunk boundaries with the semantic boundaries of your text.
Then there’s semi-structured data, which is data containing a combination of text, images, tables. For that, I’ve seen folks retrieve the text and non-textual components using specialized tools. There’s one called Unstructured that I particularly like. It supports a bunch of different formats and has different specialized models for extracting components present in different types of data. Yes, I would use a tool like that.
Then once you have those different components, maybe chunk the text as you would normally do. Then, two ways to approach the non-textual components. You either maybe summarize the images and tables to get everything in the text domain, or use multi-modal embedding models to embed the non-text elements as is.
Srini Penchikala: Yes, definitely. Because if we take the documents and if we chunk them into too small of segments, the context may be lost.
Apoorva Joshi: Exactly.
Srini Penchikala: If you provide a prompt, the response might not be exactly what you were looking for.
Apoorva Joshi: Right.
RAG Application Improvements [19:40]
Srini Penchikala: What are the other, especially if you’re using a RAG-based application which is probably the norm these days for all the companies … They’re all taking some kind of foundation model and ingesting their company data, incorporating it on top of it. What are the other strategies are you seeing in the RAG applications in terms of retrieval or generation steps?
Apoorva Joshi: There’s a lot of them coming every single day, but I can talk about the ones I have personally experimented with. The first one would be hybrid search. This is where you combine the results from multiple different searches. It’s commonly a combination of full text and vector search, but it doesn’t have to be that. It could be vector and craft-based. But the general concept of that is that you’re combining results from multiple different searches to get the benefits of both.
This is useful in, say ecommerce applications for example, where users might search for something very specific. Or include keywords in their natural language queries. For example, “I’m looking for size seven red Nike running shoes”. It’s a natural language query, but it has certain specific points of focus or keywords in them. An embedding model might not capture all of these details. This is where combining it with something like a full text search might make sense.
Then there’s parent document retrieval. This is where you embed and store small chunks at storage and ingest time, but you fetch the full source document or larger chunks at retrieval time. This has the effect of providing a more complete context to the LLM while generating responses. This might be useful in cases such as legal case prep or scientific research documentation chatbots where the context surrounding the user’s question can result in more rounded responses.
Finally, there’s graph RAG that I’ve been hearing about a lot lately. This is where you structure and store your data as a knowledge graph, where the nodes can be individual documents or chunks. Edges capture which nodes are related and what the relationship between the nodes is. This is particularly common in specialized domains such as healthcare, finance, legal, or anywhere where multi-hop reasoning or if you need to do some sort of root cause analysis or causal inference is required.
Srini Penchikala: Yes, definitely. The graph RAG has been getting a lot of attention lately. The power of knowledge graph in the RAG.
Apoorva Joshi: But that’s the thing. Going back to what you said earlier on, what’s the hype versus what people should be hyped about. I think a lot of organizations have a hard time balancing that too, because they want to be at the bleeding-edge of building these applications. But then sometimes, it might just be overkill to use the hottest technique.
Srini Penchikala: Where should development teams decide, “Hey, we started with an LLM-based application in mind, but my requirements are not a good fit?” What are those, I don’t want to call them limitations, but what are the boundaries where you say, “For now, let’s just go with the standard solution rather than bringing some LLM in to make it more complex?”
Apoorva Joshi: This is not just an LLM thing. Even having spent six years as a data scientist, a lot of times … ML in general, for the past decade or so, it’s just been a buzzword. Sometimes people just want to use it for the sake of using it. That’s where I think you need to bring a data scientist or an expert into the room and be like, “Hey, this is my use case”, and have them evaluate whether or not you even need to use machine learning, or in this case gen AI for it.
Going from traditional to gen AI, now there’s more of a preference to generative AI as well. I think at this point, the decision is, “Can I use a small language model or just use an XG boost and get away with it? Or do I really need a RAG use case?”
But I think in general, if you want to reason and answer questions using natural language on a repository of text, then I agree, some sort of generative AI use case is important. But say you’re basically just trying to do classification, or just doing something like anomaly detection or regression, then just because an LLM can do it doesn’t mean you should, because it might not be the most efficient thing at the end of the day.
Srini Penchikala: The traditional ML solutions are still relevant, right?
Apoorva Joshi: Yes. For some things, yes.
I do want to say the beauty of LLMs is that it’s made machine learning approachable to everyone. It’s not limited to data scientists anymore. A software engineer or PM, someone who’s not technical, they can just use these models without having to fine-tune or worry about the weights of the model. Yes, I think that results in these pros and cons, in a sense.
Srini Penchikala: Yes, you’re right. Definitely these LLM models and these applications that use them have brought the value of these to the masses. Now everybody can use ChatGPT or CoPilot and get the value out of it.
Apoorva Joshi: Yes.
Frameworks and Tools for LLM applications [25:03]
Srini Penchikala: Can you recommend any open source tools and frameworks for our audience to try out LLM applications if they want to learn about them before actually starting to use them?
Apoorva Joshi: Sure, yes. I’m trying to think what the easiest stack would be. If you’re looking at strictly open source, you don’t want to put down a credit card to just experiment and build a prototype, then I think three things. You first need a model of some sort, whether it’s embedding or LLMs.
For that, I would say use something like Hugging Face. Pretty easy to get up and running with their APIs. You don’t have to pay for it. Or if you want to go a bit deeper and try out something local, then Ollama has support for a whole bunch of open source models. I like LangGraph for orchestration. It’s something LangChain came up with a while ago. A lot of people think it’s an agent orchestration framework only, but I have personally used it for just building control flows. I think you could even build a RAG application by using LangGraph. It just gives you low-level control on the flow of your LLM application.
For vector databases, if you’re looking for something that’s really quick and open source, and easy to start with, then you could even start with something like Chroma or FAISS for experimentation. But of course, when you move from the prototype of putting something in production, you would want to consider enterprise-grade databases such as my employer.
Srini Penchikala: Yes, definitely. For local, just to get started, even Postgres has a vector flavor of the database called PG Vector.
Apoorva Joshi: Right.
Srini Penchikala: Then there’s Quadrant and others. Yes.
Apoorva Joshi: Yes.
Srini Penchikala: Do you have any metrics, or benchmarks, or resources that teams can use to look at, “Hey, I just want to see what are the top 10 or top five LLMs before I even start work on this?”
Apoorva Joshi: There’s an LLM similar to, what’s the one you were mentioning?
Srini Penchikala: The one I mentioned is Open LLM Leaderboard.
Apoorva Joshi: There’s a similar one on Hugging Face that I occasionally look at. It’s called LLM LMSYS Chatbot Arena. That’s basically a crowdsourced list of evaluation of different proprietary and open source LLMs. I think that’s a good thing to look at than just performance on benchmarks because benchmarks can have data contamination.
Sometimes vendors will actually train their models on benchmark data so certain models could end up looking good on certain tasks than they actually are. Which is why leader boards such as the one you mentioned and LMSYS are good because it’s actually people trying these models on real world prompts and tasks.
Srini Penchikala: Just like anything else, teams should try it out first and then see if it works for their use case and their requirements, right?
Apoorva Joshi: Yes.
Online Resources [27:58]
Srini Penchikala: Other than that, any other additional resources on LLM application performance improvements and evaluation? Any online articles or publications?
Apoorva Joshi: I follow a couple of people and read their blogs. There’s this person called Eugene Yan. He’s an applied scientist at Amazon. He has a blog and he’s written extensively about evals and continues to do extensive research in that area. There’s also a bunch of people in the machine learning community who had written almost a white paper titled What We Learned from a Year of Building With LLMs. It’s just really technical practitioners who’ve written that white paper based on their experience building LLMs in the past year. Yes. I generally follow a mix of researches and practitioners in the community.
Srini Penchikala: Yes, I think that’s a really good discussion. Do you have any additional comments before we wrap up today’s discussion?
Apoorva Joshi: Yes. Our discussion made me realize just how important evaluation is when building just any software application, but LLMs specifically because while they’ve made ML accessible and usable in so many different domains, what you really need on a day-to-day is for the model or application to perform on the use case or task you need. I think evaluating for what you’re building is key.
Srini Penchikala: Also, another key is your LLM mileage may vary. It all depends on what you’re trying to do, and what are the constraints and the benchmarks that are working towards.
Apoorva Joshi: Exactly.
Srini Penchikala: Apoorva, thank you so much for joining this podcast. It’s been great to discuss one of the very important topics in the AI space, how to evaluate the LLM applications, how to measure the performance, and how to improve their performance. These are practical topics that everybody is interested in, not just another Hello World application or ChatGPT tutorial.
Apoorva Joshi: Yes.
Srini Penchikala: Thank you for listening to this podcast. If you’d like to learn more about AI and ML topics, check out the AI, ML, and data engineering community page on infoq.com website. I also encourage you to listen to the recent podcasts, especially the 2024 AI ML Trends Report we published last year. And also, the 2024 Software Trends Report that we published just after the new year’s. Thank you very much. Thanks for your time. Thanks, Apoorva.
Apoorva Joshi: Yes. Thank you so much for having me.
Mentioned:
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.