The Data Backbone Of LLM Systems

Transcript

Paul Iusztin: I want to start with a little story of how I started to dig into the world of LLMs, and AI engineering, and so on. It all began in 2020 with the release of ChatGPT, so probably most of us remember that moment. It was everything about hype. All the people started to dig into AI. It was a beautiful mess. Previously, I had 6 years of experience in ML engineering, and I knew I wanted to get into this field. As classic ML engineering questions, I started to ask myself, what LLMs should I use? Should I go with open source? Should I use those APIs that are so easy to follow? Should I fine-tune the models? Shouldn’t I fine-tune them? How can I gather the data? How can I deploy these behemoths with hundreds of thousands of billions of parameters which are really hard to deploy, and use, and so on?

After 2 years into the domain, I started to play around with the tech, to read papers, to build stuff, I realized that there’s a huge change in how we actually build AI systems. Basically, with the rise of foundational models, we no longer need to fine-tune models or train models at all to start building AI applications. I started to ask myself questions such as, what tool should I use? This was super confusing in the beginning. As an example, everybody told me that I should not use LangChain but at the same time everyone was using LangChain, so it was super confusing. After digging more, I realized that the key question that you should ask yourself is actually prompt engineering versus RAG versus fine-tuning. Again, slowly I realized that RAG is king, and actually in AI engineering almost every problem is a RAG problem, which in the end it boils down to software engineering, data engineering, and information retrieval. No ML engineering in most applications.

I’m Paul Iusztin. My goal was to make the learning journey into the LLM world and AI world easier. I authored together with Maxime Labonne, the LLM Engineer’s Handbook, which is a guide, a framework, taking you from data collection to fine-tuning to deploying models, and so on, with the goal to connect the dots into these systems. I also founded Decoding ML, which is a channel where we create free content like articles, posts, and free courses on AI engineering for production. Our North Star are five open-source courses for free on GitHub on production-ready AI engineering that gathered over 100,000 GitHub stars. I’m currently working on vertical AI agents, and everything is backed up by 8 years of experience in AI, deep learning, and MLOps.

The Data Backbone of LLM Systems

I want to start by laying down some foundations on the three core layers that any AI application has. We can divide it into the application layer, the model layer, and the infrastructure layer. If you want to visualize this, it looks like this. The infrastructure layer sits at the bottom of everything because we need to run our application, our training jobs on some infrastructure, obvious. Afterwards, we have the model layer. At the top, we have the application layer.

To give you an example, when we start using APIs such as the one provided by OpenAI or Claude, we actually touch only the application layer. The model layer and the infrastructure layer are actually delegated to the API provider, where they had to train those models, to host them, to optimize them, and so on. Most of the AI engineering is actually done at the application layer. On the other side of things, the data flows from another perspective. We have the fine-tuning layer, the data layer, and the inference layer.

This is another perspective on how we can see AI applications. Now let’s see how we can connect this together. In the model layer, we have the data layer and the fine-tuning layer. In the application layer, we have the data layer and the inference layer. As you can see, only the data layer overlaps between the two. The model layer is more unique to applications, where the application layer that consists of the data layer and inference layer is almost everywhere. The thing is that LLM system design is complex. If you don’t carefully design it since the beginning with a thoughtful process, you will end up with this beautiful mess.

The Feature Training Inference (FTI) Architecture

The question is, can we find a mind map that can ease up this process on designing this LLM system from the first place? I really like this quote by Leonardo da Vinci, that simplicity is the ultimate sophistication. The key is that we should always strive for simplicity in our software, and in the designs that we adopt. That’s why I’m really in love with this pattern that I always preach, which is called the feature training inference architecture, where basically we have the feature pipeline, the training pipeline, and the inference pipeline. This is not in the design, but we also have to attach the observability pipeline to detect issues as fast as possible and to understand how to solve them. The feature pipeline has as input, raw data, and output features and labels stored in the feature store, which we’ll use to train models.

One thing that I want to highlight is that this is not necessarily applied to LLM systems. It’s a pattern that is very powerful for any AI system, but especially in the LLM world, it can help us navigate easier when we design a system. From the feature pipeline, it spits out the features and labels stored into the feature store. Then, we have the training pipeline, which takes the features and labels from the feature store and outputs a fine-tuned or trained model into the model registry.

Then we have the inference pipeline, which takes as input the prepared features and labels, the trained models, and spits out predictions. This inference pipeline is basically your inference service that can be a batch service, a real-time API, or whatever makes sense for your applications. The predictions will be consumed by your application. The last piece of the puzzle is the observability pipeline, which usually takes as input, your raw data, your predictions, and other internal states with the scope of understanding and easily debugging what’s inside of the system, and also computing metrics, alarms, and so on.

The beautiful thing about this architecture is its interfaces. Now we have clear interfaces that define how each component should communicate between each other. What I want to highlight is that each of these four pipelines do not necessarily have to define one pipeline. Each of them has actually multiple smaller pipelines. For example, the training pipeline usually has a fine-tuning pipeline and an evaluation pipeline. The feature pipeline usually consists of multiple feature pipelines or some data validation, and so on. The idea is that using this pattern, you can very easily zoom in and zoom out, delegate technologies, delegate people to different components, and easily build up something more complex.

Retrieval-Augmented Generation (RAG)

Now let’s talk about RAG, which stands for Retrieval-Augmented Generation. Every LLM application has RAG as its core. This is really important to understand how it works. This is how the naive RAG architecture looks like. We have two core components over here. We have the RAG ingestion pipeline and the part that sits and interacts with the user, which is usually called the inference part. Let’s dig deeper into each. On the RAG ingestion part, we usually ingest some data sources. We clean them, chunk them, embed them into vectors using some embedding model, and store everything into a Vector DB. Usually, we store the vectors that we just created along with some metadata. This metadata is actually what we will use as context to our models to anchor them into some specific data to avoid hallucinations and other similar problems. On the inference side, we have the user, which usually enters a query or something similar. We embed this query with the same embedding model that we used during ingestion.

Then we query the Vector DB, more exactly the vector index of the Vector DB for semantically similar chunks that we can later on use together with the prompt template and the user input to create the final prompt that we input to the LLM and generate the final answer, which will show up to the user. The key to this design is that the RAG ingestion pipeline is offloaded as an offline process, while the RAG inference pipeline is usually in your actual application that interfaces with the user.

Let’s talk a little about advanced RAG. We have three phases where we can actually optimize this RAG system. We have the pre-retrieval phase, where we usually want to play in the ingestion pipeline side with the cleaning and chunking step, especially with the chunking step, because we don’t want to blindly chunk our data. The key here is that per one chunk you should have one entity. It’s similar to how a record in your database represents a single entity. We want to do the same thing over here with the chunking process, but it’s a lot harder because data is noisy, we work with various types of documents, and it’s really hard to generalize, in reality, this part.

The second part that is important for chunking is that these embedding models have a limited context window size. By limiting and making our documents smaller, we are certain that we can use them with the embedding model. Now on the pre-retrieval part, what we can optimize is usually your queries. You want to be certain that your query contains enough signal, so during retrieval you retrieve the right stuff. On the retrieval part, you mostly want to play around with the embedding model. You want to try out different embedding models. You even want to fine-tune them on your domain, a specific language from your company, and so on.

The last part that we can optimize is during post-retrieval. We retrieved our chunk of interest and now we want to further optimize them. What we can do here is to do things such as summarize them further if we want to reduce cost and latency, or the most important part is re-ranking them because there’s this well-known issue with the LLMs, which is called the haystack problem, where the LLM is always biased in answering questions based on the first part of the prompt and the last part of the prompt. Usually what sits in the middle is lost, so we need to take this into consideration when we build up our prompt.

A few things about vector databases. I like to bucketize them in two big buckets. There’s this all-in-one category where we usually use the databases that we all know: MongoDB, PostgreSQL, Redis. What they’ve done, they just added a vector index on top of what they already have. This is very powerful if you already work with this technology or you want to have a single database in your system. Many people actually adopt just PostgreSQL because it’s easier to manage a single database.

You also have another family of dedicated vector databases like Qdrant, Chroma, and Pinecone, which of course offer you this powerful vector index, but on top of it they also offer all kinds of goodies like embedding models out of the box, re-ranking models out of the box, and so on. They’re more like database plus part of your RAG problem out of the box as a service. The thing is that you have like 40, 50 vector databases available out there in the market. There’s this really cool table hosted by Superlinked, called the Vector DB Comparison Table, where they constantly keep and compare all the vector databases in the market and they constantly update it on 20, 30 important features that you can compare and adapt on your use case.

LLM System Architecture

Let’s merge everything together into an LLM system architecture and make this even more interesting. Usually, an LLM system architecture at a very high level looks like this. We have our feature pipelines, which in our use case are the ones in dark blue, so the RAG feature pipeline and the dataset generation pipeline. Next, we have the training pipeline in orange, which is the training pipeline. We have our inference pipelines in yellow, which usually consists of some services that host your fine-tuned LLMs and the agentic layer where you actually build your agents, workflows, tools, and so on. You of course want to hook a UI to your application and the observability layer on top of your application. I deliberately left the data engineering pipelines with the light blue in the end, because in reality we should also take care of this.

Depending on the size of your team, you will implement yourself these data engineering problems as an AI or ML engineer, or you have a dedicated data team that takes care of this. In my experience working with startups, I usually have to build everything you see here, but usually as the corporation goes you work in more isolated components. The thing is that inside the data engineering pipelines you need to feed data into your AI system.

Let’s now dig deeper into this architecture. We have the model layer. Let’s see how the data flows in this model layer. As I said at the beginning, it consists of the data layer and the fine-tuning layer. In the data layer, we usually in this case have the data engineering phase, where I don’t want to go too deep. Further on, we have the feature engineering dataset generation step that takes in data from the data warehouse where we usually expect structured, clean, normalized data. In the LLM world, usually this is stored as Markdown because the internet is full of Markdown, and LLMs really like it. At this phase we want to automatically generate some fine-tuning dataset. Yes, we can automatically generate it using techniques such as distillation, but the key here is that we should always anchor it in our concrete data. For a very simple example, we want to fine-tune an LLM to summarize company specific documents.

In this use case, we want to create input and output tuples where the input is the full document and the output is the summarized document. In this use case, we can use a more powerful LLM to summarize these documents and transfer this knowledge to smaller language models, also known as SLMs. We store this generated dataset into a data registry. Then we kick off the training pipeline which actually loads the dataset from the data registry, which has as its core, features, things such as versioning, like semantic versioning or time travel versioning, also lineage. You can, for example, attach all kinds of metadata to every bump of your dataset and explain where this data came from, explain what techniques you use to generate this dataset, and so on.

Then you kick off your fine-tuning pipeline which usually needs big clusters of compute to fine-tune these models, and you store the fine-tuned model into a model registry. The model registry I think is very similar to a docker registry but specialized for models. At deploy time you take the LLM from the model registry and deploy it to a specialized microservice for LLMs. I will dig deeper into this service part a little bit later. The thing is that these systems should be optimized for high throughput because during training you work with lots of data and you want to speed up your training as fast as possible, because this means more experimentation and a reduction in cost. You also want to make this system reproducible as fast as possible. That’s why we added the data warehouse, the data registry and the model registry because we can always understand how the model was trained, how the dataset was generated, where it came from, and so on.

Experimentation, I think this is super important because AI systems are very experimental. They’re complex, you can’t just plan everything in advance. You usually re-iterate, re-iterate, and re-iterate, fail and re-iterate until you just see that the system works. That’s why you need to have experimentation as a first-class citizen.

Let’s see how the data flows in the application layer, the other part of the system where we have again the data layer and the inference layer. Here in the data layer we mostly focus on the RAG ingestion pipeline, which is the exact pipeline that I explained in the RAG architecture, which basically takes in data from the data warehouse and cleans it, chunks it, embeds it, and loads it into your vector database. What’s more interesting is on the inference side where we have the actual agentic implementation.

As you can see, the agentic part of your system is just a very small part of the whole ecosystem. Here is where you build your agent, your tools, and so on. What’s interesting with this is that in the diagram you can see that it uses our fine-tuned model, but in reality, you have the choice to choose that you want to use your fine-tuned model, you want to use a hosted open-source model without fine-tuning it, you want to use just an API. This is the easiest way to start things. This is not like a perfect recipe, it just gives you all the elements and you can pick and choose what makes more sense for you. Lastly, we have the observability pipeline which takes in as input, prompt traces, from your system. Instead of having normal logs, usually you want to group together your user request as prompt traces where you can dig in into what happened inside the agent.

This system is mostly optimized for low latency because it interacts with the user. Of course, we want to make the request as fast as possible. Also, at this phase of the system we play a lot with the RAG phase, so we have to optimize the ingestion and the retrieval phase with a highlight on the retrieval phase. Actually, a RAG and AI and LLM problem is the most complex on the retrieval phase because it’s super hard to retrieve the right chunks from your data. It’s a very complex problem. Lastly, observability. You should start with observability as a first-class citizen. You should add this feature as fast as possible into your system. In my opinion, from day one, just hook it in there. You can use specialized tools that make this quite easy to integrate.

One last thing that I want to highlight about this architecture is, as you remember, I said that the feature training inference architecture has a feature store, but we don’t see any feature store here. Actually, we’ve built a logical feature store, which means that we just borrowed ideas from the standard feature store and indirectly implemented them, because, in reality, you don’t need a feature store for all your problems, especially in the LLM world. Feature stores are super powerful in real-time, high transactional streaming problems, but LLMs are usually not like that. The most important parts from the feature store in the LLM world is the online and offline store, where the online store is optimized for low latency, which in reality is just your normal database which is already optimized for low latency. The offline store, which is the data registry which has things such as high throughput, data versioning, lineage, and shareability of your datasets across multiple training pipelines, and so on.

Use Case – Second Brain AI Assistant

Now I want to take another look at everything that I explained and take a use case which I call the Second Brain AI assistant, and slowly build up a real-world scenario while using all these components into a real system. Let’s first understand what Second Brain AI assistant is. The Second Brain part is actually your digital footprint, things such as your notes, your Notion, your Google Drive, your emails, your Slack communication, and so on.

The AI assistant is basically, you just hook an LLM, a chatbot to this, and instead of searching like a caveman, you use an LLM to answer questions, and you put a RAG on top of everything to retrieve what is of interest and just synthesize everything into an interesting and useful answer. I want to begin this with a really naive example. I saw that in most tutorials we have this, where we have the RAG ingestion, RAG retrieval, and basically the whole logic into a Jupyter Notebook that everything runs into a single process. We have our user query, then we start our RAG ingestion step, which usually triggers some data collection step. We ingest this into a vector database, then we do our RAG retrieval phase, build up the prompt, and generate the answer using an LLM.

The thing is that this RAG ingestion phase and the data collection phase are very slow and costly, and you don’t want them at any time in your inference phase. To optimize this, the logical thing to do is just to move these two pieces into two processes. We have our data collection step where we collect data from Notion, and the RAG ingestion step as an offline pipeline which can run on a schedule or triggered by an event, or just manually if the data doesn’t change that often. You usually want to further optimize it with things such as rewriting part of your system in Rust, or what can be parallelized to put on CUDA kernels and make it fast if you have GPUs available. Just remember that these systems are usually written in Python.

Python is not that great for optimizing stuff, and that’s why leveraging these technologies can really speed up your development as a Python developer. Now on the inference pipeline side, we have the same thing. We have the user query. We query the vector database, retrieve the relevant chunks, and pass them to an LLM and get the answer. Before adding more complexity to this, I want to show you the easiest way to deploy this. We offload the offline pipeline, the RAG ingestion pipeline, to a simple worker plus a cron job. Every cloud has some worker available there for you, so it’s really easy to deploy it. Usually, the RAG inference phase can be deployed as an API server like FastAPI, or something else. Of course, you can deploy it differently, but this is the most common option.

Let’s start to add more complexity and make the system more reproducible. The first phase is to make it easier to experiment with the system. AI systems love experimentation. Usually, when you develop an AI system you have to re-iterate. That’s why it was important to have a working system as fast as possible that works end-to-end, and then you need to quickly re-iterate over it. The thing is that, in RAG systems usually you have this data collection phase that collects data from external data sources, and you don’t have control over them. You never know when the outside world will change. That’s why a simple solution is just to collect this data and make a snapshot of it into a data lake where we don’t really care about how the data is clean, structured, and all of that. We just dump it in there and that’s it. Now on the RAG ingestion phase, we just read that snapshot and we no longer have to run the data collection phase when we experiment with our RAG system.

The next phase is if we want to add fine-tuning. The thing about fine-tuning is that we don’t always want fine-tuning or maybe we think we do want fine-tuning and we actually don’t want fine-tuning because it’s hard and it doesn’t work. That’s why it’s important to make it very decoupled. As you can see, the only piece that connects the fine-tuning component from the rest of the system is that we read from the same data lake that we use anyway. We have our dataset generation phase that reads from the data lake where we use distillation to create our dataset. Then we pass the generated dataset to the training pipeline and save the fine-tuned model to a model registry.

Now we want to deploy this. Usually, we deploy LLMs on specialized microservices that use technologies such as vLLM or TGI from Hugging Face. These are open-source tools which along just hosting your LLM on an API, they offer you tons of goodies on inference optimization, because LLMs are big and usually you want to run them on as low compute as possible. You want to run things such as quantization, KV caching, dynamic batching, and so on. Methods keep coming in and usually these two frameworks adopt them, adopt what’s working. You don’t have to implement them yourself, you just have to configure some parameters that do this for you. Again, I want to insist on making things reproducible.

A lot of things about AI systems is that you want to make them reproducible, and that’s not easy because AI systems by design are non-deterministic systems. That’s why you need to make all these snapshots, all this versioning, all this storing so you have control and understand how everything was generated. That’s why we need the data registry where we store the datasets generated from the dataset generation pipeline, and the training pipelines always read from the data registry. This data registry can take many forms. You can maybe just use an S3 that you use solely for this that has all types of versioning, or you can use something more open source like the Hugging Face data registry, or DVC, or the data catalog from Databricks. There are many options out there in the industry, but the thing is that this is more about processes and how you design your system than about tooling. Also, what’s good about this is that now you can use this data to fine-tune multiple models in multiple training pipelines across your organization.

The last phase about reproducibility is that you also want to add a data warehouse between your feature pipelines and your data engineering world. Of course, this is optional, like I add here the good practices in the industry, but the reality is that while you work everything will be a lot noisier with more limited resources, so you have to pick and choose. For example, you can just drop the data warehouse and keep everything in a data lake.

The key thing is the ETL pipeline, for example, in our use case with our Second Brain use case, we gather many links from Notion because we have that long list of articles and books that we want to keep on reading and we can hook it like this. In the ETL pipeline, for example, we can trigger some crawlers that crawl all these links, get the actual articles, books, videos, and so forth, transform them into usable formats like Markdown, and store them into the data warehouse. Again, we highlight that we don’t want to depend on the outer world as much as possible.

The last phase of the offline pipelines that I want to highlight is about tooling. The thing is that, so far, I recommended to deploy these offline pipelines as simple workers plus cron jobs, but the thing is that that’s not actually good practice, and you want to use specialized software that does this for you. The two tools that I usually recommend are Metaflow and ZenML. They’re open source. Basically, they will help you manage your pipelines, monitor them, trigger them within a nice UI. You can also attach metadata or, for example, you can version your outputs of each step into dataset artifact which you can later on store on your data lake, and have multiple snapshots of your dataset in a more granular way.

The most beautiful part is that they also take care of deploying all these pipelines on your cloud. Personally, I would start with this tooling from day one, and just actually start building with it, and don’t focus on anything else, because usually you need to structure your code to work with them so it’s just 10 times easier to start with them from day zero.

Now let’s move on to the online environment of the system. Here we don’t have anything special, I just cleared up the board. What I want to highlight is the vector index and the model registry which actually connect the two worlds, the offline pipeline worlds and the online environment. If we don’t have these two components, the online and offline don’t communicate with each other at all.

The first thing that we usually want to add to the inference pipeline is an agentic layer. As you can see, before we have the RAG retrieval as one single step, and now we transform it into a tool. This allows the LLM to start putting questions such as, do I have enough context to answer the question to the user? If it answers to itself, yes, then it generates the final answer. If it answers to itself, no, like in the reasoning part, then it will use the retrieval tool to gather more context. It also needs to rewrite the query inside of it, because if it doesn’t rewrite the query if it keeps using the same retrieval tool it will keep getting the same answer. You need to rewrite the query, remove some noise, add more context, and so on.

At the next time you use the retrieval tool you will get different context. Also, you want here to add a maximum number of steps because, for example, maybe you just don’t have that specific data in your vector database and you will never retrieve the right context to answer the question, then you want to max out after five steps, for example.

Frameworks for Building Agentic Apps

I want to talk a little bit about frameworks for building agentic apps. This is just my personal labeling of things, how I like to label these tools. I label them as integration tools versus orchestration tools. This is just a personal preference. The thing is that we have two big families of LLM frameworks. The first family that came first contained tools such as LangChain. The second iteration of these tools are tools like LangGraph or LlamaIndex Workflows. The big difference between them is that LangChain is mostly about integrating your RAG application with various models with various databases and integrating some specific RAG algorithms. The thing is that it’s super rigid and you can’t customize it. It’s just good to quickly prototype something, see if it’s worth your time, and move on.

Then we have the LangGraph framework which is in my opinion the actual orchestration tool which help you manage your agentic flows, your workflows. They’re mostly oriented around orchestrating various steps in your agents to actually hook tools to keep the state of the conversation, to store the conversation. They’re more focused on actually letting you build stuff and delegating the integration on your side, which anyway you usually have to do. Because, for example, at one of my clients using things such as LangChain or LlamaIndex, not LlamaIndex Workflows, just wasn’t possible. As a simple example, they enforce their data structure in your database, and it just doesn’t work for most of the use cases out there.

Let’s move back to our Second Brain use case. The thing is that, after we add this agentic layer, we can really easily start hooking different tools, like web search tool or whatever we have in mind, it just doesn’t matter. We just define the tool and hook it to the agent, and that’s it. The thing is that we want to be really careful about write operations. They’re really dangerous. A very basic example is an email tool. We would like to live in the ideal world where we just hook this system to our Jira, to our boards, to our whatever, and just let it answer all the questions out there and don’t care about email anymore. The thing is that even if the system would generate the best answers while we develop it, it’s non-deterministic. We can never rely entirely on it. It can just go wild and start rambling about whatever in these emails, and you don’t want that. That’s why when we define write operations we actually want to add a human in the loop.

For example, in this email use case, we want the agent to generate a draft and then a human will review the draft and push the final button before actually making that modification. Be careful about write operations. There is that joke with vibe coding that you vibe coded your code, and at some point Cursor or other tool just deleted your whole code, which is not storing it, and you say that anyway I wanted to do something different.

Most of the time you don’t want to do something different and you don’t want to lose your work. Maybe you’ve heard about MCP servers, they’re quite popular nowadays. The thing is that, using this pattern, you can really easily hook an MCP server to your agent, because in my opinion they’re just a standardized way of defining tools and connecting external services as tools to your agent. They’re really nothing fancy, it’s just a way to standardize how these things work.

One last piece of the puzzle is memory. Agents need memory. It sounds fancy but it isn’t. We have two types of memories, short-term memory and long-term memory. As for short-term memory, another way to look at this is the working memory where we keep the state of the agent. For example, where we keep all the messages, all the context that we retrieve from the databases, and other metadata that the agents need to work with. We usually keep this in the RAM. Also, we want to store this and save this to a state because this will help use different instances of the agents between multiple conversations, or when you close the application you can reopen it, load the state, and continue your conversation.

For long-term memory, we have two types of long-term memory. We have the semantic long-term memory which is actually just the vector database or other database that you work with. It can be a SQL database, it can be a graph database, or whatever makes sense for your use case. We have the procedural long-term memory which is actually a fancy word to tell that this is how you define your agent in code, and this is how it works. The procedural long-term memory is basically how you define your logic in code. There’s actually the third long-term memory type which is more hidden, and is called the episodic long-term memory. It’s actually when we run RAG on top of our short-term memory. Basically, we take our state, we run RAG on top of it, and put it in our vector database, and now the agent can query our past conversation and remember things about it. The key here is that it’s very narrowed down to specific aspects of previous conversation that make sense to it.

In the same system we can also add guardrails like input and output guardrails. This can improve the performance and security. For example, for the input guardrails, we want to do things such as masking of sensitive credentials. This is actually done in simple Python code with regex. Or we also want to check for prompt injections or any other things that don’t comply with your application. At the output guardrail we want to usually check for performance issues like hallucinations, but also for moderation issues, things that will usually not comply with your application. One last thing about this is that these guardrails can take many forms. They can be implemented directly in Python. For moderation issues, usually all the API providers like OpenAI, Anthropic, and so on, they actually provide these type of moderation models which are actually just classifiers that classify your input as racist and so on. Or you can train some simple classifiers that run really fast to classify specific inputs on things that matter to your use case.

The last piece of the puzzle is the observability pipeline. I added it in the end just because I did not want to clutter the graph, but, actually, I would have added just at the very beginning. Keep that in mind. In the observability pipeline we usually implement it with tools such as Opik and Langfuse, which are open source, or we can take a more out-of-the-box approach with LangSmith which comes with the LangGraph suite, but is not open source so you’re stuck in the Lang environment. These two tools provide two key points. They provide LLM evaluation and prompt monitoring. For LLM evaluation we usually first have to define an evaluation dataset. This evaluation dataset usually is built while you develop the application. When you develop the application, you usually test it out on specific prompts, on specific inputs and outputs, and you expect it to behave in some way.

The key is just to store these prompts somewhere and later on you can put them into your evaluation dataset and automate this whole process. Also, after you have these use cases and edge cases, you can use LLMs to create more use cases inspired from this. If you have prompt monitoring, while you detect all kinds of new edge cases, you can just save them and put them in your evaluation dataset and continually evaluate your system on it. Or you can see them I think as regression tests so your system fails on some specific inputs and outputs, and you save them as samples in your evaluation dataset and you test your system against it. For evaluation, you usually use LLMs as judges because testing these systems is hard, because remember that these are generative AI systems so we don’t expect a very rigid answer.

The LLM can answer in very many ways and the answer is correct, so we can’t really use classic metrics to test this, and that’s why we can use LLMs to look at the more higher level and compute metrics such as moderation, hallucination, answer relevance like, is your answer relevant to your question or is your answer relevant to the context it provided, or, does your answer contain the whole context it provided from RAG? Basically, how I like to see it intuitively is that you have this triangle where you have your input, your question, and the context from the RAG, and you want to see that all three match together and they make sense together.

For prompt monitoring, you can’t really use standard logging tools because inside the LLM you have many complex steps, and that’s why they call it prompt traces. The thing is that, during observability you want to understand that an error happened but you also want to understand how to fix it. Logging your whole trace helps you dig into the agent and understand what tools it called, what was the input, what was the output, or how it processed your queries. Also, you can add things such as latency per step.

At the top you can see the latency for the whole query, and while you dig in into the prompt trace you can see other sub-steps, how long they take. Using this, you can also start doing things such as computing the time to generate the first token, which is usually the longest because the LLM has to process first your input before starting to output stuff, or the average on how many tokens it generates per second and all kind of latency token related metrics. Also, you can start to count your tokens per step, which can help you further compute costs, and which can help you further reduce costs and optimize your system, and understand what’s not going well.

Conclusion

We’ve ended our use case. To conclude, I want to highlight that now the Second Brain AI assistant has a scalable RAG layer because the offline ingestion and the online query are completely decoupled, has a clear architecture between the data processing, the fine-tuning, and the inference part because we use the feature training inference architecture. It is built with experimentation in mind because we made this data lake, data warehouse, data registry snapshots, so we can very easily start working from various snapshots of the data and just focus on what matters, and not re-computing everything from scratch. It supports low latency for inference and high throughput for fine-tuning, which maybe sounds obvious when you see it like this, but I’ve seen so many times that everything becomes a mess and it’s hard to differentiate these aspects. It’s observable and secure because we added monitoring, evaluation, and guardrails.

Resources

I also wanted to adopt another use case called the LLM Twin. The good news is that I adopted this use case in the book. In the book it’s interesting because instead of trying to touch multiple smaller examples, we take this example through the whole book and try to dig in into the whole detail. Basically, we touch everything from data collection to feature pipelines to training pipelines to the logical feature store and to the inference pipeline, and show how everything connects together into a production-ready AI application. Or for more related content you can also read the free courses, similar courses on Decoding ML. This Second Brain use case is actually inspired from there and you can dig into the code and articles, which dig a lot deeper into everything that I presented. I also write a lot of many other free articles on the same topics, or follow me on LinkedIn for weekly content on similar stuff.

See more presentations with transcripts

The Data Backbone of LLM Systems

Transcript

The Data Backbone of LLM Systems

The Feature Training Inference (FTI) Architecture

Retrieval-Augmented Generation (RAG)

LLM System Architecture

Use Case – Second Brain AI Assistant

Frameworks for Building Agentic Apps

Conclusion

Resources

Leave a Reply Cancel reply

Stay Connected

Latest News

Samsung launches One UI 8.5 public beta. How to download it.

Our pick of the best new movies and TV shows to stream in December 2025

Meet MacPaw: HackerNoon Company of the Week | HackerNoon

Paramount Submits Hostile Bid for WBD After Trump Says Netflix Deal Could Be a ‘Problem’

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

The Data Backbone of LLM Systems

The Feature Training Inference (FTI) Architecture

Retrieval-Augmented Generation (RAG)

LLM System Architecture

Use Case – Second Brain AI Assistant

Frameworks for Building Agentic Apps

Conclusion

Resources

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News