Key Takeaways
- A hybrid of vector and term-based search is the most effective strategy for RAG pipelines that answer user questions about documentation. Both vector databases and Lucene-based search engines support this, but tuning the underlying algorithm is critical for optimal results.
- When the domain is complex enough, and the questions are sufficiently sophisticated and nuanced, similarity (which is what you get out of a document search) is not the same thing as relevance (which is what the LLM needs to answer the question).
- Chunking refers to the process of breaking down content into smaller units when indexing documents for a database. The database search could miss the similarity if the chunks are too large or too small. The basis for chunking should differ depending on the knowledge domain and the type of content and media used to deliver it.
- Not all types of content should be indexed the same way. Strategies vary for indexing diagrams, graphs, sample code, tabular data, and various kinds of prose.
- Despite getting bigger with each new LLM release, the context window remains a crucial consideration. Including only the most relevant search results in a RAG prompt ensures the highest quality response.
Having just completed a development effort to build a RAG pipeline for software architects, I decided to collect some notes from my experience and share them here. I hope that you can benefit from what I have learned.
Project Solution Architecture
Let’s briefly describe the project from which most of these lessons were learned. This is for a cloud based B2B solution that provides software architecture support services to technology companies that maintain their own software. You give it access to your source code and documentation.
After that, you can ask about the current architecture or how to enhance it. You can ask for just a high-level pitch or a detailed plan. You can also ask about reducing accidental complexity in general. Figure 1 below is a component diagram for the entire RAG pipeline for this project but this article will focus on the lessons learned from the documentation chunking, indexing, and searching parts.
Figure 1. component diagram for the software architecture support project where these lessons were learned
[Click here to expand image above to full-size]
RAG revisited
To properly set the context for the rest of the article, let’s begin with the basics of what RAG is. I will be brief here and invite you to read this May 2025 InfoQ article Beyond the Gang of Four: Practical Design Patterns for Modern AI Systems for more details on the subject.
Figure 2 illustrates the basic flow of data for a document focused RAG pipeline. Before the user can ask any questions, documents are chunked into fragments then saved to a database. The user’s question is collected and undergoes some preprocessing before being combined with system instructions and database search results in a template to construct the final prompt. The RAG pipeline then submits the prompt to the LLM. The response is then collected and undergoes some post-processing before it is surfaced to the end user as an answer to their question.
Figure 2. typical data flow in a document focused RAG pipeline
Let’s review the context window, which is the maximum amount of text in tokens that the LLM can consider or remember in any single session. If your prompt exceeds the context window, the LLM won’t remember the system instructions or the question, which is typically located at the beginning of the prompt. Successive versions of LLMs have ever-increasing context windows, but this remains a consideration to be aware of.
Why bother with a RAG pipeline at all? Why not have the user ask their question directly to the LLM without any additional prompt engineering or supporting database search? If we consider a use case from the above mentioned project which is a user asking how to enhance an existing software system, several limitations of LLMs become obvious. Perhaps the vendor trained the LLM prior to the release of the documentation, or the documentation was not part of the training data. Even if the documentation was used to train the LLM, there may be conflicting information that requires additional context. Internal details about proprietary software are almost never released for public consumption. When the LLM knows either too little or too much, the likelihood of AI hallucination increases, and we must take steps to mitigate it through a curated process.
LLMs and RAG are currently a very hot topic, and there is a lot of activity in the tooling landscape that supports this. Here is the high-level anatomy of this type of tooling. It is usually an open-source framework that supports a plugin architecture. There is typically an associated hub with shareable assets, which could include models, training data, and prompts. Not only does the tool vendor contribute to this hub, but also members of the community. There is also an enterprise upsell that you may or may not feel compelled to purchase.
Lessons Learned
Enough with the basics. Let’s get on with what we ran into and what we did about it.
There are many vendor tools in this space, too numerous to evaluate much less cover here. The oldest and most well-known tool is langchain which is a general purpose framework that focuses on standardized component interfaces, orchestration, and observability. I decided to pass on langchain for the following reasons. I have been burned by their backwards breaking changes in a previous project. Every LLM provides their own API and they are all very similar so it is easy to swap one out for another. Connecting the output from one step to the input for another via a template is pretty easy to do with standard Python.
Aside from the LLM itself, there are two LLM focused tools that we did end up using. Remember question pre-processing and answer post-processing in the RAG pipeline description above? That processing is mainly in the form of safety vetting, and the guardrails product is an excellent tool for that. There are lots of checkers to choose from on their hub. Some require an API key but many do not. In the rest of this article, you will see references to sentence transformers, POS (Parts of Speech) tagging, summarization, and sentiment analysis. Huggingface is an excellent resource for that functionality.
Document Ingestion
Before you can search a database for document excerpts to consider in answering the question, you must first save those excerpts into the database. Breaking up the overall documentation into individual document fragments and storing them in the database as separate rows is known as chunking.
Chunking granularity indicates the size of each separately saved document fragment. A chunk could range from a single sentence to multiple pages. If the chunks are too large, then the similarity of a small portion of a chunk with the question may get washed out of the results. Here is another issue with large chunks. What eventually gets fed into the LLM will contain a significant amount of irrelevant content. We will see later how that compromises answer quality. If we make the chunks too small, we may overlook relevant data in the search and, therefore, not include that data in the prompt.
I wish that there was a one-size-fits-all granularity for chunking, but that is not the case. While the character length of each document fragment should fall within a predetermined range, that should not be the guiding criteria. Chunking granularity depends on several factors, including knowledge domain, content type, and media type. For example, descriptive prose on a detailed domain that serves experts might require a larger chunking granularity than persuasive prose on a simplified domain intended for a more general audience. In many types of prose, each paragraph wants to capture a separate idea. That might be a reasonable basis for chunking. What about screenplays? Each paragraph is just the lines for the next character that speaks. If you chunk at that level, then you might miss relevant results for questions about the interaction between two characters. Other types of content where per-paragraph chunking may not serve as well include sample code, poetry, and tabular data.
The choice of delivery medium for the original documentation can also have a profound effect on the chunking strategy. Here is an example – files written in the Portable Document Format (PDF) group content by page. What if a paragraph gets split across two pages? It may not look any different than two pages where a paragraph ends at the bottom of the first page and a new paragraph begins at the top of the second page. How do you programmatically determine when two separate blocks of prose that span pages belong to the same paragraph?
For our project, we decided to ignore this problem and just live with the consequences. Some PDFs are internally organized such that each page is just an image of text, graphs, tables, you name it. There are strategies for extracting the content but it isn’t easy. Examples include asking the LLM and using OCR. Fortunately for us, those types of PDFs are more likely intended for public facing marketing collateral than internal facing software architecture. For our project, we coded up a custom PDF importer using pypdf because we needed more control over the chunking. If you want to go with something more off-the-shelf, then I hear that pymupdf4llm + tesseract is a good choice for that.
What about long paragraphs? According to Grammarly, the average paragraph size is somewhere between 100 and 200 words. That is a reasonable chunking size, but not all writers follow that rule. For our project, we decided to ignore this problem since this issue does not occur much in the world of software architecture documentation.
Indexing web pages are another example of a media-specific chunking strategy. You need to filter out all the navigation and boilerplate content from each page. Examples include headers, footers, navigation menus, and breadcrumbs. You will likely want to follow links that take you to another area of the documentation to be indexed, but not outside of that area. Examples include other domains, about us, site map, and shopping cart. For our project, we coded up a custom web scraper because we needed more control over the chunking. This custom web scraper uses the Beautiful Soup library.
Different knowledge domains have varying reliance on graphical content. The graphics are superfluous in some types of content. In other types of content, a narrative description of what is relevant in each image may already be there, although not in all types. How can you collect that for a document search? You can embed images into databases for vector search (more on this later), but what gets embedded may not be relevant to the types of questions asked. A cartoon of a boy and his dog will most likely be embedded accurately. A graph of political election results could be represented as red and blue circles of varying sizes, which may not be particularly useful. You may need to generate and save custom, domain-specific text summaries for specific content types, including diagrams, charts, sample code, and tabular data.
For the project that I was working on, we would periodically scrape the web sites containing the documentation. This was mostly architectural documentation but also user and administrator guides were scraped because they can include useful information and are usually in the same web site. We also imported PDF files containing educational collateral. Both HTML and PDF assets were usually hosted on an internal CMS. We would use heuristics to break up each page into blocks of prose, sample code, diagrams, and configuration. We would chunk at the paragraph level for the prose and ask the LLM to generate a summary for the other types of content.
Document Search
Now that you have indexed all of the document fragments into your database, what is the best way to search for them in the context of the original question? For the project that I was working on, I found that the hybrid search approach was the most effective. Hybrid search casts a wider net so you are more likely to collect enough relevant results. Hybrid search involves multiple searches, typically comprising one or two vector searches and a term search. The results are merged and reranked, usually with the Reciprocal Rank Fusion (RRF) algorithm. The query for the vector search is either the question itself or a normalized version of it. We haven’t done this yet but you should consider correcting for grammatical errors and summarizing if the question is too long or meanders.
The query for the term search could be the question, or you could extract the most relevant terms from the question. For many domains, that would be the proper nouns, and you can use POS tagging to identify those. That was our approach and we used the huggingface plugin for flair to accomplish POS tagging. This approach does not work for “What does it do?” style questions that lack any proper nouns. We ended up treating that type of question as invalid and would direct the user to a page on how to write good questions.
We have already covered the process of breaking down your documentation into text fragments and then saving each fragment to the database. The process of converting that text into a searchable vector is known as embedding, and there are different types. Use dense vectors for what is typically known as vector search. Use a sentence transformer to embed text into a dense vector.
The most popular transformers are derived from the Bidirectional Encoder Representations from Transformers (BERT) algorithm. Two popular sentence transformers to consider are all-distilroberta-v1 and nli-mpnet-base-v2. Check out this leaderboard if you’re more interested in the latest developments in embedding models. Embedding images require a vision transformer such as CLIP or Swin. Vector databases use sparse vectors to accommodate term-based search. There is a BM25-based embedder for this, but I recommend a SParse Lexical AnD Expansion Model (SPLADE) embedder, which is also based on BERT. You should use the same embedding algorithm for indexing document fragments as you use for embedding the question at time of search. Different record fields in the database can be populated with different embedding functions but always use the same embedding function for the same field in all records of the database.
I am writing this in 2025, so the most prominent databases to consider using are the vector-only databases, such as Qdrant, Pinecone, and Milvus, as well as the Lucene-based databases, including Elasticsearch, Solr, and OpenSearch. The Lucene-based databases provide term-based search using an inverted index data structure. The algorithm is BM25, which ranks results by a normalized term frequency (TF) and inverse document frequency (IDF). Lucene also supports vector search algorithms. You can use the community edition of Elasticsearch to do this, but you have to use the sentence transformer in your code.
The more premium licenses of Elasticsearch give you the ability to configure an ML pipeline that runs within the Elasticsearch process space and does the embedding for you automatically. All of these databases support hybrid search but we ended up querying the database separately and implementing the merge and reranking of the results in our code. We felt like we needed more control over that because it is so critical to the quality of the responses.
The actual algorithm for vector search is also known as kNN or k nearest neighbors. There is a faster version of kNN known as aNN or approximate k nearest neighbors. The aNN approach is usually implemented with the Hierarchical Navigable Small Worlds (HNSW) algorithm. Think of HNSW as hierarchical proximity graphs over probabilistic skip lists. An alternative algorithm for aNN is FAISS, but most databases don’t support it yet.
What determines just how near the neighbors are is what is known as the distance function. For dense vectors, the most popular algorithm for that is called cosine similarity. There are alternatives, such as Euclidean distance and Manhattan distance. Use a weighted dot product for sparse vectors. The choice of distance function rarely changes but is loosely based on choice of embedding function. Cosine similarity does not take the amplitude of the two vectors into account whereas Euclidean distance does. Manhattan distance is more about the number of discrete changes required to transform one vector into the other. We used cosine similarity.
Document Retrieval and Reranking
Given the nature of the project where the questions are not simple and the answers require deeper reasoning, we quickly learned something that may not be obvious to RAG pipeline newcomers. The results from a database search provides document fragments that are similar to the question. What the LLM needs to answer the question are relevant document fragments. Relevance is the likelihood that the document fragment will help answer the question. For most non-trivial use cases, similarity is not equivalent to relevance. Why not include all the search results in the prompt and let the LLM determine what is relevant, silently ignoring the rest? Even when a prompt fits within the context window, responses tend to degrade as token count approaches the limit. Those answers will sound more buzzwordy and less articulate, providing less value to your end user. This is known as context rot and measurable degradation can occur even after the input token size exceeds 1% of the context window.
You must figure out a way to rerank the search results based on relevance and include only the most relevant results in the prompt. Perhaps you can find some domain-specific heuristics for estimating relevance. For our purposes, results that included enough proper nouns from the question (see POS tagging above) were considered to be relevant. If you are in a contentious and polarizing domain, you should also include sentiment analysis with proper nouns, although this was not necessary for our project. If we didn’t get enough relevant results this way, we would try the worst-case solution which is to loop through all the search results and then prompt the LLM to calculate the relevance of each result. The system instructions could appear as follows:
Study the following question and data, then return the relevance or likelihood that the supplied data could be used to answer the question. This relevance would be expressed as a floating-point number between 0.0 and 1.0, where 0.0 indicates that the data isn’t useful at all, 1.0 means that the data is useful with complete certainty, and 0.5 suggests that the data is just as likely to be useful as not – output as JSON in the specified schema. The response MUST be a valid JSON object and NOTHING else.
That approach can be expensive and time-consuming, making it less suitable for your purposes if scalability and costs are considerations.
Figure 3. Flowchart for combining POS tagging, LLM relevance, and summarization.
[Click here to expand image above to full-size]
If limiting by relevance is not possible, consider using the summaries of the search results in the prompt instead. Summaries might sufficiently shorten the length of the prompt, but they do take some time to generate and could also result in lowering the relevance of the data, which could decrease the value of the answer. Depending on the size of the results, we ended up doing a mixture of both relevance filtering and result summarization (see Figure 3 above). We used the BART (large-sized model), fine-tuned on CNN Daily Mail model released by Meta on Huggingface to generate this kind of summarization.
Prompting with the Results
You may have to adjust the system instructions based on the search results. Suppose you have a healthy number of relevant search results. In that case, your system instructions can say something like, “You are an expert tasked to answer the following question based exclusively on the provided data”. but if you get none or very few relevant results, then maybe the system instructions are more like, “You are an expert tasked to answer the following question based on your already existing knowledge of the subject”.
Prompting with no results doesn’t sound very RAG-like, but I can assure you from firsthand experience that the LLM can get quite snarky when you instruct it to use the data and then fail to provide any data to use.
Conclusions
From document ingestion to searching, retrieval, reranking, and folding into a template-based prompt for the LLM, I hope you find these notes on RAG pipelines useful in your learning journey with RAG pipeline construction. After reading this article, you should realize that there are plenty of knobs to turn and techniques to try if you are finding the LLM’s answers to be lackluster in your own pipeline.