Key Takeaways
- Modular Retrieval Augmented Generation (RAG) applications enhance accuracy and relevancy by assigning ownership to dedicated domain experts.
- Metadata should be leveraged to intelligently route queries to the most appropriate RAG application, whether through auto-selection, manual choice, or comprehensive search.
- Domain experts must own both content curation and system prompt engineering to ensure technical accuracy in specialized areas.
- Technical diagrams should be converted into textual representations to enrich RAG systems with architectural knowledge.
- RAG capabilities should be built into complete tools that integrate with existing workflows, not offered as standalone AI interfaces.
Understanding our context
As a leading banking tech vendor with over 30 years in the industry, we have developed an extensive proprietary codebase and expanded through strategic acquisitions. Over the decades, we’ve positioned ourselves as innovators, yet the rapid pace of innovation has brought challenges in maintaining consistent and up-to-date documentation across our vast product lineup.
While some areas of our codebase have solid, well-managed documentation, others are unclear or outdated, making it tough for our sales engineers and client architects to find needed information. Additionally, our domain experts possess deep knowledge in specific areas, but their expertise is often isolated and hard to access systematically.
Previously, we tried tackling this issue with knowledge-sharing initiatives and comprehensive training programs, but the outcomes were less than impressive due to fragmented documentation and expertise. We tried to use a fact-finding tool based on a static database of pre-defined questions and answers to help us. The lack of context is a big challenge with using these tools successfully. Usually, an answer to a specific question within a specific context cannot be replicated if the question or the context (or both!) are slightly different. To address these obstacles to accessing accurate tech info for our sales engineers and client architects, I decided a few months ago to explore using Retrieval Augmented Generation (RAG) as an assistive tool developed by our domain experts to aid the team’s fact-finding process.
RAG is an emerging AI technology that bridges retrieval and generation models to enhance fact-finding processes. By combining a smart search engine with AI-generated responses, RAG systems access vast data sources to deliver accurate and efficient answers. This integration handles complex queries, provides real-time updates, and supports multiple languages. Currently, LLM-based technologies like RAG face challenges with low accuracy and a tendency for the model to “hallucinate”, by generating incorrect or made-up information. While advancements are being made to enhance these systems, they still fall short of human reliability. Interestingly, our human consultants currently hallucinate less than the AI models, which is both amusing and telling.
To maintain this human edge and uphold the highest standards of accuracy, we strictly limit the AI to a consultative role. This means our applications will serve as tools to assist consultants in finding information, but it’s crucial that consultants always filter, validate, and modify the AI-generated outputs to guarantee correctness and integrity in their work. Despite significant market enthusiasm today for autonomous agents, we recognize that AI technology has not yet matured sufficiently to be used for business-critical tasks.
Ownership of knowledge
To address these challenges, we are redistributing ownership of RAG implementation across our sales engineering team. By recognizing the complexity of our product lineup and the difficulty in maintaining consistent documentation, we have identified that a centralized approach would not suffice. Instead, we are opting to assign dedicated domain owners – experts embedded within specific teams – to oversee the integration and fine-tuning of RAG systems for their respective areas.
These domain owners are tasked with ensuring that RAG aligns seamlessly with their team’s unique needs, while also maintaining accuracy in its responses. Each domain expert uses their own AI infrastructure to refine search parameters and system prompts, tailoring them to the nuances of their domain.
For instance, one owner may focus on how a specific product works from every perspective including functional architecture, application architecture, deployment, etc., whereas another owner that is building a RAG application for SaaS is spending more time to ensure the application accurately conveys our SLAs and default service setup. Beyond technical adjustments, these owners are responsible for curating high-quality documentation and training materials, ensuring that their teams can leverage RAG effectively.
System prompt engineering represents a core responsibility of our domain owners. They meticulously craft, test, and refine these prompts to ensure the RAG system correctly interprets queries within their specific domain context. Through iterative testing, owners develop prompts that guide the LLM to retrieve the most relevant information and format responses appropriately for their team’s use cases.
This specialized prompt engineering requires deep domain knowledge and significant experimentation, as subtle wording changes in a prompt can dramatically improve accuracy. By maintaining control over system prompts, owners effectively “tune” the RAG application to their domain’s terminology, priorities, and common question patterns, creating a more reliable and contextually-aware information retrieval system.
This approach leverages the deep expertise of our domain experts while integrating the efficiency of AI-driven solutions. By distributing ownership, we are creating a system where human insight and machine learning work in tandem to address the unique challenges of each product line. This collaboration not only enhances the accuracy of RAG outputs but also ensures that our teams remain well-equipped to handle the evolving demands of modern banking technology.
Using metadata
There is a trend nowadays to simply incorporate all available knowledge into an LLM with a large context window and minimize RAG. Some researchers claim this approach to be superior. My experiments as well as some online research indicate this is not always the case. LLM accuracy decreases significantly when the context window is populated with approximately half a million tokens before querying (note this number will change over time). There is a lot of research you can refer to in this area (e.g., the articles Long Context vs. RAG for LLMs and Retrieval Augmented Generation or Long-Context LLMs?).
We attempted to input a large volume of documents from all our RAG assistants into an LLM with several hundred thousand tokens in the context window and the results were not as good as the ones provided via RAG. Furthermore, the cost greatly increases (even though LLMs get continuously cheaper, cost will always be a function of the number of tokens used, at least in the foreseeable future). Therefore, we decided to move to a metadata-driven approach.
Our metadata is currently generated using a specific approach. First, we feed the RAG documents into a standard LLM. We ask the LLM to “Summarize the documents, focus on describing what type of information the documents include”. Then our domain expert reviews and edits the output to better reflect the content of the document. The metadata also includes a few keywords commonly associated with the domain. These keywords could be commonly used three-letter acronyms, internal project names, or older names used for the same or previously used equivalent component.
Metadata serves as the backbone for organizing and contextualizing information within RAG applications, allowing users to understand the scope and context under which the system provides responses. This structured approach is particularly vital in our banking tech domain, where managing diverse product lines and services requires precise information retrieval.
Addressing the challenges of scaling RAG
In addressing the challenges of scaling RAG systems, we have explored three primary approaches. First, allowing users to search across all applications simultaneously offers simplicity but becomes ineffective as the application grows. The lack of specificity leads to accuracy issues, making this approach less viable for larger, more complex systems. Second, leveraging metadata to auto-select the most relevant RAG application based on user queries introduces a more targeted and efficient method.
By feeding the query into an LLM that matches it against available metadata, we can pinpoint the most suitable assistant. This approach enhances precision, especially when combined with manual selection rules. For instance, if a user’s question includes a specific product name, the system directs the query to the dedicated RAG application for that product. Third, enabling users to manually select the assistant provides flexibility, which is particularly useful when users are familiar with the specific assistance they need. This method complements the auto-classification process and is especially effective in scenarios where multiple applications might be relevant.
Looking ahead, we aim to enhance this system by allowing the auto-classifier to identify the top two or three matched assistants. Users would then have the option to search across these selected applications, significantly improving efficiency as the number of RAG applications expands. This improvement will be particularly beneficial in managing our diverse product portfolio, ensuring that users receive accurate and relevant information tailored to their needs.
The approaches outlined above provide greater flexibility to the user for combining and selecting appropriate resources. A short summary of how we have structured the application is shown in the diagram below.
Figure 1: RAG-powered application with distributed ownership of RAG applications
From Design to Implementation
Having established our metadata strategy and distributed ownership model, we needed to translate these concepts into a functioning application that our teams could use daily. This required moving beyond theoretical RAG concepts to build a complete tool that integrates into our existing workflows while implementing our unique approach to knowledge retrieval.
Implementation Approach: Building a Complete Application
RAG Implementation
Our RAG implementation goes beyond being a simple AI query system. It is a comprehensive application designed to enhance the user experience while maintaining high information accuracy. We’ve built a standard web application using Flask with authentication, error handling, and a responsive interface. This foundation allows us to focus on integrating specialized RAG capabilities rather than reinventing basic application infrastructure.
The foundation of our application is a distributed RAG system where domain owners manage specialized knowledge models. Each model operates independently with its own vector store containing domain-specific documentation. This approach enables fine-grained control over information retrieval while maintaining the distributed ownership model described earlier.
It may be worthwhile to elaborate here on how we currently define our knowledge model. There are essentially four main components we use today:
- The vector store with our curated documents.
- The metadata, described in an earlier section, which helps define the domain each RAG application covers.
- The system prompt, which helps format the response, align it with our branding guidelines, and emphasize the business benefits of different components. Most importantly, the system prompt is used to prevent hallucinations and ensure that the RAG application only provides responses based on the supplied material without extrapolating or incorporating industry-standard knowledge.
- UML diagrams describing our solution components. We currently use class diagrams to represent solution components and have recently started exploring the use of sequence diagrams to help the RAG application understand flows. In addition to including the UML files directly in the vector store, we also generate a textual description of the UML content (again using an LLM at design time).
Figure 2: Each RAG application has its own evolving knowledge model
At the application’s core, we’ve implemented three query paths that align with our metadata strategy:
1. Auto-selection mode utilizes an intelligent question classifier that routes queries to the most appropriate knowledge model based on content analysis. This classifier combines rule-based pattern matching for obvious cases with an LLM-based classification system for more nuanced queries.
```python
def classify_question(question):
# First, apply rule-based pattern matching for common topics
if any(re.search(r'b' + re.escape(term) + r'b', question, re.IGNORECASE)
for term in ["payment", "payments", "payment hub"]):
return "Payments"
# For more complex questions, use LLM classification
llm = AzureChatOpenAI(
deployment_name=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
temperature=0
)
combined_prompt = f"{system_prompt}n{definitions_prompt}nnQuestion: '{question}'nCategory:"
response = llm([{"role": "user", "content": combined_prompt}])
return extract_category(response)
```
This classifier function first applies pattern matching to quickly identify common topics, then falls back to an LLM for more nuanced classification, saving processing time while maintaining accuracy for complex queries.
2. Manual selection mode allows users to explicitly choose which domain’s knowledge model to query, providing control when they know exactly which area holds their answer.
3. Search-all mode performs a comprehensive search across all knowledge models using an ensemble retriever, enabling broader information retrieval when the domain is uncertain.
Our API endpoint handles all three modes, routing queries appropriately:
```python
@app.route('/api/auto-classify', methods=['POST'])
@auth.login_required
def auto_classify():
data = request.json
query = data['query']
mode = data.get('mode', 'auto')
try:
if mode == 'auto':
# Use the classifier to determine the appropriate model
category = classify_question(query)
component = component_mapping.get(category)
response = RAG_MODEL_FUNCTIONS[component](query)
formatted_response = response.get('result', str(response))
elif mode == 'search_all':
# Query all models using ensemble retriever
response = get_all_model_response(query)
formatted_response = response.get('result', str(response))
else:
# Manual selection (component specified in request)
component = data['component']
response = RAG_MODEL_FUNCTIONS[component](query)
formatted_response = response.get('result', str(response))
return jsonify({
'component': component if mode != 'search_all' else 'AllModels',
'query': query,
'response': formatted_response
})
except Exception as e:
return jsonify({'error': f"Error processing query: {str(e)}"}), 500
```
This endpoint implements our metadata-driven approach by dynamically routing queries based on the selected mode, demonstrating how we integrate classification, manual selection, and comprehensive search capabilities.
UML Integration for Enhanced Knowledge Representation
To enhance the user experience beyond simple text responses, we’ve integrated UML diagram visualization capabilities. Users can explore component relationships through interactive diagrams that provide architectural context for their queries. More importantly, we’re leveraging these diagrams as a rich knowledge source for our RAG applications.
Our UML files include descriptive text attributes that make them valuable inputs for LLMs. For example, the following snippet is from our Integration Layer diagram:
```
package "Integration Layer" {
class PubSubBroker {
+ Type: Kafka / Azure Event Hub
+ Purpose: Event-driven communication
}
class MQBroker {
+ Type: ActiveMQ
+ Purpose: Event-driven communication
}
class APIGateway {
+ Type: API Gateway
+ Purpose: Routes and manages API requests
}
class SFTP {
+ Type: SFTP client/server
+ Purpose: File exchanges
}
class GIT {
+ Type: Source Control Management
+ Purpose: Source Control, entry point to DevOps
}
}
```
This structured representation includes key attributes like component types and purposes, which we parse and convert into descriptive text documents. These documents are then ingested into our RAG knowledge base alongside the original UML files. Our conversion process preserves the hierarchical relationships and technical details, translating graphical connections into textual descriptions of system interactions.
We’ve configured our system prompts to assign higher confidence to answers derived from these UML-based knowledge sources, particularly for architectural and integration questions. When a user asks about system components, the RAG application can provide answers grounded in precise documentation rather than generalized knowledge.
We’re currently using both class and sequence diagrams in our knowledge base. Class diagrams provide structural information about system components and their relationships, while sequence diagrams offer insights into process flows and interaction patterns. This combination gives our RAG applications a comprehensive understanding of both static architecture and dynamic behaviors.
As we continue evolving our approach, we’re exploring ways to enrich our UML files with additional descriptive text, such as adding detailed comments for each component, that can be processed by LLMs without affecting the visual representation of the diagrams. This would allow domain experts to embed deeper knowledge directly into architectural documentation, making the information accessible both visually for humans and textually for our RAG applications.
Application Architecture and Management
The application manages authentication, logging, and error handling to ensure security and reliability. We’ve implemented standard web application practices including input validation, path traversal prevention, and secure credential management. This infrastructure allows domain experts to focus on knowledge curation rather than application security concerns.
Our backend architecture employs a singleton pattern for the RAG application manager, ensuring efficient resource use while maintaining separation between different vector stores. This design allows us to scale horizontally as more domain owners create specialized knowledge models without increasing memory footprint.
System prompts are managed through a centralized configuration system that domain owners can customize, giving them control over how the LLM interprets retrieved information. This approach means domain experts can tune both the knowledge sources and the response style without modifying code.
The web interface provides an intuitive dashboard where users can:
- Submit queries and select their preferred search mode
- Visualize technical architecture through UML diagrams
- Combine multiple components into comprehensive architectural views
- Access both RAG-based knowledge and structured documentation seamlessly
By building a complete software application rather than just an AI interface, we’ve created a tool that integrates into our teams’ workflows while maintaining the distributed ownership model that ensures information accuracy and relevance.
Performance Evaluation
Methodology
To perform an initial evaluation of our approach, we wanted to measure how the models perform against our accuracy requirements.
We implemented a two-pronged evaluation approach:
- Classifier-Based Routing.
Questions were processed by an LLM-based classifier that determined which specialized model should handle the query. - Comprehensive Coverage.
The same questions were simultaneously processed by all available models.
For each question, we captured both the classifier’s routing decision and the respective answers from each approach. A human expert then evaluated the outputs across four key metrics to assess system performance.
Evaluation Metrics
Our evaluation framework employs four complementary metrics:
Metric | Description | Formula |
---|---|---|
Classifier Accuracy (81.7%) | Measures how accurately the classifier routes questions to the appropriate model. | Correctly classified questions / Total questions |
Response Precision (Classified Model) (97.4%) | Evaluates the quality of answers from correctly classified questions only. | Good answers from correctly classified questions / Total correctly classified questions. |
Response Precision (All Models) (83.8%) | Assesses the quality of answers when using all models together. | Good answers from all models / Total questions |
Expert-Guided Answer Recovery (63.4%) | Determines whether a human expert can find a satisfactory answer in either approach. | Questions with useful answers in either approach / Total questions |
Here is a short description of how we came up with these metrics:
In our evaluation, we considered appropriate “I cannot comment on that” responses as correct when the model genuinely lacked relevant information, valuing honesty over hallucination. As described in the beginning, we prefer our human experts to locate the correct information every time instead of risking inaccuracies. In this context, when the model says “I don’t know”, it is actually a good answer – we must absolutely avoid providing incorrect information. We prefer the model to acknowledge uncertainty and defer to a human for investigation.
Therefore, in our response precision metric, such cases are counted as positive outcomes. However, it’s important to note that an “I don’t know” response creates manual work for the consultant, who then needs to find the answer through traditional means. We can reduce the frequency of these cases and improve this metric by fine-tuning the system prompt and enhancing the metadata.
The second metric (expert-guided answer recovery) counts the questions answered correctly (either via the classified model or all models) and can be used as-is. This is essentially the level of automation we achieve. We can increase this by adding more knowledge to each RAG application.
Key Findings
The analysis reveals several important insights:
- Classification Quality Matters.
While the classifier correctly routes questions 81.7% of the time, the disparity between overall Response Precision (88.5% when including misclassifications) and Response Precision for correct classifications only (97.4%) demonstrates the significant impact of classification errors on answer quality. - Specialized Models Outperform Comprehensive Queries.
When correctly classified, the targeted model approach (97.4%) substantially outperforms querying all models simultaneously (83.8%), suggesting that specialized knowledge retrieval yields higher quality answers than broader but less focused approaches. - Human Oversight Remains Valuable.
The Expert-Guided Answer Recovery metric (63.4%) indicates that human experts can often extract useful information from either approach, highlighting the continued importance of human judgment in complex question-answering systems.
Figure 3: Performance Evaluation Metrics
Conclusion
Our findings suggest that investing in classifier accuracy yields substantial returns in answer quality. When classification is correct, specialized models provide remarkably precise answers (97.4%), significantly outperforming the all-models approach. However, the gap between Response Precision metrics underscores the need for continued improvement in classification systems.
For organizations implementing RAG systems with multiple specialized models, these results suggest a hybrid approach: use classification for routing questions when confidence is high, but maintain the ability to query all models in cases where classification confidence is low or when initial answers are unsatisfactory.
Summary and Looking Forward
Our RAG implementation represents a unique approach to knowledge management through its distributed ownership model, metadata-driven query routing, and UML-enhanced knowledge base. By combining these innovations into a complete application, we’ve created a system that respects the complexity and diversity of our product knowledge while making it accessible to our teams.
Early results are promising, with sales engineers reporting faster access to accurate information and domain experts appreciating the control they maintain over their knowledge areas. We’re actively enhancing the system daily, improving UML parsing capabilities, refining classification accuracy, and expanding the knowledge base. As we continue to refine this tool, we remain committed to balancing AI assistance with human expertise to provide the highest quality information to our teams and ultimately to our customers.