Key Takeaways
- Multimodal retrieval-augmented generation (RAG) enhances AI retrieval by integrating text, images, and structured data for deeper contextual understanding.
- A typical multimodal RAG pipeline consists of three primary components: data indexer, retrieval engine, and large language model (LLM).
- Multimodal RAG has practical applications in healthcare, social media, and enterprise search, enabling richer insights into those business domains.
- Multimodal data presents unique challenges, different approaches to tackle them include unified embeddings, grounding modalities, and dedicated datastores and reranking.
- Healthcare example application showcases a prototype for medical diagnosis assistance by retrieving patients relevant past cases to aid the doctor’s decision-making.
Why Multimodal RAG?
Unimodal RAG has served us well in domains where information is neatly structured or exists solely as text. However, real world data is rarely so cooperative. Think about analyzing a medical report that combines textual diagnosis, image scans, and tabular lab results. Alternatively, think through your social media feeds as you scroll through. That feed is almost always a combination of images, videos and text. A traditional RAG system, limited to processing one modality at a time, would fail to provide the nuanced understanding needed to extract actionable insights from such a complex dataset.
Enter Multimodal RAG, a transformative leap forward. By integrating multiple modalities like text, images, and even audio, this approach allows systems to:
- Handle complexity by extracting and fusing knowledge from diverse data types
- Enhance accuracy by providing richer, context-aware outputs
- Expand application scope by unlocking use cases in business domains like healthcare, education, and enterprise document analysis
The next generation of information retrieval and knowledge-based systems demands this multimodal capability to stay relevant and impactful. However, there are several challenges with this novel technique:
- Cross-modal Understanding. How do you align textual and visual information meaningfully?
- Data Fusion. Combining outputs from diverse retrieval methods without losing context
- Scalability. Efficiently indexing and retrieving multimodal data at scale
This complexity underscores the need for innovative techniques, which we’ll explore in this article. We will focus specifically on text and images, a common combination in industries like healthcare and education. We briefly discuss some possible use cases in some sectors here:
- Educational applications. RAG has significant potential for revolutionizing education by enabling more engaging, accessible, and personalized learning experiences. Multimodal RAG can create dynamic textbooks that integrate text, images, videos, and interactive diagrams. Students can ask questions about specific elements within these materials, and the system will provide contextually relevant answers drawn from various modalities.
- Enterprise search. Multimodal RAG is poised to significantly enhance enterprise search by enabling more comprehensive and contextually rich information retrieval. Employees can search for information using a combination of text, images, or even audio. For example, they could search for “product assembly instructions” and receive results that include text documents, instructional videos, and diagrams. This is crucial for enterprises with diverse data formats, such as marketing materials, product catalogs, and technical documentation.
Core Components of Multimodal RAG
Figure 1: Core components of Multimodal RAG
A robust multimodal RAG pipeline consists of three primary components:
- Data Indexing. Preparing multimodal data for efficient retrieval. Typically this involves creating embeddings for various modalities in the input data.
- Retrieval. Fetching the most relevant information using vector similarity or other mechanisms.
- Large Language Model (LLM). Generating coherent and insightful outputs from the retrieved data.
Let’s explore how these components come together to handle multimodal data effectively.
Handling Multiple Modalities
Multimodal data presents unique challenges and opportunities. Here are three different approaches to tackle them:
- Unified Embedding Space. Embed text, images, and other modalities into a shared vector space.
- This approach enables vector similarity comparisons (such as cosine similarity) in the same embedding space, regardless of the modality.
- Tooling. Use models like CLIP (for image-text embeddings) and hybrid embedding solutions for other data types.
- Grounding Modalities. Convert non-text modalities into text. For example, we can generate textual descriptions of images using vision-language models like BLIP or LLava.
- This approach simplifies downstream data processing as all modalities converge to a primary text representation.
- Separate Datastores and Reranking. Store each modality in dedicated databases (e.g., a Postgres JSON database for structured data, a vector database for embeddings, and blob storage for images).
- Retrieve relevant data from each modality independently and rerank based on relevance scores using a multimodal LLM.
Practical Example: Multimodal RAG for Healthcare
To illustrate the power of Multimodal RAG, let’s build a prototype for medical diagnosis assistance. For instance, given a patient’s X-ray image and medical history, the system retrieves relevant past cases to aid the doctor’s decision-making.
Here’s a simplified example of how we can approach this using readily available tools and models, using different embedding spaces (and datastores) for text and images.
Figure 2: Architecture diagram of sample application
Following are the main tools and libraries used in the sample application:
- CLIP Model. For image and text embeddings. The specific CLIP model used is
clip-vit-large-patch14
from OpenAI. - Sentence Transformers. For generating text embeddings. The specific model mentioned is
all-mpnet-base-v2
. - SBert Summarizer. For summarizing relevant patient records. The specific model used is
paraphrase-MiniLM-L6-v2
. - PyTorch. The code uses PyTorch for tensor operations and GPU acceleration.
- PIL (Pillow). Python Imaging Library is used for opening and processing images.
- Requests. This library is used to download images from URLs.
1. Data Preparation
- Medical History (Text). Assume the patient’s medical history is in a text format, including details like age, past diagnoses, symptoms, medications, etc.
- X-ray Image. You’ll need a chest X-ray image in a standard format (e.g., PNG, JPG).
2. Multimodal embedding
- Text Embedding. We can use a pre-trained sentence transformer like
all-mpnet-base-v2
from the Sentence Transformers library (sentence-transformers
) to generate an embedding vector representing the patient’s medical history. - Image Embedding. The
get_image_embedding()
function uses OpenAI’sclip-vit-large-patch14
model to get the image embedding. This model is specifically designed for visual understanding and is suitable for medical images.
3. Retrieval and Ranking (Simplified)
In this example, we’ll simulate a basic retrieval mechanism. In a real-world implementation, a vector database like Faiss or Pinecone would store and efficiently search through embeddings of a large collection of medical records and images.
- Compute the cosine similarity between the input X-ray embedding and a set of precomputed X-ray embeddings from a sample database.
- Retrieve the top k most similar cases (e.g., k = 3).
- Fetch the corresponding medical history records associated with the top k matches.
This approach enables efficient retrieval of relevant cases, aiding in diagnosis and decision-making.
4. Fusion and Generation
- Input to LLM. Combine the following information as input to a large language model (LLM) like GPT-4V (which can handle both text and images) or LLava:
- The patient’s medical history text.
- The input X-ray image.
- The retrieved similar medical history texts.
- Prompt Engineering. Craft a prompt that instructs the LLM to analyze the provided information and answer questions or suggest possible diagnoses.
Code and environment configuration (python dependencies, etc.) can be found at this github repository. We show the main parts of the code inline here along with explanations.
First we set up some test data for our application. These are some textual summaries of patient records which we will use for RAG. In a production application, these textual descriptions would be retrieved from a database, e.g. Postgres.
def get_patient_health_records():
"""Retrieves a list of simulated patient health records."""
return [
"Patient presents with a persistent dry cough, no fever. Chest X-ray reveals mild hyperinflation. Suspect possible allergies.",
"Patient reports shortness of breath and chest tightness, especially after exertion. Auscultation reveals wheezing. Diagnosed with asthma exacerbation.",
"Patient presents with high fever, chills, and productive cough (green phlegm). Chest X-ray shows consolidation in the right lower lobe. Diagnosed with pneumonia.",
"Patient reports chest pain, radiating to the left arm. EKG shows no acute changes. Suspect musculoskeletal pain.",
"Patient presents with hemoptysis (coughing up blood). Chest CT scan reveals a small pulmonary nodule. Further investigation needed.",
"Patient reports pain and swelling in the right knee after a fall. X-ray shows no fracture. Suspect soft tissue injury.",
"Patient presents with chronic ankle pain and instability. MRI reveals ligament tear. Recommend physical therapy.",
"Patient reports burning sensation and numbness in the feet. Suspect peripheral neuropathy. Further neurological evaluation recommended.",
"Patient presents with a painful bunion on the left foot. Recommend conservative management initially.",
"Patient reports calf pain and swelling. Doppler ultrasound reveals deep vein thrombosis (DVT). Requires anticoagulation.",
"Patient reports shoulder pain and limited range of motion. MRI reveals rotator cuff tear. Recommend arthroscopic surgery.",
"Patient presents with elbow pain after overuse. Diagnosed with lateral epicondylitis (tennis elbow).",
"Patient reports numbness and tingling in the fingers. Suspect carpal tunnel syndrome. Nerve conduction studies recommended.",
"Patient presents with a wrist fracture after a fall. Requires casting.",
"Patient reports muscle weakness in the arm. Neurological examination reveals possible nerve impingement.",
"Patient reports severe headache and stiff neck. Lumbar puncture performed, ruling out meningitis.",
"Patient presents with abdominal pain and vomiting. CT scan reveals appendicitis. Requires surgery.",
"Patient reports fatigue and weight loss. Further blood tests and imaging studies needed to determine the cause.",
"Patient presents with a skin rash. Biopsy performed to determine the diagnosis.",
"Patient reports anxiety and insomnia. Recommend cognitive behavioral therapy and lifestyle changes.",
"Patient presents with high fever, chills, and productive cough (green phlegm). Chest X-ray shows consolidation in the right lower lobe. Diagnosed with pneumonia.",
"Patient presents with a persistent cough, shortness of breath, and wheezing. Oxygen saturation is 90% on room air. Chest X-ray shows hyperinflation and flattened diaphragm. Diagnosed with COPD exacerbation.",
"Patient reports sharp chest pain that worsens with deep breathing. Auscultation reveals pleural rub. Diagnosed with pleurisy.",
"Patient presents with fever, cough, and night sweats. Chest X-ray shows a cavitary lesion in the upper lobe. Sputum culture positive for Mycobacterium tuberculosis. Diagnosed with pulmonary tuberculosis.",
"Patient reports sudden onset of shortness of breath and chest pain. CT pulmonary angiogram reveals a pulmonary embolism. Started on anticoagulation therapy.",
"Patient presents with a history of asthma. Currently experiencing difficulty breathing and using accessory muscles. Peak expiratory flow rate is significantly reduced. Diagnosed with acute severe asthma."
]
The following function runs the core of the Multimodal RAG.
def run_multimodal_rag(url, probability_threshold, max_values):
"""
Runs multimodal Retrieval Augmented Generation (RAG) on an X-ray image and patient records.
Args:
url: URL of the X-ray image.
probability_threshold: Minimum probability for relevance.
max_values: Maximum number of relevant records to retrieve.
Returns:
A summary of the relevant patient records.
"""
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
patient_records = get_patient_health_records()
try:
image = Image.open(requests.get(url=url, stream=True).raw)
except Exception as e:
print(f"Error loading image: {e}")
return None
inputs = processor(text=patient_records, images=image, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
pprint(probs, width=10)
top_values, top_indices = torch.topk(probs, max_values)
top_values = top_values.tolist()[0]
top_indices = top_indices.tolist()[0]
relevant_records = [
patient_records[index]
for value, index in zip(top_values, top_indices)
if value > probability_threshold
]
print("Relevant Patient Records:", relevant_records)
summarizer = SBertSummarizer('paraphrase-MiniLM-L6-v2')
summary = summarizer(" ".join(relevant_records), num_sentences=5)
return summary
if __name__ == '__main__':
# the URL to download the sample image from.
image_url = "https://healthimaging.com/sites/default/files/styles/top_stories/public/assets/articles/4996132.jpg.webp?itok=sR1hg4KS"
# values to be tuned per application needs.
probability_threshold = 0.1
max_values = 5
# run multimodal RAG and print the output for inspection.
print(run_multimodal_rag(image_url, probability_threshold, max_values))
In this example, we provide the system an image of a chest X-ray and leverage our RAG application to search for relevant records and summarize them. An example of input chest X-ray image is shown below:
Figure 3: Sample image used as input to the RAG application – image source
The output from the multimodal RAG using the X-ray image(Figure3 above) over existing patient records is as follows:
Patient presents with high fever, chills, and productive cough (green phlegm). Chest X-ray shows consolidation in the right lower lobe. Sputum culture positive for Mycobacterium tuberculosis. Recommend cognitive behavioral therapy and lifestyle changes. Patient presents with high fever, chills, and productive cough (green phlegm).
Future Enhancements
To make this system production-ready, several improvements are necessary:
- Fine-Tuning. Train CLIP and Sentence Transformers on medical datasets for domain-specific accuracy.
- Error Handling. Implement robust exception handling for API requests and image processing.
- Data Security: Ensure HIPAA/GDPR compliance when dealing with patient data.
- Scalability. Optimize retrieval using distributed vector databases (e.g., Weaviate, Milvus).
Conclusion
Multimodal RAG represents a paradigm shift in AI-driven retrieval bridging the gap between text, images, videos and other data of differing modalities . By combining unified embeddings, grounding techniques, and reranking strategies, we unlock new capabilities in healthcare, education, and enterprise search.
Our prototype demonstrates how multimodal AI can enhance medical diagnosis by retrieving and analyzing patient history alongside medical imagery. As AI research progresses, optimizing these techniques will be key to deploying scalable, real-world multimodal applications.