Bridging Modalities: Multimodal RAG For Advanced Information Retrieval

Key Takeaways

Multimodal retrieval-augmented generation (RAG) enhances AI retrieval by integrating text, images, and structured data for deeper contextual understanding.

A typical multimodal RAG pipeline consists of three primary components: data indexer, retrieval engine, and large language model (LLM).

Multimodal RAG has practical applications in healthcare, social media, and enterprise search, enabling richer insights into those business domains.

Multimodal data presents unique challenges, different approaches to tackle them include unified embeddings, grounding modalities, and dedicated datastores and reranking.

Healthcare example application showcases a prototype for medical diagnosis assistance by retrieving patients relevant past cases to aid the doctor’s decision-making.

Why Multimodal RAG?

Unimodal RAG has served us well in domains where information is neatly structured or exists solely as text. However, real world data is rarely so cooperative. Think about analyzing a medical report that combines textual diagnosis, image scans, and tabular lab results. Alternatively, think through your social media feeds as you scroll through. That feed is almost always a combination of images, videos and text. A traditional RAG system, limited to processing one modality at a time, would fail to provide the nuanced understanding needed to extract actionable insights from such a complex dataset.

Enter Multimodal RAG, a transformative leap forward. By integrating multiple modalities like text, images, and even audio, this approach allows systems to:

Handle complexity by extracting and fusing knowledge from diverse data types

Enhance accuracy by providing richer, context-aware outputs

Expand application scope by unlocking use cases in business domains like healthcare, education, and enterprise document analysis

The next generation of information retrieval and knowledge-based systems demands this multimodal capability to stay relevant and impactful. However, there are several challenges with this novel technique:

Cross-modal Understanding. How do you align textual and visual information meaningfully?

Data Fusion. Combining outputs from diverse retrieval methods without losing context

Scalability. Efficiently indexing and retrieving multimodal data at scale

Core Components of Multimodal RAG

Figure 1: Core components of Multimodal RAG

A robust multimodal RAG pipeline consists of three primary components:

Data Indexing. Preparing multimodal data for efficient retrieval. Typically this involves creating embeddings for various modalities in the input data.

Retrieval. Fetching the most relevant information using vector similarity or other mechanisms.

Large Language Model (LLM). Generating coherent and insightful outputs from the retrieved data.

Let’s explore how these components come together to handle multimodal data effectively.

Handling Multiple Modalities

Multimodal data presents unique challenges and opportunities. Here are three different approaches to tackle them:

Unified Embedding Space. Embed text, images, and other modalities into a shared vector space.
- This approach enables vector similarity comparisons (such as cosine similarity) in the same embedding space, regardless of the modality.
- Tooling. Use models like CLIP (for image-text embeddings) and hybrid embedding solutions for other data types.

Grounding Modalities. Convert non-text modalities into text. For example, we can generate textual descriptions of images using vision-language models like BLIP or LLava.
- This approach simplifies downstream data processing as all modalities converge to a primary text representation.

Separate Datastores and Reranking. Store each modality in dedicated databases (e.g., a Postgres JSON database for structured data, a vector database for embeddings, and blob storage for images).
- Retrieve relevant data from each modality independently and rerank based on relevance scores using a multimodal LLM.

Practical Example: Multimodal RAG for Healthcare

To illustrate the power of Multimodal RAG, let’s build a prototype for medical diagnosis assistance. For instance, given a patient’s X-ray image and medical history, the system retrieves relevant past cases to aid the doctor’s decision-making.

Here’s a simplified example of how we can approach this using readily available tools and models, using different embedding spaces (and datastores) for text and images.

Figure 2: Architecture diagram of sample application

Following are the main tools and libraries used in the sample application:

CLIP Model. For image and text embeddings. The specific CLIP model used is clip-vit-large-patch14 from OpenAI.

Sentence Transformers. For generating text embeddings. The specific model mentioned is all-mpnet-base-v2.

SBert Summarizer. For summarizing relevant patient records. The specific model used is paraphrase-MiniLM-L6-v2.

PyTorch. The code uses PyTorch for tensor operations and GPU acceleration.

PIL (Pillow). Python Imaging Library is used for opening and processing images.

Requests. This library is used to download images from URLs.

1. Data Preparation

Medical History (Text). Assume the patient’s medical history is in a text format, including details like age, past diagnoses, symptoms, medications, etc.

X-ray Image. You’ll need a chest X-ray image in a standard format (e.g., PNG, JPG).

2. Multimodal embedding

Text Embedding. We can use a pre-trained sentence transformer like all-mpnet-base-v2 from the Sentence Transformers library (sentence-transformers) to generate an embedding vector representing the patient’s medical history.

Image Embedding. The get_image_embedding() function uses OpenAI’s clip-vit-large-patch14 model to get the image embedding. This model is specifically designed for visual understanding and is suitable for medical images.

3. Retrieval and Ranking (Simplified)

In this example, we’ll simulate a basic retrieval mechanism. In a real-world implementation, a vector database like Faiss or Pinecone would store and efficiently search through embeddings of a large collection of medical records and images.

Compute the cosine similarity between the input X-ray embedding and a set of precomputed X-ray embeddings from a sample database.

Retrieve the top k most similar cases (e.g., k = 3).

Fetch the corresponding medical history records associated with the top k matches.

This approach enables efficient retrieval of relevant cases, aiding in diagnosis and decision-making.

4. Fusion and Generation

Input to LLM. Combine the following information as input to a large language model (LLM) like GPT-4V (which can handle both text and images) or LLava:
- The patient’s medical history text.
- The input X-ray image.
- The retrieved similar medical history texts.

Prompt Engineering. Craft a prompt that instructs the LLM to analyze the provided information and answer questions or suggest possible diagnoses.

Code and environment configuration (python dependencies, etc.) can be found at this github repository. We show the main parts of the code inline here along with explanations.

First we set up some test data for our application. These are some textual summaries of patient records which we will use for RAG. In a production application, these textual descriptions would be retrieved from a database, e.g. Postgres.


def get_patient_health_records():
   """Retrieves a list of simulated patient health records."""
   return [
       "Patient presents with a persistent dry cough, no fever. Chest X-ray reveals mild hyperinflation. Suspect possible allergies.",
       "Patient reports shortness of breath and chest tightness, especially after exertion. Auscultation reveals wheezing. Diagnosed with asthma exacerbation.",
       "Patient presents with high fever, chills, and productive cough (green phlegm). Chest X-ray shows consolidation in the right lower lobe. Diagnosed with pneumonia.",
       "Patient reports chest pain, radiating to the left arm. EKG shows no acute changes. Suspect musculoskeletal pain.",
       "Patient presents with hemoptysis (coughing up blood). Chest CT scan reveals a small pulmonary nodule. Further investigation needed.",
       "Patient reports pain and swelling in the right knee after a fall. X-ray shows no fracture. Suspect soft tissue injury.",
       "Patient presents with chronic ankle pain and instability. MRI reveals ligament tear. Recommend physical therapy.",
       "Patient reports burning sensation and numbness in the feet. Suspect peripheral neuropathy. Further neurological evaluation recommended.",
       "Patient presents with a painful bunion on the left foot. Recommend conservative management initially.",
       "Patient reports calf pain and swelling. Doppler ultrasound reveals deep vein thrombosis (DVT). Requires anticoagulation.",
       "Patient reports shoulder pain and limited range of motion. MRI reveals rotator cuff tear. Recommend arthroscopic surgery.",
       "Patient presents with elbow pain after overuse. Diagnosed with lateral epicondylitis (tennis elbow).",
       "Patient reports numbness and tingling in the fingers. Suspect carpal tunnel syndrome. Nerve conduction studies recommended.",
       "Patient presents with a wrist fracture after a fall. Requires casting.",
       "Patient reports muscle weakness in the arm. Neurological examination reveals possible nerve impingement.",
       "Patient reports severe headache and stiff neck. Lumbar puncture performed, ruling out meningitis.",
       "Patient presents with abdominal pain and vomiting. CT scan reveals appendicitis. Requires surgery.",
       "Patient reports fatigue and weight loss. Further blood tests and imaging studies needed to determine the cause.",
       "Patient presents with a skin rash. Biopsy performed to determine the diagnosis.",
       "Patient reports anxiety and insomnia. Recommend cognitive behavioral therapy and lifestyle changes.",
       "Patient presents with high fever, chills, and productive cough (green phlegm). Chest X-ray shows consolidation in the right lower lobe. Diagnosed with pneumonia.",
       "Patient presents with a persistent cough, shortness of breath, and wheezing. Oxygen saturation is 90% on room air. Chest X-ray shows hyperinflation and flattened diaphragm. Diagnosed with COPD exacerbation.",
       "Patient reports sharp chest pain that worsens with deep breathing. Auscultation reveals pleural rub. Diagnosed with pleurisy.",
       "Patient presents with fever, cough, and night sweats. Chest X-ray shows a cavitary lesion in the upper lobe. Sputum culture positive for Mycobacterium tuberculosis. Diagnosed with pulmonary tuberculosis.",
       "Patient reports sudden onset of shortness of breath and chest pain. CT pulmonary angiogram reveals a pulmonary embolism. Started on anticoagulation therapy.",
       "Patient presents with a history of asthma. Currently experiencing difficulty breathing and using accessory muscles. Peak expiratory flow rate is significantly reduced. Diagnosed with acute severe asthma."
   ]

The following function runs the core of the Multimodal RAG.


def run_multimodal_rag(url, probability_threshold, max_values):
   """
   Runs multimodal Retrieval Augmented Generation (RAG) on an X-ray image and patient records.

   Args:
       url: URL of the X-ray image.
       probability_threshold: Minimum probability for relevance.
       max_values: Maximum number of relevant records to retrieve.

   Returns:
       A summary of the relevant patient records.
   """

   device = "cuda" if torch.cuda.is_available() else "cpu"
   model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
   processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

   patient_records = get_patient_health_records()

   try:
       image = Image.open(requests.get(url=url, stream=True).raw)
   except Exception as e:
       print(f"Error loading image: {e}")
       return None

   inputs = processor(text=patient_records, images=image, return_tensors="pt", padding=True).to(device)
   with torch.no_grad():
       outputs = model(**inputs)

   logits_per_image = outputs.logits_per_image
   probs = logits_per_image.softmax(dim=1)
   pprint(probs, width=10)

   top_values, top_indices = torch.topk(probs, max_values)
   top_values = top_values.tolist()[0]
   top_indices = top_indices.tolist()[0]

   relevant_records = [
       patient_records[index]
       for value, index in zip(top_values, top_indices)
       if value > probability_threshold
   ]
   print("Relevant Patient Records:", relevant_records)

   summarizer = SBertSummarizer('paraphrase-MiniLM-L6-v2')
   summary = summarizer(" ".join(relevant_records), num_sentences=5)
   return summary

if __name__ == '__main__':
   # the URL to download the sample image from.
   image_url = "https://healthimaging.com/sites/default/files/styles/top_stories/public/assets/articles/4996132.jpg.webp?itok=sR1hg4KS"
   # values to be tuned per application needs.
   probability_threshold = 0.1
   max_values = 5

   # run multimodal RAG and print the output for inspection.
   print(run_multimodal_rag(image_url, probability_threshold, max_values))

In this example, we provide the system an image of a chest X-ray and leverage our RAG application to search for relevant records and summarize them. An example of input chest X-ray image is shown below:

Figure 3: Sample image used as input to the RAG application – image source

The output from the multimodal RAG using the X-ray image(Figure3 above) over existing patient records is as follows:


Patient presents with high fever, chills, and productive cough (green phlegm). Chest X-ray shows consolidation in the right lower lobe. Sputum culture positive for Mycobacterium tuberculosis. Recommend cognitive behavioral therapy and lifestyle changes. Patient presents with high fever, chills, and productive cough (green phlegm).

Future Enhancements

To make this system production-ready, several improvements are necessary:

Fine-Tuning. Train CLIP and Sentence Transformers on medical datasets for domain-specific accuracy.

Error Handling. Implement robust exception handling for API requests and image processing.

Data Security: Ensure HIPAA/GDPR compliance when dealing with patient data.

Scalability. Optimize retrieval using distributed vector databases (e.g., Weaviate, Milvus).

Conclusion

Multimodal RAG represents a paradigm shift in AI-driven retrieval bridging the gap between text, images, videos and other data of differing modalities . By combining unified embeddings, grounding techniques, and reranking strategies, we unlock new capabilities in healthcare, education, and enterprise search.

Our prototype demonstrates how multimodal AI can enhance medical diagnosis by retrieving and analyzing patient history alongside medical imagery. As AI research progresses, optimizing these techniques will be key to deploying scalable, real-world multimodal applications.

Bridging Modalities: Multimodal RAG for Advanced Information Retrieval

Key Takeaways

Why Multimodal RAG?

Related Sponsored Content

Core Components of Multimodal RAG

Handling Multiple Modalities

Practical Example: Multimodal RAG for Healthcare

1. Data Preparation

2. Multimodal embedding

3. Retrieval and Ranking (Simplified)

4. Fusion and Generation

Future Enhancements

Conclusion

Leave a Reply Cancel reply

Stay Connected

Latest News

Threads now has DMs

I Automated My Content Side Hustle with Notion, ChatGPT, and Zapier — Here’s the Exact Workflow | HackerNoon

I’ve used the Nothing Phone 3, and I probably said ‘That’s fun!’ a dozen times

As The Generative AI Wars Heat Up, Elon Musk’s xAI Secures $10B In Debt, Equity

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Key Takeaways

Why Multimodal RAG?

Related Sponsored Content

Core Components of Multimodal RAG

Handling Multiple Modalities

Practical Example: Multimodal RAG for Healthcare

1. Data Preparation

2. Multimodal embedding

3. Retrieval and Ranking (Simplified)

4. Fusion and Generation

Future Enhancements

Conclusion

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News