By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: MIVPG: Multi-Instance Visual Prompt Generator for MLLMs | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > MIVPG: Multi-Instance Visual Prompt Generator for MLLMs | HackerNoon
Computing

MIVPG: Multi-Instance Visual Prompt Generator for MLLMs | HackerNoon

News Room
Last updated: 2025/11/11 at 9:47 AM
News Room Published 11 November 2025
Share
MIVPG: Multi-Instance Visual Prompt Generator for MLLMs | HackerNoon
SHARE

Table of Links

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

Abstract

Multimodal Large Language Models (MLLMs) have achieved SOTA performance in various visual language tasks by fusing the visual representations with LLMs leveraging some visual adapters. In this paper, we first establish that adapters using query-based Transformers such as Q-former is a simplified Multi-instance Learning method without considering instance heterogeneity/correlation. We then propose a general component termed Multi-instance Visual Prompt Generator (MIVPG) to incorporate enriched visual representations into LLMs by taking advantage of instance correlation between images or patches for the same sample. Quantatitive evaluation on three public vision-language (VL) datasets from different scenarios shows that the proposed MIVPG improves Q-former in main VL tasks.

1. Introduction

In recent years, with the disruptive changes brought to the Machine Learning community by Large Language Models (LLMs)[4, 29–31], an increasing number of researchers have been exploring the application of LLMs in the realm of multimodality, giving rise to Multimodal Large Language Models (MLLMs)[2, 21, 22, 24, 48]. One of the most common forms of multimodality involves the combination of images and text. Just as humans excel in using both images and text to perform tasks, the fusion of images and text in multimodal applications finds wide real-world use, such as in Image Captioning[13, 32, 43, 44] and Visual Question Answering (VQA)[3, 11, 25, 38]. Leveraging the formidable generalization capabilities of large models, MLLMs have achieved state-of-the-art (SOTA) performance in various few-shot and fine-tuning tasks.

In contemporary MLLMs, the integration of images is achieved through a critical component for imparting visual understanding to LLMs through transforming images to visual tokens, which we termed Visual Prompt Generators (VPGs) in this paper. SOTA MLLMs, such as BLIP2[22], Flamingo[2], and MiniGPT-4[48], utilize attention-based VPGs with learnable query embeddings. These embeddings engage in cross-attention with visual embeddings, extracting visual information for LLM input. In this work, we introduce a novel approach, the Multi-instance Visual Prompt Generator (MIVPG), designed to handle diverse visual inputs. Drawing inspiration from Multiple Instance Learning (MIL), MIVPG treats images or patches of a sample as a set of instances, forming a ”bag.” Unlike traditional machine learning tasks, MIL performs predictions at the bag level rather than the instance level, employing permutationinvariant functions to aggregate instances. MIVPG extends this concept by considering correlations and relationships across visual representations, facilitating signal pooling from different dimensions. Additionally, we establish that the commonly used QFormer[22, 48] is a limited MIL module, prompting the introduction of MIVPG. We showcase MIVPG’s enhanced performance across three distinct scenarios, including common natural images, gigapixelsized pathological images, and e-commerce products with multiple images.

In summary, our contributions in this paper can be outlined as follows:

• We introduce a general and flexible component MIVPG to incorporate enriched visual representations and their relationship into the open source LLM.

• We establish that the commonly used QFormer is a simplified case of MIVPG with limited capability and conduct experiments to show the superiority of our component over the QFormer.

• We evaluate the MIVPG on three public datasets from distinct scenarios and showcase that the MIVPG supports visual representation aggregation from different dimensions: image dimension for e-commerce data and patch dimension for WSI. MIVPG outperforms the QFormer by a significant margin in all datasets, which demonstrates the effectiveness and generalizability of the proposed component.

:::info
This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

:::info
Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).

:::

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article The Best External Hard Drives We’ve Tested for 2025 The Best External Hard Drives We’ve Tested for 2025
Next Article How Crypto Exchanges Really Work (And What They’re Not Telling You!) How Crypto Exchanges Really Work (And What They’re Not Telling You!)
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

CMF Watch 3 Pro drops back to its record-low price of just
CMF Watch 3 Pro drops back to its record-low price of just $79
News
Google issues security alert about malicious VPNs stealing user data
Google issues security alert about malicious VPNs stealing user data
Software
The Strategic Role of Social Media Most Brands Ignore
The Strategic Role of Social Media Most Brands Ignore
Computing
Snag the CMF Watch 3 Pro at a fraction of the price of the best smartwatches
Snag the CMF Watch 3 Pro at a fraction of the price of the best smartwatches
Gadget

You Might also Like

The Strategic Role of Social Media Most Brands Ignore
Computing

The Strategic Role of Social Media Most Brands Ignore

18 Min Read
How Generative Data Expands AI’s Understanding of the Real World | HackerNoon
Computing

How Generative Data Expands AI’s Understanding of the Real World | HackerNoon

11 Min Read
How AI can predict social media trends before they go viral
Computing

How AI can predict social media trends before they go viral

17 Min Read
Data Diversity Matters More Than Data Quantity in AI | HackerNoon
Computing

Data Diversity Matters More Than Data Quantity in AI | HackerNoon

5 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?