Table of Links
Abstract and 1 Introduction
-
Related Work
2.1. Multimodal Learning
2.2. Multiple Instance Learning
-
Methodology
3.1. Preliminaries and Notations
3.2. Relations between Attention-based VPG and MIL
3.3. MIVPG for Multiple Visual Inputs
3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios
-
Experiments and 4.1. General Setup
4.2. Scenario 1: Samples with Single Image
4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding
4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study
-
Conclusion and References
Supplementary Material
A. Detailed Architecture of QFormer
B. Proof of Proposition
C. More Experiments
3.3. MIVPG for Multiple Visual Inputs
When a sample comprises multiple images, it is imperative to consider MIL feature aggregation from different perspectives. In the context of individual images, each image can be treated as a ’bag,’ and each patch within the image as an ’instance.’ From the sample’s perspective, each sample can also be regarded as a ’bag,’ with each image within the sample as an ’instance.’ When a sample contains only a single image, we can focus primarily on the former perspective since the latter perspective involves a single instance per bag. However, in a more general context, it is essential to adopt a hierarchical approach when considering the utilization of MIL for feature aggregation. Without loss of generality, we now consider the input of the MIVPG to be a bag B containing multiple instances. Hence, the cross-attention can be expressed as Attention(Q = q, K = B, V = B).
:::info
Authors:
(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);
(2) Wenyi Wu, Amazon ([email protected]);
(3) Qi Li, Amazon ([email protected]);
(4) Rob Barton, Amazon ([email protected]);
(5) Boxin Du, Amazon ([email protected]);
(6) Shioulin Sam, Amazon ([email protected]);
(7) Karim Bouyarmane, Amazon ([email protected]);
(8) Ismail Tutar, Amazon ([email protected]);
(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).
:::
:::info
This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.
:::
