By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: MIVPG on E-commerce: Multi-Image/Multi-Patch Aggregation for Captioning | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > MIVPG on E-commerce: Multi-Image/Multi-Patch Aggregation for Captioning | HackerNoon
Computing

MIVPG on E-commerce: Multi-Image/Multi-Patch Aggregation for Captioning | HackerNoon

News Room
Last updated: 2025/11/19 at 3:06 AM
News Room Published 19 November 2025
Share
MIVPG on E-commerce: Multi-Image/Multi-Patch Aggregation for Captioning | HackerNoon
SHARE

Table of Links

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered

Thirdly, we assess the method in a comprehensive manner where each sample contains multiple images and considers multiple patches within an image. Specifically, we utilize the Amazon Berkeley Objects (ABO)[7] dataset, consisting of samples of e-commerce products. Each product is accompanied by multiple images illustrating various characteristics from different perspectives. Typically, one of these images provides an overview of the product, commonly displayed first on an e-commerce website. We refrain from utilizing the product title as the caption due to its limited descriptiveness. Instead, we rely on manually annotated captions [27] from the same dataset as reference captions to assess the quality of our generated captions. Specifically, there are 6410 product samples, and we randomly partition them into train, validation and test subsets with a ratio of 70%, 10%, 20%. Each product comprises images ranging from 2 to 21. In this dataset, we must consider both image details and multiple images simultaneously. Consequently, we apply MIL in both image and patch dimensions. To be specific, we employ AB-MIL (Equation 5) to generate image-level embeddings for images. Each image is then treated as an instance and passed to the QFormer as a sample-level MIL. Since the number of patches per sample is much smaller than that of WSIs, we do not apply PPEG in this setting.

We primarily compare the proposed method against BLIP2 with different settings: 1) Zero-Shot: Directly feed the overview image of a sample to query the BLIP2 without fine-tuning; 2) Single-Image: Fine-tune the original BLIP2 with the overview image of each product; 3) Patch-Concatenation: Fine-tune the original BLIP2 with multiple images, with patches concatenated in one sequence. We

repeat the experiments three times with different random seeds and report the mean and standard deviation. As shown in Table 2, results from the ABO dataset demonstrate that our method outperforms the use of a single image or images

Figure 6. Example from ABO. Top Left: a sample consisting of six images. Top Right: Attention weights of patches among different images. We use the COLORMAP JET to represent the weights, where lighter colors indicate higher weights. Bottom: Attention weights among different images. There are 12-head attention maps. In an attention map, each row indicates the weights of images for one query.

with concatenated patches mostly, underscoring the efficacy of considering MIL from different dimensions. It is worth noting that fine-tuning BLIP2 with a single image has already achieved respectable performance, indicating that the overview image contains general information about a sample. Additionally, while fine-tuning BLIP2 with multiple concatenated image patches shows good results in terms of ROUGE, it should be emphasized that the concatenation results in a complexity of O(RNP). In contrast, the proposed method applied on different dimensions will only have a complexity of O(P + RN), ensuring computational efficiency.

Unlike the ground-truth captions that describe the details of a product, we find the zero-shot BLIP2 tends to provide less detailed information. This discrepancy can be attributed to the model’s pretraining, where it is predominantly tasked with describing an overview of an image, with less emphasis on details. Nonetheless, when we input multiple images for a single item, the model showcases its capacity to discern what aspects should be emphasized. This capability arises from the presence of common patterns across different images that collectively describe the item, thus affirming the effectiveness of utilizing multiple visual inputs.

A visualization example can be seen in Figure 6, featuring an ottoman seat composed of six images. We present both patch-level attention weights and image-level attention weights. In the patch-level attention weights, the model emphasizes edges and legs of the seat, leading to an output that recognizes the rectangular shape and four legs. The image-level attention weights show maps for all twelve heads. Each row in a map represents a query, and each column represents an image. Notably, different heads and queries exhibit varying attention patterns towards images. Generally, the first, fourth, and fifth images attract the most attention.

Table 3. Ablation results of effectiveness of CSA

4.5. Case Study

To assess the impact of instance correlation, we conduct additional ablation studies involving self-attention (SA) and correlated self-attention (CSA). Please refer to Table 3 for the results. The results in PatchGastricADC22 indicate that self-attention and correlated self-attention among instances yield similar performance. However, in the case of ABO, correlated self-attention outperforms self-attention. We posit that this discrepancy arises from the fact that images of e-commerce products typically do not exhibit explicit correlations. In the correlated self-attention mechanism, images are initially aggregated into query embeddings, which may help reduce the impact of irrelevant information. Due to space constraints, we have postponed additional case studies to supplementary C.2 and more visualization results to supplementary C.3.

:::info
Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).

:::


:::info
This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article The Best VPN Deals The Best VPN Deals
Next Article Today's NYT Wordle Hints, Answer and Help for Nov. 19 #1614- CNET Today's NYT Wordle Hints, Answer and Help for Nov. 19 #1614- CNET
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

As Stablecoins Boom, Brazil Considers a New Tax on Crypto Transfers | HackerNoon
As Stablecoins Boom, Brazil Considers a New Tax on Crypto Transfers | HackerNoon
Computing
Apple Issues Rare Firmware Update For Magic Keyboard, Magic Trackpad, And Power Adapter – BGR
Apple Issues Rare Firmware Update For Magic Keyboard, Magic Trackpad, And Power Adapter – BGR
News
5 best Prime Video movies to stream now before they leave in November 2025
5 best Prime Video movies to stream now before they leave in November 2025
News
Ukraine has returned from Europe with 250 fighter jets under its arm. The problem is that only Spain has told him the truth
Ukraine has returned from Europe with 250 fighter jets under its arm. The problem is that only Spain has told him the truth
Mobile

You Might also Like

As Stablecoins Boom, Brazil Considers a New Tax on Crypto Transfers | HackerNoon
Computing

As Stablecoins Boom, Brazil Considers a New Tax on Crypto Transfers | HackerNoon

1 Min Read
China fines JD.com payment unit .3 million for compliance failures · TechNode
Computing

China fines JD.com payment unit $1.3 million for compliance failures · TechNode

1 Min Read
Best Time to Post on LinkedIn in 2025 to Maximize Engagement
Computing

Best Time to Post on LinkedIn in 2025 to Maximize Engagement

3 Min Read
Amazon Launches  Billion Bond Sale as AI Infrastructure Spending Surges | HackerNoon
Computing

Amazon Launches $15 Billion Bond Sale as AI Infrastructure Spending Surges | HackerNoon

2 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?