By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Teaching AI to See and Speak: Inside the OW‑VISCap Approach | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Teaching AI to See and Speak: Inside the OW‑VISCap Approach | HackerNoon
Computing

Teaching AI to See and Speak: Inside the OW‑VISCap Approach | HackerNoon

News Room
Last updated: 2025/11/04 at 8:37 AM
News Room Published 4 November 2025
Share
Teaching AI to See and Speak: Inside the OW‑VISCap Approach | HackerNoon
SHARE

Table of Links

Abstract and 1. Introduction

  1. Related Work

    2.1 Open-world Video Instance Segmentation

    2.2 Dense Video Object Captioning and 2.3 Contrastive Loss for Object Queries

    2.4 Generalized Video Understanding and 2.5 Closed-World Video Instance Segmentation

  2. Approach

    3.1 Overview

    3.2 Open-World Object Queries

    3.3 Captioning Head

    3.4 Inter-Query Contrastive Loss and 3.5 Training

  3. Experiments and 4.1 Datasets and Evaluation Metrics

    4.2 Main Results

    4.3 Ablation Studies and 4.4 Qualitative Results

  4. Conclusion, Acknowledgements, and References

Supplementary Material

A. Additional Analysis

B. Implementation Details

C. Limitations

3 Approach

Given a video, our goal is to jointly detect, segment and caption object instances present in the video. Importantly, note that object instance categories may not be part of the training set (e.g., the parachutes shown in Fig. 3 (top row)), placing our goal in an open-world setting. To achieve this goal, a given video is first broken into short clips, each consisting of T frames. Each clip is processed using our approach OW-VISCap. We discuss merging of the results of each clip in Sec. 4.

We provide an overview of OW-VISCap to process each clip in Sec. 3.1. We then discuss our contributions: (a) introduction of open-world object queries in Sec. 3.2, (b) use of masked attention for object-centric captioning in Sec. 3.3, and (c) use of inter-query contrastive loss to ensure that the object qeries are different from each other in Sec. 3.4. In Sec. 3.5, we discuss the final training objective.

3.1 Overview

Both open- and closed-world object queries are processed by our specifically designed captioning head which yields an object-centric caption, a classification head which yields a category label, and a detection head which yields either a segmentation mask or a bounding-box.

We introduce an inter-query contrastive loss to ensure that the object queries are encouraged to differ from each other. We provide details in Sec. 3.4. For closed world objects, this loss helps in removing highly overlapping false positives. For open-world objects, it helps in the discovery of new objects.

Finally, we provide the full training objective in Sec. 3.5.

3.2 Open-World Object Queries

We first match the ground truth objects with the open-world predictions by minimizing a matching cost using the Hungarian algorithm [34]. The optimal matching is then used to calculate the final open-world loss.

3.3 Captioning Head

3.4 Inter-Query Contrastive Loss

3.5 Training

Our total training loss is

Table 1: Open-world tracking accuracy (OWTA) on the BURST validation and test sets for all, common (comm.) and uncommon (unc.) categories of objects. Onl. refers to online frame-by-frame processing. The best scores are highlighted in bold font, and the second-best scores are underlined.

Table 2: Dense video object captioning results on the VidSTG [57] dataset. Off. indicates offline methods and onl. refers to online methods.

:::info
Authors:

(1) Anwesa Choudhuri, University of Illinois at Urbana-Champaign ([email protected]);

(2) Girish Chowdhary, University of Illinois at Urbana-Champaign ([email protected]);

(3) Alexander G. Schwing, University of Illinois at Urbana-Champaign ([email protected]).

:::


:::info
This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Apple reportedly planning to use Google Gemini to power the new and improved Siri Apple reportedly planning to use Google Gemini to power the new and improved Siri
Next Article Amazon Equips Next Underwater Cable With ‘Robust Armoring’ to Prevent Cuts Amazon Equips Next Underwater Cable With ‘Robust Armoring’ to Prevent Cuts
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Amazon and Perplexity, total war with significance in the industry
Amazon and Perplexity, total war with significance in the industry
Mobile
Android GenAI Prompt API Enables Natural Language Requests with Gemini Nano
Android GenAI Prompt API Enables Natural Language Requests with Gemini Nano
News
Intel ANV Vulkan Driver Finally Exposes Pipeline Binary “VK_KHR_pipeline_binary”
Intel ANV Vulkan Driver Finally Exposes Pipeline Binary “VK_KHR_pipeline_binary”
Computing
The PlayStation Portal can finally stream PS5 games from the cloud
The PlayStation Portal can finally stream PS5 games from the cloud
Gadget

You Might also Like

Intel ANV Vulkan Driver Finally Exposes Pipeline Binary “VK_KHR_pipeline_binary”
Computing

Intel ANV Vulkan Driver Finally Exposes Pipeline Binary “VK_KHR_pipeline_binary”

2 Min Read
Huawei reportedly building mega facility in Shenzhen to manufacture Kirin and Ascend chips · TechNode
Computing

Huawei reportedly building mega facility in Shenzhen to manufacture Kirin and Ascend chips · TechNode

1 Min Read
Accrue’s growing agent network for stablecoins payments across Africa 
Computing

Accrue’s growing agent network for stablecoins payments across Africa 

10 Min Read
Federated Cloud for Secure and Scalable Cross-Border Payments: A Study by Avinash Reddy Segireddy | HackerNoon
Computing

Federated Cloud for Secure and Scalable Cross-Border Payments: A Study by Avinash Reddy Segireddy | HackerNoon

0 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?