By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: A Comparative Review of Open‑World, Closed‑World, and Captioning Methods for Video Segmentation | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > A Comparative Review of Open‑World, Closed‑World, and Captioning Methods for Video Segmentation | HackerNoon
Computing

A Comparative Review of Open‑World, Closed‑World, and Captioning Methods for Video Segmentation | HackerNoon

News Room
Last updated: 2025/11/04 at 9:01 AM
News Room Published 4 November 2025
Share
A Comparative Review of Open‑World, Closed‑World, and Captioning Methods for Video Segmentation | HackerNoon
SHARE

Table of Links

Abstract and 1. Introduction

  1. Related Work

    2.1 Open-world Video Instance Segmentation

    2.2 Dense Video Object Captioning and 2.3 Contrastive Loss for Object Queries

    2.4 Generalized Video Understanding and 2.5 Closed-World Video Instance Segmentation

  2. Approach

    3.1 Overview

    3.2 Open-World Object Queries

    3.3 Captioning Head

    3.4 Inter-Query Contrastive Loss and 3.5 Training

  3. Experiments and 4.1 Datasets and Evaluation Metrics

    4.2 Main Results

    4.3 Ablation Studies and 4.4 Qualitative Results

  4. Conclusion, Acknowledgements, and References

Supplementary Material

A. Additional Analysis

B. Implementation Details

C. Limitations

2 Related Work

2.1 Open-world Video Instance Segmentation

Prompt-less methods. Several works have explored the open-world video instance segmentation problem without requiring initial prompts [2,10,28,39,44]. New objects are discovered based on an objectness score. Some works [2, 28] rely on classic region-based object proposals. However, recent query-based methods [10, 39] have shown to outperform those methods for object detection and segmentation. OW-VISFormer [39] proposes a feature enrichment mechanism and a spatio-temporal objectness module to distinguish the foreground objects from the background. Different from our work, they operate on one type of object query. Also orthogonal to our work is DEVA [10], which develops a class-agnostic temporal propagation approach to track objects detected by a segmentation network. UVO [44] and BURST [2] provide novel datasets and region-proposal-based baselines for the task of open-world video instance segmentation. In this work, we evaluate our approach on the BURST [2] dataset as it is more diverse and allows to separately evaluate common and uncommon classes.

Prompt-based methods. Prompt-based methods rely on prompts, i.e., prior knowledge, to segment objects in videos. Video object segmentation (VOS) [11, 17, 19, 20, 35, 40, 43, 51, 56] is a task where the ground truth segmentations of objects (belonging to any category) in the first frame are available, and act as prompts to track the objects throughout the video. CLIPSeg [31] uses words as prompts, by projecting the words onto the image space following CLIP [38], and calculating a similarity matrix to obtain a segmentation mask for the given word prompt. Prompt-based methods work when one has some prior knowledge on what or where to look for. However, in a true open-world setting, such prior knowledge may not be available. We operate in such a setting.

In this work, we combine the advantages of both prompt-based and promptless methods. Notably, we do not require additional prompts from the ground truth or separate networks. Instead, we use equally spaced points distributed over the video frames and encode them as prompts to discover new objects. Also note, that all the aforementioned methods, prompt-less or prompt-based, assign only a one-word label to the detected object. In contrast, we are interested in generating a rich object-centric caption for the given object in this work.

2.2 Dense Video Object Captioning

Dense video object captioning involves detecting, tracking, and captioning trajectories of objects in a video. For this task, DVOC-DS [58] extends the imagebased GRiT [47] and trains it with a mixture of disjoint tasks. In addition, existing video grounding datasets VidSTG [57] and VLN [41] are re-purposed for this task. However, DVOC-DS [58] cannot caption multiple action segments within a single object trajectory because the method produces a single caption for the entire object trajectory. In addition, similar to many other video models [1, 45, 48], DVOC-DS [58] struggles with very long videos and only processes up to 200 frames. The method is further constrained to handle only known object categories and it is unclear how the method extends to an open-world setting.

Unlike DVOC-DS [58], we use open-world object queries to tackle open-world video instance segmentation. Differently, in this work, we also process videoframes sequentially using a short temporal context. Hence, our method is able to process long videos, as well as handle multiple action segments within a single object trajectory. Finally, we leverage masked attention for dense video object captioning, which has not been explored before.

2.3 Contrastive Loss for Object Queries

Many recent approaches have used a contrastive loss to help in video instance segmentation. OWVISFormer [39] uses a contrastive loss in the open-world setting: it ensures that assigned foreground objects are similar to each other while being different from the background objects. IDOL [50] works in the closed-world setting, and uses an interframe contrastive loss to ensure object queries belonging to the same object across frames are similar, and object queries of different instances across frames differ. In contrast, in this work, for both the closed- and open-world setting, we use a contrastive loss to ensure that no two object queries in the foreground are similar to each other, even in the same frame.

2.4 Generalized Video Understanding

Recently, there has been progress in unifying different video related tasks. TubeFormer [23], Unicorn [53], and CAROQ [13] unify different video segmentation tasks in the closed world. DVOC-DS [58] unifies the tasks of detecting closedworld objects in videos and captioning those closed-world objects. In this work, we explore the task of detecting or segmenting both closed- and open-world objects in videos, and captioning these objects.

Video understanding starts from a strong generalized image understanding. Several works generalize multiple image related tasks. Some methods [8, 9, 21] combine different image segmentation methods, and provide a baseline for many different video understanding tasks [7, 13, 18, 50]. X-Decoder [59] unifies different image segmentation tasks along with the task of referring image segmentation. SAM [24] introduces a vision foundation model, that primarily performs prompt-based open-world image segmentation, and can be used for many downstream tasks. Different from these works, we develop a generalized method for videos that tackles segmentation and object-centric captioning for both open and closed-world objects.

2.5 Closed-World Video Instance Segmentation

Closed-world video instance segmentation involves simultaneously segmenting and tracking objects from a fixed category-set in a video. Some works [3–5, 12, 16, 32, 36, 42, 52, 54, 55] rely on classical region-based proposals. Recent works [7, 13, 18, 33, 46, 49, 50] rely on query-based proposals and perform significantly better at discovering closed-world objects. Differently, in this work, we explore query-based proposals for both the closed- and open-world setting.

:::info
Authors:

(1) Anwesa Choudhuri, University of Illinois at Urbana-Champaign ([email protected]);

(2) Girish Chowdhary, University of Illinois at Urbana-Champaign ([email protected]);

(3) Alexander G. Schwing, University of Illinois at Urbana-Champaign ([email protected]).

:::


:::info
This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Chancellor hails ‘economic strengths’ in pre-budget speech – but paves the way for tax rises – UKTN Chancellor hails ‘economic strengths’ in pre-budget speech – but paves the way for tax rises – UKTN
Next Article Alexa Plus Is Rolling Out on the Amazon Music App Alexa Plus Is Rolling Out on the Amazon Music App
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Companies hit the brakes on EVs, laying off thousands of workers
Companies hit the brakes on EVs, laying off thousands of workers
News
The Mathematician Who Tried to Convince the Catholic Church of Two Infinities
The Mathematician Who Tried to Convince the Catholic Church of Two Infinities
Gadget
The Harsh Truth About Making Money Online (What I Learned After 0K)
The Harsh Truth About Making Money Online (What I Learned After $300K)
Computing
The Security Interviews: Colin Mahony, CEO, Recorded Future | Computer Weekly
The Security Interviews: Colin Mahony, CEO, Recorded Future | Computer Weekly
News

You Might also Like

The Harsh Truth About Making Money Online (What I Learned After 0K)
Computing

The Harsh Truth About Making Money Online (What I Learned After $300K)

5 Min Read
Mevolaxy Launches Mobile App And Announces Record Payouts | HackerNoon
Computing

Mevolaxy Launches Mobile App And Announces Record Payouts | HackerNoon

3 Min Read
The “Dumb Home” is Returning | HackerNoon
Computing

The “Dumb Home” is Returning | HackerNoon

3 Min Read
When Miners and Validators Go Rogue: Some Bad Cases | HackerNoon
Computing

When Miners and Validators Go Rogue: Some Bad Cases | HackerNoon

11 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?