By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: MS MARCO Web Search: Powering Next-Gen Information Access & Neural Indexers | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > MS MARCO Web Search: Powering Next-Gen Information Access & Neural Indexers | HackerNoon
Computing

MS MARCO Web Search: Powering Next-Gen Information Access & Neural Indexers | HackerNoon

News Room
Last updated: 2025/06/27 at 2:36 PM
News Room Published 27 June 2025
Share
SHARE

Table of Links

Abstract and 1 Introduction

2 Background and Related work

2.1 Web Scale Information Retrieval

2.2 Existing Datasets

3 MS Marco Web Search Dataset and 3.1 Document Preparation

3.2 Query Selection and Labeling

3.3 Dataset Analysis

3.4 New Challenges Raised by MS MARCO Web Search

4 Benchmark Results and 4.1 Environment Setup

4.2 Baseline Methods

4.3 Evaluation Metrics

4.4 Evaluation of Embedding Models and 4.5 Evaluation of ANN Algorithms

4.6 Evaluation of End-to-end Performance

5 Potential Biases and Limitations

6 Future Work and Conclusions, and References

ABSTRACT

Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of downstream tasks and encourages research in various areas, such as generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models. MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks that demands innovations in both machine learning and information retrieval system research domains. As the first dataset that meets large, real and rich data requirements, MS MARCO Web Search paves the way for future advancements in AI and system research. MS MARCO Web Search dataset is available at: https://github.com/microsoft/MSMARCO-Web-Search.

1 INTRODUCTION

Recently, the large language model (LLM), a breakthrough in the field of artificial intelligence, has provided a novel way for people to access information through interactive communication. Although it has become an indispensable tool for tasks such as content creation, semantic understanding and conversational AI, it still exhibits certain limitations. One such limitation is the model’s tendency to produce hallucinated or fabricated content, as it generates responses based on patterns observed in the training data rather than verifying factual accuracy. Furthermore, it struggles with real-time knowledge updates, as it can only provide information available up until the time of its last training. This makes it less reliable for retrieving the latest, dynamic information. Therefore, integrating an external up-to-date knowledge base with large language models is of paramount importance to enhance their performance and reliability. This combination not only mitigates the limitations of hallucination and knowledge update but also broadens the model’s applicability across various domains, making it more versatile and valuable. Consequently, information retrieval systems, like the Bing search engine [32], continue to play a vital role in the new LLMbased information systems, such as Webgpt [34] and new Bing [33].

For modern information retrieval systems, the core is the large semantic understanding model, such as a neural indexer model [51] or dual embedding model [16, 20, 21, 38–40, 45, 46, 54], which can capture users’ intents as well as the rich meanings of a document with better tolerance for out of vocabulary words, spelling errors, and synonymous expressions. Training a high-quality large semantic understanding model requires a vast amount of data to achieve sufficient knowledge coverage. The larger the dataset, the better the model is likely to perform, as the model can learn more complex and sophisticated patterns and correlations.

High-quality human-labeled data is as important as data scale. Recent research, such as InstructGPT [36] and LLAMA-2 [50], has demonstrated the crucial role of labeled data for training large foundation models. These models rely on large volumes of training data to learn generalizable features, while human-labeled data enable the model to learn the specific tasks it is designed for. This also applies to large semantic understanding models.

Moreover, information-rich data is also crucial for training large semantic understanding models effectively. The use of multi-modal datasets can help models understand complex relationships between different types of data and transfer knowledge between them. For example, using images and text in a multi-modal data set can help models learn about image concepts and their corresponding text descriptions, providing a more holistic representation of the data.

The emerging large, real and rich data requirements motivate us to create a new MS MARCO Web Search dataset, the first largescale information-rich web dataset with millions of real clicked query-document labels. MS MARCO Web Search incorporates the largest open web document dataset, ClueWeb22 [37], as our document corpus. ClueWeb22 includes about 10 billion high-quality web pages, sufficiently large to serve as representative web-scale data. It also contains rich information from the web pages, such as visual representation rendered by web browsers, raw HTML structure, clean text, semantic annotations, language and topic tags labeled by industry document understanding systems, etc. MS MARCO Web Search further contains 10 million unique queries from 93 languages with millions of relevant labeled query-document pairs collected from the search log of the Microsoft Bing search engine to serve as the query set. This large collection of multi-lingual information rich real web documents, queries and labeled query-document pairs enables various kinds of downstream tasks and encourages several new research directions that previous datasets cannot well support, for example, generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models, etc. As the first large, real and rich web dataset, MS MARCO Web Search will serve as a critical data foundation for future AI and systems research.

MS MARCO Web Search offers a retrieval benchmark which implements several state-of-the-art embedding models, retrieval algorithms, and retrieval systems originally developed on existing datasets. We compare the quality of their results and system performance on our new MS MARCO Web Search dataset as the benchmark baselines for web scale information retrieval. The experiment results demonstrate that embedding models, retrieval algorithms, and retrieval systems are all critical components in web

Table 1: Comparison of MS MARCO Web Search (with ClueWeb22) and existing datasetsTable 1: Comparison of MS MARCO Web Search (with ClueWeb22) and existing datasets

information retrieval. And interestingly, simply improving only one component may bring negative impacts to the end-to-end retrieval result quality and system performance. We hope that this retrieval benchmark can facilitate future innovations in data-centric techniques, embedding models, retrieval algorithms, and retrieval systems to maximize end-to-end performance.

Authors:

(1) Qi Chen, Microsoft Beijing, China;

(2) Xiubo Geng, Microsoft Beijing, China;

(3) Corby Rosset, Microsoft, Redmond, United States;

(4) Carolyn Buractaon, Microsoft, Redmond, United States;

(5) Jingwen Lu, Microsoft, Redmond, United States;

(6) Tao Shen, University of Technology Sydney, Sydney, Australia and the work was done at Microsoft;

(7) Kun Zhou, Microsoft, Beijing, China;

(8) Chenyan Xiong, Carnegie Mellon University, Pittsburgh, United States and the work was done at Microsoft;

(9) Yeyun Gong, Microsoft, Beijing, China;

(10) Paul Bennett, Spotify, New York, United States and the work was done at Microsoft;

(11) Nick Craswell, Microsoft, Redmond, United States;

(12) Xing Xie, Microsoft, Beijing, China;

(13) Fan Yang, Microsoft, Beijing, China;

(14) Bryan Tower, Microsoft, Redmond, United States;

(15) Nikhil Rao, Microsoft, Mountain View, United States;

(16) Anlei Dong, Microsoft, Mountain View, United States;

(17) Wenqi Jiang, ETH Zürich, Zürich, Switzerland;

(18) Zheng Liu, Microsoft, Beijing, China;

(19) Mingqin Li, Microsoft, Redmond, United States;

(20) Chuanjie Liu, Microsoft, Beijing, China;

(21) Zengzhong Li, Microsoft, Redmond, United States;

(22) Rangan Majumder, Microsoft, Redmond, United States;

(23) Jennifer Neville, Microsoft, Redmond, United States;

(24) Andy Oakley, Microsoft, Redmond, United States;

(25) Knut Magne Risvik, Microsoft, Oslo, Norway;

(26) Harsha Vardhan Simhadri, Microsoft, Bengaluru, India;

(27) Manik Varma, Microsoft, Bengaluru, India;

(28) Yujing Wang, Microsoft, Beijing, China;

(29) Linjun Yang, Microsoft, Redmond, United States;

(30) Mao Yang, Microsoft, Beijing, China;

(31) Ce Zhang, ETH Zürich, Zürich, Switzerland and the work was done at Microsoft.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Red Lobster diners blast new dishes as ‘cold’ & ‘unaffordable’
Next Article Substack Is Having a Moment—Again. But Time Is Running Out
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

The ‘Kadrey v. Meta’ fair use ruling is just the start of a long, complex AI copyright battle
News
Vibe Coding Explained: Tools, Tips & Setup Inspiration |
Computing
Google Photos just got an update that changes how HDR photos are edited
News
The 3 can’t-miss Netflix releases that belong on your watchlist next week
News

You Might also Like

Computing

Vibe Coding Explained: Tools, Tips & Setup Inspiration |

33 Min Read
Computing

WhatsApp Launches Message Summaries in US—Here’s How to Turn It On | HackerNoon

3 Min Read
Computing

SF Express’ subsidy to introduce on-demand delivery service in Hong Kong · TechNode

1 Min Read
Computing

What Are WIP Limits? A Guide to Managing Workflow Bottlenecks

19 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?