By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Deep Dive into MS MARCO Web Search: Unpacking Dataset Characteristics | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Deep Dive into MS MARCO Web Search: Unpacking Dataset Characteristics | HackerNoon
Computing

Deep Dive into MS MARCO Web Search: Unpacking Dataset Characteristics | HackerNoon

News Room
Last updated: 2025/06/29 at 2:05 PM
News Room Published 29 June 2025
Share
SHARE

We have constructed two scales of the datasets: Set-100M and Set10B. Table 2 gives the detailed statistics of the datasets. The example files of MS MARCO Web Search Set-100M are shown in figure 2.

3.3.1 Language Distribution Analysis. MS MARCO Web Search is a multi-lingual dataset with its queries and document both from a commercial web search engine. We analyze the 20 most popular languages among 93 and 207 languages in both queries and documents in the 100M dataset respectively; the 10B dataset has a similar distribution. Figure 3 summarizes the document language distribution in the train and test document sets. We can see that both train and test document sets are aligned with the original ClueWeb22 document distribution. Figure 4 summarizes the query language distribution in the train, dev, and test query sets. From the distribution, we can see that the language distribution of the queries in the web scenario is high-skewed which may lead to model bias. It encourages research on data-centric techniques for training data optimization.

3.3.2 Data Skew Analysis. We analyze the query-document label distribution in the training data. Figure 5(a) shows documents and

the number of relevant queries associated with them. From the figure, we can see that there are only a few documents with multiple labels: only 7.77% of the documents have relevant labeled queries and 0.46% of documents have more than one labeled relevant query. Figure 5(b) summarizes the queries and their relevant documents. From the figure, we can see that only 1.4% of queries have multiple relevant documents. This highly skewed nature of the dataset is consistent with what is observed while training models for web-scale information retrieval. Our intention is to keep this skew to make models trained on this dataset applicable to real-world scenarios.

3.3.3 Test-Train Overlap Analysis. As introduced in [30], there exists large test-train overlap in some popular open-domain QA datasets, which cause many popular open-domain models to simply memorize the queries seen at the training stage. Subsequently, they perform worse on novel queries. The work [56] observes this phenomenon in the MSMARCO dataset. To better evaluate model generalizability, we minimize the overlap between the train and test sets by splitting the query-document pairs into train and test sets by time. This means the test query-document pairs have no time overlap with the train query-document pairs, which introduces a large portion of novel queries. This can be verified in the table 3. We summarize the test query-document pairs into four categories:

• Q∈Train, D∈Train: Both query and document have appeared in the train set,

• Q∉Train, D∈Train: Query has not been seen in the train set, but the relevant document has been seen in the train set,

• Q∈Train, D∉Train: Query has been seen in the train set, but the document is a new web page that has not been seen in the train set,

• Q∉Train, D∉Train: Both query and document are novel content which have never been seen in the train set.

We can see from the table 3 that 82% of query-document pairs are novel content in the test set which have not been seen in the train set. Therefore, MS MARCO Web Search dataset is capable of offering effective assessments of models based on memory capacity and generalizability by dividing the test set into four categories for a more detailed comparison.

Authors:

(1) Qi Chen, Microsoft Beijing, China;

(2) Xiubo Geng, Microsoft Beijing, China;

(3) Corby Rosset, Microsoft, Redmond, United States;

(4) Carolyn Buractaon, Microsoft, Redmond, United States;

(5) Jingwen Lu, Microsoft, Redmond, United States;

(6) Tao Shen, University of Technology Sydney, Sydney, Australia and the work was done at Microsoft;

(7) Kun Zhou, Microsoft, Beijing, China;

(8) Chenyan Xiong, Carnegie Mellon University, Pittsburgh, United States and the work was done at Microsoft;

(9) Yeyun Gong, Microsoft, Beijing, China;

(10) Paul Bennett, Spotify, New York, United States and the work was done at Microsoft;

(11) Nick Craswell, Microsoft, Redmond, United States;

(12) Xing Xie, Microsoft, Beijing, China;

(13) Fan Yang, Microsoft, Beijing, China;

(14) Bryan Tower, Microsoft, Redmond, United States;

(15) Nikhil Rao, Microsoft, Mountain View, United States;

(16) Anlei Dong, Microsoft, Mountain View, United States;

(17) Wenqi Jiang, ETH Zürich, Zürich, Switzerland;

(18) Zheng Liu, Microsoft, Beijing, China;

(19) Mingqin Li, Microsoft, Redmond, United States;

(20) Chuanjie Liu, Microsoft, Beijing, China;

(21) Zengzhong Li, Microsoft, Redmond, United States;

(22) Rangan Majumder, Microsoft, Redmond, United States;

(23) Jennifer Neville, Microsoft, Redmond, United States;

(24) Andy Oakley, Microsoft, Redmond, United States;

(25) Knut Magne Risvik, Microsoft, Oslo, Norway;

(26) Harsha Vardhan Simhadri, Microsoft, Bengaluru, India;

(27) Manik Varma, Microsoft, Bengaluru, India;

(28) Yujing Wang, Microsoft, Beijing, China;

(29) Linjun Yang, Microsoft, Redmond, United States;

(30) Mao Yang, Microsoft, Beijing, China;

(31) Ce Zhang, ETH Zürich, Zürich, Switzerland and the work was done at Microsoft.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Amazon Takes $100 Off iPad Mini 7 With Return of All-Time Low Prices
Next Article 5 carriers you should sign up for instead of Trump Mobile
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

China’s Zeekr to start selling electric vehicles in Hong Kong, Macau · TechNode
Computing
iPhone 17 Base Model Rumored to Feature Larger Display
News
The Best Phones for Kids in 2025
News
Top Chinese chain absent from Meituan’s Dianping amid Douyin rivalry · TechNode
Computing

You Might also Like

Computing

China’s Zeekr to start selling electric vehicles in Hong Kong, Macau · TechNode

1 Min Read
Computing

Top Chinese chain absent from Meituan’s Dianping amid Douyin rivalry · TechNode

1 Min Read
Computing

Tencent claims Hunyuan AI model surpasses GPT-3.5 in Chinese · TechNode

1 Min Read
Computing

Xiaomi 14 series debuts firm’s own HyperOS · TechNode

4 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?