We have constructed two scales of the datasets: Set-100M and Set10B. Table 2 gives the detailed statistics of the datasets. The example files of MS MARCO Web Search Set-100M are shown in figure 2.
3.3.1 Language Distribution Analysis. MS MARCO Web Search is a multi-lingual dataset with its queries and document both from a commercial web search engine. We analyze the 20 most popular languages among 93 and 207 languages in both queries and documents in the 100M dataset respectively; the 10B dataset has a similar distribution. Figure 3 summarizes the document language distribution in the train and test document sets. We can see that both train and test document sets are aligned with the original ClueWeb22 document distribution. Figure 4 summarizes the query language distribution in the train, dev, and test query sets. From the distribution, we can see that the language distribution of the queries in the web scenario is high-skewed which may lead to model bias. It encourages research on data-centric techniques for training data optimization.
3.3.2 Data Skew Analysis. We analyze the query-document label distribution in the training data. Figure 5(a) shows documents and
the number of relevant queries associated with them. From the figure, we can see that there are only a few documents with multiple labels: only 7.77% of the documents have relevant labeled queries and 0.46% of documents have more than one labeled relevant query. Figure 5(b) summarizes the queries and their relevant documents. From the figure, we can see that only 1.4% of queries have multiple relevant documents. This highly skewed nature of the dataset is consistent with what is observed while training models for web-scale information retrieval. Our intention is to keep this skew to make models trained on this dataset applicable to real-world scenarios.
3.3.3 Test-Train Overlap Analysis. As introduced in [30], there exists large test-train overlap in some popular open-domain QA datasets, which cause many popular open-domain models to simply memorize the queries seen at the training stage. Subsequently, they perform worse on novel queries. The work [56] observes this phenomenon in the MSMARCO dataset. To better evaluate model generalizability, we minimize the overlap between the train and test sets by splitting the query-document pairs into train and test sets by time. This means the test query-document pairs have no time overlap with the train query-document pairs, which introduces a large portion of novel queries. This can be verified in the table 3. We summarize the test query-document pairs into four categories:
• Q∈Train, D∈Train: Both query and document have appeared in the train set,
• Q∉Train, D∈Train: Query has not been seen in the train set, but the relevant document has been seen in the train set,
• Q∈Train, D∉Train: Query has been seen in the train set, but the document is a new web page that has not been seen in the train set,
• Q∉Train, D∉Train: Both query and document are novel content which have never been seen in the train set.
We can see from the table 3 that 82% of query-document pairs are novel content in the test set which have not been seen in the train set. Therefore, MS MARCO Web Search dataset is capable of offering effective assessments of models based on memory capacity and generalizability by dividing the test set into four categories for a more detailed comparison.
Authors:
(1) Qi Chen, Microsoft Beijing, China;
(2) Xiubo Geng, Microsoft Beijing, China;
(3) Corby Rosset, Microsoft, Redmond, United States;
(4) Carolyn Buractaon, Microsoft, Redmond, United States;
(5) Jingwen Lu, Microsoft, Redmond, United States;
(6) Tao Shen, University of Technology Sydney, Sydney, Australia and the work was done at Microsoft;
(7) Kun Zhou, Microsoft, Beijing, China;
(8) Chenyan Xiong, Carnegie Mellon University, Pittsburgh, United States and the work was done at Microsoft;
(9) Yeyun Gong, Microsoft, Beijing, China;
(10) Paul Bennett, Spotify, New York, United States and the work was done at Microsoft;
(11) Nick Craswell, Microsoft, Redmond, United States;
(12) Xing Xie, Microsoft, Beijing, China;
(13) Fan Yang, Microsoft, Beijing, China;
(14) Bryan Tower, Microsoft, Redmond, United States;
(15) Nikhil Rao, Microsoft, Mountain View, United States;
(16) Anlei Dong, Microsoft, Mountain View, United States;
(17) Wenqi Jiang, ETH Zürich, Zürich, Switzerland;
(18) Zheng Liu, Microsoft, Beijing, China;
(19) Mingqin Li, Microsoft, Redmond, United States;
(20) Chuanjie Liu, Microsoft, Beijing, China;
(21) Zengzhong Li, Microsoft, Redmond, United States;
(22) Rangan Majumder, Microsoft, Redmond, United States;
(23) Jennifer Neville, Microsoft, Redmond, United States;
(24) Andy Oakley, Microsoft, Redmond, United States;
(25) Knut Magne Risvik, Microsoft, Oslo, Norway;
(26) Harsha Vardhan Simhadri, Microsoft, Bengaluru, India;
(27) Manik Varma, Microsoft, Bengaluru, India;
(28) Yujing Wang, Microsoft, Beijing, China;
(29) Linjun Yang, Microsoft, Redmond, United States;
(30) Mao Yang, Microsoft, Beijing, China;
(31) Ce Zhang, ETH Zürich, Zürich, Switzerland and the work was done at Microsoft.