Table of Links
Abstract and 1 Introduction
2 Background and Related work
2.1 Web Scale Information Retrieval
2.2 Existing Datasets
3 MS Marco Web Search Dataset and 3.1 Document Preparation
3.2 Query Selection and Labeling
3.3 Dataset Analysis
3.4 New Challenges Raised by MS MARCO Web Search
4 Benchmark Results and 4.1 Environment Setup
4.2 Baseline Methods
4.3 Evaluation Metrics
4.4 Evaluation of Embedding Models and 4.5 Evaluation of ANN Algorithms
4.6 Evaluation of End-to-end Performance
5 Potential Biases and Limitations
6 Future Work and Conclusions, and References
ABSTRACT
Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of downstream tasks and encourages research in various areas, such as generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models. MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks that demands innovations in both machine learning and information retrieval system research domains. As the first dataset that meets large, real and rich data requirements, MS MARCO Web Search paves the way for future advancements in AI and system research. MS MARCO Web Search dataset is available at: https://github.com/microsoft/MSMARCO-Web-Search.
1 INTRODUCTION
Recently, the large language model (LLM), a breakthrough in the field of artificial intelligence, has provided a novel way for people to access information through interactive communication. Although it has become an indispensable tool for tasks such as content creation, semantic understanding and conversational AI, it still exhibits certain limitations. One such limitation is the model’s tendency to produce hallucinated or fabricated content, as it generates responses based on patterns observed in the training data rather than verifying factual accuracy. Furthermore, it struggles with real-time knowledge updates, as it can only provide information available up until the time of its last training. This makes it less reliable for retrieving the latest, dynamic information. Therefore, integrating an external up-to-date knowledge base with large language models is of paramount importance to enhance their performance and reliability. This combination not only mitigates the limitations of hallucination and knowledge update but also broadens the model’s applicability across various domains, making it more versatile and valuable. Consequently, information retrieval systems, like the Bing search engine [32], continue to play a vital role in the new LLMbased information systems, such as Webgpt [34] and new Bing [33].
For modern information retrieval systems, the core is the large semantic understanding model, such as a neural indexer model [51] or dual embedding model [16, 20, 21, 38–40, 45, 46, 54], which can capture users’ intents as well as the rich meanings of a document with better tolerance for out of vocabulary words, spelling errors, and synonymous expressions. Training a high-quality large semantic understanding model requires a vast amount of data to achieve sufficient knowledge coverage. The larger the dataset, the better the model is likely to perform, as the model can learn more complex and sophisticated patterns and correlations.
High-quality human-labeled data is as important as data scale. Recent research, such as InstructGPT [36] and LLAMA-2 [50], has demonstrated the crucial role of labeled data for training large foundation models. These models rely on large volumes of training data to learn generalizable features, while human-labeled data enable the model to learn the specific tasks it is designed for. This also applies to large semantic understanding models.
Moreover, information-rich data is also crucial for training large semantic understanding models effectively. The use of multi-modal datasets can help models understand complex relationships between different types of data and transfer knowledge between them. For example, using images and text in a multi-modal data set can help models learn about image concepts and their corresponding text descriptions, providing a more holistic representation of the data.
The emerging large, real and rich data requirements motivate us to create a new MS MARCO Web Search dataset, the first largescale information-rich web dataset with millions of real clicked query-document labels. MS MARCO Web Search incorporates the largest open web document dataset, ClueWeb22 [37], as our document corpus. ClueWeb22 includes about 10 billion high-quality web pages, sufficiently large to serve as representative web-scale data. It also contains rich information from the web pages, such as visual representation rendered by web browsers, raw HTML structure, clean text, semantic annotations, language and topic tags labeled by industry document understanding systems, etc. MS MARCO Web Search further contains 10 million unique queries from 93 languages with millions of relevant labeled query-document pairs collected from the search log of the Microsoft Bing search engine to serve as the query set. This large collection of multi-lingual information rich real web documents, queries and labeled query-document pairs enables various kinds of downstream tasks and encourages several new research directions that previous datasets cannot well support, for example, generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models, etc. As the first large, real and rich web dataset, MS MARCO Web Search will serve as a critical data foundation for future AI and systems research.
MS MARCO Web Search offers a retrieval benchmark which implements several state-of-the-art embedding models, retrieval algorithms, and retrieval systems originally developed on existing datasets. We compare the quality of their results and system performance on our new MS MARCO Web Search dataset as the benchmark baselines for web scale information retrieval. The experiment results demonstrate that embedding models, retrieval algorithms, and retrieval systems are all critical components in web
information retrieval. And interestingly, simply improving only one component may bring negative impacts to the end-to-end retrieval result quality and system performance. We hope that this retrieval benchmark can facilitate future innovations in data-centric techniques, embedding models, retrieval algorithms, and retrieval systems to maximize end-to-end performance.
Authors:
(1) Qi Chen, Microsoft Beijing, China;
(2) Xiubo Geng, Microsoft Beijing, China;
(3) Corby Rosset, Microsoft, Redmond, United States;
(4) Carolyn Buractaon, Microsoft, Redmond, United States;
(5) Jingwen Lu, Microsoft, Redmond, United States;
(6) Tao Shen, University of Technology Sydney, Sydney, Australia and the work was done at Microsoft;
(7) Kun Zhou, Microsoft, Beijing, China;
(8) Chenyan Xiong, Carnegie Mellon University, Pittsburgh, United States and the work was done at Microsoft;
(9) Yeyun Gong, Microsoft, Beijing, China;
(10) Paul Bennett, Spotify, New York, United States and the work was done at Microsoft;
(11) Nick Craswell, Microsoft, Redmond, United States;
(12) Xing Xie, Microsoft, Beijing, China;
(13) Fan Yang, Microsoft, Beijing, China;
(14) Bryan Tower, Microsoft, Redmond, United States;
(15) Nikhil Rao, Microsoft, Mountain View, United States;
(16) Anlei Dong, Microsoft, Mountain View, United States;
(17) Wenqi Jiang, ETH Zürich, Zürich, Switzerland;
(18) Zheng Liu, Microsoft, Beijing, China;
(19) Mingqin Li, Microsoft, Redmond, United States;
(20) Chuanjie Liu, Microsoft, Beijing, China;
(21) Zengzhong Li, Microsoft, Redmond, United States;
(22) Rangan Majumder, Microsoft, Redmond, United States;
(23) Jennifer Neville, Microsoft, Redmond, United States;
(24) Andy Oakley, Microsoft, Redmond, United States;
(25) Knut Magne Risvik, Microsoft, Oslo, Norway;
(26) Harsha Vardhan Simhadri, Microsoft, Bengaluru, India;
(27) Manik Varma, Microsoft, Bengaluru, India;
(28) Yujing Wang, Microsoft, Beijing, China;
(29) Linjun Yang, Microsoft, Redmond, United States;
(30) Mao Yang, Microsoft, Beijing, China;
(31) Ce Zhang, ETH Zürich, Zürich, Switzerland and the work was done at Microsoft.