Scaling Human Judgment: How Dropbox Uses LLMs To Improve Labeling For RAG Systems

To improve the relevance of responses produced by Dropbox Dash, Dropbox engineers began using LLMs to augment human labelling, which plays a crucial role in identifying the documents that should be used to generate the responses. Their approach offers useful insights for any system built on retrieval-augmented generation (RAG).

As Dropbox principal engineer Dmitriy Meyerzon explains, document retrieval quality is the bottleneck in RAG systems that select relevant content from large documents repositories before passing it to an LLM:

Because there are millions (and, in very large enterprises, billions) of documents in the enterprise search index, Dash can pass along only a small subset of the retrieved documents to the LLM. This makes the quality of search ranking—and the labeled relevance data used to train it—critical to the quality of the final answer.

The implication is that the quality of the search ranking model has a direct impact on generated answers. Dash uses a ranking model trained with supervised learning techniques where query-document pairs are labelled according to how well each document satisfies a given query. The main challenge of this approach lies in producing a large volume of high-quality relevance labels.

To address the limitations of purely human judge-based labelling, which is expensive,slow, and inconsistent, Dropbox introduced a complementary approach in which an LLM generates relevance judgments at scale. This method is cheaper, more consistent, and can easily scale to large document sets. However, LLMs are not perfect evaluators, so their judgments must be assessed before being used for training.

In practice, using LLMs for relevance evaluation requires a structured process that combines automation with human oversight.

This approach, called “human-calibrated LLM labeling”, is straightforward: humans label a small, high-quality dataset, which is later used to calibrate the LLM evaluator. The LLM then generates hundreds of thousands or even millions of labels, amplification human effort by roughly 100×. Importantly, LLMs do not replace the ranking system, as using them directly for query-time ranking would be too slow and limited by context.

The evaluation step involves comparing LLM-generated relevance ratings with human judgments on a test subset of query-document pairs not included in the training set. Evaluation also focuses on the hardest mistakes, where LLM judgments disagree with user behavior, such as users clicking documents the LLM rated low or skipping documents the LLM rated high, which produce the strongest learning signal.

One important consideration is that context is often critical for judging relevance. For example “diet sprite” at Dropbox refers to an internal performance tool rather than a beverage. To address this, LLMs are allowed to run additional searches, look up context, and understand internal terminology, which dramatically improves labeling accuracy.

Based on their experience with Dropbox Dash, Meyerzon says that this approach enables LLMs to consistently amplify human judgement at scale, proving and effective way to improve RAG systems.