By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Scaling Human Judgment: How Dropbox Uses LLMs to Improve Labeling for RAG Systems
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Scaling Human Judgment: How Dropbox Uses LLMs to Improve Labeling for RAG Systems
News

Scaling Human Judgment: How Dropbox Uses LLMs to Improve Labeling for RAG Systems

News Room
Last updated: 2026/03/07 at 2:06 PM
News Room Published 7 March 2026
Share
Scaling Human Judgment: How Dropbox Uses LLMs to Improve Labeling for RAG Systems
SHARE

To improve the relevance of responses produced by Dropbox Dash, Dropbox engineers began using LLMs to augment human labelling, which plays a crucial role in identifying the documents that should be used to generate the responses. Their approach offers useful insights for any system built on retrieval-augmented generation (RAG).

As Dropbox principal engineer Dmitriy Meyerzon explains, document retrieval quality is the bottleneck in RAG systems that select relevant content from large documents repositories before passing it to an LLM:

Because there are millions (and, in very large enterprises, billions) of documents in the enterprise search index, Dash can pass along only a small subset of the retrieved documents to the LLM. This makes the quality of search ranking—and the labeled relevance data used to train it—critical to the quality of the final answer.

The implication is that the quality of the search ranking model has a direct impact on generated answers. Dash uses a ranking model trained with supervised learning techniques where query-document pairs are labelled according to how well each document satisfies a given query. The main challenge of this approach lies in producing a large volume of high-quality relevance labels.

To address the limitations of purely human judge-based labelling, which is expensive,slow, and inconsistent, Dropbox introduced a complementary approach in which an LLM generates relevance judgments at scale. This method is cheaper, more consistent, and can easily scale to large document sets. However, LLMs are not perfect evaluators, so their judgments must be assessed before being used for training.

In practice, using LLMs for relevance evaluation requires a structured process that combines automation with human oversight.

This approach, called “human-calibrated LLM labeling”, is straightforward: humans label a small, high-quality dataset, which is later used to calibrate the LLM evaluator. The LLM then generates hundreds of thousands or even millions of labels, amplification human effort by roughly 100×. Importantly, LLMs do not replace the ranking system, as using them directly for query-time ranking would be too slow and limited by context.

The evaluation step involves comparing LLM-generated relevance ratings with human judgments on a test subset of query-document pairs not included in the training set. Evaluation also focuses on the hardest mistakes, where LLM judgments disagree with user behavior, such as users clicking documents the LLM rated low or skipping documents the LLM rated high, which produce the strongest learning signal.

One important consideration is that context is often critical for judging relevance. For example “diet sprite” at Dropbox refers to an internal performance tool rather than a beverage. To address this, LLMs are allowed to run additional searches, look up context, and understand internal terminology, which dramatically improves labeling accuracy.

Based on their experience with Dropbox Dash, Meyerzon says that this approach enables LLMs to consistently amplify human judgement at scale, proving and effective way to improve RAG systems.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Movement Network Foundation Earns a -5 Proof of Usefulness Score by Building a Modular Move-Ethereum Framework | HackerNoon Movement Network Foundation Earns a -5 Proof of Usefulness Score by Building a Modular Move-Ethereum Framework | HackerNoon
Next Article 9to5Mac Overtime 062: MacBook Neo lets the iPad be an iPad – 9to5Mac 9to5Mac Overtime 062: MacBook Neo lets the iPad be an iPad – 9to5Mac
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Meet the Artificial Intelligence (AI) ETF with 20% of its portfolio parked in Alphabet, Nvidia, Micron and Amazon
Meet the Artificial Intelligence (AI) ETF with 20% of its portfolio parked in Alphabet, Nvidia, Micron and Amazon
News
Make the most of your Pixel 10a with these tips and tricks
Make the most of your Pixel 10a with these tips and tricks
News
OpenAI robotics lead Caitlin Kalinowski quits in response to Pentagon deal |  News
OpenAI robotics lead Caitlin Kalinowski quits in response to Pentagon deal | News
News
Today's NYT Wordle Hints, Answer and Help for March 8 #1723 – CNET
Today's NYT Wordle Hints, Answer and Help for March 8 #1723 – CNET
News

You Might also Like

Meet the Artificial Intelligence (AI) ETF with 20% of its portfolio parked in Alphabet, Nvidia, Micron and Amazon
News

Meet the Artificial Intelligence (AI) ETF with 20% of its portfolio parked in Alphabet, Nvidia, Micron and Amazon

8 Min Read
Make the most of your Pixel 10a with these tips and tricks
News

Make the most of your Pixel 10a with these tips and tricks

21 Min Read
OpenAI robotics lead Caitlin Kalinowski quits in response to Pentagon deal |  News
News

OpenAI robotics lead Caitlin Kalinowski quits in response to Pentagon deal | News

3 Min Read
Today's NYT Wordle Hints, Answer and Help for March 8 #1723 – CNET
News

Today's NYT Wordle Hints, Answer and Help for March 8 #1723 – CNET

2 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?