Apple researchers ran an A/B test to measure how AI-generated relevance labels would affect App Store search rankings and app downloads. Here’s what they found.
AI-generated relevance labels slightly improved App Store search conversions
In a new study titled Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments, a group of Apple researchers explored whether LLMs could help improve App Store search results by generating the relevance labels used to train the ranking system.
As the study explains, relevance is obviously key to helping users find the apps they’re looking for. And while there are many signals that can contribute to search ranking, the researchers focused on two main ones:
- Behavioral relevance, which reflects how users interact with results, such as whether they tap on or download an app.
- Textual relevance, which measures how well an app’s metadata (like its name, description, and keywords) semantically matches a user’s search query.
In the study, the researchers say that while there is plenty of available data regarding behavioral relevance (since that can be easily measured), the same isn’t true for textual relevance:
While behavioral relevance labels are abundant, textual relevance labels generated by human judges are much rarer. This creates a fundamental problem: high-quality textual relevance labels are scarce and expensive to produce, creating a scalability bottleneck and leaving the textual relevance objective under-powered in multi-objective training.
To tackle this problem, the researchers fine-tuned a 3-billion-parameter LLM on existing human judgments so it could learn to assign relevance labels to apps based on a user’s search query and the app’s metadata.
Next, they generated millions of new relevance labels with that model, and retrained the App Store ranking system using both the original data, and the LLM-generated labels.
Once that was done, they made an offline evaluation, followed by a worldwide A/B test on live App Store traffic:
“(…) the
llm-augmentedmodel demonstrated a statistically significant +0.24% increase in our primary metric, conversion rate, defined as the proportion of search sessions with at least one app download. While this number may appear small, it is considered a significant improvement for a mature industrial ranker. This gain was observed in 89% of storefronts.”
In other words, users who saw the search results ranked using the LLM-augmented model downloaded at least one app 0.24% more often than users who saw the search results presented by the traditional ranking model.
And while 0.24% is obviously a very small increase, it scales rather quickly when we consider that most estimates peg total App Store downloads in 2025 at around 38 billion. In practice, that could translate to dozens of millions of additional downloads from App Store searches, which developers would surely appreciate.
To read the full study, follow this link.
Accessory deals on Amazon


FTC: We use income earning auto affiliate links. More.

