Yandex Releases Massive Dataset To Help AI Understand What You Really Like

Meet Yambda: One of the world’s largest open datasets for RecSys.

Recommender algorithms help people discover the right products, movies, music, and more. They’re the backbone of services ranging from online stores to streaming platforms. The advancement of these algorithms directly depends on research, which in turn requires high-quality, large-scale datasets.

However, most open-source datasets are either small or outdated, as companies that accumulate terabytes of data rarely make them publicly available due to privacy concerns. Today, we’re releasing Yambda, one of the world’s largest recommendation datasets.

This dataset features 4.79 billion anonymized user interactions, compiled from 10 months of user activity.

We chose the Music service because it’s the largest subscription-based streaming service in Russia, with an average monthly audience of 28 million users.

A significant portion of the dataset includes aggregated listens, likes, and dislikes, as well as track attributes sourced from the personalized recommendation system My Vibe. All user and track data is anonymized: the dataset contains only numeric identifiers, ensuring user privacy.

Releasing large open datasets like Yambda helps solve several problems. Access to high-quality, large-scale data opens new avenues for scientific research and engages young researchers keen to apply machine learning to real-world challenges.

I’m Alexander Ploshkin, and I lead personalization quality development at Yandex.

In this article, I’ll explain what the dataset consists of, how we collected it, and how you can use it to evaluate new recommender algorithms.

Let’s begin!

Why do large-scale open datasets matter?

Recommender systems are experiencing a true renaissance in recent years.

Tech companies are increasingly adopting transformer-based models, inspired by the success of large language models (LLMs) in other domains.

What we’ve learned from computer vision and natural language processing is that data volume is crucial for how well these methods work: transformers aren’t very effective on small datasets but become almost essential once they scale to billions of tokens.

Truly large-scale open datasets are a rarity in the recommender systems domain.

Well-known datasets like LFM-1B, LFM-2B, and the Music Listening Histories Dataset (27B) have become unavailable over time due to licensing restrictions.

Source: https://www.cp.jku.at/datasets/LFM-1b/

Currently, the record for the number of user interactions is held by Criteo’s advertising dataset, with approximately 4 billion events. This creates a challenge for researchers: most don’t have access to web-scale services, meaning they can’t test algorithms under conditions that resemble real-world deployments.

Popular datasets like MovieLens, Steam, or the Netflix Prize contain, at best, tens of millions of interactions and typically focus on explicit feedback, such as ratings and reviews.

Meanwhile, production recommender systems work with much more diverse and nuanced signals: clicks, likes, full listens, views, purchases, and so on.

There’s another critical issue: the lack of temporal dynamics. Many datasets don’t allow for an honest chronological split between training and test sets, which is crucial for evaluating algorithms that aim to predict the future, not just explain the past.

To address these challenges and support the development of new algorithms in recommender systems, we’re releasing Yambda.

This dataset is currently the largest open resource for user interactions in the recommendation domain.

What’s inside Yambda?

The dataset includes interactions from 1 million users with over 9 million music tracks from the Music service, totaling 4.79 billion events.

First, to be clear: all events are anonymized.

The dataset uses only numeric identifiers for users, tracks, albums, and artists. This is to ensure privacy and protect user data.

The dataset includes key implicit and explicit user actions:

Listen: The user listened to a music track.
Like: The user liked a track (“thumbs up”).
Unlike: The user removed a like.
Dislike: The user disliked a track (“thumbs down”).
Undislike: The user removed a dislike.

To make the dataset more accessible, we’ve also released smaller samples containing 480 million and 48 million events, respectively.

Summary statistics for these subsets are provided in the table below:

Dataset summary (Image by Author)

The data is stored in Apache Parquet format, which is natively supported by Python data analysis libraries such as Pandas and Polars. For ease of use, the dataset is fully replicated in two formats:

Flat: Each row represents a single interaction between a user and a track.
Sequential: Each row contains the complete interaction history of a single user.

The dataset structure is as follows:

File names and columns (Image by Author)

A key feature of Yambda is the is_organic flag, which is included with every event. This flag helps differentiate between user actions that happened naturally and those prompted by recommendations.

If is_organic = 0, it means the event was triggered by a recommendation.

For example, in a personalized music stream or a recommended playlist. All other events are considered organic, meaning the user discovered the content on their own.

The table below provides statistics on recommendation-driven events:

Statistics based on event types (Image by Author)

User interaction history is key to creating personalized recommendations. It captures both long-term preferences and momentary interests that may shift with context.

To help you better understand the data structure, here are some quick statistics on our dataset:

Event charts againts user frequency (Image by Author)

The above charts reveal that user history length follows a heavy-tailed distribution.

This means while most users have relatively few interactions, a small but significant group has very long interaction histories.

This is especially important to account for when building recommendation models, to avoid overfitting to highly active users and to maintain quality for the “heavy tail” of the less engaged users.

In contrast, the distribution across tracks tells a very different story.

Distribution across tracks in Yambda (Image by Author)

This chart clearly shows the imbalance between the highly popular tracks and a large volume of niche content: over 90% of tracks received fewer than 100 plays during the entire data collection period.

Despite this, recommender systems must engage with the entire catalog to surface even low-popularity tracks that align well with individual user preferences.

Using Yambda to evaluate algorithmic performance

Academic studies on recommender algorithm quality often use the Leave-one-Out (LOO) scheme, where a single user action is held back for testing and the rest are used for training.

This method, however, comes with two serious drawbacks:

Temporal inconsistency: Test events can include actions that happened before those in the training set.
Equal weighting of users: Inactive users affect the evaluation metrics just as much as active ones, which can skew the results.

To bring evaluation conditions closer to real-world recommender system scenarios, we propose an alternative: global temporal split.

This simple method selects a point in time (T), excluding all subsequent events from the training set.

This ensures the model trains on historical data and is tested against future data, mimicking a true production environment. The diagram below illustrates this:

Global temporal shift (Image by Author)

For our evaluation, we reserved one day of data as the holdout set for two main reasons:

Even a single day’s worth of data provides enough volume to reliably assess algorithm performance.
Models in real-world production have different characteristics: some require frequent stat updates (for example, popularity-based recommendations), others are fine-tuned or retrained periodically (boosting, matrix factorization, two-tower models), and some depend on continuously updated user interaction histories (recurrent and transformer-based models).

From our viewpoint, a one-day window is the optimal evaluation period to keep models static while still capturing short-term trends.

The drawback of this approach is that it doesn’t account for longer-term patterns, such as weekly shifts in music listening behavior. We suggest leaving those aspects for future research.

Baselines

We evaluated several popular recommender algorithms on Yambda to establish baselines for future research and comparison.

The algorithms we tested include: MostPop, DecayPop, ItemKNN, iALS, BPR, SANSA, and SASRec.

For evaluation, we used the following metrics:

NDCG@k (Normalized Discounted Cumulative Gain), which measures the quality of ranking in recommendations.
Recall@k, which assesses the algorithm’s ability to retrieve relevant recommendations from the total pool.
Coverage@k, which indicates how broadly the recommendation catalog is represented.

Results are provided in tables, and the code is available on Hugging Face.

Baseline results for Yambda (Image by Author)

Conclusion

Yambda can be valuable for research into recommendation algorithms on large-scale data, where both performance and the ability to model behavioral dynamics are crucial.

The dataset is available in three versions: a full set with 5 billion events, and smaller subsets with 500 million and 50 million events.

Developers and researchers can choose the version that best fits their project and computational resources. Both the dataset and the evaluation code are available on Hugging Face.

We hope this dataset proves useful in your experiments and research!

Thanks for reading!