By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Yandex Releases Massive Dataset to Help AI Understand What You Really Like | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Yandex Releases Massive Dataset to Help AI Understand What You Really Like | HackerNoon
Computing

Yandex Releases Massive Dataset to Help AI Understand What You Really Like | HackerNoon

News Room
Last updated: 2025/07/11 at 5:08 PM
News Room Published 11 July 2025
Share
SHARE

Meet Yambda: One of the world’s largest open datasets for RecSys.

Recommender algorithms help people discover the right products, movies, music, and more. They’re the backbone of services ranging from online stores to streaming platforms. The advancement of these algorithms directly depends on research, which in turn requires high-quality, large-scale datasets.

However, most open-source datasets are either small or outdated, as companies that accumulate terabytes of data rarely make them publicly available due to privacy concerns. Today, we’re releasing Yambda, one of the world’s largest recommendation datasets.

This dataset features 4.79 billion anonymized user interactions, compiled from 10 months of user activity.

We chose the Music service because it’s the largest subscription-based streaming service in Russia, with an average monthly audience of 28 million users.

A significant portion of the dataset includes aggregated listens, likes, and dislikes, as well as track attributes sourced from the personalized recommendation system My Vibe. All user and track data is anonymized: the dataset contains only numeric identifiers, ensuring user privacy.

Releasing large open datasets like Yambda helps solve several problems. Access to high-quality, large-scale data opens new avenues for scientific research and engages young researchers keen to apply machine learning to real-world challenges.

I’m Alexander Ploshkin, and I lead personalization quality development at Yandex.

In this article, I’ll explain what the dataset consists of, how we collected it, and how you can use it to evaluate new recommender algorithms.

Let’s begin!

Why do large-scale open datasets matter?

Recommender systems are experiencing a true renaissance in recent years.

Tech companies are increasingly adopting transformer-based models, inspired by the success of large language models (LLMs) in other domains.

What we’ve learned from computer vision and natural language processing is that data volume is crucial for how well these methods work: transformers aren’t very effective on small datasets but become almost essential once they scale to billions of tokens.

Truly large-scale open datasets are a rarity in the recommender systems domain.

Well-known datasets like LFM-1B, LFM-2B, and the Music Listening Histories Dataset (27B) have become unavailable over time due to licensing restrictions.

Source: https://www.cp.jku.at/datasets/LFM-1b/Source: https://www.cp.jku.at/datasets/LFM-1b/

Currently, the record for the number of user interactions is held by Criteo’s advertising dataset, with approximately 4 billion events. This creates a challenge for researchers: most don’t have access to web-scale services, meaning they can’t test algorithms under conditions that resemble real-world deployments.

Popular datasets like MovieLens, Steam, or the Netflix Prize contain, at best, tens of millions of interactions and typically focus on explicit feedback, such as ratings and reviews.

Meanwhile, production recommender systems work with much more diverse and nuanced signals: clicks, likes, full listens, views, purchases, and so on.

There’s another critical issue: the lack of temporal dynamics. Many datasets don’t allow for an honest chronological split between training and test sets, which is crucial for evaluating algorithms that aim to predict the future, not just explain the past.

To address these challenges and support the development of new algorithms in recommender systems, we’re releasing Yambda.

This dataset is currently the largest open resource for user interactions in the recommendation domain.

What’s inside Yambda?

The dataset includes interactions from 1 million users with over 9 million music tracks from the Music service, totaling 4.79 billion events.

First, to be clear: all events are anonymized.

The dataset uses only numeric identifiers for users, tracks, albums, and artists. This is to ensure privacy and protect user data.

The dataset includes key implicit and explicit user actions:

  • Listen: The user listened to a music track.
  • Like: The user liked a track (“thumbs up”).
  • Unlike: The user removed a like.
  • Dislike: The user disliked a track (“thumbs down”).
  • Undislike: The user removed a dislike.

To make the dataset more accessible, we’ve also released smaller samples containing 480 million and 48 million events, respectively.

Summary statistics for these subsets are provided in the table below:

Dataset summary (Image by Author)Dataset summary (Image by Author)

The data is stored in Apache Parquet format, which is natively supported by Python data analysis libraries such as Pandas and Polars. For ease of use, the dataset is fully replicated in two formats:

  • Flat: Each row represents a single interaction between a user and a track.
  • Sequential: Each row contains the complete interaction history of a single user.

The dataset structure is as follows:

File names and columns (Image by Author)File names and columns (Image by Author)

A key feature of Yambda is the is_organic flag, which is included with every event. This flag helps differentiate between user actions that happened naturally and those prompted by recommendations.

If is_organic = 0, it means the event was triggered by a recommendation.

For example, in a personalized music stream or a recommended playlist. All other events are considered organic, meaning the user discovered the content on their own.

The table below provides statistics on recommendation-driven events:

Statistics based on event types (Image by Author)Statistics based on event types (Image by Author)

User interaction history is key to creating personalized recommendations. It captures both long-term preferences and momentary interests that may shift with context.

To help you better understand the data structure, here are some quick statistics on our dataset:

Event charts againts user frequency (Image by Author)Event charts againts user frequency (Image by Author)

The above charts reveal that user history length follows a heavy-tailed distribution.

This means while most users have relatively few interactions, a small but significant group has very long interaction histories.

This is especially important to account for when building recommendation models, to avoid overfitting to highly active users and to maintain quality for the “heavy tail” of the less engaged users.

In contrast, the distribution across tracks tells a very different story.

Distribution across tracks in Yambda (Image by Author)Distribution across tracks in Yambda (Image by Author)

This chart clearly shows the imbalance between the highly popular tracks and a large volume of niche content: over 90% of tracks received fewer than 100 plays during the entire data collection period.

Despite this, recommender systems must engage with the entire catalog to surface even low-popularity tracks that align well with individual user preferences.

Using Yambda to evaluate algorithmic performance

Academic studies on recommender algorithm quality often use the Leave-one-Out (LOO) scheme, where a single user action is held back for testing and the rest are used for training.

This method, however, comes with two serious drawbacks:

  • Temporal inconsistency: Test events can include actions that happened before those in the training set.
  • Equal weighting of users: Inactive users affect the evaluation metrics just as much as active ones, which can skew the results.

To bring evaluation conditions closer to real-world recommender system scenarios, we propose an alternative: global temporal split.

This simple method selects a point in time (T), excluding all subsequent events from the training set.

This ensures the model trains on historical data and is tested against future data, mimicking a true production environment. The diagram below illustrates this:

Global temporal shift (Image by Author)Global temporal shift (Image by Author)

For our evaluation, we reserved one day of data as the holdout set for two main reasons:

  1. Even a single day’s worth of data provides enough volume to reliably assess algorithm performance.
  2. Models in real-world production have different characteristics: some require frequent stat updates (for example, popularity-based recommendations), others are fine-tuned or retrained periodically (boosting, matrix factorization, two-tower models), and some depend on continuously updated user interaction histories (recurrent and transformer-based models).

From our viewpoint, a one-day window is the optimal evaluation period to keep models static while still capturing short-term trends.

The drawback of this approach is that it doesn’t account for longer-term patterns, such as weekly shifts in music listening behavior. We suggest leaving those aspects for future research.

Baselines

We evaluated several popular recommender algorithms on Yambda to establish baselines for future research and comparison.

The algorithms we tested include: MostPop, DecayPop, ItemKNN, iALS, BPR, SANSA, and SASRec.

For evaluation, we used the following metrics:

  • NDCG@k (Normalized Discounted Cumulative Gain), which measures the quality of ranking in recommendations.
  • Recall@k, which assesses the algorithm’s ability to retrieve relevant recommendations from the total pool.
  • Coverage@k, which indicates how broadly the recommendation catalog is represented.

Results are provided in tables, and the code is available on Hugging Face.

Baseline results for Yambda (Image by Author)Baseline results for Yambda (Image by Author)

Baseline results for Yambda (Image by Author)Baseline results for Yambda (Image by Author)

Conclusion

Yambda can be valuable for research into recommendation algorithms on large-scale data, where both performance and the ability to model behavioral dynamics are crucial.

The dataset is available in three versions: a full set with 5 billion events, and smaller subsets with 500 million and 50 million events.

Developers and researchers can choose the version that best fits their project and computational resources. Both the dataset and the evaluation code are available on Hugging Face.

We hope this dataset proves useful in your experiments and research!

Thanks for reading!

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Former Trump NASA nominee open to running for Congress
Next Article These Are the Best Deals We’ve Found on Pet Tech for Amazon Prime Day
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

James Gunn's 'Superman': Are There Post-Credits Scenes?
News
FBI Seizes Sites That Offered Pirated Nintendo, PlayStation Games
News
The newest Nest Learning Thermostat is on sale for Prime Day.
News
Amazon devices deals typically end right when Prime Day does. Shop our top picks while you still can.
News

You Might also Like

Computing

Annihilation vs. VBF: The Dynamic Interplay for New Physics Discovery at Muon Colliders | HackerNoon

7 Min Read
Computing

Six Orders of Magnitude: Muon Colliders’ Unrivaled Signal-to-Background | HackerNoon

7 Min Read
Computing

Muon Colliders: The Era of Electroweak Gauge Boson Collisions | HackerNoon

9 Min Read
Computing

Set It and Forget It: Back Up Cloudflare Resources with cf-terraforming | HackerNoon

4 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?