yandex

Recommender systems are everywhere, from playlists to product pages, yet the research behind them is often starved of scale. Unlike large language models that thrive on massive datasets, recommender algorithms are usually evaluated on tiny, outdated datasets. 

A global technology company  Yandex hopes to narrow that gap between research and production with the release of Yambda-5B, the largest open dataset of anonymised user interactions currently  available for recommendation tasks.

Yambda-5B contains 4.79 billion anonymised user interactions, including listens, likes, and dislikes, gathered from Yandex Music over a ten-month period. With it comes metadata, audio embeddings, timestamped logs, and a subtle but crucial flag marking whether a user found a track organically or through algorithmic suggestions.

From Netflix to Yambda: Why Researchers Needed This

Yambda is not just larger in volume; it is structurally designed to reflect modern usage. 

Many existing benchmarks fail to simulate the complexities of real-world environments. The classic Netflix Prize dataset includes fewer than 18,000 items and only date-level timestamps, making it ill-suited for any temporal or sequential modelling. 

Spotify’s Million Playlist dataset, while popular, represents only a small fraction of the scale that commercial systems require. Meanwhile, Criteo’s terabyte-scale logs are plagued by missing documentation and inconsistent identifiers, making reproducibility a challenge.

In contrast, Yambda includes high-fidelity audio embeddings derived from convolutional neural networks, providing content-level features rarely found in public datasets. It captures five kinds of user interactions—listens, likes, dislikes, unlikes, and undislikes—allowing both implicit and explicit feedback to be studied in tandem. 

Each event is timestamped with five-second precision, and user actions are categorised using an “is_organic” flag to distinguish organic discovery from recommendation-driven activity. All user and track identities are anonymised using numeric identifiers to ensure compliance with privacy standards.

“Recommender systems are inherently tied to sensitive data. Companies can only publish recommender system datasets publicly after exhaustive anonymisation, a resource-intensive process that has slowed open innovation,” said Nikolai Savushkin, Head of Recommender Systems at Yandex. 

The research paper states that the dataset is released in both flat and sequential formats, making it accessible to teams working on batch inference or real-time modelling. The research team highlights that Yambda was created to support experiments conducted “under conditions that closely mirror real-world use”. 

Yambda includes user data with long interaction histories, and a median of over 3,000 listens, creating an ideal testing ground for sequential and context-aware models. “By releasing Yambda-5B to the community, we aim to provide a readily accessible, industrial-scale resource to advance research, foster innovation, and promote reproducible results in recommender systems,” wrote the researchers.

Training Like It’s the Real World

The company mentions that the dataset’s importance lies not only in what it provides, but also in how it evaluates recommender algorithms. 

The dataset employs a Global Temporal Split (GTS) protocol, which partitions the data into training and test sets based on time rather than user interaction patterns. This preserves causal consistency and avoids the common pitfall of training models on future information. 

In the Yambda benchmark, training data spans 300 days, followed by a 30-minute buffer and then a one-day test window. The buffer was added deliberately. “A 30-minute gap between training and test sets was introduced to exclude interactions used neither for training nor evaluation. This mimics the latency between model training and deployment in industrial systems,” the research paper explained. 

Yambda comes  with a robust benchmarking suite that includes both traditional and modern algorithms. These range from simple popularity-based approaches like MostPop and DecayPop to matrix factorisation methods such as iALS and BPR. Notably, it also includes SASRec, a Transformer-based sequential model that excels at capturing long-range dependencies in user behaviour. 

This research paper explains that traditional collaborative filtering, such as Matrix Factorization, shows reduced effectiveness in situations requiring immediate processing of interactions. This underscores the importance of using sequence-aware models like Transformers. The evaluation framework used in this study highlights this performance difference and points toward future research directions.

A Dataset With Long-Term Implications

Yambda’s utility extends far beyond the ranking metrics. Its combination of behavioural sequences, rich metadata, and content-level audio embeddings enables the exploration of cross-modal learning and hybrid recommendation architectures. 

Researchers can now study how audio characteristics influence user preferences or how graph neural networks might model the relationships between artists, albums, and tracks.

Yandex sees Yambda as a catalyst not only for academic progress but also for industry adoption. “Yambda empowers researchers to test innovative hypotheses and helps businesses build smarter recommender systems. Ultimately, users benefit by finding the perfect song, product, or service effortlessly,” added Savushkin. 

All three dataset sizes — 50M, 500M, and 5B events — are available on Hugging Face in Apache Parquet format. This democratises access to web-scale training data without the overhead of legal agreements or platform access.

“When industry leaders share hard-won tools and data, a rising tide lifts all boats: researchers gain real-world benchmarks, startups access resources once reserved for tech giants, and users everywhere enjoy greater personalisation,” said Savushkin. 

Yambda isn’t Yandex’s first open-source project in AI. The company has previously released several popular tools embraced by the machine learning community. These include Perforator, which identifies and evaluates code inefficiencies across entire codebases; AQLM, an advanced quantization algorithm for extreme compression of large language models; YaFSDP, a sharded data parallelism framework optimized for transformer-based architectures; and CatBoost, a high-performance gradient boosting library for decision trees.

The post Yandex’s Yambda Gives AI What Spotify and Netflix Didn’t appeared first on Analytics India Magazine.