Beyond Win Rates: How Spotify Quantifies Learning In Product Experiments

Spotify has introduced the Experiments with Learning (EwL) metric on top of its Confidence experimentation platform to measure how many tests deliver decision-ready insights, not just how many “win.” EwL captures both the quantity and quality of learning across product teams, helping them make faster, smarter product decisions at scale.

A successful experiment under this framework is both valid (correctly implemented, with healthy traffic splits and no sample mismatches) and decision‑ready. The outcome must definitively support one of three actions: ship, abort, or iterate. This metric redefines experimentation success as learning that informs decisions—even when the result isn’t positive.

Confidence, the experimentation platform, enabled hundreds of teams to experiment concurrently. The company’s focus has evolved: from increasing test velocity to optimising test quality and business impact.

Experiment Desirability Bias. Source: Spotify Engineering website

An EwL must satisfy two conditions:

Valid: All systems, metrics, and sample checks worked as intended.

Decision‑ready: Results clearly indicate next steps—
- Ship: A metric improves without regressions.
- Abort: A regression is detected.
- Neutral but Powered: The effect is neutral, but the experiment was sufficiently strong to detect it if it existed.

Experiments classified as “no learning” fail one or more of these standards. They are separated into three types: invalid (failed health checks or setup errors), unpowered (neutral results with insufficient data on any key metric), and aborted early (tests stopped mid-run, with experimenter feedback collected for analysis).

While traditional A/B testing frameworks emphasise win rates, data shows that learning is a stronger indicator of experimentation health. Across Spotify R&D, the learning rate averages 64%, while the win rate is roughly 12%.

Win rate vs learning rate. Experiments ran. Source: Spotify Engineering website

The gap highlights that most value emerges from identifying what doesn’t work or detecting regressions early—especially crucial in a mature product with hundreds of millions of users. Many experiments aim not to boost engagement directly, but to mitigate the risk of performance regressions caused by backend, infrastructure, or UX changes.

In 2018, the number of active experimenters increased from about 40 teams to nearly 300. This growth required investments in both technology—SDKs, analytical tooling, and a simplified UI—and in company culture, through training, documentation, and best practices.

Major app surfaces see dense experimentation: the mobile home screen alone hosted 520 experiments across 58 teams in one year. Because bandwidth testing is finite, EwL helps allocate experimentation capacity most effectively.

The EwL rate acts as a strategic signal:

A stable learning rate with a declining win rate indicates strong experiment quality but diminishing product returns—suggesting the need for bolder innovation bets.

A high learning rate paired with low business returns can reveal misallocated test capacity, prompting reprioritisation of surface areas or initiatives.

Operationally, Confidence uses EwL insights to channel bandwidth toward product areas generating the most actionable learning while reducing low-yield experimentation elsewhere.

EwL results also guide continuous platform enhancement. When learning rates drop, diagnostic signals often reveal underpowered tests, weak integrations, or configuration friction. Spotify’s platform team responds by refining:

Sample size calculators for better planning.

Health check tooling to detect invalid setups early.

Documentation and API integrations across stacks where invalid rates are high.

Organizationally, changes such as adding experiment reviewers and adjusting access controls have measurably improved EwL rates, raising both practice quality and confidence in outcomes.

To preserve the metric’s integrity, three key guardrails are monitored:

Win rate – ensuring teams still achieve positive results.

Experiment volume – keeping throughput high to maintain learning velocity.

Precision – ensuring effect sizes remain statistically reliable.

For example, lowering minimum detectable effect sizes might artificially raise EwL by categorising more tests as “powered neutrals,” but would undermine precision. Such trade-offs are balanced to avoid optimising EwL at the expense of innovation speed.

Experiment Outcome at Spotify. Source: Spotify Engineering website

Experimentation is treated as a driver of insight, not just of shipping velocity. Its EwL metric, which represents 64% learning vs. a 12% win rate, reinforces the principle that avoiding adverse outcomes and discovering neutral results add as much business value as traditional wins.

Some “no learning” remains healthy—indicating experimentation that moves fast enough to sustain innovation. The key is balance: fast iteration, rigorous design, and continuous learning from every outcome.