Netflix engineers Vidhya Arvind and Shawn Liu presented their architecture for a centralized data-deletion platform at QCon San Francisco, addressing a critical yet rarely discussed system design challenge. The platform manages deletion across heterogeneous data stores while balancing durability, availability, and correctness. So far, it has processed 76.8 billion row deletions across 1,300 datasets with zero data loss incidents.
Data deletion in distributed systems presents challenges that extend far beyond simple database operations. Engineers face a fundamental dilemma: the fear of accidentally destroying critical information keeps teams cautious, yet failing to delete data can expose them to legal risks under regulations like GDPR, increase storage costs, and erode customer trust. “Deletion can’t be an afterthought,” the presenters emphasized. A major driver of Netflix’s platform is managing testing data generated by frequent end-to-end production tests. These tests verify system functionality but leave substantial “garbage” data throughout the system.
The complexity deepens when data spans multiple storage engines with different deletion characteristics. Cassandra uses background compaction with associated CPU costs and potential spikes; Elasticsearch relies on eventual segment merging, which has a high resource impact; while Redis uses lazy or active expiration. Even efficient deletes can cause background resource spikes, which can impact system stability. The platform also addresses data resurrection, where deleted data can reappear due to misconfiguration, extended node downtime, or synchronization issues—what the presenters called “the ghost in the machine.”
The Hidden Cost of Deletion
Netflix’s solution centers on three foundational pillars. Durability ensures data is eventually deleted via careful management of copies propagated across distributed systems. Availability keeps systems operational by treating delete operations as low-priority requests and using asynchronous processing to prioritize live traffic. Correctness ensures accurate deletions, even in the presence of race conditions.
The platform architecture integrates several components. A control plane triggers workflows, audit jobs identify deletable data across systems, validation jobs verify marked data, and a delete service coordinates removal operations. Journal and recovery services maintain deletion history with timestamps, enabling data recovery within 30 days while preserving data integrity.
The overall architecture of Netflix’s Data Deletion Platform
To maintain resilience during bulk deletions, Netflix implemented multiple safeguards. Backpressure mechanisms use resource utilization metrics to determine deletion speed, slowing operations as database load increases. Rate limiting starts with low requests per second and increases gradually as safe capacity allows, using compaction metrics to throttle operations. Exponential backoff prevents cluster hammering during failures.
Comprehensive monitoring tracks deletion health through key metrics, including deletable record counts, maximum retention overrun, and successful versus failed deletion ratios. A centralized dashboard provides visibility, so teams trust the platform to handle their data correctly. The outcomes demonstrate effectiveness: 1,300 datasets under management, zero data loss incidents, 76.8 billion total rows deleted, 125 audit configurations enabled, and daily deletion counts exceeding 3 million.
Netflix Data Deletion Platform: outcomes and daily row deletion count
Netflix’s key recommendations include continuously auditing for deletion failures, building centralized platforms rather than scattered solutions, deeply understanding storage engine specifics, and aggressively applying resilience techniques, including spread TTL, resource utilization monitoring, rate limiting, and prioritized load shedding. Most importantly, organizations must build trust through rigorous validation, centralized visibility, and demonstrating reliable data handling.
The platform represents treating deletion as a first-class architectural concern, requiring dedicated infrastructure rather than an operational afterthought. The presenters shared how this system emerged from a traumatic production incident in which a misplaced command during a late-night deployment triggered cascading data loss, creating immense stress and guilt among engineers. “The core resolve was to ensure such a crisis never recurs,” the presenters noted.
