Jun He gave a talk at QCon SF 2024 titled Efficient Incremental Processing with Netflix Maestro and Apache Iceberg. He showed how Netflix used the system to reduce processing time and cost while improving data freshness.
He, Staff Software Engineer in the Big Data Orchestration team at Netflix, began with a discussion of how Netflix uses their data to inform many decisions and features, from recommending movies for users to watch to deciding to terminate production of a show. These insights are sourced from many data pipelines and ML workflows.
With this volume of data and processing come three challenges: data accuracy, data freshness, and cost efficiency. He used the scenario of late-arriving data events to illustrate all three. Because Netflix sources events asynchronously from user devices such as laptops and mobile phones, these events may be ingested or processed much later than the time when the event occurred. Until the event is processed, the data is inaccurate and “stale,” and if the system is designed such that fixing the data requires re-processing large numbers of events, that drives cost inefficiencies.
Incremental processing can address all these challenges, but it has two requirements: capturing incremental state changes and tracking whether a change has been processed. He then began to describe Netflix’s solution to this problem: Incremental Processing Support (IPS). This system is built using two major components: Apache Iceberg and Maestro.
Iceberg is a “high-performance format for huge analytic tables.” For Netflix, using Iceberg simplifies a lot of data management tasks; in particular, it supports change capture with needing to read the data. Overall, Netflix manages “more than one million tables” using Iceberg, with “hundreds of thousands of workflows” operating on the data.
To manage and orchestrate those workflows, Netflix developed Maestro, a “general-purpose workflow orchestrator.” Netflix created Maestro instead of adopting another workflow orchestrator such as Airflow, in part because of their large scale needs. Maestro also integrates well with other Netflix tools such as Metaflow.
The typical change capture interface in IPS is a table, which matches the schema of the “main” table but contains only changed data; this would be used in a SQL query that performs a JOIN with the change capture table to only process changed data. However, not all existing workflows could use this interface, so the team developed two additional primitives: IpCapture and IpCommit. These could be added at the beginning and end, respectively, of an existing workflow, with no other modification required. He walked through several examples of how workflows could adopt IPS with minimal modification and pointed out that some workflows reduced their processing costs to as little as 10% of their original cost.
After launching IPS, the Netflix team noticed three common usage patterns emerge: appending incrementally processed data to a target table; using changed data as a “row-level filter” to reduce transformations; and using the range parameters of the changed data in business logic. When He took questions from the audience after his talk, one question was about the cost efficiency of the different usage patterns. He noted that the first two patterns usually had more improved costs compared to the third.