At QCon SF 2024, David Berg and Romain Cledat gave a talk about how Netflix uses Metaflow, an open-source framework, to support a variety of ML systems. The pair gave an overview of Metaflow’s design principles and illustrated several of Netflix’s use cases, including media processing, content demand modeling, and meta-models for explaining models.
Berg and Cledat, both Senior Software Engineers at Netflix, began with several design principles for Metaflow. The goal is to accelerate ML model development in Python by minimizing the developer’s cognitive load. The Metaflow team identified several effects that they wished to minimize: the House of Cards effect, where the underlying layers of a framework are “shaky” instead of a solid foundation; the Puzzle effect, where the composable modules have unique or unintuitive interfaces; and the Waterbed effect, where the system has a fixed amount of complexity that “pops up” in one spot when pushed down elsewhere.
Cledat gave an overview of the project’s history. Metaflow began in 2017 as an internal project at Netflix; in 2019, it was open-sourced, although Netflix continued to maintain its own internal version. In 2021, a group of Netflix ex-employees created a startup, Outerbounds, to maintain and support the open-source project. The same year, Netflix’s internal version and the open-source version were refactored to create a shared “core.”
The key idea of Metaflow is to express computation as a directed acyclic graph (DAG) of steps. Everything is expressed using Python code that “any Python developer would be comfortable coding” instead of using a DSL. The DAG can be executed locally on a developer’s machine or in a production cluster without modification. Each execution of the low, or “run,” can be tagged and persisted for collaboration.
Berg gave several examples of the different ML tasks that Netflix developers have tackled with Metaflow. Content demand modeling tries to predict user demand for a video “across the entire life cycle of the content.” This actually involves multiple data sources and models, and leverages Metaflow’s ability to orchestrate among multiple flow DAGs; in particular, it uses a feature where flows can signal other flows, for example when a flow completes.
Another use case is meta-modeling, which trains a model to explain the results of other models. This relies on Metaflow’s ability to support reproducible environments. Metaflow packages all the dependencies needed to run a flow so that developers can perform repeatable experiments. When training a meta-model, this may require loading several environments, as the meta-model may have different dependencies from the explained model.
The presenters concluded their talk by answering questions from the audience. Track host Thomas Betts asked the first question. He noted that the code for a flow DAG can have annotations specifying the size of the compute cluster to execute it, but the hosts said the same DAG could also be executed on a single machine, and he wondered if those were ignored in that case. The hosts confirmed that this was the case.
Another attendee asked about how these cluster specifications were tuned, especially in the case of over-provisioning. Berg said that the framework can surface “hints” about resource use. He also said there was some research being done on auto-tuning the resources, but not everything could be abstracted away.