Netflix has introduced a significant enhancement to its Metaflow machine learning infrastructure: a new Config object that brings powerful configuration management to ML workflows. This addition addresses a common challenge faced by Netflix’s teams, which manage thousands of unique Metaflow flows across diverse ML and AI use cases.
Netflix Metaflow is an open-source, data science framework designed to simplify the process of building and managing data-intensive workflows. It allows users to define workflows as directed graphs, making it easy to visualize and iterate on them. Metaflow automatically handles scaling, versioning, and deployment of workflows, which are critical aspects in machine learning and data engineering projects. It provides built-in support for tasks like data storage, parameter management, and computation execution, both locally and in the cloud.
The new Config feature represents a fundamental shift in how ML workflows can be configured and managed at Netflix. While Metaflow has always excelled at providing infrastructure for data access, compute resources and workflow orchestration, teams previously lacked a unified way to configure flow behaviour, particularly for decorators and deployment settings.
The Config object joins Metaflow’s existing constructs of artifacts and Parameters, but with a crucial difference in timing. While artifacts are persisted at the end of each task and parameters are resolved at the start of a run, configs are resolved during flow deployment. This timing difference makes configs particularly powerful for setting up deployment-specific configurations.
Configs can be specified using human-readable TOML files, making it easy to manage different aspects of a flow:
[schedule]
cron = "0 * * * *"
[model]
optimizer = "adam"
learning_rate = 0.5
[resources]
cpu = 1
Netflix’s internal tool, Metaboost, showcases how powerful this configuration system can be. Metaboost is a unified interface for managing ETL workflows, ML pipelines, and data warehouse tables. The new Config feature allows teams to create different experimental configurations while maintaining the core flow structure.
For example, ML practitioners can easily create variations of their models by simply swapping configuration files, enabling rapid experimentation with different features, hyperparameters, or target metrics. This capability has proven particularly valuable for Netflix’s Content ML team, which works with hundreds of data columns and multiple metrics.
The new Config system offers several advantages:
- Flexible Runtime Configuration: Parameters and Configs can be mixed to balance fixed deployments and runtime configurability.
- Enhanced Validation: Custom parsers can validate configurations, including integration with popular tools like Pydantic.
- Advanced Configuration Management: Support for configuration managers like OmegaConf and Hydra enables sophisticated configuration hierarchies.
- Generate configuration on the fly: users can retrieve Configs from an external service or analyzing the execution environment, such as the current GIT branch, to include it as additional context during runs.
This enhancement represents a significant step forward in Metaflow’s evolution as a machine learning infrastructure platform. By providing a more structured way to manage configurations, Netflix has made it easier for teams to maintain and scale their ML workflows while adhering to their specific development practices and business goals.
The feature is now available in Metaflow 2.13, and users can start implementing it in their workflows immediately.
Several tools similar to Netflix Metaflow are designed to help data scientists and engineers manage workflows, orchestrate pipelines, and build scalable machine learning or data-driven systems. Each of these tools caters to slightly different needs and priorities, but they all aim to simplify complex workflows and scale data operations. Here are some noteworthy examples:
- Apache Airflow: A widely used, open-source platform for orchestrating workflows. It allows users to define tasks and their dependencies as Directed Acyclic Graphs (DAGs). While Metaflow focuses on data science pipelines, Airflow is more general-purpose and excels in managing workflows across different domains.
- Luigi (Spotify): An open-source Python framework designed to build complex pipelines. Like Metaflow, Luigi handles dependencies, workflow orchestration, and task management, but it is less focused on machine learning-specific needs.
- Kubeflow: A machine learning toolkit for Kubernetes. It specializes in managing ML workflows and deploying models in production, making it a strong choice for Kubernetes-based environments.
- MLflow: An open-source platform that manages the ML lifecycle, including experiment tracking, reproducibility, deployment, and monitoring. MLflow has strong support for model versioning and deployment but lacks the broader workflow orchestration capabilities of Metaflow.
- Argo Workflows: A Kubernetes-native workflow engine designed to run complex workflows on containerized infrastructure. It’s ideal for teams already using Kubernetes and looking for a lightweight solution.
While these tools overlap in some functionality, Metaflow stands out with its simplicity, scalability, and built-in support for machine learning workflows, making it particularly attractive to data science teams.