A common problem associated with data science projects is version control (not of the project code, but of the data). The DVC (Data Version Control) tool can be used to attach version descriptors to data sets. These can be checked into Git like the rest of the code to keep versions of data and code consistent.
DVC can track almost any type of data set, as long as it can be represented in a file. It doesn’t matter whether the data is stored in a remote storage service or locally. The concept: You use a “pipeline” to describe how data models are managed and used.
However, DVC can do more than just version data along with code. For example, the tool can also act as:
- fast data cache for remotely hosted data,
- Methodology to track experiments performed on the data, and
- Registry or catalog for machine learning models built with the data.
Visual Studio Code users can integrate DVC workflows into their editor via the corresponding extension.
Because it is expensive and time-consuming to create clean, correctly labeled data, high-quality data sets for machine learning purposes are in short supply. Sometimes data scientists have no choice but to work with raw data or inconsistent information. The Cleanlab tool was developed for this scenario.
This Python data tool leverages existing, high-quality machine learning datasets to analyze those of lower quality that are unlabeled or poorly labeled. In other words, you build a model based on the original data set. You then use Cleanlab to find out what needs to be improved in that original data set – and then retrain the model with your automatically cleaned and adjusted data set.
Cleanlab works independently of data models and frameworks. So it doesn’t matter whether you use PyTorch, OpenAI, Scikit-learn or Tensorflow – Cleanlab works with any classifier. The tool still has specific workflows for common tasks such as:
- token classification,
- Multi-Labeling,
- Regression,
- Image segmentation, or even
- Object and outlier detection.
Ideally, you can use various examples to get an idea of how the process works and what results can be expected.
Data science workflows are difficult to set up. But doing this in a consistent and predictable way is even more difficult. Snakemake was developed to automate this process and set up data analysis workflows so that everyone involved receives the same results. The following applies: the more moving parts your data science workflow contains, the greater the likelihood that you will benefit from automating it with Snakemake.
Snakemake workflows are similar to GNU Make workflows: they define the steps of the workflow with rules. These determine what is recorded and output – and which commands must be executed. The workflow rules can be multithreaded and configuration data can be imported via JSON or YAML files. You can also define functions in your workflows to transform the data used in the rules – and log the actions taken at each step.
Snakemake jobs are also portable – they can be deployed in both managed Kubernetes and certain cloud environments. And:
- Workloads can also be “frozen” to use a specific set of packages,
- Unit tests can be automatically created and saved for successfully executed workloads – also as a tarball for long-term archiving.
(fm)
This article originally appeared at our sister publication Infoworld.com.
