When designing the MLOps stack for our project, we needed a solution that allowed for a high degree of customization and flexibility to evolve as our experimentation dictated. We considered large platforms that encompassed many functions, but found it limiting in some key areas. Ultimately we decided on an approach where separate specialized tools were implemented for labeling, data versioning, and continuous integration. This article documents our experience building this custom MLOps approach.
The classic problem using Jupyter for development was moving from prototype to production required copy/pasting code from a notebook to a python module. NBDEV automates the transition between notebook and module, thus enabling the Jupyter notebook to be an official part of a production pipeline. NBDEV allows the developer to state which module a notebook should create, which notebook cells to push to the module and which notebook cells are tests. A key capability of NBDEV is its approach to testing within the notebooks, and the NBDEV template even provides a base Github Action to implement testing in the CI/CD framework. The resulting Python module requires no editing by the developer, and can easily be integrated into other notebooks or the project at large using built-in python import functionality.
The files used in machine learning pipelines are often large archives of binary/compressed files, which are not accessible or cost prohibitive for existing version control solutions like git. DVC solves data versioning by representing large datasets as a hash of the file contents which enables DVC to track changes. It works similar to git (e.g.
dvc push). When you run
dvc add on your dataset, it gets added to the
.gitignore and tracked for changes by
dvc. CML is a project that provides functionality for publishing model artifacts from Github Actions workflows into comments attached Github Issues, Pull Requests, etc… That is important because it helps us start to fill in the gaps in the Pull Requests accounting for training data changes and resulting model accuracy and effectiveness.
We want automated code testing, including building models in the automated testing pipeline. Github Actions is in competition with CircleCI, Travis, Jenkins, which is to automate testing around code pushes, commits, pull requests, etc. Since we’re already using Github to host our repos, we avoid another 3rd party app by using Actions. In this project we need to use Github self-hosted runners to run jobs on an on-prem GPU cluster.
We did a deep dive into how we’re using Label Studio found here. Label Studio is a solution for labeling data. It works well, and is flexible to run in a variety of environments.
The setup is designed to deploy models faster. That means, more data scientists working harmoniously in parallel, transparency in the repository and faster onboarding time for new people. The goal is to standardize the types of activities that data scientists need to do in project and provide clear instructions for them.
The following is a list of tasks we want to streamline with this system design:
Below is the description of pipeline for each task.
This pipeline implements automated testing feedback for each pull request that includes evaluation of syntax, unit, regression and integration tests. The outcome of this process is a functionally tested docker image to our private repository. This process maximizes the likelihood that the latest best code is in a fully tested image available in the repository for downstream tasks. Here’s how the developer lifecycle works in the context of a new feature:
Label Studio currently lacks event hooks enabling updates on-changes to the label data stored. So we take a
cron triggered approach, updating the dataset every hour. Additionally, while the label studio training dataset is small enough, the updates can be done as part of the training pipeline as well. We have the ability to trigger the data pipeline refresh on demand using the Github Actions interface.
The modeling pipeline integrates model training into the CI/CD pipeline for the repository. This enables each pull request to evaluate the syntax, unit, integration and regression tests configured on the codebase, but also can provide feedback that includes evaluating the new resulting model
The benchmarking pipeline forms an “official submission” process to ensure all modeling activities are measured against the metrics of the project.
Here is the DAG definition file that is used by DVC. It captures the workflow steps and their inputs, and allows for reproducibility across users and machines.
stages: labelstudio_export_trad: cmd: python pipelines/1_labelstudio_export.py --config_fp pipelines/traditional_pipeline.yaml --ls_token *** --proj_root "." params: - pipelines/traditional_pipeline.yaml: - src.host - src.out_fp - src.proj_id dataset_create_trad: cmd: python pipelines/2_labelstudio_todataset.py --config_fp pipelines/create_traditional.yaml --proj_root "." deps: - data/raw_labels/traditional.json params: - pipelines/create_traditional.yaml: - dataset.bmdata_fp - dataset.labels_map - dataset.out_fp - dataset.rawdata_dir train_model_trad: cmd: python pipelines/3_train_model.py --config_fp pipelines/model_params.yaml --proj_root "." deps: - data/traditional_labeling params: - pipelines/model_params.yaml: - dataloader.bs - dataloader.size - dataloader.train_fp - dataloader.valid_fp - learner.backbone - learner.data_dir - learner.in_checkpoint - learner.metrics - learner.n_out - learner.wandb_project_name - train.cycles labelstudio_export_bench: cmd: python pipelines/1_labelstudio_export.py --config_fp pipelines/benchmark_pipeline.yaml --ls_token *** --proj_root "." params: - pipelines/benchmark_pipeline.yaml: - src.host - src.out_fp - src.proj_id dataset_create_bench: cmd: python pipelines/2_labelstudio_todataset.py --config_fp pipelines/create_benchmark.yaml --proj_root "." deps: - data/raw_labels/benchmark.json params: - pipelines/create_benchmark.yaml: - dataset.bmdata_fp - dataset.labels_map - dataset.out_fp - dataset.rawdata_dir eval_model_trad: cmd: python pipelines/4_eval_model.py --config_fp pipelines/bench_eval.yaml --proj_root "." deps: - data/models/best-model.pth params: - pipelines/bench_eval.yaml: - eval.bench_fp - eval.label_config - eval.metrics_fp - eval.model_conf - eval.overlay_dir
Ultimately, it took one week to complete the implementation of these tools for managing our code with Github Actions, Iterative.ai tools (DVC & CML) and NBDEV. This provides us with the following capabilities:
Aaron Soellinger has formerly worked as a data scientist and software engineer solving problems in finance, predictive maintenance and sports. He currently works as a machine learning systems consultant with Hoplabs working on a multi-camera computer vision application.
Will Kunz is a back end software developer, bringing a can-do attitude and dogged determination to challenges. It doesn’t matter if it’s tracking down an elusive bug or adapting quickly to a new technology. If there’s a solution, Will wants to find it.
Original. Reposted with permission.
This content was originally published here.