Submitted by angkhandelwal749 t3_10lxwgd in MachineLearning

Versioning and collaboration on code for software engineers is a reasonably solved problem through GitHub since the task at hand predominantly involves just maintaining different copies of just simple vanilla code in different folders. On the other hand, ML engineers face the humungous task of maintaining different versions on not just code, but hyper parameters, data, models, data lineage and labels and storing this on GitHub currently does not allow you to track the changes on each variable well. What are the software/open source tools currently used for the same? Is their a space for a new company to be built here?

18

Comments

You must log in or register to comment.

Delicious-View-8688 t1_j60s6lt wrote

git for versioning code

dvc for versioning data (and other ML things)

mlflow for managing ml pipelines (overlaps with some parts of dvc)

conda for environment management (yes, it can be slow...)

13

ikkeweer t1_j60ych9 wrote

Try mamba if you struggle with conda being slow, its a drop in replacement.

11

Delicious-View-8688 t1_j60ykt5 wrote

Yes, it's true. I have not tried using mamba with mlflow - maybe it integrates, maybe it doesn't. MLflow docs at the time of my reading indicated it works with conda or docker only.

2

angkhandelwal749 OP t1_j62uk80 wrote

Is there no platform which does it all? Would WandB cut it?

1

metric_logger t1_j61xa7a wrote

Comet.ml does everything you listed! Free for individuals!

2

Vivid-Ad6077 t1_j62n4k1 wrote

https://wandb.ai/site - Weights & Biases does everything you listed, from versioning code, datasets and models to vizualizing experiments and managing hyperparameters and even running hyperparameter search. It can be used to fully reproduce and recreate the entire state of your ML workflow. It's free for individuals and academics.

2

conv3d t1_j62o8yn wrote

I can’t believe nobody has mentioned MLFlow

2

curiousshortguy t1_j617zzd wrote

The keyword you want, similar to DevOps where Github plays a role as the code storage, is MLOps, and within that you want to look for data and model management and versioning. There are quite a number of companies offering various aspects of that, see for example this random infographic: https://adataanalyst.com/wp-content/uploads/2021/05/Infra-Tooling3.png

1

angkhandelwal749 OP t1_j62urqo wrote

>https://adataanalyst.com/wp-content/uploads/2021/05/Infra-Tooling3.png

Understood! Thanks so much for that - also wanted to understand at core the thinking process of an ML engineer - what parameters do they prioritise while choosing a tool - like user experience or service? lot of features or just few quality features done well?

1

Dry-Tomatillo449 t1_j6mnkh3 wrote

GitLab is an open-source and free alternative to GitHub for hosting ML projects and code. It's used by many organizations for software development, data analysis, and machine learning. It offers a wide range of features, including an integrated CI/CD pipeline, version control, issue tracking, and project management. Additionally, GitLab also supports Jupyter Notebooks and data science projects.

1