Hi r/machinelearning!

A few months ago I quit my job to join my partners to make training open-source models much faster and easier for engineers.

We're building Rubbrband. It's a web app that takes any ML repo off of GitHub, and gives you a Terminal and Jupyter Notebook in browser with dependencies and GPUs automatically set up.

Why did we build this?

My co-founders and I have been working on this because we found this dependency set up process super tedious and draining as researchers.

What's included?

- Automatic Dependency set up for any GitHub python repo

- Integrated Terminal and Notebooks

- A server with an Nvidia GPU

- Code explanations for functions

- Our pricing is simple at $75/month for 3 repos running at a time. First week is free.

I'd love to get your feedback on:

Does the value we provide resonate with you? Would you try it out?
Is dependency and environment set up take up a large chunk of your time?

We're currently working on acquiring more GPUs to onboard more users, but if you'd like access to the product please let me know.

Thank you very much in advance!

Comments

You must log in or register to comment.

JackBlemming t1_j3997nh wrote on January 6, 2023 at 10:30 PM

#1,298,672

Couple thoughts:

Setting up an environment is typically harder than cloning the repo and running pip install on the requirements.txt file. Many python packages require prior linux packages to have been installed beforehand. Your service should ideally take care of this for me. Some obvious examples are opencv, cuda/gpu drivers, mysqlclients etc.
Dataset management is the most annoying part of machine learning for me, not setting up environments which is typically a dockerfile or docker-compose file, and maybe one shell script to bootstrap everything. Dataset management being allowing my models to access the dataset in a fast way, updating the dataset, etc. Ideally your service should make it easy to upload data to your dataset and then make it accessible to the training code. This is assuming you want to allow people to train models on the service.

jrmylee OP t1_j39ara0 wrote on January 6, 2023 at 10:40 PM

#1,298,733

Replying to JackBlemming (#1,298,672)

Great point, we have this covered. We intelligently install apt dependencies alongside pip dependencies. CUDA drivers are also all installed properly.
This makes sense. If I understand you correctly, is the difficult part: uploading/managing dataset to server easily + writing data loaders to feed into the model?

JackBlemming t1_j39baqq wrote on January 6, 2023 at 10:44 PM

#1,298,752

Replying to jrmylee (#1,298,733)

Per 2. yes, exactly right. Some of my datasets are millions of images with metadata. As you can imagine, uploading and consuming this magnitude is slow and tedious, and then integrating it with the remote machine actually running the training script.

jrmylee OP t1_j39dpr0 wrote on January 6, 2023 at 10:59 PM

#1,298,826

Replying to JackBlemming (#1,298,752)

Got it, appreciate it the feedback!

SpecialistEstimate12 t1_j39dsz4 wrote on January 6, 2023 at 11:00 PM

#1,298,830

Sent you a msg

RuairiSpain t1_j39q68d wrote on January 7, 2023 at 12:24 AM

#1,299,239

I like the idea. I work for a large enterprises on their ML platform team, providing similar services internally to all DEV, ML and analytics teams. I think there is a business in it, it is a competitive space but the accusation potential is great (to be bought over and merged into a larger org).

I suggest you check out https://www.gitpod.io, which does more general provisioning of GitOps clusters/Pods in their managed Kubernetes clusters. It's not specifically ML, but we've looked at it for POC ML projects that want basic hosting.

Also check out: https://github.com/ml-tooling/ml-workspace, it a nice open source project with lots of packages ready to use.

And Jupyterlabs offering, they'll be your main competition on pricing.

You are going to have a headache with Python version compatibility with your base dependencies, the onces used on GitHub, and the ones needed by Jupyter Notebooks. Same with CUDA drivers, suggest you lock down the AWS node instance types, so it's less confusing for end users.

If you are turning it into a business, I'd recommend you have a tier approach to size of ML project. Simple POC ML projects with a tiny dataset, is a good starting point for most people. But then people was data ingest, cleaning, ETL to Big Data and enterprise sources; this gets complex fast (and where most teams waste time and money). Either keep your focus on POCs, and grow it's as a ML hosting company for SMEs; or embrace the ETL side and simplify data ingest for larger enterprise companies. The second option is more a consulting business but you can charge high fees.

ML ETL space: https://www.ycombinator.com/companies/dagworks-inc

https://www.gathr.one/plans-pricing/

https://www.snowflake.com/en/

Of these 3 ETL companies, I've played with Snowflake and like what they do and their direction. Especially like they acquired https://streamlit.io/ which is a fun way to deploy Python apps without dealing with infrastructure and devOps tasks.

My final comment, include data ingest and ETL in your story to customers. ML training and deploying training pipelines is not where DS people spend their time, 80% is spent on data collection, reshaping and validation.

FYI, I think you'll burn through $75 very quickly for a Nvidia GPU. I presume you are running these in on-demand and not spot prices. That monthly price seems generous for an average ML training pipeline.

jrmylee OP t1_j39rh76 wrote on January 7, 2023 at 12:34 AM

#1,299,291

Replying to RuairiSpain (#1,299,239)

Got it that makes a lot of sense. We'll definitely be focusing on POC projects. For me, I mainly wanted a better, faster version of Google Colab. It's difficult to compete with their offering due to their free tier, but we think solving the problems of Colab is still worthwhile.

I'm wondering, would you or anyone you know be willing to give this a spin? It would really help us to know if

a. product works on a variety of repos

b. UI is fully-functionally and easy to use

HoneyEatingPunkKid t1_j39yqnj wrote on January 7, 2023 at 1:26 AM

#1,299,507

Cloud

RuairiSpain t1_j39zy50 wrote on January 7, 2023 at 1:34 AM

#1,299,539

Replying to jrmylee (#1,299,291)

Sure DM me and I'll send my email address. I don't have much time to spend in it, but will give it a spin.

You've looked at huggyface? They have an elegant way to package the sample dataset with notebooks, and their documents are easy to digest.

[deleted] t1_j3a4nrl wrote on January 7, 2023 at 2:09 AM

#1,299,675

Replying to RuairiSpain (#1,299,539)

[deleted]

jrmylee OP t1_j3a7ngf wrote on January 7, 2023 at 2:32 AM

#1,299,779

Replying to RuairiSpain (#1,299,539)

dm sent!

montcarl t1_j3a96mi wrote on January 7, 2023 at 2:43 AM

#1,299,840

It's a good idea that already has a lot of options: Google colab, codespaces, binder, sagemaker, kaggle notebooks, etc

muffdivemcgruff t1_j3afnk1 wrote on January 7, 2023 at 3:34 AM

#1,300,080

Are you running on Kubernetes? If no, the better go do that. If you need help just ask.

jrmylee OP t1_j3aglv4 wrote on January 7, 2023 at 3:41 AM

#1,300,127

Replying to muffdivemcgruff (#1,300,080)

We use terraform!

muffdivemcgruff t1_j3ap9i2 wrote on January 7, 2023 at 4:52 AM

#1,300,445

Replying to jrmylee (#1,300,127)

You poor soul.

[deleted] t1_j3b0qiy wrote on January 7, 2023 at 6:46 AM

#1,300,823

[deleted]

brucebay t1_j3bb7t2 wrote on January 7, 2023 at 8:58 AM

#1,301,104

This seems to be a very ambitious project, as there are several ML projects that have very obscure dependencies that dont't work out of the box. This is especially true for older repos. I would personally be very interested at a reasonable price level (compatible to vast ai or runpod) to check out some repos without hassling with setup.

But I'm just a hobbyist. In a professional environment, I don't know if I would be interested in an automated ML setup for a long term development/production solution. My company uses H2O, DataRobot, some IBM solution, (and another one but forget the name). They have some attractive features for everyday data analysts, but mostly limits the advanced users. In a corporate environment, your solution seems to fit between an expert developer who does all work, and an AutoML solution that makes most of the work.

I think it is great idea for rapid experimentation for middle-high end development . So I suspect your target audience for those features are going to be either educational institutions, or any kind of research centers, be it military or commercial. I hope it will have enough interest to support you financially. Good luck with your company.

ps: if you can find a way to let users download the environment where the target repo runs (or provide a tool to mirror) for local development (may be at an extra cost) it would be very useful feature for most people. I would even pay for such a stand-alone software.

_Arsenie_Boca_ t1_j3bbl7h wrote on January 7, 2023 at 9:03 AM

#1,301,114

For me personally, it would be very important that I am not tied to Jupyter Notebooks. Ideally integrate vscode and automatically load settings from .vscode directory in the repo

bernhard-lehner t1_j3bddsb wrote on January 7, 2023 at 9:27 AM

#1,301,156

You might want to consider a different name, as there is a library already there with a highly similar name: https://pypi.org/project/rubberband/

chief167 t1_j3borij wrote on January 7, 2023 at 11:58 AM

#1,301,469

First thought: decide for yourself who your target audience is

If you hope to sell this to companies, or even start-ups, be prepared for a lot of questions around data governance, security, ....

Second: do you have an idea how many users you need for break even and how the infrastructure needs to scale to cope with that? Gpu's aren't cheap of course, neither is electricity or cloud providers

AGI_69 t1_j3bw8mp wrote on January 7, 2023 at 1:21 PM

#1,301,751

Error happened while submitting your request. Please try again later.Error happened while submitting your request. Please try again later.

[deleted] t1_j3byanc wrote on January 7, 2023 at 1:41 PM

#1,301,849

[deleted]

Aggravating-Act-1092 t1_j3bzclk wrote on January 7, 2023 at 1:51 PM

#1,301,882

I think it’s interesting but for a hobbyist the pricing is too high. I would say some kind of tiered access would allow you to casually try it before committing.

I would like to try it, and I can afford $75/m, but it’s too much for something casual which I might forget about. Codec and MidJourney I both signed up to straight away.

GFrings t1_j3cc42r wrote on January 7, 2023 at 3:32 PM

#1,302,364

I'm sort of confused as to what this buys me as a developer. Sure, I can run the model with one click maybe. But that doesn't seem to get me any closer to my typical goal which is to have a module that I can drop right into my code base and use the model to solve one sub problem of a much larger system. I can see using this as sort of a fast way to demo a model maybe, but most repos are clean enough that it takes maybe 30 minutes to reproduce the environment and run the author model myself.

There are already a lot of open source tools that solve the other problem, by the way. One is pytorch-liberator which can extract all the code and dependencies from a module and package in a nice portable format for integration elsewhere.

As a general tip to you and your sales team, when you go to market with something like this you should have some value propositions lined up already instead of asking us whether we think it's valuable. Most folks will tend to assume not unless you can help them see what makes this useful.

fakesoicansayshit t1_j3d8bg3 wrote on January 7, 2023 at 7:06 PM

#1,303,737

All I really need is a storage unit that doesn't make me move large GB files up and down like colab (takes forever, has to be done everytime), and that lets me use a A100 on the fly when needed (instead of confusing compute units bs) without having to switch runtimes (which makes you move data again).

jrmylee OP t1_j3dl18u wrote on January 7, 2023 at 8:29 PM

#1,304,214

Replying to _Arsenie_Boca_ (#1,301,114)

Gotcha. We have a VSCode editor built in, but haven't implemented a .vscode integration yet so we'll add that in.

Do you also use Github Copilot?

jrmylee OP t1_j3dlb06 wrote on January 7, 2023 at 8:31 PM

#1,304,230

Replying to montcarl (#1,299,840)

Yeah definitely true. I've use all of these except sagemaker, and I felt the solutions weren't perfect for my workflows. I guess we're figuring out if other people feel the same way!

jrmylee OP t1_j3dll1u wrote on January 7, 2023 at 8:32 PM

#1,304,238

Replying to Aggravating-Act-1092 (#1,301,882)

OK got it that makes sense. We're actually trying to figure out how to do a tiered access, possibly a free tier with CPUs only might make sense.

I'll also DM you with a link to try the app!

jrmylee OP t1_j3dnlv6 wrote on January 7, 2023 at 8:46 PM

#1,304,326

Replying to brucebay (#1,301,104)

>This seems to be a very ambitious project, as there are several ML projects that have very obscure dependencies that dont't work out of the box. This is especially true for older repos. I would personally be very interested at a reasonable price level (compatible to vast ai or runpod) to check out some repos without hassling with setup.

Yeah that makes sense, appreciate the feedback! We're hoping it works out as well haha

You mentioned users between expert and non-technical folks, and we think that this is intended audience for our app. Most of this is due to the fact that we're building this for ourselves(as recent ML grad students), and it made sense to us to solve a problem we're familiar with.

I also DM'd you a link to the app, if you have time to check it out would appreciate your feedback.

_Arsenie_Boca_ t1_j3dntcr wrote on January 7, 2023 at 8:47 PM

#1,304,337

Replying to jrmylee (#1,304,214)

Awesome. I dont, but there is a VSCode extension, so that would be integrated already. Or do you have any special integration of copilot?

the__itis t1_j3dogz7 wrote on January 7, 2023 at 8:51 PM

#1,304,356

Replying to JackBlemming (#1,298,672)

Also if not 100% native, compiling libraries for architectures is a big complication.

jrmylee OP t1_j3dp0tc wrote on January 7, 2023 at 8:55 PM

#1,304,378

Replying to _Arsenie_Boca_ (#1,304,337)

Yeah we don't currently have extensions implemented(not sure it's possible in a 3rd party web-app actually).

I've been using Copilot a ton so was curious if you were also using it

_Arsenie_Boca_ t1_j3dpfxv wrote on January 7, 2023 at 8:58 PM

#1,304,397

Replying to jrmylee (#1,304,378)

Ah ok, I didnt know that was an issue. Extensions are really important so you should definitely look into that

jrmylee OP t1_j3dylup wrote on January 7, 2023 at 9:59 PM

#1,304,720

Replying to AGI_69 (#1,301,751)

How did you run into this error?

AGI_69 t1_j3e0uou wrote on January 7, 2023 at 10:14 PM

#1,304,804

Replying to jrmylee (#1,304,720)

Tried to join the waitlist. Works now

The_Rational_Player t1_j3ewfe3 wrote on January 8, 2023 at 1:57 AM

#1,306,075

Really feel like this is a redundancy based on what's available out there todate.

i_ikhatri t1_j3xlleh wrote on January 11, 2023 at 7:35 PM

#1,334,805

Replying to jrmylee (#1,298,826)

Just to add onto this feedback (because I think /u/JackBlemming is 100% correct) you would probably benefit from storing some of the most popular datasets (ImageNet, MS COCO, whatever is relevant to the fields you're targeting) somewhere in the cloud where you can provide fast read access (or fast copies) to any number of training workers that get spun up.

Research datasets tend to be fairly standardized so I think you could get a high amount of coverage by just having a few common datasets available. I only gave computer vision examples because that's what I'm most familiar with but if you get a few CV datasets, a few NLP ones etc. you should be able to provide a killer UX.

Bonus points if you're somehow able to configure the repos to read from the centralized datastore properly automatically (though this is probably difficult/impossible).