Submitted by I_will_delete_myself t3_115z9hc in MachineLearning

I really like training in the cloud for some reason and feels satisfying, however here is a couple of things I would've wished I knew beforehand to get things started.

  1. Use a spot instance unless you absolutely must make sure it isn't interrupted. Your wallet will thank you later.
  2. Make sure Nvidia drivers are installed and don't experiment with Operating systems. You are paying by the hour.
  3. Make sure to use something like tmux to save the sessions running in your terminal so you don't have to start from scratch or in case you disconnect from the vm (but the VM isn't shut down). That way you can just click out of the terminal and not bother with it until it's done.
  4. Debug on your local machine on CPU if you don't have CUDA. You can debug the model on a CPU perfectly fine.

Now what about you all?

89

Comments

You must log in or register to comment.

Demortus t1_j94q9zd wrote

Running Linux on your desktop/laptop makes it significantly easier to run projects on the cloud. Namely, you will be familiar with all dependencies needed to run your project and how to install them online. Moreover, you will not need to make many, if any changes, to your scripts to get them to work.

45

Mefaso t1_j95hjkm wrote

>Running Linux on your desktop/laptop makes it significantly easier to run projects on the cloud

Just as a note, this can easily be done in a docker consider on windows as well.

14

Lifaux t1_j9572rl wrote

Alternatively, you can always use WSL2 if you don't want to dual boot.

11

I_will_delete_myself OP t1_j94qohm wrote

I agree. It also helps with deploying an API for you model. Also systemMD is useful to keep things running is the server gets reset for whatever reason.

7

royalemate357 t1_j94ax4h wrote

Depending on what scale you're working at, egress fees / data transfer fees can be something to look out for. Be aware of them if you are moving data around or data is leaving (e.g. you are downloading a model checkpoint).

21

I_will_delete_myself OP t1_j94c8d1 wrote

This is something most cloud services use to lock you in to their services and discourage migrations to another vendor.

9

Appropriate_Ant_4629 t1_j97gjhy wrote

>egress fees / data transfer fees

On the bright side, ingress is often free.

It costs surprisingly little to stream live video ***into*** the cloud and spew back tiny embedding vectors from models running there.

5

Lifaux t1_j957q53 wrote

If you're having to debug code, VSCode has really good integrations for running on your remote server. Unless you're already very familiar with vim, it's going to be quicker to set this up.

Ensure you've got rsync experience - no one wants to include venv when pulling your changes back from the remote side.

Run the image you're using remotely locally via docker first. Check your code works, you don't want to be messing around with fixes while your GPUs sit idle.

If you're running compiled code, check the CPU architecture. I wasted a day debugging a fault that was due to compiling starspace on a build server that had different architecture to our remote server.

Tmux is a godsend.

19

skippy_nk t1_j95aurm wrote

The discovery of tmux was one of my greatest achievements of the early 2022

5

Tgs91 t1_j954pam wrote

If you work in a job where you're frequently asked to apply your code using different cloud environments (AWS, Azure, Google, local machines, etc, etc), then it's good to dev/test code locally and have a mix of Windows and Mac on your team. If your tests pass on both Mac and Windows, then they'll probably also pass on just about any Linux based environment in a cloud service. Dev local, train on cloud with minimal debugging because you pay by the hour.

3

RideOrDieRemember t1_j957mmn wrote

Is there a trick to spot instances on aws? In the past when I tried to spot instance a gpu it was never available.

3

Mefaso t1_j95hl4n wrote

Maybe try different regions?

1

dancingnightly t1_j95wa9s wrote

Try multiple regions and zones. There are peaks and troughs in availability, most notably the weekend is a good time to spot. There are some sites that help you do this / scripts online that use the aws cli to check for you.

1

__lawless t1_j95ixov wrote

Use code-server (VS code in browser) it is amazing

3

No_Goat277 t1_j98oklc wrote

What is cost of cloud total vs running your servers on prem? I need to start a project with 2/4 RTX cards to train my stable diffusion model.

2

I_will_delete_myself OP t1_j98p0vg wrote

I been running the A100 the entire weekend and so far it’s only costing me under 20 bucks. If you need it around an hour and it would probably cost you between 1-3 dollars

I would recommend you plan a budget before you get started and it will almost always be cheaper on a year basis. Try Colab first and see if you will need it longer than 12 hours.

6

No_Goat277 t1_j98pvwn wrote

Thank you. I have scientific team so our PhD is requesting GPU for SD training. Our other team is using Midjourney but there is no API to it, so they happy but we can’t move forward due to lack of API.

2

milleeeee t1_j9cbrxg wrote

Azure has cheap A100 spot instances. Only 1$ per hour per A100. Up until now I have always gotten my instances immediately and I have only been kicked out twice in over 100 training runs (each run lasts a couple hours). So I am very happy with it at the moment and would highly recommend it

3

DeepDeeperRIPgradien t1_j9eo5uw wrote

Can you recommend a tutorial or something that explains the steps to move from (e.g. pytorch) training on your own machine to training that model in the Cloud (e.g. AWS)? What type of instances to chose, how/where to store data, making sure Nvidia/CUDA stuff is working properly, etc.?

1

I_will_delete_myself OP t1_j9fodao wrote

>Can you recommend a tutorial or something that explains the steps to move from (e.g. pytorch) training on your own machine to training that model in the Cloud (e.g. AWS)?

Same as running on your own machine.

>What type of instances to chose, how/where to store data, making sure Nvidia/CUDA stuff is working properly, etc.?

Just look up a EC2 or VM that has the gpu you want and there you go. nvidia-smi is the command that should tell you the gpu you have. It's working if it outputs the GPU you have. I would suggest checking in the code if CUDA is running.

I prefer to use a EC2 or VM because it's normally cheaper, but you have to do your own research on pricing. Cloud is a competitive market, so there is always someone ready to offer a A100 at a cheaper price. Lambda Cloud I heard was super cheap for on demand.

1

gamerx88 t1_j9evm62 wrote

How do you utilize a spot instance for training? How do you automatically resume training from a checkpoint? Or are you referring to something like Sagemaker's managed spot training?

1