Submitted by KlausMich t3_zql367 in MachineLearning

Hi everyone!

I want to use a server to continuously train my ML models without keeping on my pc 24/7. I am currently running fairly simple deep learning models that would take a week on my computer.

So far the best solution to start with that I found is the AWS t2.micro instance which could be good for starting. I've seen that also google cloud and Nvidia have other options.

Could you please guide me thru or giving me suggestions about which one could be better as I am not an expert and it is the first time I do it?

0

Comments

You must log in or register to comment.

Deep-Station-1746 t1_j0ylg2a wrote

Go for vast.ai if you don't have a huge budget. You could rent a 24GB vram instance for 0.4$/hr.

4

ggf31416 t1_j0ypnpp wrote

Training a large model only in CPU is madness, it will take forever and waste a lot of electricity. You need a GPU with CUDA or an equivalent solution fully supported by your framework. See e.g. this benchmark.

A t2.micro instance may be free during the free trial but is useless for anything resource intensive. You are much better off just using Google Colab or Kaggle notebooks.

If you have to train models very often (like everyday) and 24GB from a RTX3090 or better a RTX4090 is enough, a dedicated computer is the most cost effective way in the long run. If you cant afford a RTX3090 and 12GB is enough, a 3060 with 12GB will do (for ML we usually want as much VRAM as possible, raw computing power often is not the bottleneck).

Vast ai is a cost effective way of renting computing power for non-constant use, much cheaper than AWS or GCP, but beware that because of how it works the instance is not fully secure against attacks from the host so you can't use it with sensitive data.

Any good CUDA GPU will be able to train with a small dataset in less of a day, so take that into account for the decision between purchasing a GPU and cloud computing.

7

120pi t1_j10c3gq wrote

t2.micro instances are great for lightweight applications but do not have GPUs attached. Look for the g4-, g5-, and px-series servers and use one of the DL AMIs. Single GPU servers should work for starters, then you can simply upgrade the instance if you need parallel compute.

Something to consider as well, you're only charged for wall time on these servers when they're up so do as much of your development locally, then spin up the instance, pull your repo into your EC2, train, export model/checkpoints to S3, shut down EC2 (this can all be done in a script).

I bring this up because the cost seems high for a 24h usage cycle, but the A10 (g5) and above are really powerful and may actually be cheaper due to reduced training times.

Breaking even on a 4090 would take around 6-9 months at 8h/day using a single-GPU g5. Personally, I'm grabbing a 4070Ti when it comes out since I'm not so concerned with training times (12G is "ok") for personal projects and I don't want to upgrade my PSU.

2