Hey /r/machinelearning,

Long time reader, first time posting non-anonymously. I've been training models using various cloud services, but as an individual user it's stressful for me to worry about shutting down the instances if training fails or stops. Crashes, bad code, etc can cause GPU utilization to drop without the program successfully "finishing", and this idle time can cost a lot of money if you don't catch it quickly.

Thus, I built this tiny lil tool to help. It watches the GPU utilization of your instance, and performs an action if it drops too low for too long. For example, shutdown the instance if GPU usage drops under 30% for 5 minutes.

It's easy to use and install, just pip install gpu_sentinel

If this is useful please leave comments here or on the Github page: https://github.com/moonshinelabs-ai/gpu_sentinel

I'm hoping it helps save some other folks money!

Comments

You must log in or register to comment.

MuonManLaserJab t1_j4o2mlx wrote on January 17, 2023 at 1:53 AM

I have a little script called gpu_Speed that blows up my laptop if it drops below 50 ~~mph~~ % GPU utilization

nateharada OP t1_j4ou7qz wrote on January 17, 2023 at 5:23 AM

./popquiz_hotshot.sh

whatwasigoingtodo t1_j4px3so wrote on January 17, 2023 at 1:09 PM

shoot the hostage

scaredoftheinternet t1_j4mkrqk wrote on January 16, 2023 at 7:53 PM

Wow this is actually really cool, thanks for sharing.

Zealousideal_Low1287 t1_j4n2ahm wrote on January 16, 2023 at 9:43 PM

Looks nice. I probably wouldn’t use it for shutting down or anything, but a notification on failure might be useful!

nateharada OP t1_j4ne979 wrote on January 16, 2023 at 11:00 PM

Nice! Right now you can use the end_process trigger to just return 0 when the trigger is hit from the process, but it should be fairly straightforward to externalize the API a little bit more. This would let you do something like this in your script:

from gpu_sentinel import Sentinel, get_gpu_usage
sentinel = Sentinel(
    arm_duration=10,
    arm_threshold=0.7,
    kill_duration=60,
    kill_threshold=0.7,
    kill_fn=my_callback_fn,
)
while True:
    gpu_usage = get_gpu_usage(device_ids=[0, 1, 2, 3])
    sentinel.tick(gpu_usage)
    time.sleep(1)

Is that something that would be useful? You can define the callback function yourself so maybe you trigger an alert, etc.

Zealousideal_Low1287 t1_j4neybb wrote on January 16, 2023 at 11:05 PM

Yeah, that’s something which would be useful indeed. Don’t worry yourself though, I can put in a PR.

nateharada OP t1_j4ngy65 wrote on January 16, 2023 at 11:18 PM

It's actually almost entirely ready now, I just need to alter a few things. I'll go ahead and push it soon! Need to do some final tests.

EDIT: The above code should work! See the README on the Github for a complete example.

MrAcurite t1_j4t9ch1 wrote on January 18, 2023 at 2:26 AM

At work, we've got this thing that will notify you if a cloud instance has been running for 24 hours. However, it does this by messaging your work email, you can't configure it to go to a personal device or anything. Meaning, if you set a job to run at the end of the week, you can come back on Monday to over a thousand dollars of cloud charges and like fifty angry emails about it.

extracompute t1_j4tnogh wrote on January 18, 2023 at 4:14 AM

>Give Award

Ha. computeX has automated notifs built in to avoid problems like these.

What's the biggest bill you've ever come back to on Monday AM?

Fit_Schedule5951 t1_j4obl4w wrote on January 17, 2023 at 2:55 AM

Nice, I think an extension where this could be beneficial is when your process hangs - it's using full GPU memory but not training, this happened to me recently training models with fairseq. (I am not sure how you can catch these conditions)

nateharada OP t1_j4otocf wrote on January 17, 2023 at 5:18 AM

This tool actually doesn't look at memory right now, just actual computation. Usually loading your model into memory eats up basically the max memory until the training is done, even if compute usage is very low.

If your training is hanging and still burning GPU cycles that'd be harder to detect I think.

bay_der t1_j4papbd wrote on January 17, 2023 at 8:35 AM

One way I have figured out is to put a watch on the log file.

junetwentyfirst2020 t1_j4n6amt wrote on January 16, 2023 at 10:08 PM

👍 very cool

lorenzo1384 t1_j4r47aw wrote on January 17, 2023 at 6:05 PM

Can I try this on colab

nateharada OP t1_j4tojyh wrote on January 18, 2023 at 4:22 AM

Yeah it should work if you use the API (and if you have a GPU in your co-lab). I don't think it'll work with TPU just yet.

lorenzo1384 t1_j4trj9v wrote on January 18, 2023 at 4:48 AM

Thanks and yes I do have a premium GPU. I am paying for all the proof of concepts i do. So this will be helpful.

ndemir t1_j4s5774 wrote on January 17, 2023 at 9:52 PM

good idea.

Kinwwizl t1_j4slam2 wrote on January 17, 2023 at 11:38 PM

That's one of the reasons GCP is nice for ML training workloads - you can kill VM after training is finished calling poweroff at the end of bash script for training.

[deleted] t1_j4uaouz wrote on January 18, 2023 at 8:23 AM

[removed]