Submitted by nateharada t3_10do40p in MachineLearning
Zealousideal_Low1287 t1_j4n2ahm wrote
Looks nice. I probably wouldn’t use it for shutting down or anything, but a notification on failure might be useful!
nateharada OP t1_j4ne979 wrote
Nice! Right now you can use the end_process
trigger to just return 0 when the trigger is hit from the process, but it should be fairly straightforward to externalize the API a little bit more. This would let you do something like this in your script:
from gpu_sentinel import Sentinel, get_gpu_usage
sentinel = Sentinel(
arm_duration=10,
arm_threshold=0.7,
kill_duration=60,
kill_threshold=0.7,
kill_fn=my_callback_fn,
)
while True:
gpu_usage = get_gpu_usage(device_ids=[0, 1, 2, 3])
sentinel.tick(gpu_usage)
time.sleep(1)
Is that something that would be useful? You can define the callback function yourself so maybe you trigger an alert, etc.
Zealousideal_Low1287 t1_j4neybb wrote
Yeah, that’s something which would be useful indeed. Don’t worry yourself though, I can put in a PR.
nateharada OP t1_j4ngy65 wrote
It's actually almost entirely ready now, I just need to alter a few things. I'll go ahead and push it soon! Need to do some final tests.
EDIT: The above code should work! See the README on the Github for a complete example.
MrAcurite t1_j4t9ch1 wrote
At work, we've got this thing that will notify you if a cloud instance has been running for 24 hours. However, it does this by messaging your work email, you can't configure it to go to a personal device or anything. Meaning, if you set a job to run at the end of the week, you can come back on Monday to over a thousand dollars of cloud charges and like fifty angry emails about it.
extracompute t1_j4tnogh wrote
>Give Award
Ha. computeX has automated notifs built in to avoid problems like these.
What's the biggest bill you've ever come back to on Monday AM?
Viewing a single comment thread. View all comments