Submitted by not_particulary t3_11dk7nt in MachineLearning

I've loved using wandb because my workflow is using a university-provided slurm cluster that doesn't allow any internet on the compute nodes, and it's annoying to have to keep doing 2fa just to evaluate results. It's offline mode lets me sync that on a little script in the login node, and see everything in an online dashboard.

However, the software is super unstable. I've been losing jobs randomly to a mystery error `Killed`, it's piled up runs and insisted on syncing all of them again, so I have to go in and manually delete old runs that have long been saved in the dashboard, and it already took me forever to figure out how to keep it from logging a seperate run for each gpu I use to train. Is there anything that does the same thing, but is just more mature, so that I don't have to spend all my time squashing bugs related to data logging? I'd rather just focus on training these models, honestly.

9

Comments

You must log in or register to comment.

Jean-Porte t1_ja9ejvo wrote

You can increase some timeout parameter, it helps

But I agree, I don't even understand why they don't log things locally when failing instead of KILLING A ONE WEEK JOB ON A HIGH END GPU SERVER ( MORE THAN 100$ WORTH OF COMPUTE TIME)

10

Jean-Porte t1_ja9iik5 wrote

>Yeah but it's super iffy. My exact script works most of the time, so idk even what to fix. That's why I just want to use something else, the software is obviously not stabl

Do `export WANDB__SERVICE_WAIT=300`

I don't have that problem anymore

5

neu_jose t1_jaau1rk wrote

Killed can mean you're running out of memory, ie you have a memory leak, and the os killed your job.

3

pyonsu2 t1_jabzq81 wrote

What are potential alternatives?

Mlflow, tensorboard?

3

SatoshiNotMe t1_jacas00 wrote

I never liked wandb aggressive forced annual Subscription pricing. I’ve been a happy user of ClearML for a year now. I only use their hosted service for experiment tracking, I don’t have my own server.

No specific experience with long running jobs etc.

2

Riatekk t1_jaciqmg wrote

I have already used Comet.ml

2

ichiichisan t1_jacz676 wrote

Neptune.ai - I personally find wandb unbearable - also they collect by default all your training code which I find extremely shady.

2