Submitted by not_particulary t3_11dk7nt in MachineLearning
I've loved using wandb because my workflow is using a university-provided slurm cluster that doesn't allow any internet on the compute nodes, and it's annoying to have to keep doing 2fa just to evaluate results. It's offline mode lets me sync that on a little script in the login node, and see everything in an online dashboard.
However, the software is super unstable. I've been losing jobs randomly to a mystery error `Killed`, it's piled up runs and insisted on syncing all of them again, so I have to go in and manually delete old runs that have long been saved in the dashboard, and it already took me forever to figure out how to keep it from logging a seperate run for each gpu I use to train. Is there anything that does the same thing, but is just more mature, so that I don't have to spend all my time squashing bugs related to data logging? I'd rather just focus on training these models, honestly.
Jean-Porte t1_ja9ejvo wrote
You can increase some timeout parameter, it helps
But I agree, I don't even understand why they don't log things locally when failing instead of KILLING A ONE WEEK JOB ON A HIGH END GPU SERVER ( MORE THAN 100$ WORTH OF COMPUTE TIME)