Submitted by not_particulary t3_11dk7nt in MachineLearning
Jean-Porte t1_ja9ejvo wrote
You can increase some timeout parameter, it helps
But I agree, I don't even understand why they don't log things locally when failing instead of KILLING A ONE WEEK JOB ON A HIGH END GPU SERVER ( MORE THAN 100$ WORTH OF COMPUTE TIME)
not_particulary OP t1_ja9g6o1 wrote
Yeah but it's super iffy. My exact script works most of the time, so idk even what to fix. That's why I just want to use something else, the software is obviously not stable.
Jean-Porte t1_ja9iik5 wrote
>Yeah but it's super iffy. My exact script works most of the time, so idk even what to fix. That's why I just want to use something else, the software is obviously not stabl
Do `export WANDB__SERVICE_WAIT=300`
I don't have that problem anymore
Viewing a single comment thread. View all comments