Submitted by Still-Barracuda5245 t3_ysa5ti in deeplearning
arhetorical t1_iw16x4q wrote
It looks like a lot but there's nothing especially weird in there. If you spend some time tuning your model you'll probably end up with something like that too.
Adam - standard.
Linear warmup and decay - warmup and decay is very common. The exact shape might vary but cosine decay is often used.
Decreasing the update frequency - probably something you'd come up with after inspecting the training curve and trying to get a little more performance out of it.
Clipping the gradients - pretty common solution for "why isn't my model training properly". Maybe a bit hacky but if it works, it works.
The numbers themselves are usually just a matter of hand tuning and/or hyperparameter search.
Viewing a single comment thread. View all comments