arhetorical t1_iw16x4q wrote on November 12, 2022 at 3:36 AM

It looks like a lot but there's nothing especially weird in there. If you spend some time tuning your model you'll probably end up with something like that too.

Adam - standard.

Linear warmup and decay - warmup and decay is very common. The exact shape might vary but cosine decay is often used.

Decreasing the update frequency - probably something you'd come up with after inspecting the training curve and trying to get a little more performance out of it.

Clipping the gradients - pretty common solution for "why isn't my model training properly". Maybe a bit hacky but if it works, it works.

The numbers themselves are usually just a matter of hand tuning and/or hyperparameter search.