trajo123 t1_j13gu3z wrote
Reply to comment by techni_24 in [D] Techniques to optimize a model when the loss over the training dataset has a Power Law type curve. by Dartagnjan
Reducing the batch size to 1 can allow you to train a bigger model, allowing you to reach a lower loss on the training set. Note that accumulate_grad_batches takes on the meaning of batch_size when the latter is set to 1.
Viewing a single comment thread. View all comments