Viewing a single comment thread. View all comments

Desperate-Whereas50 t1_iqwzlgc wrote

I am not a transformer expert. So maybe this is a stupid question, but is this also true for transformer based architectures? For example BERT uses 12/24 transformer Blocks. Thats sounds not as deep as for example a resnet-256.

1