_Arsenie_Boca_ t1_iusvc0e wrote
Parameter sharing across layers would achieve just that. In the ALBERT paper the authors show that repeating a layer multiple times actually leads to similar performance than having separate parameter matrices. I havent heard a lot about this technique, but I assume this is because people mostly care about speed, which this does not improve (while it is a good match for your usecase)
Viewing a single comment thread. View all comments