Submitted by madmax_br5 t3_10mbct5 in MachineLearning
suflaj t1_j63bf1q wrote
Reply to comment by madmax_br5 in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Well for starters, it would probably have worse performance due to so many redundant features, and it would be much slower.
Remember that the embedding layer carries loads of overhead, as we're talking V * d
matrices. So for a corpus of 250k and embedding vector of 768, ex., we're talking about 192M parameters just for the embedding layer. Maybe you can save some space by having a sparse embedder, but find me a free implementation of sparse layers that work as well as dense ones. Other than that, the 192M parameters are, before compression techniques, equivalent to 768M. And that's just in memory, and the gradient, unless sparsified, will be 768M PER BATCH.
This is without mentioning that you would likely need to increase the embedding dim to account for the 8 times times bigger vocabulary.
madmax_br5 OP t1_j63mi7f wrote
Thank you, this is very helpful!
Viewing a single comment thread. View all comments