Submitted by alex_lite_21 t3_10kjhhb in MachineLearning
HateRedditCantQuitit t1_j5r5f69 wrote
Reply to comment by [deleted] in [D] are two linear layers better than one? by alex_lite_21
You can represent any `m x n` matrix with the product of some `m x k` matrix with a `k x n` matrix, so long as k >= min(m, n). If k is less than that, you're basically adding regularization.
Imagine you have some optimal M in Y = M X. Then if A and B are the right shape (big enough in the k dimension), they can represent that M. If they aren't big enough, then they can't learn that M. If the optimal M doesn't actually need a zillion degrees of freedom, then having a small k bakes that restriction into the model, which would be regularization.
Look up linear bottlenecks.
Viewing a single comment thread. View all comments