HateRedditCantQuitit t1_j5r5f69 wrote on January 24, 2023 at 11:55 PM

Reply to comment by [deleted] in [D] are two linear layers better than one? by alex_lite_21

You can represent any `m x n` matrix with the product of some `m x k` matrix with a `k x n` matrix, so long as k >= min(m, n). If k is less than that, you're basically adding regularization.

Imagine you have some optimal M in Y = M X. Then if A and B are the right shape (big enough in the k dimension), they can represent that M. If they aren't big enough, then they can't learn that M. If the optimal M doesn't actually need a zillion degrees of freedom, then having a small k bakes that restriction into the model, which would be regularization.

Look up linear bottlenecks.