Submitted by bjergerk1ng t3_11542tv in MachineLearning
When designing neural network architectures, it is common to think about "information flow", e.g. how is information propagated, where are the "information bottlenecks" and so on. Another example might be that some people use "information loss" to explain why transformers work better than RNNs.
It seems like most papers discuss this in a rather hand-wavy way. Is there any work done in formalising such ideas to better guide us understanding various model architectures? What are the core ideas?
filipposML t1_j90m8x0 wrote
Maybe you are interested in Tishby's rate distortion. E.g. in this paper they do an analysis of the behaviour of mutual information in the hidden layers as a neural network is trained to convergence.