Viewing a single comment thread. View all comments

AlmightySnoo t1_j3d6wpo wrote

Haven't watched yet, but does he address criticism by e.g. The Principles of Deep Learning Theory, regarding the infinite width limit?

59

hattulanHuumeparoni t1_j3dth0z wrote

Is there a summary of that criticism somewhere, I wouldn't want to read a full book

12

AlmightySnoo t1_j3fa93g wrote

Excerpt from pages 8 and 9:

>Unfortunately, the formal infinite-width limit, n -> ∞, leads to a poor model of deep neural networks: not only is infinite width an unphysical property for a network to possess, but the resulting trained distribution also leads to a mismatch between theoretical description and practical observation for networks of more than one layer. In particular, it’s empirically known that the distribution over such trained networks does depend on the properties of the learning algorithm used to train them. Additionally, we will show in detail that such infinite-width networks cannot learn representations of their inputs: for any input x, its transformations in the hidden layers will remain unchanged from initialization, leading to random representations and thus severely restricting the class of functions that such networks are capable of learning. Since nontrivial representation learning is an empirically demonstrated essential property of multilayer networks, this really underscores the breakdown of the correspondence between theory and reality in this strict infinite-width limit.
>
>From the theoretical perspective, the problem with this limit is the washing out
of the fine details at each neuron due to the consideration of an infinite number of incoming signals. In particular, such an infinite accumulation completely eliminates the subtle correlations between neurons that get amplified over the course of training for representation learning.

30

rikkajounin t1_j3g9eki wrote

I’m only marginally familiar with Greg’s work (skimmed some papers and listened to his talks) but i believe that both criticisms are addressed.

  1. Tensor programs consider discrete time (stochastic) learning algorithms stopped at T steps in place of continuous time gradient flow until convergence (the latter is used in standard neural tangent kernel literature), hence I think the infinite width limit varies depending on the algorithm and also the order of minibatches.

  2. They identify infinite width limits where representation learning happens and where it doesn’t. The behaviour changes by varying how to scale with width parameters of the weights distribution of the input, output, and middle layers and the learning rate. In particular they propose to use a limit where representation (they call them features) is maximally learned. In contrast in neural tangent kernel the representation stays fixed.

8

IamTimNguyen OP t1_j3himnu wrote

Having spoken to Greg (who may or may not be chiming in), it appears that the authors of PDLT were only considering one kind of infinite width limit (as evidenced by your use of the word "the"). But Greg considers a general family of them. The NTK limit indeed has no feature learning, whereas Greg analyzes entire families, some that do have feature learning, in particular, one that has maximal feature learning. So there is no contradiction with respect to past works.

6

eyeofthephysics t1_j4f2w85 wrote

>u/IamTimNguyen

Hi Tim, just to add on to your comment, Sho Yaida (one of the co-authors of PDLT) also wrote a paper on the various infinite width limits of neural nets, https://arxiv.org/abs/2210.04909. He was able to construct a family of infinite width limits and show that in some of them there is representation learning (and he also found agreement with Greg's existing work).

1