Submitted by IamTimNguyen t3_105v7el in MachineLearning
rikkajounin t1_j3g9eki wrote
Reply to comment by AlmightySnoo in [R] Greg Yang's work on a rigorous mathematical theory for neural networks by IamTimNguyen
I’m only marginally familiar with Greg’s work (skimmed some papers and listened to his talks) but i believe that both criticisms are addressed.
-
Tensor programs consider discrete time (stochastic) learning algorithms stopped at T steps in place of continuous time gradient flow until convergence (the latter is used in standard neural tangent kernel literature), hence I think the infinite width limit varies depending on the algorithm and also the order of minibatches.
-
They identify infinite width limits where representation learning happens and where it doesn’t. The behaviour changes by varying how to scale with width parameters of the weights distribution of the input, output, and middle layers and the learning rate. In particular they propose to use a limit where representation (they call them features) is maximally learned. In contrast in neural tangent kernel the representation stays fixed.
Viewing a single comment thread. View all comments