IamTimNguyen
IamTimNguyen OP t1_j3himnu wrote
Reply to comment by AlmightySnoo in [R] Greg Yang's work on a rigorous mathematical theory for neural networks by IamTimNguyen
Having spoken to Greg (who may or may not be chiming in), it appears that the authors of PDLT were only considering one kind of infinite width limit (as evidenced by your use of the word "the"). But Greg considers a general family of them. The NTK limit indeed has no feature learning, whereas Greg analyzes entire families, some that do have feature learning, in particular, one that has maximal feature learning. So there is no contradiction with respect to past works.
IamTimNguyen OP t1_j3cybcu wrote
Part I. Introduction
00:00:00 : Biography
00:02:36 : Harvard hiatus 1: Becoming a DJ
00:07:40 : I really want to make AGI happen (back in 2012)
00:09:00 : Harvard math applicants and culture
00:17:33 : Harvard hiatus 2: Math autodidact
00:21:51 : Friendship with Shing-Tung Yau
00:24:06 : Landing a job at Microsoft Research: Two Fields Medalists are all you need
00:26:13 : Technical intro: The Big Picture
00:28:12 : Whiteboard outline
Part II. Classical Probability Theory
00:37:03 : Law of Large Numbers
00:45:23 : Tensor Programs Preview
00:47:25 : Central Limit Theorem
00:56:55 : Proof of CLT: Moment method
01:02:00 : Moment method explicit computations
Part III. Random Matrix Theory
01:12:45 : Setup
01:16:55 : Moment method for RMT
1:21:21 : Wigner semicircle law
Part IV. Tensor Programs
1:31:04 : Segue using RMT
1:44:22 : TP punchline for RMT
1:46:22 : The Master Theorem (the key result of TP)
1:55:02 : Corollary: Reproof of RMT results
1:56:52 : General definition of a tensor program
Part V. Neural Networks and Machine Learning
2:09:09 : Feed forward neural network (3 layers) example
2:19:16 : Neural network Gaussian Process
2:23:59 : Many large N limits for neural networks
2:27:24 : abc parametrizations (Note: "a" is absorbed into "c" here): variance and learning rate scalings
2:36:54 : Geometry of space of abc parametrizations
2:39:50 : Kernel regime
2:41:35 : Neural tangent kernel
2:43:40 : (No) feature learning
2:48:42 : Maximal feature learning
2:52:33 : Current problems with deep learning
2:55:01 : Hyperparameter transfer (muP)
3:00:31 : Wrap up
Submitted by IamTimNguyen t3_105v7el in MachineLearning
IamTimNguyen OP t1_j3hj6ef wrote
Reply to comment by cdsmith in [R] Greg Yang's work on a rigorous mathematical theory for neural networks by IamTimNguyen
Great question and you're right we did not cover this (alas, we could not cover everything even with 3 hours). You can unroll NN training as a sequence of gradient updates. The gradient updates involve nonlinear additions to the set of weights at initialization (e.g. the first update is w -> w - grad_w(L), where w is randomly initialized). Unrolling the entire graph is a large composition of such nonlinear functions of the weights at initialization. The Master Theorem, from a bird's eye view, is precisely the tool to handle such a computation graph (all such unrolls are themselves tensor programs). This is how Greg's work covers NN training.
Note: This is just a cartoon picture of course. The updated weights are now highly correlated in the unrolled computation graph (weight updates in a given layer depend on weights from all layers), and one has to do a careful analysis of such a graph.
Update: Actually, Greg did discuss this unrolling of the computation graph for NN training. https://www.youtube.com/watch?v=1aXOXHA7Jcw&t=8540s