IamTimNguyen

IamTimNguyen OP t1_j3hj6ef wrote

Great question and you're right we did not cover this (alas, we could not cover everything even with 3 hours). You can unroll NN training as a sequence of gradient updates. The gradient updates involve nonlinear additions to the set of weights at initialization (e.g. the first update is w -> w - grad_w(L), where w is randomly initialized). Unrolling the entire graph is a large composition of such nonlinear functions of the weights at initialization. The Master Theorem, from a bird's eye view, is precisely the tool to handle such a computation graph (all such unrolls are themselves tensor programs). This is how Greg's work covers NN training.

Note: This is just a cartoon picture of course. The updated weights are now highly correlated in the unrolled computation graph (weight updates in a given layer depend on weights from all layers), and one has to do a careful analysis of such a graph.

Update: Actually, Greg did discuss this unrolling of the computation graph for NN training. https://www.youtube.com/watch?v=1aXOXHA7Jcw&t=8540s

2

IamTimNguyen OP t1_j3himnu wrote

Having spoken to Greg (who may or may not be chiming in), it appears that the authors of PDLT were only considering one kind of infinite width limit (as evidenced by your use of the word "the"). But Greg considers a general family of them. The NTK limit indeed has no feature learning, whereas Greg analyzes entire families, some that do have feature learning, in particular, one that has maximal feature learning. So there is no contradiction with respect to past works.

6

IamTimNguyen OP t1_j3cybcu wrote

Part I. Introduction

00:00:00 : Biography

00:02:36 : Harvard hiatus 1: Becoming a DJ

00:07:40 : I really want to make AGI happen (back in 2012)

00:09:00 : Harvard math applicants and culture

00:17:33 : Harvard hiatus 2: Math autodidact

00:21:51 : Friendship with Shing-Tung Yau

00:24:06 : Landing a job at Microsoft Research: Two Fields Medalists are all you need

00:26:13 : Technical intro: The Big Picture

00:28:12 : Whiteboard outline

Part II. Classical Probability Theory

00:37:03 : Law of Large Numbers

00:45:23 : Tensor Programs Preview

00:47:25 : Central Limit Theorem

00:56:55 : Proof of CLT: Moment method

01:02:00 : Moment method explicit computations

Part III. Random Matrix Theory

01:12:45 : Setup

01:16:55 : Moment method for RMT

1:21:21 : Wigner semicircle law

Part IV. Tensor Programs

1:31:04 : Segue using RMT

1:44:22 : TP punchline for RMT

1:46:22 : The Master Theorem (the key result of TP)

1:55:02 : Corollary: Reproof of RMT results

1:56:52 : General definition of a tensor program

Part V. Neural Networks and Machine Learning

2:09:09 : Feed forward neural network (3 layers) example

2:19:16 : Neural network Gaussian Process

2:23:59 : Many large N limits for neural networks

2:27:24 : abc parametrizations (Note: "a" is absorbed into "c" here): variance and learning rate scalings

2:36:54 : Geometry of space of abc parametrizations

2:39:50 : Kernel regime

2:41:35 : Neural tangent kernel

2:43:40 : (No) feature learning

2:48:42 : Maximal feature learning

2:52:33 : Current problems with deep learning

2:55:01 : Hyperparameter transfer (muP)

3:00:31 : Wrap up

24