Viewing a single comment thread. View all comments

IamTimNguyen OP t1_j3hj6ef wrote

Great question and you're right we did not cover this (alas, we could not cover everything even with 3 hours). You can unroll NN training as a sequence of gradient updates. The gradient updates involve nonlinear additions to the set of weights at initialization (e.g. the first update is w -> w - grad_w(L), where w is randomly initialized). Unrolling the entire graph is a large composition of such nonlinear functions of the weights at initialization. The Master Theorem, from a bird's eye view, is precisely the tool to handle such a computation graph (all such unrolls are themselves tensor programs). This is how Greg's work covers NN training.

Note: This is just a cartoon picture of course. The updated weights are now highly correlated in the unrolled computation graph (weight updates in a given layer depend on weights from all layers), and one has to do a careful analysis of such a graph.

Update: Actually, Greg did discuss this unrolling of the computation graph for NN training. https://www.youtube.com/watch?v=1aXOXHA7Jcw&t=8540s

2