danman966

danman966 t1_ixgzwfh wrote

Back propagation is essentially applying the chain rule a bunch of times. Since Neural nets and other functions are just applying basic functions loads of times on top of a variable x, to get some output z, e.g. z = f(g(h(x))), then the derivative of z with respect to the parameters of f, g, and h, is going to be the chain rule applied three times. Since pytorch/tensorflow store all derivatives of their functions, e.g. activation functions or linear layers in a neural network, it is easy for the software to compute each gradient.

We need the gradient of course because that is how we update our parameter values, with gradient descent or something similar.

1