Submitted by cruddybanana1102 t3_zyclre in MachineLearning
Saw this tweet where it says that with some "quirky tricks" Nesterov can be obtained as a special case of PID control. I did a google search but it returned nothing of relevance.
Is this a popular result in optimisation I'm not aware of? Or have I just not looked hard enough? If someone can point me to relevant references, that'll be great.
TheNovicePhilomath t1_j25wla1 wrote
I don't think this is a standard result, or at least I haven't encountered it. After some digging, this paper seems to have a good explanation of the similarities between Nesterov and PID (section 3).
Also, the idea behind the linked paper in the twitter thread just blew my mind. So obvious, yet beautiful. A Kalman filter as an optimiser to estimate network parameters from noisy loss measurements. Great stuff.