master3243 t1_irdi8o7 wrote on October 7, 2022 at 5:40 AM

In the paper, Appendix A.4 for deriving the loss and gradients,

I don't see how this is true (eq 14) https://i.imgur.com/ZuN2RC2.png

As the RHS seems to equal (2 * alpha_t) * LHS

I'm also unsure how in the same equation this happens https://i.imgur.com/DHixElF.png

dkangx t1_irdmnsp wrote on October 7, 2022 at 6:41 AM

Well, someone’s gonna fire it up and test it out and we will see if it’s real

master3243 t1_irdoq0o wrote on October 7, 2022 at 7:12 AM

Empirical results don't necessarily prove theoretical results, in fact most Deeplearning research (mine included) is trying out different stuff based on intuition and past experiences on what worked until you have something that achieves really good results,

Then you attempt to formally and theoretically show why the thing you did is justified mathematically.

And often enough, once you start going through the formal math you get ideas on how to further improve or different paths to take on your model, and thus it's a back and forth.

However, someone could just as easily get good results with a certain architecture/loss and then fail to justify it formally or skip certain steps or take an invalid jump from one step to another, which results in theoretical work that is wrong but works great empirically.