YouAgainShmidhoobuh
YouAgainShmidhoobuh t1_jd2ojml wrote
Reply to comment by Xotchkass in [D] Simple Questions Thread by AutoModerator
If you mean the context/sequence length, it's 2048 (https://github.com/facebookresearch/llama/pull/127).
YouAgainShmidhoobuh t1_jd2n2v5 wrote
Reply to [D]: Vanishing Gradients and Resnets by Blutorangensaft
ResNets do not tackle the vanishing gradient problem. The authors specifically mention that the issue of vanishing gradients was already fixed because of BatchNorm in particular. So removing BatchNorm from the equation will most likely lead to vanishing gradients.
I am assuming you are doing a WGAN approach since that would explain the gradient penalty violation. In this case, use LayerNorm as indicated here: https://github.com/LynnHo/DCGAN-LSGAN-WGAN-GP-DRAGAN-Tensorflow-2/issues/3
YouAgainShmidhoobuh t1_jc9a44k wrote
Reply to comment by respeckKnuckles in [D] On research directions being "out of date" by redlow0992
Not so sure about this. It seems like a tempting argument but gpt4 has no explanation of model architecture or training approach at all, so there is no way for fair comparison of any kind.
YouAgainShmidhoobuh t1_iyc24s6 wrote
Transformers gain the most when comparing size of training corpus and log likelihood performance. It is also in the scope of large data sets and large sequence lengths that transformers really stand out
YouAgainShmidhoobuh t1_is9mo5a wrote
Reply to comment by evanthebouncy in [P] a minimalist guide to program synthesis by evanthebouncy
Prolog as program synthesis is a one I’ve not heard yet, but it does make sense
YouAgainShmidhoobuh t1_jd2qmh1 wrote
Reply to comment by darthstargazer in [D] Simple Questions Thread by AutoModerator
Not entirely the same thing. VAEs offer approximate likelihood estimation, but not exact. The difference here is key - VAEs do not optimize the log-likelihood directly but they do so through the evidence lower bound, an approximation. Flow based methods are exact methods - we go from an easy tractable distribution to a more complex one, guaranteeing at each level that the learned distribution is actually a legit distribution through the change of variables theorem.
Of course, the both (try) to learn some probability distribution of the training data, and that is how they would differ from GAN approaches that do not directly learn a probability distribution.
For more insight you might want to look at https://openreview.net/pdf?id=HklKEUUY_E