crude2refined

crude2refined t1_jcwdk7j wrote on March 20, 2023 at 1:45 AM

Reply to [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

In Google colab, I'm not able to reproduce the benefits in pytorch 2 vs 1 with scaled_dot_product_attention. Is there anything I'm missing? Please see attached image: https://imgur.com/72FKcp1

crude2refined t1_iy9lbuv wrote on November 29, 2022 at 7:47 PM

Reply to [r] The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable - LessWrong by visarga

To be fair, the SVD of any network architecture trained on a dataset will exhibit such properties. See, for example, the emergence of “V1 features” in MLPs, CNNs, etc when training on image datasets