crude2refined
crude2refined t1_iy9lbuv wrote
Reply to [r] The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable - LessWrong by visarga
To be fair, the SVD of any network architecture trained on a dataset will exhibit such properties. See, for example, the emergence of “V1 features” in MLPs, CNNs, etc when training on image datasets
crude2refined t1_jcwdk7j wrote
Reply to [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
In Google colab, I'm not able to reproduce the benefits in pytorch 2 vs 1 with scaled_dot_product_attention. Is there anything I'm missing? Please see attached image: https://imgur.com/72FKcp1