Submitted by Not-Banksy t3_126a1dm in singularity
ActuatorMaterial2846 t1_je8fqgw wrote
Reply to comment by Not-Banksy in When people refer to “training” an AI, what does that actually mean? by Not-Banksy
No worries. I'll also point out the magic behind all this is particularly the transformer architecture. This is he real engine behind LLMs and other models.
FlyingCockAndBalls t1_je8h4g3 wrote
what is so special about the transformer architecture?
ActuatorMaterial2846 t1_je8ik1t wrote
It's actually quite technical, but essentially, the transformer architecture helps each part of the sentence “talk” to all the other parts at the same time. This way, each part can understand what the whole sentence is about and what it means.
Here is the paper that imo changed the world 6 years ago and is the reason for the current state of AI.
https://arxiv.org/abs/1706.03762
If it goes over your head (it did for me), ask bing or chatgpt to summarise it for you. It helped me get my head around this stuff, as I'm in no way an expert nor do I study this field.
turnip_burrito t1_je8i45w wrote
"Attention mechanism" makes it good at predicting new words from past ones.
The paper that introduced the attention mechanism is called Attention its All You Need.
Zermelane t1_je8lss0 wrote
Better parallelism in training, and a more direct way to reference past information, than in RNNs (recurrent neural networks) which seemed like the "obvious" way to process text before transformers came by.
These days we have RNN architectures that can achieve transformer-like training parallelism, the most interesting-looking one being RWKV. They are still badly disadvantaged when needing information directly from the past, for instance to repeat a name that's been mentioned before, but they have other advantages, and their performance gets close enough to transformers that it could be just a question of scaling exponents which architecture ends up winning out.
Viewing a single comment thread. View all comments