Not-Banksy OP t1_je8ez6o wrote on March 30, 2023 at 3:59 AM

Reply to comment by ActuatorMaterial2846 in When people refer to “training” an AI, what does that actually mean? by Not-Banksy

Thanks for the explanation, much appreciated!

ActuatorMaterial2846 t1_je8fqgw wrote on March 30, 2023 at 4:06 AM

No worries. I'll also point out the magic behind all this is particularly the transformer architecture. This is he real engine behind LLMs and other models.

FlyingCockAndBalls t1_je8h4g3 wrote on March 30, 2023 at 4:20 AM

what is so special about the transformer architecture?

ActuatorMaterial2846 t1_je8ik1t wrote on March 30, 2023 at 4:34 AM

It's actually quite technical, but essentially, the transformer architecture helps each part of the sentence “talk” to all the other parts at the same time. This way, each part can understand what the whole sentence is about and what it means.

Here is the paper that imo changed the world 6 years ago and is the reason for the current state of AI.

https://arxiv.org/abs/1706.03762

If it goes over your head (it did for me), ask bing or chatgpt to summarise it for you. It helped me get my head around this stuff, as I'm in no way an expert nor do I study this field.

turnip_burrito t1_je8i45w wrote on March 30, 2023 at 4:30 AM

"Attention mechanism" makes it good at predicting new words from past ones.

The paper that introduced the attention mechanism is called Attention its All You Need.

Zermelane t1_je8lss0 wrote on March 30, 2023 at 5:09 AM

Better parallelism in training, and a more direct way to reference past information, than in RNNs (recurrent neural networks) which seemed like the "obvious" way to process text before transformers came by.

These days we have RNN architectures that can achieve transformer-like training parallelism, the most interesting-looking one being RWKV. They are still badly disadvantaged when needing information directly from the past, for instance to repeat a name that's been mentioned before, but they have other advantages, and their performance gets close enough to transformers that it could be just a question of scaling exponents which architecture ends up winning out.

abudabu t1_je9ixnd wrote on March 30, 2023 at 12:14 PM

The GPUs aren’t actually connected together physically. The transformer architecture is entirely in software. The software uses GPUs to do matrix calculations efficiently.

Specifically, the transformer architecture is a bunch of large matrices connected together with arithmetic operations. The training process shows it a sequence of words and sees if it correctly predicts the next word. It figures out how “wrong” the prediction is and updates the matrices so that the prediction will be slightly more right next time. This is a very high level description of “back propagation”.

Using text to automatically train the network is called self-supervised learning. It’s great because no human input is required, just lots of text.

There are many other forms of training. ChatGPT works because it was also trained using human reinforcement feedback learning (HRFL), where humans rank a set of answers. Basically the same underlying process as above, but the answers generated by the network are used to train the network, and the ranking is used to prefer the better answers. Probably when we’re giving up and down votes, OpenAI is using that for HRFL.

Another approach is to use humans to create examples. OpenAI hired people in Africa to have conversations where one played the role of the chatbot. This kind of training helped the network understand chat style interactions.

Since it’s a next word predictor, the chat data has special tokens in the text which represent “user” and “chatbot” roles. So maybe that helps you imagine it better as a very fancy autocomplete.