Submitted by Singularian2501 t3_ya5ofj in MachineLearning

Paper: https://arxiv.org/abs/2210.11416

Github: https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints

Abstract:

>Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

https://preview.redd.it/xazwmswbb8v91.jpg?width=1593&format=pjpg&auto=webp&s=2eab22e36819a6ba6117c552b9620271ad8cc51c

https://preview.redd.it/xeovsswbb8v91.jpg?width=1392&format=pjpg&auto=webp&s=58355bc00ba17a78551fa43e2453b66ffd8659e4

https://preview.redd.it/xvpmc1xbb8v91.jpg?width=1040&format=pjpg&auto=webp&s=2725e0d263d2183f14ad53f6a1006a3488ba9f1c

https://preview.redd.it/ylqzw0xbb8v91.jpg?width=1145&format=pjpg&auto=webp&s=6a40ba4551821215dcf06d9ecdee9ac2b569995e

https://preview.redd.it/nf3kerwbb8v91.jpg?width=1149&format=pjpg&auto=webp&s=f28750ca5db60c3df295988d77c59f3c00651b55

25

Comments

You must log in or register to comment.

cygn t1_itbsjr4 wrote

Is it possible to run it on consumer grade gpus (3090) with 24 GB ram?

3

LetterRip t1_itchnjl wrote

I assume you mean 24GB of VRAM? Deepspeed with enough CPU RAM and mapping to hard drive as needed, might let you run it. Note that 540B parameters is more than 2 TB for float 32. Even going 8 bit, you are looking at 512 GB. Consumer hardware RAM is typically max 128 GB. So the vast majority of it is going to have to be mapped to the hard drive. The size can probably be reduced a lot using both quantization and compression, but you will either have to do the work yourself or wait till someone else does.

5

farmingvillein t1_itefjav wrote

> Note that 540B parameters is more than 2 TB for float 32

They only provide checkpoints up to the 11B model, however (unless I'm reading things wrong), so this is a moot point, at the moment.

2

rehrev t1_it9f0a4 wrote

Who actually tries to predict the sota level in the future, especially on language modeling?

1