Submitted by Seankala t3_119onf8 in MachineLearning
The ELECTRA paper introduces a small version that has around 15M parameters. MobileBERT and TinyBERT also have around the same number of parameters.
Are there any other language models out there that are smaller? Would it be possible to further distill large models into smaller variants?
adt t1_j9neq5w wrote
There should be quite a few models smaller than 15M params. What's your use case? A lot of the 2022-2023 optimizations mean that you can squish models onto modern GPUs now (i.e. int8 etc.).
Designed to be fit onto a standard GPU, DeepMind Gato was bigger than I thought, with starting size of 79M params.
Have you found the BERT compression paper, which compresses the models to 7MB? It lists some 1.2M-6.2M param models:
https://arxiv.org/pdf/1909.11687.pdf
My table shows...
https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit#gid=1158069878
*looks at table*
Smallest seems to be Microsoft Pact, which was ~30M params. Ignore that! Transformer is supposed to be wide and deep, I suppose, so it makes sense...
Many of the text-to-image models use smaller LLMs.
Also check HF, they now have 130,000 models of different sizes (to Feb/2023):
https://huggingface.co/models
Includes a tiny-gpt2: https://huggingface.co/sshleifer/tiny-gpt2
And t5-efficient tiny ('has 15.58 million parameters and thus requires ca. 62.32 MB of memory in full precision (fp32) or 31.16 MB of memory in half precision (fp16 or bf16).'):
https://huggingface.co/google/t5-efficient-tiny
Edit: I thought of Anthropic's toy models, but they were not really LLMs. They did train a 10M model during scaling research (paper), but the model hasn't been released.