kittenkrazy t1_jc5sesx wrote
Reply to comment by generatorman_ai in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef
- Used accelerate fp16 mixed precision with deepspeed zero 2
- No xformers, no 8-bit Adam although I did test it and it works, no gradient checkpointing on this run but it does work.
- With a sequence length of 2048 I did a batch size of 1 with 8 gpus and accumulation of 4. This was on A6000s so 48 gigs of vram per card. Currently training a Lora on the 30B while training with the base model in 8-bit and can only fit 1 with a sequence length of 350. Once this one trains I’m going to try to set up a run with the model split up between the cards so I can crank up the sequence length. Will also be training the PPO phase so that will be a requirement to have enough vram lol.
- If you checkout the trlx repo they have some examples and they have an example of how they trained sft and ppo on the hh dataset. So it’s basically that but with llama. https://github.com/CarperAI/trlx/blob/main/examples/hh/sft_hh.py
- Just the hh directly. From the results it seems like it might possibly be enough but I might also try instruction tuning then running the whole process from that base. I will also be running the reinforcement learning by using a Lora using this as an example https://github.com/lvwerra/trl/tree/main/examples/sentiment/scripts/gpt-neox-20b_peft
- I’m also thinking maybe sharing lora weights instead of the direct model is a possible way around the license issue?
generatorman_ai t1_jc5u7w2 wrote
Wow, 392 gigs for batch size 1? This is for 7B? That is an order of magnitude more than I was expecting. Sounds like even with full memory optimizations, we're far away from the 16 GB goal.
Good idea on the lora - since it's a completely separate set of weights I don't see how it could come under the license. In fact loras do work on weights different from the base model they were trained from (e.g. loras trained on base Stable Diffusion work when applied to heavily fine-tuned SD models), so it's not even necessarily tied to the LLaMA weights.
kittenkrazy t1_jc5v4is wrote
Training a Lora should be significantly cheaper especially combined with deepspeed cpu offloading and training with the model in 8 bit. Can probably get it to train on consumer cards.
And yup, completely separate unless you decide to merge them with the main model weights for faster inference/training another Lora on top/etc.
Hopefully people will share around loras for all sorts of plug and play personalities and finetuned abilities and it’ll be like stable diffusion but with personal assistants
generatorman_ai t1_jc5vc5r wrote
Probably I'm misinterpreting - you mean you did a batch size of 1 per GPU with 8 GPUs, so actually it's 48 GB with no optimizations (except fp16). That sounds more reasonable, though probably still too large for 16 GB with common optimizations by several gigs.
Viewing a single comment thread. View all comments