generatorman_ai t1_jceddn2 wrote on March 16, 2023 at 5:52 AM

Reply to comment by kittenkrazy in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

Found this: https://github.com/tloen/alpaca-lora

generatorman_ai t1_jc5w4m9 wrote on March 14, 2023 at 7:28 AM

Reply to comment by dojoteef in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

The general problem of generative NPCs seems like a subset of robotics rather than pure language models, so that still seems some way off (but Google made some progress with PaLM-E).

LLMs and Disco Elysium sounds like the coolest paper ever! I would love to follow you on twitter to get notified when you release the preprint.

generatorman_ai t1_jc5vsbw wrote on March 14, 2023 at 7:23 AM

Reply to comment by extopico in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

T5 is below the zero-shot phase transition crossed by GPT-3 175B (and presumably by LLaMA 7B). Modern models with instruction and HF finetuning will not need further task-specific finetuning for most purposes.

generatorman_ai t1_jc5vc5r wrote on March 14, 2023 at 7:17 AM

Reply to comment by generatorman_ai in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

Probably I'm misinterpreting - you mean you did a batch size of 1 per GPU with 8 GPUs, so actually it's 48 GB with no optimizations (except fp16). That sounds more reasonable, though probably still too large for 16 GB with common optimizations by several gigs.

generatorman_ai t1_jc5u7w2 wrote on March 14, 2023 at 7:01 AM

Reply to comment by kittenkrazy in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

Wow, 392 gigs for batch size 1? This is for 7B? That is an order of magnitude more than I was expecting. Sounds like even with full memory optimizations, we're far away from the 16 GB goal.

Good idea on the lora - since it's a completely separate set of weights I don't see how it could come under the license. In fact loras do work on weights different from the base model they were trained from (e.g. loras trained on base Stable Diffusion work when applied to heavily fine-tuned SD models), so it's not even necessarily tied to the LLaMA weights.

generatorman_ai t1_jc5q5z0 wrote on March 14, 2023 at 6:07 AM

Reply to comment by kittenkrazy in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

That's great, it's been hard to find people who are actually fine-tuning LLaMA. Would you mind sharing your experience for the benefit of the open-source community?

Did you train the full-precision weights?
Did you use memory optimizations like xformers, 8-bit Adam (from bitsandbytes), gradient checkpointing etc.?
How much VRAM does it take for a batch size of 1?
hh seems to be a preference dataset for RLHF rather than a text corpus - how did you use it as a fine-tuning dataset?
Did you first do instruction fine-tuning (using something like FLAN or Self-Instruct) or just the hh directly?