light24bulbs t1_jdntdbb wrote on March 25, 2023 at 8:10 PM

Reply to comment by baffo32 in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry

I'm not hoping to do instruction tuning, i want to do additional pre-training.

baffo32 t1_jdo24su wrote on March 25, 2023 at 9:15 PM

It is the same thing. The alpaca data is just further pretraining data consisting of instructions and responses. Doing this is called finetuning.

baffo32 t1_jdrhj77 wrote on March 26, 2023 at 4:49 PM

I was still confused as to your response, and I’m thinking that if you wanted a model to behave like you had given different pretraining data, you would probably first finetune on the different bulk data, and then after this finetune on the target task such as instruction following.

Instruction following is indeed of course just predicting the next word: on data where the next word is obedient to instructions preceding it.

light24bulbs t1_jdrm9kh wrote on March 26, 2023 at 5:23 PM

That's the part I wasn't getting. I assumed the fine tuning involved a different process. I see now that it is fact just more training data, often templated into a document in such a way that it's framed clearly for the LLM.

The confusing thing is that most of the LLM-as-a-service companies, Open-AI included, will ONLY take data in the question answer format, as if that's the only data you'd want to use to fine tune.

What if i want to feed a book in so we can talk about the book? A set of legal documents? Documentation of my project? Transcriptions of TV shows?

There are so many use cases for training on top of an already pre-trained LLM that aren't just question answering.

I'm into training llama now. I simply took some training code i found, removed the JSON parsing question answer templating stuff, and done.

nemorocksharder t1_jdz8kt5 wrote on March 28, 2023 at 7:40 AM

What you're describing is exactly what I have been looking to do too, and am really surprised I'm not hearing more about it. Have you found any useful approaches to essentially adding to the LLM's Corpus with target material/text? or anyone else trying to do this?

light24bulbs t1_jdzzeh4 wrote on March 28, 2023 at 1:01 PM

Yes, I'm into it now. Code like this can be adapted to load bulk data instead of q&a.

I suspect some of the training parameters need to be adjusted a bit to prevent over fitting and obviously the data loading and templating needs to be removed.

https://github.com/lxe/llama-tune Or for a cooler approach where you make a Lora layer https://github.com/serp-ai/LLaMA-8bit-LoRA