Hi all, let me lay out my problem…

Imagine there are two corpora, Corpus A (100,000~) and Corpus B (20,000,000~).

Individuals create reports for corpus A based on the information in corpus B.

My idea was to pretrain a GPT on corpus A, and fine tune it to take documents from corpus B as an input, and output text in the style of corpus A (essentially a mix of text generation and summarization).

Is this something folks think is even feasible? Should I be pretaining the GPT on both corpora or just corpus A? I thought of both fine tuning an OpenAI GPT and training from scratch.

Any advice would be welcome!

Comments

You must log in or register to comment.

pitrucha t1_j9ccuun wrote on February 20, 2023 at 9:57 PM

proceed as with summarization. you can train gpt family using open ai or get yourself t5 and train it. training from scratch will be a waste of resources

nashcaps2724 OP t1_j9cdelh wrote on February 20, 2023 at 10:01 PM

Just to make sure I’m following, steps here could be…

Finetune GPT2 or GPT3 on Corpus A
Take the finetuned GPT* and continue as a summarization task?

pitrucha t1_j9cdugi wrote on February 20, 2023 at 10:04 PM

more like: given item x from Corpus B, predict item y from corpus A

hayAbhay t1_j9duoda wrote on February 21, 2023 at 4:52 AM

Create a Corpus C like this

<source text from corpus A> <human generated text from corpus B> . . .

Make sure you add some unique tokens marking the start and end of each example and the input and output within it.

Then, take any pretrained LLM (tuning gpt3 is trivial with ~10-20 lines of code).

For inference, use the tuned model and give it the input and let it complete the output. You can add the "end" marker token to get generation to complete.

[Source: trained/tuned several language models including gpt3]

nail_nail t1_j9gkc5w wrote on February 21, 2023 at 8:13 PM

That means it will need to be paid for each summarization API call, forever, right? Is there an alternative model that one can tune on a couple of high end nvidia cards? Like GPT NeoX?

hayAbhay t1_j9i9nfv wrote on February 22, 2023 at 3:58 AM

If you have the hardware, and if you have a lot of those input-output examples, you can use alternative smaller models in the gpt family.

Should work reasonably well especially if the variance in the input-output isn't too much. (A lot depends on your dataset here)

Definitely tradeoffs here in terms of model dev, inference and maintenance of it. If the expected costs aren't too high, I'd strongly recommend gpt3 as a base.