Submitted by nashcaps2724 t3_117l2vf in deeplearning
Hi all, let me lay out my problem…
Imagine there are two corpora, Corpus A (100,000~) and Corpus B (20,000,000~).
Individuals create reports for corpus A based on the information in corpus B.
My idea was to pretrain a GPT on corpus A, and fine tune it to take documents from corpus B as an input, and output text in the style of corpus A (essentially a mix of text generation and summarization).
Is this something folks think is even feasible? Should I be pretaining the GPT on both corpora or just corpus A? I thought of both fine tuning an OpenAI GPT and training from scratch.
Any advice would be welcome!
pitrucha t1_j9ccuun wrote
proceed as with summarization. you can train gpt family using open ai or get yourself t5 and train it. training from scratch will be a waste of resources