Submitted by JohnyWalkerRed t3_123oovw in MachineLearning
I love seeing all this great progress with LLMs being made more accessible to all, but all of the new efficient models (Dolly, Alpaca, etc.) depend on the Alpaca dataset, which was generated from a GPT3 davinci model, and is subject to non-commercial use. Are there efforts in the community to replicate this dataset for commercial use? This seems to me to be the “secret sauce”: a good quality instruction dataset you can use to “unlock” potential of smaller models.
KungFuScubaMaster t1_jdvmcyw wrote
Just adding, I'm also very interested in this!