Submitted by Singularian2501 t3_yijfkw in MachineLearning
ServiceNow and Hugging Face have released a 3.1TB dataset of permissively licensed code in 30 programming languages. This is about 4x larger than the dataset used to train GPT-3 (though obviously ‘code only’), and 3x the size of CodeParrot, the next largest released code dataset.
Paper: https://drive.google.com/file/d/17J-0KXTDzY9Esp-JqXYHIcy--i_7G5Bb/view
Hugging Face: https://huggingface.co/datasets/bigcode/the-stack
Twitter: https://twitter.com/BigCodeProject/status/1585631176353796097
Download The Stack: https://hf.co/BigCode
Source: https://twitter.com/BigCodeProject/status/1585631176353796097
Source: https://twitter.com/BigCodeProject/status/1585631176353796097
​
Source: https://twitter.com/BigCodeProject/status/1585631176353796097
boyetosekuji t1_iuj0v1n wrote
great news, how much would it cost to train