ServiceNow and Hugging Face have released a 3.1TB dataset of permissively licensed code in 30 programming languages. This is about 4x larger than the dataset used to train GPT-3 (though obviously ‘code only’), and 3x the size of CodeParrot, the next largest released code dataset.

Download The Stack: https://hf.co/BigCode

Source: https://twitter.com/BigCodeProject/status/1585631176353796097

Comments

boyetosekuji t1_iuj0v1n wrote on October 31, 2022 at 6:19 PM

#395,204

great news, how much would it cost to train

whyvitamins t1_iujar88 wrote on October 31, 2022 at 7:25 PM

#398,407

Replying to boyetosekuji (#395,204)

yes

master3243 t1_iujfxwi wrote on October 31, 2022 at 8:00 PM

#400,401

Replying to boyetosekuji (#395,204)

very many and very much

[deleted] t1_iujp1su wrote on October 31, 2022 at 9:02 PM

#403,932

Replying to boyetosekuji (#395,204)

[deleted]

MostlyRocketScience t1_iujpd9l wrote on October 31, 2022 at 9:04 PM

#404,071

I'm excited for open source code generation models. So I won't have to pay Github every month. And if this is a bigger dataset and permissively licensed, this means there will be no chance that it will generate copyrighted code.

nomadiclizard t1_iujxwax wrote on October 31, 2022 at 10:07 PM

#407,847

I'm curious which 'permissive' licenses have terms permitting the use of the code as training data in machine learning algorithms. Are we assuming licenses which allow code to be modified/redistributed, also include this right?

What if a commercial for-profit company trains on a lot of copyleft code, then commercialises the result and refuses to release the model? Is that ethical?

elcomet t1_iujync7 wrote on October 31, 2022 at 10:13 PM

#408,199

Replying to nomadiclizard (#407,847)

> What if a commercial for-profit company trains on a lot of copyleft code, then commercialises the result and refuses to release the model? Is that ethical?

I would assume this is the same as licences which allow to use the code to commercialise software when using it

Bulky_Highlight_3352 t1_iujz3ty wrote on October 31, 2022 at 10:16 PM

#408,423

impressive, bash automation, here I come

I_draw_boxes t1_iuk27ck wrote on October 31, 2022 at 10:41 PM

#409,907

Replying to nomadiclizard (#407,847)

Permissive licenses basically allow the user to do anything they want with the code save sue the author.

>What if a commercial for-profit company trains on a lot of copyleft code, then commercialises the result and refuses to release the model?

That probably isn't legal, but copyleft licenses are not permission licenses and are not included in this dataset for that reason.

[deleted] t1_iuk3es8 wrote on October 31, 2022 at 10:51 PM

#410,581

Replying to boyetosekuji (#395,204)

[removed]

pm_me_your_ensembles t1_iuk6950 wrote on October 31, 2022 at 11:13 PM

#412,117

Replying to boyetosekuji (#395,204)

If you have to ask :D

sitmo t1_iukbw82 wrote on October 31, 2022 at 11:58 PM

#415,782

As an open-source code writer this feels like an abuse of my contributions, they are monetizing on my code, building a brand out of other people's content, and cash big time with a Stock IPO in the near future.

In order to take back control I decided to change my naive flower-power-every-body-happy MIT license projects to the more protective GPL3