Viewing a single comment thread. View all comments

Zermelane t1_itf46c7 wrote

You're not going to stop the global maritime shipping industry by peeing in the ocean.

The datasets that large language models are trained on are already full of absolute junk. My favorite example is from Deduplicating Training Data Makes Language Models Better, a sentence that was repeated more than 60,000 times in a version of the Common Crawl used for training some significant models at Google:

> by combining fantastic ideas, interesting arrangements, and follow the current trends in the field of that make you more inspired and give artistic touches. We’d be honored if you can apply some or all of these design in your wedding. believe me, brilliant ideas would be perfect if it can be applied in real and make the people around you amazed!

... not to mention, I've heard stories of training instabilities caused by entire batches consisting of backslashes, or bee emojis. The former I can at least understand how you'd end up with (backslash escapes grow exponentially if you re-escape them), but the bee emojis are, I don't know, someone just wanted to put a lot of bee emojis online, and they ended up messing with someone's language model training.

1