Viewing a single comment thread. View all comments

starstruckmon t1_j6izowe wrote

I would be very surprised. They have technically speaking ( as per benchmarks ), one of the best text-to-image generators right now, yet the practical output is far below what we have in quality due to the limited dataset.

It would probably be even worse for text. Wikipedia, reddit, all the code forums like stackoverflow, documentations and manuals, vast majority of scientific papers. They'd be leaving so much out.

4

visarga t1_j6jut51 wrote

No, AI doesn't work that way. You just put into it text in any language, all of them together, and it figures out an inter-language representation. So you can ask in Chinese what it learns in English.

But there's also plenty of Chinese text. GLM-130B has been trained on over 400 billion text tokens (200B each for Chinese and English). GPT-3 was trained on 300B tokens mostly English.

4

starstruckmon t1_j6jw3kl wrote

It seems like you're talking about a model that has been trained in both languages. However, there are two issues with this. Firstly, the Chinese generally prefer to train models solely on Chinese data or with a limited amount of English data included. Secondly, multi-language models currently perform significantly worse compared to models that are trained on a single language.

1

_Just7_ t1_j6kv8fs wrote

Hate to be that guy, but source on models in single languages being better? I thought more data = better modeling. Why would it perform worse if you also include the Spanish and Chinese parts of the internet?

2

starstruckmon t1_j6kygds wrote

I can't really speculate on that topic. It's currently an active area of research.

To be honest, this problem is so widely known that I hadn't considered finding sources to support the claim. Here is the best authoritative source I could quickly find

https://arxiv.org/abs/2012.15613

It may seem counter-intuitive to link to a paper that supposedly fixes this issue, but this is obviously the most likely scenario in which a paper would discuss it. Also, if you read it carefully, you'll see that while the authors managed to reduce the gap, it still persists.

1