visarga t1_j6jut51 wrote on January 30, 2023 at 9:08 PM

Reply to comment by starstruckmon in Chinese Search Giant Baidu to Launch ChatGPT-Style Bot by Buck-Nasty

No, AI doesn't work that way. You just put into it text in any language, all of them together, and it figures out an inter-language representation. So you can ask in Chinese what it learns in English.

But there's also plenty of Chinese text. GLM-130B has been trained on over 400 billion text tokens (200B each for Chinese and English). GPT-3 was trained on 300B tokens mostly English.

starstruckmon t1_j6jw3kl wrote on January 30, 2023 at 9:16 PM

It seems like you're talking about a model that has been trained in both languages. However, there are two issues with this. Firstly, the Chinese generally prefer to train models solely on Chinese data or with a limited amount of English data included. Secondly, multi-language models currently perform significantly worse compared to models that are trained on a single language.

_Just7_ t1_j6kv8fs wrote on January 31, 2023 at 1:13 AM

Hate to be that guy, but source on models in single languages being better? I thought more data = better modeling. Why would it perform worse if you also include the Spanish and Chinese parts of the internet?

starstruckmon t1_j6kygds wrote on January 31, 2023 at 1:37 AM

I can't really speculate on that topic. It's currently an active area of research.

To be honest, this problem is so widely known that I hadn't considered finding sources to support the claim. Here is the best authoritative source I could quickly find

https://arxiv.org/abs/2012.15613

It may seem counter-intuitive to link to a paper that supposedly fixes this issue, but this is obviously the most likely scenario in which a paper would discuss it. Also, if you read it carefully, you'll see that while the authors managed to reduce the gap, it still persists.

[deleted] t1_j6maw1g wrote on January 31, 2023 at 9:48 AM

[deleted]