TehDing t1_je08lpg wrote on March 28, 2023 at 2:10 PM

Reply to comment by sebzim4500 in [P] two copies of gpt-3.5 (one playing as the oracle, and another as the guesser) performs poorly on the game of 20 Questions (68/1823). by evanthebouncy

You can ask GPT to spell a word, or provide the words as individual "S P A C E D" characters and it will similarly do poorly- it has nothing to do with tokenization. GPT is capable of spelling, it can even identify that it is not playing well if you ask if something is a good guess- but continues to give poor answers.

In terms of 'solving' a game as this 20 questions example, there are only 12000 valid words to guess from, or at worst 26^5 possible answers, which still makes this a smaller example (or at worst case on par) as the blog experiment.

Want an easier game? Sucks at Hangman too. It'll guess in terms of frequency, but not well enough to bring together a word. Even guessing on the basis of common ngrams would probably be a good enough strategy.

My experience is that LLMs are poor in terms of novel reasoning. This makes sense, RFHL isn't giving these things a consciousness. Maybe with tweaks/ tools we'll actually see some "thinking", but for now (this may change next week at the rate things are going) it's not very good at games in general as a result (another example: I haven't tried it with GPT4, but GPT3 cheats at chess).

sebzim4500 t1_je0c899 wrote on March 28, 2023 at 2:35 PM

> You can ask GPT to spell a word, or provide the words as individual "S P A C E D" characters and it will similarly do poorly- it has nothing to do with tokenization. GPT is capable of spelling, it can even identify that it is not playing well if you ask if something is a good guess- but continues to give poor answers.

Yeah, because 99.99% of the time when it sees words they are not written in the way. It's true that the model can just about figure out how to break a word up into characters, but it has to work hard at that and seemingly doesn't have many layers left for completing the actual task.

I would expect that a model trained with single character tokens would do far better at these word games (wordle, hangman, etc.) at the cost of being worse at almost everything else.