Viewing a single comment thread. View all comments

FpRhGf t1_j9tg77o wrote

A flaw is that the tokens in LLM are word-based, not character-based. It sees every word as an entirely different thing instead of a combination using the same 26 letters.

This means it's unable to give outputs that rely on knowledge of the text of the word itself. It can't write you a story that doesn't contain the letter “e”, write a poem with a specific number of syllables, create new words, write in pig-Latin, break up words in random ways or make wordplays that involves play on the letters rather than meaning etc.

There's a lot of things I want it to do that it can't do because of this limitation.

15

YobaiYamete t1_j9ugnw3 wrote

Yep, this is what causes all the posts about the AI cheating like a mofo at hangman as well. It's funny to see, but is an actual problem.

There's also the issue that LLM are shockingly weak to gaslighting. Social Engineering has always been the best method of "hacking" and with the AI it's even more relevant than ever.

Gaslighting the piss out of the AI to give you all it's secret info is hilariously easy

8

Additional-Escape498 t1_j9w2ix6 wrote

LLM tokenization uses wordpieces, not words or characters. This is standard since the original “Attention is All you Need Paper” that introduced the transformer architecture in 2017. Vocabulary size is typically between 32k - 50k depending on the implementation. GPT-2 uses 50k. They include each individual ASCII character plus commonly used combinations of characters. Documentation: https://huggingface.co/docs/transformers/tokenizer_summary

https://huggingface.co/course/chapter6/6?fw=pt

4

FpRhGf t1_j9xtnne wrote

Thanks! Well it's better than I thought. It still doesn't fix the limitations for the outputs I listed, but at least it's more flexible than what I presumed.

2

Additional-Escape498 t1_j9yorbo wrote

You’re definitely right that it can’t do those things, but I don’t think it’s because of the tokenization. The wordpieces do contain individual characters, so it is possible for a model to do that with the wordpiece tokenization it uses, but the issue is that the things you’re asking for (like writing a story with pig Latin) require reasoning and LLMs are just mapping inputs to a manifold. LLM’s can’t really do much reasoning or logic and can’t do basic arithmetic. I wrote an article about the limitations of transformers if you’re interested: https://taboo.substack.com/p/geometric-intuition-for-why-chatgpt

1

Representative_Pop_8 t1_j9wp2cc wrote

I think ChatGPT has finer grained data. in fact I did teach Spanish piglatin ( geringoso) to ChatGPT and it did learn it after about a dozen promts even though it insisted that it didn't know and couldn't learn it.

i had to ask to play role as a person that knew the piglatin I tought him. Funny thing is it ranted about not being able to do the translation, and they l that I wanted to know I could v apply the rules muy self! but next paragraph it said something like. " but the person would have said..." followed by a pretty decent translation

1

FpRhGf t1_j9xuv93 wrote

I think understanding and generating are different things. I remember seeing an article days ago on this sub that says LLMs could translate languages they weren't trained on, so I'm not too surprised if you say it could translate Geringoso.

However it can't generate. When I tried to get it to write in Pig Latin, the sentences were incoherent and it contained words that aren't real words. But at least they all end with “ay” and the output was better than my initial approach.

My initial approach was to get ChatGPT to move the first letter of each word to the last (Pig Latin without the “ay”) to see if it's a viable way of avoiding the filter. And it completely failed. It ended up giving me sentences where every word is a typo like “Myh hubmurger js veyr dliecsoius” instead of “Ym amburgerh si eryv eliciousd”. On top of that, the filters could still detect the content with all those typos, so it was a failed experiment for me.

1