When people point out the blatant holes like "2+2=Duck" or "Strawberry has 6 Rs", you can tell they try to quickly patch out these cases out.
But how are they patching them out? "When someone asks this question,
this is the right answer"? By adding the correct answer to the model? These would be bodges. The former worse than the latter.
My non-expert stab at what's happening...
There's an article describing the tokenization problem Veritable mentioned. The solution is to train it that when people ask this kind of question, solve it with a simpler method. But that's just a guess as to why it's having the problem in the first place. You'd think the asking of a question like that would beg the solution using a simple counting. But what might be happening is that it is making a mistake interpreting what you really want, and its solution may not be as bad as it appears.
How many R's are there in 'strawberry'?
Easy enough to count, but why are you even asking the question? Particularly if you typed the word yourself. Testing an AI's ability to count is probably not what it's expecting you to be doing. It's not expecting you to be happy with a pedantic answer. It probably predicts that what you are really asking for is a spelling check. Yes, the berry part has two R's, not one.
The tokenization is only really useful for pronunciation, for rhyming, and for figuring out the meaning of novel words or misspelled words. But when the user provides the word, and spells it correctly? It's hard to imagine why tokenization would be done on the INPUT.
Where I'm coming from: I created a tokenization scheme for a rhyming dictionary, and tokenized 40,000-ish words. To find rhymes, you don't tokenize the input. That's already done in the dictionary. You look the input up in the dictionary and return all the results that have the same tokenization. You'd do the same thing for text-to-speech, and in fact I started with a dictionary intended for text-to-speech (and found it so full of noise, from the perspective of rhyming, and errors in the tokenization, and the tokenization scheme was crap to begin with, that I couldn't use it). You'd do the reverse for speech-to-text.
An AI would have to know how to do all of these things, but it would also need to know WHEN to do all of these things, and when it wasn't called for. If I asked it how the word is pronounced, or to write me a lyric to rhyme, it might look up the tokenization. But check my spelling? Tokenization not required.
Any solution that solves the pedantic question has to be careful not to shortcircuit the other types of questions that might be asked. You might ask your voice assistant if strawberry has one or two R's. It would be confusing if it told you there were three R's. A FULL answer might be, "there are three R's in strawberry, two of them in the berry part." But another thing it's trying to do is correctly interpret your intent and give you a concise answer.
The tokenization article says that the word 'giggling' has the same problem. Of course it does. He says it correctly answers the question if you type it out with spaces between the letters: G I G G L I N G. I suspect it is easier to determine intent (the pedantic answer is desired) from this.
Also, and they already have this problem a bit, as more data is "generated", the flatter the bumps and nuance will get. So all the edge cases start "vanishing" because people just rely on the LLM which just generated the "statistically most likely" response.
To keep its own mistakes from ruining the model, it should probably exclude things that are AI-generated. For now. That's easy to do when AI-generated results are labeled as such. But what happens when I publish AI-generated garbage and pretend I created it?