No, this literally is the explanation. The model understands the concept of "Strawberry", It can output from the model (and that itself is very complicated) in English as Strawberry, jn Persian as توت فرنگی and so on.
But the model does not understand how many Rs exist in Strawberry or how many ت exist in توت فرنگی
The model ISN'T outputing the letters individually, binary models (as I mentioned) do; not transformers.
The model output is more like Strawberry
Tokens can be a letter, part of a word, any single lexeme, any word, or even multiple words ("let be")
Okay I did a shit job demonstrating the time axis. The model doesn't know the underlying letters of the previous tokens and this processes is going forward in time