Vigge93

joined 2 years ago
[–] Vigge93@lemmy.world 7 points 2 days ago

That's why you always have the same number of digits. 01 < 05 < 10

[–] Vigge93@lemmy.world 2 points 5 days ago

That's when you get into more of the nuance with tokenization. It's not a simple lookup table, and the AI does not have access to the original definitions of the tokens. Also, tokens do not map 1:1 onto words, and a word might be broken into several tokens. For example "There's" might be broken into "There" + "'s", and "strawberry" might be broken into "straw" + "berry".

The reason we often simplify it as token = words is that it is the case for most of the common words.

[–] Vigge93@lemmy.world 6 points 5 days ago (3 children)

Each word gets converted to a number before it is processed, so asking how many "how many r are there in strawberry" could be converted to "how many 7 are there in 13", for example.

(Very simplified)