Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Tokenizers aren't considered the "sexy" part of LLMs, but where others see boring, I see opportunity. Papers like xVal[1], point toward specialization strategies in tokenization. Spelling and letter tasks are another problem that could benefit from innovation on the tokenization.

LLMs are notoriously bad at counting letters in words or performing simply oulipos of letter omission. GPT-4o, for example, writes a small python program and executes it in order to count letter instances. We all know that tokenization effectively erases knowledge about letters in prompts and directly negatively impacts performance at these tasks, yet we haven't found a way to solve it.

1. https://ar5iv.labs.arxiv.org/html/2310.02989



This was ages ago, in the pre-transformer era, and I can't find the link anymore. But once upon a time I read a great paper that demonstrated that most of the performance differences being reported among popular embedding models of the time were better explained by text cleaning and tokenization than they were by the embedding model itself.

In other words, if you train a model using word2vec's preprocessing and GloVe's algorithm, the result looks more like a "standard-issue" word2vec model than a "standard-issue" GloVe model.


Yes those models were sensitive to the preprocessing, far more than transformers.

However Word2Vec and GloVe were fundamentally different, when used as designed GloVe worked better pretty uniformly.


Tokenizers face an odd compute issue.

Since they're part of the pre-processing pipeline, you can't quickly test them out for effectiveness. You have to restart a pretraining run to test downstream effectiveness.

Separately,

As much as an attention module can do universal nonlinear transformations....I wonder if it makes sense to add specifuc modules for some math primitives as well. I remember that the executor paper [1] (slightly precursor to the attention is allyou need paper) created self contained modules for operations like less than, count, sum and then explicitly orchestrated them in the decoder.

I'm surprised we haven't seen such solutions produce sota results from math-ai or code-ai research communities.

[1] https://arxiv.org/abs/1705.03633


What's the issue with character-level tokenization(I assume this would be much better at count-the-letter tasks)? The article mentions it as an option but doesn't talk about why subword tokenization is preferred by most of the big LLMs out there.


Using subwords makes your sequences shorter, which makes them cost less.

Besides that, for alphabetic languages, there exists almost no relation between form and meaning. I.e.: “ring” and “wing” differ by one letter but have no real common meaning. By picking the character or byte as your choice of representation, the model basically has to learn to distinguish ring and wing in context. This is a lot of work!

So, while working on the character or byte level saves you some embeddings and thus makes your model smaller, it puts all of the work of distinguishing similar sequences with divergent meanings on the model itself, which means you need a larger model.

By having subwords, a part of this distinguishing work already has been done by the vocabulary itself. As the article points out, this sometimes fails.


> Besides that, for alphabetic languages, there exists almost no relation between form and meaning.

Also true for Abugida-based languages, for eg. சரம் (saram = string) vs மரம் (maram = tree), and many more. I think your intention with specifying "alphabetic languages" was to say "non-logographic languages", right?


I'll do you one more and say "non-Chinese languages". Written Japanese - including the kanji portion of the script - has the same characteristic.

And even in Chinese it's a fairly weak relationship. A large portion of the meanings of individual characters come from sound loan. For example the 英 in 英雄 means "hero", in 英语 means "England", an in 精英 means "flower". The relationship there is simple homophony.

On the other hand, one thing you do get with written Chinese is that "1 character = 1 morpheme" very nearly works. So mechanistically breaking a text into a sequence of morphemes can be done pretty reliably without the aid of a semantic model or exhaustive hard-coded mapping. I think that for many other languages you can't even get close using only syntactic analysis.


> I'll do you one more and say "non-Chinese languages". Written Japanese - including the kanji portion of the script - has the same characteristic.

Written Japanese is much more ideographic than written Chinese. Japanese spelling is determined, such as it is, by semantics. Chinese spelling is determined by sound. Thus, 女的, 娘们, and 妮子, all meaning 'girl' or 'woman', have no spelling in common because they are different words, while Japanese uses 女 for "jo" and "onna" despite a total lack of any relationship between those words.


I was trying to say “at least for alphabetic languages”. I don’t like to say things about languages I can’t speak or write. So, no, it wasn’t my intention to say “non-logographic languages”


I suspect that the holy grail here is figuring out how to break the input into a sequence of morphemes and non-morpheme lexical units.


What do you mean by non-morpheme lexical units? Syntactic particles, units too small to be morphemes? Lexical items that contain multiple morphemes?

In either case, isn't this something we already do well?


Punctuation, for example.

And no, at least for the languages with which I'm familiar SOTA tokenizers tend to only capture the easy cases.

For example, the GPT-4 tokenizer breaks the first sentence of your post like so:

  What/ do/ you/ mean/ by/ non/-m/orp/heme/ lexical/ units/?
Notice how "morpheme" gets broken into three tokens, and none of them matches "morpheme"'s two morphemes. "Lexical" and "units" are each a single token, when they have three and two morphemes respectively.

Or in French, the word "cafetière" gets chopped willy-nilly into "c/afet/ière". The canonical breakdown is "cafe/t/ière".


Has anyone tried to combine a token embedding with some representation of the characters in the (sub)word? For example, use a 512 long vector to represent a token, and reserve the last 12 values to spell out the word.


I'm not following - spell out the word how? Like put the actual bytes as numerical input to the transformer layer?


Yes


Actually adding the numerical value I think is not the right way to do it because of what happens when you matmul those values - usually the right way to do it would be to have low dimensional character embeddings that are learnable in addition to the token embeddings.

The problem with pure numerical values representing a class as input to nueral network layer is that the byte encoding number is going to be very hard for the transformer to memorize exact values especially when relatively close numbers to each other often do not share much meaning. Catagories are usually encoded somehow, like a one-hot embedding layer, or more recently a learned embedding, so that these different categories can be easily distinguished (different categories are close to orthogonal).

My prediction would be that using the numerical value directly would not work at all, and using learnable embeddings would work but You would have to reserve that part of the token embedding for each token which would hurt performance a lot on non-spelling tasks relative to just letting that whole embedding represent the token however the model sees fit.

But, IDK! It would be easy enough to try! You should on a small toy model. And then try a small learnable re-usable character embedding. And write a blog post. Would be happy to coach / offer some gpu time / answer any other questions you have while building it.


Which tasks would you test the expected improvements on from this addition?


maybe make a metric of "count letter $L in work $word" - if you want to game it, you can choose words so that they are all tokenized into 1-2 tokens and each token has multiples of letter $L

And then use something like helloswag to measure how much you've lost on general text completion compared to a vanilla LLM with the same embedding size all dedicated to just the token.


Which model would you recommend to try it on? Would you train it from scratch, or finetune an existing one?


You would have to train the new model from scratch since it would be all new token embeddings with whatever character encoding scheme you come up with. It would probably make sense to train the vanilla gpt from scratch with the same total embeddings size as your control. I would start with https://github.com/karpathy/nanoGPT as a baseline since you can train a toy (GPT2 sized) llm in a couple days on an a100 which are pretty easy to come by.


Not that I know of, but encoding orthography in a fixed width vector usually carries the assumption that words with the same prefix are more similar. So there’s an alignment problem. You usually solve this using dynamic programming, but that doesn’t work in a vector.

For example “parent” and “parents” are aligned, they share letters in the same position, but “skew” and “askew” share no letters in the same position.


The other 500 values in the skew/askew vectors will be similar though. The 12 character values don’t need to be aligned, their function is to provide spelling. Adding such info will probably help LLM answer questions requiring character level knowledge (e.g. counting ‘r’s in ‘strawberry’).


Well, fastText uses character n-grams to compute embeddings for out-of-vocabulary words. This is pre-transformers work BTW.


IIRC, overlapping ngram vectors are summed to form the token embedding - doesn’t it effectively destroy any character level representation of the token? Doesn’t really make sense to me.


It works because they use really large ngram values, up to 6. So most character-level information is in these subwords.


Let’s say we want to use 6-grams and build an embedding vector for the word “because”: we add integer vectors for “becaus” and “ecause”, right? For example: [1,2,3,4,5,6] + [2,3,4,5,6,2] = [3,5,7,9,11,8]. Obviously we cannot use this resulting numerical vector to spell the input word. Pretty much all character level info is lost.


tokens are on average four characters and the number of residual streams (and therefore RAM) the LLM allocates to a given sequence is proportionate to the number of units of input. the flops is proportionate to their square in the attention calculation.

you can hypothetically try to ameliorate this by other means, but if you just naively drop from tokenization to character or byte level models this is what goes wrong


4x seq length expansion doesn’t sound that bad.


I mean, it's not completely fatal, but it means an approximately 16x increase in runtime cost, if I'm not mistaken. That's probably not worth trying to solve letter counting in most applications.


it is not necessarily 16x if you, e.g., decrease model width by a factor of 4 or so also, but yeah naively the RAM and FLOPs scale up by n^2


I think it has to do with both performance (smaller tokens means more tokens per sentence read and more runs per sentence generated) and with how embeddings work. You need a token for "dog" and a token for "puppy" to represent the relationship between the two as a dimension in latent space.


Context length performance and memory scales N^2. Smaller tokens mean worse scaling, up to a point.


I wrote a whole paper about this exact topic! (Syntactic, phonetic, and related constraints)

https://aclanthology.org/2022.cai-1.2/


> but where others see boring, I see opportunity

I feel this way about embeddings

This line of thought seems related to the old wisdom of finding innovative solutions by mucking around in the layer below whatever the "tools of the trade" are for your domain


> LLMs are notoriously bad at counting letters in words or performing simply oulipos of letter omission.

If it were so simple, why hasn’t this already been dealt with?

Multimodal VQA models also have had a hard time generalizing counting. Counting is not as simple as changing the tokenizer.


There is a paper about : https://arxiv.org/pdf/2405.18719


I'm saying the oulipo rule is simple, not the task given current tokenization methods


Should the number 23 be tokenized as one token or two tokens?


It doesn’t matter. The challenge with counting doesn’t have to do with tokenization. Why this got into the zeitgeist, I don’t know.


No LLM struggles with two digit arithmetic. 100 digit addition is possible with the use of state of the art position encodings. Counting is not bottlenecked by arithmetic at all.

When you ask an LLM to count the number of "r" in the word Strawberry, the LLM will output a random number. If you ask it to separate the letters into S t r a w b e r r y, then each letter is tokenized independently and the attention mechanism is capable of performing the task.

What you are doing is essentially denying that the problem exists.


Gpt4-o correctly answered

"How many letters "r" are in the word Frurirpoprar"

And it didn't use a program execution (at least it didn't show the icon and the answer was very fast so it's unlikely it generated an executed a program to count)


I wouldn't consider that a thing that's going to work generally. That word may tokenize to one per char and have seen relevant data, or it may be relatively close to some other word and it's seen data which gives the answer.


Why would you confidently say such a lie like this? It's exactly the opposite. It's mostly due to toeknization. Show NeurIPS papers which give evidence of the opposite because I can square up with NeurIPS papers to substantiate that it is tokenization that causes these issues.


Tokenization absolutely screws with math and counting.


Two. That's the reality.

You interpret the token sequence by constructing a parse tree, but that doesn't require you to forget that the tokens exist.


If you use standard BPE, you likely won't tokenize every number by it's digits, depending on the data set used to create the tokenizer.

The point is, you have a choice. You can do the tokenization however you like. The reason 23 is interesting is that there is a case to be made that a model will more likely understand 23 is related to Jordan if it's one token, and if it's two tokens it's more difficult. The opposite is true for math problems.

The reality is whatever we want to make it. It's likely that current schemes are... sub optimal. In practice it would be great if every token was geometrically well spaced after embedding, and preserve semantic information, among other things. The "other things" have taken precedent thus far.


We already solved that with binary representation ;-)


And decoders.


>oulipos

?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: