They are. Nobody in the AI research community expected that scaling up models several orders of magnitude would make them that good. Even the original selling point of the Transformer paper was "can be much bigger and faster than RNNs, even if not as good" - Of course everyone was wrong and it's all about the data and scale.
Btw, tokens are nothing new in modern models. They've been used since 2015/2016 [0]
Which goes to show you one of the best strategies in ML isn’t necessarily thinking about learning theory, it’s thinking about what our fastest computers are most capable of doing.
But this does run out of steam (or rather…data) eventually. The largest LLM was ibm’s tangora model back in the 1980s. 8 trillion params.
We’ll go through the cycle again with a new paradigm.
There was a similar argument for not improving the efficiency of software in general - why bother to improve the design if upgrading the underlying hardware can double performance without changing the software?
It's not entirely invalid, but it does encourage a rather blasé attitude towards actually understanding the problem that you're dealing with.
"It just goes to show the best strategy in assembling wood together isn't learning to use a screw driver, its about pounding it in with the hardest and most capable hammer."
I'll acknowledge Large Hammer Models have done surprisingly well, but true General Artificial Fastening is obviously not just Hammer models scaled up. There's things a screw driver can do which a hammer is hopeless on. Such as taking apart your eye glasses (and more importantly, putting them back together). This isn't something that is going to be solved once the tradesmen have a bit more training. Its fundamental to the way hammers work. We can only do so much by hammering in all our screws.
> The largest LLM was ibm’s tangora model back in the 1980s. 8 trillion params.
As far as I can tell, this comes from a typo in a 1992 paper? https://aclanthology.org/J92-4003.pdf Does it mean the space of possible inputs is 8 trillion?
It's an n-gram model so it can easily have as many parameters as you want to store. Skimming it, I think the '8 trillion parameters' here means '8 trillion trainable n-gram parameters but it's very sparse and most of them are simply defined by omission to be a small constant like epsilon due to not being represented in the training data at all and so suffering from the 0-count problem'. (Old problem - even Turing and Good spent a lot of time discussing how to handle 0s in contingency tables because that happens a lot in cryptography!)
I think the 8 trillion parameters is accurate- Tangora is an N-gram model with a vocab size of 20,000 words and N = 3.
Parameters for an N-gram model = V^(N-1) * (V-1)
Plugging in V=20,000 words and N = 3 for Tangora, you'd get 7.9996E12.
Most of the parameters are likely zero or close to it because many 3-grams are possible but not likely to occur. (However the aggregate probability of all 3-grams is substantial and thus they have to be included.)
Btw, tokens are nothing new in modern models. They've been used since 2015/2016 [0]
[0] https://arxiv.org/abs/1508.07909