They are. Nobody in the AI research community expected that scaling up models se...

williamtrask · on May 15, 2023

Which goes to show you one of the best strategies in ML isn’t necessarily thinking about learning theory, it’s thinking about what our fastest computers are most capable of doing.

But this does run out of steam (or rather…data) eventually. The largest LLM was ibm’s tangora model back in the 1980s. 8 trillion params.

We’ll go through the cycle again with a new paradigm.

yarg · on May 15, 2023

There was a similar argument for not improving the efficiency of software in general - why bother to improve the design if upgrading the underlying hardware can double performance without changing the software?

It's not entirely invalid, but it does encourage a rather blasé attitude towards actually understanding the problem that you're dealing with.

balder1991 · on May 15, 2023

And at some point the cost of hardware starts to add up.

IIAOPSW · on May 15, 2023

"It just goes to show the best strategy in assembling wood together isn't learning to use a screw driver, its about pounding it in with the hardest and most capable hammer."

simonh · on May 15, 2023

Yep. Sometimes so.

https://www.thesprucecrafts.com/wood-joinery-methods-use-no-...

IIAOPSW · on May 15, 2023

I'll acknowledge Large Hammer Models have done surprisingly well, but true General Artificial Fastening is obviously not just Hammer models scaled up. There's things a screw driver can do which a hammer is hopeless on. Such as taking apart your eye glasses (and more importantly, putting them back together). This isn't something that is going to be solved once the tradesmen have a bit more training. Its fundamental to the way hammers work. We can only do so much by hammering in all our screws.

tornato7 · on May 15, 2023

8 trillion params in 1980? I don't think so bud. There wasn't that much memory in the entire world at the time.

creatonez · on May 15, 2023

> The largest LLM was ibm’s tangora model back in the 1980s. 8 trillion params.

As far as I can tell, this comes from a typo in a 1992 paper? https://aclanthology.org/J92-4003.pdf Does it mean the space of possible inputs is 8 trillion?

gwern · on May 15, 2023

It's an n-gram model so it can easily have as many parameters as you want to store. Skimming it, I think the '8 trillion parameters' here means '8 trillion trainable n-gram parameters but it's very sparse and most of them are simply defined by omission to be a small constant like epsilon due to not being represented in the training data at all and so suffering from the 0-count problem'. (Old problem - even Turing and Good spent a lot of time discussing how to handle 0s in contingency tables because that happens a lot in cryptography!)

IshanMi · on May 16, 2023

I think the 8 trillion parameters is accurate- Tangora is an N-gram model with a vocab size of 20,000 words and N = 3.

Parameters for an N-gram model = V^(N-1) * (V-1) Plugging in V=20,000 words and N = 3 for Tangora, you'd get 7.9996E12.

Most of the parameters are likely zero or close to it because many 3-grams are possible but not likely to occur. (However the aggregate probability of all 3-grams is substantial and thus they have to be included.)