Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I tried this with GPT-4 for NYC, from my address on the upper west side of Manhattan to the Brooklyn botanical gardens. It basically got the whole thing pretty much correct. I wouldn’t use it as directions, since it sometimes got left and right turns mixed up, stuff like that, but overall amazing.


That's wild.

I don't understand how that's even possible with a "next token predictor" unless some weird emergence, or maybe i'm over complicating things?

How does it know what the next street or neighbourhood it should traverse in each step without a pathfinding algo? Maybe there's some bus routes in the data it leans on?


> How does it know what the next street or neighbourhood it should traverse in each step without a pathfinding algo?

Because Transformers are 'AI-complete'. Much is made of (decoder-only) transformers being next token predictors which misses the truth that large transformers can "think" before they speak: there are many layers in-between input and output. They can form a primitive high-level plan by a certain layer of a certain token such as the last input token of the prompt, e.g. go from A to B via approximate midpoint C, and then refer back to that on every following token, while expanding upon it with details (A to C via D): their working memory grows with the number of input+output tokens, and with each additional layer they can elaborate details of an earlier representation such as a 'plan'.

However the number of sequential steps of any internal computation (not 'saved' as an output token) is limited by the number of layers. This limit can be worked around by using chain-of-thought, which is why I call them AI-complete.

I write this all hypothetically, not based on mechanistic interpretability experiments.


I like your interpretation, but how would they refer back to a plan if it isn’t stored in the input/output? Wouldn’t this be lost/recalculated with each token?


The internal state at layer M of token N is available at every following token > N and layer > M via attention heads. Transformed by a matrix but a very direct lookup mechanism. The state after the final attention layer is not addressable in this way, but it immediately becomes the output token which is of course accessible.

Note also that sequential computations such as loops translate nicely to parallel ones, e.g. k layers can search the paths of length k in a graph, if each token represents one node. But since each token can only look backwards, unless you're searching a DAG you'd also have to feed in the graph multiple times so the nodes can see each other. Hmm... that might be a useful LLM prompting technique.


But is this lookup mechanism available from one token prediction to the next? I’ve heard conflicting things, with others saying that transformers are stateless and therefore don’t share this information across prediction steps. I might be misunderstanding something fundamental.


Yes, attention (in transformer decoders) looks backwards to internal state at previous tokens. (In transform encoders like in BERT it can also look forwards.) When they said "stateless" I think they meant that you can recompute the state from the tokens, so the state can be discarded at any time: the internal state is entirely deterministic, it's only the selection of output tokens that involves random sampling. What's also a critical feature of transformers is that you can compute the state at layer N for all tokens in parallel, because it depends only on layer N-1 for the current and all previous tokens, not on layer N for the previous token as in LSTMs or typical RNNs. The whole point of the transformer architecture is to allow that parallel compute, at the cost of directly depending on every previous token rather than just the last.

So if you wished you could implement a transformer by recomputing everything on every token. That would be incredibly inefficient. However, if you're continuing a conversation with an LLM you likely would recompute all the state for all tokens on each new user input, because the alternative is to store all that state in memory until the user gets back to you again a minute later. If you have too many simultaneous users you won't have enough VRAM for that. (In some cases moving it out of VRAM temporarily might be practical.)


One thing you could think about is the very simple idea of “I’m walking north on park avenue, how do I get into the park”?

The answer is always ‘turn left’ no matter where you are on park avenue. These kind of heuristics would allow you to build more general path finding based on ‘next direction’ prediction.

OpenAI may well have built a lot of synthetic direction data from some maps system like Google maps, which would then heavily train this ‘next direction’ prediction system. Google maps builds a list of smaller direction steps to follow to achieve the larger navigation goal.


A couple hundred billion floating point numbers is enough to store quite a few things.

Also, algorithms you learn in CS are for scaling problems to arbitrary sizes, but you don't strictly need those algos to handle problems of a small size. In a sense, you could say the "next token predictor" can simulate some very crude algorithms, eg. at every token, greedily find the next location by looking at the current location, and output the neighboring location that's to the direction of the destination.

The next token predictor is a built in for loop, and if you have a bunch of stored data on where the current location is roughly, its neighboring locations, and the relative direction of the destination... then you got a crude algo that kinda works.

PS: but yeah despite the above, I still think the emergence is "magic".


There must be some point a to point c training data that it had to learn how to complete via tons of point a to b data, which it then learned to generalize?


> I don't understand how that's even possible with a "next token predictor"

It isn't. It failed.

> I wouldn’t use it as directions, since it sometimes got left and right turns mixed up, stuff like that, but overall amazing.


maybe: embedding vector contains coordinates, vectors are really good at coordinates


So it got the whole thing pretty much correct... Except for stuff like taking you down opposite directions?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: