Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> While I understand the core concept of 'just' picking the next word based on statistics

That's just the mechanism it uses to generate output - which it not the same as being the way it internally chooses what to say.

I think it's unfortunate that the name LLM (large language model) has stuck for these predictive models, since IMO it's very misleading. The name has stuck since this line of research was born out of much simpler systems that were just language models, and sadly the name has stuck. The "predict next word" concept is also misleading, especially when connected to the false notion that these are just language models. What is true is that:

1) These models are trained by being given feedback on their "predict next word" performance

2) These models generate output a word at a time, and those words are a selection from variety of predictions about how their input might be continued in light of the material they saw during training, and what they have learnt from it

What is NOT true is that these models are operating just at the level of language and are generating output purely based on language level statistics. As Ilya Sutskever (one of the OpenAI founders) has said, these models have used their training data and predict-next-word feedback (a horrible way to have to learn!!!) to build an internal "world model" of the processes generating the data they are operating on. "world model" is jargon, but what it essentially means is that these models have gained some level of understanding of how the world (seen through the lens of language) operates.

So, what really appears to be happening (although I don't think anyone knows in any level of detail), when these models are fed a prompt and tasked with providing a continuation (i.e. a "reply" in context of ChatGPT), is that the input is consumed and per the internal "world model" a high level internal representation of the input is built - starting at the level of language presumably, but including a model of the entities being discussed, relations between them, related knowledge that is recalled, etc, etc, and this internal model of what is being discussed persists (and is updated) throughout the conversation and as it is generating output... The output is generated word by word, but not as a statistical continuation of the prompt, but rather as a statistically likely continuation of texts it saw during training when it had similar internal states (i.e. a similar model of what was being discussed).

You may have heard of "think step by step" or "chain of thought" prompting which are ways to enable these models to perform better on complex tasks where the distance from problem statement (question) to solution (answer) is too great for the model to do in a "single step". What is going on here is that these models, unlike us, are not (yet) designed to iteratively work on a problem and explore it, and instead are limited to a fixed number of processing steps (corresponding to number of internal levels - repeated transformer blocks - between input and output). For simple problems where a good response can conceived/generated within that limited number of steps, the models work well, otherwise you can tell the them to "think step by step" which allows it to overcome this limitation by taking multiple baby steps, and evolving it's internal model of the dialogue.

Most of what I see written about ChatGPT, or these predictive models in general, seems to be garbage. Everyone has an opinion and wants to express it regardless of whether they have any knowledge, or even experience, with the models themselves. I was a bit shocked to see an interview with Karl Friston (a highly intelligent theoretical neuroscientist) the other day, happily pontificating about ChatGPT and offering opinions about it while admitting that he had never even used it!

The unfortunate "language model" name and associated understanding of what "predict next word" would be doing IF (false) they didn't have the capacity to learn anything more than language seems largely to blame.



> ...the input is consumed and per the internal "world model" a high level internal representation of the input is built...

This is the aspect of ChatGPT I'm trying to understand. Can you point to any resources on this?


No - I'm not sure anyone outside of OpenAI knows, and maybe they only have a rough understanding themselves.

We don't even know the exact architecture of GPT-4 - is it just a Transformer, or does it have more to it ? The head of OpenAI, Sam Altman, was interviewed by Lex Fridman yesterday (you can find it on YouTube) and he mentioned that, paraphrasing, "OpenAI is all about performance of the model, even if that involves hacks ...".

While Sutskever describes GPT-4 as having learnt this "world model", Sam Altman instead describes it as having learnt a non-specific "something" from the training data. It seems they may still be trying to figure out much of how it is working themselves, although Altman also said that "it took a lot of understanding to build GPT-4", so apparently it's more than just a scaling up of earlier models.

Note too that my description of it's internal state being maintained/updated through the conversation is likely (without knowing the exact architecture) to be more functional than literal since if it were just a plain Transformer then it's internal state is going to be calculated from scratch for each word it is asked to generate, but evidentially there is a great deal of continuity between the internal state when the input is, say, prompt words 1-100 as when it is words 2-101 - so (assuming they haven't added any architectural modification to remember anything of prior state), the internal state isn't really "updated" as such, but rather regenerated into updated form.

Lots of questions, not so many answers, unfortunately!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: