I'd never seen an LLM summarized like this before, and I really like it: hidden_...

rakejake · on Feb 18, 2024

I don't follow. Isn't this the flow for practically every neutral network i.e you index the sampled inputs from the embedding Matrix, forward this through every hidden layer and then finally transform to the dimensions of your tokens so that it can be interpreted as log-counts?

simonw · on Feb 18, 2024

Yes, but I've never seen it expressed so clearly as pseudocode before.

elcomet · on Feb 18, 2024

This is not specific to llms. So not really informative of how llms work. It also works for CNNs, LSTM, MLPs, or even any data processing program..

sigmoid10 · on Feb 18, 2024

Not really. LSTM for example would require a recursive element where you update the hidden state and then pass it through the same layer again as you complete the output sequence. In fact the pseudocode shows very nicely how much simpler transformers are. And MLP is already a component in the transformer architecture.

microtonal · on Feb 18, 2024

No? You could perfectly plug in an RNN or bidirectional RNN for layer. This is the pseudocode for applying multiple layers. It does not really matter what these layers are, transformer, RNN, convolution, dilated convolutions, etc. The recurrence happens within a layer, not between layers.

elcomet · on Feb 19, 2024

Exactly. Nothing prevents the list of layers to be the same or different layers.

alexmolas · on Feb 18, 2024

Isn't this the typical representation we used back then when working with LSTMs?

sigmoid10 · on Feb 18, 2024

No, because LSTMs are recurrent. You couldn't use the same algorithm outlined here. Instead you'd have to iteratively pass elements of the sequence through the same layer over and over.

microtonal · on Feb 18, 2024

You are confused. The recurrence is within a layer, not between layers. The algorithm shown is for applying a stack of layers, but it doesn't really matter what the layers are. You can do the same (and people have been doing the same) with RNNs, convolutional networks, etc.

In reality it would typically be more complex for decoders, because you want to pass along a cache (such as a key-value cache in a transformer), add residual connections, etc.