Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd never seen an LLM summarized like this before, and I really like it:

    hidden_state = self.embeddings(input_tokens)

    for layer in self.layers:
        hidden_state = layer(hidden_state)

    return transform_into_logits(hidden_state)


I don't follow. Isn't this the flow for practically every neutral network i.e you index the sampled inputs from the embedding Matrix, forward this through every hidden layer and then finally transform to the dimensions of your tokens so that it can be interpreted as log-counts?


Yes, but I've never seen it expressed so clearly as pseudocode before.


This is not specific to llms. So not really informative of how llms work. It also works for CNNs, LSTM, MLPs, or even any data processing program..


Not really. LSTM for example would require a recursive element where you update the hidden state and then pass it through the same layer again as you complete the output sequence. In fact the pseudocode shows very nicely how much simpler transformers are. And MLP is already a component in the transformer architecture.


No? You could perfectly plug in an RNN or bidirectional RNN for layer. This is the pseudocode for applying multiple layers. It does not really matter what these layers are, transformer, RNN, convolution, dilated convolutions, etc. The recurrence happens within a layer, not between layers.


Exactly. Nothing prevents the list of layers to be the same or different layers.


Isn't this the typical representation we used back then when working with LSTMs?


No, because LSTMs are recurrent. You couldn't use the same algorithm outlined here. Instead you'd have to iteratively pass elements of the sequence through the same layer over and over.


You are confused. The recurrence is within a layer, not between layers. The algorithm shown is for applying a stack of layers, but it doesn't really matter what the layers are. You can do the same (and people have been doing the same) with RNNs, convolutional networks, etc.

In reality it would typically be more complex for decoders, because you want to pass along a cache (such as a key-value cache in a transformer), add residual connections, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: