I don't follow. Isn't this the flow for practically every neutral network i.e you index the sampled inputs from the embedding Matrix, forward this through every hidden layer and then finally transform to the dimensions of your tokens so that it can be interpreted as log-counts?
Not really. LSTM for example would require a recursive element where you update the hidden state and then pass it through the same layer again as you complete the output sequence. In fact the pseudocode shows very nicely how much simpler transformers are. And MLP is already a component in the transformer architecture.
No? You could perfectly plug in an RNN or bidirectional RNN for layer. This is the pseudocode for applying multiple layers. It does not really matter what these layers are, transformer, RNN, convolution, dilated convolutions, etc. The recurrence happens within a layer, not between layers.