My bad on the 6 D V estimate; you are correct that if they do a dense decoding (...

My bad on the 6 D V estimate; you are correct that if they do a dense decoding (rather than a hierarchical one as google used to do in the old days) the cost is exactly 6 D V. I cannot edit the GP comment and I will absorb the shame of my careless words there. I was put off by the subtitle and initial title of this HN post, though the current title is more appropriate and correct.

Even if it's a small model, one could use ddp or FSDP/2 without slowdowns on fast interconnect, which certainly adds to the cost. But if you want to reproduce all the work at the cheapest price point you only need to parallelize to the minimal level for fitting in memory (or rather, the one that maxes the MFU), so everything below 2B parameters runs on a single H100 or single node.