Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

tokens are on average four characters and the number of residual streams (and therefore RAM) the LLM allocates to a given sequence is proportionate to the number of units of input. the flops is proportionate to their square in the attention calculation.

you can hypothetically try to ameliorate this by other means, but if you just naively drop from tokenization to character or byte level models this is what goes wrong



4x seq length expansion doesn’t sound that bad.


I mean, it's not completely fatal, but it means an approximately 16x increase in runtime cost, if I'm not mistaken. That's probably not worth trying to solve letter counting in most applications.


it is not necessarily 16x if you, e.g., decrease model width by a factor of 4 or so also, but yeah naively the RAM and FLOPs scale up by n^2




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: