I'll be honest that I've never actually considered tokens per second as something to focus on for my projects, I'm much more concerned with quality of the output then quantity.
Is 500 tok/s on Gemma 7B a gamechanger? or is this more just an advertisement for mystic.ai?
Improvements in inference speed would also manifest itself on those bigger models that may only partially fit into GPU VRAM. In some cases, the improvement on the GPU side alone, is strong enough to basically turn what you would previously consider a too-slow-to-be-usable higher quality model, into a faster, usable one.
There’s a percentage of users that do care about token generation speed as they chain multiple API calls. The performance is all thanks to TensorRT-LLM, Mystic takes care of the engineering of getting a scalable endpoint out of it, i.e, not having to manage your infra.
Is 500 tok/s on Gemma 7B a gamechanger? or is this more just an advertisement for mystic.ai?