I'll be honest that I've never actually considered tokens per second as somethin...

MrBra · on March 6, 2024

Improvements in inference speed would also manifest itself on those bigger models that may only partially fit into GPU VRAM. In some cases, the improvement on the GPU side alone, is strong enough to basically turn what you would previously consider a too-slow-to-be-usable higher quality model, into a faster, usable one.

oscarrovira · on Feb 23, 2024

There’s a percentage of users that do care about token generation speed as they chain multiple API calls. The performance is all thanks to TensorRT-LLM, Mystic takes care of the engineering of getting a scalable endpoint out of it, i.e, not having to manage your infra.

v1vek · on Feb 23, 2024

the latency to serve the end-user esp with chained calls impacts ux quality