Quantization, Floating Points and TurboQuant

darshanmakwana412.github.io

A lot of effort is spent to make LLM inference cheaper and performant. Quantization is the standard way to do this, where we reduce model’s size by representing it with parameters with fewer bits so they take up less memory and move faster through the memory hierarchy. The progression from 32-bit -> mixed precision -> 16-bit -> 8-bit -> 4-bit formats has been one of the most impactful practical developments in LLM inference Floating Point Formats

0 pages link to this URL

No pages have linked to this URL yet.