Elevator Pitch
- Optimize quantized LLMs by treating memory as a fit constraint first, then selecting per-tensor datatypes to maximize the real device experience: tokens-per-second (TPS) vs. output quality.
Key Takeaways
- Memory is a budget to meet; once the model fits, smaller files aren’t automatically better—optimize the TPS/quality tradeoff instead.
- On CPUs, lower bitlengths tend to predictably improve TPS once the model fits; on GPUs, “fewer bits” can be slower due to kernel and decode-path effects.
- Across tested devices (Pi 5, Intel i7, RTX 5090/4080), ShapeLearn-chosen ByteShape quantizations generally land on a better TPS/quality curve than Unsloth and MagicQuant.
Most Memorable Quotes
- “Bottom line: treat memory as a budget to meet, then optimize what matters most: TPS and quality.”
- “Different quantization formats can trigger different kernels and overheads, and on some GPUs, going lower-bit can even get slower, despite using less memory.”
- “If your system can't run a 30B model smoothly, don't blame the model or the silicon. Blame the datatypes.”
Source URL•Original: 2760 words
•Summary: 170 words