Summary from byteshape.com

Optimize quantized LLMs by treating memory as a fit constraint first, then selecting per-tensor datatypes to maximize the real device experience: tokens-per-second (TPS) vs. output quality.

Memory is a budget to meet; once the model fits, smaller files aren’t automatically better—optimize the TPS/quality tradeoff instead.
On CPUs, lower bitlengths tend to predictably improve TPS once the model fits; on GPUs, “fewer bits” can be slower due to kernel and decode-path effects.
Across tested devices (Pi 5, Intel i7, RTX 5090/4080), ShapeLearn-chosen ByteShape quantizations generally land on a better TPS/quality curve than Unsloth and MagicQuant.

“Bottom line: treat memory as a budget to meet, then optimize what matters most: TPS and quality.”
“Different quantization formats can trigger different kernels and overheads, and on some GPUs, going lower-bit can even get slower, despite using less memory.”
“If your system can't run a 30B model smoothly, don't blame the model or the silicon. Blame the datatypes.”

Source URL•Original: 2760 words•Summary: 170 words