Elevator Pitch

  • Optimize quantized LLMs by treating memory as a fit constraint first, then selecting per-tensor datatypes to maximize the real device experience: tokens-per-second (TPS) vs. output quality.

Key Takeaways

  • Memory is a budget to meet; once the model fits, smaller files aren’t automatically better—optimize the TPS/quality tradeoff instead.
  • On CPUs, lower bitlengths tend to predictably improve TPS once the model fits; on GPUs, “fewer bits” can be slower due to kernel and decode-path effects.
  • Across tested devices (Pi 5, Intel i7, RTX 5090/4080), ShapeLearn-chosen ByteShape quantizations generally land on a better TPS/quality curve than Unsloth and MagicQuant.

Most Memorable Quotes

  • Bottom line: treat memory as a budget to meet, then optimize what matters most: TPS and quality.”
  • “Different quantization formats can trigger different kernels and overheads, and on some GPUs, going lower-bit can even get slower, despite using less memory.”
  • “If your system can't run a 30B model smoothly, don't blame the model or the silicon. Blame the datatypes.

Source URLOriginal: 2760 wordsSummary: 170 words