Summary from www.anuragk.com

Elevator Pitch

Taalas boosts LLM inference to ~17,000 tokens/sec by using a fixed-function ASIC that physically hardwires a quantized Llama model’s weights, eliminating most GPU-style memory traffic.

Key Takeaways

GPUs repeatedly shuttle weights/activations between compute and VRAM across layers, creating a bandwidth/energy “memory wall.”
Taalas “engraves” the model’s layers into silicon so data streams through on-chip transistors layer-by-layer instead of round-tripping to external memory.
The chip avoids external DRAM/HBM but uses some on-chip SRAM for KV cache and LoRA adapters; model mapping is sped up by only customizing “the top two layers/masks.”

Most Memorable Quotes

“A startup called Taalas, recently released an ASIC chip running Llama 3.1 8B (3/6 bit quant) at an inference rate of 17,000 tokens per seconds.”
“They just engraved the 32 layers of Llama 3.1 sequentially on a chip. Essentially, the model's weights are physical transistors etched into the silicon.”
“It took them two months, to develop chip for Llama 3.1 8B.”

Source URL•Original: 732 words•Summary: 166 words