Elevator Pitch
- Taalas boosts LLM inference to ~17,000 tokens/sec by using a fixed-function ASIC that physically hardwires a quantized Llama model’s weights, eliminating most GPU-style memory traffic.
Key Takeaways
- GPUs repeatedly shuttle weights/activations between compute and VRAM across layers, creating a bandwidth/energy “memory wall.”
- Taalas “engraves” the model’s layers into silicon so data streams through on-chip transistors layer-by-layer instead of round-tripping to external memory.
- The chip avoids external DRAM/HBM but uses some on-chip SRAM for KV cache and LoRA adapters; model mapping is sped up by only customizing “the top two layers/masks.”
Most Memorable Quotes
- “A startup called Taalas, recently released an ASIC chip running Llama 3.1 8B (3/6 bit quant) at an inference rate of 17,000 tokens per seconds.”
- “They just engraved the 32 layers of Llama 3.1 sequentially on a chip. Essentially, the model's weights are physical transistors etched into the silicon.”
- “It took them two months, to develop chip for Llama 3.1 8B.”
Source URL•Original: 732 words
•Summary: 166 words