Elevator Pitch

  • Taalas boosts LLM inference to ~17,000 tokens/sec by using a fixed-function ASIC that physically hardwires a quantized Llama model’s weights, eliminating most GPU-style memory traffic.

Key Takeaways

  • GPUs repeatedly shuttle weights/activations between compute and VRAM across layers, creating a bandwidth/energy “memory wall.”
  • Taalas “engraves” the model’s layers into silicon so data streams through on-chip transistors layer-by-layer instead of round-tripping to external memory.
  • The chip avoids external DRAM/HBM but uses some on-chip SRAM for KV cache and LoRA adapters; model mapping is sped up by only customizing “the top two layers/masks.”

Most Memorable Quotes

  • “A startup called Taalas, recently released an ASIC chip running Llama 3.1 8B (3/6 bit quant) at an inference rate of 17,000 tokens per seconds.”
  • “They just engraved the 32 layers of Llama 3.1 sequentially on a chip. Essentially, the model's weights are physical transistors etched into the silicon.”
  • “It took them two months, to develop chip for Llama 3.1 8B.”

Source URLOriginal: 732 wordsSummary: 166 words