Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 3 days ago

Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.

KnilAdlez [none/use name]@hexbear.net · edit-2 3 days ago

So, like crypto, AI has fucked over the gpu market and will now go to ASICs

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 3 days ago

I mean, ASICs are an obvious approach for making neural networks.

peeonyou [he/him]@hexbear.net · 3 days ago

I thought that was the point of TPUs?

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 3 days ago

TPUs are general purpose accelerators for any neural network. They still have to fetch instructions, manage a programmable memory hierarchy because it has to provide a general purpose platform for anything from a CNN to a Transformer.

What Taalas is doing amounts to hard-wiring the actual weights and structure of a specific model directly into the silicon. They call these Hardcore Models because the model is expressed directly in the topology of the chip. Their approach eliminates the overhead of fetching instructions or moving data between a separate processor and memory. They bypass the memory wall that plagues GPUs and TPUs by merging storage and computation at the transistor level. And that’s why they see such massive leap in tokens per second and power efficiency. The trade off here is that you sacrifice the ability to run any model in exchange for running one specific model at near instantaneous speeds at a fraction of the cost.

bobs_guns@lemmygrad.ml · 3 days ago

I guess the only problem is if the model is going to be superseded by a better model the chip becomes useless.

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 3 days ago

It depends how the chip is designed, with something like FPGA or memristors you could reconfigure the chip itself to support different network topologies and weights. But even with a chip that’s not configurable, this is still pretty useful. Like if you can make a chip for running a full DeepSeek, it can do a ton of very useful tasks right now even without any future upgrades. So, it’s not like outdated chips will become useless. If you set it up for whatever task you need, and it does the job, then you just keep using it for that.

invalidusernamelol [he/him]@hexbear.net · 3 days ago

This is just the reification of the technology. Same as CPU architectures. You’ll have chips designed with a specific instruction set and then you’ll be sold one with the new instruction set.

peeonyou [he/him]@hexbear.net · 3 days ago

wouldn’t an fpga be a better idea? at least when better models came out you could reprogram it no?

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 3 days ago

In terms of flexibility for sure, but I’m guessing they’re just going for raw optimisation here to see how fast they can make it go.

Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.

Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.

The path to ubiquitous AI | Taalas