AI hardware startup Cerebras Systems claims to have built the fastest AI inference solution in the world, capable of processing thousands of tokens for mammoth AI models in seconds.
Powered by the startup’s Wafer Scale Engine (WSE) chips, the Cerebras inference solution can run large-scale AI models at super-fast speeds while ensuring responses are accurate.
The startup claims the solution can run Meta’s Llama 3.1, the largest open source AI model in the world, some 20 times faster than traditional GPU-based hyperscale cloud solutions for just one-fifth the price.
Subscribe today for free
“With record-breaking performance, industry-leading pricing, and open API access, Cerebras Inference sets a new standard for open large language model development and deployment,” James Wang, director of product marketing at Cerebras wrote in a blog post.
“As the only solution capable of delivering both high-speed training and inference, Cerebras opens entirely new capabilities for AI.”
Cerebras contends that traditional hyperscale cloud processing of AI workloads is slow due to the sizable amounts of memory bandwidth required by AI models as they generate each work sequentially.
For example, the 70 billion parameter version of Llama 3.1 requires 140GB of memory — that's two bytes for each parameter.
GPUs traditionally have around 200MB of on-chip memory meaning they can’t store such large volumes of information. Instead, the traditional systems send an AI model entirely in order to model to generate every output token.
Cerebras’ answer to the problem? Bigger chips: The Inference solution uses the startup’s WSE hardware, the largest chips ever produced.
The current third generation of Cerebras’ WSE chips are 57 times larger than Nvidia’s flagship H100s and contain 44GB of on-chip Static Random Access Memory (SRAM).
Cerebras uses WSE-3s to process AI workloads without the need for external memory, which the startup claims speeds up memory bandwidth by as much as 7,000% compared to the H100.
“It is the only AI chip with both petabyte-scale compute and petabyte-scale memory bandwidth, making it a near ideal design for high-speed inference,” Wang wrote.
Cerebras Inference is available via an application programming interface, enabling enterprises to run their models on the system.
Upon launch, users are provided with one million free tokens daily. Larger deployments will incur a cost of $0.60 for 450 tokens per second per user, which is significantly less than Microsoft Azure's pricing of $2.90 for 20 tokens per second per user.
RELATED STORIES
Nvidia shares fall despite record gains
AMD takes on Nvidia with $4.9bn deal to compete for AI supremacy