SambaNova Systems has launched an AI inference service, claiming to offer the fastest inference speeds in the world.
The SambaNova Cloud service enables businesses to run Meta’s Llama 3.1 model at speeds of 461 tokens per second (t/s) for the 70 billion parameter version and 132 t/s for the mammoth 405 billion version.
SambaNova said its new Cloud service would enable developers to build and run AI models at unrivalled speeds with low latency, exceeding inference speeds of systems run by vendors like OpenAI, Anthropic, and Google.
Subscribe today for free
“SambaNova Cloud is the fastest Application Programming Interface (API) service for developers. We deliver world record speed and in full 16-bit precision — all enabled by the world’s fastest AI chip,” said Rodrigo Liang, CEO of SambaNova Systems. “SambaNova Cloud is bringing the most accurate open source models to the vast developer community at speeds they have never experienced before.”
Based in Palo Alto, California, SambaNova is backed by investors like SoftBank, Blackrock, SK Telecom and the venture capital arms of chipmakers like Intel and Samsung.
The company is another vendor looking to challenge the likes of Nvidia with specially designed hardware for running AI models.
Its SN40L chips are designed to be cheaper and simpler to use than Nvidia hardware like the H100s. With a proprietary dataflow design and purpose-built three-tier memory architecture, SambaNova’s SN40L are designed to power AI models at higher speeds.
The SambaNova Cloud is similar to services from rivals such as Groq and Cerebras, however, the hardware is optimized to a point where it can run on a single rack consisting of just eight trays containing SN40Ls – reducing the infrastructure footprint required to run it.
Users can also switch between models at speeds, as well as automate their workflows using a chain of prompts as well as import existing fine-tuned models to run on the platform.
“Competitors are not offering the 405B model to developers today because of their inefficient chips. Providers running on Nvidia GPUs are reducing the precision of this model, hurting its accuracy, and running it at unusably slow speeds,” Liang said. “Only SambaNova is running 405B — the best open-source model created — at full precision and at 132 tokens per second.”
The service appears to have caught the eye of one Andrew Ng, a machine learning pioneer who co-founded Google Brain, who described SambaNova Cloud as an “impressive technical achievement.”
"Agentic workflows are delivering excellent results for many applications. Because they need to process a large number of tokens to generate the final result, fast token generation is critical,” Ng said.
“The best open weights model today is Llama 3.1 405B, and SambaNova is the only provider running this model at 16-bit precision and at over 100 tokens/second. This impressive technical achievement opens up exciting capabilities for developers building using large language models.”
Developers can use SambaNova Cloud to build their own generative AI models for free via the platform’s API.
SambaNova has also launched an enterprise tier, offering business customers higher rate limits to power their AI workloads at a production scale.
RELATED STORIES
AI startup's colossal chip powers 'world's fastest' AI processing
Meta reports massive spike in cloud demand for its Llama AI models