The company tackled inferencing the Llama-3.1 405B foundation model and just crushed it. And for the crowds at SC24 this week in Atlanta, the company also announced it is 700 times faster than Frontier, the worlds fastest supercomputer, on a molecular dynamics simulation.
“There is no supercomputer on earth, regardless of size, that can achieve this performance,” said Andrew Feldman, Co-Founder and CEO of the AI startup. As a result, scientist can now accomplish in a single day what it took two years of GPU-based supercomputer simulations to achieve.
Cerebras Inferencing Llama-3.1 405B
When Cerebras announced its record-breaking performance on the 70 billion parameter Llama 3.1, it was quite a surprise; Cerebras had previously focussed on using its Wafer Scale Engine (WSE) on the more difficult training part of the AI workflow. The memory on a CS3 is fast on-chip SRAM instead of the larger (and 10x slower) High Bandwidth Memory used in data center GPUs. Consequently, the Cerebras CS3 provides 7,000x more memory bandwidth than the Nvidia H100, addressing Generative AI’s fundamental technical challenge: memory bandwidth.
And the latest result is just stupendous. Look at the charts above for a performance over time, and below to compare the competitive landscape for Llama 3.1-405B. The entire industry occupies the upper left quadrant of the chart, showing output speeds below the 100 tokens-per-second range for the Meta Llama 3.1-405B model. Cerebras produced some 970 tokens per second, all at roughly the same price as GPU and custom ASIC services like Samna N: 6 dollars per million input tokens and $12 dollars per million output tokens.
Compared to the competition, using 1000 input tokens, Cerebras embarrassed GPUs which all produced less than 100 tokens per second. Only SambaNova even came “close” at 164 tokens. Now, as you know, there is no free lunch; a single CS3 is estimated to cost between $2-3M, though the exact price is not publicly disclosed by the company. But the performance, latency, and throughput amortize that cost over a massive number of users.
To put it into perspective, Cerebras ran the 405B model nearly twice as fast as the fastest GPU cloud ran the 1B model. Twice the speed on a model that is two orders of magnitude more complex.
As one should expect, the Cerebras CS3 also delivered excellent latency (time to first token), at barely over 1/2 the time of the Google Vertex service, and one sixth the time required by competitors SambaNova and Amazon.
Cerebras is quick to note this is just the first step. They have increased the performance of Llama 70B from 400 tokens per second to 2,200 t/s in just a little over three months. And while Blackwell will increase inference performance by four fold over Hopper, it will not come even close to the performance of Cerebras.
But who needs this level of performance?
Ok, so nobody can ready anywhere close to 1000/tokens per second, which translates into about 500 words. But computers certainly can and do. And inference is undergoing a transformation, from simple queries to becoming a component in agentic AI and multi-query AI to provide better results.
“By running the largest models at instant speed, Cerebras enables real-time responses from the world’s leading open frontier model,” noted Mr. Feldman. “This opens up powerful new use cases, including reasoning and multi-agent collaboration, across the AI landscape.” Open AI’s o1 may demand as much as 10 times the computer of GPT-40 and agentic AI coupled with chain of thought requires over 100 times the performance available on today’s fastest GPUs.
Cerebras and Molecular Dynamics Simulation
Since this week is SuperComputing ‘24, Cerebras also announced an amazing scientific accomplishment. The CS3 was able to deliver 1.2 million simulation steps per second, a new world’s record. Thats 700 times faster than Frontier, the worlds fastest supercomputer. This means that scientists can now perform 2 years worth of GPU-based simulations in a single day on a simgle Cerebras System. And this benchmark is based on the older CS-2 WSE!
Conclusions
Instead of scaling AI training to produce more accurate answers, chain-of-thought reasoning explore different avenues and provide better answers. This “think before answering” approach provides dramatically better performance on demanding tasks like math, science, and code generation, fundamentally boosting the intelligence of AI models without requiring additional training. By running over 70x faster than other solutions, Cerebras Inference allows AI models to “think” far longer and return more accurate results. As agentic AI becomes available and eventually widespread, the demands in ifnerencing hardware will increase by another 10-fold.
Nothing even comes close to Cerebras in these emerging advancements in AI.