close
close

Cerebras hopes its planned IPO will accelerate the race with Nvidia and other chip startups for the fastest generative artificial intelligence

Welcome to Eye on AI! In this edition… Governor Newsom to veto SB 1047; ByteDance plans new AI model based on Huawei chips; Microsoft announces AI models will improve Windows search; and the U.S. Department of Commerce establishes a new rule that eases restrictions on AI chip shipments to the Middle East.

Cerebras needs speed. In a bid to take control of Nvidia, the AI ​​chip startup is rapidly moving toward an initial public offering after announcing yesterday that it had filed for an IPO. At the same time, the company is in a tight race with fellow AI chip startups Groq and SambaNova for the title of “fastest generative AI.” The three are pushing the boundaries of their highly specialized hardware and software to enable AI models to generate responses using ultrafast generative AI that outperforms even Nvidia GPUs.

Here’s what this means: When you ask an AI assistant a question, it must sift through all the knowledge associated with its AI model to quickly find the answer. In industry jargon, this process is called “inference.” However, large language models do not screen words in the inference process. When you ask a question or give the chatbot a prompt, the AI ​​breaks it down into smaller pieces called “tokens” – which can represent a word or part of a word – in order to process and respond to you.

Pressure to get out faster and faster

So what does “ultrafast” inference mean? If you’ve tried chatbots like ChatGPT from OpenAI, Claude from Anthropic, or Gemini from Google, you probably find that your suggestions come at a completely reasonable pace. In fact, you may be impressed by how quickly it spits out answers to your questions. However, in February 2024, demonstrations of the Mistral-based Groq chatbot delivered answers much faster than humans could read. This went viral. The configuration provided 500 tokens per second, which allowed for almost immediate responses. By April, Groq was delivering an even faster 800 tokens per second, and by May, SambaNova boasted that it had exceeded the barrier of 1,000 tokens per second.

Currently, Cerebras, SambaNova, and Groq are shipping over 1,000 tokens per second, and the “token wars” have gained significant momentum. In late August, Cerebras claimed to have run the “world’s fastest AI inference method” at 1,800 tokens per second, and last week Cerebras said it had broken that record and become “the first hardware of any kind” to exceed 2,000 tokens per second per one of the Lamy Meta models.

When will soon be enough?

This led me to the question: why would anyone need to generate AI results so quickly? When will soon be enough?

According to Cerebras CEO Andrew Feldman, the speed of generative AI is essential as search results will increasingly rely on generative AI as well as new capabilities such as video streaming. These are two areas where latency, or the delay between action and reaction, is particularly irritating.

“No one will build a business on an app that makes you sit and wait,” he said Fortune.

Additionally, AI models are quickly being used to power much more complex applications than just chat. A rapidly growing area of ​​interest is the creation of AI agent-based application workflows in which the user asks a question or suggests an action that is not simply a single query to a single model. Instead, it leads to multiple queries to multiple models that can run and do things like search the Internet or database.

“Then performance really matters,” Feldman said, explaining that today’s relatively low performance can quickly become painfully slow.

Unlocking AI’s potential with speed

The bottom line is that speed matters because faster inference unlocks greater potential in applications built on AI, said Mark Heaps, chief technology evangelist at Groq Fortune. This is especially true for data-intensive applications in areas such as financial trading, traffic monitoring and cybersecurity: “You need real-time insight, a form of instantaneous intelligence that keeps pace with the moment,” he said. “The race to increase speed…will deliver better quality, accuracy and the potential for greater return on investment.”

It’s worth noting, he noted, that artificial intelligence models still don’t have as many neural connections as the human brain. “As models become more sophisticated, larger, or are overlaid with more agents using smaller models, maintaining application usability will require greater speed,” he explained, adding that this problem has always existed. “Why do we need cars to go over 50 miles per hour? Was it so we could go fast? Or producing an engine that could go 100 miles per hour allowed it to lift more weight at 50 miles per hour?

Rodrigo Liang, CEO and co-founder of SambaNova, agreed. Speed ​​of inference, he said Fortune“this is where the rubber hits the road – where all the training and model building is leveraged to deliver real business value.” This is especially true now that the AI ​​industry is shifting more of its training from training AI models to deploying them into production. “The world is looking for the most efficient way to produce tokens that can serve the ever-increasing number of users,” he said. “The speed allows us to serve many customers at the same time.”

Sharon Goldman
[email protected]

This story was originally published on Fortune.com