Groq, established by former Google TPU engineers, developed an LPU capable of lightning-fast output generation. The leading Nvidia GPU chips, utilized for AI inferencing in ChatGPT, achieve a maximum throughput of 30 to 60 tokens per second. Groq LPUs deliver a performance increase of 10 times with only one-tenth the latency and minimal energy consumption compared to Nvidia GPUs.
If you’ve been using ChatGPT, particularly with the GPT-4 model, you’ve probably experienced the sluggish response time of the model. Additionally, voice assistants relying on large language models such as ChatGPT’s Voice Chat feature or the newly launched Gemini AI, which replaced Google Assistant on Android phones, suffer from even slower performance due to the significant latency of LLMs. However, this scenario is poised for transformation soon, courtesy of Groq’s potent new LPU (Language Processing Unit) inference engine.
Table of contents
Groq has made a significant impact globally. It’s worth noting that this isn’t Elon Musk’s Grok, which is an AI model accessible on X (previously Twitter). Groq’s LPU inference engine boasts an impressive capability of generating up to 500 tokens per second with a 7B model, which reduces to approximately 250 tokens per second with a 70B model. This stands in stark contrast to OpenAI’s ChatGPT, which operates on GPU-powered Nvidia chips, providing roughly 30 to 60 tokens per second.
Groq is Built by Ex-Google TPU Engineers
Groq distinguishes itself as an AI inference chip rather than an AI chatbot, positioning itself as a competitor against industry giants like Nvidia in the AI hardware sector. Co-founded by Jonathan Ross in 2016, who previously played a pivotal role at Google in establishing the team responsible for developing Google’s initial TPU (Tensor Processing Unit) chip for machine learning.
Following this, a significant number of employees departed from Google’s TPU team to establish Groq, focusing on the development of hardware for next-generation computing.
What is Groq’s LPU?
Groq’s remarkable speed advantage over established players like Nvidia stems from its fundamentally different approach to development.
CEO Jonathan Ross explains that Groq took a unique path by initially creating the software stack and compiler before designing the silicon. This software-first approach was chosen to ensure performance determinism, a critical concept in achieving fast, accurate, and predictable results in AI inferencing.
Groq’s LPU architecture resembles the workings of an ASIC chip (Application-specific integrated circuit) and is fabricated on a 14nm node. Unlike general-purpose chips capable of handling various complex tasks, Groq’s LPU is custom-designed for a specific task: processing sequences of data in large language models. In contrast, CPUs and GPUs offer broader capabilities but often come with performance delays and increased latency.
Thanks to the tailored compiler, which comprehensively understands the instruction cycle within the chip, Groq achieves a significant reduction in latency. By efficiently assigning instructions to their designated locations, the compiler further minimizes latency. Additionally, each Groq LPU chip is equipped with 230MB of on-die SRAM, enhancing performance and reducing latency while maintaining superior efficiency.
Regarding the suitability of Groq chips for training AI models, they are purposefully designed for AI inferencing and lack high-bandwidth memory (HBM) necessary for training and fine-tuning models. Groq asserts that HBM memory introduces non-determinism to the overall system, contributing to heightened latency. Therefore, Groq LPUs are not suitable for training AI models.
We’ve Tested Groq’s LPU Inference Engine
Feel free to visit Groq’s website to witness the exceptional performance firsthand, no account or subscription necessary. Currently, the website offers access to two AI models: Llama 70B and Mixtral-8x7B. To assess Groq’s LPU performance, we executed several prompts using the Mixtral-8x7B-32K model, renowned as one of the top-tier open-source models available.
Groq’s LPU showcased impressive performance, generating outputs at a remarkable speed of 527 tokens per second. Specifically, it took merely 1.57 seconds to produce 868 tokens (equivalent to 3846 characters) on a 7B model. Even with a 70B model, its speed remained impressive at 275 tokens per second, surpassing competitors by a significant margin.
In our effort to compare Groq’s AI accelerator performance, we conducted a similar test on ChatGPT (GPT-3.5, utilizing a 175B model), and we manually computed the performance metrics. ChatGPT, leveraging Nvidia’s state-of-the-art Tensor-core GPUs, generated output at a rate of 61 tokens per second. Specifically, it took 9 seconds to produce 557 tokens (equivalent to 3090 characters).
To provide a comprehensive comparison, we conducted a similar test on the free version of Gemini (powered by Gemini Pro), which operates on Google’s Cloud TPU v5e accelerator. While Google has not disclosed the model size of the Gemini Pro model, its speed was determined to be 56 tokens per second. Specifically, it took 15 seconds to generate 845 tokens (equivalent to 4428 characters).
Compared to other service providers, the ray-project conducted an extensive LLMPerf test and concluded that Groq outperformed other providers significantly.
Although we haven’t conducted specific tests, it’s worth noting that Groq LPUs are compatible not only with language models but also with diffusion models. According to the demo, they can generate various styles of images at 1024px resolution in under a second, which is indeed remarkable.
Groq vs Nvidia: What Does Groq Say?
According to a report by Groq, its LPUs are designed to be scalable and can be interconnected using optical interconnect technology across 264 chips. Further scalability can be achieved using switches, although this may introduce additional latency. CEO Jonathan Ross has revealed that the company is currently working on developing clusters capable of scaling across 4,128 chips, which are slated for release in 2025. These clusters will be built using Samsung’s 4nm process node.
In a benchmark test conducted by Groq utilizing 576 LPUs on a 70B Llama 2 model, the AI inferencing performance was achieved in one-tenth of the time required by a cluster of Nvidia H100 GPUs.
Moreover, while Nvidia GPUs consumed between 10 to 30 joules of energy to generate tokens in a response, Groq required only 1 to 3 joules. In summary, Groq LPUs provide a remarkable 10x improvement in speed for AI inferencing tasks at just 1/10th of the cost compared to Nvidia GPUs.
What Does It Mean For End Users?
Overall, the emergence of LPUs represents a thrilling advancement in the AI landscape. With LPUs, users can anticipate instantaneous interactions with AI systems. The substantial reduction in inference time opens up possibilities for users to seamlessly engage with multimodal systems, whether through voice commands, image input, or image generation.
Groq has already made API access available to developers, signaling the potential for even greater performance improvements in AI models in the near future. What are your thoughts on the development of LPUs in the AI hardware sector? Share your opinions in the comment section below.