According to Decrypt, chipmaker Nvidia announced on Monday that its Spectrum-X networking technology has significantly expanded startup xAI’s Colossus supercomputer, now recognized as the largest AI training cluster globally. Located in Memphis, Tennessee, Colossus is the training ground for the third generation of Grok, xAI’s suite of large language models developed to power chatbot features for X Premium subscribers. The supercomputer, completed in just 122 days, began training its first models 19 days after installation. Tech billionaire Elon Musk’s startup xAI plans to double the system’s capacity to 200,000 GPUs, Nvidia stated on Monday.
Colossus is a massive interconnected system of GPUs, each specialized in processing large datasets. When Grok models are trained, they need to analyze enormous amounts of text, images, and data to improve their responses. The supercomputer connects 100,000 NVIDIA Hopper GPUs using a unified Remote Direct Memory Access network. Nvidia’s Hopper GPUs handle complex tasks by distributing the workload across multiple GPUs and processing it in parallel. This architecture allows data to move directly between nodes, bypassing the operating system and ensuring low latency and optimal throughput for extensive AI training tasks.
Traditional Ethernet networks often suffer from congestion and packet loss, limiting throughput to 60%. In contrast, Spectrum-X achieves 95% throughput without latency degradation. This technology allows large numbers of GPUs to communicate more smoothly with one another, as traditional networks can get bogged down with too much data. The improved communication enables Grok to be trained faster and more accurately, which is essential for building AI models that respond effectively to human interactions. Despite the significant technological advancements, Nvidia’s stock saw a slight dip, trading at $141 as of Monday, with the company’s market cap at $3.45 trillion.