In a battle for supremacy in the realm of artificial intelligence (AI) and high-performance computing (HPC), Nvidia is boldly firing back at AMD, asserting that its H100 GPU outpaces AMD’s MI300X by a staggering factor of two in terms of speed for inference workloads.
At the recent launch of the Instinct MI300X, AMD confidently declared its GPU’s superiority over Nvidia’s H100 in AI and HPC applications. However, Nvidia has taken a moment to counter this claim, contending that the comparison made by AMD is flawed. According to Nvidia, AMD failed to use optimized software for its comparison, specifically with the DGX H100 machine employed to gauge performance against the Instinct MI300X-based server.
Nvidia underscores the significance of optimized software, emphasizing the need for a robust parallel computing framework, a versatile suite of tools, refined algorithms, and top-notch hardware to achieve optimal AI performance. Without these crucial components, Nvidia argues, performance is likely to fall short of expectations.
The tech giant highlights its TensorRT-LLM, equipped with advanced kernel optimizations tailored for the Hopper architecture. Nvidia deems this fine-tuning as pivotal, enabling accelerated FP8 operations on H100 GPUs without compromising the precision of inferences. This optimization was showcased through the Llama 2 70B model, demonstrating the H100’s prowess.
Backing up its assertions with concrete data, Nvidia presented performance metrics for a single DGX H100 server equipped with eight H100 GPUs running the Llama 2 70B model. The results were impressive, with the DGX H100 completing a single inference task in just 1.7 seconds at a batch size of one, outshining AMD’s eight-way MI300X machine, which clocked in at 2.5 seconds, according to figures published by AMD. This configuration not only delivered superior speed but also showcased the H100’s ability to provide rapid responses for model processing.
Nvidia acknowledges the common industry practice of balancing response time and efficiency, particularly in cloud services. It points out that cloud services often adopt a standard response time for specific tasks, such as 2.0 seconds, 2.3 seconds, or 2.5 seconds. This approach enables servers to handle multiple inference requests in larger batches, thereby enhancing the server’s overall inferences per second. Nvidia notes that this method aligns with industry benchmarks like MLPerf.
The company highlights that even slight compromises in response time can significantly impact the number of inferences a server can manage simultaneously. For instance, with a predetermined response time of 2.5 seconds, an eight-way DGX H100 server can perform over five Llama 2 70B inferences every second, a remarkable improvement compared to processing less than one inference per second under a batch-one setting.
While Nvidia presents compelling data and arguments to support its claims, it notably does not provide performance numbers for AMD’s Instinct MI300X in the same setup. The competitive landscape between Nvidia and AMD continues to intensify as both companies strive for dominance in the rapidly evolving field of AI and HPC. The battleground is marked by claims, counterclaims, and a relentless pursuit of innovation to secure a leading position in the race for computational supremacy.