Gaudi 3 Competes On Inference Efficiency

Author: Ayush Jain

 
Gaudi 3 Competes On Inference Efficiency
 

In the datacenter AI market, NVIDIA has taken several leaps forward—first with Ampere, then Hopper, and now Blackwell. Other AI accelerator vendors have attempted to keep pace, with some coming close, such as Intel’s Gaudi 2, which offered performance comparable to NVIDIA’s A100. To gain a foothold in the AI accelerator market, Intel is back with its next-generation Gaudi accelerator, the Gaudi 3. In its maximum configuration, the chip delivers twice the FP8 performance of its predecessor while pushing the open accelerator module (OAM) power consumption to a relatively modest, 900 W. Gaudi 3 is expected to ship in Q3 2024, and Intel expects it to generate $500 million in sales this year.

Utilizing a heterogeneous architecture, Gaudi 3 features two compute dies, connected via an interconnect, on a 5nm process. Between the two dies, the new accelerator quadruples the number of matrix multiplication engines (MMEs) engines and contains 2.7 times the core count of Gaudi 2.

Gaudi 3 contains 96 MB of on-die SRAM and increases the amount of high bandwidth memory (HBM) by 33% to 128 GB, while providing 1.5x more HBM bandwidth. The increased compute and memory capacity also results in power rising to 900 W TDP in standard mezzanine form, 50% more than Gaudi 2. The networking subsystem is upgraded to twenty-four 200-GbE network interface controller (NIC) ports offering an aggregated bandwidth of 4.8 TB/s in each direction, optimized for training large networks. The NIC ports in the accelerator module provide scale-out support connecting multiple Intel Gaudi 3 accelerators in a server.

The company offers Gaudi 3 in a custom OAM module or a standard PCIe card; the latter delivers similar peak performance for all supported datatypes at 600 W. The OAM card can combine with up to 8,192 other modules over a 1,028-node cluster using standard Ethernet switches. MLPerf projected benchmarks show Gaudi 3 OAM clusters achieving 25%-40% faster time-to-train than H100s on large-scale pretraining of large language models (LLMs). The Gaudi 3 accelerator will reach the market about the same time as NVIDIA’s Blackwell GPUs.

The authoritative information platform to the semiconductor industry.

Discover why TechInsights stands as the semiconductor industry's most trusted source for actionable, in-depth intelligence.