The task of designing, building and connecting a server system that can run today’s artificial intelligence workloads is daunting.
Mainly, because there are a lot of moving parts. Assembling and connecting them all correctly is not only complicated, but also time-consuming.
Supermicro and AMD are here to help. They’ve recently co-published a Verified Design document that explains how to build an AI cluster. The PDF also tells you how you can acquire an AMD-powered Supermicro cluster for AI pre-built, with all elements connected, configured and burned in before shipping.
Full-Stack for GenAI
Supermicro and AMD are offering a fully validated, full-stack solution for today’s Generative AI workloads. The system’s scale can be easily adjusted from as few as 16 nodes to as many as 1,024—and points in between.
This Supermicro solution is based on three AMD elements: the AMD Instinct MI325X GPU, AMD Pensando Pollara 400 AI network interface card (NIC), and AMD EPYC CPU.
These three AMD parts are all integrated with Supermicro’s optimized servers. That includes network cabling and switching.
The new Validated Design document is designed to help potential buyers understand the joint AMD-Supermicro solution’s key elements. To shorten your implementation time, the document also provides an organized plan from start to finish.
Under the Cover
This comprehensive report—22 pages plus a lengthy appendix—goes into a lot of technical detail. That includes the traffic characteristics of AI training, impact of large “elephant” flows on the network fabric, and dynamic load balancing. Here’s a summary:
- Foundations of AI Fabrics: Remote Direct Memory Access (RDMA), PCIe switching, Ethernet, IP and Border Gateway Protocol (BGP).
- Validated Design Equipment and Configuration: Server options that optimize RDMA traffic with minimal distance, latency and silicon between the RDMA-capable NIC (RNIC) and accelerator.
- Scaling Out the Accelerators with an Optimized Ethernet Fabric: Components and configurations including the AMD Pensando Pollara 400 Ethernet NIC and Supermicro’s own SSE-T8196 Ethernet switch.
- Design of the Scale Unit—Scaling Out the Cluster: Designs are included for both air-cooled and liquid-cooled setups.
- Resource Management and Adding Locality into Work Placement: Covering the Simple Linux Utility for Resource Management (SLURM) and topology optimization including the concept of rails.
- Supermicro Validated AMD Instinct MI325 Design: Shows how you can scale the validated design all the way to 8,000 AMD MI325X GPUs in a cluster.
- Storage Network Validated Design: Multiple alternatives are offered.
- Importance of Automation: Human errors are, well, human. Automation can help with tasks including the production of detailed architectural drawings, output of cabling maps, and management of device firmware.
- How to Minimize Deployment Time: Supermicro’s Rack Scale Solution Stack offers a fully integrated, end-to-end solution. And by offering a system that’s pre-validated, this also eases the complexity of multi-vendor integration.
Total Rack Solution
Looking to minimize implementation times? Supermicro offers a total rack scale solution that’s fully integrated and end-to-end.
This frees the user from having to integrate and validate a multi-vendor solution. Basically, Supermicro does it for you.
By leveraging industry-leading energy efficiency, liquid and air-cooled designs, and global logistics capabilities, Supermicro delivers a cost-effective and future-proof solution designed to meet the most demanding IT requirements.
The benefits to the customer include reduced operational overhead, a single point of accountability, streamlined procurement and deployment, and maximum return on investment.
For onsite deployment, Supermicro provides a turnkey, fully optimized rack solution that is ready to run. This helps organizations maximize efficiency, lower costs and ensure long-term reliability. It includes a dedicated on-site project manager.
Do More:
- Download the Validated Design PDF: Building a Tenant-Aware AMD Instinct MI325X GPU Cluster with an Integrated Supermicro Solution