Performance Intensive Computing

Capture the full potential of IT

Run HPC workloads faster with this AMD-Supermicro combo

Featured content

Run HPC workloads faster with this AMD-Supermicro combo

Compared with a CPU-only system, a Supermicro system with AMD CPUs and GPUs delivered double-digit performance gains. In benchmark tests, some HPC workloads that previously required months of simulation time were now completed in just days.

Applications:
Featured Technologies:

Are you or your customers looking to boost high-performance computing (HPC) workloads with greater performance, scalability and energy efficiency?

If so, then Supermicro and AMD have what you’re looking for. Together, they’ve demonstrated dramatic improvements in HPC workloads using a Supermicro system powered by AMD Instinct MI355X GPUs.

Compared with a CPU-only system, this AMD-Supermicro setup delivered double-digit performance gains. In benchmark tests, some HPC workloads that previously required months of simulation time were now completed in just days.

That’s because HPC workloads — especially those that are compute-intensive — can benefit from GPU accelerators.

The massive parallelism of GPUs enables faster iteration and higher resolution modeling. And accelerators such as the AMD Instinct MI355X transform traditional CPU-bound HPC clusters into GPU-accelerated supercomputing platforms.

Booming Benchmarks

The improvements can be dramatic for both performance/watt and time-to-solution. But how dramatic? That’s what AMD and SMC set out to discover.

To find out, they ran benchmarks on a liquid-cooled Supermicro 4U server powered by dual AMD EPYC CPUs and eight Instinct MI355X GPUs.

The AMD Instinct MI355X is a compelling solution for HPC applications, mainly because of its massive memory capacity (up to 6TB of DDR5), high double-precision (FP64) performance, and its support of the open-source AMD ROCm software ecosystem. These features enable it to handle extensive and complex scientific modeling, simulations and data-intensive tasks efficiently and at scale.

The benchmarks were generated using Chroma (software for quantum lattice field theory), Gromacs (software for molecular dynamics) and NAMD (molecular dynamics simulations). Here were the test workloads:

Chroma QUDA BICGSTAB Clover Solver: Fast lattice quantum chromodynamics
Gromacs ADH-Dodec: High-thruput for small and medium biomolecular systems
Gromacs Cellulose-NVE: Biomolecular simulation of crystal structures
Gromacs STMV Virus: Biomolecular simulation of plant virus
NAMD large-scale MD: Molecular dynamics simulation with 1 million steps

Compared with CPU-only, results were delivered anywhere from 6x to 14x faster, depending on benchmark.

Are your customers looking for that kind of HPC advantage? Then tell them about this Supermicro and AMD solution.

Do More:

Read the product brief: Supermicro and AMD together increase performance of HPC workloads using the AMD Instinct MI355X accelerator

Check the test system’s tech specs: Supermicro 4U liquid-cooled system with AMD Instinct MI355X GPUs

Learn more about AMD Instinct MI355X GPUs

Featured videos

Events

.NEXT 2026

Apr. 7-9; Chicago, IL

Explore the intersection of AI, distributed data, cloud-native innovation.
AMD & Supermicro are both Platinum sponsors.

Learn more >

AMD AI DevDay 2026

Apr. 30; San Francisco, CA

Technical sessions, hands-on workshops and more.
Learn more, & register to attend.

Learn more >

AI Infra Summit

May 1; San Francisco, CA

Get ahead of the curve with this hybrid event.
AMD & Supermicro are sponsors.

Learn more >

Find AMD & Supermicro Elsewhere

Read more about Run HPC workloads faster with this AMD-Supermicro combo

Need to cool AI hardware with safety? Check out the new solution from AMD, Supermicro & Metrum AI

Featured content

Need to cool AI hardware with safety? Check out the new solution from AMD, Supermicro & Metrum AI

The solution employs AI agents to monitor liquid-cooling systems, identifying and remediating problems quickly.

Applications:
Featured Technologies:

Liquid cooling is great for controlling the temperature of hard-working AI servers, but the technology also has its risks. Even minor disruptions or fluctuations in a cooling system can quickly lead to massive hardware failures.

A solution to this challenge has been developed by AMD and Supermicro working with Metrum AI Inc., an Austin, Texas-based provider of industry-specific AI agents and AI evaluation products.

Their solution integrates Supermicro infrastructure, AMD computational power and ROCm software, and Metrum AI’s orchestration to deliver fast decisions that ensure safety at scale.

This solution enables multiple AI agents to collaboratively monitor signals, diagnose issues, predict failures, and coordinate corrective actions. The agents are embedded directly into a server’s cooling control plate.

Essentially, this creates a data center that is adaptive, resilient, and self-optimizing. The solution should also support the massive compute intensity of next-generation AI workloads while proactively managing their thermal and physical risks.

And unlike traditional monitoring tools, this solution can actually predict and then prevent catastrophic hydraulic failures—before they occur. And do so faster than would be possible with traditional human intervention.

Power Features

To design these multi-agent systems, the team used AMD ROCm. This open-source software delivers important benefits that include flexibility, optimized libraries and seamless integration with AMD Instinct GPUs.

Another feature that made the solution possible is the massive memory reservoir of the AMD Instinct GPUs. For example, the AMD Instinct MI355X GPU has a dedicated memory of 288 GB. This lets large-scale reasoning models operate fully in-memory.

The structural foundation of this platform is the Supermicro 8U server (model AS -8126GS-TNMR) powered by dual AMD EPYC 9005 Series CPUs and supporting up to eight AMD Instinct MI325X or MI350X GPUs.

Unlike standard servers, these systems are engineered with direct-to-chip cooling headers that expose flow, temperature and pressure data directly through Redfish interfaces. (Redfish is a standard designed to deliver simple and secure management for converged systems, hybrid IT and software-defined data centers.) This empowers the agents to monitor and adjust cooling performance in real time.

The combination of specific technologies creates what’s known as a Unified Computational Fabric. There, the AMD EPYC processors feed continuous Redfish data directly into the Supermicro Instinct GPUs via PCIe 5, eliminating I/O bottlenecks.

This synergy powers the platform to sustain real-time adaptive control loops across dozens of racks, and quickly. It’s a capability that conventional air-cooled and CPU-based infrastructures can’t deliver.

Smart Racks

The autonomous cooling system was built on a distributed multi-agent architecture designed specifically for liquid-cooled AI environments. Unlike conventional systems, where monitoring is either centralized or based on human intervention, the solution places intelligence directly at the rack level.

In this setup, lightweight agents continuously monitor telemetry, interpret changes in flow and pressure, and coordinate rapid remediation actions across the data center. This creates a resilient, high-resolution control fabric that can respond to thermal events in milliseconds.

At the base of the stack, AMD ROCm supplies the core libraries, tools, compilers and runtimes for GPU-accelerated compute on AMD Instinct GPUs. And Kubernetes orchestration and the AMD GPU Operator enable containerized deployment, GPU scheduling, and lifecycle management at a multirack scale. (Kubernetes is an open-source system for automating the deployment, scaling and management of containerized applications.)

Above this layer, the AMD Enterprise AI Suite delivers higher-level services. The suite is a full-stack of enterprise-ready AI. The services it delivers include solution blueprints, AI workbench, and a resource manager for unified model deployment, optimization and infrastructure governance.

Metrum AI extends these platform components into a specialized multiagent architecture. It supports real-time telemetry ingestion, large-model reasoning and autonomous cooling control.

Test Results: Fast Yet Stable

All that sounds good in theory, but does it really work?

To find out, the solution was tested by Metrum AI along two dimensions: telemetry ingestion thruput and large-model inference stability.

When monitoring a full deployment of 200 racks (1,000 servers), the system successfully processed more than 13,000 Redfish telemetry endpoints per minute. Simultaneously, it maintained over 8,000 tokens/second of multiagent large-model reasoning.

This demonstrated that as the infrastructure added complexity, the centralized coordination architecture did not become a bottleneck. Also, the test shows that every agent received real-time, high-resolution sensor context, regardless of facility size.

Across all benchmarks, the integrated solution demonstrated stable, real-time, end-to-end autonomous operation under data-center scale load.

So do you have customers who are eager to try liquid cooling, but concerned about the risks? If so, tell them about this new AI-powered solution from AMD, Supermicro and Metrum AI.

Do More:

Read the solution brief: Multi-Agent AI Solution Using AMD ROCm

Check out the test system’s tech specs: Supermicro 8U server AS -8126GS-TNMR

Learn more about AMD ROCm

Explore the AMD Enterprise AI Suite

Meet Metrum AI

Featured videos

Events