Featured content

Need to cool AI hardware with safety? Check out the new solution from AMD, Supermicro & Metrum AI

The solution employs AI agents to monitor liquid-cooling systems, identifying and remediating problems quickly.

 

  • February 19, 2026 | Author: Peter Krass
Learn More about this topic

Liquid cooling is great for controlling the temperature of hard-working AI servers, but the technology also has its risks. Even minor disruptions or fluctuations in a cooling system can quickly lead to massive hardware failures.

A solution to this challenge has been developed by AMD and Supermicro working with Metrum AI Inc., an Austin, Texas-based provider of industry-specific AI agents and AI evaluation products.

Their solution integrates Supermicro infrastructure, AMD computational power and ROCm software, and Metrum AI’s orchestration to deliver fast decisions that ensure safety at scale.

This solution enables multiple AI agents to collaboratively monitor signals, diagnose issues, predict failures, and coordinate corrective actions. The agents are embedded directly into a server’s cooling control plate.

Essentially, this creates a data center that is adaptive, resilient, and self-optimizing. The solution should also support the massive compute intensity of next-generation AI workloads while proactively managing their thermal and physical risks.

And unlike traditional monitoring tools, this solution can actually predict and then prevent catastrophic hydraulic failures—before they occur. And do so faster than would be possible with traditional human intervention.

Power Features

To design these multi-agent systems, the team used AMD ROCm. This open-source software delivers important benefits that include flexibility, optimized libraries and seamless integration with AMD Instinct GPUs.

Another feature that made the solution possible is the massive memory reservoir of the AMD Instinct GPUs. For example, the AMD Instinct MI355X GPU has a dedicated memory of 288 GB. This lets large-scale reasoning models operate fully in-memory.

The structural foundation of this platform is the Supermicro 8U server (model AS -8126GS-TNMR) powered by dual AMD EPYC 9005 Series CPUs and supporting up to eight AMD Instinct MI325X or MI350X GPUs.

Unlike standard servers, these systems are engineered with direct-to-chip cooling headers that expose flow, temperature and pressure data directly through Redfish interfaces. (Redfish is a standard designed to deliver simple and secure management for converged systems, hybrid IT and software-defined data centers.) This empowers the agents to monitor and adjust cooling performance in real time.

The combination of specific technologies creates what’s known as a Unified Computational Fabric. There, the AMD EPYC processors feed continuous Redfish data directly into the Supermicro Instinct GPUs via PCIe 5, eliminating I/O bottlenecks.

This synergy powers the platform to sustain real-time adaptive control loops across dozens of racks, and quickly. It’s a capability that conventional air-cooled and CPU-based infrastructures can’t deliver.

Smart Racks

The autonomous cooling system was built on a distributed multi-agent architecture designed specifically for liquid-cooled AI environments. Unlike conventional systems, where monitoring is either centralized or based on human intervention, the solution places intelligence directly at the rack level.

In this setup, lightweight agents continuously monitor telemetry, interpret changes in flow and pressure, and coordinate rapid remediation actions across the data center. This creates a resilient, high-resolution control fabric that can respond to thermal events in milliseconds.

At the base of the stack, AMD ROCm supplies the core libraries, tools, compilers and runtimes for GPU-accelerated compute on AMD Instinct GPUs. And Kubernetes orchestration and the AMD GPU Operator enable containerized deployment, GPU scheduling, and lifecycle management at a multirack scale. (Kubernetes is an open-source system for automating the deployment, scaling and management of containerized applications.)

Above this layer, the AMD Enterprise AI Suite delivers higher-level services. The suite is a full-stack of enterprise-ready AI. The services it delivers include solution blueprints, AI workbench, and a resource manager for unified model deployment, optimization and infrastructure governance.

Metrum AI extends these platform components into a specialized multiagent architecture. It supports real-time telemetry ingestion, large-model reasoning and autonomous cooling control.

Test Results: Fast Yet Stable

All that sounds good in theory, but does it really work?

To find out, the solution was tested by Metrum AI along two dimensions: telemetry ingestion thruput and large-model inference stability.

When monitoring a full deployment of 200 racks (1,000 servers), the system successfully processed more than 13,000 Redfish telemetry endpoints per minute. Simultaneously, it maintained over 8,000 tokens/second of multiagent large-model reasoning.

This demonstrated that as the infrastructure added complexity, the centralized coordination architecture did not become a bottleneck. Also, the test shows that every agent received real-time, high-resolution sensor context, regardless of facility size.

Across all benchmarks, the integrated solution demonstrated stable, real-time, end-to-end autonomous operation under data-center scale load.

So do you have customers who are eager to try liquid cooling, but concerned about the risks? If so, tell them about this new AI-powered solution from AMD, Supermicro and Metrum AI.

Do More:

 

Featured videos


Events


Find AMD & Supermicro Elsewhere