Sponsored by:

Visit AMD Visit Supermicro

Performance Intensive Computing

Capture the full potential of IT

Run HPC workloads faster with this AMD-Supermicro combo

Featured content

Run HPC workloads faster with this AMD-Supermicro combo

Compared with a CPU-only system, a Supermicro system with AMD CPUs and GPUs delivered double-digit performance gains. In benchmark tests, some HPC workloads that previously required months of simulation time were now completed in just days.

Learn More about this topic
  • Applications:
  • Featured Technologies:

Are you or your customers looking to boost high-performance computing (HPC) workloads with greater performance, scalability and energy efficiency?

If so, then Supermicro and AMD have what you’re looking for. Together, they’ve demonstrated dramatic improvements in HPC workloads using a Supermicro system powered by AMD Instinct MI355X GPUs.

Compared with a CPU-only system, this AMD-Supermicro setup delivered double-digit performance gains. In benchmark tests, some HPC workloads that previously required months of simulation time were now completed in just days.

That’s because HPC workloads — especially those that are compute-intensive — can benefit from GPU accelerators.

The massive parallelism of GPUs enables faster iteration and higher resolution modeling. And accelerators such as the AMD Instinct MI355X transform traditional CPU-bound HPC clusters into GPU-accelerated supercomputing platforms.

Booming Benchmarks

The improvements can be dramatic for both performance/watt and time-to-solution. But how dramatic? That’s what AMD and SMC set out to discover.

To find out, they ran benchmarks on a liquid-cooled Supermicro 4U server powered by dual AMD EPYC CPUs and eight Instinct MI355X GPUs.

The AMD Instinct MI355X is a compelling solution for HPC applications, mainly because of its massive memory capacity (up to 6TB of DDR5), high double-precision (FP64) performance, and its support of the open-source AMD ROCm software ecosystem. These features enable it to handle extensive and complex scientific modeling, simulations and data-intensive tasks efficiently and at scale.

The benchmarks were generated using Chroma (software for quantum lattice field theory), Gromacs (software for molecular dynamics) and NAMD (molecular dynamics simulations). Here were the test workloads:

  • Chroma QUDA BICGSTAB Clover Solver: Fast lattice quantum chromodynamics
  • Gromacs ADH-Dodec: High-thruput for small and medium biomolecular systems
  • Gromacs Cellulose-NVE: Biomolecular simulation of crystal structures
  • Gromacs STMV Virus: Biomolecular simulation of plant virus
  • NAMD large-scale MD: Molecular dynamics simulation with 1 million steps

Compared with CPU-only, results were delivered anywhere from 6x to 14x faster, depending on benchmark.

Are your customers looking for that kind of HPC advantage? Then tell them about this Supermicro and AMD solution.

Do More:

 

Featured videos


Events


Find AMD & Supermicro Elsewhere

To supercharge AI clusters, check out a newly validated solution from AMD, Supermicro & Mirantis

Featured content

To supercharge AI clusters, check out a newly validated solution from AMD, Supermicro & Mirantis

Validating Supermicro hardware with Mirantis k0rdent AI represents a shift from building clusters to composing them.

Learn More about this topic
  • Applications:
  • Featured Technologies:

Full-stack AI infrastructure solutions are having a moment. And why not. Organizations choose these solutions to speed GPU operations, ensure efficient GPU utilization, and enforce security and compliance at scale.

One such solution is k0rdent AI, a turnkey, production-ready “super control plane” for managing complex AI environments. K0rdent automates provisioning, life-cycle management, and orchestration of infrastructure and core services.

The company behind k0rdent is Mirantis Inc. It’s privately held and based in Campbell, Calif. Founded in 2011, Mirantis today has over 800 employees.

Importantly, Mirantis is also a contributor to Kubernetes, the open-source system for automating the deployment, scaling, and management of containerized applications. Containerization is a software-deployment process that creates a single software package, known as a container, that can run on all types of devices and operating systems.

Mirantis helps organizations achieve digital self-determination by giving them complete control over their strategic infrastructure. The company’s customers include such well-known brands as Adobe, DocuSign and PayPal.

Could Supermicro benefit from the solution’s capabilities? To find out, Supermicro recently validated its modular server architecture with k0rdent.

Testing, Testing

For the validation, Supermicro used two of its own systems:

  • A Supermicro 8U GPU server (model AS -8126GS-TNMR) powered by dual AMD EPYC 9005 CPUs and up to eight AMD Instinct MI325X GPUs.
  • A Supermicro 2U Big Twin server (model AS -2124BT-HNTR) powered by dual AMD EPYC 7003 processors.

Validation began at the physical level, where the k0rdent bare-metal operator acts as a bridge between the Kubernetes API and the Supermicro servers. This delivered automated BIOS configuration, firmware updates, RAID orchestration, and deployment of a hardened host OS.

Next, the testing team deployed the AMD GPU Operator via the k0rdent catalog. GPU Operator simplifies the deployment and management of AMD Instinct GPUs with Kubernetes clusters, enabling seamless configuration and operation of GPU-accelerated workloads.

The AMD Network Operator was deployed, too. It's a control component that enables GPU-to-GPU communications in an AI cluster, managing AMD NICs in Kubernetes clusters.

Here was the test configuration:

  • Scope: Single GPU unit performance

The testers used a custom PyTorch script to measure raw compute throughput across different precisions. (PyTorch is an open-source deep learning library.)

Results Delivered

The validation successfully demonstrated the automated provisioning of production-grade Kubernetes clusters on Supermicro bare-metal hardware using k0rdent’s declarative orchestration engine and the Bare Metal Operator (BMO).

k0rdent managed the entire lifecycle of the Supermicro nodes. That went from out-of-band discovery via BMC/IPMI (Baseboard Management Controller/Intelligent Platform Management Interface) and hardware introspection…all the way to automated OS imaging and Kubernetes bootstrapping.

This eliminated manual configuration and hypervisor overhead. It also provided a high-performance, consistent, and repeatable deployment model that adheres to Cluster API (CAPI) standards.

As Supermicro explains, the validation confirms that k0rdent effectively bridges the gap between physical server management and cloud-native agility. That makes it an ideal solution for resource-intensive workloads requiring direct hardware access and deterministic performance on Supermicro infrastructure.

Conclusions

Validating Supermicro hardware with Mirantis k0rdent AI represents a shift from building clusters to composing them.

Enterprises can run their entire portfolios—from legacy apps to cutting-edge LLMs—on a single, unified, bare-metal platform with automatic deployment and comprehensive platform management from the bare metal up.

If you have customers eager to eliminate human error and inconsistencies from the AI deployment and management processes, tell them to check out this solution.

Do More:

 

Featured videos


Events


Find AMD & Supermicro Elsewhere

Tech Explainer: What’s a Neocloud?

Featured content

Tech Explainer: What’s a Neocloud?

This cloud variant has arisen to meet the needs of AI developers. Find out how it differs from hyperscalers—and why your customers might want to jump on board.

Learn More about this topic
  • Applications:
  • Featured Technologies:

A new kind of technology demands a new kind of cloud.

Sure, it’s easy to take cloud computing for granted. After all, it’s been years since “the cloud” became part of our lives and everyday vernacular.

Over the years, clouds ranging from the simple (think Dropbox) to the fabulously complex (think multicloud ecosystems) have been powerful enough to handle whatever we’ve thrown their way.

But now our widespread adoption of AI demands a new kind of cloud.

To the rescue: Behold the neocloud!

Neoclouds offer AI workload-specific functionality as a service. And to help save enterprises and SMBs considerable expenses of time and money, neoclouds offer platforms designed to empower the rapid development and launch of the latest AI creations.

A neocloud isn’t your typical “run anything” platform. Instead, it’s optimized to run a narrow selection of highly specialized AI-centric tasks. These include AI/ML inference and training, data analytics and media rendering.

Neoclouds vs. Traditional Clouds

To better understand how neoclouds fit into the grand scheme of modern cloud architecture, it helps to compare and contrast them with their forebearer, the hyperscaler.

Hyperscalers that include Amazon Web Services (AWS), Microsoft Azure and Google Cloud also offer cloud-based services. They simply offer a much larger and less AI-specific selection.

The seemingly endless array of services these hyperscalers offer makes them ideal for developers who prize flexibility and versatility. Hyperscalers let developers combine multiple managed services to simultaneously harness the power of distributed databases, machine-learning pipelines and other components of a highly customized platform.

By contrast, neoclouds are tuned for specific workloads. They offer a narrower focus and so-called “opinionated architecture” designed to make autonomous architectural decisions. That level of specificity and autonomy changes the nature of the development process from DIY to plug & play.

 

                  

 

More-Specific Hardware, Too

To fully compare neocloud apples with hyperscaler oranges, you also need to look under the hood. The tech behind the latest cloud type makes a huge difference.

For both hyperscalers and neoclouds, we’re talking about some of the most advanced tech ever. But here again, it’s the neocloud’s laser-like focus on AI that makes it an invaluable development tool.

It’s for that reason that popping the top off an AI server like the Supermicro’s 8U server (model AS -8126GS-TNMR) will treat you to a view of truly cutting-edge CPUs, GPUs and networking gear. That gear includes a couple of server-focused AMD EPYC 9005 series processors with as many as 384 cores and up to 6TB of DDR5 memory.

For brute-force AI processing, the Supermicro A+ server also offers room for eight onboard AMD Instinct MI350X GPUs banded together via AMD Infinity Fabric Link.

Supermicro’s behemoth is also equipped with AMD ROCm. Pronounced “rock-em,” it’s a software stack designed to translate the code written by programmers into sets of instructions that AMD GPUs can understand and execute perfectly.

The Neocloud Sales Pitch, Condensed

The what and how of neoclouds are important. But if your customers are considering investing in neocloud, they’ll surely want to know about the why, as well.

So why would you want to engage a neocloud for AI development? There are four main reasons:

1. Neoclouds cut admin work, letting you concentrate instead on production.

A new eBook from Supermicro and AMD, The Smartest Path to Scalable AI, cites neoclouds for their “frictionless dev-to-prod motion.”

That’s tech business-speak for a system that handles the nitty-gritty details, getting out of your way so you can get to work. That includes one-command access to optimized hardware and preconfigured environments.

Bottom line: Less admin, more development, and faster time-to-market.

2. A neocloud delivers instant gratification, not endless development integration.

“Day 0 readiness” is the catchphrase that sums up this one. And not just for any single aspect of the neocloud platform, but for the whole stack. That includes hardware, software, and the managed offerings wrapped around them, collectively referred to as services.

Bottom line: Large models and agents start running efficiently from the get-go.

3. A neocloud is always up-to-date with the latest, greatest silicon.

The last thing you want to contend with is outdated infrastructure. That may fly when it comes to making last-decade file storage app. But creating tomorrow’s brilliant new AI requires cutting-edge tech. The problem is, that tech gets expensive. The solution? Rent, don’t buy.

Bottom line: Access to all the cool toys, with no down payment.

4. It’s already got wheels; you don’t have to reinvent them.

Neoclouds come well stocked with what are known as specialized microservices. These are pre-built, workload-specific building blocks that developers can stand on to bypass the mundanities of production and get to the good stuff.

Examples of wheels you won’t have to reinvent include distributed training orchestration, streaming ingestion services, and GPU render farms.

Bottom line: Neoclouds do the boring due diligence, and let developers get all the glory.

The Future’s Future

Neoclouds are already the future. They’re coming online now, and revealing themselves to be the greatest thing for developers since sliced bread.

But tech moves fast these days. There’s always someone thinking about the next step.

When it comes to the next step for neoclouds, that’s likely to involve deeper specialization, more compelling economics, and consolidation.

That makes sense in terms of the big picture. As both enterprises and SMBs adopt neoclouds, they’ll create more demand. That demand, in turn, should help fund expansion.

Eventually, we may see a new level of specificity. For example, one neocloud could offer low-latency SaaS production inferencing, while another may focus on analytics that cater to medical research.

What happens after that is hard to predict. But one easy-to-believe theory foretells a time in which neoclouds plug into hyperscalers. With that kind of power, imagine what tomorrow’s developers will be able to do!

Do More:

 

Featured videos


Events


Find AMD & Supermicro Elsewhere

Need to cool AI hardware with safety? Check out the new solution from AMD, Supermicro & Metrum AI

Featured content

Need to cool AI hardware with safety? Check out the new solution from AMD, Supermicro & Metrum AI

The solution employs AI agents to monitor liquid-cooling systems, identifying and remediating problems quickly.

 

Learn More about this topic
  • Applications:
  • Featured Technologies:

Liquid cooling is great for controlling the temperature of hard-working AI servers, but the technology also has its risks. Even minor disruptions or fluctuations in a cooling system can quickly lead to massive hardware failures.

A solution to this challenge has been developed by AMD and Supermicro working with Metrum AI Inc., an Austin, Texas-based provider of industry-specific AI agents and AI evaluation products.

Their solution integrates Supermicro infrastructure, AMD computational power and ROCm software, and Metrum AI’s orchestration to deliver fast decisions that ensure safety at scale.

This solution enables multiple AI agents to collaboratively monitor signals, diagnose issues, predict failures, and coordinate corrective actions. The agents are embedded directly into a server’s cooling control plate.

Essentially, this creates a data center that is adaptive, resilient, and self-optimizing. The solution should also support the massive compute intensity of next-generation AI workloads while proactively managing their thermal and physical risks.

And unlike traditional monitoring tools, this solution can actually predict and then prevent catastrophic hydraulic failures—before they occur. And do so faster than would be possible with traditional human intervention.

Power Features

To design these multi-agent systems, the team used AMD ROCm. This open-source software delivers important benefits that include flexibility, optimized libraries and seamless integration with AMD Instinct GPUs.

Another feature that made the solution possible is the massive memory reservoir of the AMD Instinct GPUs. For example, the AMD Instinct MI355X GPU has a dedicated memory of 288 GB. This lets large-scale reasoning models operate fully in-memory.

The structural foundation of this platform is the Supermicro 8U server (model AS -8126GS-TNMR) powered by dual AMD EPYC 9005 Series CPUs and supporting up to eight AMD Instinct MI325X or MI350X GPUs.

Unlike standard servers, these systems are engineered with direct-to-chip cooling headers that expose flow, temperature and pressure data directly through Redfish interfaces. (Redfish is a standard designed to deliver simple and secure management for converged systems, hybrid IT and software-defined data centers.) This empowers the agents to monitor and adjust cooling performance in real time.

The combination of specific technologies creates what’s known as a Unified Computational Fabric. There, the AMD EPYC processors feed continuous Redfish data directly into the Supermicro Instinct GPUs via PCIe 5, eliminating I/O bottlenecks.

This synergy powers the platform to sustain real-time adaptive control loops across dozens of racks, and quickly. It’s a capability that conventional air-cooled and CPU-based infrastructures can’t deliver.

Smart Racks

The autonomous cooling system was built on a distributed multi-agent architecture designed specifically for liquid-cooled AI environments. Unlike conventional systems, where monitoring is either centralized or based on human intervention, the solution places intelligence directly at the rack level.

In this setup, lightweight agents continuously monitor telemetry, interpret changes in flow and pressure, and coordinate rapid remediation actions across the data center. This creates a resilient, high-resolution control fabric that can respond to thermal events in milliseconds.

At the base of the stack, AMD ROCm supplies the core libraries, tools, compilers and runtimes for GPU-accelerated compute on AMD Instinct GPUs. And Kubernetes orchestration and the AMD GPU Operator enable containerized deployment, GPU scheduling, and lifecycle management at a multirack scale. (Kubernetes is an open-source system for automating the deployment, scaling and management of containerized applications.)

Above this layer, the AMD Enterprise AI Suite delivers higher-level services. The suite is a full-stack of enterprise-ready AI. The services it delivers include solution blueprints, AI workbench, and a resource manager for unified model deployment, optimization and infrastructure governance.

Metrum AI extends these platform components into a specialized multiagent architecture. It supports real-time telemetry ingestion, large-model reasoning and autonomous cooling control.

 

 

 

Test Results: Fast Yet Stable

All that sounds good in theory, but does it really work?

To find out, the solution was tested by Metrum AI along two dimensions: telemetry ingestion thruput and large-model inference stability.

When monitoring a full deployment of 200 racks (1,000 servers), the system successfully processed more than 13,000 Redfish telemetry endpoints per minute. Simultaneously, it maintained over 8,000 tokens/second of multiagent large-model reasoning.

This demonstrated that as the infrastructure added complexity, the centralized coordination architecture did not become a bottleneck. Also, the test shows that every agent received real-time, high-resolution sensor context, regardless of facility size.

Across all benchmarks, the integrated solution demonstrated stable, real-time, end-to-end autonomous operation under data-center scale load.

So do you have customers who are eager to try liquid cooling, but concerned about the risks? If so, tell them about this new AI-powered solution from AMD, Supermicro and Metrum AI.

Do More:

 

Featured videos


Events


Find AMD & Supermicro Elsewhere

Looking for AI's ROI? Try purpose-fitting

Featured content

Looking for AI's ROI? Try purpose-fitting

Delivering an AI return on investment can be challenging. A new IDC white paper offers a solution: leverage infrastructure to the use case.

Learn More about this topic
  • Applications:
  • Featured Technologies:

Companies can build a strong return on investment (ROI) for their AI projects—but only if they understand how to leverage different infrastructure solutions for different AI use cases. In other words, they need to know how to do purpose-fitting.

That’s the case argued in a new IDC white paper sponsored by AMD and Supermicro.

The paper’s two co-authors are Peter Rutten, research VP in IDC’s worldwide infrastructure research group and global research lead of the firm’s performance-intensive computing practice; and Madhumitha Sathish, research manager for high-performance computing at IDC and lead of the firm’s AI infrastructure research.

Rutten and Sathish find not all is well in the world of AI. In a survey conducted by IDC this past September, fewer than half of companies worldwide said their AI-related projects have delivered any measurable business outcomes. And only about one in 10 companies (11.4%) said they’re obtaining measurable business results from more than 75% of their AI projects.

What’s blocking AI progress? According to the IDC survey, these are the top reasons:

  • Competition for resources: cited by 34% of survey respondents
  • Resistance to process change: cited by 30%
  • Difficulty quantifying AI’s ROI: 28%
  • Regulatory uncertainty: also 28% (multiple responses were allowed)

“Cost continues to be a major hurdle,” the authors write.

And the biggest cost? Over 60% of companies surveyed by IDC said it’s around developing and deploying AI is specialized infrastructure.

Four Questions

It doesn’t have to be this way, the IDC authors argue. Instead, AI-using organizations can build a strong ROI for their projects with purpose-fitting.

To do this, managers should ask (and answer) 4 important questions:

  • Who decides what is your relevant AI use case? A separate IDC survey finds that fewer than one in three organizations involve IT during an AI initiative’s conceptual stage.
  • What kind of AI model do you need? There are many, including machine learning, GenAI, agentic AI, deep neural network, etc. Not all require major capital expenditures.
  • How will you obtain this AI model? Each approach involves trade-offs. For example, most businesses fine-tune or customize an existing commercial model. But this approach involves both licensing costs and training costs.
  • Have you considered the biggest factors that impact AI infrastructure needs? These factors include AI model types, number of parameters, volume of training data, query response times, and query size.

By taking these factors into account, the authors say, enterprises can develop AI options that match their AI use case, creating a purpose-built infrastructure solution.

Spectrum Choices

To contain AI infrastructure costs, the IDC authors recommend that managers develop what they call a “spectrum of options” based on 7 factors: Complexity, parameter count, data volume, model accuracy, time to value, query response latency, and query size.

When these factors are low or small, an AI project is in the blue zone, which implies lower costs. As these factors become higher or larger, the project moves into the green and red zones, which imply higher costs, as shown in the IDC chart below.

Hardware system requirements can vary by spectrum, too.

Blue zone projects, those with the lowest infrastructure costs, can be run on CPU-based, air-cooled systems, or even a PC or workstation.

Green zone projects, those with intermediate infrastructure costs, can run on systems powered by CPUs with built-in accelerators and lighter co-processors.

And red zone projects, those with the highest infrastructure zones, require rack-scale systems with high-end CPUs, GPUs and liquid-cooling.

But wait, there’s more. The IDC authors point to several additional considerations:

  • Is there more than one AI use case in development? Typically, there are. If that’s the case, then that will need to be built into the needs projection.
  • How rapidly will the AI use case evolve over time? For example, if the number of users is projected to grow substantially, then the accounting must consider new infrastructure that will be required.
  • How often will the AI model require generational updates? Many models are constantly being improved, expanded and retrained, and these updates will deliver infrastructure impacts.

Better Together

The IDC authors say AI-using companies would do well to consider AMD-powered Supermicro systems. The two suppliers work with a vast ecosystem of partners to offer alternatives and options.

AMD and Supermicro demystify complexity, helping companies plan their AI projects faster and better. And they offer reliable, high-performance platforms that support AI workloads across a wide range of deployment scales.

“AMD and Supermicro,” the IDC authors write, “have developed some of the most versatile, powerful and well-tailored solutions available today.”

Do More:

 

Featured videos


Events


Find AMD & Supermicro Elsewhere

Tech Explainer: What’s an AI Factory?

Featured content

Tech Explainer: What’s an AI Factory?

Discover how AI factories work—and how your clients might benefit from building an AI factory of their own.

Learn More about this topic
  • Applications:
  • Featured Technologies:

How can you tell that the AI Era is here? One way is by noticing that large enterprises are increasingly focused on mass producing AI models.

It’s no longer enough to have a decent set of working AI models to power Spotify’s suggestion engine or Accenture’s Big Data analytics.

To keep up with—and surpass—the Joneses, Spotify and Accenture will need dedicated systems that work every day to create, evaluate and iterate their AI models.

These systems are called AI factories. Somewhat like a factory that creates physical widgets, an AI factory churns out new and updated AI models. This continual AI production process helps enterprises react quickly to market demands and competition.

Make no mistake: The development of AI factories represents a turning point in the evolution of AI-powered business.

No. 2 with a Bullet

This theory is supported by some of IT’s top thinkers. They include Tom Davenport, a professor, speaker and author; and Randy Bean, a corporate advisor.

Davenport and Bean co-wrote an article that appeared earlier this month In the Sloan Management Review: Five trends in AI and data science for 2026. In their article, the authors place AI factories in the Number 2 spot. AI factories, they say, will be adopted by users and “all-in” AI adopters that include consumer products makers, banks and software companies.

As Davenport and Bean explain, an AI factory combines technology platforms, methods, data and previously developed algorithms to make building AI systems easy and fast. The authors’ all-important message: Watch this space.

How AI Factories Work

To fully understand the concept of an AI factory, it can help to think of the traditional smoke-belching, brick-and-mortar factories it’s named for.

Of course, there are some differences. A physical factory takes in raw materials, uses machines to process them, and produces physical products.

By contrast, an AI factory takes in data (such as text, audio, images and logs), runs that data through massive compute engines, and outputs AI models for recommendations, predictions, automation and generative content.

Another difference: Unlike the static products that emerge from traditional factories, the products of AI factories are virtual. They learn and grow as new data, infrastructure and techniques become available. In this way, AI factories help their organizations keep up with rapid changes and market shifts.

For instance, a new AI model produced by an enterprise’s AI factory can be continuously retrained as new data becomes available. While each new iteration deployed in the field busily suggests which Netflix movie to watch next, a newer version is constantly being developed in the background. When the new suggestion engine is ready, Netflix can seamlessly slide it into place.

Why Your Clients Probably Need an AI Factory

It’s good to understand the abstract benefits of an AI factory. But your clients will also want to know how building one can translate into business results.

Here’s the bottom line. An AI factory can:

  • Dramatically reduce the cost of business intelligence. Once an AI factory is built and a given AI model is trained, that model can run continuously, serving millions of decisions, predictions, etc., for a fraction of its initial cost. In other words, the cost per additional decision rapidly collapses toward zero.
  • Help organizations maintain a decisive competitive advantage. This happens on two levels. First, maintaining a constant production stream of AI models and iterations helps your clients meet market demands as quickly as possible. And second, having that ability to react faster to customer needs and economic conditions can help create and sustain an advantage over competitors.
  • Turn data into capital. Many organizations are ill-equipped to analyze and monetize all the data they collect. All that piled-up data can seem like an albatross around their neck. But by building an AI factory, the organization can harness that otherwise squandered data and put it to work.

Further, companies that don’t build an AI factory could find themselves at a competitive disadvantage. Davenport and Bean, in their Sloan Management Review article, say companies that lack an AI factory will find building AI at scale both expensive and time-consuming.

Stumbling Blocks? A Couple

Building an AI factory isn’t always easy. Enterprises can run into serious roadblocks.

For one, siloed, inconsistent or low-trust data can make for a messy AI production process. As programmers say, “garbage in, garbage out.” In other words, if the data is messy, the analysis will be, too.

Another thing that can wreak havoc on the virtual factory floor are talent bottlenecks. There are only so many data scientists to go around, and they’re in high demand. Finding the right employees is a key component here—even in an age of super-smart robots.

Another trap your clients need to watch out for are bureaucratic hold-ups. Legal, compliance and trust issues can cause AI projects to grind to a halt.

The AI Factory Future

Like everything else in the fast-moving AI world, AI factories are changing. In the near future, AI factories will likely focus on the immediacy of real-time, always-on learning.

As AI factories shift to nearly continuous adaptation, enterprises will use their AI model updates to keep pace with rapidly changing market conditions and customer demands.

Another likely future is inferencing at the edge. For “edge,” think vehicles, devices and brick-and-mortar factories. Organizations that move inferencing closer to where data is created can lower system latency (that is, increase speed) and reduce cloud costs.

Another factor that could make a big impact on AI factories is new software and hardware integrations. A recent Supermicro webinar on AI factories and related technology showed how enterprises can benefit from integrating software platforms such as Supermicro’s SuperCloud Composer (SCC) and Power Asset Orchestrator (PAO).

Supermicro says this potent combination allows operators to gain total visibility into AI Factories. It can also optimize everything from GPU telemetry to real-time grid pricing.

Overall, it’s safe to assume that when these and other updates are deployed, AI factories will quickly become part of the common AI infrastructure. In so doing, they’ll touch nearly every aspect of our daily lives.

Do More:

 

Featured videos


Events


Find AMD & Supermicro Elsewhere

2025: Look Back at the Year’s Top Advances

Featured content

2025: Look Back at the Year’s Top Advances

Catch up on 2025’s highlights: ROCm 7.0, liquid-cooled AI servers, server processors for SMBs, and a MicroBlade server that’s highly efficient.

Learn More about this topic
  • Applications:
  • Featured Technologies:

2025 was a year to remember. But in case you’ve forgotten, here are some of the year’s top advances.

ROCm for the AI Era

This past fall, AMD introduced version 7.0 of its ROCm software stack. This latest edition features capabilities designed especially for AI.

ROCm, part of AMD’s portfolio since 2016, translates code written by human programmers into instruction sets that AMD GPUs and CPUs can understand and execute.

Now AMD has purpose-built ROCm 7.0 for GenAI, large-scale AI training, and AI inferencing. Essentially, ROCm now offers the tools and runtime to make the most complex GPU workloads run efficiently.

The full ROCm 7.0 stack contains multiple components. These include drivers, a Heterogeneous Interface for Portability (HIP), math and AI libraries, compilers and system-management tools.

Liquid-Cooled AI Servers

Supermicro introduced two rackmount AI servers in June, both of them powered by AMD Instinct MI350 Series GPUs and dual AMD EPYC 9005 CPUs.

One of the two new servers, Supermicro model number AS -4126GS-NMR-LCC, is a 4U liquid-cooled system. This server can handle up to eight GPUs, the user’s choice of AMD’s Instinct MI325X or MI355X.

The other server, Supermicro model number AS -8126GS-TNMR, is a larger 8U server that’s also air-cooled. It also offers a choice of AMD GPUs, either the AMD Instinct MI325X or AMD Instinct MI350X.

Both servers feature PCIe 5.0 connectivity; memory capacities of up to 2.3TB; support for AMD’s ROCm open-source software; and support for AMD Infinity Fabric Link connections for GPUs.

In June, Supermicro CEO Charles Liang said the new servers “strengthen and expand our industry-leading AI solutions—and give customers greater choice and better performance as they design and build the next generation of data centers.”

EPYCs for SMBs

In May, AMD introduced a CPU series designed specifically for small and medium businesses.

The processors, known as the AMD EPYC 4005 Series, bring a full suite of enterprise-level features and performance. But they’re designed for on-prem SMBs and cloud service providers who need cost-effective solutions in a 3U form factor.

“We’re delivering the right balance of performance, simplicity, and affordability,” says Derek Dicker, AMD’s corporate VP of enterprise and HPC. 

That balance includes the same AMD ‘Zen 5’ core architecture behind the AMD EPYC 9005 Series processors used in data centers run by large enterprises.

The AMD EPYC 4005 Series CPUs for SMBs come in a single-socket package. Depending on model, they offer anywhere from 6 to 16 cores and boosted performance of up to 5.7 GHz.

One model of the AMD EPYC 4005 line also includes integrated AMD 3D V-Cache technology for a larger 128MB L3 cache and lower latency.

MicroBlades for CSPs

The AMD EPYC 4005 Series processors made a star appearance in November, when Supermicro introduced a 6U, 20-node MicroBlade server (model number MBA-315R-1G) powered by the new CPUs.

These servers are intended for small and midsize cloud service providers.

Each blade is powered by a single AMD EPYC 4005 CPU. When 20 blades are combined in the system’s 6U form factor, the system offers 3.3x higher density than a traditional 1U server. It also reduces cabling by up to 95%, saves up to 70% space, and lowers energy costs by up to 30%.

This MicroBlade system with an AMD EPYC 4005 processor is also available as a motherboard (model number BH4SRG) for use in Supermicro A+ servers.

~~~~~~~~~

Happy holidays from all of us at Performance Intensive Computing, and best wishes for the new year! We look forward to serving you in 2026.

~~~~~~~~~~

Read related 2025 blog posts:

 

Featured videos


Events


Find AMD & Supermicro Elsewhere

Check out Supermicro’s new AMD GPU-powered server—it’s air-cooled

Featured content

Check out Supermicro’s new AMD GPU-powered server—it’s air-cooled

Supermicro’s new 10U server is powered by AMD’s EPYC CPUs and Instinct MI355X GPUs. And it’s kept cool by nearly 20 fans.

Learn More about this topic
  • Applications:
  • Featured Technologies:

What do you do if you need GPU power for AI and other compute-intensive workloads, but lack the infrastructure for liquid cooling?

Supermicro has the answer. The company just introduced a 10U server powered by AMD Instinct MI355X GPUs that’s air-cooled.

The new server, showcased at the recent SC25 conference in St. Louis, is Supermicro model AS -A126GS-TNMR.

Each server is powered by the customer’s choice of dual AMD EPYC 9004 or 9005 Series CPUs with up to 384 cores and 768 threads. The system also features a total of eight AMD Instinct MI355X onboard OAM GPU accelerator modules, which are air-cooled. (OAM is short for OCP Accelerator Module, an industry-standard form factor for AI hardware.) In addition, these accelerated GPU servers offer up to 6TB of DDR5 system memory.

While the systems are air-cooled with up to 19 heavy-duty fans, there’s no penalty in terms of cooling capacity. In fact, AMD has boosted the GPU’s thermal design point (TDP)—the maximum amount of heat a server’s cooling system can handle—from 1000W to 1400W.

Also, compared with the company’s air-cooled 8U server based on AMD Instinct MI350X GPUs, the 10U server offers up to double-digit more performance, according to Supermicro . For end users, that means faster data processing.

More Per Rack

The bigger picture: Supermicro’s new 10U option lets customers unlock higher performance per rack. And with their choice of 10U air cooling or 4U liquid cooling, both powered by the latest AMD EPYC processors.

Supermicro’s GPU solutions are designed to offer maximum performance for AI and inference at scale. And they’re intended for use by both cloud service providers and enterprises.

Are your customers looking for a GPU-powered server that’s air cooled? Tell them about these new Supermicro 10U servers. And let them know that these systems are ready to ship now.

Do More:

 

Featured videos


Events


Find AMD & Supermicro Elsewhere

Tech Explainer: What’s new in AMD ROCm 7?

Featured content

Tech Explainer: What’s new in AMD ROCm 7?

Learn how the AMD ROCm software stack has been updated for the era of AI.

Learn More about this topic
  • Applications:
  • Featured Technologies:

While GPUs have become the digital engines of our increasingly AI-powered lives, controlling them accurately and efficiently can be tricky.

That’s why, in 2016, AMD created ROCm. Pronounced rock-em, it’s a software stack designed to translate the code written by programmers into sets of instructions that AMD GPUs can understand and execute perfectly.

If the GPUs in today’s cutting-edge AI servers are the orchestra, then ROCm is the sheet music being played.

AMD introduced the latest version, ROCm 7.0, earlier this fall. Version 7.0 is designed for the new world of AI.

How ROCm works

ROCm is a platform created by AMD to run programs on its AI-focused GPUs, the Instinct MI350 Series accelerators. AMD calls the latest version, ROCm 7.0, an AI-ready powerhouse designed for performance, efficiency and productivity.

Providing that kind of facility is a matter of far more than just simple software. ROCm is actually an expansive collection of tools, drivers and libraries.

What’s in the collection? The full ROCm stack contains:

  • Drivers that enable a computer’s operating system to communicate with any installed AMD GPUs.
  • The Heterogeneous Interface for Portability (HIP), a coding system for users to create and run custom GPU programs.
  • Math and AI libraries including specialized tools like deep learning operations, fast math routines, matrix multiplication, and tensor ops. These AI building blocks are pre-built to help developers accelerate production.
  • Compilers that turn code into GPU instructions.
  • System-management tools that developers can use to debug applications and optimize GPU performance.

Help Me, GPU

The latest version of ROCm is purpose-built for generative AI and large-scale AI inferencing and training. While developers rely on GPUs for parallel processing, performing many tasks at once, GPUs are general-purpose processors. To achieve the best performance for AI workloads, developers need a software bridge that turns their high-level code into GPU-optimized instructions. That bridge is ROCm.

ROCm lets developers run AI frameworks that include PyTorch effectively on AMD GPUs. ROCm converts application code into instructions designed for the hardware. In this way, ROCm helps organizations improve performance, scale workloads across multiple GPUs, and meet increasing demand without sacrificing reliability.
 
For demanding AI workloads such as those using Mixture of Experts (MoE) models, ROCm is essential for execution. MoE models activate only a small group of expert networks for each input, resulting in sparse workloads that are efficient, but hard to schedule. ROCm ensures that GPUs can perform these sparse operations at scale, maintaining high throughput and accuracy across clusters.
 
In other words, ROCm provides the tools and runtime to make even the most complex GPU workloads run efficiently. It connects AI developers with the hardware that supports their applications.
 
That’s important. While increased demand is what every enterprise wants, it still brings challenges that leave little room for mistakes.
 
Open Source Power

But wait, there's more. AMD ROCm has another clever trick up its sleeve: open-source integration.

By using popular open-source frameworks, ROCm lets enterprises and developers run large-scale inference workloads more efficiently. This open source approach also empowers the same organizations and developers to break free of proprietary software and vendor-locked ecosystems.

Free from those dependencies, users can scale AI clusters by deploying commodity components instead of being locked into a single vendor’s hardware. Ultimately, that can lead to lower hardware and licensing costs.

This approach also empowers users to customize their AI operations. In this way, AI systems can be developed to better suit the unique requirements of an organization’s applications, environments and end users.

Another Layer

While ROCm serves the larger market, the recent release of AMD’s new Enterprise AI Suite shows the company’s commitment to developing tools specifically for enterprise-class organizations.

AMD says the new suite can to take enterprises from bare metal server to enterprise-ready AI software in mere minutes.

To accomplish this, the suite provides four additional components: solution blueprints, inference microservices, AI Workbench, and a dedicated resource manager.

These tools are designed to help enterprises better scale their AI workloads, predict costs and capacity, and accelerate time-to-production.

Always Be Developing

Along with these product releases, AMD is being perfectly clear about its focus on AI development. At the company’s recent Financial Analyst Day, AMD CEO Lisa Su explained that over the last five years, the cost of AMD’s AI-related investments and acquisitions has topped $100 billion. That includes building up a staff of some 25,000 engineers.

Looking ahead, Su told financial analysts that AMD’s data-center AI business is on track to draw revenue in the “tens of billions of dollars” by 2027. She also said that over the next three to five years, AMD expects its data-center AI revenue to enjoy a compound annual growth rate (CAGR) of over 80%.

AMD’s roadmap points to updates that will focus on further boosts to performance, productivity and scalability. The company may accomplish these gains by offering more streamlined build and packaging systems, more optimized training and inferencing, and broader hardware support. It’s also reasonable to expect improved virtualization and multi-tenant support.

That said, if you want your speculation about future AI-centric ROCm improvements to be as accurate as possible, your best bet may be to ask an AI chatbot…powered by Supermicro and AMD, of course.

Do More:

 

Featured videos


Events


Find AMD & Supermicro Elsewhere

Tech Explainer: What’s liquid cooling? And why might your data center need it now?

Featured content

Tech Explainer: What’s liquid cooling? And why might your data center need it now?

Liquid cooling offers big efficiency gains over traditional air. And while there are upfront costs, for data centers with high-performance AI and HPC servers, the savings can be substantial. Learn how it works.

Learn More about this topic
  • Applications:
  • Featured Technologies:

Increasingly resource-intensive AI workloads are creating more demand for advanced data center cooling systems. Today, the most efficient and cost-effective method is liquid cooling.

A liquid-cooled PC or server relies on a liquid rather than air to remove heat from vital components that include CPUs, GPUs and AI accelerators. The heat produced by these components is transferred to a liquid. Then the liquid carries away the heat to where it can be safely dissipated.

Most computers don’t require liquid cooling. That’s because general-use consumer and business machines don’t generate enough heat to justify liquid cooling’s higher upfront costs and additional maintenance.

However, high-performance systems designed for tasks such as gaming, scientific research and AI can often operate better, longer and more efficiently when equipped with liquid cooling.

How Liquid Cooling Works

For the actual coolant, most liquid systems use either water or dielectric fluids. Before water is added to a liquid cooler, it’s demineralized to prevent corrosion and build-up. And to prevent freezing and bacterial growth, the water may also be mixed with a combination of glycol, corrosion inhibitors and biocides.

Thus treated, the coolant is pushed through the system by an electric pump. A single liquid-cooled PC or server will need to include its own pump. But for enterprise data center racks containing multiple servers, the liquid is pumped by what’s known as an in-rack cooling distribution unit (CDU). Then the liquid is distributed to each server via a coolant distribution manifold (CDM).

As the liquid flows through the system, it’s channeled into cold plates that are mounted atop the system’s CPUs, GPUs, DIMM modules, PCIe switches and other heat-producing components. Each cold plate has microchannels through which the liquid flows, absorbing and carrying away each component’s thermal energy.

The next step is to safely dissipate the collected heat. To accomplish this, the liquid is pumped back through the CDU, which sends the now-hot liquid to a mechanism that removes the heat. This is typically done using chillers, cooling towers or heat exchangers.

Finally, the cooled liquid is sent back to the systems’ heat-producing components to begin the process again.

 

Liquid Pros & Cons

The most compelling aspect of liquid cooling is its efficiency. Water moves heat up to 25 times better than air while using less energy to do it. In comparison with traditional air, liquid cooling can reduce cooling energy costs by up to 40%.

But there’s more to the efficiency of liquid cooling than just cutting costs. Liquid cooling also enables IT managers to move servers closer together, packing in more power and storage per square foot. Given the high cost of data center real estate, and the fullness of many data centers, that’s an important benefit.

In addition, liquid cooling can better handle the latest high-powered processing components. For instance, Supermicro says its DLC-2 next-generation Direct Liquid-Cooling solutions, introduced in May, can accommodate warmer liquid inflow temperatures while also enhancing AI per watt.

But liquid cooling systems have their downsides, too. For one, higher upfront costs can present a barrier for entry. Sure, data center operators will realize a lower total cost of ownership (TCO) over the long run. But when deploying a liquid-cooled data center, they must still contend with initial capital expense (CapEx) outlays—and justifying those costs to the CFO.

For another, IT managers might think twice about the additional complexity and risks of a liquid cooling solution. More components and variables mean more things that can go wrong. Data center insurance premiums may rise too, since a liquid cooling system can always spring a leak.

Driving Demand: AI

All that said, the market for liquid cooling systems is primed for serious growth.

As AI workloads become increasingly resource-intensive, IT managers are deploying more powerful servers to keep up with demand. These high-performance machines produce more heat than previous generations. And that creates increased demand for efficient, cost-effective cooling solutions.

How much demand? This year, the data center liquid cooling market is projected to drive global sales of $2.84 billion, according to Markets and Markets.

Looking ahead, the industry watcher expects the global liquid cooling market to reach $21.14 billion by 2032. If that happens, the rise will represent a compound annual growth rate (CAGR) over the projected period of 33%.

Coming Soon: Immersive Cooling

In the near future, AI workloads will likely become even more demanding. This means data centers will need to deploy—and cool—ultra-dense AI server clusters that produce tremendous amounts of heat.

To deal with this extra heat, IT managers may need the next step in data center cooling: immersion.

With immersion cooling, an entire rack of servers is submerged horizontally in a tank filled with what’s known as dielectric fluid. This is a non-conductive liquid that ensures the server’s hardware can operate while submerged, and without short-circuiting.

Immersion cooling is being developed along two paths. The most common variety is called single-phase, and it operates similarly to an aquarium’s water filter. As pumps circulate the dielectric fluid around the servers, the fluid is heated by the server’s components. Then it’s cooled by an external heat exchanger.

The other type of immersion cooling is known as two-phase. Here, the system uses water treated to have a relatively low boiling point—around 50 C / 122 F. As this water is heated by the immersed server, it boils, creating a vapor that rises to condensers installed at the top of the tank. The vapor is there condensed to a cooler liquid, then allowed to drip back down into the tank.

This natural convection means there’s no need for electric pumps. It’s a glimpse of a smarter, more efficient liquid future, coming soon to a data center near you.

Do More:

 

Featured videos


Events


Find AMD & Supermicro Elsewhere

Pages