Run HPC workloads faster with this AMD-Supermicro combo
Compared with a CPU-only system, a Supermicro system with AMD CPUs and GPUs delivered double-digit performance gains. In benchmark tests, some HPC workloads that previously required months of simulation time were now completed in just days.
Applications:
Featured Technologies:
Are you or your customers looking to boost high-performance computing (HPC) workloads with greater performance, scalability and energy efficiency?
If so, then Supermicro and AMD have what you’re looking for. Together, they’ve demonstrated dramatic improvements in HPC workloads using a Supermicro system powered by AMD Instinct MI355X GPUs.
Compared with a CPU-only system, this AMD-Supermicro setup delivered double-digit performance gains. In benchmark tests, some HPC workloads that previously required months of simulation time were now completed in just days.
That’s because HPC workloads — especially those that are compute-intensive — can benefit from GPU accelerators.
The massive parallelism of GPUs enables faster iteration and higher resolution modeling. And accelerators such as the AMD Instinct MI355X transform traditional CPU-bound HPC clusters into GPU-accelerated supercomputing platforms.
Booming Benchmarks
The improvements can be dramatic for both performance/watt and time-to-solution. But how dramatic? That’s what AMD and SMC set out to discover.
To find out, they ran benchmarks on a liquid-cooled Supermicro 4U server powered by dual AMD EPYC CPUs and eight Instinct MI355X GPUs.
The AMD Instinct MI355X is a compelling solution for HPC applications, mainly because of its massive memory capacity (up to 6TB of DDR5), high double-precision (FP64) performance, and its support of the open-source AMD ROCm software ecosystem. These features enable it to handle extensive and complex scientific modeling, simulations and data-intensive tasks efficiently and at scale.
The benchmarks were generated using Chroma (software for quantum lattice field theory), Gromacs (software for molecular dynamics) and NAMD (molecular dynamics simulations). Here were the test workloads:
Chroma QUDA BICGSTAB Clover Solver: Fast lattice quantum chromodynamics
Gromacs ADH-Dodec: High-thruput for small and medium biomolecular systems
Gromacs Cellulose-NVE: Biomolecular simulation of crystal structures
Gromacs STMV Virus: Biomolecular simulation of plant virus
NAMD large-scale MD: Molecular dynamics simulation with 1 million steps
Compared with CPU-only, results were delivered anywhere from 6x to 14x faster, depending on benchmark.
Are your customers looking for that kind of HPC advantage? Then tell them about this Supermicro and AMD solution.
Need to cool AI hardware with safety? Check out the new solution from AMD, Supermicro & Metrum AI
The solution employs AI agents to monitor liquid-cooling systems, identifying and remediating problems quickly.
Applications:
Featured Technologies:
Liquid cooling is great for controlling the temperature of hard-working AI servers, but the technology also has its risks. Even minor disruptions or fluctuations in a cooling system can quickly lead to massive hardware failures.
A solution to this challenge has been developed by AMD and Supermicro working with Metrum AI Inc., an Austin, Texas-based provider of industry-specific AI agents and AI evaluation products.
Their solution integrates Supermicro infrastructure, AMD computational power and ROCm software, and Metrum AI’s orchestration to deliver fast decisions that ensure safety at scale.
This solution enables multiple AI agents to collaboratively monitor signals, diagnose issues, predict failures, and coordinate corrective actions. The agents are embedded directly into a server’s cooling control plate.
Essentially, this creates a data center that is adaptive, resilient, and self-optimizing. The solution should also support the massive compute intensity of next-generation AI workloads while proactively managing their thermal and physical risks.
And unlike traditional monitoring tools, this solution can actually predict and then prevent catastrophic hydraulic failures—before they occur. And do so faster than would be possible with traditional human intervention.
Power Features
To design these multi-agent systems, the team used AMD ROCm. This open-source software delivers important benefits that include flexibility, optimized libraries and seamless integration with AMD Instinct GPUs.
Another feature that made the solution possible is the massive memory reservoir of the AMD Instinct GPUs. For example, the AMD Instinct MI355X GPU has a dedicated memory of 288 GB. This lets large-scale reasoning models operate fully in-memory.
The structural foundation of this platform is the Supermicro 8U server (model AS -8126GS-TNMR) powered by dual AMD EPYC 9005 Series CPUs and supporting up to eight AMD Instinct MI325X or MI350X GPUs.
Unlike standard servers, these systems are engineered with direct-to-chip cooling headers that expose flow, temperature and pressure data directly through Redfish interfaces. (Redfish is a standard designed to deliver simple and secure management for converged systems, hybrid IT and software-defined data centers.) This empowers the agents to monitor and adjust cooling performance in real time.
The combination of specific technologies creates what’s known as a Unified Computational Fabric. There, the AMD EPYC processors feed continuous Redfish data directly into the Supermicro Instinct GPUs via PCIe 5, eliminating I/O bottlenecks.
This synergy powers the platform to sustain real-time adaptive control loops across dozens of racks, and quickly. It’s a capability that conventional air-cooled and CPU-based infrastructures can’t deliver.
Smart Racks
The autonomous cooling system was built on a distributed multi-agent architecture designed specifically for liquid-cooled AI environments. Unlike conventional systems, where monitoring is either centralized or based on human intervention, the solution places intelligence directly at the rack level.
In this setup, lightweight agents continuously monitor telemetry, interpret changes in flow and pressure, and coordinate rapid remediation actions across the data center. This creates a resilient, high-resolution control fabric that can respond to thermal events in milliseconds.
At the base of the stack, AMD ROCm supplies the core libraries, tools, compilers and runtimes for GPU-accelerated compute on AMD Instinct GPUs. And Kubernetes orchestration and the AMD GPU Operator enable containerized deployment, GPU scheduling, and lifecycle management at a multirack scale. (Kubernetes is an open-source system for automating the deployment, scaling and management of containerized applications.)
Above this layer, the AMD Enterprise AI Suite delivers higher-level services. The suite is a full-stack of enterprise-ready AI. The services it delivers include solution blueprints, AI workbench, and a resource manager for unified model deployment, optimization and infrastructure governance.
Metrum AI extends these platform components into a specialized multiagent architecture. It supports real-time telemetry ingestion, large-model reasoning and autonomous cooling control.
Test Results: Fast Yet Stable
All that sounds good in theory, but does it really work?
To find out, the solution was tested by Metrum AI along two dimensions: telemetry ingestion thruput and large-model inference stability.
When monitoring a full deployment of 200 racks (1,000 servers), the system successfully processed more than 13,000 Redfish telemetry endpoints per minute. Simultaneously, it maintained over 8,000 tokens/second of multiagent large-model reasoning.
This demonstrated that as the infrastructure added complexity, the centralized coordination architecture did not become a bottleneck. Also, the test shows that every agent received real-time, high-resolution sensor context, regardless of facility size.
Across all benchmarks, the integrated solution demonstrated stable, real-time, end-to-end autonomous operation under data-center scale load.
So do you have customers who are eager to try liquid cooling, but concerned about the risks? If so, tell them about this new AI-powered solution from AMD, Supermicro and Metrum AI.
Delivering an AI return on investment can be challenging. A new IDC white paper offers a solution: leverage infrastructure to the use case.
Applications:
Featured Technologies:
Companies can build a strong return on investment (ROI) for their AI projects—but only if they understand how to leverage different infrastructure solutions for different AI use cases. In other words, they need to know how to do purpose-fitting.
That’s the case argued in a new IDC white paper sponsored by AMD and Supermicro.
The paper’s two co-authors are Peter Rutten, research VP in IDC’s worldwide infrastructure research group and global research lead of the firm’s performance-intensive computing practice; and Madhumitha Sathish, research manager for high-performance computing at IDC and lead of the firm’s AI infrastructure research.
Rutten and Sathish find not all is well in the world of AI. In a survey conducted by IDC this past September, fewer than half of companies worldwide said their AI-related projects have delivered any measurable business outcomes. And only about one in 10 companies (11.4%) said they’re obtaining measurable business results from more than 75% of their AI projects.
What’s blocking AI progress? According to the IDC survey, these are the top reasons:
Competition for resources: cited by 34% of survey respondents
Resistance to process change: cited by 30%
Difficulty quantifying AI’s ROI: 28%
Regulatory uncertainty: also 28% (multiple responses were allowed)
“Cost continues to be a major hurdle,” the authors write.
And the biggest cost? Over 60% of companies surveyed by IDC said it’s around developing and deploying AI is specialized infrastructure.
Four Questions
It doesn’t have to be this way, the IDC authors argue. Instead, AI-using organizations can build a strong ROI for their projects with purpose-fitting.
To do this, managers should ask (and answer) 4 important questions:
Who decides what is your relevant AI use case? A separate IDC survey finds that fewer than one in three organizations involve IT during an AI initiative’s conceptual stage.
What kind of AI model do you need? There are many, including machine learning, GenAI, agentic AI, deep neural network, etc. Not all require major capital expenditures.
How will you obtain this AI model? Each approach involves trade-offs. For example, most businesses fine-tune or customize an existing commercial model. But this approach involves both licensing costs and training costs.
Have you considered the biggest factors that impact AI infrastructure needs? These factors include AI model types, number of parameters, volume of training data, query response times, and query size.
By taking these factors into account, the authors say, enterprises can develop AI options that match their AI use case, creating a purpose-built infrastructure solution.
Spectrum Choices
To contain AI infrastructure costs, the IDC authors recommend that managers develop what they call a “spectrum of options” based on 7 factors: Complexity, parameter count, data volume, model accuracy, time to value, query response latency, and query size.
When these factors are low or small, an AI project is in the blue zone, which implies lower costs. As these factors become higher or larger, the project moves into the green and red zones, which imply higher costs, as shown in the IDC chart below.
Hardware system requirements can vary by spectrum, too.
Blue zone projects, those with the lowest infrastructure costs, can be run on CPU-based, air-cooled systems, or even a PC or workstation.
Green zone projects, those with intermediate infrastructure costs, can run on systems powered by CPUs with built-in accelerators and lighter co-processors.
And red zone projects, those with the highest infrastructure zones, require rack-scale systems with high-end CPUs, GPUs and liquid-cooling.
But wait, there’s more. The IDC authors point to several additional considerations:
Is there more than one AI use case in development? Typically, there are. If that’s the case, then that will need to be built into the needs projection.
How rapidly will the AI use case evolve over time? For example, if the number of users is projected to grow substantially, then the accounting must consider new infrastructure that will be required.
How often will the AI model require generational updates? Many models are constantly being improved, expanded and retrained, and these updates will deliver infrastructure impacts.
Better Together
The IDC authors say AI-using companies would do well to consider AMD-powered Supermicro systems. The two suppliers work with a vast ecosystem of partners to offer alternatives and options.
AMD and Supermicro demystify complexity, helping companies plan their AI projects faster and better. And they offer reliable, high-performance platforms that support AI workloads across a wide range of deployment scales.
“AMD and Supermicro,” the IDC authors write, “have developed some of the most versatile, powerful and well-tailored solutions available today.”
Discover how AI factories work—and how your clients might benefit from building an AI factory of their own.
Applications:
Featured Technologies:
How can you tell that the AI Era is here? One way is by noticing that large enterprises are increasingly focused on mass producing AI models.
It’s no longer enough to have a decent set of working AI models to power Spotify’s suggestion engine or Accenture’s Big Data analytics.
To keep up with—and surpass—the Joneses, Spotify and Accenture will need dedicated systems that work every day to create, evaluate and iterate their AI models.
These systems are called AI factories. Somewhat like a factory that creates physical widgets, an AI factory churns out new and updated AI models. This continual AI production process helps enterprises react quickly to market demands and competition.
Make no mistake: The development of AI factories represents a turning point in the evolution of AI-powered business.
No. 2 with a Bullet
This theory is supported by some of IT’s top thinkers. They include Tom Davenport, a professor, speaker and author; and Randy Bean, a corporate advisor.
Davenport and Bean co-wrote an article that appeared earlier this month In the Sloan Management Review: Five trends in AI and data science for 2026. In their article, the authors place AI factories in the Number 2 spot. AI factories, they say, will be adopted by users and “all-in” AI adopters that include consumer products makers, banks and software companies.
As Davenport and Bean explain, an AI factory combines technology platforms, methods, data and previously developed algorithms to make building AI systems easy and fast. The authors’ all-important message: Watch this space.
How AI Factories Work
To fully understand the concept of an AI factory, it can help to think of the traditional smoke-belching, brick-and-mortar factories it’s named for.
Of course, there are some differences. A physical factory takes in raw materials, uses machines to process them, and produces physical products.
By contrast, an AI factory takes in data (such as text, audio, images and logs), runs that data through massive compute engines, and outputs AI models for recommendations, predictions, automation and generative content.
Another difference: Unlike the static products that emerge from traditional factories, the products of AI factories are virtual. They learn and grow as new data, infrastructure and techniques become available. In this way, AI factories help their organizations keep up with rapid changes and market shifts.
For instance, a new AI model produced by an enterprise’s AI factory can be continuously retrained as new data becomes available. While each new iteration deployed in the field busily suggests which Netflix movie to watch next, a newer version is constantly being developed in the background. When the new suggestion engine is ready, Netflix can seamlessly slide it into place.
Why Your Clients Probably Need an AI Factory
It’s good to understand the abstract benefits of an AI factory. But your clients will also want to know how building one can translate into business results.
Here’s the bottom line. An AI factory can:
Dramatically reduce the cost of business intelligence. Once an AI factory is built and a given AI model is trained, that model can run continuously, serving millions of decisions, predictions, etc., for a fraction of its initial cost. In other words, the cost per additional decision rapidly collapses toward zero.
Help organizations maintain a decisive competitive advantage. This happens on two levels. First, maintaining a constant production stream of AI models and iterations helps your clients meet market demands as quickly as possible. And second, having that ability to react faster to customer needs and economic conditions can help create and sustain an advantage over competitors.
Turn data into capital. Many organizations are ill-equipped to analyze and monetize all the data they collect. All that piled-up data can seem like an albatross around their neck. But by building an AI factory, the organization can harness that otherwise squandered data and put it to work.
Further, companies that don’t build an AI factory could find themselves at a competitive disadvantage. Davenport and Bean, in their Sloan Management Review article, say companies that lack an AI factory will find building AI at scale both expensive and time-consuming.
Stumbling Blocks? A Couple
Building an AI factory isn’t always easy. Enterprises can run into serious roadblocks.
For one, siloed, inconsistent or low-trust data can make for a messy AI production process. As programmers say, “garbage in, garbage out.” In other words, if the data is messy, the analysis will be, too.
Another thing that can wreak havoc on the virtual factory floor are talent bottlenecks. There are only so many data scientists to go around, and they’re in high demand. Finding the right employees is a key component here—even in an age of super-smart robots.
Another trap your clients need to watch out for are bureaucratic hold-ups. Legal, compliance and trust issues can cause AI projects to grind to a halt.
The AI Factory Future
Like everything else in the fast-moving AI world, AI factories are changing. In the near future, AI factories will likely focus on the immediacy of real-time, always-on learning.
As AI factories shift to nearly continuous adaptation, enterprises will use their AI model updates to keep pace with rapidly changing market conditions and customer demands.
Another likely future is inferencing at the edge. For “edge,” think vehicles, devices and brick-and-mortar factories. Organizations that move inferencing closer to where data is created can lower system latency (that is, increase speed) and reduce cloud costs.
Another factor that could make a big impact on AI factories is new software and hardware integrations. A recent Supermicro webinar on AI factories and related technology showed how enterprises can benefit from integrating software platforms such as Supermicro’s SuperCloud Composer (SCC) and Power Asset Orchestrator (PAO).
Supermicro says this potent combination allows operators to gain total visibility into AI Factories. It can also optimize everything from GPU telemetry to real-time grid pricing.
Overall, it’s safe to assume that when these and other updates are deployed, AI factories will quickly become part of the common AI infrastructure. In so doing, they’ll touch nearly every aspect of our daily lives.
ROCm, part of AMD’s portfolio since 2016, translates code written by human programmers into instruction sets that AMD GPUs and CPUs can understand and execute.
Now AMD has purpose-built ROCm 7.0 for GenAI, large-scale AI training, and AI inferencing. Essentially, ROCm now offers the tools and runtime to make the most complex GPU workloads run efficiently.
The full ROCm 7.0 stack contains multiple components. These include drivers, a Heterogeneous Interface for Portability (HIP), math and AI libraries, compilers and system-management tools.
One of the two new servers, Supermicro model number AS -4126GS-NMR-LCC, is a 4U liquid-cooled system. This server can handle up to eight GPUs, the user’s choice of AMD’s Instinct MI325X or MI355X.
The other server, Supermicro model number AS -8126GS-TNMR, is a larger 8U server that’s also air-cooled. It also offers a choice of AMD GPUs, either the AMD Instinct MI325X or AMD Instinct MI350X.
Both servers feature PCIe 5.0 connectivity; memory capacities of up to 2.3TB; support for AMD’s ROCm open-source software; and support for AMD Infinity Fabric Link connections for GPUs.
In June, Supermicro CEO Charles Liang said the new servers “strengthen and expand our industry-leading AI solutions—and give customers greater choice and better performance as they design and build the next generation of data centers.”
EPYCs for SMBs
In May, AMD introduced a CPU series designed specifically for small and medium businesses.
The processors, known as the AMD EPYC 4005 Series, bring a full suite of enterprise-level features and performance. But they’re designed for on-prem SMBs and cloud service providers who need cost-effective solutions in a 3U form factor.
“We’re delivering the right balance of performance, simplicity, and affordability,” says Derek Dicker, AMD’s corporate VP of enterprise and HPC.
That balance includes the same AMD ‘Zen 5’ core architecture behind the AMD EPYC 9005 Series processors used in data centers run by large enterprises.
The AMD EPYC 4005 Series CPUs for SMBs come in a single-socket package. Depending on model, they offer anywhere from 6 to 16 cores and boosted performance of up to 5.7 GHz.
One model of the AMD EPYC 4005 line also includes integrated AMD 3D V-Cache technology for a larger 128MB L3 cache and lower latency.
These servers are intended for small and midsize cloud service providers.
Each blade is powered by a single AMD EPYC 4005 CPU. When 20 blades are combined in the system’s 6U form factor, the system offers 3.3x higher density than a traditional 1U server. It also reduces cabling by up to 95%, saves up to 70% space, and lowers energy costs by up to 30%.
This MicroBlade system with an AMD EPYC 4005 processor is also available as a motherboard (model number BH4SRG) for use in Supermicro A+ servers.
~~~~~~~~~
Happy holidays from all of us at Performance Intensive Computing, and best wishes for the new year! We look forward to serving you in 2026.
Each server is powered by the customer’s choice of dual AMD EPYC 9004 or 9005 Series CPUs with up to 384 cores and 768 threads. The system also features a total of eight AMD Instinct MI355X onboard OAM GPU accelerator modules, which are air-cooled. (OAM is short for OCP Accelerator Module, an industry-standard form factor for AI hardware.) In addition, these accelerated GPU servers offer up to 6TB of DDR5 system memory.
While the systems are air-cooled with up to 19 heavy-duty fans, there’s no penalty in terms of cooling capacity. In fact, AMD has boosted the GPU’s thermal design point (TDP)—the maximum amount of heat a server’s cooling system can handle—from 1000W to 1400W.
Also, compared with the company’s air-cooled 8U server based on AMD Instinct MI350X GPUs, the 10U server offers up to double-digit more performance, according to Supermicro . For end users, that means faster data processing.
More Per Rack
The bigger picture: Supermicro’s new 10U option lets customers unlock higher performance per rack. And with their choice of 10U air cooling or 4U liquid cooling, both powered by the latest AMD EPYC processors.
Supermicro’s GPU solutions are designed to offer maximum performance for AI and inference at scale. And they’re intended for use by both cloud service providers and enterprises.
Are your customers looking for a GPU-powered server that’s air cooled? Tell them about these new Supermicro 10U servers. And let them know that these systems are ready to ship now.
Tech Explainer: What’s liquid cooling? And why might your data center need it now?
Liquid cooling offers big efficiency gains over traditional air. And while there are upfront costs, for data centers with high-performance AI and HPC servers, the savings can be substantial. Learn how it works.
Applications:
Featured Technologies:
Increasingly resource-intensive AI workloads are creating more demand for advanced data center cooling systems. Today, the most efficient and cost-effective method is liquid cooling.
A liquid-cooled PC or server relies on a liquid rather than air to remove heat from vital components that include CPUs, GPUs and AI accelerators. The heat produced by these components is transferred to a liquid. Then the liquid carries away the heat to where it can be safely dissipated.
Most computers don’t require liquid cooling. That’s because general-use consumer and business machines don’t generate enough heat to justify liquid cooling’s higher upfront costs and additional maintenance.
However, high-performance systems designed for tasks such as gaming, scientific research and AI can often operate better, longer and more efficiently when equipped with liquid cooling.
How Liquid Cooling Works
For the actual coolant, most liquid systems use either water or dielectric fluids. Before water is added to a liquid cooler, it’s demineralized to prevent corrosion and build-up. And to prevent freezing and bacterial growth, the water may also be mixed with a combination of glycol, corrosion inhibitors and biocides.
Thus treated, the coolant is pushed through the system by an electric pump. A single liquid-cooled PC or server will need to include its own pump. But for enterprise data center racks containing multiple servers, the liquid is pumped by what’s known as an in-rack cooling distribution unit (CDU). Then the liquid is distributed to each server via a coolant distribution manifold (CDM).
As the liquid flows through the system, it’s channeled into cold plates that are mounted atop the system’s CPUs, GPUs, DIMM modules, PCIe switches and other heat-producing components. Each cold plate has microchannels through which the liquid flows, absorbing and carrying away each component’s thermal energy.
The next step is to safely dissipate the collected heat. To accomplish this, the liquid is pumped back through the CDU, which sends the now-hot liquid to a mechanism that removes the heat. This is typically done using chillers, cooling towers or heat exchangers.
Finally, the cooled liquid is sent back to the systems’ heat-producing components to begin the process again.
Liquid Pros & Cons
The most compelling aspect of liquid cooling is its efficiency. Water moves heat up to 25 times better than air while using less energy to do it. In comparison with traditional air, liquid cooling can reduce cooling energy costs by up to 40%.
But there’s more to the efficiency of liquid cooling than just cutting costs. Liquid cooling also enables IT managers to move servers closer together, packing in more power and storage per square foot. Given the high cost of data center real estate, and the fullness of many data centers, that’s an important benefit.
In addition, liquid cooling can better handle the latest high-powered processing components. For instance, Supermicro says its DLC-2 next-generation Direct Liquid-Cooling solutions, introduced in May, can accommodate warmer liquid inflow temperatures while also enhancing AI per watt.
But liquid cooling systems have their downsides, too. For one, higher upfront costs can present a barrier for entry. Sure, data center operators will realize a lower total cost of ownership (TCO) over the long run. But when deploying a liquid-cooled data center, they must still contend with initial capital expense (CapEx) outlays—and justifying those costs to the CFO.
For another, IT managers might think twice about the additional complexity and risks of a liquid cooling solution. More components and variables mean more things that can go wrong. Data center insurance premiums may rise too, since a liquid cooling system can always spring a leak.
Driving Demand: AI
All that said, the market for liquid cooling systems is primed for serious growth.
As AI workloads become increasingly resource-intensive, IT managers are deploying more powerful servers to keep up with demand. These high-performance machines produce more heat than previous generations. And that creates increased demand for efficient, cost-effective cooling solutions.
How much demand? This year, the data center liquid cooling market is projected to drive global sales of $2.84 billion, according to Markets and Markets.
Looking ahead, the industry watcher expects the global liquid cooling market to reach $21.14 billion by 2032. If that happens, the rise will represent a compound annual growth rate (CAGR) over the projected period of 33%.
Coming Soon: Immersive Cooling
In the near future, AI workloads will likely become even more demanding. This means data centers will need to deploy—and cool—ultra-dense AI server clusters that produce tremendous amounts of heat.
To deal with this extra heat, IT managers may need the next step in data center cooling: immersion.
With immersion cooling, an entire rack of servers is submerged horizontally in a tank filled with what’s known as dielectric fluid. This is a non-conductive liquid that ensures the server’s hardware can operate while submerged, and without short-circuiting.
Immersion cooling is being developed along two paths. The most common variety is called single-phase, and it operates similarly to an aquarium’s water filter. As pumps circulate the dielectric fluid around the servers, the fluid is heated by the server’s components. Then it’s cooled by an external heat exchanger.
The other type of immersion cooling is known as two-phase. Here, the system uses water treated to have a relatively low boiling point—around 50 C / 122 F. As this water is heated by the immersed server, it boils, creating a vapor that rises to condensers installed at the top of the tank. The vapor is there condensed to a cooler liquid, then allowed to drip back down into the tank.
This natural convection means there’s no need for electric pumps. It’s a glimpse of a smarter, more efficient liquid future, coming soon to a data center near you.
Where & when: San Jose, California; Oct. 13-16, 2025
Who it’s for: This event, sponsored by the Open Compute Project (OCP), is for anyone interested in redesigning open source hardware to support the changing demands on compute infrastructure. This year’s theme: “Leading the future of AI.”
Who will be there: Speakers this year include Vik Malyala, senior VP of technology and AI at Supermicro; Mark Papermaster, CTO of AMD; Johnson Eung, staff growth product manager in AI at Supermicro; Shane Corban, senior director of technical product management at AMD; and Morris Ruan, director of product management at Supermicro.
Fun facts: AMD is a Diamond sponsor, and Supermicro is an Emerald sponsor.
Who it’s for: Developers of artificial intelligence applications and systems. Workshop topics will include developing multi-model, multi-agent systems; generating videos using open source tools; and developing optimized kernals.
Who will be there: Speakers will include executives from the University of California, Berkeley; Red Hat AI; Google DeepMind; and OpenAI. Also speaking will be execs from Ollama, an open source platform for AI models; Unsloth AI, an open source AI startup; vLLM, a library for large language model (LLM) inference and serving; and SGLang, an LLM framework.
Fun facts:
Supermicro is a conference sponsor.
During the conference, winners of the AMD Developer Challenge will be announced. The grand prize winner will take home $100,000.
AMD, PyTorch and Unsloth AI are co-sponsoring a virtual hackathon, the Synthetic Data AI Agents Challenge, on Oct. 18-20. The first-prize winners will receive $3,000 plus 1,200 hours of GPU credits.
Who it’s for: Anyone interested in the convergence of AI innovation and scalable infrastructure. This event is being hosted by Ignite, a go-to-market provider for the technology industry.
Who will be there: The speaker lineup is still TBA, but is promised to include enterprise technology leaders, AI and machine learning engineers, cloud and data center architects, venture capital investors, and infrastructure vendors.
Fun facts:
This is a hybrid event. You can attend either live or online.
Where & when: St. Louis, Missouri; Nov. 16-21, 2025
Who it’s for: The global supercomputing community, including those working in high performance computing (HPC), networking, storage and analysis. This year’s theme: “HPC ignites.”
Who will be there: Speakers will feature nearly a dozen AMD executives, including Rob Curtis, a Fellow in Data Center Platform Engineering; Shelby Lockhart, a software system engineer; and Nuwan Jayasena, a Fellow in AMD Research. They and other speakers will appear in panels, presentations of papers, workshops, tutorials and more.
Fun facts: SC25 will feature a series of noncommercial “Birds of a Feather” sessions that allow attendees to openly discuss topics of mutual interest.
Looking for business benefits from GenAI? Supermicro, AMD & PioVation have your solution
Struggling to deliver business benefits from Generative AI? Supermicro, AMD and PioVation have a new solution that not only works out-of-the-box, but is also highly scalable.
Applications:
Featured Technologies:
Experimenting with Generative AI can be fun, but CEOs and corporate boards aren’t interested in fun. They want to see real business results—things like an enhanced customer experience, more innovative products, streamlined operations and lower TCO. And they want to see them now.
Getting GenAI to deliver these kinds of business results isn’t easy. A recent report from MIT finds that despite nearly $40 billion worth of enterprise investment in GenAI, 95% of the organizations are getting “zero return.”
That estimate is based on solid numbers. The MIT researchers reviewed over 300 AI projects, interviewed with more than 50 organizations, and surveyed some 150 senior leaders.
The latest forecasts aren’t much cheerier. Research firm Gartner this summer predicted that by the end of this year, nearly a third of all GenAI projects (30%) will be abandoned after the proof-of-concept stage. Gartner says the projects will be cut due to poor data quality, inadequate risk controls, escalating costs and unclear business value.
“After last year’s hype, executives are impatient to see returns on GenAI investments,” says Gartner analyst Rita Sallam. “Yet organizations are struggling to prove and realize value.”
That’s About to Change
Supermicro, AMD and startup PioVation have partnered to jointly develop a GenAI solution that offers a pre-validated, turnkey infrastructure for deploying large language models (LLMs). The benefits include lower deployment overhead, enhanced observability, and ensured control of sovereign data.
Partner PioVation is a developer of AI platforms for enterprises, government agencies, and small and midsize businesses. Its products can be run either on-premises or in PioVation’s cloud in Munich, Germany. The company, founded in 2024 by former AMD executive Mazda Sabony, has formed partnerships with several companies, including AMD and Supermicro.
The GenAI solution being offered by the three companies has been designed to scale all the way from compact on-prem clusters up to large-scale multi-tenant cloud environments. And its architecture integrates Supermicro rack-level systems, AMD Instinct GPUs, and PioVation’s agentic AI platform, PioSphere. The result, the companies say, is out-of-the box agentic AI at any scale.
Full Stack
The Supermicro-AMD-PioVation offering is a full-stack solution. An autonomous microservice chains LLM prompts, invokes domain-specific tools, and integrates seamlessly with your existing systems via REST (an architectural style for distributed hypermedia systems), gRPC (a remote procedure call framework), or event streams running on the pre-validated Supermicro server powered by AMD Instinct GPUs.
Another feature is the solution’s Model Context Protocol (MCP). It lets agents interact with external tools in a way that’s both modular and composable. The MCP also governs how tools are registered, discovered, invoked and composed dynamically at runtime. This includes input/output serialization, maintaining execution context, and enforcing consistency across tool chains. MCP also enables context-aware tool usage, making every agent interoperable, auditable and enterprise-ready from the start.
The solution is available in three topologies, each designed for different operational scales and use cases:
MiniStack: For SMBs, pilots, research and the edge.
EdgeCluster: For regulated sites, branches and other locations where high availability is required.
Cloud Deployment: For cloud service providers (CSPs), enterprises and AI providers.
All three versions include a unified agent dashboard, role-based access control, and policy enforcement.
Business Benefits
The three partners haven’t forgotten about the need for GenAI to deliver real business results that can keep CEOs and corporate boards happy. To that end, the solution offers benefits that include:
Turnkey deployment: PioSphere’s Cloud OS has been prevalidated on the Supermicro platform powered by AMD GPUs.
Unified operations stack: A tightly integrated environment eliminates fragmented AI tooling.
No-code agent development: A PioVation feature known as AgentStudio lets nontechnical users design, deploy and iterate AI agents using a no-code interface.
Sovereign data control: Built-in controls support national and regional compliance frameworks, including Europe’s GDPR and the United States’ HIPAA.
Multi-tenant scalability: An organization can create separate, secure environments for different business units or clients, yet they’ll all share a common infrastructure footprint.
Integrated LLM operations and agent life-cycle management: Users can integrate any LLM published on the Hugging Face or Kaggle communities with one-click connectors. Other built-in features include RAG (retrieval augmented generation) pipelines and full agent life-cycle tools.
Intelligent autoscaling: During workload spikes, the solution’s dynamic autoscaling ensures resource utilization, cost efficiency and seamless performance.
Put it all together, and you have a solution that goes far beyond mere experimentation. The three partners—Supermicro, AMD and PioVation—are serious about helping your GenAI projects deliver serious benefits for the business.
Deploy GenAI with confidence: Validated Server Designs from Supermicro and AMD
Learn about the new Validated Design for AI clusters from Supermicro and AMD. It can save you time, reduce complexity and improve your ROI.
Applications:
Featured Technologies:
The task of designing, building and connecting a server system that can run today’s artificial intelligence workloads is daunting.
Mainly, because there are a lot of moving parts. Assembling and connecting them all correctly is not only complicated, but also time-consuming.
Supermicro and AMD are here to help. They’ve recently co-published a Verified Design document that explains how to build an AI cluster. The PDF also tells you how you can acquire an AMD-powered Supermicro cluster for AI pre-built, with all elements connected, configured and burned in before shipping.
Full-Stack for GenAI
Supermicro and AMD are offering a fully validated, full-stack solution for today’s Generative AI workloads. The system’s scale can be easily adjusted from as few as 16 nodes to as many as 1,024—and points in between.
These three AMD parts are all integrated with Supermicro’s optimized servers. That includes network cabling and switching.
The new Validated Design document is designed to help potential buyers understand the joint AMD-Supermicro solution’s key elements. To shorten your implementation time, the document also provides an organized plan from start to finish.
Under the Cover
This comprehensive report—22 pages plus a lengthy appendix—goes into a lot of technical detail. That includes the traffic characteristics of AI training, impact of large “elephant” flows on the network fabric, and dynamic load balancing. Here’s a summary:
Foundations of AI Fabrics: Remote Direct Memory Access (RDMA), PCIe switching, Ethernet, IP and Border Gateway Protocol (BGP).
Validated Design Equipment and Configuration: Server options that optimize RDMA traffic with minimal distance, latency and silicon between the RDMA-capable NIC (RNIC) and accelerator.
Scaling Out the Accelerators with an Optimized Ethernet Fabric: Components and configurations including the AMD Pensando Pollara 400 Ethernet NIC and Supermicro’s own SSE-T8196 Ethernet switch.
Design of the Scale Unit—Scaling Out the Cluster: Designs are included for both air-cooled and liquid-cooled setups.
Resource Management and Adding Locality into Work Placement: Covering the Simple Linux Utility for Resource Management (SLURM) and topology optimization including the concept of rails.
Supermicro Validated AMD Instinct MI325 Design: Shows how you can scale the validated design all the way to 8,000 AMD MI325X GPUs in a cluster.
Storage Network Validated Design: Multiple alternatives are offered.
Importance of Automation: Human errors are, well, human. Automation can help with tasks including the production of detailed architectural drawings, output of cabling maps, and management of device firmware.
How to Minimize Deployment Time: Supermicro’s Rack Scale Solution Stack offers a fully integrated, end-to-end solution. And by offering a system that’s pre-validated, this also eases the complexity of multi-vendor integration.
Total Rack Solution
Looking to minimize implementation times? Supermicro offers a total rack scale solution that’s fully integrated and end-to-end.
This frees the user from having to integrate and validate a multi-vendor solution. Basically, Supermicro does it for you.
By leveraging industry-leading energy efficiency, liquid and air-cooled designs, and global logistics capabilities, Supermicro delivers a cost-effective and future-proof solution designed to meet the most demanding IT requirements.
The benefits to the customer include reduced operational overhead, a single point of accountability, streamlined procurement and deployment, and maximum return on investment.
For onsite deployment, Supermicro provides a turnkey, fully optimized rack solution that is ready to run. This helps organizations maximize efficiency, lower costs and ensure long-term reliability. It includes a dedicated on-site project manager.