While GPUs have become the digital engines of our increasingly AI-powered lives, controlling them accurately and efficiently can be tricky.
That’s why, in 2016, AMD created ROCm. Pronounced rock-em, it’s a software stack designed to translate AI programmers’ code into sets of instructions that AMD GPUs can understand and execute perfectly.
If the GPUs in today’s cutting-edge AI servers are the orchestra, then ROCm is the sheet music being played.
AMD introduced the latest version, ROCm 7.0, earlier this fall. Version 7.0 is designed for the new world of AI.
How ROCm works
ROCm is a platform created by AMD to run programs on its AI-focused GPUs, the Instinct MI350 Series accelerators. AMD calls the latest version, ROCm 7.0, an AI-ready powerhouse designed for performance, efficiency and productivity.
Providing that kind of facility is a matter of far more than just simple software. ROCm is actually an expansive collection of tools, drivers and libraries.
What’s in the collection? The full ROCm stack contains:
- Drivers that enable a computer’s operating system to communicate with any installed AMD GPUs.
- The Heterogeneous Interface for Portability (HIP), a coding system for users to create and run custom GPU programs.
- Math and AI libraries including specialized tools like deep learning operations, fast math routines, matrix multiplication, and tensor ops. These AI building blocks are pre-built to help developers accelerate production.
- Compilers that turn code into GPU instructions.
- System-management tools that developers can use to debug applications and optimize GPU performance.
Help Me, GPU
The latest version of ROCm is purpose-built for generative AI and large-scale AI inferencing and training. If an AI developer is using industry-standard software like PyTorch to train an AI model, they can rely on ROCm to act as a bridge between their high-level framework and the GPU(s) that will power the final AI application.
Developers need that bridge because the GPUs they rely on for parallel processing—performing many tasks at once—are designed for general use throughout an enormous marketplace. Without a kind of translator, the GPU wouldn’t understand how to best serve developers and end users.
In other words, ROCm bridges the gap between humans and hardware. It tells each GPU how to execute an application’s commands in the most effective and efficient way possible. This, in turn, helps organizations boost their performance and scale up their workloads as demand increases.
That’s important. Because while increased demand is what every enterprise wants, it still brings challenges that leave very little room for mistakes.
ROCm helps developers meet those challenges. It does so by providing the capability to scale across clusters and serve models at massive throughput.
One way it does this is by employing reasoning algorithms and an AI model known as Mixtures of Experts. Called MoE for short, this AI model divides tasks among GPUs, activating only those able to boost efficiency and accuracy.
Open Source Power
AMD ROCm has another clever trick up its sleeve: open-source integration.
By using popular open-source frameworks, ROCm lets enterprises and developers run large-scale inference workloads more efficiently. This open source approach also empowers the same organizations and developers to break free of proprietary software and vendor-locked ecosystems.
Free from those dependencies, users can scale AI clusters by deploying commodity components instead of being locked into a single vendor’s hardware. Ultimately, that can lead to lower hardware and licensing costs.
This approach also empowers users to customize their AI operations. In this way, AI systems can be developed to better suit the unique requirements of an organization’s applications, environments and end users.
Another layer
While ROCm servers the larger market, the recent release of AMD’s new Enterprise AI Suite shows the company’s commitment to developing tools specifically for enterprise-class organizations.
AMD says the new suite can to take enterprises from bare metal to enterprise-ready AI software in mere minutes.
To accomplish this, the suite provides four additional components: solution blueprints, inference microservices, AI Workbench, and a dedicated resource manager.
These tools are designed to help enterprises better scale their AI workloads, predict costs and capacity, and accelerate time-to-production.
Always Be Developing
Along with these product releases, AMD is being perfectly clear about its focus on AI development. At the company’s recent Financial Analyst Day, AMD CEO Lisa Su explained that over the last five years, the cost of AMD’s AI-related investments and acquisitions has topped $100 billion. That includes building up a staff of some 25,000 engineers.
Looking ahead, Su told financial analysts that AMD’s data-center AI business is on track to draw revenue in the “tens of billions of dollars” by 2027. She also said that over the next three to five years, AMD expects its data-center AI revenue to enjoy a compound annual growth rate (CAGR) of over 80%.
AMD’s roadmap points to updates that will likely focus on further boosting performance, productivity and scalability. The company might accomplish that by offering more streamlined build and packaging systems, more optimized training and inferencing, and broader hardware support.
It’s also reasonable to expect improved virtualization and multi-tenant support.
That said, if you want your speculation about future AI-centric ROCm improvements to be as accurate as possible, your best bet may be to ask an AI chatbot…powered by Supermicro and AMD, of course.
Do More:
- Visit the AMD ROCm 7 site
- Learn more about AMD ROCm 7.0
- Check out the new AMD Enterprise AI Suite

