While many organizations are implementing AI for business, many are also discovering that deploying and operating large language models (LLMs) at scale isn’t easy.
They’re finding that the hardware demands are intense. And so are the performance and cost trade-offs. Also, with AI workloads increasingly demanding multi-node GPU clusters, orchestration and tuning can be complex.
To address these challenges, Supermicro and MangoBoost Inc. are working together to deliver an optimized end-to-end GenAI stack. They’ve combined Supermicro’s robust AMD Instinct GPU server portfolio with MangoBoost’s LLMBoost software.
Meet MangoBoost
If you’re unfamiliar with MangoBoost, the company offers programmable solutions that improve data-center application performance while lowering CPU overhead. MangoBoost was founded three years ago; today it operates in the United States, Canada and South Korea.
MangoBoost’s core product is called the Data Processing Unit. It ensures full compatibility with general-purpose GPUs, accelerators and storage devices, enabling cost-efficient and standardized AI infrastructures.
MangoBoost also offers a ready-to-deploy, full-stack AI inference server. Known as Mango LLMBoost, it’s available from the Big Three cloud providers—AWS, Microsoft Azure and Google Cloud.
LLMBoost helps organizations accelerate both the training and deploying LLM at scale. Why is this so challenging? Because once a model is ready for inference, developers face what’s known as a “productization tax.”
Integrating the machine-learning processing pipeline into the rest of the application often requires additional time and engineering effort. And this can lead to delays.
Mango LLMBoost addresses these challenges by creating an easy-to-use container. This lets LLM experts optimize their models, then select suitable GPUs on demand.
MangoBoost’s inference engine uses three forms of GPU parallelism, allowing GPUs to balance their compute, memory and network-resource usage. In addition, the software’s intelligent job scheduling optimizes cluster-wide GPU resources, ensuring that the load is balanced equally across GPU nodes.
LLMBoost also ensures the effective use of low-latency GPU caches and high-bandwidth memory through quantization. This reduces the data footprint, but without lowering accuracy.
Complementing Hardware
MangoBoost’s LLMBoost software complements the powerful hardware with a full-stack, production-ready AI MLOps platform. It includes:
- Plug-and-play deployment: Pre-built Docker images and an intuitive command-line interface (CLI) both help developers to launch LLM workloads quickly.
- OpenAI-compatible API: Lets developers integrate LLM endpoints with minimal code changes.
- Kubernetes-native orchestration: Provides automated deployment and management of autoscaling, load balancing and job scheduling for seamless operation across both single- and multi-node clusters.
- Full-stack performance auto-tuning: Unlike conventional auto-tuners that handle model hyper-parameters only, LLMBoost optimizes every layer from the inference and training back-ends to network configurations and GPU runtime parameters. This ensures maximum hardware utilization, yet without requiring any manual tuning.
Proof of Performance
Supermicro and MangoBoost collaborating to deliver an optimized end-to-end Generative AI stack sounds good. But how does the combined solution actually perform?
To find out, Supermicro, AMD and MangoBoost recently tested their combined solution using real-world GenAI workloads. Here are the results:
- LLMBoost reduced training time by 40% for two-node training, down to 13.3 minutes on a server built around a dual-node AMD Instinct MI325X. The training was done running Llama 2 70B, an LLM with 70 billion parameters, with LoRA (low-rank adaptation).
- LLMBoost achieved a 1.96X higher throughput for multiple-node inference on Supermicro AMD servers. That was up to over 61,000 tokens/sec. on a dual-node AMD Instinct MI325X configuration.
- In-house LLM inference with Llama 4 Maverick and Scout models achieved near-linear scaling on AMD Instinct MI325X nodes. (Maverick is designed for fast responses at low cost; Scout, for long-document analysis.) This shows that Supermicro systems are ready for real-time GenAI deployment.
- Load balancing: The researchers used LLaVA, an image-capturing model, on three setups. The heterogeneous dual-node configuration—eight AMD Instinct MI300X GPUs and eight AMD Instinct MI325X GPUs—achieved 96% of the sum of individual single-node runs. This demonstrates minimal overhead and high efficiency.
Are your customers looking for a turnkey GenAI cluster solution that’s high-performance, flexible and easy to operate? Then tell them that Supermicro, AMD and MangoBoost have their solution—and the proof that it works.
Do More:
- Read the solution brief: Supermicro GPU servers with MangoBoost LLMBoost AI enterprise software: Unlocking the full potential of AMD Instinct GPUs
- Check out MangoBoost
- Dig deeper on Mango LLMBoost