orange shadow

resources

blog

Pioneering Multinode Heterogeneous Inference: MangoBoost and AMD, in Collaboration with OEM Partners, Sets Record Llama2-70B Performance in MLPerf Inference v5.1

Pioneering Multinode Heterogeneous Inference: MangoBoost and AMD, in Collaboration with OEM Partners, Sets Record Llama2-70B Performance in MLPerf Inference v5.1

September 09, 2025

by MangoBoost

Thumbnail
masked shadow

Key Highlights

  • Record-breaking throughput: Highest MLPerf inference performance for Llama2-70B, achieving 169K tok/s in the closed division and 648K tok/s in the open division.
  • First-ever heterogeneous GPU deployment: Demonstrated near linear performance scaling across multi-architecture clusters combining AMD MI300X and MI325X GPUs.
  • Multi-node MI355X performance: First and only third-party submission showcasing AMD Instinct™ MI355X GPUs in multi-node deployments.
  • Broad collaboration: Collaborated with AMD, Dell, and Supermicro to validate LLMBoost™ across diverse hardware platforms from various server vendors.
  • Beyond MLPerf Performance Leadership: LLMBoost™ delivers breakthrough results across diverse workloads — up to 186× faster than Ollama and 4× faster than vLLM on the Llama4-Scout MoE model, and up to 43.5× faster than vLLM on Qwen2.5 vision-text workloads with multi-image prompts
  • Ease of deployment: One-line deployment supporting over 50+ popular open-source models.
  • Interested in trying our LLMBoost software on AMD GPU servers, register to our virtual demo

    Pushing the Boundaries of LLM Inference

    MangoBoost is proud to announce groundbreaking results in the MLPerf Inference v5.1 benchmark. This round showcases how LLMBoost™, our enterprise-grade GenAI platform, unlocks the full potential of heterogeneous and multi-node GPU clusters.

    By combining software intelligence, optimized ROCm™ integration, and deep collaborations with AMD, Dell, and Supermicro, we continue to push the frontiers of performance, scalability, and cost efficiency for large-scale LLM inference.


    Industry Recognition

    The scale and innovation of MangoBoost’s submission have been recognized across the industry:

    David Kanter – Founder & Head of MLPerf, MLCommons: “I am thrilled to see MangoBoost’s creativity in pushing the limits of LLM serving with the first-ever heterogeneous results for MLPerf Inference. This submission integrates multiple generations of AMD GPUs with MangoBoost's LLMBoost platform to deliver impressive performance for Llama2-70B at 160K tokens/s, underscoring the power of software and system architecture in inference serving.”

    Meena Arunachalam – Fellow, AMD: “We are thrilled with our MLPerf Inference v5.1 co-submission with MangoBoost. Together, we set a new performance record for Llama2-70B using AMD Instinct™ MI355X GPUs, scaling to 4-node and 8-node configurations. MangoBoost also delivered the first heterogeneous multi-node submission showcasing the combined power of AMD Instinct MI300X and MI325X GPUs, along with strong multi-node results on Dell and Supermicro servers. With nine high-performance MLPerf inference submissions, MangoBoost’s LLMBoost GenAI, powered by AMD ROCm™, provides seamless scalability and easy deployment for enterprise AI workloads.”

    Frank Han – Distinguished Member of Technical Staff, Dell: “Dell's collaboration with MangoBoost is built on a foundation that began with MLPerf Training v5.0. In our official MLPerf Inference v5.1 co-submission (Dell_MangoBoost), MangoBoost’s LLMBoost software demonstrated impressive linear scaling across multi-node Dell PowerEdge server configurations with AMD accelerators.”


    LLMBoost™: GenAI Software for Any GPU, Any Scale

    LLMBoost™ delivers a turn-key, high-performance inference stack with broad compatibility and advanced optimizations:


    Core Features

  • One-line Deployment: Launch models instantly using Docker with OpenAI-compatible REST APIs.
  • Hybrid Flexibility: Ready for cloud, on-premise, or hybrid deployments for maximum control, security, and performance.
  • Cost Efficiency: Achieves up to 99.5% lower cost per million tokens vs. Ollama and 74.1% vs. vLLM while delivering superior performance.
  • Seamless Multi-Node Scaling: Supports tensor, pipeline, and data parallelism for efficient distributed inference on homogeneous or heterogeneous GPU clusters.
  • Advanced Capabilities

  • Deep Metrics and Observability : Tracks over 500 built-in metrics across the AI stack for real-time visibility into GPU utilization, memory usage, latency, throughput, and queue states.Exports metrics in standard formats for Prometheus, Grafana, and Datadog integration.
  • Full-Stack Auto-Tuning (Patent Pending) : Dynamically adjusts runtime configurations and model deployment strategies based on model architecture, hardware topology, and live traffic patterns to maximize performance.
  • Intelligent Orchestration : Supports dynamic scaling, load balancing, straggler mitigation, and automatic failover to ensure high availability and consistent latency during heavy workloads.
  • Deep Collaborations with Industry Leaders

    MangoBoost’s success in MLPerf Inference v5.1 is built on deep, strategic collaborations with industry leaders such as AMD, Dell, and Supermicro (listed alphabetically), alongside validation on Gigabyte platforms. These partnerships ensure LLMBoost™ is optimized, trusted, and deployed across a wide range of hardware ecosystems to unlock the full performance potential of AMD Instinct™ GPUs.

    Why These Collaborations Matter

    Advancing the frontier of LLM inference serving requires more than just software innovation — it depends on tight hardware-software co-design and validation across diverse platforms. By working directly with leading technology partners, MangoBoost ensures that LLMBoost™ is not only optimized for the latest AMD Instinct™ GPUs, but also proven on a variety of server architectures from major vendors. These collaborations demonstrate that enterprises can trust LLMBoost™ to deliver consistent, high-performance results on their infrastructure of choice, whether in single-node systems, multi-node clusters, or heterogeneous GPU deployments.


    MangoBoost Collaboration with AMD:
    Through co-engineering efforts with AMD, MangoBoost gained early access to next-generation AMD Instinct™ GPUs such as the MI355X, enabling us to tightly integrate LLMBoost™ with the ROCm™ software stack. This collaboration resulted in the first-ever third-party MLPerf submission on AMD’s flagship MI355X, achieving 648K tok/s on 64 GPUs and delivering record-setting performance in the open division with 8-node MI355X clusters.Read more on AMD’s perspective in their blog.

    MangoBoost Collaboration with Dell:
    In collaboration with Dell Technologies, MangoBoost validated LLMBoost™ on Dell PowerEdge servers, demonstrating reliable performance across enterprise-scale deployments. This collaboration included validation on multi-node MI300X clusters with near-linear scalability and further confirmed LLMBoost™’s enterprise-ready performance on Dell PowerEdge XE9680. Learn more from Dell’s blog.

    MangoBoost Collaboration with Supermicro:
    Working with Supermicro, MangoBoost validated LLMBoost™ across a wide range of single-node, multi-node, and heterogeneous GPU deployments. This included the first-ever heterogeneous GPU submission, combining MI300X nodes (Supermicro AS-8125GS-TNMR2) with MI325X nodes (Supermicro AS-8126GS-TNMR) and achieving near-linear scalability. The results demonstrated how LLMBoost™ can seamlessly orchestrate workloads across different GPU generations while maintaining consistent performance and efficiency.


    Highlights from MLPerf v5.1 Results

    The figure above summarizes the throughput (tokens/s) achieved by MangoBoost and our partners across homogeneous and heterogeneous configurations in MLPerf Inference v5.1.

  • Broad Coverage of Configurations: Ranging from single-node systems to 8-node clusters, demonstrating LLMBoost™’s scalability across diverse hardware platforms.
  • Linear Scaling Achieved: Consistent near-linear performance gains were observed as the number of nodes increased, even in heterogeneous setups combining MI300X and MI325X GPUs.
  • Record Performance with MI355X: The multi-node MI355X deployments — including 32× MI355X and 64× MI355X configurations — delivered record-breaking throughput, validating LLMBoost™’s optimizations with AMD’s latest GPUs.
  • Cross-Vendor Optimization: Submissions included deployments on Dell, Gigabyte, and Supermicro servers (listed alphabetically), ensuring consistent performance across hardware ecosystems.
  • Below are some of the key highlights from this round:

    1. Record-Breaking Llama2-70B Results

    As a result of deep collaboration with AMD and full integration of the optimized ROCm software stack, our submission delivered 169K token/s throughput in the closed division, outperforming the next-best NVIDIA result by 35% and 307% higher performance on average across all submissions by primary metric.

    In the open division, our joint submission with AMD scaled an 8-node MI355X cluster to an unprecedented 648K token/s, highlighting the scalability and efficiency of LLMBoost™ on the latest GPU architectures.

    2. First-Ever Heterogeneous GPU Deployment

    For the first time in MLPerf history, MangoBoost submitted results using heterogeneous GPU configurations, combining MI300X and MI325X GPUs. This innovative setup achieved 169K token/s, with near-perfect scaling, proving LLMBoost™’s ability to efficiently orchestrate workloads across multiple GPU generations.

    This capability gives customers the flexibility to mix and match hardware, allowing them to integrate newer GPUs into their infrastructure without sacrificing performance, while optimizing for cost-efficiency.

    3. Near-Perfect Multi-Node Scaling

    MangoBoost’s submissions in MLPerf Inference v5.1 demonstrate exceptional scalability of LLMBoost™ across a wide range of multi-node configurations, from homogeneous clusters to large-scale heterogeneous deployments.

    With intelligent load balancing, optimized communication libraries, and full-stack auto-config tuning, LLMBoost™ delivers up to 97% scalability, while maintaining an average scalability of ~94% across all deployments.


    Versatility Across Models and Modalities

    LLMBoost™ is engineered to deliver best-in-class performance across a wide range of workloads — from text-only deployments to highly complex vision-text (multi-modal) applications. Its ability to scale across different models and workloads ensures enterprise-grade reliability and efficiency for any GenAI deployment.

    1. Text-Only Models: Superior Throughput and Cost Efficiency

    For text-only LLM deployments, LLMBoost™ consistently outperforms competing solutions like vLLM and Ollama across multiple model families.

  • On the Llama4-Scout MoE model, LLMBoost™ achieves up to 186× faster performance than Ollama and 4× faster than vLLM, setting a new benchmark for throughput and cost efficiency.
  • These gains are powered by patent-pending optimizations, including full-stack autotuning and load-aware parallelization, ensuring peak utilization of GPU resources while maintaining predictable latency for production workloads.
  • This unmatched performance translates into significant cost savings, enabling enterprises to scale inference workloads at a fraction of the cost compared to existing frameworks.

    2. High-Performance Vision-Text Multi-Modal Deployments

    LLMBoost™ extends its performance and scalability leadership to multi-modal inference, powering complex vision-text applications such as image-grounded LLMs.

  • Throughput Leadership: On the Qwen2.5 vision-text model, LLMBoost™ delivers 4.7× faster throughput with single-image prompts and an astounding 43.5× improvement with multi-image prompts compared to vLLM.
  • Low Latency at Scale: Beyond throughput, time-to-first-token (TTFT) and end-to-end request latency remain minimal, even as the number of images per prompt increases from 1 to 10. This ensures real-time responsiveness for interactive applications.
  • Scalability Under Load: As query volume increases, LLMBoost™ maintains stable throughput and low latency, demonstrating its ability to handle large batch sizes and high concurrency in multi-modal serving environments.
  • Together, these results showcase LLMBoost™ as the preferred choice for vision-text workloads, offering a high-performance, scalable, and cost-efficient solution for enterprises deploying multi-modal AI systems at scale.

    Easy Deployment

    Getting started with LLMBoost™ is as simple as selecting the model you want on HuggingFace and then running one command.

    Beyond Software: MangoBoost DPU Solutions

    Alongside LLMBoost™, MangoBoost accelerates infrastructure with DPU-powered hardware solutions:

  • Mango GPUBoost™: RDMA acceleration for multi-node training and inference.
  • Mango NetworkBoost™: TCP/IP stack offloading for optimized CPU utilization.
  • Mango StorageBoost™: High-performance NVMe stack for AI workloads.
  • Try LLMBoost™ Today

    To experience the record-setting performance of MLPerf v5.1, register to our virtual demo page.

    Ready for
    a boost?

    Schedule a call with our team today to see how we can customize our products to boost your datacenter

    Contact us