resources
blog
MangoBoost Demonstrates Record-Breaking Performance in AI Training Storage Solution with MLPerf® Storage v2.0
August 04, 2025
by MangoBoost
Key Takeaway:
In this MLPerf Storage 2.0 round, MangoBoost is proud to announce how our Mango StorageBoostTM addresses the scalability and performance challenges with its patent-pending DPU technology. StorageBoostTM technology offloads infrastructure and data-processing tasks from the host to datacenter-deployable FPGA PCIe cards. By deploying our best-in-class NVMe/TCP Initiator on the host and NVMe/TCP Target on the storage server, we further push the performance of 3D-UNet training on both A100 and H100 configuration making it the leading Ethernet-based storage system for AI training.
DPU-accelerated NVMe/TCP submission for MLPerf™ Storage Solutions
Modern high-performance computing applications, including large-language models, are growing in complexity and size. This trend places significant demands not only on compute and memory, but also on storage systems. Storage I/O bottlenecks can lead to underutilized GPUs. As models increase in size, the need for high-performance, high-capacity storage becomes even more critical. While technologies like NVMe/TCP enable remote block devices for ML systems and help meet capacity demands, these remote storage solutions often experience performance degradation due to the CPU overhead associated with NVMe/TCP software stack.
To address this performance bottleneck, DPU-accelerated Mango StorageBoostTM Initiator and Mango StorageBoostTM Target fully offload NVMe/TCP to the hardware, delivering line-rate performance and allows remote storage to operate at the speed similar to a local high-performance NVMe.
In this round, we have set up our evaluation for MLPerf Storage 2.0 Training workload by connecting the host with Mango StorageBoostTM Initiator DPU with the storage server with Mango StorageBoostTM Target through ethernet switch with a 400G network connection between the host and the storage server as shown in Figure 1. The host then emulates the demand of ML workloads by running multiple copies of emulated A100 and H100 3D-UNet traces on the storage system with the goal of testing the maximum number of GPUs that can be connected to the storage system.
Figure 1: System configuration with MangoBoost DPUs
MangoBoost shows the best-performing Ethernet-based system result in MLPerf Storage v2.0 and provides near-local SSD performance.
Measuring the performance efficiency of a remote storage system targeting ML workloads can be done through two key metrics: number of GPUs that the storage system can handle normalized by the number of clients, and the number of GPUs normalized by the network bandwidth. Figure 2 shows the overall GPU scalability per unit relative to other submitters in the Fabric-attached Block Storage category (both NVMe/TCP and NVMe/RDMA).These results show how MangoBoost’s DPU solution delivers the best performance: 6.2x for 3D-UNET A100 and 1.25x-7.5x for 3D-UNET H100, and how offloading NVMe/TCP to hardware allows more GPUs to be connected to the storage system.
Figure 2: GPU scalability vs. other Fabric-based solutions (both NVMe/TCP and NVMe/RDMA)
Figure 3 shows the performance measured as the number of GPUs that can be connected normalized by the network bandwidth for both NVMe/RDMA and NVMe/TCP submitters. This result highlights the efficiency of our solution in utilizing the available network bandwidth as MangoBoost solutions provide the best performance: 1.57x for 3D-UNET A100 and 1.27x-2.05x for 3D-UNET H100.
Figure 3: : Plots showing bandwidth-normalized throughput (throughput per 400G bandwidth) of Mango StorageBoost vs. other Fabric-attached solutions
Figure 4 also provides further performance breakdown across two of our MLPerf Storage submissions (MLPerf Storage 1.0 and MLPerf Storage 2.0), showing that the DPUs can achieve near line-rate performance regardless of GPU architectures (H100 and A100 architectures).
Figure 4: Performance improvement of our submissions from MLPerf Storage 1.0 to MLPerf Storage 2.0 showing that our storage solutions achieve line-rate performance
The combination of the Mango StorageBoostTM Initiator DPU on the host and the Mango StorageBoostTM Target DPU on the target also allows ML workloads to operate remote block storage at performance near that of local storage. This is highlighted in Figure 5 through the performance comparison between MangoBoost solutions against local SSD baselines. Our system deployed four Solidigm D7-PS1030s on the target. We normalized the score by the number of SSDs to compare with other non-Fabric attached submissions using a single SSD.
Figure 5: Plots vs. other non-Fabric attached submitters (*Solidigm D7-PS1010 is submitted by Quanta Cloud Technology)
Industry-leading DPU storage solutions, delivering higher performance and lower TCO
While MangoBoost is the only submitter focusing on DPU solutions, this section now focuses on the performance comparison of Mango StorageBoost DPUs against the most apple-to-apple DPU solution available in the market – the NVIDIA BlueField-3 400G DPU, using the same MLPerf Storage 2.0 methodology but replacing MangoBoost DPU with NVIDIA BlueField-3 as shown in Figure 6.
Figure 6: System configuration with BF3 DPUs
To highlight our performance and cost-efficiency, we rerun MLPerf Storage 2.0 on 3D-UNET benchmarks using the same methodology as the official MLPerf Storage 2.0. Figure 7 compares the scores of BlueField-3 running NVMe/TCP and NVMe/RDMA with host CPU and MangoBoost NVMe/TCP full offloading. The results demonstrate that Mango StorageBoost significantly outperforms BlueField-3 NVMe/TCP and even shows higher performance than BlueField-3 NVMe/RDMA.
Figure 7: Performance vs. BF3 (MLPerf Storage v2.0 Training, 3D-UNET)
Additionally, we also ran the Llama3-8B checkpointing workload on the same setup (this result is not included in v2.0, we will submit the checkpointing results in the next round). Figure 8 shows the comparison of checkpointing performance for different storage configurations. The setup used only one client and did not fully saturate the given bandwidth. The I/O pattern of MLPerf Storage v2.0 checkpointing is not optimized for throughput; shorter storage system latency is crucial for achieving higher throughput on a single client. The results demonstrate that Mango StorageBoost significantly outperformed BlueFIeld-3 (NVMe/TCP) and achieved near local SSD performance.
Figure 8: Performance vs. BF3 (MLPerf Storage v2.0 Checkpointing, Llama3-8B)
The superior performance advantage directly translates into TCO reduction. Figure 9 shows the TCO reduction of MangoBoost DPU compared to BlueField-3; as more storage is deployed, the TCO gap widens. For more details, please check our Mango StorageBoost technical whitepaper, which provides further analysis on throughput and cost breakdown.
Figure 9: TCO analysis demonstrating our significant cost advantages over BF3
Mango StorageBoost™ Traditional Storage, JBoF, JBoD and EBoF Solutions.
Mango StorageBoost™ is a high-performance vendor-agnostic easy-to-use solution that allows AI systems running on any host and GPUs to scale their ML workloads.
Solution 1: Mango StorageBoost™- NVMe/TCP Initiator (NTI).
Mango StorageBoost™- NVMe/TCP Initiator (NTI) is a high-performance NVMe-oF initiator solution that unlocks the full potential of storage disaggregation. By offloading the entire NVMe/TCP stack into hardware, NTI delivers unmatched full-duplex line-rate performance with zero CPU consumption.
Mango StorageBoost™- NVMe/TCP Initiator (NTI) integrates seamlessly into NVMe/TCP initiator servers. By exposing the DPU as a standard NVMe PCIe device, existing storage systems can leverage NTI without the need for any SW modifications. Furthermore, NTI can connect to any NVMe/TCP target servers over a standard TCP/IP network.
Solution 2: Mango StorageBoost™- NVMe/TCP Target (NTT).
Mango StorageBoost™ - NVMe/TCP Target is a groundbreaking NVMe-oF solution, powering the storage system with outstanding performance. It enables disaggregated storage servers to be connected to the network through the standard TCP/IP protocol stack. By fully offloading and accelerating both the TCP/IP and NVMe-oF processing, it entirely eliminates the CPU intervention from the I/O path.
On top of Mango NetworkBoost™ - TCP/IP, NTT decapsulates the TCP packets into NVMe-oF commands and translates them into standard NVMe commands. By utilizing PCIe peer-to-peer technology, NTT directly sends those commands to the peer NVMe SSDs without any CPU consumption. Additionally, its modular and flexible architecture also enables more dynamic customer-driven storage systems such as data reduction and reliability.
Solution 3: Mango StorageBoost™- GPU Storage Boost (GSB).
Mango StorageBoost™- GPU Storage Boost (GSB) is an essential software solution designed to dramatically enhance the efficiency of GPU data movement, enabling unprecedented performance capabilities. By establishing a direct data path for DMA (Direct Memory Access) transfers between GPU memory and storage, GSB eliminates the need for bounce buffers through the CPU, and bypasses CPU involvement entirely. It facilitates peer-to-peer communication within GPU systems, regardless of whether the storage is local or remote. Furthermore, GSB offers a nearly identical syntax and semantic structure to traditional POSIX File APIs, making integration straightforward. This allows for efficient, filesystem-based data management with minimal adaptation, enabling users to harness the full power of the GPU systems without sacrificing ease of use.
When combined with Mango StorageBoost™- NVMe/TCP Initiator (NTI), GSB enables highly efficient GPU data movement, even over remote storage systems based on the TCP/IP transport layer, without any CPU involvement. Typically, handling TCP/IP protocols and the NVMe/TCP stack requires CPU intervention and memory copying, which introduces latency and reduces overall efficiency. NTI addresses these challenges by offloading these tasks to dedicated hardware, allowing the host to leverage the DMA engine as if it were directly connected to an NVMe PCI device. NTI effectively removes the performance and efficiency constraints associated with CPU resources during remote storage access, while GSB eliminates similar limitations for GPU data movement. By utilizing both solutions together, users can unlock unparalleled synergies, dramatically improving data flow efficiency and reducing system bottlenecks.
With the three solutions combined, MangoBoost DPU not only can be used for a traditional remote block storage, but can be used to manage headless storage network including Just-a-bunch-of-Flash (JBoF), Just-a-bunch-of-Disk (JBoD), and Ethernet-bunch-of-Flash (EBoD) to further lower the TCO of high-performance storage system.
Interested in how our DPU solution can deliver higher performance and lower your TCO, try our solutions today by contacting us at contact@mangoboost.io!