resources
blog
MangoBoost's Turn-Key Storage Solution for AI Training with MLPerf® Storage v1.0
September 30, 2024
by MangoBoost
Try it now. If interested in the MangoBoost AI solutions, please contact us at contact@mangoboost.io
[Fig. 1 Overview of AI Infrastructure with MangoBoost’s StorageBoost Technologies ]
Modern large-language models and other high-performance computing applications have scaled up in the complexity of their operations and sizes. This increase often leads to additional pressure not only on the compute and memory elements but also on the storage systems designed to support these large-scale applications. MLPerf™ Storage is created to benchmark modern storage systems and is designed to show performance in supporting AI/ML tasks.
To address the scalability and performance challenges, MangoBoost’s patent-pending DPU technology allows offloading of infrastructure and data-processing tasks from the host to datacenter-deployable FPGA PCIe cards. With our best-in-class NVMe/TCP Initiator(s) deployed on AI server(s) connected via standard ethernet to NVMe/TCP Target(s) deployed on storage server(s), we are able to deliver the best performance ethernet-based storage system across all evaluated benchmarks on top-of-the-line NVIDIA H100 and A100 GPUs as measured and published in MLPerf™ Storage v1.0 results.
Fig. 1 shows an overview of AI infrastructure using MangoBoost’s DPU-accelerated storage. MangoBoost DPUs are deployed on GPU servers and storage servers to accelerate the feeding of training data to GPU. We introduce the three turn-key solutions for accelerating storage for AI training.
In the following sections, we showcase MangoBoost’s MLPerf™ Storage v1.0 submission and introduce details of the above turn-key solutions applied to our systems. Then, we offer further details on our solutions.
[Fig. 2: MLPerf™ Storage Overview]
Storage systems play a crucial role in AI training. A bottleneck in the storage system results in the under-utilization of GPUs. This arises when GPUs are idle, waiting for the completion of I/O to read training data from storage. GPUs are an expensive resource, and we should keep GPU utilization high to minimize the completion time of AI model training.
The MLPerf™ Storage benchmark is currently the best environment for storage performance evaluation for AI workloads. Fig. 2 shows an overview of MLPerf™ Storage. One of the notable features in MLPerf™ Storage is "GPU emulation." This means from the perspective of the storage system evaluation; the GPU is only an external processor that ingests fetched data from storage. Since the real computation of the GPU is not necessary, MLPerf™ Storage emulates GPU computation with "sleep."
Additionally, MLPerf™ Storage provides many parameters that are used for emulating a variety of AI training workloads. Users can easily generate datasets with different size distributions and simulate various I/O patterns.
[Fig. 3: MLPerf™ Storage Evaluation Setup]
Fig. 3 presents our evaluation setup for MLPerf™ Storage. We utilize a single GPU server that accesses training data in a remote SSD over NVMe/TCP. NVMe/TCP is fully offloaded to NTI to save CPU cycles and maximize training throughput on GPU servers. The storage servers deploy NTT as a counterpart to NTI. The storage server offloads the entire data path processing onto the DPU, utilizing only a single CPU core to handle control path operations. NTT can saturate a 200GbE line with a single card while significantly reducing power consumption for CPU utilization.
To showcase the wide compatibility with standard servers and SSDs, we demonstrate our solution with two different types of storage servers. Each storage server is powered by either an Intel or AMD CPU and equipped with two SSD models from Samsung and Kioxia. Since both storage servers show almost identical performance, for simplicity, we report the results with one of the Storage Servers in this blog.
In the following section, we showcase our MLPerf™ Storage v1.0 submission result, highlighting our top-ranked Ethernet category scores.
[Fig. 4: Relative Performance to Other MLPerf™ Storage Submissions - Ethernet Single-host Result]
Fig. 4 shows a direct comparison of our MLPerf™ Storage score against other Ethernet-based single-host submissions. The results show that the MangoBoost DPU-accelerated NVMe/TCP storage significantly outperforms other systems in AI training workloads. Our system significantly outperforms other systems on ResNet50 and UNET3D workloads, with an average performance increase of 2.7x, ranging from 1.2x to 4.7x, respectively.The Cosmoflow result is omitted because Nutanix and Hammerspace did not submit results for this workload.
[Fig. 5: Relative Performance to Other MLPerf™ Storage Submissions - Ethernet Multi-host Result]
Fig. 5 shows comparisons to Ethernet-based multi-host submissions. To estimate the relative performance of our single-host result, we normalized competitors' numbers by the number of hosts used in the evaluation. As the figure shows, our system outperforms others by 1.4x to 7.8x and 4.3x on average for ResNet50 and UNET3D workloads, respectively.
[Fig. 6: Relative Performance to the Baseline]
Finally, we demonstrate the superior performance of our NVMe/TCP solutions compared to systems using software NVMe/TCP, achieving performance on par with local NVMe. Local NVMe is an ideal system with minimal overhead for storage access, and software NVMe/TCP is the existing available solution (albeit slow). Our DPU solution is the first to unlock the full power of NVMe/TCP.
Fig. 6 shows a performance comparison of our solution versus "Local NVMe" and "SW NVMe/TCP" baselines. We show a 2.5x and 1.5x performance improvement in UNET3D and ResNet50, respectively, compared to SW NVMe/TCP. The limited computation resources severely limit the performance of SW NVMe/TCP since many cycles are spent processing the TCP/IP stack for remote NVMe access.
We also achieved the same score in UNET3D and ResNet50 compared to "Local NVMe." This is a great achievement since closing the performance gap completely is impossible to do without the state-of-the-art full hardware acceleration of NVMe/TCP with our DPU. It also shows that MangoBoost DPU allows true storage disaggregation with high-performance data center-grade SSDs.
NVMe/TCP is widely recognized for its scalability and ease of deployment compared to RDMA-based protocols like RoCEv2, which require specialized network hardware such as RDMA network cards and switches [3, 4, 5, 6]. This makes NVMe/TCP a more accessible and flexible solution for many organizations. However, despite these advantages, NVMe/TCP has traditionally faced performance challenges due to higher overheads, making it less efficient for high-performance workloads like AI and machine learning.
MangoBoost addresses these limitations with a fully hardware-accelerated NVMe/TCP solution on both initiator and target-side servers. By leveraging our Data Processing Units (DPUs), MangoBoost is the first to overcome the inherent performance bottlenecks of NVMe/TCP, significantly enhancing both throughput and efficiency. This makes our NVMe/TCP solution an ideal choice for AI-driven applications, where high-performance storage is critical.
Our DPU solutions are designed to be deployed seamlessly into existing infrastructure. From a hardware perspective, MangoBoost’s DPU installs seamlessly into any standard PCIe slot and connects via standard Ethernet QSFP ports, similar to a typical network interface card (NIC). On the software side, the DPU is fully compatible with NVMe/PCIe and NVMe/TCP protocols, requiring no modifications to the user’s software environment.
Additionally, we offer a GPU Storage Boost solution. We co-presented this solution with AMD at SDC 2024 [7], where we showed end-to-end training workloads running on AMD MI300X GPUs accelerated by MangoBoost DPUs, dramatically improving overall system performance. This solution further enables peer-to-peer direct data communication between GPUs and remote storage (on top of NTI). However, this was not included in our submission since the MLPerf Storage setup only emulates the GPU hardware, so there is no real GPU hardware to communicate peer-to-peer within this setup.
Mango StorageBoost™- NVMe/TCP Initiator (NTI) is a high-performance NVMe-oF initiator solution that unlocks the full potential of storage disaggregation. By offloading the entire NVMe/TCP stack into hardware, NTI delivers unmatched full-duplex line-rate performance while significantly reducing CPU consumption.
NTI integrates seamlessly into any initiator servers that connect using standard ethernet (including AI GPU servers). By exposing the DPU as a standard NVMe PCIe device, existing storage systems can leverage NTI without the need for any software modifications. Furthermore, NTI can connect to any NVMe/TCP target servers over a standard TCP/IP network.
Mango StorageBoost™ - NVMe/TCP Target is a groundbreaking NVMe-oF target solution, powering the storage system with outstanding performance. It enables disaggregated storage servers to be connected to the network through the standard TCP/IP protocol stack. By fully offloading and accelerating both the TCP/IP and NVMe-oF processing, it entirely eliminates the CPU intervention from the I/O path.
NTT decapsulates the TCP packets into NVMe-oF commands and translates them into standard NVMe commands. By utilizing PCIe peer-to-peer technology, NTT directly sends those commands to the peer NVMe SSDs without any CPU consumption. Additionally, its modular and flexible architecture also enables more dynamic customer-driven storage systems such as data reduction and reliability.
Mango StorageBoost™- GPU Storage Boost (GSB) is an essential software solution designed to dramatically enhance the efficiency of GPU data movement, enabling unprecedented performance capabilities. By establishing a direct data path for Direct Memory Access (DMA) transfers between GPU memory and storage, GSB eliminates the need for bounce buffers through the CPU, and bypasses CPU involvement entirely. It facilitates peer-to-peer communication within GPU systems, regardless of whether the storage is local or remote. Furthermore, GSB offers a nearly identical syntax and semantic structure to traditional POSIX File APIs, making integration straightforward. This allows for efficient, filesystem-based data management with minimal adaptation, enabling users to harness the full power of the GPU systems without sacrificing ease of use.
When combined with Mango StorageBoost™- NVMe/TCP Initiator (NTI), GSB enables highly efficient GPU data movement, even over remote storage systems based on the TCP/IP transport layer, without any CPU involvement. Typically, handling TCP/IP protocols and the NVMe/TCP stack requires CPU intervention and memory copying, which introduces latency and reduces overall efficiency. NTI addresses these challenges by offloading these tasks to dedicated hardware, allowing the host to leverage the DMA engine as if it were directly connected to an NVMe PCI device. NTI effectively removes the performance and efficiency constraints associated with CPU resources during remote storage access, while GSB eliminates similar limitations for GPU data movement. By utilizing both solutions together, users can unlock unparalleled synergies, dramatically improving data flow efficiency and reducing system bottlenecks.
If you are interested in NTI, NTT and GPU Storage Boost, please email contact@mangoboost.io
References
[1] https://mlcommons.org/2024/09/mlperf-storage-v1-0-benchmark-results/
[2] https://mlcommons.org/2023/06/introducing-the-mlperf-storage-benchmark-suite/
[3] Answering Your Questions: NVMe™/TCP: What You Need to Know About the Specification, https://nvmexpress.org/answering-your-questions-nvme-tcp-what-you-need-to-know-about-the-specification-webcast-qa/
[4] Pavilion compares RoCE and TCP NVMe over Fabrics performance, https://blocksandfiles.com/2018/08/16/pavilion-compares-roce-and-tcp-nvme-over-fabrics-performance/
[5] SRNIC: A Scalable Architecture for RDMA NICs, NSDI’23, https://www.usenix.org/conference/nsdi23/presentation/wang-zilong
[6] RDMA over Ethernet for Distributed Training at Meta Scale, SIGCOMM’24, https://dl.acm.org/doi/abs/10.1145/3651890.3672233
[7] Accelerating GPU Server Access to Network-Attached Disaggregated Storage using Data Processing Unit (DPU),https://www.sniadeveloper.org/events/agenda/session/666
Disclaimer
The performance claims in this document are based on the internal cluster environment. Actual performance may vary depending on the server configuration. Software and workloads used in performance tests may have been optimized for performance only on MangoBoost products. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. Results that are based on pre-production systems and components as well as results that have been estimated or simulated using MangoBoost reference platform for informational purposes only. Results may vary based on future changes to any systems, components, specifications, or configurations. Statements in this document that refer to future plans or expectations are forward-looking statements. These statements are based on current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in such statements. MangoBoost does not guarantee any specific outcome. Nothing contained herein is, or shall be relied upon as, a promise or representation or warranty as to future performance of MangoBoost or any MangoBoost product. The information contained herein shall not be deemed to expand in any way the scope or effect of any representations or warranties contained in the definitive agreement for MangoBoost products.
The information contained herein may not be reproduced in whole or in part without prior written consent of MangoBoost. The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. MangoBoost assumes no obligation to update or otherwise correct or revise this information and MangoBoost reserves the right to make changes to the content hereof from time to time without any notice. Nothing contained herein is intended by MangoBoost, nor should it be relied upon, as a promise or a representation as to the future.
MANGOBOOST MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
© 2024 MangoBoost, Inc. All rights reserved.