resources

blog

MangoBoost Demonstrates Turn-Key AI LLM Inference Serving solution

MangoBoost Demonstrates a Turn-Key LLM Inference Serving Solution on AMD Instinct™ MI300X GPUs with the MLPerf® Inference v4.1 Llama Benchmark

August 28, 2024

by MangoBoost

Key Points

New MLPerf® result on AMD Instinct™ MI300X GPUs. Our first MLPerf Inference v4.1 results on AMD MI300X GPUs achieve 23.5K tokens per second on the Llama-70B benchmark! [1]

The productization challenge. The productization challenge: Going from the MLPerf inference benchmark to production deployment is no easy task. It demands a significant effort, including bring-up, co-optimization, and validation of the full-stack AI inference serving software.

Solution 1: Mango LLMBoost™ — a ready-to-deploy, full-stack AI inference serving container. It includes MangoBoost’s inference serving software, and comes pre-packaged with everything needed to run optimized Large Language Models on any GPU, including the AMD MI300X.
- Turn-key: Easily reproduce valid FP8 MLPerf Inference v4.1 Llama2-70B results on AMD MI300X systems. [6]
- Production ready: The Mango LLMBoost™ package includes optimized server software for LLM inference, along with various industry-standard APIs for easy integration into your application.
- High performance: MangoBoost demonstrates cutting-edge performance on AMD MI300X GPUs with various LLMs, including Llama-70B and Llama Guard-7B.

Solution 2: Mango WebBoost™ — a hardware-accelerated frontend gateway web server designed for GPU-accelerated AI inference clusters, delivering unparalleled throughput and ultra-low latency.
- DPU accelerated: Mango WebBoost utilizes the MangoBoost Hardware Data Processing Unit (DPU) solution on the AMD Alveo™ U45N FPGA card to offload complete TCP/IP networking tasks with stateful processing, ensuring consistently low average and tail latency for reliable performance.
- Nginx case study: Our solution has been successfully integrated and demonstrated with Nginx, one of the most widely used web servers. It can also be seamlessly integrated into other web servers without requiring modification.

Try it now! If you're interested in the above solutions, please contact us at contact@mangoboost.io

Introduction

Recent publications [1] and news [2] regarding MLPerf Inference v4.1 have included the first published results of systems with AMD’s MI300X GPUs on the Llama-70B benchmark. Our solution delivers state-of-the-art performance, computing 23.5K tokens per second using 8 MI300X GPUs on our custom servers. [3].

Beyond the MLPerf benchmark, deploying the MI300X in a production-grade AI inference serving solution requires additional capabilities, including support for diverse web API endpoints, multi-model serving, and integration hooks for cluster management. Additionally, a production inference serving cluster requires a front-end web server to efficiently manage multiple back-end GPU servers.

MangoBoost introduces two turn-key solutions to address above said challenges.

Mango LLMBoost — a ready-to-deploy, full-stack AI inference serving container. It includes MangoBoost’s inference serving software, and comes pre-packaged with everything needed to run optimized Large Language Models on any GPU, including the AMD MI300X.

Mango WebBoost — a hardware-accelerated frontend gateway web server designed for GPU-accelerated AI inference clusters, delivering unparalleled throughput and ultra-low latency.

We detail these solutions in the next sections, and report case studies that quantify their benefits.

Solution 1:
LLMBoost - — Ready-to-deploy LLM inference software demonstrating industry-leading performance on AMD MI300X GPUs.

MangoBoost’s LLMBoost AI inference server software is a fully optimized, turn-key solution, ready for deployment as a containerized software package. It delivers exceptional LLM inference performance on AI servers equipped with any GPU, including AMD’s MI300X accelerators. Purpose-built for the next generation of resource-hungry LLMs and the GPUs required to run them, LLMBoost leverages advanced parallelism strategies to accelerate a wide range of models — from Llama-70B to Deepseek-R1.

Fig. 1: LLMBoost container - turn-key solution for LLM-serving based on vLLM

Figure 1 highlights two key areas of optimization within LLMBoost: the inference engine and the model deployment. The inference engine leverages advanced GPU kernels to enable multidimensional parallelism, complemented by tuning of the ROCm software stack for expensive matrix multiplication (GEMM) kernels. Alongside these kernel optimizations, the inference engine utilizes advanced memory management techniques to reduce the cost of data transfer and eliminate memory fragmentation within the GPU. Additionally, LLMBoost offers automatic deployment optimization, in which it finds the optimal GPU deployment configuration to maximize performance without the need for manual user tuning.

Our inference server offers multiple endpoint options to ensure seamless integration with existing solutions. Synchronous and asynchronous REST APIs provide flexible request management, while WebSockets enable an efficient interface for streaming applications. Advanced dynamic scheduling algorithms intelligently allocate incoming requests to the most suitable GPU, maximizing efficiency across diverse deployment scenarios. Customers can fine-tune their deployments to prioritize throughput, latency, or a balanced hybrid approach, delivering a fully customizable inference server. This ensures consistently state-of-the-art performance on cutting-edge GPU hardware, meeting the diverse needs of a wide range of applications.

Our containers also expose standard Prometheus endpoints for performance monitoring, enabling seamless integration with existing Kubernetes deployments and supporting Horizontal Pod Autoscaling. Additionally, a single container can serve multiple models, offering users a high degree of flexibility to tailor the inference server to their specific requirements.

Fig. 2: Demonstrates the normalized throughput (tokens per second) comparison between vLLM and LLMBoost on a server with 8 AMD MI300X GPUs. LLMBoost’s advanced parallelism strategies result in ~6x higher throughput than vLLM.n ~6x token/sec uplift.

Case Study : MLPerf Inference v4.1 Llama-70B and Llama Guard-7B. We compared LLMboost’s performance to vLLM, a popular open-source inference engine for LLMs [4] [5]. For this evaluation, we selected two prominent LLMs: Llama2-70B and Llama Guard-7B. The tests were conducted on a high-performance Supermicro server featuring 8 AMD MI300X GPUs. Detailed server configurations are provided in the table below.

GPU-Compute Server Configuration

For the Llama2-70B model, we strictly followed the MLPerf Inference v4.1 Closed Division rules to ensure the validity and integrity of the results on the LLMBoost configuration with all 8 GPUs active. This setup established a consistent and rigorous foundation for comparison. To maintain fairness, identical conditions were applied to all subsequent experiments.

In the MLPerf Inference v4.1 Llama-70B test case, we achieved a performance of 22.8K tokens per second, which is within 3% of AMD's published MLPerf result of 23.5K tokens per second. It's worth noting that our system uses a different CPU than AMD's official MLPerf submission, which may explain the slight performance difference. As the MLPerf Inference Benchmark does not yet provide validation rules for Llama Guard, we applied the same configuration used for Llama2-70B to maintain consistency in our testing methodology.

As illustrated in Figure 2, LLMBoost achieved 5.2x to 6.0x higher throughput over vLLM, driven by three key features: (1) Multi-dimensional parallelism, (2) dynamic scheduling across the 8 different GPUs, and (3) a lightweight, streamlined interface. This significant enhancement highlights LLMBoost's capability to accelerate inference workloads, establishing it as an essential tool for optimizing large-scale AI deployments.

Solution 2:
WebBoost — Accelerating the Nginx Web Server with MangoBoost's Stateful TCP/IP Hardware Acceleration on the AMD U45N FPGA Card

WebBoost is a web-serving container designed to run Nginx on front-end servers equipped with hardware acceleration powered by MangoBoost's DPU (Data Processing Unit) solution. This solution supports multiple FPGA cards, and in this paper, we demonstrate it on the AMD U45N. The LLMBoost and WebBoost containers are engineered to work seamlessly together, enabling coordinated operations to stream input data to GPUs and return generated output tokens to clients.

Figure 3 illustrates the network processing layers between incoming LLM requests and the GPU of an LLM server. Clients submit their LLM requests via a REST API, which are then sent to the Nginx front-end gateway. Nginx collects incoming requests and distributes them to back-end GPU computation containers.

On a front-end server equipped with WebBoost, MangoBoost’s DPU solution provides stateful hardware acceleration of the TCP/IP networking stack, significantly enhancing the performance of web server software like Nginx. WebBoost delivers consistently low average and tail latency while boosting throughput—key factors for optimizing LLM inference performance. Furthermore, WebBoost seamlessly integrates with existing systems by hooking into standard POSIX socket API calls, requiring no modifications to user applications. This makes it a reliable, high-performance, and easy-to-use solution for modern AI deployments.

Fig. 3: End-to-End Inference Serving Service

Case Study with WebBoost. To quantify the performance benefits of MangoBoost’s WebBoost when integrated with Nginx, we conducted a series of benchmarks comparing our solution against a setup using Linux’s standard TCP kernel software stack. The TCP kernel software stack relies on the NVIDIA® ConnectX®-6 network card, which is widely used in typical data center environments.

We utilized the wrk benchmarking tool to generate network requests to the Nginx web server. This tool is designed for measuring HTTP performance, making it ideal for evaluating the throughput and latency of our solution. In this evaluation, wrk sends HTTP POST requests to the Nginx web server. The requests contain sentences typically input by users of LLM AI inference serving. The Nginx then responds with sentences as part of the HTTP responses. The configuration details of our server hardware and software are in the table below.

Web-Serving Front-End Server Configuration

The evaluation results are summarized in the graphs in Fig. 4.

Throughput: WebBoost outperforms the Linux TCP stack significantly, achieving a 2x improvement in throughput. This means that Nginx can handle 100% more requests per second when using the WebBoost solution compared to the standard TCP stack.

Latency: WebBoost also demonstrates 0.53x and 0.54x reductions on the 50th and 90th percentile latency, respectively.

Fig. 4: Throughput and Latency comparison between Linux and WebBoost

These results clearly illustrate the efficiency gains from offloading TCP packet processing to our DPU. By handling TCP connections more effectively, the WebBoost solution not only increases throughput but also significantly reduces the time it takes to process requests, leading to a more responsive and scalable Nginx deployment.

Nginx is a cornerstone of modern web infrastructure, and optimizing its performance is crucial for maintaining the efficiency and reliability of web services. By offloading TCP packet processing to a DPU, we can significantly enhance Nginx’s capabilities, allowing it to handle more traffic with lower latency. Our solution represents a leap forward in web server performance, setting a new standard for what is possible in high-demand environments.

Try our solutions today!

If you are interested in LLMBoost and WebBoost, please email contact@mangoboost.io.

References
[1] MLPerf Inference: Datacenter Benchmark Suite Results
[2] Unveiling MLPerf® Results on AMD Instinct™ MI300X Accelerators
[3] Smith, Alan, et al. "Realizing the AMD Exascale Heterogeneous Processor Vision: Industry Product." 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024.
[4] vLLM in AMD MLPerf Inference v4.1 submission
[5] AMD ROCm vLLM
[6] Result not verified by MLCommons Association.

Disclaimer
The performance claims in this document are based on the internal cluster environment. Actual performance may vary depending on the server configuration. Software and workloads used in performance tests may have been optimized for performance only on MangoBoost products. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. Results that are based on pre-production systems and components as well as results that have been estimated or simulated using MangoBoost reference platform for informational purposes only. Results may vary based on future changes to any systems, components, specifications, or configurations. Statements in this document that refer to future plans or expectations are forward-looking statements. These statements are based on current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in such statements. MangoBoost does not guarantee any specific outcome. Nothing contained herein is, or shall be relied upon as, a promise or representation or warranty as to future performance of MangoBoost or any MangoBoost product. The information contained herein shall not be deemed to expand in any way the scope or effect of any representations or warranties contained in the definitive agreement for MangoBoost products.

The information contained herein may not be reproduced in whole or in part without prior written consent of MangoBoost. The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. MangoBoost assumes no obligation to update or otherwise correct or revise this information and MangoBoost reserves the right to make changes to the content hereof from time to time without any notice. Nothing contained herein is intended by MangoBoost, nor should it be relied upon, as a promise or a representation as to the future.

MANGOBOOST MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

Ready for
a boost?

Schedule a call with our team today to see how we can customize our products to boost your datacenter