GPU POD Solutions

GPU POD Solutions
Table of Contents

AI-ready supercomputing infrastructure solution for all workloads at scale

Scalable AI incorporates best of compute, networking, storage, power, and cooling to deliver the fastest application performance and meet the demands of evolving AI workloads.

🔨
If you would like to discuss building a GPU POD solution for your Datacenter, please contact us for more information.

Turnkey GPU Pod
AMAX GPU POD

Providing the computational power to train deep learning models

The AMAX GPU POD with NVIDIA H100 GPUs is an artificial intelligence (AI) supercomputing infrastructure, providing the computational power necessary to training today’s state-of-the-art deep learning (DL) models and to fuel innovation well into the future. The AMAX GPU POD delivers groundbreaking performance and is designed to solve the world’s most challenging computational problems.

This GPU POD reference architecture is the result of co-design between data scientists, application performance engineers, and system architects to build a system capable of supporting the widest range of deep learning workloads.

H100
NVIDIA H100 Tensor Core GPU

NVIDIA H100 Tensor Core GPU

The NVIDIA® H100 Tensor Core GPU provides unprecedented acceleration to power the world’s most elastic data centers for AI, data analytics, and high-performance computing (HPC) applications. The most effective end-to-end AI and HPC data center platform, enabling researchers to deliver real-world results and deploy solutions at

AMAX AceleMax DGS-428A

Each AceleMax DGS-428A system with flexible configuration supports up to eight NVIDIA Tensor Core A100 GPUs, powered by AMD EPYC™ 7003 series dual-socket processors in a 4U form factor.

The AceleMax DGS-428A features up to 11 PCIe 4.0 slots and up to 160 PCIe lanes for compute, graphics, storage and networking expansion. PCIe 4.0 provides transfer speed of up to 16 GT/s – double the bandwidth of PCIe 3.0 – and delivers lower power consumption, better lane scalability and backwards compatibility.

AceleMax DGS-428A with NVIDIA A100 GPUs
Mellanox QM8700 Network Switch for GPU POD

NVIDIA InfiniBand Network

NVIDIA provides the world’s smartest switches, enabling in-network computing through the Co-Design Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) ™ technology. The QM8700 series has the highest fabric performance available in the market with up to 16Tb/s of non-blocking bandwidth with sub 130ns port-to-port latency.

For this reference architecture, the StorMax® storage to the AceleMax DGS-428A systems by two NVIDIA HDR InfiniBand (for high-availability) network to provide the most efficient scalability of the GPU workloads and datasets. Built with NVIDIA’s Quantum InfiniBand switch device, the QM8700 series provides up to forty 200Gb/s full bi-directional bandwidth per port.

StorMax A-1110NV
StorMax A-2440

AMAX StorMax® Storage Systems

AMAX, together with Excelero, are delivering StorMax® all-flash and hybrid flash storage solutions, featuring 200Gb/s NVMe over Fabrics on InfiniBand with NVIDIA® ConnectX-6 adapters. StorMax platforms are the highest performance, most secure and extremely flexible architectures in the market, with unmatched price-performance that accelerates all AI computing, database, big data analytics, cloud, web 2.0, and video processing workloads.

StorMax A-1110NV (1U) and StorMax A-2440 (2U) offer two ports of 200Gb/s InfiniBand and Ethernet connectivity, sub- 600 nanosecond latency, and 215 million messages per second. The two systems deliver low-latency distributed block storage for web-scale applications, enabling shared NVMe across any network and supports any local or distributed file system. These StorMax® solutions feature an intelligent management layer that abstracts underlying hardware with CPU offload, creates logical volumes with redundancy, and provides centralized, intelligent management and monitoring.

All applications benefit from the ultra-low latency, extremely high throughput and high IOPs of a local NVMe device with the convenience of centralized storage while avoiding proprietary hardware lock-in and reducing the overall TCO.

GPU POD Reference Architecture

Designed for any dataset size, GPU POD enables training at vastly improved performance in three deployment options.

SMALL REFERENCE ARCHITECTURE: 61.44 TB Raw

AMAX GPU POD with NVIDIA A100 GPUs

GPU Server:

  • 1x AceleMax DGS-428A
  • 4x A100 NVIDIA GPUs
  • 5x NVIDIA ConnectX-6 VPI HDR/200GbE dual-port adapters

 

Networking:

  • 1x NVIDIA QM8700 Switch
PerformanceReadsWrites
Bandwidth20 GB/s7.5 GB/s
IOPS5M340K
Latency95µs21µs

High-Performance Storage:

  • 1x StorMax® A-1110NV
  • 1x 2nd or 3rd Gen AMD EPYC™ Processor
  • 128GB RAM (8x 16GB) DDR4-32—DIMMs
  • 2x NVIDIA ConnectX-6 VPI HDR/200GbE dual-port adapters
  • 4x Kioxia CM6-R 15.36TB NVMe

MEDIUM REFERENCE ARCHITECTURE: 245.76 TB Raw

MD GPU POD Reference

GPU Server:

  • 2x AceleMax DGS-428A, each with:
  • 4x A100 NVIDIA GPUs
  • 5x NVIDIA ConnectX-6 VPI HDR/200GbE dual-port adapters

 

Networking:

  • 2x NVIDIA QM8700 Switch
PerformanceReadsWrites
Bandwidth40 GB/s15 GB/s
IOPS10M680K
Latency95µs21µs

High-Performance Storage:

  • 2x StorMax® A-1110NV
  • 1x 2nd or 3rd Gen AMD EPYC™ Processor
  • 128GB RAM (8x 16GB) DDR4-32—DIMMs
  • 2x NVIDIA ConnectX-6 VPI HDR/200GbE dual-port adapters
  • 4x Kioxia CM6-R 15.36TB NVMe

LARGE REFERENCE ARCHITECTURE: 368.64 TB Raw

LG-GPU-POD-Reference
PerformanceReadsWrites
Bandwidth160 GB/s46 GB/s
IOPS30M2M
Latency95µs21µs

GPU Server:

  • 4x AceleMax DGS-428A, each with:
  • 4x A100 NVIDIA GPUs
  • 6x NVIDIA ConnectX-6 VPI HDR/200GbE dual-port adapters

Networking:

  • 2x NVIDIA QM8700 Switch

High-Performance Storage:

  • 1x StorMax® A-2440 (2U4N), each includes:
  • 1x 2nd or 3rd Gen AMD EPYC™ Processor
  • 128GB RAM (8x 16GB) DDR4-32—DIMMs
  • 2x NVIDIA ConnectX-6 VPI HDR/200GbE dual-port adapters
  • 24x Kioxia CM6-R 15.36TB NVMe
🔨
If you would like to discuss building a GPU POD solution for your Datacenter, please contact us for more information.