Aug 15, 2024 7 min read

How to Setup the First NVIDIA MGX GH200 Server - AceleMax® DGS-AR211

See our hands on experience of setting up and testing the AceleMax® DGS-AR211 powered by the NVIDIA GH200 superchip based on the latest NVIDIA MGX™ reference design.

Technician working on server hardware with code displayed on a laptop in a data center.
Table of Contents

Hardware Overview

The NVIDIA MGX™ architecture offers a new standard for modular server design, providing scalable and adaptable solutions for AI and high-performance computing. This allows system manufacturers to efficiently meet evolving technological demands, ensuring compatibility with NVIDIA's range of products. Incorporating the NVIDIA® Grace™ Hopper™ Superchip boosts this platform by integrating powerful CPU and GPU functions into a single chip, optimizing data processing and enhancing performance for complex applications​.

🛠️
Contact AMAX for expert deployment of NVIDIA’s MGX architecture with the NVIDIA Grace Hopper Superchip.

The AceleMax® DGS-AR211 is the first server to introduce the NVIDIA® Grace™ Hopper™ Superchip in conjunction with NVIDIA MGX™ architecture. This 2U server offers features particularly appealing to the AI/ML community.

The AceleMax® DGS-AR211

Overview of the AceleMax® DGS-AR211 with the top cover removed. Underneath the massive heatsink is the NVIDIA Grace Hopper super chip board.

Key Features

  • Grace Hopper Superchip
    • Includes an H100 Tensor Core GPU with 96GB HBM3 memory and a Grace CPU.
    • Note: A typical H100 GPU, PCIe or SXM5, has 80G HBM3 memory.
  • NVIDIA Grace CPU
    • Features 72 Arm Neoverse V2 cores.
    • Note: This is not a typical X86 CPU which means GPU drivers need to be ARM64 based
    • The Arm Neoverse V2 CPU is designed for cloud computing, high performance computing (HPC), and machine learning (ML) performance leadership. Neoverse V2 delivers up to twice the performance as Neoverse V1 and is the first V-series CPU with Armv9 performance, power, and security enhancements. For more information on Neoverse V2.
  • 480G LPDDR5X System Memory
    • LPDDR5X, or Low Power Double Data Rate 5X, supports data rates up to 8.533 Gbps, a 33% boost from the previous generation - LPDDR5 which supports a max of 6.4 Gbps. For more information on LPDDR5X.
  • Additional PCIe Slots
    • 3x FHFL PCIe 5.0 x16 slots in the back via riser
  • Max TDP
    • 1000W with air cooling
Hardware information can be viewed using the "lspci" command available in any Linux installation.

System Access

The system comes with login protection for the following access:

  1. BMC
  2. BIOS

For assistance with obtaining login credentials, please contact your server manufacturer or provider.

OS Installation

NVIDIA Base OS

NVIDIA provides the Ubuntu-based NVIDIA Base OS for the system. NVIDIA Base OS implements a stable and fully qualified operating systems for running AI, machine learning, and analytics applications on the DGX and MGX platforms.

For expert guidance on installation and support for the NVIDIA Base OS, our AMAX specialists are here to assist. For detailed installation instructions, please contact your server manufacturer or provider.

System Access

There is a default login included for the NVIDIA Base OS. For assistance with obtaining login credentials, please contact your server manufacturer or provider.

GPU Driver Update

ARM64 NVIDIA GPU Driver

When you login to the system for the first time and run a command such as nvidia-smi, you may see "No devices were found." This is expected as you need to manually update the GPU driver.

Steps to Update the ARM64 NVIDIA GPU Driver:

  1. Download the latest NVIDIA GPU Driver for the ARM architecture here.
  2. Unload default NVIDIA modules.
    • sudo rmmod nvidia-uvm
    • sudo rmmod nvidia-drm
    • sudo rmmod nvidia-modeset
    • sudo rmmod nvidia
  3. Please verify before you proceed.
Output using "sudo lsmod | grep -i nvidia" command.
  1. Uninstall default NVIDIA GPU driver that is outdated.
    • sudo apt remove nvidia-driver-535-server
    • sudo apt autoremove
  2. Execute run file with -m=kernel-open option.
    • sudo ./NVIDIA-Linux-aarch64-535.104.05.run -m=kernel-open
  3. You may see the following windows during the installation process:
Warning message: "nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules';
Warning message: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system.
The success message: "Installation of the NVIDIA Accelerated Graphics Driver for Linux-aarch64 (version: 535.104.05) is now complete.

6. Reboot the system.

    • sudo nvidia-persistenced
    • sudo reboot

7. Verify after the reboot.

    • sudo nvidia-smi -q
Output of " sudo nvidia-smi -q" showing the GH200 480GB product name.

==============NVSMI LOG==============
Timestamp                                 : Thu Jan 25 02:59:59 2024
Driver Version                            : 535.104.05
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000009:01:00.0
    Product Name                          : GH200 480GB
    Product Brand                         : NVIDIA
    Product Architecture                  : Hopper
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : ATS
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled

CUDA Runtime Setup

NVIDIA NGC PyTorch Container

Using the PyTorch container (latest v23.12-py3) from the NVIDIA NGC site is recommended. This container has built-in ARM64 and x86 support and does not require a login. You can find the container on the NVIDIA NGC site.

How to Pull Pytorch Container Supporting ARM

  • sudo docker pull nvcr.io/nvidia/pytorch:23.12-py3
Detailed information of the PyTorch 23.12 container on NVIDIA NGC Catalog.

Prerequisite Packages

Run the Container and Gain Terminal Access

  • sudo docker run --gpus all -p 8501:8080 --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm -v /path/to/your/code:/workspace/code nvcr.io/nvidia/pytorch:23.12-py3
    • -p 8501:8080 is optional if you need to run the application inside the container and provide access from your laptop for example. It will map the container port 8501 (used by Streamlit) to port 8080 of the system so that you can access via your browser by visiting http://GH200-IP-Address:8080
    • -v /path/to/your/code:/workspace/code will map your local path to be accessed at /workspace/code inside the pytorch container.
  • sudo docker ps
    • You can find the image id of such pytorch container
  • sudo docker exec -it [container_id_or_name] /bin/bash
    • You can gain terminal access to the container and install any 3rd party libraries such as transformers etc.

Potential Error Messages

If you don’t have the NVIDIA container toolkit or CUDA installed correctly, you may see the following error message (side note – does OCI stand for Oracle Cloud Infrastructure here? If so, does it mean such PyTorch container is compiled by OCI team?)

$ sudo docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/pytorch:23.12-py3
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ERRO[0000] error waiting for container:

Deploy NVIDIA MGX with AMAX

If you're planning to upgrade to the latest NVIDIA MGX architecture and Grace Hopper Superchip but are concerned about the complexities of installation and system integration, AMAX is ready to assist. Our expertise extends beyond basic installation to encompass comprehensive data center solutions, including layout design, cluster architecture, network topology design, system bring-up, and performance tuning.

AceleMax® DGS-AR211

  • Powered by NVIDIA® Grace™ Hopper™ Superchip, offering coherent memory between CPU and GPU through NVLink®- C2C interconnect
  • Optimized for memory-intensive inference and high-performance computing (HPC) applications
  • Supports up to 480GB LPDDRX embedded memory and 96GB HBM3 GPU memory
  • Includes a 1+1 high-efficiency, hot-plug 2000W power supply unit with 80 Plus Titanium certification
Download Datasheet

We also specialize in both liquid and air-cooled facility retrofits and provide co-location services for site hosting. Partnering with AMAX gives you access to top-tier infrastructure solutions vital for advanced AI deployments. Contact us today to ensure your hardware is expertly integrated and optimized to meet your specific operational needs.

🚀
Contact Us to empower your operations with AMAX’s advanced AI and HPC solutions.