HPC Student Cluster Competition

Atlanta, GA · November 2024

Cluster diagram

Per Node

Nodes
CPU 2× Intel Xeon Platinum 8580
Memory 64 GB × 32 DDR5-5600
GPU 4× Nvidia H100 NVL
Network 4× Nvidia ConnectX-7 NIC

Over the summer leading up to the competition I ran benchmarks on a single H100 GPU using the HPL benchmark — both the Nvidia optimized benchmarks container and my own pre-optimized HPL parameter set — to mirror the environment we planned to use in competition.

Setting the TDP to 100% proved to be the most inefficient power use case. I expected efficiency to peak around 75–80% TDP, but was surprised to find it continue to increase linearly all the way down to 50% TDP.

A rough estimate placed our cluster at 350W × 2 sockets × 3 nodes = 2,100W CPU, and 400W × 4 GPUs × 3 nodes = 4,800W GPU — ~6,900W total ignoring memory and other components. Based on the applications for this year's competition, we had room to dynamically allocate resources per workload while staying under the power cap. We chose a configuration with iDRAC chips, enabling us to power nodes on and off remotely without physically touching the cluster — a competition requirement after the benchmarking phase.