Cluster Configuration

3x Nodes

- 2x Intel Xeon Platinum 8580

- 64GB x 32 DDR5-5600 Memory

- 4x Nvidia H100 NVL

- 4x Nvidia ConnectX-7 NIC



Competition Photos


Team Photos


Shipping Cluster


Cluster Build


H100 Power Performance Results

     Over the summer leading up to the competition I ran a number of tests benchmarking the per-watt performance of H100 GPUs. This was particularly important in designing our cluster for this year as the results ended up being a key factor in the decision to over-spec our machine.

     I ran benchmarks on a single H100 GPU, specifically running the HPL benchmark, which was using the Nvidia optimized benchmarks container and also my own pre-optimized HPL parameter set. This was intentional to mirror the environment that we planned to use in the competition.

     My hypothesis that setting the TDP to 100% would be inefficent was correct, I found this to be the most inefficent power use case. I further hypothesized that we would increase in efficiency up to around the 75-80% TDP mark, but was surprised to find the efficiency continue to linearly increase all the way up to the point of using only 50% of the TDP.

     The results here played a role in choosing our final cluster configuration. A rough estimate would show our cluster should be using 350W * 2 sockets * 3 nodes = 2100W in CPU power, and 400W * 4 GPUs * 3 nodes = 4800W in GPU power, for a total of 6900W in power usage, ignoring all other components (The large amount of DDR5 memory, for example, uses a significant amount of power just to keep running).

     Based off of the applications for this years competition we have the room to dynamically allocate resources as we need for the differnet applicaitons while staying below the power draw. For HPL, for example, we may end up using only 2 nodes, while for ICON we may use all 3 with the GPU TDP limited. We chose a configuration enabled with iDRAC chips which gives us the ability to completely power off and on nodes as needed without physically touching the cluster (A requirement for the compeititon after the benchmarking phase has concluded).