SPARTA WWW Site

SPARTA Benchmarks

This page gives SPARTA performance on two benchmark problems, run on different machines, both in serial and parallel. Input files and sample output files for these benchmark tests are provided in the bench directory of the SPARTA distribution. See the bench/README file for details.

Free molecular flow in a box
Collisional flow in a box
Machine characteristics

Free molecular flow in a box

This benchmark is for particles advecting in free molecular flow (no collsions) on a regular grid overlaying a 3d closed box with reflective boundaries. The size of the grid was varied; the particle counts is always 10x the number of grid cells. Particles were initialized with a thermal temperature (no streaming velocity) so they move in random directions. Since there is very little computation to do, this is a good stress test of the communication capabilities of SPARTA and the machines it is run on.

The input script for this problem is bench/in.free in the SPARTA distribution.

This plot shows timings results in particle moves/sec/node, for runs of different sizes on varying node counts of two different machines. Problems as small as 1M grid cells (10M particles) and as large as 10B grid cells (100B particles) were run.

Chama is an Intel cluster with Infiniband described below. Each node of chama has dual 8-core Intel Sandy Bridge CPUs. These tests were run on all 16 cores of each node, i.e. with 16 MPI tasks/node. Up to 1024 nodes were used (16K MPI tasks). Mira is an IBM BG/Q machine at Argonne National Labs, also described below. It has 16 cores per node. These tests were run with 4 MPI tasks/core, for a total of 64 MPI tasks/node. Up to 8K nodes were used (512K MPI tasks).

The plot shows that a Chama node is about 2x faster than a BG/Q node.

Each individual curve in the plot is a strong scaling test, where the same size problem is run on more and more nodes. Perfect scalability would be a horizontal line. The curves show some initial super-linear speed-up as the particle count/node decreased, due to cache effects, then a slow-down as more nodes are added due to too-few particles/node and increased communication costs.

Jumping from curve-to-curve as node count increases is a weak scaling test, since the problem size is increasing with node count. Again a horizontal line would represent perfect weak scaling.

Click on the image to see a larger version.

Collisional flow in a box

This benchmark is for particles undergoing collisional flow. Everything about the problem is the same as the free molecular flow problem described above, except that collisions were enabled, which requires extra computation, as well as particle sorting each timestep to identify particles in the same grid cell.

The input script for this problem is bench/in.collide in the SPARTA distribution.

As above, this plot shows timings results in particle moves/sec/node, for runs of different sizes on varying node counts. Data for the same two machines is shown: chama (Intel cluster with Ifiniband at Sandia) and mira (IBM BG/Q at ANL). Comparing these timings to the free molecule flow plot in the previous section shows the cost of collisions (and sorting) slows down the performance by a factor of about 2.5x. Cache effects (super-linear speed-up) are smaller due to the increased computational costs.

For collisional flow, problems as small as 1M grid cells (10M particles) and as large as 1B grid cells (10B particles) were run.

The discussion above regarding strong and weak scaling also applies to this plot. For any curve, a horizontal line would represent perfect weak scaling.

Click on the image to see a larger version.

Machine characteristics

This section lists characteristics of machines used in the benchmarking along with options used in compiling SPARTA. The communication parameters are for bandwidth and latency at the MPI level, i.e. what a program like SPARTA sees.

Desktop = Dell Precision T7500 desktop workstation running Red Hat linux

1 node (12 cores)
sited at Sandia National Labs (2011)
node: dual hex-core 3.47 GHz Intel Xeons X5690, 12 cores/node
communication: on chip and between procs, ?? MB/sec bandwidth, ?? usec latency
code compile: Intel 14.0.3 compiler, icc -O

Chama = Intel cluster with Infiniband

1232 nodes (19712 cores)
sited at Sandia National Labs (2012)
node: 2.6 GHz dual 8-core Intel Sandy Bridge nodes, 16 cores/node
communication: Infiniband 4X QDR, fat tree, Qlogic chipset
code compile: Intel 12.1 compiler and OpenMPI, mpic++ -O2 -xsse4.2 -funroll-loops -fstrict-aliasing

Mira = IBM BG/Q

48 racks, 1024 nodes/rack, 49152 nodes (768K cores)
sited at Argonne National Labs (2013)
node: IBM PowerPC A2 1.6 GHz, 16 cores/node, up to 64 MPI tasks/node
communication: IBM 5D torus custom interconnect, 2 GB/sec chip-to-chip
code compile: GNU compiler, mpicxx -O3 -fopenmp