Details on
the HPL Linpack benchmark run on the amd32 (32 cores/node, total 34
nodes or 1088 cores) nodes
The HPL Linpack
benchmark version was 1.0. It was compiled
against OpenMPI version 1.4.2 and linked with GotoBLAS2 library.
The benchmark
was run using the OpenMPI 1.4.4rc2 mpirun
launcher since this version has the --loadbalance
option which
provides uniform distribution of ranks across all nodes. The best Linpack
performance recorded was 6.2 TFlop/second.
The
benchmark was launched from the job head node as follows
$>
/usr/local/openmpi/bin/mpirun -np 1088 --loadbalance ./xhpl
Contents
of
HPL.dat
file
follow
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
8
device out (6=stdout,7=stderr,file)
4 #
of problems sizes (N)
10000 20000 40000
483072 Ns
1 #
of NBs
128 NBs
0
PMAP process mapping (0=Row-,1=Column-major)
1 #
of process grids (P x Q)
32 Ps
34 Qs
16.0 threshold
1 #
of panel fact
2
PFACTs (0=left, 1=Crout, 2=Right)
1 #
of recursive stopping criterium
4
NBMINs (>= 1)
1 #
of panels in recursion
2
NDIVs
1 #
of recursive panel fact.
1
RFACTs (0=left, 1=Crout, 2=Right)
1 #
of broadcast
1
BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 #
of lookahead depth
1
DEPTHs (>=0)
2
SWAP (0=bin-exch,1=long,2=mix)
64 swapping
threshold
0 L1
in (0=transposed,1=no-transposed) form
0
U in (0=transposed,1=no-transposed) form
1
Equilibration (0=no,1=yes)
8
memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0
Number of additional problem sizes for PTRANS
1200 10000
30000
values of N
0
number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64
values of NB
Contents
of
HPL.out
file
follow
============================================================================
HPLinpack 1.0a -- High-Performance Linpack benchmark
-- January 20, 2004
Written by A. Petitet and R. Clint Whaley, Innovative Computing
Labs., UTK
============================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 10000
20000 40000 483072
NB : 128
PMAP : Row-major process mapping
P : 32
Q : 34
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
----------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1 *
N )
2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to
be 1.110223e-16
- Computational tests pass if scaled residuals are less
than 16.0
============================================================================
T/V
N NB
P
Q
Time
Gflops
----------------------------------------------------------------------------
WR11C2R4 10000
128 32
34
2.03 3.278e+02
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 *
N )
= 0.0324856 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
= 0.0076871 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
= 0.0017182 ...... PASSED
============================================================================
T/V
N NB
P
Q
Time
Gflops
----------------------------------------------------------------------------
WR11C2R4 20000
128 32
34
6.35 8.397e+02
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 *
N )
= 0.0058409 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
= 0.0056532 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
= 0.0010774 ...... PASSED
============================================================================
T/V
N NB
P
Q
Time
Gflops
----------------------------------------------------------------------------
WR11C2R4 40000
128 32
34
23.14 1.844e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 *
N )
= 0.0042686 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
= 0.0039030 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
= 0.0007664 ...... PASSED
============================================================================
T/V
N NB
P
Q
Time
Gflops
----------------------------------------------------------------------------
WR11C2R4 483072
128 32
34
11997.38 6.264e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 *
N )
= 0.0015761 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
= 0.0011265 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
= 0.0001925 ...... PASSED
============================================================================
Finished 4 tests with the following
results:
4 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
----------------------------------------------------------------------------
End of Tests.
============================================================================
Last Updated:
11/22/2011