ExMatEx | Proxy Applications

CoSP2 Proxy Application Sparse Linear Algebra for Electronic Structure calculations

CoSP2 represents a sparse linear algebra parallel algorithm for calculating the density matrix in electronic structure theory. The algorithm is based on a recursive second-order Fermi-Operator expansion method (SP2) and is tailored for density functional based tight-binding calculations of non-metallic systems. This SP2 algorithm is part of the Los Alamos Transferable Tight-binding for Energetics (LATTE) code, based on a matrix expansion of the Fermi operator in a recursive series of generalized matrix-matrix multiplications. The computational cost scales linearly for shared memory with the system size, which is achieved by using the ELLPACK-R (ELL) sparse matrix data format and OpenMP multi-threading. The CoSP2 implementation uses hierarchical parallelism, MPI between nodes and OpenMP on node.

We achieve efficient shared memory parallelism for generalized sparse matrix-matrix multiplications by using the ELL data structure. The ELL format allows sparse matrix storage in a regular manner (see Fig. 1a). Each sparse matrix is described by three arrays:

a 2-dimensional array containing the numerical values,
a 2-dimensional array containing the column indices, and
a vector containing the number of non-zero entries per row.

ELL Matrix format — Fig. 1a. The sparse matrix ELLPACK-R (ELL) representation consists of a 2-D array of the non-zero values in each row, a 2-D array of the column indices, and a vector of number of non-zeroes in each row.

ELL matrix multiplication — Fig. 1b. Sparse matrix-matrix multiplication is performed in a row-wise fashion. Each non-zero value in a row is multiplied by the non-zero values corresponding to the column index. These values are summed and stored in a temporary row buffer based on their column indices. The row values are thresholded and added to the resulting X2 matrix in ELL representation.

The row-wise data storage makes a parallel implementation of a sparse matrix-matrix multiplication fairly straightforward. Fig. 1b illustrates the calculation of a sparse matrix-matrix multiplication, X², which is the operation governing the cost of the SP2 algorithm. The matrix elements of each row i of X are multiplied with the corresponding column vector elements. However, the matrix X is symmetric and we therefore multiply the corresponding row vector elements and accumulate into a temporary row buffer (i). The temporary row buffer vector represents the result of a new row i in X², which is stored in the compressed ELL format. Using a distributed/shared memory architecture this scheme can be parallelized over the rows.

Pseudo code for the Second-order spectral projection (SP2) algorithm is shown below.

Estimate ε_max and ε_min
X = (ε_maxI − H)/(ε_max − ε_min) 
TraceX = Tr[X] 
BreakLoop = 0 
i=0 
while BreakLoop = 0 do	
	i=i+1	
	Exchange halo information between MPI ranks	
	TraceXold = TraceX 
	X_tmp = X² 	
	TraceXtmp = Tr[X_tmp] 
	if |TraceXtmp − N_occ| − |2*TraceX − TraceXtmp − N_occ| > IdemTol then	
		X = 2*X − X_tmp	
		TraceX = 2*TraceX − TraceXtmp 
	else	
		X = X_tmp	
		TraceX = TraceXtmp 	
	end if	
	IdemErr_i = |TraceX − TraceXold| 
	if IdemErr_i−2 ≤ IdemErr_i and i > i_min then	
		BreakLoop = 1 	
	end if	
	Exchange halo chunks of rows across MPI ranks
end while 	
P=X

The work decomposition across MPI ranks consists of splitting up the Hamiltonian matrix into chunks of rows, where each chunk is assigned to a different rank. Access to halo rows is needed during the computation, requiring an exchange of relevant halo chunks per iteration of the SP2 algorithm.

Performance timings are shown for a polyethylene chain of 1024 molecules (6144 atoms, 12,288 x 12,288 matrix). Fig. 2 shows timings for 1) OpenMP only with 1-16 threads on a single node; 2) MPI with 1 OpenMP thread, running on 1-16 ranks of one node; and 3) MPI with 4 OpenMP threads, running on 1-16 ranks (as separate nodes). We see similar performance for OpenMP and MPI running on a single node. Multi-node MPI+OpenMP shows improved performance due to hierarchical parallelism.

Background Papers

Papers

M. J. Cawkwell, E. J. Sanville, S. M. Mniszewski, A. M. N. Niklasson, , “ Computing the density matrix in electronic structure theory on graphics processing units, ” J. Chem. Theory Comput., 8, 4094−4101, 2012.
E. H. Rubensson, A. M. N. Niklasson, , “ Interior eigenvalues from density matrix expansions in quantum mechanical molecular dynamics, ” SIAM J. Sci. Comput. Vol. 36, No. 2, pp. B147–B170, 2014.
P. Souvatzis, A. M. N. Niklasson, , “ First principles molecular dynamics without self-consistent field optimization, ” J. Chem. Physics 140, 044117, 2014.
M. J. Cawkwell, A. M. N. Niklasson, , “ Energy conserving, linear scaling Born-Oppenheimer molecular dynamics, ” J. Chem. Physics 137, 134105, 2012.
A. M. N. Niklasson, P., Steneteg, N. Bock, , “ Extended Lagrangian free energy molecular dynamics, ” J. Chem. Physics 135, 164111, 2011.
A. M. N. Niklasson, , “ Extended Born-Oppenheimer molecular dynamics, ” PRL 100, 123004, 2008.
A. M. N. Niklasson, , “ Expansion algorithm for the density matrix, ” Phys. Rev. B 66, 155115, 2002.