Skip to content

CoSP2 Proxy Application Sparse Linear Algebra for Electronic Structure calculations

CoSP2 represents a sparse linear algebra parallel algorithm for calculating the density matrix in electronic structure theory. The algorithm is based on a recursive second-order Fermi-Operator expansion method (SP2) and is tailored for density functional based tight-binding calculations of non-metallic systems. This SP2 algorithm is part of the Los Alamos Transferable Tight-binding for Energetics (LATTE) code, based on a matrix expansion of the Fermi operator in a recursive series of generalized matrix-matrix multiplications. The computational cost scales linearly for shared memory with the system size, which is achieved by using the ELLPACK-R (ELL) sparse matrix data format and OpenMP multi-threading. The CoSP2 implementation uses hierarchical parallelism, MPI between nodes and OpenMP on node.

We achieve efficient shared memory parallelism for generalized sparse matrix-matrix multiplications by using the ELL data structure. The ELL format allows sparse matrix storage in a regular manner (see Fig. 1a). Each sparse matrix is described by three arrays:

  1. a 2-dimensional array containing the numerical values,
  2. a 2-dimensional array containing the column indices, and
  3. a vector containing the number of non-zero entries per row.

ELL Matrix format
Fig. 1a. The sparse matrix ELLPACK-R (ELL) representation consists of a 2-D array of the non-zero values in each row, a 2-D array of the column indices, and a vector of number of non-zeroes in each row.
ELL matrix multiplication
Fig. 1b. Sparse matrix-matrix multiplication is performed in a row-wise fashion. Each non-zero value in a row is multiplied by the non-zero values corresponding to the column index. These values are summed and stored in a temporary row buffer based on their column indices. The row values are thresholded and added to the resulting X2 matrix in ELL representation.

The row-wise data storage makes a parallel implementation of a sparse matrix-matrix multiplication fairly straightforward. Fig. 1b illustrates the calculation of a sparse matrix-matrix multiplication, X2, which is the operation governing the cost of the SP2 algorithm. The matrix elements of each row i of X are multiplied with the corresponding column vector elements. However, the matrix X is symmetric and we therefore multiply the corresponding row vector elements and accumulate into a temporary row buffer (i). The temporary row buffer vector represents the result of a new row i in X2, which is stored in the compressed ELL format. Using a distributed/shared memory architecture this scheme can be parallelized over the rows.

Pseudo code for the Second-order spectral projection (SP2) algorithm is shown below.

Estimate εmax and εmin
X = (εmaxI − H)/(εmax − εmin) 
TraceX = Tr[X]

BreakLoop = 0

i=0

while BreakLoop = 0 do	
	i=i+1	
	Exchange halo information between MPI ranks	
	TraceXold = TraceX

	Xtmp = X2
	
	TraceXtmp = Tr[Xtmp]

	if |TraceXtmp − Nocc| − |2*TraceX − TraceXtmp − Nocc| > IdemTol then	
		X = 2*X − Xtmp	
		TraceX = 2*TraceX − TraceXtmp 
	else	
		X = Xtmp	
		TraceX = TraceXtmp 	
	end if	
	IdemErri = |TraceX − TraceXold|

	if IdemErri−2 ≤ IdemErri and i > imin then	
		BreakLoop = 1 	
	end if	
	Exchange halo chunks of rows across MPI ranks
end while 	
P=X	
The work decomposition across MPI ranks consists of splitting up the Hamiltonian matrix into chunks of rows, where each chunk is assigned to a different rank. Access to halo rows is needed during the computation, requiring an exchange of relevant halo chunks per iteration of the SP2 algorithm.

ELL Matrix format
Fig. 2. Performance timings for the SP2 algorithm using a polyethylene chain of 1024 molecules are shown for OpenMP (1-16 threads), MPI (single node, 1-16 ranks, 1 thread), and MPI+OpenMP (multi-node, 16 ranks, 4 threads).

Performance timings are shown for a polyethylene chain of 1024 molecules (6144 atoms, 12,288 x 12,288 matrix). Fig. 2 shows timings for 1) OpenMP only with 1-16 threads on a single node; 2) MPI with 1 OpenMP thread, running on 1-16 ranks of one node; and 3) MPI with 4 OpenMP threads, running on 1-16 ranks (as separate nodes). We see similar performance for OpenMP and MPI running on a single node. Multi-node MPI+OpenMP shows improved performance due to hierarchical parallelism.

Background Papers

Papers
  1. M. J. Cawkwell, E. J. Sanville, S. M. Mniszewski, A. M. N. Niklasson, , “ Computing the density matrix in electronic structure theory on graphics processing units, ” J. Chem. Theory Comput., 8, 4094−4101, 2012.

  2. E. H. Rubensson, A. M. N. Niklasson, , “ Interior eigenvalues from density matrix expansions in quantum mechanical molecular dynamics, ” SIAM J. Sci. Comput. Vol. 36, No. 2, pp. B147–B170, 2014.

  3. P. Souvatzis, A. M. N. Niklasson, , “ First principles molecular dynamics without self-consistent field optimization, ” J. Chem. Physics 140, 044117, 2014.

  4. M. J. Cawkwell, A. M. N. Niklasson, , “ Energy conserving, linear scaling Born-Oppenheimer molecular dynamics, ” J. Chem. Physics 137, 134105, 2012.

  5. A. M. N. Niklasson, P., Steneteg, N. Bock, , “ Extended Lagrangian free energy molecular dynamics, ” J. Chem. Physics 135, 164111, 2011.

  6. A. M. N. Niklasson, , “ Extended Born-Oppenheimer molecular dynamics, ” PRL 100, 123004, 2008.

  7. A. M. N. Niklasson, , “ Expansion algorithm for the density matrix, ” Phys. Rev. B 66, 155115, 2002.

Most of our code is released as open source // Visit the ExMatEx GitHub site