The Hamiltonian consists of the sum of the kinetic and potential operators. Within the pseudopotential approximation [1], the momentum-space formalism [2], in which plane-waves are used as the basis set for the eigenfunctions, has become widespread. In momentum-space, the kinetic operator is diagonal and hence trivially applied to a trial eigenvector. However, the operation of the potential, a multiplication in real-space, is an expensive convolution in momentum-space. The fast Fourier transform (FFT) [3] enables the trial eigenvector to be transformed cheaply between real- and momentum-space so that the potential may also be applied efficiently. The FFT is thus crucial to the viability of the momentum-space formalism. Application of the Hamiltonian matrix to a trial eigenvector thus involves two FFTs, and in most applications it is the FFT which dominates the computational effort.
The implementation of the momentum-space method on parallel computers involves the distribution of plane-waves across the nodes [4]. To perform an FFT, it is thus necessary for all of the nodes to communicate with each other. Even in the largest electronic structure calculations which can be performed with current parallel computers, the 3D FFT grid is relatively small (typically ), as compared with those found in other applications. The amount of data sent between nodes is thus small, so that the latency cost of the communication can be significant, and this prevents the method from scaling well up to very large numbers of nodes.
In this paper, we present a new method for performing FFTs on parallel computers which minimises the latency cost and thus offers the prospect of scaling plane-wave electronic structure calculations up to a larger number of nodes than is currently possible. This method will be most significant in enabling electronic structure calculations to be performed on relatively inexpensive parallel computers consisting of a large number of networked workstations, which are growing in popularity but which also suffer from high latencies.
The rest of the paper is organised as follows: in Sec. 2 we outline those aspects of the derivation of the FFT which are relevant to our description of the new parallel method. In Sec. 3 we briefly describe the current implementation and in Sec. 4 we turn to a description of the new method. The costs of each method and the conditions under which we expect them to be most efficient are compared in Sec. 5. In Sec. 6 we present results for implementations of both methods. We discuss the issue of load balancing in Sec. 7 and in Sec. 8 we draw our conclusions.