The distribution of data in the new method is illustrated in
Fig. 2, again for a
grid and 4
nodes. This distribution is motivated by the Danielson-Lanczos lemma.
Each node now deals with a set of
-planes. For simplicity,
consider first applying the Danielson-Lanczos lemma recursively to a
1D-FFT. In the first step, the data is partitioned into even and odd
elements. In the second step, each of these two sets of data is
partitioned into even and odd elements again. Thus one obtains four
sets of data: even-even, even-odd, odd-even and odd-odd elemets. In
the case of four nodes, we would therefore assign one of these sets to
each node, to obtain the pattern shown in Fig. 2. Since
the three 1D-FFTs in the
-,
- and
-directions in
Eq. 3 commute, this distribution is quite flexible, and
an alternative is shown in Fig. 3 in which the
Danielson-Lanczos lemma has been applied in both the
- and
-directions.
Thus, instead of distributing data as rods as in the traditional
method, each node deals with a set of data which is picked uniformly
from the full FFT grid. In the case of a distribution across
nodes, each node thus deals with one of the sets of data which would
result after
applications of the Danielson-Lanczos
lemma. Hence, in order to perform the FFT from momentum-space to
real-space, each node first performs a local 3D-FFT on its data. It
then remains for the nodes to communicate in order to combine this
data to complete the full FFT (i.e. to perform the multiplication by a
phase factor and addition shown in the last line of Eq. 2). When complete, each node possesses a single contiguous
block of real-space data (right of Figs. 2 and
3). Again, the DFT from real- to momentum-space is
performed by reversing the above operations.
Another advantage of the new method is that the Hamiltonian
matrix is now effectively blocked. The FFT permits the application of
the Hamiltonian on state vectors in operations. The new
method distributes the data such that each block of the matrix can be
applied in
. This opens up the
possibility of applying iterative diagonalisation algorithms for block
matrices to the electronic structure problem to further decrease the
amount of communication.