In the following analysis, we assume that the cost of communicating a packet of data from one node to another consists of two parts. The first part is simply the time taken to transmit the data between the nodes and depends upon the bandwidth of the connection and the size of the data packet. The second part is a fixed overhead, the latency , which is independent of the size of the data packet. We define and for the situation in which a node is sending a data packet to another node and simultaneously receiving a data packet of the same size from another node.
In the traditional method, each of the two communication phases
generally involves each node communicating with every other node. This
can be achieved in phases in which all of the nodes
simultaneously send a data packet to one node and receive a packet
from another (usually different) node. The total latency cost for the
traditional parallel DFT is thus
. We define
to be the total number of data elements in the full FFT
grid. In the traditional method, each data packet has a size of where is the size of a single data element (typically 16 bytes
for a double precision complex data type). The total communication
time involved in the traditional method,
is
thus:
In the new method, there are communication phases in
which each node communicates with just one other node. The total
latency cost for this method is thus
. However, in
this method, the size of the data packets exchanged is , a
factor of larger than in the traditional method. The total
communication time for the new method,
is then:
The new method has a lower latency cost, due to the smaller number of data packets sent, but a higher transmission cost due to the larger size of those data packets. We therefore expect the new method to be advantageous in the limit of a large number of processors . In this limit, many processors need to communicate with each other, but the packets they exchange are very small, so that the latency cost dominates. Since the latency cost of the new method increases only logarithmically instead of linearly, it should scale to a larger number of processors.
The ``cross-over'' point at which the new method performs more efficiently than the traditional method depends upon the hardware (and low-level software libraries which interface to it), parametrised by and , and the size of the FFT grid . The smaller the FFT grid, the more competitive the new method will be. It is for this reason that the method may be most useful in electronic structure calculations in which the FFT grid sizes are relatively small.
The product defines the packet size which costs as much in latency as transmission to send. On a cluster of PCs connected by 100 Mbit ethernet this product is of the order of 2 Kbytes. We measured a very similar value for a 64-node SGI Origin 2000 supercomputer. In both cases, the generic Message-Passing Interface (MPI) was used. However, use of lower-level vendor-specific libraries on the Origin would reduce the latency and hence the product. We therefore anticipate that the new method may be more suitable for implementation on clusters of workstations than on supercomputers designed and built to run programs in parallel.