International
Tables for
Crystallography
Volume B
Reciprocal space
Edited by U. Shmueli

International Tables for Crystallography (2006). Vol. B. ch. 1.3, p. 58   | 1 | 2 |

Section 1.3.3.3.3.2. Computer architecture considerations

G. Bricognea

aMRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, England, and LURE, Bâtiment 209D, Université Paris-Sud, 91405 Orsay, France

1.3.3.3.3.2. Computer architecture considerations

| top | pdf |

To obtain a truly useful measure of the computational complexity of a DFT algorithm, its arithmetic operation count must be tempered by computer architecture considerations. Three main types of trade-offs must be borne in mind:

  • (i) reductions in floating-point (f.p.) arithmetic count are obtained by reindexing, hence at the cost of an increase in integer arithmetic on addresses, although some shortcuts may be found (Uhrich, 1969[link]; Burrus & Eschenbacher, 1981[link]);

  • (ii) reduction in the f.p. multiplication count usually leads to a large increase in the f.p. addition count (Morris, 1978[link]);

  • (iii) nesting can increase execution speed, but causes a loss of modularity and hence complicates program development (Silverman, 1977[link]; Kolba & Parks, 1977[link]).

Many of the mathematical developments above took place in the context of single-processor serial computers, where f.p. addition is substantially cheaper than f.p. multiplication but where integer address arithmetic has to compete with f.p. arithmetic for processor cycles. As a result, the alternatives to the Cooley–Tukey algorithm hardly ever led to particularly favourable trade-offs, thus creating the impression that there was little to gain by switching to more exotic algorithms.

The advent of new machine architectures with vector and/or parallel processing features has greatly altered this picture (Pease, 1968[link]; Korn & Lambiotte, 1979[link]; Fornberg, 1981[link]; Swartzrauber, 1984[link]):

  • (i) pipelining equalizes the cost of f.p. addition and f.p. multiplication, and the ideal `blend' of the two types of operations depends solely on the number of adder and multiplier units available in each machine;

  • (ii) integer address arithmetic is delegated to specialized arithmetic and logical units (ALUs) operating concurrently with the f.p. units, so that complex reindexing schemes may be used without loss of overall efficiency.

Another major consideration is that of data flow [see e.g. Nawab & McClellan (1979)[link]]. Serial machines only have few registers and few paths connecting them, and allow little or no overlap between computation and data movement. New architectures, on the other hand, comprise banks of vector registers (or `cache memory') besides the usual internal registers, and dedicated ALUs can service data transfers between several of them simultaneously and concurrently with computation.

In this new context, the devices described in Sections 1.3.3.2[link] and 1.3.3.3[link] for altering the balance between the various types of arithmetic operations, and reshaping the data flow during the computation, are invaluable. The field of machine-dependent DFT algorithm design is thriving on them [see e.g. Temperton (1983a[link],b[link],c[link], 1985[link]); Agarwal & Cooley (1986[link], 1987[link])].

References

First citationAgarwal, R. C. & Cooley, J. W. (1986). Fourier transform and convolution subroutines for the IBM 3090 Vector facility. IBM J. Res. Dev. 30, 145–162.Google Scholar
First citationAgarwal, R. C. & Cooley, J. W. (1987). Vectorized mixed radix discrete Fourier transform algorithms. Proc. IEEE, 75, 1283–1292.Google Scholar
First citationBurrus, C. S. & Eschenbacher, P. W. (1981). An in-place, in-order prime factor FFT algorithm. IEEE Trans. Acoust. Speech Signal Process. 29, 806–817.Google Scholar
First citationFornberg, A. (1981). A vector implementation of the fast Fourier transform algorithm. Math. Comput. 36, 189–191.Google Scholar
First citationKolba, D. P. & Parks, T. W. (1977). A prime factor FFT algorithm using high-speed convolution. IEEE Trans. Acoust. Speech Signal Process. 25, 281–294.Google Scholar
First citationKorn, D. G. & Lambiotte, J. J. Jr (1979). Computing the fast Fourier transform on a vector computer. Math. Comput. 33, 977–992.Google Scholar
First citationMorris, R. L. (1978). A comparative study of time efficient FFT and WFTA programs for general purpose computers. IEEE Trans. Acoust. Speech Signal Process. 26, 141–150.Google Scholar
First citationNawab, H. & McClellan, J. H. (1979). Bounds on the minimum number of data transfers in WFTA and FFT programs. IEEE Trans. Acoust. Speech Signal Process. 27, 393–398.Google Scholar
First citationPease, M. C. (1968). An adaptation of the fast Fourier transform for parallel processing. J. Assoc. Comput. Mach. 15, 252–264.Google Scholar
First citationSilverman, H. F. (1977). An introduction to programming the Winograd Fourier transform algorithm (WFTA). IEEE Trans. Acoust. Speech Signal Process. 25, 152–165.Google Scholar
First citationSwartzrauber, P. N. (1984). FFT algorithms for vector computers. Parallel Comput. 1, 45–63.Google Scholar
First citationTemperton, C. (1983a). Self-sorting mixed-radix fast Fourier transforms. J. Comput. Phys. 52, 1–23.Google Scholar
First citationTemperton, C. (1983b). A note on the prime factor FFT algorithm. J. Comput. Phys. 52, 198–204.Google Scholar
First citationTemperton, C. (1983c). Fast mixed-radix real Fourier transforms. J. Comput. Phys. 52, 340–350.Google Scholar
First citationTemperton, C. (1985). Implementation of a self-sorting in place prime factor FFT algorithm. J. Comput. Phys. 58, 283–299.Google Scholar
First citationUhrich, M. L. (1969). Fast Fourier transforms without sorting. IEEE Trans. AU, 17, 170–172.Google Scholar








































to end of page
to top of page