International
Tables for Crystallography Volume B Reciprocal space Edited by U. Shmueli © International Union of Crystallography 2006 
International Tables for Crystallography (2006). Vol. B. ch. 1.3, p. 58

To obtain a truly useful measure of the computational complexity of a DFT algorithm, its arithmetic operation count must be tempered by computer architecture considerations. Three main types of tradeoffs must be borne in mind:
Many of the mathematical developments above took place in the context of singleprocessor serial computers, where f.p. addition is substantially cheaper than f.p. multiplication but where integer address arithmetic has to compete with f.p. arithmetic for processor cycles. As a result, the alternatives to the Cooley–Tukey algorithm hardly ever led to particularly favourable tradeoffs, thus creating the impression that there was little to gain by switching to more exotic algorithms.
The advent of new machine architectures with vector and/or parallel processing features has greatly altered this picture (Pease, 1968; Korn & Lambiotte, 1979; Fornberg, 1981; Swartzrauber, 1984):
Another major consideration is that of data flow [see e.g. Nawab & McClellan (1979)]. Serial machines only have few registers and few paths connecting them, and allow little or no overlap between computation and data movement. New architectures, on the other hand, comprise banks of vector registers (or `cache memory') besides the usual internal registers, and dedicated ALUs can service data transfers between several of them simultaneously and concurrently with computation.
In this new context, the devices described in Sections 1.3.3.2 and 1.3.3.3 for altering the balance between the various types of arithmetic operations, and reshaping the data flow during the computation, are invaluable. The field of machinedependent DFT algorithm design is thriving on them [see e.g. Temperton (1983a,b,c, 1985); Agarwal & Cooley (1986, 1987)].
References
Agarwal, R. C. & Cooley, J. W. (1986). Fourier transform and convolution subroutines for the IBM 3090 Vector facility. IBM J. Res. Dev. 30, 145–162.Google ScholarAgarwal, R. C. & Cooley, J. W. (1987). Vectorized mixed radix discrete Fourier transform algorithms. Proc. IEEE, 75, 1283–1292.Google Scholar
Burrus, C. S. & Eschenbacher, P. W. (1981). An inplace, inorder prime factor FFT algorithm. IEEE Trans. Acoust. Speech Signal Process. 29, 806–817.Google Scholar
Fornberg, A. (1981). A vector implementation of the fast Fourier transform algorithm. Math. Comput. 36, 189–191.Google Scholar
Kolba, D. P. & Parks, T. W. (1977). A prime factor FFT algorithm using highspeed convolution. IEEE Trans. Acoust. Speech Signal Process. 25, 281–294.Google Scholar
Korn, D. G. & Lambiotte, J. J. Jr (1979). Computing the fast Fourier transform on a vector computer. Math. Comput. 33, 977–992.Google Scholar
Morris, R. L. (1978). A comparative study of time efficient FFT and WFTA programs for general purpose computers. IEEE Trans. Acoust. Speech Signal Process. 26, 141–150.Google Scholar
Nawab, H. & McClellan, J. H. (1979). Bounds on the minimum number of data transfers in WFTA and FFT programs. IEEE Trans. Acoust. Speech Signal Process. 27, 393–398.Google Scholar
Pease, M. C. (1968). An adaptation of the fast Fourier transform for parallel processing. J. Assoc. Comput. Mach. 15, 252–264.Google Scholar
Silverman, H. F. (1977). An introduction to programming the Winograd Fourier transform algorithm (WFTA). IEEE Trans. Acoust. Speech Signal Process. 25, 152–165.Google Scholar
Swartzrauber, P. N. (1984). FFT algorithms for vector computers. Parallel Comput. 1, 45–63.Google Scholar
Temperton, C. (1983a). Selfsorting mixedradix fast Fourier transforms. J. Comput. Phys. 52, 1–23.Google Scholar
Temperton, C. (1983b). A note on the prime factor FFT algorithm. J. Comput. Phys. 52, 198–204.Google Scholar
Temperton, C. (1983c). Fast mixedradix real Fourier transforms. J. Comput. Phys. 52, 340–350.Google Scholar
Temperton, C. (1985). Implementation of a selfsorting in place prime factor FFT algorithm. J. Comput. Phys. 58, 283–299.Google Scholar
Uhrich, M. L. (1969). Fast Fourier transforms without sorting. IEEE Trans. AU, 17, 170–172.Google Scholar