Numerical computation of the discrete Fourier transform

Bricogne, G.

doi:10.1107/97809553602060000551

International
Tables for
Crystallography
Volume B
Reciprocal space
Edited by U. Shmueli

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. B. ch. 1.3, pp. 49-58 | 1 | 2 |

Section 1.3.3. Numerical computation of the discrete Fourier transform

G. Bricogne^a

^a MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, England, and LURE, Bâtiment 209D, Université Paris-Sud, 91405 Orsay, France

1.3.3. Numerical computation of the discrete Fourier transform

| top | pdf |

1.3.3.1. Introduction

| top | pdf |

The Fourier transformation's most remarkable property is undoubtedly that of turning convolution into multiplication. As distribution theory has shown, other valuable properties – such as the shift property, the conversion of differentiation into multiplication by monomials, and the duality between periodicity and sampling – are special instances of the convolution theorem.

This property is exploited in many areas of applied mathematics and engineering (Campbell & Foster, 1948; Sneddon, 1951; Champeney, 1973; Bracewell, 1986). For example, the passing of a signal through a linear filter, which results in its being convolved with the response of the filter to a δ-function `impulse', may be modelled as a multiplication of the signal's transform by the transform of the impulse response (also called transfer function). Similarly, the solution of systems of partial differential equations may be turned by Fourier transformation into a division problem for distributions. In both cases, the formulations obtained after Fourier transformation are considerably simpler than the initial ones, and lend themselves to constructive solution techniques.

Whenever the functions to which the Fourier transform is applied are band-limited, or can be well approximated by band-limited functions, the discrete Fourier transform (DFT) provides a means of constructing explicit numerical solutions to the problems at hand. A great variety of investigations in physics, engineering and applied mathematics thus lead to DFT calculations, to such a degree that, at the time of writing, about 50% of all supercomputer CPU time is alleged to be spent calculating DFTs.

The straightforward use of the defining formulae for the DFT leads to calculations of size $[N^{2}]$ for N sample points, which become unfeasible for any but the smallest problems. Much ingenuity has therefore been exerted on the design and implementation of faster algorithms for calculating the DFT (McClellan & Rader, 1979; Nussbaumer, 1981; Blahut, 1985; Brigham, 1988). The most famous is that of Cooley & Tukey (1965) which heralded the age of digital signal processing. However, it had been preceded by the prime factor algorithm of Good (1958, 1960), which has lately been the basis of many new developments. Recent historical research (Goldstine, 1977, pp. 249–253; Heideman et al., 1984) has shown that Gauss essentially knew the Cooley–Tukey algorithm as early as 1805 (before Fourier's 1807 work on harmonic analysis!); while it has long been clear that Dirichlet knew of the basis of the prime factor algorithm and used it extensively in his theory of multiplicative characters [see e.g. Chapter I of Ayoub (1963), and Chapters 6 and 8 of Apostol (1976)]. Thus the computation of the DFT, far from being a purely technical and rather narrow piece of specialized numerical analysis, turns out to have very rich connections with such central areas of pure mathematics as number theory (algebraic and analytic), the representation theory of certain Lie groups and coding theory – to list only a few. The interested reader may consult Auslander & Tolimieri (1979); Auslander, Feig & Winograd (1982, 1984); Auslander & Tolimieri (1985); Tolimieri (1985).

One-dimensional algorithms are examined first. The Sande mixed-radix version of the Cooley–Tukey algorithm only calls upon the additive structure of congruence classes of integers. The prime factor algorithm of Good begins to exploit some of their multiplicative structure, and the use of relatively prime factors leads to a stronger factorization than that of Sande. Fuller use of the multiplicative structure, via the group of units, leads to the Rader algorithm; and the factorization of short convolutions then yields the Winograd algorithms.

Multidimensional algorithms are at first built as tensor products of one-dimensional elements. The problem of factoring the DFT in several dimensions simultaneously is then examined. The section ends with a survey of attempts at formalizing the interplay between algorithm structure and computer architecture for the purpose of automating the design of optimal DFT code.

It was originally intended to incorporate into this section a survey of all the basic notions and results of abstract algebra which are called upon in the course of these developments, but time limitations have made this impossible. This material, however, is adequately covered by the first chapter of Tolimieri et al. (1989) in a form tailored for the same purposes. Similarly, the inclusion of numerous detailed examples of the algorithms described here has had to be postponed to a later edition, but an abundant supply of such examples may be found in the signal processing literature, for instance in the books by McClellan & Rader (1979), Blahut (1985), and Tolimieri et al. (1989).

1.3.3.2. One-dimensional algorithms

| top | pdf |

Throughout this section we will denote by [e(t)] the expression $[\exp (2 \pi it)]$ , $[t \in {\bb R}]$ . The mapping $[t \;\longmapsto\; e(t)]$ has the following properties: $[\eqalign{e(t_{1} + t_{2}) &= e(t_{1}) e(t_{2}) \cr e(-t) &= \overline{e(t)} = [e(t)]^{-1} \cr e(t) &= 1 \Leftrightarrow t \in {\bb Z}.}]$ Thus e defines an isomorphism between the additive group $[{\bb R} /{\bb Z}]$ (the reals modulo the integers) and the multiplicative group of complex numbers of modulus 1. It follows that the mapping $[\ell \;\longmapsto\; e(\ell/N)]$ , where $[\ell \in {\bb Z}]$ and N is a positive integer, defines an isomorphism between the one-dimensional residual lattice $[{\bb Z}/N {\bb Z}]$ and the multiplicative group of Nth roots of unity.

The DFT on N points then relates vectors X and $[{\bf X}^{*}]$ in W and $[W^{*}]$ through the linear transformations: $[\eqalign{&F(N): \quad X(k) = {1 \over N} {\sum\limits_{k^{*} \in {\bb Z}/N {\bb Z}}} X^{*} (k^{*}) e(-k^{*} k/N) \cr &\bar{F}(N): \quad X^{*} (k^{*}) = {\sum\limits_{k \in {\bb Z}/N {\bb Z}}} X(k) e(k^{*} k/N).}]$

1.3.3.2.1. The Cooley–Tukey algorithm

| top | pdf |

The presentation of Gentleman & Sande (1966) will be followed first [see also Cochran et al. (1967)]. It will then be reinterpreted in geometric terms which will prepare the way for the treatment of multidimensional transforms in Section 1.3.3.3.

Suppose that the number of sample points N is composite, say $[N = N_{1} N_{2}]$ . We may write k to the base $[N_{1}]$ and $[k^{*}]$ to the base $[N_{2}]$ as follows: $[\eqalign{k &= k_{1} + N_{1} k_{2} \quad\; k_{1} \in {\bb Z}/N_{1} {\bb Z}, \quad k_{2} \in {\bb Z}/N_{2} {\bb Z} \cr k^{*} &= k_{2}^{*} + k_{1}^{*} N_{2} \quad\;k_{1}^{*} \in {\bb Z}/N_{1} {\bb Z}, \quad k_{2}^{*} \in {\bb Z}/N_{2} {\bb Z}.}]$ The defining relation for $[\bar{F}(N)]$ may then be written: $[\eqalign{X^{*} (k_{2}^{*} + k_{1}^{*} N_{2}) &= {\sum\limits_{k_{1} \in {\bb Z}/N_{1} {\bb Z}}}\; {\sum\limits_{k_{2} \in {\bb Z}/N_{2} {\bb Z}}} X (k_{1} + N_{1} k_{2}) \cr &\quad \times e \left[{(k_{2}^{*} + k_{1}^{*} N_{2}) (k_{1} + N_{1} k_{2}) \over N_{1} N_{2}}\right].}]$ The argument of [e[.]] may be expanded as $[{k_{2}^{*} k_{1} \over N} + {k_{1}^{*} k_{1} \over N_{1}} + {k_{2}^{*} k_{2} \over N_{2}} + k_{1}^{*} k_{2},]$ and the last summand, being an integer, may be dropped: $[\eqalign{&X^{*} (k_{2}^{*} + k_{1}^{*} N_{2})\cr &\quad = {\sum\limits_{k_{1}}} \left\{e \left({k_{2}^{*} k_{1} \over N}\right) \left[{\sum\limits_{k_{2}}}\; X (k_{1} + N_{1} k_{2}) e \left({k_{2}^{*} k_{2} \over N_{2}}\right)\right]\right\}\cr &\qquad \times e \left({k_{1}^{*} k_{1} \over N_{1}}\right).}]$ This computation may be decomposed into five stages, as follows:

(i) form the $[N_{1}]$ vectors $[{\bf Y}_{k_{1}}]$ of length $[N_{2}]$ by the prescription $[Y_{k_{1}} (k_{2}) = X (k_{1} + N_{1} k_{2}),\quad k_{1} \in {\bb Z}/N_{1} {\bb Z},\quad k_{2} \in {\bb Z}/N_{2} {\bb Z}\hbox{;}]$
(ii) calculate the $[N_{1}]$ transforms $[{\bf Y}_{k_{1}}^{*}]$ on $[N_{2}]$ points: $[{\bf Y}_{k_{1}}^{*} = \bar{F} (N_{2}) [{\bf Y}_{k_{1}}],\quad k_{1} \in {\bb Z}/N_{1} {\bb Z}\hbox{;}]$
(iii) form the $[N_{2}]$ vectors $[{\bf Z}_{k_{2}^{*}}]$ of length $[N_{1}]$ by the prescription $[{\bf Z}_{k_{2}^{*}} (k_{1}) = e \left({k_{2}^{*} k_{1} \over N}\right) Y_{k_{1}}^{*} (k_{2}^{*}),\quad k_{1} \in {\bb Z}/N_{1} {\bb Z},\quad k_{2}^{*} \in {\bb Z}/N_{2} {\bb Z}\hbox{;}]$
(iv) calculate the $[N_{2}]$ transforms $[{\bf Z}_{k_{2}^{*}}^{*}]$ on $[N_{1}]$ points: $[{\bf Z}_{k_{2}^{*}}^{*} = \bar{F} (N_{1}) [{\bf Z}_{k_{2}^{*}}],\quad k_{2}^{*} \in {\bb Z}/N_{2} {\bb Z}\hbox{;}]$
(v) collect $[X^{*} (k_{2}^{*} + k_{1}^{*} N_{2})]$ as $[Z_{k_{2}^{*}}^{*} (k_{1}^{*})]$ .

If the intermediate transforms in stages (ii) and (iv) are performed in place, i.e. with the results overwriting the data, then at stage (v) the result $[X^{*} (k_{2}^{*} + k_{1}^{*} N_{2})]$ will be found at address $[k_{1}^{*} + N_{1} k_{2}^{*}]$ . This phenomenon is called scrambling by `digit reversal', and stage (v) is accordingly known as unscrambling.

The initial N-point transform $[\bar{F} (N)]$ has thus been performed as $[N_{1}]$ transforms $[\bar{F} (N_{2})]$ on $[N_{2}]$ points, followed by $[N_{2}]$ transforms $[\bar{F} (N_{1})]$ on $[N_{1}]$ points, thereby reducing the arithmetic cost from $[(N_{1} N_{2})^{2}]$ to $[N_{1} N_{2} (N_{1} + N_{2})]$ . The phase shifts applied at stage (iii) are traditionally called `twiddle factors', and the transposition between $[k_{1}]$ and $[k_{2}^{*}]$ can be performed by the fast recursive technique of Eklundh (1972). Clearly, this procedure can be applied recursively if $[N_{1}]$ and $[N_{2}]$ are themselves composite, leading to an overall arithmetic cost of order N log N if N has no large prime factors.

The Cooley–Tukey factorization may also be derived from a geometric rather than arithmetic argument. The decomposition $[k = k_{1} + N_{1} k_{2}]$ is associated to a geometric partition of the residual lattice $[{\bb Z}/N {\bb Z}]$ into $[N_{1}]$ copies of $[{\bb Z}/N_{2} {\bb Z}]$ , each translated by $[k_{1} \in {\bb Z}/N_{1} {\bb Z}]$ and `blown up' by a factor $[N_{1}]$ . This partition in turn induces a (direct sum) decomposition of X as $[{\bf X} = {\textstyle\sum\limits_{k_{1}}}\; {\bf X}_{k_{1}},]$ where $[\eqalign{X_{k_{1}} (k) &= X (k)\quad \hbox{if } k \equiv k_{1} \hbox{ mod } N_{1},\cr &= 0{\hbox to 23pt{}} \hbox{otherwise}.}]$

According to (i), $[{\bf X}_{k_{1}}]$ is related to $[{\bf Y}_{k_{1}}]$ by decimation by $[N_{1}]$ and offset by $[k_{1}]$ . By Section 1.3.2.7.2, $[\bar{F} (N) [{\bf X}_{k_{1}}]]$ is related to $[\bar{F} (N_{2}) [{\bf Y}_{k_{1}}]]$ by periodization by $[N_{2}]$ and phase shift by $[e (k^{*} k_{1}/N)]$ , so that $[X^{*} (k^{*}) = {\sum\limits_{k_{1}}} e \left({k^{*} k_{1} \over N}\right) Y_{k_{1}}^{*} (k_{2}^{*}),]$ the periodization by $[N_{2}]$ being reflected by the fact that $[Y_{k_{1}}^{*}]$ does not depend on $[k_{1}^{*}]$ . Writing $[k^{*} = k_{2}^{*} + k_{1}^{*} N_{2}]$ and expanding $[k^{*} k_{1}]$ shows that the phase shift contains both the twiddle factor $[e (k_{2}^{*} k_{1}/N)]$ and the kernel $[e (k_{1}^{*} k_{1}/N_{1})]$ of $[\bar{F} (N_{1})]$ . The Cooley–Tukey algorithm is thus naturally associated to the coset decomposition of a lattice modulo a sublattice (Section 1.3.2.7.2).

It is readily seen that essentially the same factorization can be obtained for [F(N)] , up to the complex conjugation of the twiddle factors. The normalizing constant [1/N] arises from the normalizing constants $[1/N_{1}]$ and $[1/N_{2}]$ in $[F (N_{1})]$ and $[F (N_{2})]$ , respectively.

Factors of 2 are particularly simple to deal with and give rise to a characteristic computational structure called a `butterfly loop'. If [N = 2M] , then two options exist:

(a) using $[N_{1} = 2]$ and $[N_{2} = M]$ leads to collecting the even-numbered coordinates of X into $[{\bf Y}_{0}]$ and the odd-numbered coordinates into $[{\bf Y}_{1}]$ $[\eqalign{Y_{0} (k_{2}) &= X (2k_{2}),\quad \qquad k_{2} = 0, \ldots, M - 1,\cr Y_{1} (k_{2}) &= X (2k_{2} + 1),\quad \;k_{2} = 0, \ldots, M - 1,}]$ and writing: $[\eqalign{X^{*} (k_{2}^{*}) = \;&Y_{0}^{*} (k_{2}^{*}) + e (k_{2}^{*}/N) Y_{1}^{*} (k_{2}^{*}),\cr \quad &k_{2}^{*} = 0, \ldots, M - 1\hbox{;}\cr X^{*} (k_{2}^{*} + M) =\; &Y_{0}^{*} (k_{2}^{*}) - e (k_{2}^{*}/N) Y_{1}^{*} (k_{2}^{*}),\cr &k_{2}^{*} = 0, \ldots, M - 1.}]$ This is the original version of Cooley & Tukey, and the process of formation of $[{\bf Y}_{0}]$ and $[{\bf Y}_{1}]$ is referred to as `decimation in time ' (i.e. decimation along the data index k).
(b) using $[N_{1} = M]$ and $[N_{2} = 2]$ leads to forming $[\eqalign{Z_{0} (k_{1}) &= X (k_{1}) + X (k_{1} + M),\;\;\quad \quad \quad \qquad k_{1} = 0, \ldots, M - 1,\cr Z_{1} (k_{1}) &= [X (k_{1}) - X (k_{1} + M)] e \left({k_{1} \over N}\right),{\hbox to 20pt{}} k_{1} = 0, \ldots, M - 1,}]$ then obtaining separately the even-numbered and odd-numbered components of $[{\bf X}^{*}]$ by transforming $[{\bf Z}_{0}]$ and $[{\bf Z}_{1}]$ : $[\eqalign{X^{*} (2k_{1}^{*}) &= Z_{0}^{*} (k_{1}^{*}),\quad k_{1}^{*} = 0, \ldots, M - 1\hbox{;}\cr X^{*} (2k_{1}^{*} + 1) &= Z_{1}^{*} (k_{1}^{*}),\quad k_{1}^{*} = 0, \ldots, M - 1.}]$ This version is due to Sande (Gentleman & Sande, 1966), and the process of separately obtaining even-numbered and odd-numbered results has led to its being referred to as `decimation in frequency' (i.e. decimation along the result index $[k^{*}]$ ).

By repeated factoring of the number N of sample points, the calculation of [F(N)] and $[\bar{F} (N)]$ can be reduced to a succession of stages, the smallest of which operate on single prime factors of N. The reader is referred to Gentleman & Sande (1966) for a particularly lucid analysis of the programming considerations which help implement this factorization efficiently; see also Singleton (1969). Powers of two are often grouped together into factors of 4 or 8, which are advantageous in that they require fewer complex multiplications than the repeated use of factors of 2. In this approach, large prime factors P are detrimental, since they require a full $[P^{2}]$ -size computation according to the defining formula.

1.3.3.2.2. The Good (or prime factor) algorithm

| top | pdf |

1.3.3.2.2.1. Ring structure on $[{\bb Z}/N{\bb Z}]$

| top | pdf |

The set $[{\bb Z}/N {\bb Z}]$ of congruence classes of integers modulo an integer N [see e.g. Apostol (1976), Chapter 5] inherits from $[{\bb Z}]$ not only the additive structure used in deriving the Cooley–Tukey factorization, but also a multiplicative structure in which the product of two congruence classes mod N is uniquely defined as the class of the ordinary product (in $[{\bb Z}]$ ) of representatives of each class. The multiplication can be distributed over addition in the usual way, endowing $[{\bb Z}/N {\bb Z}]$ with the structure of a commutative ring.

If N is composite, the ring $[{\bb Z}/N {\bb Z}]$ has zero divisors. For example, let $[N = N_{1} N_{2}]$ , let $[n_{1} \equiv N_{1}]$ mod N, and let $[n_{2} \equiv N_{2}]$ mod N: then $[n_{1} n_{2} \equiv 0]$ mod N. In the general case, a product of non-zero elements will be zero whenever these elements collect together all the factors of N. These circumstances give rise to a fundamental theorem in the theory of commutative rings, the Chinese Remainder Theorem (CRT), which will now be stated and proved [see Apostol (1976), Chapter 5; Schroeder (1986), Chapter 16].

1.3.3.2.2.2. The Chinese remainder theorem

| top | pdf |

Let $[N = N_{1} N_{2} \ldots N_{d}]$ be factored into a product of pairwise coprime integers, so that g.c.d. $[(N_{i}, N_{j}) = 1]$ for $[i \neq j]$ . Then the system of congruence equations $[{\ell} \equiv {\ell}_{j} \hbox{ mod } N_{j},\qquad j = 1, \ldots, d,]$ has a unique solution $[\ell]$ mod N. In other words, each $[\ell \in {\bb Z}/N {\bb Z}]$ is associated in a one-to-one fashion to the d-tuple $[(\ell_{1}, \ell_{2}, \ldots, \ell_{d})]$ of its residue classes in $[{\bb Z}/N_{1} {\bb Z}, {\bb Z}/N_{2} {\bb Z}, \ldots, {\bb Z}/N_{d} {\bb Z}]$ .

The proof of the CRT goes as follows. Let $[Q_{j} = {N \over N_{j}} = {\prod\limits_{i \neq j}}\; N_{i}.]$ Since g.c.d. $[(N_{j}, Q_{j}) = 1]$ there exist integers $[n_{j}]$ and $[q_{j}]$ such that $[n_{j} N_{j} + q_{j} Q_{j} = 1,\qquad j = 1, \ldots, d,]$ then the integer $[{\ell} = {\textstyle\sum\limits_{i = 1}^{d}}\; {\ell}_{i} q_{i} Q_{i} \hbox{ mod } N]$ is the solution. Indeed, $[{\ell} \equiv {\ell}_{j} q_{j} Q_{j} \hbox{ mod } N_{j}]$ because all terms with $[i \neq j]$ contain $[N_{j}]$ as a factor; and $[q_{j} Q_{j} \equiv 1 \hbox{ mod } N_{j}]$ by the defining relation for $[q_{j}]$ .

It may be noted that $[\eqalign{(q_{i} Q_{i}) (q_{j} Q_{j}) &\equiv 0{\phantom{Q_j}}\quad\quad\hbox{ mod } N \hbox{ for } i \neq j,\cr (q_{j} Q_{j})^{2} &\equiv q_{j} Q_{j}\quad\quad\hbox{mod } N, \;\;j = 1, \ldots, d,}]$ so that the $[q_{j} Q_{j}]$ are mutually orthogonal idempotents in the ring $[{\bb Z}/N {\bb Z}]$ , with properties formally similar to those of mutually orthogonal projectors onto subspaces in linear algebra. The analogy is exact, since by virtue of the CRT the ring $[{\bb Z}/N {\bb Z}]$ may be considered as the direct product $[{\bb Z}/N_{1} {\bb Z} \times {\bb Z}/N_{2} {\bb Z} \times \ldots \times {\bb Z}/N_{d} {\bb Z}]$ via the two mutually inverse mappings:

(i) $[{\ell} \;\longmapsto\; (\ell_{1}, \ell_{2}, \ldots, \ell_{d})]$ by $[\ell \equiv \ell_{j}]$ mod $[N_{j}]$ for each j;
(ii) $[(\ell_{1}, \ell_{2}, \ldots, \ell_{d}) \;\longmapsto\; \ell \hbox { by } \ell = {\textstyle\sum_{i = 1}^{d}} \ell_{i} q_{i} Q_{i}\hbox{ mod } N]$ .

The mapping defined by (ii) is sometimes called the `CRT reconstruction' of $[\ell]$ from the $[\ell_{j}]$ .

These two mappings have the property of sending sums to sums and products to products, i.e: $[\displaylines{\quad (\hbox{i})\hfill {\ell} + {\ell}' \;\longmapsto\; ({\ell}_{1} + {\ell}'_{1}, {\ell}_{2} + {\ell}'_{2}, \ldots, {\ell}_{d} + {\ell}'_{d}) \hfill\cr \hfill {\ell} {\ell}' \;\longmapsto\; ({\ell}_{1} {\ell}'_{1}, {\ell}_{2} {\ell}'_{2}, \ldots, {\ell}_{d} {\ell}'_{d}) \quad\;\;\; \phantom{(\hbox{i})} \hfill\cr \quad (\hbox{ii}) \hfill ({\ell}_{1} + {\ell}'_{1}, {\ell}_{2} + {\ell}'_{2}, \ldots, {\ell}_{d} + {\ell}'_{d}) \;\longmapsto\; {\ell} + {\ell}' \;\;\hfill\cr \hfill ({\ell}_{1} {\ell}'_{1}, {\ell}_{2} {\ell}'_{2}, \ldots, {\ell}_{d} {\ell}'_{d}) \;\longmapsto\; {\ell} {\ell}' \quad \phantom{(\hbox{i})} \;\;\;\;\hfill}]$ (the last proof requires using the properties of the idempotents $[q_{j} Q_{j}]$ ). This may be described formally by stating that the CRT establishes a ring isomorphism: $[{\bb Z}/N {\bb Z} \cong ({\bb Z}/N_{1} {\bb Z}) \times \ldots \times ({\bb Z}/N_{d} {\bb Z}).]$

1.3.3.2.2.3. The prime factor algorithm

| top | pdf |

The CRT will now be used to factor the N-point DFT into a tensor product of d transforms, the jth of length $[N_{j}]$ .

Let the indices k and $[k^{*}]$ be subjected to the following mappings:

(i) $[k \;\longmapsto\; (k_{1}, k_{2}, \ldots, k_{d}), k_{j} \in {\bb Z}/N_{j} {\bb Z}]$ , by $[k_{j} \equiv k]$ mod $[N_{j}]$ for each j, with reconstruction formula $[k = {\textstyle\sum\limits_{i = 1}^{d}} \;k_{i} q_{i} Q_{i} \hbox{ mod } N\hbox{;}]$
(ii) $[k^{*} \;\longmapsto\; (k_{1}^{*}, k_{2}^{*}, \ldots, k_{d}^{*}), k_{j}^{*} \in {\bb Z}/N_{j} {\bb Z}]$ , by $[k_{j}^{*} \equiv q_{j} k^{*}]$ mod $[N_{j}]$ for each j, with reconstruction formula $[k^{*} = {\textstyle\sum\limits_{i = 1}^{d}} \;k_{i}^{*} Q_{i} \hbox{ mod } N.]$

Then $[\eqalign{k^{*} k &= \left({\textstyle\sum\limits_{i = 1}^{d}}\; k_{i}^{*} Q_{i}\right) \left({\textstyle\sum\limits_{j = 1}^{d}} \;k_{j} q_{j} Q_{j}\right) \hbox{ mod } N\cr &= {\textstyle\sum\limits_{i, \, j = 1}^{d}} k_{i}^{*} k_{j} Q_{i} q_{j} Q_{j} \hbox{ mod } N.}]$ Cross terms with $[i \neq j]$ vanish since they contain all the factors of N, hence $[\eqalign{k^{*} k &= {\textstyle\sum\limits_{j = 1}^{d}}\; q_{j} Q_{j}^{2} k_{j}^{*} k_{j} \hbox{ mod } N\cr &= {\textstyle\sum\limits_{j = 1}^{d}} (1 - n_{j} N_{j}) Q_{j} k_{j}^{*} k_{j} \hbox{ mod } N.}]$ Dividing by N, which may be written as $[N_{j} Q_{j}]$ for each j, yields $[\eqalign{{k^{*} k \over N} &= {\sum\limits_{j = 1}^{d}} (1 - n_{j} N_{j}) {Q_{j} \over N_{j} Q_{j}} k_{j}^{*} k_{j} \hbox{ mod } 1\cr &= {\sum\limits_{j = 1}^{d}} \left({1 \over N_{j}} - n_{j}\right) k_{j}^{*} k_{j} \hbox{ mod } 1,}]$ and hence $[{k^{*} k \over N} \equiv {\sum\limits_{j = 1}^{d}} {k_{j}^{*} k_{j} \over N_{j}} \hbox{ mod } 1.]$ Therefore, by the multiplicative property of [e(.)] , $[e \left({k^{*} k \over N}\right) \equiv \bigotimes\limits_{j = 1}^{d} e \left({k_{j}^{*} k_{j} \over N_{j}}\right).]$

Let $[{\bf X} \in L ({\bb Z}/N {\bb Z})]$ be described by a one-dimensional array [X(k)] indexed by k. The index mapping (i) turns X into an element of $[L ({\bb Z}/N_{1} {\bb Z} \times \ldots \times {\bb Z}/N_{d} {\bb Z})]$ described by a d-dimensional array $[X (k_{1}, \ldots, k_{d})]$ ; the latter may be transformed by $[\bar{F} (N_{1}) \bigotimes \ldots \bigotimes \bar{F} (N_{d})]$ into a new array $[X^{*} (k_{1}^{*}, k_{2}^{*}, \ldots, k_{d}^{*})]$ . Finally, the one-dimensional array of results $[X^{*} (k^{*})]$ will be obtained by reconstructing $[k^{*}]$ according to (ii).

The prime factor algorithm, like the Cooley–Tukey algorithm, reindexes a 1D transform to turn it into d separate transforms, but the use of coprime factors and CRT index mapping leads to the further gain that no twiddle factors need to be applied between the successive transforms (see Good, 1971). This makes up for the cost of the added complexity of the CRT index mapping.

The natural factorization of N for the prime factor algorithm is thus its factorization into prime powers: $[\bar{F}(N)]$ is then the tensor product of separate transforms (one for each prime power factor $[N_{j} = p_{j}^{\nu_{j}}]$ ) whose results can be reassembled without twiddle factors. The separate factors $[p_{j}]$ within each $[N_{j}]$ must then be dealt with by another algorithm (e.g. Cooley–Tukey, which does require twiddle factors). Thus, the DFT on a prime number of points remains undecomposable.

1.3.3.2.3. The Rader algorithm

| top | pdf |

The previous two algorithms essentially reduce the calculation of the DFT on N points for N composite to the calculation of smaller DFTs on prime numbers of points, the latter remaining irreducible. However, Rader (1968) showed that the p-point DFT for p an odd prime can itself be factored by invoking some extra arithmetic structure present in $[{\bb Z} / p {\bb Z}]$ .

1.3.3.2.3.1. N an odd prime

| top | pdf |

The ring $[{\bb Z} / p {\bb Z} = \{0,1,2,\ldots,p - 1\}]$ has the property that its [p - 1] non-zero elements, called units, form a multiplicative group [U(p)] . In particular, all units $[r \in U(p)]$ have a unique multiplicative inverse in $[{\bb Z} / p {\bb Z}]$ , i.e. a unit $[s \in U(p)]$ such that $[rs \equiv 1\hbox { mod } p]$ . This endows $[{\bb Z} / p {\bb Z}]$ with the structure of a finite field.

Furthermore, [U(p)] is a cyclic group, i.e. consists of the successive powers $[g^{m}\hbox{ mod } p]$ of a generator g called a primitive root mod p (such a g may not be unique, but it always exists). For instance, for [p = 7] , $[U(7) = \{1,2,3,4,5,6\}]$ is generated by [g = 3] , whose successive powers mod 7 are: $[g^{0} = 1, \quad g^{1} = 3, \quad g^{2} = 2, \quad g^{3} = 6, \quad g^{4} = 4, \quad g^{5} = 5]$ [see Apostol (1976), Chapter 10].

The basis of Rader's algorithm is to bring to light a hidden regularity in the matrix [F(p)] by permuting the basis vectors $[{\bf u}_{k}]$ and $[{\bf v}_{k^{*}}]$ of $[L({\bb Z} / p {\bb Z})]$ as follows: $[\eqalign{{\bf u}'_{0} &= {\bf u}_{0} \cr {\bf u}'_{m} &= {\bf u}_{k} {\hbox to 12pt{}}\hbox{with } k = g^{m}, {\hbox to 15pt{}} m = 1, \ldots, p - 1\hbox{;} \cr {\bf v}'_{0} &= {\bf v}_{0} \cr {\bf v}'_{m^{*}} &= {\bf v}_{k^{*}} \quad \hbox{with } k^{*} = g^{m^{*}}, \quad m^{*} = 1, \ldots, p - 1\hbox{;}}]$ where g is a primitive root mod p.

With respect to these new bases, the matrix representing $[\bar{F}(p)]$ will have the following elements: $[\eqalign{\hbox{element } (0,0) &= 1 \cr \hbox{element } (0, m + 1) &= 1 \quad \hbox{for all } m = 0, \ldots p - 2, \cr \hbox{element } (m^{*} + 1,0) &= 1 \quad \hbox{for all } m^{*} = 0, \ldots, p - 2, \cr \hbox{element } (m^{*} + 1, m + 1) &= e \left({k^{*}k \over p}\right) \cr &= e(g^{(m^{*} + m)/p}) \cr &\qquad \quad \hbox{for all } m^{*} = 0, \ldots, p - 2.}]$ Thus the `core' $[\bar{C}(p)]$ of matrix $[\bar{F}(p)]$ , of size $[(p - 1) \times (p - 1)]$ , formed by the elements with two non-zero indices, has a so-called skew-circulant structure because element $[(m^{*}, m)]$ depends only on $[m^{*} + m]$ . Simplification may now occur because multiplication by $[\bar{C}(p)]$ is closely related to a cyclic convolution. Introducing the notation $[C(m) = e(g^{m/p})]$ we may write the relation $[{\bf Y}^{*} = \bar{F}(p){\bf Y}]$ in the permuted bases as $[\eqalign{Y^{*} (0) &= {\textstyle\sum\limits_{k}} Y(k) \cr Y^{*} (m^{*} + 1) &= Y(0) + {\textstyle\sum\limits_{m = 0}^{p - 2}} C(m^{*} + m) Y(m + 1) \cr &= Y(0) + {\textstyle\sum\limits_{m = 0}^{p - 2}} C(m^{*} - m) Z(m) \cr &= Y(0) + ({\bf C} * {\bf Z}) (m^{*}), \quad m^{*} = 0, \ldots, p - 2,}]$ where Z is defined by [Z(m) = Y(p - m - 2)] , $[m = 0, \ldots, p - 2]$ .

Thus $[{\bf Y}^{*}]$ may be obtained by cyclic convolution of C and Z, which may for instance be calculated by $[{\bf C} * {\bf Z} = F(p - 1) [\bar{F}(p - 1) [{\bf C}] \times \bar{F} (p - 1) [{\bf Z}]],]$ where × denotes the component-wise multiplication of vectors. Since p is odd, [p - 1] is always divisible by 2 and may even be highly composite. In that case, factoring $[\bar{F} (p - 1)]$ by means of the Cooley–Tukey or Good methods leads to an algorithm of complexity p log p rather than $[p^{2}]$ for $[\bar{F}(p)]$ . An added bonus is that, because $[g^{(p-1) / 2} = -1]$ , the elements of $[\bar{F} (p - 1) [{\bf C}]]$ can be shown to be either purely real or purely imaginary, which halves the number of real multiplications involved.

1.3.3.2.3.2. N a power of an odd prime

| top | pdf |

This idea was extended by Winograd (1976, 1978) to the treatment of prime powers $[N = p^{\nu}]$ , using the cyclic structure of the multiplicative group of units $[U(p^{\nu})]$ . The latter consists of all those elements of $[{\bb Z} / p^{\nu} {\bb Z}]$ which are not divisible by p, and thus has $[q_{\nu} = p^{\nu - 1} (p - 1)]$ elements. It is cyclic, and there exist primitive roots g modulo $[p^{\nu}]$ such that $[U(p^{\nu}) = \{1, g, g^{2}, g^{3}, \ldots, g^{q_{\nu} - 1}\}.]$ The $[p^{\nu - 1}]$ elements divisible by p, which are divisors of zero, have to be treated separately just as 0 had to be treated separately for [N = p] .

When $[k^{*} \not\in U(p^{\nu})]$ , then $[k^{*} = pk_{1}^{*}]$ with $[k_{1}^{*} \in {\bb Z} / p^{\nu - 1} {\bb Z}]$ . The results $[X^{*} (pk_{1}^{*})]$ are p-decimated, hence can be obtained via the $[p^{\nu - 1}]$ -point DFT of the $[p^{\nu - 1}]$ -periodized data Y: $[X^{*} (pk_{1}^{*}) = \bar{F} (p^{\nu - 1}) [{\bf Y}] (k_{1}^{*})]$ with $[Y(k_{1}) = {\textstyle\sum\limits_{k_{2} \in {\bb Z} / p {\bb Z}}} X(k_{1} + p^{\nu - 1} k_{2}).]$

When $[k^{*} \in U(p^{\nu})]$ , then we may write $[X^{*} (k^{*}) = X_{0}^{*} (k^{*}) + X_{1}^{*} (k^{*}),]$ where $[{\bf X}_{0}^{*}]$ contains the contributions from $[k\; \notin\; U(p^{\nu})]$ and $[{\bf X}_{1}^{*}]$ those from $[k \in U(p^{\nu})]$ . By a converse of the previous calculation, $[{\bf X}_{0}^{*}]$ arises from p-decimated data Z, hence is the $[p^{\nu - 1}]$ -periodization of the $[p^{\nu - 1}]$ -point DFT of these data: $[X_{0}^{*} (p^{\nu - 1} k_{1}^{*} + k_{2}^{*}) = \bar{F} (p^{\nu - 1}) [{\bf Z}] (k_{2}^{*})]$ with $[Z(k_{2}) = X(pk_{2}), \qquad k_{2} \in {\bb Z} / p^{\nu - 1} {\bb Z}]$ (the $[p^{\nu - 1}]$ -periodicity follows implicity from the fact that the transform on the right-hand side is independent of $[k_{1}^{*} \in {\bb Z} / p {\bb Z}]$ ).

Finally, the contribution $[X_{1}^{*}]$ from all $[k \in U(p^{\nu})]$ may be calculated by reindexing by the powers of a primitive root g modulo $[p^{\nu}]$ , i.e. by writing $[X_{1}^{*} (g^{m^{*}}) = {\textstyle\sum\limits_{m = 0}^{q_{\nu} - 1}} X(g^{m}) e(g^{(m^{*} + m) / p^{\nu}})]$ then carrying out the multiplication by the skew-circulant matrix core as a convolution.

Thus the DFT of size $[p^{\nu}]$ may be reduced to two DFTs of size $[p^{\nu - 1}]$ (dealing, respectively, with p-decimated results and p-decimated data) and a convolution of size $[q_{\nu} = p^{\nu - 1} (p - 1)]$ . The latter may be `diagonalized' into a multiplication by purely real or purely imaginary numbers (because $[g^{(q_{\nu} / 2)} = -1]$ ) by two DFTs, whose factoring in turn leads to DFTs of size $[p^{\nu - 1}]$ and [p - 1] . This method, applied recursively, allows the complete decomposition of the DFT on $[p^{\nu}]$ points into arbitrarily small DFTs.

1.3.3.2.3.3. N a power of 2

| top | pdf |

When $[N = 2^{\nu}]$ , the same method can be applied, except for a slight modification in the calculation of $[{\bf X}_{1}^{*}]$ . There is no primitive root modulo $[2^{\nu}]$ for $[\nu \gt 2]$ : the group $[U(2^{\nu})]$ is the direct product of two cyclic groups, the first (of order 2) generated by −1, the second (of order [N/4] ) generated by 3 or 5. One then uses a representation $[\eqalign{k &= (-1)^{m_{1}} 5^{m_{2}} \cr k^{*} &= (-1)^{m_{1}^{*}} 5^{m_{2}^{*}}}]$ and the reindexed core matrix gives rise to a two-dimensional convolution. The latter may be carried out by means of two 2D DFTs on $[2 \times (N/4)]$ points.

1.3.3.2.4. The Winograd algorithms

| top | pdf |

The cyclic convolutions generated by Rader's multiplicative reindexing may be evaluated more economically than through DFTs if they are re-examined within a new algebraic setting, namely the theory of congruence classes of polynomials [see, for instance, Blahut (1985), Chapter 2; Schroeder (1986), Chapter 24].

The set, denoted $[{\bb K}[X]]$ , of polynomials in one variable with coefficients in a given field $[{\bb K}]$ has many of the formal properties of the set $[{\bb Z}]$ of rational integers: it is a ring with no zero divisors and has a Euclidean algorithm on which a theory of divisibility can be built.

Given a polynomial [P(z)] , then for every [W(z)] there exist unique polynomials [Q(z)] and [R(z)] such that [W(z) = P(z) Q(z) + R(z)] and $[\hbox{degree } (R) \;\lt\; \hbox{degree } (P).]$ [R(z)] is called the residue of [H(z)] modulo [P(z)] . Two polynomials $[H_{1}(z)]$ and $[H_{2}(z)]$ having the same residue modulo [P(z)] are said to be congruent modulo [P(z)] , which is denoted by $[H_{1}(z) \equiv H_{2}(z) \hbox{ mod } P(z).]$

If $[H(z) \equiv 0\hbox{ mod } P(z),\; H(z)]$ is said to be divisible by [P(z)] . If [H(z)] only has divisors of degree zero in $[{\bb K}[X]]$ , it is said to be irreducible over $[{\bb K}]$ (this notion depends on $[{\bb K}]$ ). Irreducible polynomials play in $[{\bb K}[X]]$ a role analogous to that of prime numbers in $[{\bb Z}]$ , and any polynomial over $[{\bb K}]$ has an essentially unique factorization as a product of irreducible polynomials.

There exists a Chinese remainder theorem (CRT) for polynomials . Let $[P(z) = P_{1}(z) \ldots P_{d}(z)]$ be factored into a product of pairwise coprime polynomials [i.e. $[P_{i}(z)]$ and $[P_{j}(z)]$ have no common factor for $[i \neq j]$ ]. Then the system of congruence equations $[H(z) \equiv H_{j}(z) \hbox{ mod } P_{j}(z), \quad j = 1, \ldots, d,]$ has a unique solution [H(z)] modulo [P(z)] . This solution may be constructed by a procedure similar to that used for integers. Let $[Q_{j}(z) = P(z) / P_{j}(z) = {\textstyle\prod\limits_{i \neq j}} \;P_{i}(z).]$ Then $[P_{j}]$ and $[Q_{j}]$ are coprime, and the Euclidean algorithm may be used to obtain polynomials $[p_{j}(z)]$ and $[q_{j}(z)]$ such that $[p_{j}(z) P_{j}(z) + q_{j}(z) Q_{j}(z) = 1.]$ With $[S_{i}(z) = q_{i}(z) Q_{i}(z)]$ , the polynomial $[H(z) = {\textstyle\sum\limits_{i = 1}^{d}} \;S_{i}(z) H_{i}(z) \hbox{ mod } P(z)]$ is easily shown to be the desired solution.

As with integers, it can be shown that the 1:1 correspondence between [H(z)] and $[H_{j}(z)]$ sends sums to sums and products to products, i.e. establishes a ring isomorphism: $[{\bb K}[X] \hbox{ mod } P \cong ({\bb K}[X] \hbox{ mod } P_{1}) \times \ldots \times ({\bb K}[X] \hbox{ mod } P_{d}).]$

These results will now be applied to the efficient calculation of cyclic convolutions. Let $[{\bf U} = (u_{0}, u_{1}, \ldots, u_{N - 1})]$ and $[{\bf V} = (v_{0}, v_{1}, \ldots, v_{N - 1})]$ be two vectors of length N, and let $[{\bf W} = (w_{0}, w_{1}, \ldots, w_{N - 1})]$ be obtained by cyclic convolution of U and V: $[w_{n} = {\textstyle\sum\limits_{m = 0}^{N - 1}} u_{m} v_{n - m}, \quad n = 0, \ldots, N - 1.]$ The very simple but crucial result is that this cyclic convolution may be carried out by polynomial multiplication modulo $[(z^{N} - 1)]$ : if $[\eqalign{U(z) &= {\textstyle\sum\limits_{l = 0}^{N - 1}} u_{l} z^{l} \cr V(z) &= {\textstyle\sum\limits_{m = 0}^{N - 1}} v_{m} z^{m} \cr W(z) &= {\textstyle\sum\limits_{n = 0}^{N - 1}} w_{n} z^{n}}]$ then the above relation is equivalent to $[W(z) \equiv U(z) V(z) \hbox{ mod } (z^{N} - 1).]$ Now the polynomial $[z^{N} - 1]$ can be factored over the field of rational numbers into irreducible factors called cyclotomic polynomials: if d is the number of divisors of N, including 1 and N, then $[z^{N} - 1 = {\textstyle\prod\limits_{i = 1}^{d}} P_{i}(z),]$ where the cyclotomics $[P_{i}(z)]$ are well known (Nussbaumer, 1981; Schroeder, 1986, Chapter 22). We may now invoke the CRT, and exploit the ring isomorphism it establishes to simplify the calculation of [W(z)] from [U(z)] and [V(z)] as follows:

(i) compute the d residual polynomials $[\eqalign{U_{i}(z) &\equiv U(z) \hbox{ mod } P_{i}(z), \quad i = 1, \ldots, d,\cr V_{i}(z) &\equiv V(z) \hbox{ mod } P_{i}(z), \quad i = 1, \ldots, d\hbox{;}}]$
(ii) compute the d polynomial products $[W_{i}(z) \equiv U_{i}(z) V_{i}(z) \hbox{ mod } P_{i}(z), \quad i = 1, \ldots, d\hbox{;}]$
(iii) use the CRT reconstruction formula just proved to recover from the $[W_{i} (z)]$ : $[W (z) \equiv {\textstyle\sum\limits_{i = 1}^{d}} S_{i} (z) W_{i} (z) \hbox{ mod } (z^{N} - 1).]$

When N is not too large, i.e. for `short cyclic convolutions', the $[P_{i} (z)]$ are very simple, with coefficients 0 or ±1, so that (i) only involves a small number of additions. Furthermore, special techniques have been developed to multiply general polynomials modulo cyclotomic polynomials, thus helping keep the number of multiplications in (ii) and (iii) to a minimum. As a result, cyclic convolutions can be calculated rapidly when N is sufficiently composite.

It will be recalled that Rader's multiplicative indexing often gives rise to cyclic convolutions of length [p - 1] for p an odd prime. Since [p - 1] is highly composite for all $[p \leq 50]$ other than 23 and 47, these cyclic convolutions can be performed more efficiently by the above procedure than by DFT.

These combined algorithms are due to Winograd (1977, 1978, 1980), and are known collectively as `Winograd small FFT algorithms'. Winograd also showed that they can be thought of as bringing the DFT matrix F to the following `normal form': $[{\bf F} = {\bf CBA},]$ where

A is an integer matrix with entries 0, $[\pm 1]$ , defining the `pre-additions',
B is a diagonal matrix of multiplications,
C is a matrix with entries 0, $[\pm 1]$ , $[\pm i]$ , defining the `post-additions'.

The elements on the diagonal of B can be shown to be either real or pure imaginary, by the same argument as in Section 1.3.3.2.3.1. Matrices A and C may be rectangular rather than square, so that intermediate results may require extra storage space.

1.3.3.3. Multidimensional algorithms

| top | pdf |

From an algorithmic point of view, the distinction between one-dimensional (1D) and multidimensional DFTs is somewhat blurred by the fact that some factoring techniques turn a 1D transform into a multidimensional one. The distinction made here, however, is a practical one and is based on the dimensionality of the indexing sets for data and results. This section will therefore be concerned with the problem of factoring the DFT when the indexing sets for the input data and output results are multidimensional.

1.3.3.3.1. The method of successive one-dimensional transforms

| top | pdf |

The DFT was defined in Section 1.3.2.7.4 in an n-dimensional setting and it was shown that when the decimation matrix N is diagonal, say $[{\bf N} = \hbox{diag} (N^{(1)}, N^{(2)}, \ldots, N^{(n)})]$ , then $[\bar{F} (N)]$ has a tensor product structure: $[\bar{F} ({\bf N}) = \bar{F} (N^{(1)}) \otimes \bar{F} (N^{(2)}) \otimes \ldots \otimes \bar{F} (N^{(n)}).]$ This may be rewritten as follows: $[\eqalign{\bar{F} ({\bf N}) &= [\bar{F} (N^{(1)}) \otimes I_{N^{(2)}} \otimes \ldots \otimes I_{N^{(n)}}] \cr &\quad \times [I_{N^{(1)}} \otimes \bar{F} (N^{(2)}) \otimes \ldots \otimes I_{{N}^{(n)}}] \cr &\quad \times \ldots \cr &\quad \times [I_{N^{(1)}} \otimes I_{N^{(2)}} \otimes \ldots \otimes \bar{F} (N^{(n)}],}]$ where the I's are identity matrices and × denotes ordinary matrix multiplication. The matrix within each bracket represents a one-dimensional DFT along one of the n dimensions, the other dimensions being left untransformed. As these matrices commute, the order in which the successive 1D DFTs are performed is immaterial.

This is the most straightforward method for building an n-dimensional algorithm from existing 1D algorithms. It is known in crystallography under the name of `Beevers–Lipson factorization' (Section 1.3.4.3.1), and in signal processing as the `row–column method'.

1.3.3.3.2. Multidimensional factorization

| top | pdf |

Substantial reductions in the arithmetic cost, as well as gains in flexibility, can be obtained if the factoring of the DFT is carried out in several dimensions simultaneously. The presentation given here is a generalization of that of Mersereau & Speake (1981), using the abstract setting established independently by Auslander, Tolimieri & Winograd (1982).

Let us return to the general n-dimensional setting of Section 1.3.2.7.4, where the DFT was defined for an arbitrary decimation matrix N by the formulae (where $[|{\bf N}|]$ denotes $[| \hbox{det }{\bf N}|]$ ): $[\eqalign{F ({\bf N})&:\quad X ({\bf k}) \;\;\;= {1 \over |{\bf N}|} {\sum\limits_{{\bf k}^{*}}} \;X^{*} ({\bf k}^{*}) e[-{\bf k}^{*} \cdot ({\bf N}^{-1} {\bf k})] \cr \bar{F} ({\bf N})&:\quad X^{*} ({\bf k}^{*}) = \phantom{{1 \over |{\bf N}|}} {\sum\limits_{{\bf k}}} \;X ({\bf k}) e[{\bf k}^{*} \cdot ({\bf N}^{-1} {\bf k})]}]$ with $[{\bf k} \in {\bb Z}^{n} / {\bf N} {\bb Z}^{n},\quad {\bf k}^{*} \in {\bb Z}^{n} / {\bf N}^{T} {\bb Z}^{n}.]$

1.3.3.3.2.1. Multidimensional Cooley–Tukey factorization

| top | pdf |

Let us now assume that this decimation can be factored into d successive decimations, i.e. that $[{\bf N} = {\bf N}_{1} {\bf N}_{2} \ldots {\bf N}_{d-1} {\bf N}_{d}]$ and hence $[{\bf N}^{T} = {\bf N}_{d}^{T} {\bf N}_{d - 1}^{T} \ldots {\bf N}_{2}^{T} {\bf N}_{1}^{T}.]$ Then the coset decomposition formulae corresponding to these successive decimations (Section 1.3.2.7.1) can be combined as follows: $[\eqalign{{\bb Z}^{n} &= \bigcup_{{\bf k}_{1}}\; ({\bf k}_{1} + {\bf N}_{1} {\bb Z}^{n}) \cr &= \bigcup_{{\bf k}_{1}}\; \left\{{\bf k}_{1} + {\bf N}_{1} \left[\bigcup_{{\bf k}_{2}}\; ({\bf k}_{2} + {\bf N}_{2} {\bb Z}^{n})\right]\right\} \cr &= \ldots \cr &= \bigcup_{{\bf k}_{1}} \ldots \bigcup_{{\bf k}_{d}}\; ({\bf k}_{1} + {\bf N}_{1} {\bf k}_{2} + \ldots + {\bf N}_{1} {\bf N}_{2} \times \ldots \times {\bf N}_{d - 1} {\bf k}_{d} + {\bf N} {\bb Z}^{n})}]$ with $[{\bf k}_{j} \in {\bb Z}^{n} / {\bf N}_{j} {\bb Z}^{n}]$ . Therefore, any $[{\bf k} \in {\bb Z} / {\bf N} {\bb Z}^{n}]$ may be written uniquely as $[{\bf k} = {\bf k}_{1} + {\bf N}_{1} {\bf k}_{2} + \ldots + {\bf N}_{1} {\bf N}_{2} \times \ldots \times {\bf N}_{d - 1} {\bf k}_{d}.]$ Similarly: $[\eqalign{{\bb Z}^{n} &= \bigcup_{{\bf k}_{d}^{*}}\; ({\bf k}_{d}^{*} + {\bf N}_{d}^{T} {\bb Z}^{n}) \cr &= \ldots \cr &= \bigcup_{{\bf k}_{d}^{*}} \ldots \bigcup_{{\bf k}_{1}^{*}} \;({\bf k}_{d}^{*} + {\bf N}_{d}^{T} {\bf k}_{d - 1}^{*} + \ldots + {\bf N}_{d}^{T} \times \ldots \times {\bf N}_{2}^{T} {\bf k}_{1}^{*} \cr &\quad + {\bf N}^{T} {\bb Z}^{n})}]$ so that any $[{\bf k}^{*} \in {\bb Z}^{n} / {\bf N}^{T} {\bb Z}^{n}]$ may be written uniquely as $[{\bf k}^{*} = {\bf k}_{d}^{*} + {\bf N}_{d}^{T} {\bf k}_{d - 1}^{*} + \ldots + {\bf N}_{d}^{T} \times \ldots \times {\bf N}_{2}^{T} {\bf k}_{1}^{*}]$ with $[{\bf k}_{j}^{*} \in {\bb Z}^{n} / {\bf N}_{j}^{T} {\bb Z}^{n}]$ . These decompositions are the vector analogues of the multi-radix number representation systems used in the Cooley–Tukey factorization.

We may then write the definition of $[\bar{F} ({\bf N})]$ with [d = 2] factors as $[\eqalign{X^{*} ({\bf k}_{2}^{*} + {\bf N}_{2}^{T} {\bf k}_{1}^{*}) &= {\textstyle\sum\limits_{{\bf k}_{1}}} {\textstyle\sum\limits_{{\bf k}_{2}}}\; X ({\bf k}_{1} + {\bf N}_{1} {\bf k}_{2}) \cr &\quad \times e[({\bf k}_{2}^{*T} + {\bf k}_{1}^{*T}{\bf N}_2) {\bf N}_{2}^{-1} {\bf N}_{1}^{-1} ({\bf k}_{1} + {\bf N}_{1} {\bf k}_{2})].}]$ The argument of e(–) may be expanded as $[{\bf k}_{2}^{*} \cdot ({\bf N}^{-1} {\bf k}_{1}) + {\bf k}_{1}^{*} \cdot ({\bf N}_{1}^{-1} {\bf k}_{1}) + {\bf k}_{2}^{*} \cdot ({\bf N}_{2}^{-1} {\bf k}_{2}) + {\bf k}_{1}^{*} \cdot {\bf k}_{2}.]$ The first summand may be recognized as a twiddle factor, the second and third as the kernels of $[\bar{F} ({\bf N}_{1})]$ and $[\bar{F} ({\bf N}_{2})]$ , respectively, while the fourth is an integer which may be dropped. We are thus led to a `vector-radix' version of the Cooley–Tukey algorithm , in which the successive decimations may be introduced in all n dimensions simultaneously by general integer matrices. The computation may be decomposed into five stages analogous to those of the one-dimensional algorithm of Section 1.3.3.2.1:

(i) form the $[|{\bf N}_{1}|]$ vectors $[{\bf Y}_{{\bf k}_{1}}]$ of shape $[{\bf N}_{2}]$ by $[Y_{{\bf k}_{1}} ({\bf k}_{2}) = X ({\bf k}_{1} + {\bf N}_{1} {\bf k}_{2}),\quad {\bf k}_{1} \in {\bb Z}^{n} / {\bf N}_{1} {\bb Z}^{n},\quad {\bf k}_{2} \in {\bb Z}^{n} / {\bf N}_{2} {\bb Z}^{n}\hbox{;}]$
(ii) calculate the $[|{\bf N}_{1}|]$ transforms $[{\bf Y}_{{\bf k}_{1}}^{*}]$ on $[|{\bf N}_{2}|]$ points: $[Y_{{\bf k}_{1}}^{*} ({\bf k}_{2}^{*}) = {\textstyle\sum\limits_{{\bf k}_{2}}}\; e[{\bf k}_{2}^{*} \cdot ({\bf N}_{2}^{-1} {\bf k}_{2})] Y_{{\bf k}_{1}} ({\bf k}_{2}),\quad {\bf k}_{1} \in {\bb Z}^{n} / {\bf N}_{1} {\bb Z}^{n}\hbox{;}]$
(iii) form the $[|{\bf N}_{2}|]$ vectors $[{\bf Z}_{{\bf k}_{2}^{*}}]$ of shape $[{\bf N}_{1}]$ by $[\displaylines{Z_{{\bf k}_{2}^{*}} ({\bf k}_{1}) = e[{\bf k}_{2}^{*} \cdot ({\bf N}^{-1} {\bf k}_{1})] Y_{{\bf k}_{1}}^{*} ({\bf k}_{2}^{*}),\quad {\bf k}_{1} \in {\bb Z}^{n} / {\bf N}_{1} {\bb Z}^{n},\cr {\bf k}_{2}^{*} \in {\bb Z}^{n} / {\bf N}_{2}^{T} {\bb Z}^{n}\hbox{;}}]$
(iv) calculate the $[|{\bf N}_{2}|]$ transforms $[{\bf Z}_{{\bf k}_{2}^{*}}^{*}]$ on $[|{\bf N}_{1}|]$ points: $[Z_{{\bf k}_{2}^{*}}^{*} ({\bf k}_{1}^{*}) = {\textstyle\sum\limits_{{\bf k}_{1}}}\; e[{\bf k}_{1}^{*} \cdot ({\bf N}_{1}^{-1} {\bf k}_{1})] Z_{{\bf k}_{2}^{*}} ({\bf k}_{1}),\quad {\bf k}_{2}^{*} \in {\bb Z}^{n} / {\bf N}_{2}^{T} {\bb Z}^{n}\hbox{;}]$
(v) collect $[X^{*} ({\bf k}_{2}^{*} + {\bf N}_{2}^{T} {\bf k}_{1}^{*})]$ as $[Z_{{\bf k}_{2}^{*}}^{*} ({\bf k}_{1}^{*})]$ .

The initial $[|{\bf N}|]$ -point transform $[\bar{F} ({\bf N})]$ can thus be performed as $[|{\bf N}_{1}|]$ transforms $[\bar{F} ({\bf N}_{2})]$ on $[|{\bf N}_{2}|]$ points, followed by $[|{\bf N}_{2}|]$ transforms $[\bar{F} ({\bf N}_{1})]$ on $[|{\bf N}_{1}|]$ points. This process can be applied successively to all d factors. The same decomposition applies to $[F ({\bf N})]$ , up to the complex conjugation of twiddle factors, the normalization factor $[1 / |{\bf N}|]$ being obtained as the product of the factors $[1 / |{\bf N}_{j}|]$ in the successive partial transforms $[F ({\bf N}_{j})]$ .

The geometric interpretation of this factorization in terms of partial transforms on translates of sublattices applies in full to this n-dimensional setting; in particular, the twiddle factors are seen to be related to the residual translations which place the sublattices in register within the big lattice. If the intermediate transforms are performed in place, then the quantity $[X^{*} ({\bf k}_{d}^{*} + {\bf N}_{d}^{T} {\bf k}_{d - 1}^{*} + \ldots + {\bf N}_{d}^{T} {\bf N}_{d - 1}^{T} \times \ldots \times {\bf N}_{2}^{T} {\bf k}_{1}^{*})]$ will eventually be found at location $[{\bf k}_{1}^{*} + {\bf N}_{1} {\bf k}_{2}^{*} + \ldots + {\bf N}_{1} {\bf N}_{2} \times \ldots \times {\bf N}_{d - 1} {\bf k}_{d}^{*},]$ so that the final results will have to be unscrambled by a process which may be called `coset reversal', the vector equivalent of digit reversal.

Factoring by 2 in all n dimensions simultaneously, i.e. taking $[{\bf N} = 2{\bf M}]$ , leads to `n-dimensional butterflies'. Decimation in time corresponds to the choice $[{\bf N}_{1} = 2{\bf I}, {\bf N}_{2} = {\bf M}]$ , so that $[{\bf k}_{1} \in {\bb Z}^{n} / 2{\bb Z}^{n}]$ is an n-dimensional parity class; the calculation then proceeds by $[\displaylines{Y_{{\bf k}_{1}} ({\bf k}_{2}) = X ({\bf k}_{1} + 2{\bf k}_{2}),\quad{\bf k}_{1} \in {\bb Z}^{n} / 2{\bb Z}^{n},\quad {\bf k}_{2} \in {\bb Z}^{n} / {\bf M}{\bb Z}^{n}, \cr Y_{{\bf k}_{1}}^{*} = \bar{F} ({\bf M}) [{\bf Y}_{{\bf k}_{1}}],\quad{\bf k}_{1} \in {\bb Z}^{n} / 2{\bb Z}^{n}\hbox{;} \cr \eqalign{X^{*} ({\bf k}_{2}^{*} + {\bf M}^{T} {\bf k}_{1}^{*}) &= {\textstyle\sum\limits_{{\bf k}_{1} \in {\bb Z}^{n} / 2{\bb Z}^{n}}} (-1)^{{\bf k}_{1}^{*} \cdot {\bf k}_{1}} \cr &\quad \times e[{\bf k}_{2}^{*} \cdot ({\bf N}^{-1} {\bf k}_{1})] Y_{{\bf k}_{1}}^{*} ({\bf k}_{2}^{*}).}\cr}]$ Decimation in frequency corresponds to the choice $[{\bf N}_{1} = {\bf M}]$ , $[{\bf N}_{2} = 2{\bf I}]$ , so that $[{\bf k}_{2} \in {\bb Z}^{n} / 2{\bb Z}^{n}]$ labels `octant' blocks of shape M; the calculation then proceeds through the following steps: $[\eqalign{Z_{{\bf k}_{2}^{*}} ({\bf k}_{1}) &= \left[{\textstyle\sum\limits_{{\bf k}_{2} \in {\bb Z}^{n} / 2{\bb Z}^{n}}} (-1)^{{\bf k}_{2}^{*} \cdot {\bf k}_{2}} X ({\bf k}_{1} + {\bf M}{\bf k}_{2})\right] \cr &\quad \times e[{\bf k}_{2}^{*} \cdot ({\bf N}^{-1} {\bf k}_{1})], \cr {\bf Z}_{{\bf k}_{2}^{*}}^{*} &= \bar{F} ({\bf M}) [{\bf Z}_{{\bf k}_{2}^{*}}], \cr X^{*} ({\bf k}_{2}^{*} + 2{\bf k}_{1}^{*}) &= Z_{{\bf k}_{2}^{*}}^{*} ({\bf k}_{1}^{*}),}]$ i.e. the $[2^{n}]$ parity classes of results, corresponding to the different $[{\bf k}_{2}^{*} \in {\bb Z}^{n} / 2{\bb Z}^{n}]$ , are obtained separately. When the dimension n is 2 and the decimating matrix is diagonal, this analysis reduces to the `vector radix FFT' algorithms proposed by Rivard (1977) and Harris et al. (1977). These lead to substantial reductions in the number M of multiplications compared to the row–column method: M is reduced to [3M/4] by simultaneous $[2 \times 2]$ factoring, and to [15M/32] by simultaneous $[4 \times 4]$ factoring.

The use of a non-diagonal decimating matrix may bring savings in computing time if the spectrum of the band-limited function under study is of such a shape as to pack more compactly in a non-rectangular than in a rectangular lattice (Mersereau, 1979). If, for instance, the support K of the spectrum Φ is contained in a sphere, then a decimation matrix producing a close packing of these spheres will yield an aliasing-free DFT algorithm with fewer sample points than the standard algorithm using a rectangular lattice.

1.3.3.3.2.2. Multidimensional prime factor algorithm

| top | pdf |

Suppose that the decimation matrix N is diagonal $[{\bf N} = \hbox{diag } (N^{(1)}, N^{(2)}, \ldots, N^{(n)})]$ and let each diagonal element be written in terms of its prime factors: $[N^{(i)} = {\textstyle\prod\limits_{j = 1}^{m}} \;p_{j}^{\nu (i, \, \;j)},]$ where m is the total number of distinct prime factors present in the $[N^{(i)}]$ .

The CRT may be used to turn each 1D transform along dimension i $[(i = 1, \ldots, n)]$ into a multidimensional transform with a separate `pseudo-dimension' for each distinct prime factor of $[N^{(i)}]$ ; the number $[\mu_{i}]$ , of these pseudo-dimensions is equal to the cardinality of the set: $[\{\;j \in \{1, \ldots, m\} | \nu (i,j) \gt 0 \hbox{ for some } i\}.]$ The full n-dimensional transform thus becomes μ-dimensional, with $[\mu = {\textstyle\sum_{i = 1}^{n}} \mu_{i}]$ .

We may now permute the μ pseudo-dimensions so as to bring into contiguous position those corresponding to the same prime factor $[p_{j}]$ ; the m resulting groups of pseudo-dimensions are said to define `p-primary' blocks. The initial transform is now written as a tensor product of m p-primary transforms, where transform j is on $[p_{j}^{\nu (1, \, \;j)} \times p_{j}^{\nu (2, \, j)} \times \ldots \times p_{j}^{\nu (n, \, j)}]$ points [by convention, dimension i is not transformed if $[\nu (i,j) = 0]$ ]. These p-primary transforms may be computed, for instance, by multidimensional Cooley–Tukey factorization (Section 1.3.3.3.1), which is faster than the straightforward row–column method. The final results may then be obtained by reversing all the permutations used.

The extra gain with respect to the multidimensional Cooley–Tukey method is that there are no twiddle factors between p-primary pieces corresponding to different primes p.

The case where N is not diagonal has been examined by Guessoum & Mersereau (1986).

1.3.3.3.2.3. Nesting of Winograd small FFTs

| top | pdf |

Suppose that the CRT has been used as above to map an n-dimensional DFT to a μ-dimensional DFT. For each $[\kappa = 1, \ldots, \mu]$ [κ runs over those pairs (i, j) such that $[\nu (i,j) \gt 0]$ ], the Rader/Winograd procedure may be applied to put the matrix of the κth 1D DFT in the CBA normal form of a Winograd small FFT. The full DFT matrix may then be written, up to permutation of data and results, as $[\bigotimes_{\kappa = 1}^{\mu}({\bf C}_{\kappa} {\bf B}_{\kappa} {\bf A}_{\kappa}).]$

A well known property of the tensor product of matrices allows this to be rewritten as $[\left(\bigotimes_{\gamma = 1}^{\mu} {\bf C}_{\gamma}\right) \times \left(\bigotimes_{\beta = 1}^{\mu} {\bf B}_{\beta}\right) \times \left(\bigotimes_{\alpha = 1}^{\mu} {\bf A}_{\alpha}\right)]$ and thus to form a matrix in which the combined pre-addition, multiplication and post-addition matrices have been precomputed. This procedure, called nesting, can be shown to afford a reduction of the arithmetic operation count compared to the row–column method (Morris, 1978).

Clearly, the nesting rearrangement need not be applied to all μ dimensions, but can be restricted to any desired subset of them.

1.3.3.3.2.4. The Nussbaumer–Quandalle algorithm

| top | pdf |

Nussbaumer's approach views the DFT as the evaluation of certain polynomials constructed from the data (as in Section 1.3.3.2.4). For instance, putting $[\omega = e(1/N)]$ , the 1D N-point DFT $[X^{*}(k^{*}) = {\textstyle\sum\limits_{k = 0}^{N - 1}} X(k) \omega^{k^{*}k}]$ may be written $[X^{*}(k^{*}) = Q(\omega^{k^{*}}),]$ where the polynomial Q is defined by $[Q(z) = {\textstyle\sum\limits_{k = 0}^{N - 1}} X(k)z^{k}.]$

Let us consider (Nussbaumer & Quandalle, 1979) a 2D transform of size $[N \times N]$ : $[X^{*}(k_{1}^{*}, k_{2}^{*}) = {\textstyle\sum\limits_{k_{1} = 0}^{N - 1}}\; {\textstyle\sum\limits_{k_{2} = 0}^{N - 1}} X(k_{1}, k_{2}) \omega^{k_{1}^{*} k_{1} + k_{2}^{*} k_{2}}.]$ By introduction of the polynomials $[\eqalign{Q_{k_{2}}(z) &= {\textstyle\sum\limits_{k_{1}}}\; X (k_{1}, k_{2})z^{k_{1}} \cr R_{k_{2}^{*}}(z) &= {\textstyle\sum\limits_{k_{2}}}\; \omega^{k_{2}^{*} k_{2}} Q_{k_{2}}(z),}]$ this may be rewritten: $[X^{*}(k_{1}^{*}, k_{2}^{*}) = R_{k_{2}^{*}} (\omega^{k_{1}^{*}}) = {\textstyle\sum\limits_{k_{2}}}\; \omega^{k_{2}^{*} k_{2}} Q_{k_{2}} (\omega^{k_{1}^{*}}).]$

Let us now suppose that $[k_{1}^{*}]$ is coprime to N. Then $[k_{1}^{*}]$ has a unique inverse modulo N (denoted by $[1/k_{1}^{*}]$ ), so that multiplication by $[k_{1}^{*}]$ simply permutes the elements of $[{\bb Z}/N {\bb Z}]$ and hence $[{\textstyle\sum\limits_{k_{2} = 0}^{N - 1}} f(k_{2}) = {\textstyle\sum\limits_{k_{2} = 0}^{N - 1}} f(k_{1}^{*} k_{2})]$ for any function f over $[{\bb Z}/N {\bb Z}]$ . We may thus write: $[\eqalign{X^{*}(k_{1}^{*}, k_{2}^{*}) &= {\textstyle\sum\limits_{k_{2}}} \;\omega^{k_{1}^{*} k_{2}^{*} k_{2}} Q_{k_{1}^{*} k_{2}} (\omega^{k_{1}^{*}}) \cr &= S_{k_{1}^{*} k_{2}} (\omega^{k_{1}^{*}})}]$ where $[S_{k^{*}}(z) = {\textstyle\sum\limits_{k_{2}}}\; z^{k^{*} k_{2}} Q_{k_{2}}(z).]$ Since only the value of polynomial $[S_{k^{*}}(z)]$ at $[z = \omega^{k_{1}^{*}}]$ is involved in the result, the computation of $[S_{k^{*}}]$ may be carried out modulo the unique cyclotomic polynomial [P(z)] such that $[P(\omega^{k_{1}^{*}}) = 0]$ . Thus, if we define: $[T_{k^{*}}(z) = {\textstyle\sum\limits_{k_{2}}} \;z^{k^{*} k_{2}} Q_{k_{2}}(z) \hbox{ mod } P(z)]$ we may write: $[X^{*}(k_{1}^{*}, k_{2}^{*}) = T_{k_{1}^{*} k_{2}^{*}} (\omega^{k_{1}^{*}})]$ or equivalently $[X^{*} \left(k_{1}^{*}, {k_{2}^{*} \over k_{1}^{*}}\right) = T_{k_{2}^{*}} (\omega^{k_{1}^{*}}).]$

For N an odd prime p, all non-zero values of $[k_{1}^{*}]$ are coprime with p so that the $[p \times p]$ -point DFT may be calculated as follows:

(1) form the polynomials $[T_{k_{2}^{*}}(z) = {\textstyle\sum\limits_{k_{1}}} {\textstyle\sum\limits_{k_{2}}} \;X(k_{1}, k_{2})z^{k_{1} + k_{2}^{*} k_{2}} \hbox{ mod } P(z)]$ for $[k_{2}^{*} = 0, \ldots, p - 1]$ ;
(2) evaluate $[T_{k_{2}^{*}} (\omega^{k_{1}^{*}})]$ for $[k_{1}^{*} = 0, \ldots, p - 1]$ ;
(3) put $[X^{*}(k_{1}^{*}, k_{2}^{*}/k_{1}^{*}) = T_{k_{2}^{*}} (\omega^{k_{1}^{*}})]$ ;
(4) calculate the terms for $[k_{1}^{*} = 0]$ separately by $[X^{*}(0, k_{2}^{*}) = {\textstyle\sum\limits_{k_{2}}} \left[{\textstyle\sum\limits_{k_{1}}} \;X(k_{1}, k_{2})\right] \omega^{k_{2}^{*} k_{2}}.]$

Step (1) is a set of p `polynomial transforms' involving no multiplications; step (2) consists of p DFTs on p points each since if $[T_{k_{2}^{*}}(z) = {\textstyle\sum\limits_{k_{1}}}\; Y_{k_{2}^{*}}(k_{1})z^{k_{1}}]$ then $[T_{k_{2}^{*}} (\omega^{k_{1}^{*}}) = {\textstyle\sum\limits_{k_{1}}} \;Y_{k_{2}^{*}}(k_{1}) \omega^{k_{1}^{*} k_{1}} = Y_{k_{2}^{*}}^{*}(k_{1}^{*})\hbox{;}]$ step (3) is a permutation; and step (4) is a p-point DFT. Thus the 2D DFT on $[p \times p]$ points, which takes 2p p-point DFTs by the row–column method, involves only [(p + 1)] p-point DFTs; the other DFTs have been replaced by polynomial transforms involving only additions.

This procedure can be extended to n dimensions, and reduces the number of 1D p-point DFTs from $[np^{n - 1}]$ for the row–column method to $[(p^{n} -1)/(p - 1)]$ , at the cost of introducing extra additions in the polynomial transforms.

A similar algorithm has been formulated by Auslander et al. (1983) in terms of Galois theory.

1.3.3.3.3. Global algorithm design

| top | pdf |

1.3.3.3.3.1. From local pieces to global algorithms

| top | pdf |

The mathematical analysis of the structure of DFT computations has brought to light a broad variety of possibilities for reducing or reshaping their arithmetic complexity. All of them are `analytic' in that they break down large transforms into a succession of smaller ones.

These results may now be considered from the converse `synthetic' viewpoint as providing a list of procedures for assembling them:

(i) the building blocks are one-dimensional p-point algorithms for p a small prime;
(ii) the low-level connectors are the multiplicative reindexing methods of Rader and Winograd, or the polynomial transform reindexing method of Nussbaumer and Quandalle, which allow the construction of efficient algorithms for larger primes p, for prime powers $[p^{\nu}]$ , and for p-primary pieces of shape $[p^{\nu} \times \ldots \times p^{\nu}]$ ;
(iii) the high-level connectors are the additive reindexing scheme of Cooley–Tukey, the Chinese remainder theorem reindexing, and the tensor product construction;
(iv) nesting may be viewed as the `glue' which seals all elements.

The simplest DFT may then be carried out into a global algorithm in many different ways. The diagrams in Fig. 1.3.3.1 illustrate a few of the options available to compute a 400-point DFT. They may differ greatly in their arithmetic operation counts.

Figure 1.3.3.1 | top | pdf |

A few global algorithms for computing a 400-point DFT. CT: Cooley–Tukey factorization. PF: prime factor (or Good) factorization. W: Winograd algorithm.

1.3.3.3.3.2. Computer architecture considerations

| top | pdf |

To obtain a truly useful measure of the computational complexity of a DFT algorithm, its arithmetic operation count must be tempered by computer architecture considerations. Three main types of trade-offs must be borne in mind:

(i) reductions in floating-point (f.p.) arithmetic count are obtained by reindexing, hence at the cost of an increase in integer arithmetic on addresses, although some shortcuts may be found (Uhrich, 1969; Burrus & Eschenbacher, 1981);
(ii) reduction in the f.p. multiplication count usually leads to a large increase in the f.p. addition count (Morris, 1978);
(iii) nesting can increase execution speed, but causes a loss of modularity and hence complicates program development (Silverman, 1977; Kolba & Parks, 1977).

Many of the mathematical developments above took place in the context of single-processor serial computers, where f.p. addition is substantially cheaper than f.p. multiplication but where integer address arithmetic has to compete with f.p. arithmetic for processor cycles. As a result, the alternatives to the Cooley–Tukey algorithm hardly ever led to particularly favourable trade-offs, thus creating the impression that there was little to gain by switching to more exotic algorithms.

The advent of new machine architectures with vector and/or parallel processing features has greatly altered this picture (Pease, 1968; Korn & Lambiotte, 1979; Fornberg, 1981; Swartzrauber, 1984):

(i) pipelining equalizes the cost of f.p. addition and f.p. multiplication, and the ideal `blend' of the two types of operations depends solely on the number of adder and multiplier units available in each machine;
(ii) integer address arithmetic is delegated to specialized arithmetic and logical units (ALUs) operating concurrently with the f.p. units, so that complex reindexing schemes may be used without loss of overall efficiency.

Another major consideration is that of data flow [see e.g. Nawab & McClellan (1979)]. Serial machines only have few registers and few paths connecting them, and allow little or no overlap between computation and data movement. New architectures, on the other hand, comprise banks of vector registers (or `cache memory') besides the usual internal registers, and dedicated ALUs can service data transfers between several of them simultaneously and concurrently with computation.

In this new context, the devices described in Sections 1.3.3.2 and 1.3.3.3 for altering the balance between the various types of arithmetic operations, and reshaping the data flow during the computation, are invaluable. The field of machine-dependent DFT algorithm design is thriving on them [see e.g. Temperton (1983a,b,c, 1985); Agarwal & Cooley (1986, 1987)].

1.3.3.3.3.3. The Johnson–Burrus family of algorithms

| top | pdf |

In order to explore systematically all possible algorithms for carrying out a given DFT computation, and to pick the one best suited to a given machine, attempts have been made to develop:

(i) a high-level notation of describing all the ingredients of a DFT computation, including data permutation and data flow;
(ii) a formal calculus capable of operating on these descriptions so as to represent all possible reorganizations of the computation;
(iii) an automatic procedure for evaluating the performance of a given algorithm on a specific architecture.

Task (i) can be accomplished by systematic use of a tensor product notation to represent the various stages into which the DFT can be factored (reindexing, small transforms on subsets of indices, twiddle factors, digit-reversal permutations).

Task (ii) may for instance use the Winograd CBA normal form for each small transform, then apply the rules governing the rearrangement of tensor product $[\bigotimes]$ and ordinary product × operations on matrices. The matching of these rearrangements to the architecture of a vector and/or parallel computer can be formalized algebraically [see e.g. Chapter 2 of Tolimieri et al. (1989)].

Task (iii) is a complex search which requires techniques such as dynamic programming (Bellman, 1958).

Johnson & Burrus (1983) have proposed and tested such a scheme to identify the optimal trade-offs between prime factor nesting and Winograd nesting of small Winograd transforms. In step (ii), they further decomposed the pre-addition matrix A and post-addition matrix C into several factors, so that the number of design options available becomes very large: the N-point DFT when N has four factors can be calculated in over 10¹² distinct ways.

This large family of nested algorithms contains the prime factor algorithm and the Winograd algorithms as particular cases, but usually achieves greater efficiency than either by reducing the f.p. multiplication count while keeping the number of f.p. additions small.

There is little doubt that this systematic approach will be extended so as to incorporate all available methods of restructuring the DFT.

References

Agarwal, R. C. & Cooley, J. W. (1986). Fourier transform and convolution subroutines for the IBM 3090 Vector facility. IBM J. Res. Dev. 30, 145–162.Google Scholar

Agarwal, R. C. & Cooley, J. W. (1987). Vectorized mixed radix discrete Fourier transform algorithms. Proc. IEEE, 75, 1283–1292.Google Scholar

Apostol, T. M. (1976). Introduction to analytic number theory. New York: Springer-Verlag.Google Scholar

Auslander, L., Feig, E. & Winograd, S. (1982). New algorithms for evaluating the 3-dimensional discrete Fourier transform. In Computational crystallography, edited by D. Sayre, pp. 451–461. New York: Oxford University Press.Google Scholar

Auslander, L., Feig, E. & Winograd, S. (1983). New algorithms for the multidimensional discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 31, 388–403.Google Scholar

Auslander, L., Feig, E. & Winograd, S. (1984). Abelian semi-simple algebras and algorithms for the discrete Fourier transform. Adv. Appl. Math. 5, 31–55.Google Scholar

Auslander, L. & Tolimieri, R. (1979). Is computing with the finite Fourier transform pure or applied mathematics? Bull. Am. Math. Soc. 1, 847–897.Google Scholar

Auslander, L. & Tolimieri, R. (1985). Ring structure and the Fourier transform. Math. Intelligencer, 7, 49–54.Google Scholar

Auslander, L., Tolimieri, R. & Winograd, S. (1982). Hecke's theorem in quadratic reciprocity, finite nilpotent groups and the Cooley–Tukey algorithm. Adv. Math. 43, 122–172.Google Scholar

Ayoub, R. (1963). An introduction to the analytic theory of numbers. Providence, RI: American Mathematical Society.Google Scholar

Bellman, R. (1958). Dynamic programming and stochastic control processes. Inf. Control, 1, 228–239.Google Scholar

Blahut, R. E. (1985). Fast algorithms for digital signal processing. Reading: Addison-Wesley.Google Scholar

Bracewell, R. N. (1986). The Fourier transform and its applications, 2nd ed., revised. New York: McGraw-Hill.Google Scholar

Brigham, E. O. (1988). The fast Fourier transform and its applications. Englewood Cliffs: Prentice-Hall.Google Scholar

Burrus, C. S. & Eschenbacher, P. W. (1981). An in-place, in-order prime factor FFT algorithm. IEEE Trans. Acoust. Speech Signal Process. 29, 806–817.Google Scholar

Campbell, G. A. & Foster, R. M. (1948). Fourier integrals for practical applications. Princeton: Van Nostrand.Google Scholar

Champeney, D. C. (1973). Fourier transforms and their physical applications. New York: Academic Press.Google Scholar

Cochran, W. T., Cooley, J. W., Favin, D. L., Helms, H. D., Kaenel, R. A., Lang, W. W., Maling, G. C., Nelson, D. E., Rader, C. M. & Welch, P. D. (1967). What is the fast Fourier transform? IEEE Trans. Audio, 15, 45–55.Google Scholar

Cooley, J. W. & Tukey, J. W. (1965). An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297–301.Google Scholar

Eklundh, J. O. (1972). A fast computer method for matrix transposing. IEEE Trans. C-21, 801–803.Google Scholar

Fornberg, A. (1981). A vector implementation of the fast Fourier transform algorithm. Math. Comput. 36, 189–191.Google Scholar

Gentleman, W. M. & Sande, G. (1966). Fast Fourier transforms – for fun and profit. In AFIPS Proc. 1966 Fall Joint Computer Conference, pp. 563–578. Washington, DC: Spartan Books.Google Scholar

Goldstine, H. H. (1977). A history of numerical analysis from the 16th through the 19th century. New York: Springer-Verlag.Google Scholar

Good, I. J. (1958). The interaction algorithm and practical Fourier analysis. J. R. Stat. Soc. B20, 361–372.Google Scholar

Good, I. J. (1960). Addendum [to Good (1958)]. J. R. Stat. Soc. B22, 372–375.Google Scholar

Good, I. J. (1971). The relationship between two fast Fourier transforms. IEEE Trans. C-20, 310–317.Google Scholar

Guessoum, A. & Mersereau, R. M. (1986). Fast algorithms for the multidimensional discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 34, 937–943.Google Scholar

Harris, D. B., McClellan, J. H., Chan, D. S. K. & Schuessler, H. W. (1977). Vector radix fast Fourier transform. Rec. 1977 IEEE Internat. Conf. Acoust. Speech Signal Process. pp. 548–551.Google Scholar

Heideman, M. T., Johnson, D. H. & Burrus, C. S. (1984). Gauss and the history of the fast Fourier transform. IEEE Acoust. Speech Signal Process. Mag. October 1984.Google Scholar

Johnson, H. W. & Burrus, C. S. (1983). The design of optimal DFT algorithms using dynamic programming. IEEE Trans. Acoust. Speech Signal Process. 31, 378–387.Google Scholar

Kolba, D. P. & Parks, T. W. (1977). A prime factor FFT algorithm using high-speed convolution. IEEE Trans. Acoust. Speech Signal Process. 25, 281–294.Google Scholar

Korn, D. G. & Lambiotte, J. J. Jr (1979). Computing the fast Fourier transform on a vector computer. Math. Comput. 33, 977–992.Google Scholar

McClellan, J. H. & Rader, C. M. (1979). Number theory in digital signal processing. Englewood Cliffs: Prentice Hall.Google Scholar

Mersereau, R. & Speake, T. C. (1981). A unified treatment of Cooley–Tukey algorithms for the evaluation of the multidimensional DFT. IEEE Trans. Acoust. Speech Signal Process. 29, 1011–1018.Google Scholar

Mersereau, R. M. (1979). The processing of hexagonally sampled two-dimensional signals. Proc. IEEE, 67, 930–949.Google Scholar

Morris, R. L. (1978). A comparative study of time efficient FFT and WFTA programs for general purpose computers. IEEE Trans. Acoust. Speech Signal Process. 26, 141–150.Google Scholar

Nawab, H. & McClellan, J. H. (1979). Bounds on the minimum number of data transfers in WFTA and FFT programs. IEEE Trans. Acoust. Speech Signal Process. 27, 393–398.Google Scholar

Nussbaumer, H. J. (1981). Fast Fourier transform and convolution algorithms. Berlin: Springer-Verlag.Google Scholar

Nussbaumer, H. J. & Quandalle, P. (1979). Fast computation of discrete Fourier transforms using polynomial transforms. IEEE Trans. Acoust. Speech Signal Process. 27, 169–181.Google Scholar

Pease, M. C. (1968). An adaptation of the fast Fourier transform for parallel processing. J. Assoc. Comput. Mach. 15, 252–264.Google Scholar

Rader, C. M. (1968). Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE, 56, 1107–1108.Google Scholar

Rivard, G. E. (1977). Direct fast Fourier transform of bivariate functions. IEEE Trans. Acoust. Speech Signal Process. 25, 250–252.Google Scholar

Schroeder, M. R. (1986). Number theory in science and communication, 2nd ed. Berlin: Springer-Verlag.Google Scholar

Silverman, H. F. (1977). An introduction to programming the Winograd Fourier transform algorithm (WFTA). IEEE Trans. Acoust. Speech Signal Process. 25, 152–165.Google Scholar

Singleton, R. C. (1969). An algorithm for computing the mixed radix fast Fourier transform. IEEE Trans. AU, 17, 93–103.Google Scholar

Sneddon, I. N. (1951). Fourier transforms. New York: McGraw-Hill.Google Scholar

Swartzrauber, P. N. (1984). FFT algorithms for vector computers. Parallel Comput. 1, 45–63.Google Scholar

Temperton, C. (1983a). Self-sorting mixed-radix fast Fourier transforms. J. Comput. Phys. 52, 1–23.Google Scholar

Temperton, C. (1983b). A note on the prime factor FFT algorithm. J. Comput. Phys. 52, 198–204.Google Scholar

Temperton, C. (1983c). Fast mixed-radix real Fourier transforms. J. Comput. Phys. 52, 340–350.Google Scholar

Temperton, C. (1985). Implementation of a self-sorting in place prime factor FFT algorithm. J. Comput. Phys. 58, 283–299.Google Scholar

Tolimieri, R. (1985). The algebra of the finite Fourier transform and coding theory. Trans. Am. Math. Soc. 287, 253–273.Google Scholar

Tolimieri, R., An, M. & Lu, C. (1989). Algorithms for discrete Fourier transform and convolution. New York: Springer-Verlag.Google Scholar

Uhrich, M. L. (1969). Fast Fourier transforms without sorting. IEEE Trans. AU, 17, 170–172.Google Scholar

Winograd, S. (1976). On computing the discrete Fourier transform. Proc. Natl Acad. Sci. USA, 73, 1005–1006.Google Scholar

Winograd, S. (1977). Some bilinear forms whose multiplicative complexity depends on the field of constants. Math. Syst. Theor. 10, 169–180.Google Scholar

Winograd, S. (1978). On computing the discrete Fourier transform. Math. Comput. 32, 175–199.Google Scholar

Winograd, S. (1980). Arithmetic complexity of computations. CBMS-NST Regional Conf. Series Appl. Math, Publ. No. 33. Philadelphia: SIAM Publications.Google Scholar

International Tables for Crystallography (2006). Vol. B. ch. 1.3, pp. 49-58