Tables for
Volume C
Mathematical, physical and chemical tables
Edited by E. Prince

International Tables for Crystallography (2006). Vol. C. ch. 8.2, pp. 691-692

Section 8.2.3. Entropy maximization

E. Princea and D. M. Collinsb

aNIST Center for Neutron Research, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA, and bLaboratory for the Structure of Matter, Code 6030, Naval Research Laboratory, Washington, DC 20375-5341, USA

8.2.3. Entropy maximization

| top | pdf | Introduction

| top | pdf |

Entropy maximization, like least squares, is of interest primarily as a framework within which to find or adjust parameters of a model. Rationalization of the name `entropy maximization' by analogy to thermodynamics is controversial, but there is formal proof (Shore & Johnson, 1980[link], Johnson & Shore, 1983[link]) supporting entropy maximization as the unique method of inference that satisfies basic consistency requirements (Livesey & Skilling, 1985[link]). The proof consists of discovering the consequences of four consistency axioms, which may be stated informally as follows:

  • (1) the result of the inference should be unique;

  • (2) the result of the inference should be invariant to any transformations of coordinate system;

  • (3) it should not matter whether independent information is accounted for independently or jointly;

  • (4) it should not matter whether independent subsystems are treated separately in conditional problems or collected and treated jointly.

The term `entropy' is used in this chapter as a name only, the name for variation functions that include the form [\varphi \ln \varphi ], where [\varphi ] may represent probability or, more generally, a positive proportion. Any positive measure, either observed or derived, of the relative apportionment of a characteristic quantity among observations can serve as the proportion.

The method of entropy maximization may be formulated as follows: given a set of n observations, [y_i], that are measurements of quantities that can be described by model functions, [M_i({\bf x})], where x is a vector of parameters, find the prior, positive proportions, [\mu _i=f(y_i)], and the values of the parameters for which the positive proportions [\varphi =f[M_i({\bf x})]] make the sum [S=-\textstyle\sum\limits_{i=1}^n\varphi _i^{\prime }\ln (\varphi _i^{\prime }/\mu _i^{\prime }), \eqno (]where [\varphi _i^{\prime }=\varphi _i\big/\sum \varphi _j] and [\mu _i^{\prime }=\mu _i\big/\sum \mu _j], a maximum. S is called the Shannon–Jaynes entropy. For some applications (Collins, 1982[link]), it is desirable to include in the variation function additional terms or restraints that give S the form [S=-\textstyle\sum\limits_{i=1}^n\varphi _i^{\prime }\ln (\varphi _i^{\prime }/\mu _i^{\prime })+\lambda _1\xi _1({\bf x},{\bf y})+\lambda _2\xi _2({\bf x},{\bf y})+\ldots, \eqno (]where the λs are undetermined multipliers, but we shall discuss here only applications where λi = 0 for all i, and an unrestrained entropy is maximized. A necessary condition for S to be a maximum is for the gradient to vanish. Using [{\partial S \over \partial x_j}=\sum _{i=1}^n\left ({\partial S \over \partial \varphi _i}\right) \bigg({\partial \varphi _i\over\partial x_j}\bigg) \eqno (]and [{\partial S \over \partial \varphi _i}=\sum _{k=1}^n\left ({\partial S \over \partial \varphi _k^{\prime }}\right) \left ({\partial \varphi _k^{\prime }\over\partial \varphi _i}\right), \eqno (]straightforward algebraic manipulation gives equations of the form [\sum _{i=1}^n\left \{ {\partial \varphi _i \over \partial x_j}-\varphi _i^{\prime }\left (\sum _{k=1}^n {\partial \varphi _k \over \partial x_j}\right) \right \} \ln \left (\displaystyle {\varphi _i^{\prime } \over \mu _i^{\prime }}\right) =0. \eqno (]It should be noted that, although the entropy function should, in principle, have a unique stationary point corresponding to the global maximum, there are occasional circumstances, particularly with restrained problems where the undetermined multipliers are not all zero, where it may be necessary to verify that a stationary solution actually maximizes entropy. Some examples

| top | pdf |

For an example of the application of the maximum-entropy method, consider (Collins, 1984[link]) a collection of diffraction intensities in which various subsets have been measured under different conditions, such as on different films or with different crystals. All systematic corrections have been made, but it is necessary to put the different subsets onto a common scale. Assume that every subset has measurements in common with some other subset, and that no collection of subsets is isolated from the others. Let the measurement of intensity [I_h] in subset i be [J_{hi}], and let the scale factor that puts intensity [I_h] on the scale of subset i be [k_i]. Equation ([link] becomes [S=-\sum _{h=1}^n\sum _{i=1}^m(k_iI_h)^{\prime }\ln \left [{\left (k_iI_h\right) ^{\prime } \over J_{hi}^{\prime }}\right] , \eqno (]where the term is zero if [I_h] does not appear in subset i. Because [k_i] and [I_h] are parameters of the model, equations ([link] become [\sum _{i=1}^mk_i\ln \left [{(k_iI_h)^{\prime } \over J_{hi}^{\prime }}\right] -\sum _{h=1}^n\;\sum _{i=1}^m(k_iI_h)^{\prime }\left (\sum _{l=1}^mk_l\right) \ln \left [\displaystyle {(k_iI_h)^{\prime } \over J_{hi}^{\prime }}\right] =0, \eqno (]and [\sum _{h=1}^nI_h\ln \left [{(k_iI_h)^{\prime } \over J_{hi}^{\prime }}\right] -\sum _{h=1}^n\sum _{i=1}^m(k_iI_h)^{\prime }\left (\sum_{l=1}^nI_l\right) \ln \left[{(k_iI_h)^{\prime }\over J_{hi}^{\prime}}\right]=0. \eqno (]These simplify to [\ln I_h=Q-\textstyle\sum\limits _{i=1}^mk_i^{\prime }\ln (k_i/J_{hi}) \eqno (]and [\ln k_i=Q-\textstyle\sum\limits _{h=1}^nI_h^{\prime }\ln (I_h/J_{hi}), \eqno (]where [Q=\textstyle\sum\limits ^n_{h=1}\; \textstyle\sum\limits ^m_{i=1}(k_iI_h)^{\prime }\ln [(k_iI_h)/J_{hi}].\eqno (]Equations ([link][link][link] may be solved iteratively, starting with the approximations [k_i=\sum _{h=1}^nJ_{hi}] and Q = 0.

The standard uncertainties of scale factors and intensities are not used in the solution of equations ([link][link][link], and must be computed separately. They may be estimated on a fractional basis from the variances of estimated population means [\left \langle J_{hi}/I_h\right \rangle ] for a scale factor and [\left \langle J_{hi}/k_i\right \rangle ] for an intensity, respectively. The maximum-entropy scale factors and scaled intensities are relative, and either set may be multiplied by an arbitrary, positive constant without affecting the solution.

For another example, consider the maximum-entropy fit of a linear function to a set of independently distributed variables. Let [y_i] represent an observation drawn from a population with mean [a_0+a_1x_i] and finite variance [\sigma _i^2]; we wish to find the maximum-entropy estimate of [a_0] and [a_1]. Assume that the mismatch between the observation and the model is normally distributed, so that its probability density is the positive proportion [\varphi _i=\varphi (\Delta_i)=(2\pi \sigma _i^2)^{-1/2}\exp (-\Delta_i^2/2\sigma _i^2), \eqno (]where [\Delta_i=y_i-(a_0+a_1x_i)]. The prior proportion is given by [\mu _i=\varphi (0)=(2\pi \sigma _i^2)^{-1/2}. \eqno (]Letting [A_\varphi =1\big/\sum \varphi _i], equations ([link] become [\textstyle\sum\limits_{i=1}^n\left [\varphi _i\Delta_i/\sigma _i^2-A_\varphi \,\varphi _i\left (\textstyle\sum\limits _{j=1}^n\varphi _j\Delta_j/\sigma _j^2\right) \right] \Delta_i^2/\sigma _i^2=0 \eqno (]and [\textstyle\sum\limits_{i=1}^n\left [\varphi _i\Delta_i\,x_i/\sigma _i^2-A_\varphi \,\varphi _i\left (\textstyle\sum\limits _{j=1}^n\varphi _j\Delta_jx_j/\sigma _j^2\right) \right] \Delta_i^2/\sigma _i^2=0, \eqno (]which simplifies to [\eqalignno{ &\left (\matrix{ \sum \limits _{i=1}^nw_i &\sum \limits _{i=1}^nw_i x_i \cr \sum \limits _{i=1}^nw_i x_i &\sum \limits _{i=1}^nw_i x_i^2}\right) \left ({a_0 \atop a_1}\right) \cr &\quad =\left(\matrix{ \sum \limits _{i=1}^nw_i\left (y_i-\sigma _i^2A_\varphi \sum \limits _{j=1}^n\varphi _j \Delta_j/\sigma _j^2\right) \cr \sum \limits _{i=1}^nw_i\left (y_ix_i-\sigma _i^2A_\varphi \sum \limits _{j=1}^n\varphi _j\Delta_j x_j/\sigma _j^2\right)}\right), &(}]where [w_i] may be interpreted as a weight and is given by [w_i=\varphi _i\Delta_i^2/\sigma _i^4]. Equations ([link] may be solved iteratively, starting with the approximations that the sums over j on the right-hand side are zero and [w_i=1.0] for all i, that is, using the solutions to the corresponding, unweighted least-squares problem. Resetting [w_i] after each iteration by only half the indicated amount defeats a tendency towards oscillation. Approximate standard uncertainties for the parameters, [a_0] and [a_1], may be computed by conventional means after setting to zero the sums over j on the right-hand side of equations ([link]. (See, however, a discussion of computing variance–covariance matrices in Section 8.1.2[link] .) Note that [w_i] is small for both small and large values of [\left | \Delta_i\right |]. Thus, in contrast to the robust/resistant methods (Section 8.2.2[link]), which de-emphasize only the large differences, this method down-weights both the small and the large differences and adjusts the parameters on the basis of the moderate-size mismatches between model and data. The procedure used in this two-dimensional, linear model can be extended to linear models, and linear approximations to nonlinear models, in any number of dimensions using methods discussed in Chapter 8.1[link] .

The maximum-entropy method has been described (Jaynes, 1979[link]) as being `maximally noncommittal with respect to all other matters; it is as uniform (by the criterion of the Shannon information measure) as it can be without violating the given constraint[s]'. Least squares, because it gives minimum variance estimates of the parameters of a model, and therefore of all functions of the model including the predicted values of any additional data points, might be similarly described as `maximally committal' with regard to the collection of more data. Least squares and maximum entropy can therefore be viewed as the extremes of a range of methods, classified according to the degree of a priori confidence in the correctness of the model, with the robust/resistant methods lying somewhere in between (although generally closer to least squares). Maximum-entropy methods can be used when it is desirable to avoid prejudice in favour of a model because of doubt as to the model's correctness.


First citationCollins, D. M. (1982). Electron density images from imperfect data by iterative entropy maximization. Nature (London), 298, 49–51.Google Scholar
First citationCollins, D. M. (1984). Scaling by entropy maximization. Acta Cryst. A40, 705–708.Google Scholar
First citationJaynes, E. T. (1979). Where do we stand on maximum entropy? The maximum entropy formalism, edited by R. D. Liven & M. Tribus, pp. 44–49. Cambridge, MA: Massachusetts Institute of Technology.Google Scholar
First citationJohnson, R. W. & Shore, J. E. (1983). Comments on and correction to 'Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy'. IEEE Trans. Inf. Theory, IT-29, 942–943.Google Scholar
First citationLivesey, A. K. & Skilling, J. (1985). Maximum entropy theory. Acta Cryst. A41, 113–122.Google Scholar
First citationShore, J. E. & Johnson, R. W. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. Inf. Theory, IT-26, 26–37.Google Scholar

to end of page
to top of page