The maximum-entropy principle in a general context

Bricogne, G.

doi:10.1107/97809553602060000690

International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. F. ch. 16.2, pp. 346-348 | 1 | 2 |

Section 16.2.2. The maximum-entropy principle in a general context

G. Bricogne^a ^*

^aLaboratory of Molecular Biology, Medical Research Council, Cambridge CB2 2QH, England
Correspondence e-mail: gb10@mrc-lmb.cam.ac.uk

16.2.2. The maximum-entropy principle in a general context

| top | pdf |

16.2.2.1. Sources of random symbols and the notion of source entropy

| top | pdf |

Statistical communication theory uses as its basic modelling device a discrete source of random symbols, which at discrete times $[t = 1, 2, \ldots]$ , randomly emits a `symbol' taken out of a finite alphabet $[{\cal A} = \{ s_{i} | i = 1, \ldots, n\}]$ . Sequences of such randomly produced symbols are called `messages'.

An important numerical quantity associated with such a discrete source is its entropy per symbol H, which gives a measure of the amount of uncertainty involved in the choice of a symbol. Suppose that successive symbols are independent and that symbol i has probability $[q_{i}]$ . Then the general requirements that H should be a continuous function of the $[q_{i}]$ , should increase with increasing uncertainty, and should be additive for independent sources of uncertainty, suffice to define H uniquely as $[H (q_{1}, \ldots, q_{n}) = -k \textstyle\sum\limits_{i=1}^{n}\displaystyle q_{i}\log q_{i}, \eqno(16.2.2.1)]$ where k is an arbitrary positive constant [Shannon & Weaver (1949), Appendix 2] whose value depends on the unit of entropy chosen. In the following we use a unit such that [k = 1] .

These definitions may be extended to the case where the alphabet $[{\cal A}]$ is a continuous space endowed with a uniform measure μ: in this case the entropy per symbol is defined as $[H(q) = - \textstyle\int\limits_{{\cal A}}\displaystyle q({\bf s}) \log q({\bf s})\; \hbox{d}\mu ({\bf s}), \eqno(16.2.2.2)]$ where q is the probability density of the distribution of symbols with respect to measure μ.

16.2.2.2. The meaning of entropy: Shannon's theorems

| top | pdf |

Two important theorems [Shannon & Weaver (1949), Appendix 3] provide a more intuitive grasp of the meaning and importance of entropy:

(1) H is approximately the logarithm of the reciprocal probability of a typical long message, divided by the number of symbols in the message; and
(2) H gives the rate of growth, with increasing message length, of the logarithm of the number of reasonably probable messages, regardless of the precise meaning given to the criterion of being `reasonably probable'.

The entropy H of a source is thus a direct measure of the strength of the restrictions placed on the permissible messages by the distribution of probabilities over the symbols, lower entropy being synonymous with greater restrictions. In the two cases above, the maximum values of the entropy $[H_{\max} = \log n]$ and $[H_{\max} = \log \mu ({\cal A})]$ are reached when all the symbols are equally probable, i.e. when q is a uniform probability distribution over the symbols. When this distribution is not uniform, the usage of the different symbols is biased away from this maximum freedom, and the entropy of the source is lower; by Shannon's theorem (2), the number of `reasonably probable' messages of a given length emanating from the source decreases accordingly.

The quantity that measures most directly the strength of the restrictions introduced by the non-uniformity of q is the difference $[H(q) - H_{\max}]$ , since the proportion of N-atom random structures which remain `reasonably probable' in the ensemble of the corresponding source is $[\exp \{N[H(q) - H_{\max}]\}]$ . This difference may be written (using continuous rather than discrete distributions) $[H(q) - H_{\max} = - \textstyle\int\limits_{{\cal A}}\displaystyle q({\bf s}) \log [q({\bf s})/m({\bf s})] \; \hbox{d}\mu ({\bf s}), \eqno(16.2.2.3)]$ where m(s) is the uniform distribution which is such that $[H(m) = H_{\max} = \ \log \mu ({\cal A})]$ .

16.2.2.3. Jaynes' maximum-entropy principle

| top | pdf |

From the fundamental theorems just stated, which may be recognized as Gibbs' argument in a different guise, Jaynes' own maximum-entropy argument proceeds with striking lucidity and constructive simplicity, along the following lines:

(1) experimental observation of, or `data acquisition' on, a given system enables us to progress from an initial state of uncertainty to a state of lesser uncertainty about that system;
(2) uncertainty reflects the existence of numerous possibilities of accounting for the available data, viewed as constraints, in terms of a physical model of the internal degrees of freedom of the system;
(3) new data, viewed as new constraints, reduce the range of these possibilities;
(4) conversely, any step in our treatment of the data that would further reduce that range of possibilities amounts to applying extra constraints (even if we do not know what they are) which are not warranted by the available data;
(5) hence Jaynes's rule: `The probability assignment over the range of possibilities [i.e. our picture of residual uncertainty] shall be the one with maximum entropy consistent with the available data, so as to remain maximally non-committal with respect to the missing data'.

The only requirement for this analysis to be applicable is that the `ranges of possibilities' to which it refers should be representable (or well approximated) by ensembles of abstract messages emanating from a random source. The entropy to be maximized is then the entropy per symbol of that source.

The final form of the maximum-entropy criterion is thus that q(s) should be chosen so as to maximize, under the constraints expressing the knowledge of newly acquired data, its entropy $[{\cal S}_{m}(q) = - \textstyle\int\limits_{V}\displaystyle q({\bf s}) \log [q({\bf s})/m({\bf s})] \; \hbox{d}\mu ({\bf s}) \eqno(16.2.2.4)]$ relative to the `prior prejudice' m(s) which maximizes H in the absence of these data.

16.2.2.4. Jaynes' maximum-entropy formalism

| top | pdf |

Jaynes (1957) solved the problem of explicitly determining such maximum-entropy distributions in the case of general linear constraints, using an analytical apparatus first exploited by Gibbs in statistical mechanics.

The maximum-entropy distribution $[q^{\rm ME}({\bf s})]$ , under the prior prejudice m(s), satisfying the linear constraint equations $[{\cal C}_{j}(q) \equiv \textstyle\int\limits_{\cal A}\displaystyle q({\bf s}) C_{j}({\bf s})\; {\rm d}\mu ({\bf s}) = c_{j}\quad (\;j=1, 2, \ldots, M), \eqno(16.2.2.5)]$ where the $[{\cal C}_{j}(q)]$ are linear constraint functionals defined by given constraint functions $[C_{j}({\bf s})]$ , and the $[c_{j}]$ are given constraint values, is obtained by maximizing with respect to q the relative entropy defined by equation (16.2.2.4). An extra constraint is the normalization condition $[{\cal C}_{0}(q) \equiv \textstyle\int\limits_{\cal A}\displaystyle q({\bf s})\ 1\ {\rm d}\mu ({\bf s}) = 1, \eqno(16.2.2.6)]$ to which it is convenient to give the label [j = 0] , so that it can be handled together with the others by putting $[C_{0}({\bf s}) = 1]$ , $[c_{0} = 1]$ .

By a standard variational argument, this constrained maximization is equivalent to the unconstrained maximization of the functional $[{\cal S}_{m}(q) + \textstyle\sum\limits_{j=0}^{M}\displaystyle \lambda_{j} {\cal C}_{j}(q), \eqno(16.2.2.7)]$ where the $[\lambda_{j}]$ are Lagrange multipliers whose values may be determined from the constraints. This new variational problem is readily solved: if q(s) is varied to $[q({\bf s})+\delta q({\bf s})]$ , the resulting variations in the functionals $[{\cal S}_{m}]$ and $[{\cal C}_{j}]$ will be $[\eqalign{\delta {\cal S}_{m} &= \textstyle\int\limits_{\cal A} \displaystyle\{-1 -\log \left[q({\bf s})/m({\bf s})\right]\}\; \delta q({\bf s})\; \hbox{d}\mu ({\bf s}) \quad\hbox{ and } \cr\noalign{\vskip5pt} \delta {\cal C}_{j} &= \textstyle\int\limits_{\cal A}\displaystyle\{C_{j}({\bf s})\} \;\delta q({\bf s}) \;\hbox{d}\mu ({\bf s}),} \eqno(16.2.2.8)]$ respectively. If the variation of the functional (16.2.2.7) is to vanish for arbitrary variations $[\delta q({\bf s})]$ , the integrand in the expression for that variation from (16.2.2.8) must vanish identically. Therefore the maximum-entropy density distribution $[q^{\rm ME}({\bf s})]$ satisfies the relation $[-1 -\log \left[q({\bf s})/m({\bf s})\right] + \textstyle\sum\limits_{j=0}^{M}\displaystyle \lambda_{j} C_{j}({\bf s}) = 0 \eqno(16.2.2.9)]$ and hence $[q^{\rm ME}({\bf s}) = m({\bf s}) \exp (\lambda_{0}-1) \exp \left[\textstyle\sum\limits_{j=1}^{M}\displaystyle \lambda_{j} C_{j}({\bf s})\right]. \eqno(16.2.2.10)]$

It is convenient now to separate the multiplier $[\lambda_{0}]$ associated with the normalization constraint by putting $[\lambda_{0}-1 = -\log Z, \eqno(16.2.2.11)]$ where Z is a function of the other multipliers $[\lambda_{1}, \ldots , \lambda_{M}]$ . The final expression for $[q^{\rm ME}({\bf s})]$ is thus $[q^{\rm ME}({\bf s}) = {m({\bf s}) \over Z(\lambda_{1},\ldots,\lambda_{M})} \exp \left[\sum_{j=1}^{M} \lambda_{j} C_{j}({\bf s}) \right]. \eqno(\hbox{ME1})]$ The values of Z and of $[\lambda_{1}, \ldots , \lambda_{M}]$ may now be determined by solving the initial constraint equations. The normalization condition demands that $[Z(\lambda_{1},\ldots,\lambda_{M}) = \textstyle\int\limits_{\cal A}\displaystyle m({\bf s}) \exp \left[\textstyle\sum\limits_{j=1}^{M}\displaystyle \lambda_{j} C_{j}({\bf s}) \right] \;\hbox{d}\mu ({\bf s}). \eqno(\hbox{ME2})]$ The generic constraint equations (16.2.2.5) determine $[\lambda_{1}, \ldots , \lambda_{M}]$ by the conditions that $[\textstyle\int_{\cal A}\displaystyle{[m({\bf s})/Z]} \exp \left[\textstyle\sum\limits_{\;k=1}^{M}\displaystyle \lambda_{k} C_{k}({\bf s}) \right] C_{j}({\bf s})\; \hbox{d}\mu ({\bf s}) = c_{j} \eqno(16.2.2.12)]$ for $[j=1, 2, \ldots , M]$ . But, by Leibniz's rule of differentiation under the integral sign, these equations may be written in the compact form $[{\partial (\log Z) \over \partial\lambda_{j}} = c_{j} \quad (\;j=1, 2, \ldots, M). \eqno(\hbox{ME3})]$ Equations (ME1), (ME2) and (ME3) constitute the maximum-entropy equations.

The maximal value attained by the entropy is readily found: $[\eqalign{{\cal S}_{m}(q^{\rm ME}) &= -\textstyle\int\limits_{\cal A}\displaystyle q^{\rm ME}({\bf s}) \log \left[q^{\rm ME}({\bf s})/m({\bf s})\right]\; \hbox{d}\mu ({\bf s})\cr\noalign{\vskip5pt} &= -\textstyle\int\limits_{\cal A}\displaystyle q^{\rm ME}({\bf s}) \left[ -\log Z + \textstyle\sum\limits_{j=1}^{M}\displaystyle \lambda_{j} C_{j}({\bf s})\right] \;\hbox{d}\mu ({\bf s}),}]$ i.e. using the constraint equations $[{\cal S}_{m}(q^{\rm ME}) = \log Z - \textstyle\sum\limits_{j=1}^{M}\displaystyle \lambda_{j} c_{j}. \eqno(16.2.2.13)]$ The latter expression may be rewritten, by means of equations (ME3), as $[{\cal S}_{m}(q^{\rm ME}) = \log Z - \sum_{j=1}^{M} \lambda_{j} {\partial(\log Z) \over \partial \lambda_{j}}, \eqno(16.2.2.14)]$ which shows that, in their dependence on the λ's, the entropy and log Z are related by Legendre duality.

Jaynes' theory relates this maximal value of the entropy to the prior probability $[{\cal P}({\bf c})]$ of the vector c of simultaneous constraint values, i.e. to the size of the sub-ensemble of messages of length N that fulfil the constraints embodied in (16.2.2.5), relative to the size of the ensemble of messages of the same length when the source operates with the symbol probability distribution given by the prior prejudice m. Indeed, it is a straightforward consequence of Shannon's second theorem (Section 16.2.2) as expressed in equation (16.2.2.3) that $[{\cal P}^{\rm ME}({\bf c}) \propto \exp({\cal S}), \eqno(16.2.2.15)]$ where $[{\cal S} = \log Z^{N} - \lambda \cdot {\bf c} = N {\cal S}_{m}(q^{\rm ME}) \eqno(16.2.2.16)]$ is the total entropy for N symbols.

References

Jaynes, E. T. (1957). Information theory and statistical mechanics. Phys. Rev. 106, 620–630.Google Scholar

Shannon, C. E. & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press.Google Scholar

International Tables for Crystallography (2006). Vol. F. ch. 16.2, pp. 346-348