article

In probability theory and, in particular, information theory, the mutual information, or transinformation, of two random variables is a quantity that measures the mutual dependence of the two variables. The most common unit of measurement of mutual information is the bit, in which case the logarithms below should be taken to the base 2.

Intuitively, mutual information measures the information about X that is shared by Y. If X and Y are independent, then X contains no information about Y and vice versa, so their mutual information is zero. If X and Y are identical then all information conveyed by X is shared with Y: knowing X reveals nothing new about Y and vice versa, therefore the mutual information is the same as the information conveyed by X (or Y) alone, namely the entropy of X. In a specific sense (see below), mutual information quantifies the distance between the joint distribution of X and Y and the product of their marginal distributions.

Formally, the mutual information of two discrete random variables X and Y can be defined as:

I(X;Y) = \sum_{y \in Y} \sum_{x \in X} p(x,y) \log \frac{p(x,y)}{p(x)\,p(y)},

where p(x,y) is the joint probability distribution function of X and Y, and p(x) and p(y) are the marginal probability distribution functions of X and Y respectively.

In the continuous case, we replace summation by a definite double integral:

I(X;Y) = \int_Y \int_X p(x,y) \log \frac{p(x,y)}{p(x)\,p(y)} \; dx \,dy, \!

where p(x,y) is now the joint probability density function of X and Y, and p(x) and p(y) are the marginal probability density functions of X and Y respectively.

Mutual information is a measure of independence in the following sense: I(X; Y) = 0 iff X and Y are independent random variables. This is easy to see in one direction: if X and Y are independent, then p(x,y) = p(x) × p(y), and therefore:

\log \frac{p(x,y)}{p(x)\,p(y)} = \log 1 = 0. \!

Moreover, mutual information is nonnegative (i.e. I(X;Y) ≥ 0; see below) and symmetric (i.e. I(X;Y) = I(Y;X)).

Several generalizations of mutual information to more than two random variables have been proposed, but a widely agreed on definition has not yet emerged.

Relation to other quantities


Mutual information can be equivalently expressed as

I(X;Y) = H(X) - H(X|Y) \,
= H(Y) - H(Y|X) \,
= H(X) + H(Y) - H(X,Y) \,

where H(X) and H(Y) are entropies, H(X|Y) and H(Y|X) are conditional entropies, and H(X,Y) is the joint entropy of X and Y. Since H(X) ≥ H(X|Y), this characterization is consistent with the nonnegativity property stated above.

Note that H(X|X) = 0 and therefore H(X) = I(X;X). This is the reason why entropy is often called self-information. Thus I(X;X) ≥ I(X;Y), and one can formulate the basic principle that a variable contains more information about itself than any other variable can provide.

Mutual information can also be expressed as a Kullback-Leibler divergence, of the product p(x) × p(y) of the marginal distributions of the two random variables X and Y, from p(x,y) the random variables' joint distribution:

I(X;Y) = D_{\mathrm{KL}}(p(x,y)\|p(x)p(y)).

Furthermore, let p(x|y) = p(x, y) / p(y). Then

I(X;Y) = \sum_y p(y) \sum_x p(x|y) \log_2 \frac{p(x|y)}{p(x)} \!
= \sum_y p(y) \; D_{\mathrm{KL}}(p(x|y)\|p(x)) \!
= \mathbb{E}_Y\{D_{\mathrm{KL}}(p(x|y)\|p(x))\}. \!

Thus mutual information can thus also be understood as the expectation of the Kullback-Leibler divergence of the univariate distribution p(x) of X from the conditional distribution p(x|y) of X given Y: the more different the distributions p(x|y) and p(x), the greater the information gain.

Applications of mutual information


In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing conditional entropy. Examples include:

  • Mutual information is used in medical imaging for image registration. Given a reference image (for example, a brain scan), and a second image which needs to be put the same coordinate system as the reference image, this image is deformed until the mutual information between it and the reference image is maximized.

References


  • Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes, second edition. New York: McGraw-Hill, 1984. (See Chapter 15.)

  • Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography, Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989.

Information theory | Transinformation | Information mutuelle

 

This article is licensed under the GNU Free Documentation License. It uses material from the "Mutual information".

Home Pageartsbusinesscomputersgameshealthhospitalshomekids & teensnewsphysiciansrecreationreferenceregionalscienceshoppingsocietysportsworld