Entropy is an important concept in many fields, and one field where it is extensively used is QKG. This chapter gives a general overview of entropy and presents a generalization of the chain rule of entropy as needed in future chapters. Alternate explanations to most of the contents can be found in many other places, along with lots of other useful bounds and relations. The introductory chapters of [6] are highly recommended.
The function where is a probability occurs frequently in connection with entropies. is normally undefined but since is well-defined we extend the function to zero by continuity.
Another convention we will follow is to let mean the logarithm base 2. We can choose any base, but using base 2 consequently means that everything will be expressed in bits, and people tend to be familiar with bits.
Entropy is a measure of uncertainty regarding a discrete random variable. For many purposes, the Shannon entropy is the only measure needed. Shannon entropy is defined by
The Shannon entropy is a fundamental measure in information theory. It was introduced by Claude E. Shannon, now considered the father of information theory, in [7]. Much can be said about its properties, its uniqueness, and its relation with the thermodynamical entropy in physics, but we will only scratch a little bit on the surface here. One way of understanding it better is to rewrite the definition as
Now it is clear that the Shannon entropy is the expectation value of where is the probability assigned to the measured value of the random variable. can be interpreted as the needed length, in bits, of a message communicating a measurement that had probability , which makes the Shannon entropy a measure of the expected message length needed to communicate the measured value of a random variable.
The Shannon entropy of a uniform random variable with possible values is
Shannon | (4.3) |
Without qualifiers, the word entropy and a non-subscripted normally refers only to Shannon entropy. However, when dealing with QKG, as well as most other parts of cryptography, this measure is not sufficient. The goal of QKG is to produce a key that is known to both Alice and Bob but to Eve is a random variable with high uncertainty. is a measure of the uncertainty of a value assigned probability and is therefore a measure of the security of that particular value of the key. Shannon entropy measures the expectation value of that security. The dangers of focusing on Shannon entropy alone is highlighted by this theorem:
(4.4) |
which completes the proof.
Good security average is not good enough, and Shannon entropy alone is obviously not a sufficient measure of the quality of a key.
Another measure more closely related to the difficulty of guessing the value of a random variable was introduced by Massey in [8]. He did not name it but in [6] it is called guessing entropy. Note, however, that while most other entropies have the unit bits the guessing entropy is measured in units of number of guesses. Without loss of generality we can assume that the values of are sorted with decreasing probability, in which case the guessing entropy of is defined as
(4.6) |
Again we see that good security average is not good enough, and guessing entropy alone is not a sufficient measure of the quality of a key.
A useful generalization of Shannon entropy is the Rényi entropy, which maps an entropy measure pronounced the Rényi entropy of order to every real number . Rényi entropy is, just like Shannon entropy, measured in units of bits.
An important property of Rényi entropy is that for , for all , with equality if and only if is a uniform random variable. In other words, is a constant function of if is uniform and strictly decreasing if not. A full proof is given in [6] and follows quite naturally by writing in analogy with (4.2) as and using Jensen's inequality.
Some of these measures have quite natural interpretations. Rényi entropies with higher parameter depend more on the probabilities of the more probable values and less on the more improbable ones. is logarithm of the number of values of that have non-zero probabilities. Any two random variables with different probability distributions but the same number of values with non-zero probabilities will have the same Rényi entropy of order 0. is the Shannon entropy, in which the actual probabilities are quite important. is often called collision entropy, or just Rényi entropy, and is the negative logarithm of the likelihood of two independent random variables with the same probability distribution to have the same value. More probable values are much more likely to collide and are therefore more visible in the collision entropy than in the Shannon entropy. is called min-entropy and is a function of the highest probability only.
The shape of as a function of for eight different random variables is shown in figure 4.1.
(4.9) |
Rényi entropies of different order than 1 are not expectation values so things are not quite as simple. In fact, according to [6] there is not even an agreement about a standard definition of conditional Rényi entropy. However, the dominant definition seems to be the same as (4.8) with both :s replaced by . That definition will be used here but the expectation value will always be explicitly written out. We have seen that averaging security can be dangerous, and it is nice to not hide away something potentially dangerous in the notation.
With conditional Shannon entropy comes the chain rule of Shannon entropy, equation (4.10b) below. One way to define conditional Rényi entropy is to choose it so the same relation still holds when the Shannon entropies are replaced with Rényi entropies. However, the relation does not hold for the expectation value based definition chosen above so it is clearly a different conditional entropy. Fortunately, that doesn't stop us from generalizing the chain rule in other ways to something that is useful with general Rényi entropies:
(4.11) |
When we have instead:
(4.12) |
When learning something new about a random variable, the Shannon entropy of the variable will decrease or stay equal on average. It is only true on average. Consider and a that is dependent on such that if and if not. Learning that is 1 will increase the Shannon entropy of from 0.15 to 6.64, but learning that is 0 will decrease it to exactly 0. On average, the Shannon entropy will decrease to 0.0664. On average, the Shannon entropy will always decrease for all and .
However, that is not true in general for other Rényi entropies. For example, consider and defined in (3.3a). . If turns out to be 1 the entropy of reduces to exactly 0, if not it becomes exactly 1. On average, it will increase to .
Side information that increases entropy on average like this was first mentioned in [9] and is called spoiling knowledge.
The Concise Oxford English Dictionary [10] describes holism as the theory that certain wholes are greater than the sum of their parts. In a way, random variables normally behave in a holistic way. To specify both and the whole probability vector for is needed and the size of is the size of multiplied by the size of . This is one reason why Shannon chose to primarily use a logarithmic scale in [7]. With a logarithmic scale the multiplications can be treated as sums and the whole is just the sum of the parts. Quoting Shannon: One feels, for example, that two punched cards4.3 should have twice the capacity of one for information storage, and two identical channels twice the capacity of one for transmitting information.
It should come as no surprise that an important property of Shannon entropy is that the total entropy of a system is never greater than the sum of the parts' entropies,
On the other hand, what might come as a surprise is that this is not true in general for other Rényi entropies. There exists and such that
For example, consider and defined in (3.3a) again. but . This is a case where the whole actually is greater than the sum of the parts.