The entropy of a document is an index of its information content. The entropy is measured in bits per character.
Clicking on the menu item above, calculates the entropy value of the current document.
Information content of a source
From the information theory point of view, the data in the current window can be viewed as a message source. To calculate the information content one examines the probability distribution of this source. It is assumed here that the individual messages (characters in the document / file) are stochastically independent of each other and are transmitted by the source with a uniform probability.
The information content of a message M[i]
is defined by
information content(M[i]
) := log(1/p[i]
) = -log(p[i]
)
where p[i]
is the probability that message M[i]
is
transmitted by the message source and log denotes logarithms to base 2 (as
indeed it does elsewhere in this document).
This means that the information content depends exclusively on the probability distribution with which the source generates the messages. The semantic content of the message does not enter into the calculation. As the information content of an unusual message is higher than that of a common message, the inverse value of the probability is used in the definition.
Moreover, the information content of two messages chosen independently of one another is equal to the sum of the information contents of the individual messages.
Entropy
With the aid of the information content of the individual messages, the average amount of information which a source with a specified distribution delivers can be calculated. To calculate this mean, the individual messages are weighted with the probabilities of their occurrence.
Entropy(p[1]
, p[2]
,..., p[r]
):= - [p[1]
* log(p[1]
) + p[2]
* log(p[2]
) +... + p[r]
* log(p[r]
)]
The entropy of a source thus indicates its characteristic distribution. It measures the average amount of information which one can obtain through observation of the source or, conversely, the indeterminacy which prevails over the generated messages when one cannot observe the source.
Simple description of entropy
Entropy is an expression of insecurity as the number of Yes/No questions which have to be answered in order to clarify a message or a character. If a character has a very high probability of occurrence, then its information content is low. This would be the case, for example, with a business partner who regularly replies "Yes". This reply also does not permit any conclusions to be drawn as to understanding or attention. Replies which occur very seldom have a high information content.
Extreme values of entropy
For documents which contain only upper case letters, the entropy lies between 0 bit/char (in a document which consists of only one character) and log(26) bit/char = 4.700440 bit/char (in a document in which all 26 characters occur equally often).
For documents which can contain every character of the character set (0 to 255) the entropy lies between 0 bit/char (in a document which consists of only one character) and log(256) bit/char = 8 bit/char (in a document in which all 256 characters occur equally often).