Entorpy from first principles

Before reading this article you should read logarithms. I also assume you know what a probability distribution is.

Definition of information

If you have some alphabet $\mathcal{X}=\{a,b,c, \ldots,\}$ of symbols. The number of symbols $|\mathcal{X}|$ does not have to be 26 like in English. We are just using the analogy. Here are some examples: \[ \mathcal{X}_{binary} = \{ 0, 1 \} \] \[ \mathcal{X}_{vote} = \{ yes, no, abstain \} \] \[ \mathcal{X}_{outlinks} = \{ set\ of\ outlinks\ on\ a\ webpage \} \] \[ \mathcal{X}_{2^{nd} cup} = \{ dark-roast, mild-roast \} \]

Now suppose that you have some source $S$ that produces each second one instance $x$ from the alphabet. For example, a coffee addicted graduate student will order either a dark roast or a mild roast coffee each morning.

Suppose that you have statistics of about the probability of each symbol being produced by the source $S$. This is represented by a probability mass function $p(x)=Prob\{S=x\}$. If I order dark roast 90% of the time then $p(dark-roast)=0.9$. The total probabilities have to add up to $1$, so this $p(mild-roast)$ must be equal to $0.1$.

Information theory can tell you how much information is conveyed by the source $S$ each time it produces some symbol $x$. The function that we use the most in information theory is called Shannon entropy. The entropy of a probability distribution $p(x)$ is \[ H(p(x)) = - \sum_{x\in \mathcal{X}} p(x) \log p(x). \]

You see the log in there? Why is there a log?

The log is there because we made some assumptions about entropy. Entropy is not information, rather it is how much uncertainty we have about the outcomes of $S$. If $S$ always outputs the same symbol, then the entropy of $S$ would be zero.

Assumptions:

  1. The source $S$ produces independent outcomes.
  2. The uncertainty we have about two outputs of S is twice

the amount of uncertainty of one output

How much information do I gain when I see the symbol $x$? Well let's say that there is some function $f(p_X(x))$ that tells you how surprising it is to see $x$. It will turn out that the function is just $f(p_X(x)) = \log p_X(x)$, but for now let's assume that it is an unknown function.

It won't be unknown for very long though. THe second assumption we made implies the following \[ f(p_{X_1X_2}(x_1,x_2)) = f(p_{X_1}(x_1)) + f(p_{X_2}(x_2)). \]

Do you know of any functions that can turn products into sums? So it must be the log. If $f$ tells us how much information there is in one symbol $x$ then how much information is there in each instance of $S$? We have to calculate the expectation value, i.e. take the sum over all symbols and weight each term by how likely it is to occur $p(x)$.

This is how you get the entropy function from first principles.

Applications

  1. The outlinks distribution could be a good way to learn information about website visitors.

If all your visitors click on the same route (as is popular with modern call-to-action meme)

  then your server logs will contain very little information about what visitors were 
  really interested in.  A homepage is like a question, if you ask the right question you
  should be getting many kinds of answers.