drafts:entropy_from_first_principles

minireference.com/drafts/entropy_from_first_principles

The page you are reading is part of a draft (v2.0) of the "No bullshit guide to math and physics."

The text has since gone through many edits and is now available in print and electronic format. The current edition of the book is v4.0, which is a substantial improvement in terms of content and language (I hired a professional editor) from the draft version.

I'm leaving the old wiki content up for the time being, but I highly engourage you to check out the finished book. You can check out an extended preview here (PDF, 106 pages, 5MB).

Entorpy from first principles

Before reading this article you should read logarithms. I also assume you know what a probability distribution is.

Definition of information

If you have some alphabet $\mathcal{X}=\{a,b,c, \ldots,\}$ of symbols. The number of symbols $|\mathcal{X}|$ does not have to be 26 like in English. We are just using the analogy. Here are some examples: $\mathcal{X}_{binary} = \{ 0, 1 \}$ $\mathcal{X}_{vote} = \{ yes, no, abstain \}$ $\mathcal{X}_{outlinks} = \{ set\ of\ outlinks\ on\ a\ webpage \}$ $\mathcal{X}_{2^{nd} cup} = \{ dark-roast, mild-roast \}$

Now suppose that you have some source $S$ that produces each second one instance $x$ from the alphabet. For example, a coffee addicted graduate student will order either a dark roast or a mild roast coffee each morning.

Suppose that you have statistics of about the probability of each symbol being produced by the source $S$ . This is represented by a probability mass function $p(x)=Prob\{S=x\}$ . If I order dark roast 90% of the time then $p(dark-roast)=0.9$ . The total probabilities have to add up to $1$ , so this $p(mild-roast)$ must be equal to $0.1$ .

Information theory can tell you how much information is conveyed by the source $S$ each time it produces some symbol $x$ . The function that we use the most in information theory is called Shannon entropy. The entropy of a probability distribution $p(x)$ is $H(p(x)) = - \sum_{x\in \mathcal{X}} p(x) \log p(x).$

You see the log in there? Why is there a log?

The log is there because we made some assumptions about entropy. Entropy is not information, rather it is how much uncertainty we have about the outcomes of $S$ . If $S$ always outputs the same symbol, then the entropy of $S$ would be zero.

Assumptions:

The source $S$ produces independent outcomes.
The uncertainty we have about two outputs of S is twice

the amount of uncertainty of one output

How much information do I gain when I see the symbol $x$ ? Well let's say that there is some function $f(p_X(x))$ that tells you how surprising it is to see $x$ . It will turn out that the function is just $f(p_X(x)) = \log p_X(x)$ , but for now let's assume that it is an unknown function.

It won't be unknown for very long though. THe second assumption we made implies the following $f(p_{X_1X_2}(x_1,x_2)) = f(p_{X_1}(x_1)) + f(p_{X_2}(x_2)).$

Do you know of any functions that can turn products into sums? So it must be the log. If $f$ tells us how much information there is in one symbol $x$ then how much information is there in each instance of $S$ ? We have to calculate the expectation value, i.e. take the sum over all symbols and weight each term by how likely it is to occur $p(x)$ .

This is how you get the entropy function from first principles.

Applications

Decision trees http://stackoverflow.com/questions/1859554/what-is-entropy-and-information-gain.
The outlinks distribution could be a good way to learn information about website visitors.

If all your visitors click on the same route (as is popular with modern call-to-action meme)

  then your server logs will contain very little information about what visitors were 
  really interested in.  A homepage is like a question, if you ask the right question you
  should be getting many kinds of answers.