drafts:entropy_from_first

Entorpy from first principles

Before reading this article you should read logarithms. I also assume you know what a probability distribution is.

If you have some alphabet $\mathcal{X}=\{a,b,c, \ldots,\}$ of symbols. The number of symbols $|\mathcal{X}|$ does not have to be 26 like in English. We are just using the analogy. Here are some examples: $\mathcal{X}_{binary} = \{ 0, 1 \}$ $\mathcal{X}_{vote} = \{ yes, no, abstain \}$ $\mathcal{X}_{outlinks} = \{ set\ of\ outlinks\ on\ a\ webpage \}$ $\mathcal{X}_{2^{nd} cup} = \{ dark-roast, mild-roast \}$

Now suppose that you have some source $S$ that produces each second one instance $x$ from the alphabet. For example, a coffee addicted graduate student will order either a dark roast or a mild roast coffee each morning.

Suppose that you have statistics of about the probability of each symbol being produced by the source $S$ . This is represented by a probability mass function $p(x)=Prob\{S=x\}$ . If I order dark roast 90% of the time then $p(dark-roast)=0.9$ . The total probabilities have to add up to $1$ , so this $p(mild-roast)$ must be equal to $0.1$ .

Information theory can tell you how much information is conveyed by the source $S$ each time it produces some symbol $x$ . The function that we use the most in information theory is called Shannon entropy. The entropy of a probability distribution $p(x)$ is $H(p(x)) = - \sum_{x\in \mathcal{X}} p(x) \log p(x).$

You see the log in there? Why is there a log?

The log is there because we made some assumptions about entropy. Entropy is not information, rather it is how much uncertainty we have about the outcomes of $S$ . If $S$ always outputs the same symbol, then the entropy of $S$ would be zero.

Assumptions:

The source $S$ produces independent outcomes.
The uncertainty we have about two outputs of S is twice

the amount of uncertainty of one output

How much information do I gain when I see the symbol $x$ ? Well let's say that there is some function $f(p_X(x))$ that tells you how surprising it is to see $x$ . It will turn out that the function is just $f(p_X(x)) = \log p_X(x)$ , but for now let's assume that it is an unknown function.

It won't be unknown for very long though. THe second assumption we made implies the following $f(p_{X_1X_2}(x_1,x_2)) = f(p_{X_1}(x_1)) + f(p_{X_2}(x_2)).$

Do you know of any functions that can turn products into sums? So it must be the log. If $f$ tells us how much information there is in one symbol $x$ then how much information is there in each instance of $S$ ? We have to calculate the expectation value, i.e. take the sum over all symbols and weight each term by how likely it is to occur $p(x)$ .

This is how you get the entropy function from first principles.

Applications

Decision trees http://stackoverflow.com/questions/1859554/what-is-entropy-and-information-gain.
The outlinks distribution could be a good way to learn information about website visitors.

If all your visitors click on the same route (as is popular with modern call-to-action meme)

  then your server logs will contain very little information about what visitors were 
  really interested in.  A homepage is like a question, if you ask the right question you
  should be getting many kinds of answers.

Entorpy from first principles

Definition of information

Applications