Big data and R

Yesterday, I went to a Montreal meetup about R. The event was attended by quite a few people and the good people of Bolidea offered us beer and pizza. The talk was by the Wajam team and discussed how they make use of R for business analytics and system monitoring.

Instead of simply checking basic data like clicks, number of search and API calls, they combine all this data into a “health indicator” value which is much more accurate at predicting when intervention is required. Basically, dashboards are good but dashboards that can run machine learning algorithms are better.

Their workflow centers around MySQL as the main data store. Web servers et al. send logging information to Scribe for aggregation and all the data is then written to MySQL. The main stats/ML stuff they use for business intelligence is written in R. R pulls the data from MySQL and then produces report graphs and alerts. All this is automated through cronjobs. They said they are not going for “realtime” but they have some jobs running every minute, which is near-realtime.

It was all very cool stuff to hear about, but I was hoping to see some R code during the presentation or a demo.
Nevertheless, after the talk an interesting discussion followed up which got more technical.

Some of things mentioned were:

  • Pig: an SQL-like query language which converts your queries into map reduce jobs to run on HDFS (the Hadoop Distributed File System). Apparently it is very good. Listening to the guys talk about it made it sound like handling 50TB of data is just as easy as handling 1GB on your computer…
  • There was a discussion about running R in parallel but I am not sure about which packages they were talking about. The other day I saw a new project on HN also… so interesting things are happening on that front. Using such tools one could run “exploratory analyses” on the whole dataset instead of a subset which fits on your machine.
  • There is no central package management repository. The makers of R want to preserve the spirit of “scientific publication” spirit and don’t want to become software developers. In this spirit, when creating an R package you have to include a documentation tex file: think I am publishing a paper with some code attached.
    The process for approval to CRAN takes time so some people post their stuff on github.
  • Speaking of documentation, they talked about some literate-programming-like tools: sweave,roxygen and knitr.
    This is kind of cool — especially with the markdown support.
    I imagine this could be very useful for writing stats tutorials.
    Hey what about a minireference for stats?
  • Using Shiny it should be possible to make a nice web-app that teaches the basic concepts of Stats in a very short time. Of course you could make it available in print also, but an interactive version would be much better I think. Sales from the book, web tutorial (say 30% of the material) for free.
  • Speaking of books. One of the members of the audience said that there is an opportunity for writing a book on R.

    The old me would be like “hey I can learn about R and then write a minireference for R” book, but I know better now. Focus on math and phys! Don’t spread your energy too much. How could you teach people if you just learned the subject? Textbooks should be written by people who are at least two levels more advanced than the intended audience. You should know X, know what X is used for and also know what the stuff X is used for is used for. The reason is that people who can see two hops ahead in the graph of knowledge will have better answers to offer for the “why do I need to know this?” question.

  • Finally something related to the big data thread of the discussion here that I heard about this morning on hacker news. Drake is a way to automate handling large datasets using the “makefile” interface. There were links and discussion of other projects on HN. You need to install Clojure to run drake.

Ok! Link offload complete. Now I can finally step away from the computer and stretch a bit. You, my dear reader, should do the same. Go grab a glass of water or something and work some stretches in along the way.

Wow it is 2PM already! HN I want my morning back!!!

Hacker news launch

Two weeks ago, I posted the book on hacker news. There was an tremendous amount of interest on the first day (20k visits in one day!)
and plenty of good (i.e., critical) feedback. With this post, I want to take a moment and record my impressions from surfing the hacker news wave.

Conversion rates

1. Roughly 33000 people showed up on the “product” page.
2. Of these 7000 clicked on at least one of the modals (engagement).
3. About 1761 of them clicked on the “Buy Book” and went
to the print-on-demand site (lulu.com).
4. Of these 264 ordered the book.

The engagement rate is 7000/33000 = 21%.
The percentage of engaged visitors who clicked “Buy Book” is 25% (=1761/7000).
The final step conversion rate is 15% (=264/1761).
Overall we have 0.21*0.25*0.15 = 0.78% conversion from visitor to client.
Is this good or bad?

Perhaps the more interesting metric is the conversion rate
of engaged visitors (clicked at least on one modal) to client,
which is 3.75%.

A back-of-the envelope calculation tells me that my expected earning
per engaged visitor is about 50 cents. I feel confident that I will be
able to find buy some education keywords for

TODO: try mixpanel (GA is a PITA: full path of the referral url plz!), invest and test adwords.

Book product

The book — as a product — works. Even if there are

TODO: fix typos, add math exercises, add physics exercises.

PDF product

Some of the engaged visitors are also going to the PDF: 19% (= 847/4500).
Then there is another factor of 15% = (50+37+19+7+7+3+3)/847 = one week of PDF sales / one week of clicks to gumroad.
Thus 2.8% of engaged visitors went on to buy the PDF.

Overall this means that 6.55% = 3.75% + 2.8% of my engaged visitors go on to become clients.
Now that is cool!

similar stories

Tools used

New landing page

If you visit minireference.com you will now see a new design which conforms to the standard “book product webpage” format. I am very pleased with the result, which was an attempt to mimic other good book product pages.

The design process took me about three weeks. Most of the time was spent on the copy editing. The ability to “put stuff on the page” you have with html + css is much more powerful that LaTeX. And with webfonts becoming the norm now, one cam make very beautiful sites very quickly.

Check it out: minireference.com

The web we still have

The facebookification of the Internet brings with it a stupidification of the content that people produce and share.
The old web was about blog posts (long, though-out pieces of writing) which automatically form links to each other (through trackback) so that a conversation can emerge without the need for a centralized service.

Trackbacks are awesome! For example, I can make this post appear on quora if I embed some javascript (their embed code) which will ping the quora server:
Read Quote of Ivan Savov’s answer to Machine Learning: Does Topic Modeling need a training stage when using Gibbs sampling? And why does it work? on Quora

We need to cherish this kind of distributed technology, because it is the way out of the walled gardens. They are the living proof that you can have social without central.

LDA, BTW, is short for Latent Dirichlet Allocation which is a powerful way to classify documents according to the topics they contain.

Strang lectures on linear algebra

Professor Gilbert Strang’s video lectures on Linear Algebra have been recommended to me several times. I am very impressed with the first lecture. He presents all the important problems and concepts of LA in the first lecture and in a completely as-a-matter-of-fact way.

The lecture presents the problem of solving n equations in n unknowns in three different ways: the row picture, the column picture and the matrix picture.

In the row picture, each equation represents a line in the xy plane. When “solving” these equations simultaneously, we are looking for the point (x,y) which lies on both lines. In the case of the two lines he has on the board (2x-y=0 and -x+2y=3) the solution is the point x=1, y=2.

The second way to look the system of equations is to think of the column of x coefficients as a vector and to think of the column of y coefficients as another vector. In the column picture, solving the system of equations requires us to find the linear combination of the columns (i.e., $x$ times the first column plus $y$ times the second column) gives us the vector on the right hand side.

If students start off with this picture, they will be much less mystified (as I was) by the time they start to learn about the column space of matrices.

As a side benefit of this initial brush with linear algebra in the “column picture”, Prof. Strang is also able to present an intuitive picture for the formula for the product between a matrix and a vector. He says “Ax is the combination of the columns of A.”  This way of explaining the matrix product is much more intuitive than the standard dot-product-of-row-times-column approach. Who has seen them dot products? What? Why? WTF?

I will definitely include the “column picture” in the introductory chapter on linear algebra in the book. In fact, I have been wondering for some time how I can explain what the matrix product Ax. I want to talk about A as the linear transformation TA so that I can talk about the parallels between $x$, $f:R \to R$, $f^{-1}$ and $\vec{v}$, $A$, $A^{-1}$. Now I know how to fix the intro section!

Clearly you are the master of the subject. It is funny that what started as a procrastination activity (watching a youtube video to which I just wanted to link to) led to an elegant solution to an old-standing problem which was blocking my writing. Sometimes watching can be productive 😉  Thank you Prof. Strang!

Target revenue

I did a little calculation regarding what kind of sales figures I would need to make it to the 100k income range (which is my current standard for “success” in a technical field). If I can make deals with 100 Universities, and ship ship 100 copies of the book to each of them, then I am done:

I think it is totally doable with the MATH and PHYSICS title alone within the next couple of years. So fuck the job world. I am doing my own thing!

Showing off with python

2:57AM on a Monday. I have to be up at 8AM. The faster I get the job done the more sleep I get. Sounds like the kind of thing to motivate a person.

TASK: Parse an access.log file and produce page visit trace for each visitor. Ex:

11.22.33.90 on Monday at 3pm   (Montreal, Firefox 4, on Mac OS X):
  /contents          (stayed for 3 secs)
  /derivatives       (stayed for 2m20sec)
  /contents          (6 secs)
  /derivative_rules  (1min)
  /derivative_formulas  (2min)
  end

I had already found some access.log parsing code,  and setup a processing pipeline from last time I wanted to work on this. Here is what we have so far.

3:45AM. Here is the plan. All the log entries are in a list called entries, which I will now sort and split by IP.

4:15AM. Done. Though I have to cleanup the output some more.

Available on lulu.com

We are proud to announce that the Concise MATH & PHYSICS Minireference is now available on the lulu.com book store. After five years of low intensity work and two years of high intensity work, the book has reached a sufficiently quality in the writing, content and narrative flow so that we are ready to show it to the world.

Freshman-level math and physics in 300 pages.

We at Minireference Co. are here to fix the textbook industry.

December launch

I have been promoting and selling the book for the past two weeks art McGill and I have received a lot of good feedback from students. There is no point in thinking about business ideas — you have to go out and talk to clients. In just two weeks, I now have a title (thanks to my friend Adriano), a product line (I made a mechanics only version too) and a good idea of which pitches work and which do not.

Product

NO BULLSHIT guide to MATH & PHYSICS. In just 300 pages, this book covers Precalculus, Mechanics, Calculus I (derivatives) and Calculus II (integrals). All the material is explained in a clear conversational tone. 100% Math and Physics, No filler.

We sold out today. Let’s see what happens tomorrow.