Big data and R

Yesterday, I went to a Montreal meetup about R. The event was attended by quite a few people and the good people of Bolidea offered us beer and pizza. The talk was by the Wajam team and discussed how they make use of R for business analytics and system monitoring.

Instead of simply checking basic data like clicks, number of search and API calls, they combine all this data into a “health indicator” value which is much more accurate at predicting when intervention is required. Basically, dashboards are good but dashboards that can run machine learning algorithms are better.

Their workflow centers around MySQL as the main data store. Web servers et al. send logging information to Scribe for aggregation and all the data is then written to MySQL. The main stats/ML stuff they use for business intelligence is written in R. R pulls the data from MySQL and then produces report graphs and alerts. All this is automated through cronjobs. They said they are not going for “realtime” but they have some jobs running every minute, which is near-realtime.

It was all very cool stuff to hear about, but I was hoping to see some R code during the presentation or a demo.
Nevertheless, after the talk an interesting discussion followed up which got more technical.

Some of things mentioned were:

  • Pig: an SQL-like query language which converts your queries into map reduce jobs to run on HDFS (the Hadoop Distributed File System). Apparently it is very good. Listening to the guys talk about it made it sound like handling 50TB of data is just as easy as handling 1GB on your computer…
  • There was a discussion about running R in parallel but I am not sure about which packages they were talking about. The other day I saw a new project on HN also… so interesting things are happening on that front. Using such tools one could run “exploratory analyses” on the whole dataset instead of a subset which fits on your machine.
  • There is no central package management repository. The makers of R want to preserve the spirit of “scientific publication” spirit and don’t want to become software developers. In this spirit, when creating an R package you have to include a documentation tex file: think I am publishing a paper with some code attached.
    The process for approval to CRAN takes time so some people post their stuff on github.
  • Speaking of documentation, they talked about some literate-programming-like tools: sweave,roxygen and knitr.
    This is kind of cool — especially with the markdown support.
    I imagine this could be very useful for writing stats tutorials.
    Hey what about a minireference for stats?
  • Using Shiny it should be possible to make a nice web-app that teaches the basic concepts of Stats in a very short time. Of course you could make it available in print also, but an interactive version would be much better I think. Sales from the book, web tutorial (say 30% of the material) for free.
  • Speaking of books. One of the members of the audience said that there is an opportunity for writing a book on R.

    The old me would be like “hey I can learn about R and then write a minireference for R” book, but I know better now. Focus on math and phys! Don’t spread your energy too much. How could you teach people if you just learned the subject? Textbooks should be written by people who are at least two levels more advanced than the intended audience. You should know X, know what X is used for and also know what the stuff X is used for is used for. The reason is that people who can see two hops ahead in the graph of knowledge will have better answers to offer for the “why do I need to know this?” question.

  • Finally something related to the big data thread of the discussion here that I heard about this morning on hacker news. Drake is a way to automate handling large datasets using the “makefile” interface. There were links and discussion of other projects on HN. You need to install Clojure to run drake.

Ok! Link offload complete. Now I can finally step away from the computer and stretch a bit. You, my dear reader, should do the same. Go grab a glass of water or something and work some stretches in along the way.

Wow it is 2PM already! HN I want my morning back!!!

Hacker news launch

Two weeks ago, I posted the book on hacker news. There was an tremendous amount of interest on the first day (20k visits in one day!)
and plenty of good (i.e., critical) feedback. With this post, I want to take a moment and record my impressions from surfing the hacker news wave.

Conversion rates

1. Roughly 33000 people showed up on the “product” page.
2. Of these 7000 clicked on at least one of the modals (engagement).
3. About 1761 of them clicked on the “Buy Book” and went
to the print-on-demand site (lulu.com).
4. Of these 264 ordered the book.

The engagement rate is 7000/33000 = 21%.
The percentage of engaged visitors who clicked “Buy Book” is 25% (=1761/7000).
The final step conversion rate is 15% (=264/1761).
Overall we have 0.21*0.25*0.15 = 0.78% conversion from visitor to client.
Is this good or bad?

Perhaps the more interesting metric is the conversion rate
of engaged visitors (clicked at least on one modal) to client,
which is 3.75%.

A back-of-the envelope calculation tells me that my expected earning
per engaged visitor is about 50 cents. I feel confident that I will be
able to find buy some education keywords for

TODO: try mixpanel (GA is a PITA: full path of the referral url plz!), invest and test adwords.

Book product

The book — as a product — works. Even if there are

TODO: fix typos, add math exercises, add physics exercises.

PDF product

Some of the engaged visitors are also going to the PDF: 19% (= 847/4500).
Then there is another factor of 15% = (50+37+19+7+7+3+3)/847 = one week of PDF sales / one week of clicks to gumroad.
Thus 2.8% of engaged visitors went on to buy the PDF.

Overall this means that 6.55% = 3.75% + 2.8% of my engaged visitors go on to become clients.
Now that is cool!

similar stories

Tools used