Big data and R

Jan 25 2013

Yesterday, I went to a Montreal meetup about R. The event was attended by quite a few people and the good people of Bolidea offered us beer and pizza. The talk was by the Wajam team and discussed how they make use of R for business analytics and system monitoring.

Instead of simply checking basic data like clicks, number of search and API calls, they combine all this data into a “health indicator” value which is much more accurate at predicting when intervention is required. Basically, dashboards are good but dashboards that can run machine learning algorithms are better.

Their workflow centers around MySQL as the main data store. Web servers et al. send logging information to Scribe for aggregation and all the data is then written to MySQL. The main stats/ML stuff they use for business intelligence is written in R. R pulls the data from MySQL and then produces report graphs and alerts. All this is automated through cronjobs. They said they are not going for “realtime” but they have some jobs running every minute, which is near-realtime.

It was all very cool stuff to hear about, but I was hoping to see some R code during the presentation or a demo.
Nevertheless, after the talk an interesting discussion followed up which got more technical.

Some of things mentioned were:

  • Pig: an SQL-like query language which converts your queries into map reduce jobs to run on HDFS (the Hadoop Distributed File System). Apparently it is very good. Listening to the guys talk about it made it sound like handling 50TB of data is just as easy as handling 1GB on your computer…
  • There was a discussion about running R in parallel but I am not sure about which packages they were talking about. The other day I saw a new project on HN also… so interesting things are happening on that front. Using such tools one could run “exploratory analyses” on the whole dataset instead of a subset which fits on your machine.
  • There is no central package management repository. The makers of R want to preserve the spirit of “scientific publication” spirit and don’t want to become software developers. In this spirit, when creating an R package you have to include a documentation tex file: think I am publishing a paper with some code attached.
    The process for approval to CRAN takes time so some people post their stuff on github.
  • Speaking of documentation, they talked about some literate-programming-like tools: sweave,roxygen and knitr.
    This is kind of cool — especially with the markdown support.
    I imagine this could be very useful for writing stats tutorials.
    Hey what about a minireference for stats?
  • Using Shiny it should be possible to make a nice web-app that teaches the basic concepts of Stats in a very short time. Of course you could make it available in print also, but an interactive version would be much better I think. Sales from the book, web tutorial (say 30% of the material) for free.
  • Speaking of books. One of the members of the audience said that there is an opportunity for writing a book on R.

    The old me would be like “hey I can learn about R and then write a minireference for R” book, but I know better now. Focus on math and phys! Don’t spread your energy too much. How could you teach people if you just learned the subject? Textbooks should be written by people who are at least two levels more advanced than the intended audience. You should know X, know what X is used for and also know what the stuff X is used for is used for. The reason is that people who can see two hops ahead in the graph of knowledge will have better answers to offer for the “why do I need to know this?” question.

  • Finally something related to the big data thread of the discussion here that I heard about this morning on hacker news. Drake is a way to automate handling large datasets using the “makefile” interface. There were links and discussion of other projects on HN. You need to install Clojure to run drake.

Ok! Link offload complete. Now I can finally step away from the computer and stretch a bit. You, my dear reader, should do the same. Go grab a glass of water or something and work some stretches in along the way.

Wow it is 2PM already! HN I want my morning back!!!

6 responses so far

  1. > Textbooks should be written by people who are at least two levels more advanced than the intended audience. You should know X, know what X is used for and also know what the stuff X is used for is used for. The reason is that people who can see two hops ahead in the graph of knowledge will have better answers to offer for the “why do I need to know this?” question.

    They may be able to answer asked questions better, but they can’t ask the questions themselves – this is the curse of expertise: you forget what confusions you had as a newbie, what mistakes you made, what assumptions and false beliefs you brought with you.

    Of course, in practice this is not always such a curse. Since textbooks are often written by active teachers who had gone through hundreds or thousands of students, that experience ameliorates to some extent this ‘inferential distance’ ( http://lesswrong.com/lw/kg/expecting_short_inferential_distances/ ), but you say nothing about that…

  2. [...] http://minireference.com/blog/big-data-and-r/ Share this:TwitterFacebookLike this:LikeBe the first to like this. [...]

  3. HDFS = Hadoop Distributed File System

  4. Thx. Corrected.

  5. Hi Gwern, I know what you mean. The last thing we want in a teacher is some super specialized prof. who assumes the students already know the material which he is supposed to be teaching.

    Perhaps a better statement would be “… written by people who are //exactly// two levels more advanced…” so that we get the “lookahead” benefits in the graph of knowledge of an expert in the field as well as proximity with the student’s way of thinking.

    In terms of actual skills for “transferring the knowledge”, nothing beats an experienced teacher.

  6. [...] Big data and R [...]