Last week I attended the NIPS conference and it felt like grappa shot: intense but good for brain function. There are so many advances in research, and industry is shipping ML in products, and GPUs make previously-impossible things possible. Definitely an exciting time to be.
The opening talk was about deep earning. In fact, a lot of the conference was about deep learning. Non-conformist as I am, I tried not to focus too much on that. All the new applications and interest from industry is great, but I don’t think the research is that revolutionary. I read this review paper Deep learning (paywall, door) and I’m going to limit myself to this level of understanding for now. With 4000 people in one place and 500+ posters to look, it’s hard enough to keep track of topic-modelling topics covered!
I attended the Bayesian Nonparametrics workshop which was the who-is-who of the community. I figured that was my only chance to be in a community where I’ll understand more than every second word said. The morning started with a very interesting “theory” talk by Peter Orbanz. I’m sure he’ll post the slides at some point, but in the meantime I found a 100pp PDF of lecture notes by him: Notes on Bayesian Nonparametrics. There’s also a video of a workshop from 4 years ago. This guy knows his stuff, and knows how to explain it too.
Another excellent talk was by Mike Hughes on Scalable variational inference that adapts the number of clusters. This looked like good ideas to manage fragmentation (too many topics) and finally starts to show BNP’s killer app — automatically learning the right number of topics for a given corpus.
Around lunch time I caught part of the talk by David Blei which talked about the papers Black Box Variational Inference and Hierarchical Variational Models. Very interesting general-purpose methods. I should look into some source code, to see if I can understand things a bit better.
In the afternoon, Amr Ahemed gave an interesting talk about large-scale LDA and efficient LDA sampling using alias method. First for data-parallelism, the workload can be split to thousands of machines, and each machine keeps topic-sparse word-in-topic “counts replica” on individual machines (that syncs asynchronously with shared-global state). If global topic model knows about K different topics, the local node x need to know only about the $k_x$ topics that occur on the documents it will be processing, since $k_x << K$, this allows to push the $K$. Very neat. Another interesting trick they use is alias sampling which performs some preprocessing of any n-dimensional miltinomial distribution to allow to take samples from it efficiently. It doesn’t make sense if you want just one sample, but if you’re taking many samples then the upfront cost of creating the “alias distribution” is amortized overall. It feels like we’re seeing a 3rd-generation parallel LDA ideas start to come-up.