Generating ePub from LaTeX

I want to tell you about my journey to produce the ePub files for the No Bullshit Guide textbooks. This has been an epic battle with lots of technological obstacles, but I got it working in the end, and the results are beautiful:

In this blog post, I want to share what I’ve learned about generating ePub and Mobi files from LaTeX source files that contain lots of math equations. I feel this ought to be recorded somewhere for the benefit of other STEM authors and publishers who have LaTeX manuscripts and want to convert them to .epub and .mobi formats. Read on to watch the “How it’s made” episode about math eBooks.

The end-to-end book production pipeline looks like this:

Figure 1: The eBook production pipeline described in this blog post. Each box represents a different markup format and the arrows indicate the software used to convert between formats. The hard step is to produce clean .html+MathJax format from the .tex source. The generation of the other formats is standard.

The state of open educational resources in 2017

I spend the last couple of weeks exploring the open educational resources (OER) landscape and wanted to write down my thoughts and observations about the field. The promise of an OER “revolution” that will put quality learning material into the hands of every student has been around for several decades, but we are yet to see OER displace the established publishers. Why is it that “open content” hasn’t taken off more, and what can we do to make things happen in the coming decade?

Git for authors

Using version control is very useful for storing text documents like papers and books. It’s amazing how easy it is to track changes to documents, and communicate these changes with other authors. In my career as a researcher, I’ve had the chance to initiate many colleagues to the use of mercurial and git for storing paper manuscripts. Also, when working on my math books, I’ve had the fortune to work with an editor who understands version control and performed her edits directly to the books’ source repo. This blog post is a brainstorming session on the what a git user interface specific to author’s needs could look like.

Linear algebra concept maps

I spent the last week drawing. More specifically, drawing in concept space. Drawing concept maps for the linear algebra book.

Without going into too much details, the context is that the old concept map was too overloaded with information, so I decided to redo it. I had to split the concept map on three pages, because there’s a lot of stuff to cover. Check it out.

Math basics and how they relate to geometric and computational aspects of linear algebra

The skills from high school math you need to “import” to your study of linear algebra are geometry, functions, and the tricks for solving systems of equations (e.g. the values $x$ and $y$ that simultaneously satisfy the equations $x+y=3$ and $3x+y=5$ are $x=1$ and $y=2$.)

The first thing you’ll learn in linear algebra is the Gauss–Jordan elimination procedure, which is a systematic approach for solving systems of $n$ equations with $n$ unknowns. You’ll also learn how to compute matrix products, matrix determinants, and matrix inverses. This is all part of Chapter 3 in the book.

In Chapter 4, we’ll learn about vector spaces and subspaces. Specifically, we’ll discuss points in $\mathbb{R}^3$, lines in $\mathbb{R}^3$, planes in $\mathbb{R}^3$, and $\mathbb{R}^3$ itself. The basic computational skills you picked up in Chapter 3 can be used to solve interesting geometric problems in vectors spaces with any number of dimensions $\mathbb{R}^n$.

Linear transformations and theoretical topics

The concept of a linear transformation $T:\mathbb{R}^n \to \mathbb{R}^m$ is the extension of the idea of a function of a real variable $f:\mathbb{R} \to \mathbb{R}$. Linear transformations are linear functions that take $n$-vectors as inputs and produce $m$-vectors as outputs.

Understanding linear transformations is synonymous with understanding linear algebra. There are many properties of a linear transformation that we might want to study. The practical side of linear transformations is their nature as a vector-upgrade to your existing skill set of modelling the world with functions. You’ll also learn how to study, categorize, and understand linear transformations using new theoretical tools like eigenvalues and eigenvectors.

Matrices and applications

Another fundamental idea in linear algebra is the equivalence between linear transformations $T:\mathbb{R}^n \to \mathbb{R}^m$ and matrices $M \in \mathbb{R}^{m\times n}$. Specifically, the abstract idea of a linear transformation $T:\mathbb{R}^n \to \mathbb{R}^m$, when we fix a particular choice of basis $B_i$ for the input space and $B_o$ for the output space of $T$, can be represented as a matrix of coefficients $_{B_o}[M_T]_{B_i} \in \mathbb{R}^{m\times n}$. The precise mathematical term for this equivalence is isomorphism. The isomorphism between linear transformations and their matrix representations means we can characterize the properties of a linear transformation by analyzing its matrix representation.

Chapter 7 in the book contains a collection of short “applications essays” that describe how linear algebra is applied to various domains of science and business. Chapter 8 is a mini-intro to probability theory and Chapter 9 is an intro course on quantum mechanics. All the applications are completely optional, but I guarantee you’ll enjoy reading them. The power of linear algebra made manifest.

If you’re a seasoned blog reader, and you just finished reading this post, I know what you’re feeling… a moment of anxiety goes over you—is a popup asking you to sign up going to show up from somewhere, is there going to be a call to action of some sort?

Nope.

Sometime in mid-December I set out to create problem sets for the book. My friend Nizar Kezzo offered to help me write the exercises for Chapter 2 and Chapter 4 and I made a plan to modernize the calculus questions a bit and quickly write a few more questions and be done in a couple of weeks.

That was four months ago! Clearly, I was optimistic (read unrealistic) about my productivity. Nizar did his part right on schedule, but it took me forever to write nice questions for the other chapters and to proofread everything. After all, if the book is no bullshit, the problem sets must also be no bullshit. I’m quite happy with the results!

noBS problem sets: letter format or 2up format.

Please, if you find any typos or mistakes in the problem sets, drop me a line so I can fix them before v4.1 goes to print.

Tools

In addition to work on the problem sets, I also made some updates to the main text. I also developed some scripts to use in combination with latexdiff to filter only pages with changes. This automation saved me a lot of time as I didn’t have to page through 400pp of text, but only see the subset of the pages that had changes in them.

If you would like to see the changes made to the book from v4.0 to v4.1 beta, check out noBSdiff_v4.0_v4.1beta.pdf.

Future

Today I handed over the problems to my editor and once she has taken a look at them, I’ll merge the problems into the book and release v4.1. The coming months will be focussed on the business side. I know I keep saying that, but now I think the book is solid and complete so I will be much more confident when dealing with distributors and bookstores. Let’s scale this!

Ghetto CRM

Say you want to extract the names and emails from all the messages under given tag in your gmail. In my case, it’s the 60 readers who took part in the “free PDF if you buy the print version” offer. I’d like to send them an update.

I started clicking around in gmail and compiling the list, but Gmail’s UI is NOT designed for this, you can’t select-text the email field because a popup shows up, and yada yada…. If you’re reading this, you probably got to this post because you have the same problem so I don’t need to explain.

Yes this is horribly repetitive, and yes it can be automated using python:

import imaplib
import email
import getpass

m = imaplib.IMAP4_SSL('imap.gmail.com', 993)

# see IMAP client
# m
# see tags (i.e. mailboxes) using
# m.list()

# select the desired tag
typ, data = m.search(None, 'ALL')

# build a list of people from (both FROM and TO headers)
people = []
for i in range(1, len(data[0].split(' '))+1 ):
typ, msg_data = m.fetch(str(i), '(RFC822)')
for response_part in msg_data:
if isinstance(response_part, tuple):
msg = email.message_from_string(response_part[1])
d1 = { "name":name1, "email":addr1 }
d2 = { "name":name2, "email":addr2 }
people.extend([d1,d2])
# uncomment below to see wat-a-gwaan-on
#for header in [ 'subject', 'to', 'from' ]:
#print "-"*70

# lots of people, duplicate entries
len(people)

# filter uniq
# awesome trick by gnibbler
# via http://stackoverflow.com/questions/11092511/python-list-of-unique-dictionaries
people =  {d['email']:d for d in people}.values()     # uniq by email

# just uniques
len(people)

# print as comma separated values for import into mailing list

# ciao!
m.close()



Getting started with ML in python

Next week I’m interviewing for a Data Scientist position so I figured I better brush up my machine learning skills. I found some neat youtube tutorials [1,2] on using scikit-learn so I thought this would be a good place to start.

From experience, I was expecting that setting up the dev-environment with numpy, scipy, ipython notebook, etc, would take me half a day (compiling and debugging things that don’t work out of the box), but I was pleasantly surprised when a few pip commands later I had a fully functional environment. I’ve pasted the sequence of commands below for all those in case you want to learn yourself some ML too.

Create a virtualenv

The first part is to create an isolated virtualenv for the project. Think of this as “basic python hygiene”: we want to isolate the python libraries used to follow the tutorial from my system-wide python library. (For most people this is just “best practices” but in my case my system-wide site-packages contains outdated versions, and or half-broken dependencies because of the dysfunctional relationship between fink, macports, and homebrew that plays out on my computer.) To setup  a virtualenv in a given directory and activate it, proceed as follows:

 $cd ~/Projects/MLpractice$ virtualenv pyML $. pyML/bin/activate # . is the same as source  Install prerequisites Next we’ll install all the prerequisite packages and scikit-learn. Note that the command line starts with (pyML) which indicates that pip will install these packages in the pyML virtualenv and not system-wide.  (pyML)$ which python (pyML)$which pip  (pyML)$ pip install numpy (pyML)$pip install pyzmq (pyML)$ pip install ipython[all] (pyML)$pip install scipy (pyML)$ pip install pyparsing $brew update$ brew install freetype $brew link --force freetype$ brew install libpng $brew link --force libpng$ brew install libagg (pyML)$pip install matplotlib (pyML)$ pip install psutil (pyML)$pip install scikit-learn  Done Now everything is ready and setup for us. We can clone the repositories with the example code and start the ipython notebook as follows. $ git clone git@github.com:jakevdp/sklearn_scipy2013.git $git clone git@github.com:ogrisel/parallel_ml_tutorial.git (pyML)$ cd sklearn_scipy2013/notebooks/ (pyML)\$ ipython notebook --pylab inline 

Your default browser should open showing you iPython notebooks for the first tutorial.
Let the learning begin—both for machine and human alike!

A scriptable future for the Web and home servers

I’m organizing papers today, and I keep finding dev-notes and plans for my big “home server” idea about being able to run all your “cloud services” on your own hardware with all the data protection this entails. But what is easy to imagine can be difficult to bring to reality. There are a lot of technological aspects to figure out (dyndns, mail, www, filesharing, apps?), but there is also the lack of interest in privacy matters of the general public.

The freedom of computing and the Internet is a question that depends on technology but also on public relations. I recently came up with a plan for one possible way to get FOSS into homes. PR is indicated in brackets.

• Phase 0: Develop FOSS clones for most popular cloud software. [100% done]
• Phase 1: Non-tech-savvy users learn to deploy “own server” in the cloud based on a FOSS software stack. [2015]
(Run your own Google with just one click! Customize and automate everything. Don’t let anyone tell you what to do on the Internet.)
• Phase 2: Non-tech-savvy users move their existing “own servers” to run on their “home server.” [2020]
(The Internet is distributed; be the Internet. Who got ur logs? Protect your privacy and that of your family and friends. Political discussion is not a crime. Unlimited storage—just add USB drives to the RAID. )

I think the two-step process for the home server is much more likely, even realistic. Both phases involve transitions to better features. The transition to Phase 1 will be interesting for power users, but if everything is scripted, then even non-tech users could “run their own” thing. For it to happen, we need to get to “same thing as … but with more ….”  Only after we have a mature system of own apps can we then move to Phase 2 where we say: “same thing as own, but at home.”

I’m a big believer in humanity and our ability to learn adapt and advance so I think we will be able to “domesticate” the power of computing as we previously domesticated fire and electricity.

Big data and R

Yesterday, I went to a Montreal meetup about R. The event was attended by quite a few people and the good people of Bolidea offered us beer and pizza. The talk was by the Wajam team and discussed how they make use of R for business analytics and system monitoring.

Instead of simply checking basic data like clicks, number of search and API calls, they combine all this data into a “health indicator” value which is much more accurate at predicting when intervention is required. Basically, dashboards are good but dashboards that can run machine learning algorithms are better.

Their workflow centers around MySQL as the main data store. Web servers et al. send logging information to Scribe for aggregation and all the data is then written to MySQL. The main stats/ML stuff they use for business intelligence is written in R. R pulls the data from MySQL and then produces report graphs and alerts. All this is automated through cronjobs. They said they are not going for “realtime” but they have some jobs running every minute, which is near-realtime.

It was all very cool stuff to hear about, but I was hoping to see some R code during the presentation or a demo.
Nevertheless, after the talk an interesting discussion followed up which got more technical.

Some of things mentioned were:

• Pig: an SQL-like query language which converts your queries into map reduce jobs to run on HDFS (the Hadoop Distributed File System). Apparently it is very good. Listening to the guys talk about it made it sound like handling 50TB of data is just as easy as handling 1GB on your computer…
• There was a discussion about running R in parallel but I am not sure about which packages they were talking about. The other day I saw a new project on HN also… so interesting things are happening on that front. Using such tools one could run “exploratory analyses” on the whole dataset instead of a subset which fits on your machine.
• There is no central package management repository. The makers of R want to preserve the spirit of “scientific publication” spirit and don’t want to become software developers. In this spirit, when creating an R package you have to include a documentation tex file: think I am publishing a paper with some code attached.
The process for approval to CRAN takes time so some people post their stuff on github.
• Speaking of documentation, they talked about some literate-programming-like tools: sweave,roxygen and knitr.
This is kind of cool — especially with the markdown support.
I imagine this could be very useful for writing stats tutorials.
Hey what about a minireference for stats?
• Using Shiny it should be possible to make a nice web-app that teaches the basic concepts of Stats in a very short time. Of course you could make it available in print also, but an interactive version would be much better I think. Sales from the book, web tutorial (say 30% of the material) for free.
• Speaking of books. One of the members of the audience said that there is an opportunity for writing a book on R.

The old me would be like “hey I can learn about R and then write a minireference for R” book, but I know better now. Focus on math and phys! Don’t spread your energy too much. How could you teach people if you just learned the subject? Textbooks should be written by people who are at least two levels more advanced than the intended audience. You should know X, know what X is used for and also know what the stuff X is used for is used for. The reason is that people who can see two hops ahead in the graph of knowledge will have better answers to offer for the “why do I need to know this?” question.

• Finally something related to the big data thread of the discussion here that I heard about this morning on hacker news. Drake is a way to automate handling large datasets using the “makefile” interface. There were links and discussion of other projects on HN. You need to install Clojure to run drake.

Ok! Link offload complete. Now I can finally step away from the computer and stretch a bit. You, my dear reader, should do the same. Go grab a glass of water or something and work some stretches in along the way.

Wow it is 2PM already! HN I want my morning back!!!

Showing off with python

2:57AM on a Monday. I have to be up at 8AM. The faster I get the job done the more sleep I get. Sounds like the kind of thing to motivate a person.

TASK: Parse an access.log file and produce page visit trace for each visitor. Ex:

11.22.33.90 on Monday at 3pm   (Montreal, Firefox 4, on Mac OS X):
/contents          (stayed for 3 secs)
/derivatives       (stayed for 2m20sec)
/contents          (6 secs)
/derivative_rules  (1min)
/derivative_formulas  (2min)
end

I had already found some access.log parsing code,  and setup a processing pipeline from last time I wanted to work on this. Here is what we have so far.

3:45AM. Here is the plan. All the log entries are in a list called entries, which I will now sort and split by IP.

4:15AM. Done. Though I have to cleanup the output some more.