Ghetto CRM

Say you want to extract the names and emails from all the messages under given tag in your gmail. In my case, it’s the 60 readers who took part in the “free PDF if you buy the print version” offer. I’d like to send them an update.

I started clicking around in gmail and compiling the list, but Gmail’s UI is NOT designed for this, you can’t select-text the email field because a popup shows up, and yada yada…. If you’re reading this, you probably got to this post because you have the same problem so I don’t need to explain.

Yes this is horribly repetitive, and yes it can be automated using python:

import imaplib
import email
from email.utils import parseaddr
import getpass


user = raw_input("Enter your GMail username:")
pwd = getpass.getpass("Enter your password: ")

m = imaplib.IMAP4_SSL('imap.gmail.com', 993)    
m.login(user,pwd)    

# see IMAP client
# m
# see tags (i.e. mailboxes) using
# m.list()


# select the desired tag
m.select('miniref/lulureaders', readonly=True)
typ, data = m.search(None, 'ALL')


# build a list of people from (both FROM and TO headers)
people = []
for i in range(1, len(data[0].split(' '))+1 ):
    typ, msg_data = m.fetch(str(i), '(RFC822)')
    for response_part in msg_data:
        if isinstance(response_part, tuple):
            msg = email.message_from_string(response_part[1])
            name1, addr1 = parseaddr( msg['to'] )
            name2, addr2 = parseaddr( msg['from'] )
            d1 = { "name":name1, "email":addr1 }
            d2 = { "name":name2, "email":addr2 }
            people.extend([d1,d2])
            # uncomment below to see wat-a-gwaan-on 
            #for header in [ 'subject', 'to', 'from' ]:
            #    print '%-8s: %s' % (header.upper(), msg[header])
            #print "-"*70

# lots of people, duplicate entries
len(people)

# filter uniq
# awesome trick by gnibbler 
# via http://stackoverflow.com/questions/11092511/python-list-of-unique-dictionaries
people =  {d['email']:d for d in people}.values()     # uniq by email

# just uniques
len(people)

# print as comma separated values for import into mailing list
for reader in people:
    print reader['email'] + ", " + reader['name']
    
# ciao!
m.close()


A nice question for coding interviews

I was discussing mortgage calculations with a friend today and realized this calculation would make an excellent interview question.
The problem is simple enough, but still requires some thought…

Writeup: interest-rate-calculations-using-recursion (PDF)

Source: interest-rate-calculations-using-recursion.js

If extra time, the candidate can be asked to write a solve function to solve for the payment P given the other values, e.g., solve for P in Zr(25*12,315000, 0.005,P) = 0.

Getting started with ML in python

Next week I’m interviewing for a Data Scientist position so I figured I better brush up my machine learning skills. I found some neat youtube tutorials [1,2] on using scikit-learn so I thought this would be a good place to start.

From experience, I was expecting that setting up the dev-environment with numpy, scipy, ipython notebook, etc, would take me half a day (compiling and debugging things that don’t work out of the box), but I was pleasantly surprised when a few pip commands later I had a fully functional environment. I’ve pasted the sequence of commands below for all those in case you want to learn yourself some ML too.

Create a virtualenv

The first part is to create an isolated virtualenv for the project. Think of this as “basic python hygiene”: we want to isolate the python libraries used to follow the tutorial from my system-wide python library. (For most people this is just “best practices” but in my case my system-wide site-packages contains outdated versions, and or half-broken dependencies because of the dysfunctional relationship between fink, macports, and homebrew that plays out on my computer.) To setup  a virtualenv in a given directory and activate it, proceed as follows:


$ cd ~/Projects/MLpractice
$ virtualenv pyML
$ . pyML/bin/activate # . is the same as source

Install prerequisites

Next we’ll install all the prerequisite packages and scikit-learn. Note that the command line starts with (pyML) which indicates that pip will install these packages in the pyML virtualenv and not system-wide.


(pyML)$ which python
(pyML)$ which pip

(pyML)$ pip install numpy
(pyML)$ pip install pyzmq
(pyML)$ pip install ipython[all]
(pyML)$ pip install scipy
(pyML)$ pip install pyparsing

$ brew update
$ brew install freetype
$ brew link --force freetype
$ brew install libpng
$ brew link --force libpng
$ brew install libagg
(pyML)$ pip install matplotlib
(pyML)$ pip install psutil

(pyML)$ pip install scikit-learn

Done

Now everything is ready and setup for us.
We can clone the repositories with the example code and start the ipython notebook as follows.


$ git clone git@github.com:jakevdp/sklearn_scipy2013.git
$ git clone git@github.com:ogrisel/parallel_ml_tutorial.git
(pyML)$ cd sklearn_scipy2013/notebooks/
(pyML)$ ipython notebook --pylab inline

Your default browser should open showing you iPython notebooks for the first tutorial.
Let the learning begin—both for machine and human alike!

No bullshit guide to linear algebra

I’m happy to announce the No bullshit guide to linear algebra (student edition) is ready: gum.co/noBSLA. The core chapters—the stuff that shows up on exams are done. If you have a linear exam coming up, we’ve got what you need.

For the price of a case of beer, you could have an understanding of linear algebra.

Now if you’re a cheapo like me, you’ll say “why the hell do I need to give you money, when there are free books out there?” I understand you. Perhaps you’d like this free tutorial: LA. See also MECH. By reading these short tutorials, I hope to convince you that synthesis of information (i.e. the order of the concepts and choosing an appropriate level of detail) is possible and desirable. Synthesis helps with understanding. If a subject can be summarized in just a few pages, then a full textbook on the subject shouldn’t be bigger than a couple hundred pages, including prerequisites. I call this “information distillation.”

The 1000pp+ textbooks are a scam. Don’t be duped. Get the No bullshit guide to linear algebra. It’s 1/10th the price, 1/2 the size, and 3 times better than a mainstream textbook.  In the news: [HN1], [HN2]. The price is 50% OFF until April 1st.

 

BTW, this is the second book in the “No bullshit” series. The No bullshit guide to math and physics is the first. It covers high school math, mechanics, differential calculus,  and integral calculus in 383 pages. You should definitely check it if you’re taking one of these classes.

Opportunity costs

Recently, after conversations with friends who work in industry, I’ve been questioning my “career strategy” of pursuing a textbook publishing startup. Generally speaking, the employability of a new graduate is at its peak at graduation. Industry accepts young CS graduates and tells them “Here is 70k, write this code for us” and after a few years they could be pulling in 130+k, which is prof-level income. Regardless of one’s future goals in life a little injection of cash for a person in their thirties sounds like a good thing to have. In general working is a good career move.

Using the language of economics there are opportunity costs of doing the startup thing. First there is the short term financial losses of not having a San Francisco software developer salary right now. Second, and perhaps more importantly, I may be sabotaging my career options should I ever decide to go to industry. Recruiters will ask “what did you do for the past two years?” So doing the startup thing (i.e. not doing the corporate thing) has multiple opportunities costs.

Though such thoughts do turn around in my head, I remained and remain undeterred. I just realized why—this is the inspiration for this post. There are opportunity costs with the corporate career too. This knowledge that I have fresh in my mind after teaching undergraduate math and physics for the last ten years will soon be forgotten. Certainly after two years in industry, I would not remember half the things I can recall off the top of my head right now.

So this is why, now I know, I subconsciously chose this path. We godda do this now and we’ll code later, si besoin.

Aside: I just previewed the latest linear algebra draft and it looks awesome! I’ve been slogging through the corrections during the past couple of weeks (actually months!) and I was feeling low on energy, but now that I see how close we are to the finished product I’m getting all enthusiastic again.

Thoughts and strategy for scaling distribution

I just received news from York University that the book sold out and they need replenishment. The McGill bookstore already sold out twice and I had to replenish their supplies. So in-store sales are working. I’m counting this as validation. Now let’s scale things!

I’ll have to equip the website with a “order a box of 10” option and make a deal with a fulfillment centre so they will take care of the shipping for me. How do I get into WorldCat? Who will print the book in large quantities (Lightning Source?).

I’ve been busy working so much on the Linear Algebra book and preparing exercises that I lost track of the business side of things. I’m going to finish up LA, because it is so close to being done, but I’m vetoing any work on the Electricity and Magnetism title—we’ll pick that up in October and it will be ready for January 2015.

Okay Ivansky. Put on the business hat and get things done!

A scriptable future for the Web and home servers

I’m organizing papers today, and I keep finding dev-notes and plans for my big “home server” idea about being able to run all your “cloud services” on your own hardware with all the data protection this entails. But what is easy to imagine can be difficult to bring to reality. There are a lot of technological aspects to figure out (dyndns, mail, www, filesharing, apps?), but there is also the lack of interest in privacy matters of the general public.

The freedom of computing and the Internet is a question that depends on technology but also on public relations. I recently came up with a plan for one possible way to get FOSS into homes. PR is indicated in brackets.

  • Phase 0: Develop FOSS clones for most popular cloud software. [100% done]
  • Phase 1: Non-tech-savvy users learn to deploy “own server” in the cloud based on a FOSS software stack. [2015]
    (Run your own Google with just one click! Customize and automate everything. Don’t let anyone tell you what to do on the Internet.)
  • Phase 2: Non-tech-savvy users move their existing “own servers” to run on their “home server.” [2020]
    (The Internet is distributed; be the Internet. Who got ur logs? Protect your privacy and that of your family and friends. Political discussion is not a crime. Unlimited storage—just add USB drives to the RAID. )

I think the two-step process for the home server is much more likely, even realistic. Both phases involve transitions to better features. The transition to Phase 1 will be interesting for power users, but if everything is scripted, then even non-tech users could “run their own” thing. For it to happen, we need to get to “same thing as … but with more ….”  Only after we have a mature system of own apps can we then move to Phase 2 where we say: “same thing as own, but at home.”

I’m a big believer in humanity and our ability to learn adapt and advance so I think we will be able to “domesticate” the power of computing as we previously domesticated fire and electricity.

Linear algebra tutorial in four pages

I just pushed an update to the Linear algebra explained in four pages tutorial.

Linear algebra tutorial in four pages thumbnail

Anyone who has an exam with lots of $A\vec{x}=\vec{b}$ stuff on it coming up should check it out because it covers: vector operations, matrix operations, linear transformations(matrix-vector product, fundamental vector spaces, matrix representation), solving systems of linear equations (the RREF stuff), matrix inverse, eigenvalues.

UPDATE: I found another excellent tutorial which I think you should also read, especially if you are a visual person. A Geometric Review of Linear Algebra by Prof. Eero P. Simoncelli  (discuss on HN if you are a procrastinating person). If you’re a studious person, you’ll also go to en.wikibooks.org/wiki/Linear_Algebra and practice solving problems. To become a powerful person, don’t look at the solution until you’ve attempted the problem (with pen and paper) for at least five minutes.

UPDATE: I found some other useful short tutorials, like this short review of linear algebra from Stanford and another one from Boulder which both cover interesting details and bring a new perspective.

Open book writing and typo workflow

Open is better than closed because when you work in the open the whole world can help you (or at least the portion of the world that cares about what you are doing). For books in particular, readers can be tremendously helpful by submitting typo fixes to the book. But how can users submit typos? Surely there is something better than email…

Today I saw a very interesting workflow for reader contributions on the Advanced R programming book website by Hadley Wickham. The book  is being developed on github using the Jekyll static site generator. Each page has an “Edit this” link on the right side:

 

The url for that button is:

https://github.com/hadley/adv-r/edit/master/index.rmd

Clicking on that takes you to github and a special prompt to create a fork:

Next you can make the change:

 

And finally the UI offers you to do a pull request:

This is still a complicated process for the reader (3-steps, one feature branch, one pull request), but from the author side this is awesome! You just write and then manage incoming pull requests that improve your content.

Anyone writing their blog posts in the open on github should consider adding the /edit/ links.

 

Weekend project: find a way to automate this workflow process so readers don’t need to have github accounts. Maybe I could create a “shared” github account “ivans-readers,” allow for login-less-editing to happen on my own server and then see the pull requests coming from ivans-readers on the main repo.