]]>
But what about learners who are not familiar with Python? Should we abandon non-tech learners and say they can’t learn statistics because they don’t know how to use Python? Naaaah, we ain’t having none of that! Instead, my plan is to bring non-technical learners up to speed on Python by teaching them the Python basics that they need to use for statistics. Anyone can learn Python, it’s really not a big deal. I hope to convince you of this fact in this blog post, which is intended as a Python crash-course for the absolute beginner.
I know the prospect of learning Python might seem like a daunting task, but I assure you that it is totally worth it because Python will provide you with all the tools you need to handle the mathematical and procedural complexity inherent to statistics. Using Python will make your statistics learning journey much more intuitive (because of the interactivity) and practical (because of the useful tools). All the math equations we need to describe statistical concepts will become a lot easier to understand once you can express math concepts as Python code you can “play” with. Basically, what I’m saying is that if you accept Python in your heart, you’ll benefit from much reduced suffering on your journey of learning statistics.
This blog post is the first part in a three-part series about why learning PYTHON+STATS is the most efficient and painless way to learn statistics. In this post (PART 1 of 3), we’ll introduce the JupyterLab computational platform, show basic examples of Python commands, and introduce some useful libraries for doing data visualizations (Seaborn) and data manipulations (Pandas). The key point I want to convince you of, is that you don’t need to become a programmer to benefit from all the power of the Python ecosystem, you just need to learn how to use Python as a calculator.
One of the great things about learning Python is the interactivity. In an interactive coding environment, you can write some commands, run the commands, then immediately see the outputs. The immediate link between Python commands and their results makes it easy to try different inputs and make quick changes to the code based on the results you get.
A Jupyter notebook is a complete statistical computing environment that allows you to run Python code interactively, generate plots, run simulations, do probability calculations, all in the same place. All the code examples in this blog post were copy-pasted from the notebook shown in Figure 1.
Figure 1: Screenshot of a Jupyter notebook running in JupyterLab Desktop application under macOS.
I’ve highlighted a few regions in the screenshot so that I can draw your attention to them and explain the “stack” of software that produced this screenshot. Region (1) is JupyterLab Desktop, which is an application you can download and install locally to run JupyterLab on your computer; region (2) is the JupyterLab software we use for running Jupyter notebooks; region (3) is a particular notebook called python_for_stats.ipynb (a document) that is currently opened for editing interactively; region (4) is an individual code cell that is part of the notebook.
Each cell in a notebook acts like an “input prompt” for the Python calculator. For example, the code cell (4) in the screenshot shows the input expression 2.1 + 3.4
and the output of this expression 5.5
displayed below it. This is how all code cells work in a notebook: you type in some Python commands in the code cell, then press the play button (see (5) in the screenshot) in the menu bar or use the keyboard shortcut SHIFT+ENTER, then you’ll see the result of your Python commands shown on the next line.
Since Python is a popular programming language used in many domains (education, software, data science, machine learning, etc.), people have developed multiple options for running Python code. Here are some of the ways you can run the Python code in the notebook python_for_stats.ipynb interactively.
I invite you to use one of the above methods to open the notebook python_for_stats.ipynb and try running the code examples interactively while reading the rest of the blog post.
When learning a new skill, you’ll make a lot of mistakes. If you receive instant feedback about your actions, this makes it much easier to learn the new skill. Imagine having a private tutor looking over your shoulder, and telling you when you make mistakes. That would be nice, right?
Running Python interactively in a notebook environment offers precisely this experience. Whenever you input some Python command that contains an error, Python will tell you right away by printing an “error message.” The error message tells you what problem occurred and where it occurred (highlighted in yellow). For example, if you try to divide a number by zero, Python will respond with a big red message that tells you this is not possible to do. Like, mathematically impossible. What does dividing a quantity into 0 equal parts mean? If you try 3/0
on a calculator, it will also complain and show you an error message.
Reading a Python error message gives you hints about how to fix the input so it doesn’t cause an error. In this case there is no fix—just don’t be dividing by zero! This was just a simple example I used to show what an error looks like. Yes I know the red background is very intimidating, as if chastising you for having done something wrong! Once you get used to the “panic” coloring though, you’ll start to appreciate the helpful information that is provided in these error messages. This is just your friendly Python tutor telling you something in your input needs a fix.
Receiving error messages whenever you make mistakes is one of the main reasons why learning Python is so easy. You don’t need to read a lengthy book about Python programming, you can get started by opening a Jupyter notebook and trying some commands to see what happens. The interactive feedback of notebook environments helps many people get into a learning loop that motivates them to learn more and more.
Python is the #1 most popular language according to the TIOBE programming language popularity index. Did this happen by chance, or is there something special about Python? Why do people like Python so much? Let’s find out.
Python is like the best calculator ever! You can use Python for any arithmetic calculation you might want to do. Each cell in a Jupyter notebook is like the input prompt of a calculator, so if you wanted to add two numbers, you would do it like this:
In[1]: 2.1 + 3.4 Out[1]: 5.5
All the sections in fixed-width font in this blog post contain Python code. Don’t worry if you haven’t seen code before, I’ll make sure to explain what each code block does in the surrounding text so you’ll know what’s going on. In the above code example, the first line In[1]
shows the input expression we asked Python to compute, and the line Out[1]
shows the output of this expression—the result that Python computed for us. This Python code example is equivalent to entering the number 2.1
on a calculator, followed by pressing the +
button, then entering 3.4
, and finally pressing the =
button. Just like a calculator, Python computed the value of the input expression 2.1+3.4
and printed the output 5.5
on the next line.
Instead of working with numbers directly, we can store numbers into variables and specify the arithmetic operation in terms of these variables.
In[2]: num1 = 2.1 In[3]: num2 = 3.4 In[4]: num1 + num2 Out[4]: 5.5
The effect of the input line In[2]
is to store the number 2.1
into the new variable named num1
. This is called an assignment statement in Python, and it is read from right-to-left: “store 2.1
into num1
“. Assignment statements don’t have any outputs, so this is why there is no Out[2]
to show. The input In[3]
stores 3.4
into the variable num2
, and again there is no output to display, since this is another assignment statement. The input In[4]
asks Python to compute the value of the expression num1 + num2
, that is, the sum of the variables num1
and num2
. The result is displayed on the line Out[4]
.
Note the computation on line In[4]
makes use of the variables num1
and num2
that were defined on the previous lines In[2]
and In[3]
. Python is a calculator that remembers previous commands. When Python evaluates the input line In[n]
, the evaluation is done in the context that includes information from all previous inputs: In[1]
, In[2]
, …, In[n-1]
. This is a key paradigm you need to get used to when reading Python code. Every line of Python code runs in a “context” that includes all the commands that preceded it.
I know this sounds complicated, and you might be wondering why I’m inflicting this knowledge upon you in such an early stage of this introduction to Python. You have to trust me on this one—you need to know how to “run” a line of Python code and simulate in your head what it’s going to do. Let’s look at the input line In[4]
as an example. I’ll narrate what’s going on from Python’s point of view:
I’ve been asked to evaluate the expression
num1 + num2
which is the sum of two variables. The first thing I’m going to do is look up what the namenum1
corresponds to in the current context and I’ll see thatnum1
is a variable we defined earlier when we stored the value 2.1 into it. Similarly, I will look up the namenum2
and find it has the value3.4
because of the assignment statement we made on lineIn[3]
. After these two lookups, I know thatnum1 + num2
refers to the sum of two numbers, and I’ll compute this sum, showing the result on the output lineOut[4]
.
Does this make sense? Can you put yourself in Python’s shoes and see how Python interprets the commands it received and runs them?
Let’s look at another code example that shows how we can compute the average of the two numbers num1
and num2
. The average of two numbers is the sum of these numbers divided by two:
In[5]: (num1 + num2) / 2 Out[5]: 2.75
Using math notation, the average of the numbers $x_1$ and $x_2$ is expressed as $(x_1 + x_2)/2$, which is very similar. Note we used the parentheses (
and )
to tell Python to compute the sum of the numbers first, before dividing by two. I’m showing this example to illustrate the fact that the syntax of the Python expression (num1+num2)/2
is very similar to standard math notation $(x_1 + x_2)/2$. If you know the math notation for plus, minus, times, divide, exponent, parentheses, etc. then you already know how the Python operators +
, -
, *
, /
, **
(exponent),(
and )
work.
Remember that the best way to learn Python is to “play” with Python interactively. If you’re a Python beginner, I highly recommend you follow along all the examples in this blog post in an interactive prompt where you can type in commands and see what comes out. Remember, you can run this notebook by clicking on the JupyterLab binder link, the colab link, or by enabling the Live Code feature from the rocket menu on the web version.
In order to make it easy for you to copy-paste code blocks from this blog post into a notebook environment, from now on, I won’t show the input indicators In[n]:
and instead show only the input commands. This will allow you to copy-paste commands directly into a Python prompt (Live Code webpage or notebook) and run them for yourself to see what happens. Here is a repeat of the input-output code from the previous code block using the new format.
(num1 + num2) / 2 # 2.75
The input on the first line is the same as the input line In[5]
we saw above. The second line shows the output prefixed with the #
-character (a.k.a. hash-sign), which is used to denote comments in Python. This means the line # 2.75
will be ignored by the Python input parser. The text # 2.75
is only there for reference—it tells human readers what the expected output is supposed to be. Try copy-pasting these lines into a code cell inside your choice of a Python interactive coding environment. Run the code cell to check that you get the expected output 2.75 as indicated in the comment.
Python is a “high level” programming language, which means it provides convenient data structures and functions for many common calculations. For example, suppose you want to compute the average grade from a list that contains the grades of the students in a class. We’ll assume the class has just four students, to keep things simple.
The math expression for computing the average grade for the list of grades $[g_1,g_2,g_3,g_4]$ is $(g_1 + g_2 + g_3 + g_4) / 4 = (\sum_{i=1}^{i=4} g_i ) / 4$. The symbol $\Sigma$ is the greek letter Sigma, which we use to denote summations. The expression $\sum_{i=1}^{i=4} g_i$ corresponds to the sum of the variables $g_i$ between $i=1$ and $i=4$. This is just fancy math notation for describing the sum $g_1 + g_2 + g_3 + g_4$.
The code example below defines the list grades
that contains four numbers, then uses the Python builtin functions sum
and len
(length) to compute the average of these numbers (the average grade for this group of four students).
grades = [80, 90, 70, 60] avg = sum(grades) / len(grades) avg # 75.0
The first line of the above code block defines a list of four numbers [80,90,70,60]
and stores this list of values in the variable named grades
. The second line is an assignment statement, where the right hand side computes the sum of the grades divided by the length of the list, which is the formula for computing the average. The result of the sum-divided-by-length calculation is stored into a new variable called avg
(short for average). Finally, the third line prints the contents of the variable avg
to the screen.
Dear readers, I know all of the descriptions may feel like too much information (TMI), but I assure you learning how to use Python as a calculator is totally worth it. It’s really not that bad, and you’re already ahead in the game. If you were able to follow what’s going on in the above code examples, then you already know one of the most complicated parts of Python syntax: the assignment statement.
Python reading exercise: revisit all the lines of code above where the symbol =
appears, and read them out loud, remembering you must read them from right-to-left: “compute <right hand side> then store the result into <left hand side>”. That’s what the assignment statement does: it evaluates some expression (whatever appears to the right of the =) and stores the result into the place described by the left-hand side (usually a variable name).
There are two more Python concepts I would like to introduce you to: for-loops and functions. After that we’re done, I promise!
The for-loop is a programming concept that allows us to repeat some operation or calculation multiple times. The most common use case of a for
-loop is to perform some calculation for each element in a list. For example, we can use a for
loop to compute the average of the numbers in the list grades = [80,90,70,60]
. Normally, we would compute the average using the expression avg = sum(grades)/len(grades)
, but suppose we don’t have access to the function sum
for some reason, and so we must compute the sum using a for-loop.
We want to compute the sum 80 + 90 + 70 + 60, which we can also write as 0 + 80 + 90 + 70 + 60, since starting with 0 doesn’t change anything. We can break up this summation into a four-step process, where we add the individual numbers one by one to a running total, as illustrated in Figure 2.
Figure 2: Step-by-step calculation of the sum of four numbers. In each step we add one new number to the current total. This procedure takes four steps: one step for each number in the list.
Note the operation we perform in each step is the same—we add the current grade to the partial sum of the grades from the previous step.
Here is the Python code that computes the average of the grades in the list grades
using a for
-loop:
total = 0 for grade in grades: total = total + grade avg = total / len(grades) avg # 75.0
The first line defines the variable named total
(initially containing 0
), which we’ll use to store the intermediate values of the sum at different steps. Next, the for loop tells Python to go through the list grades
and for each grade
in the list, perform the command total = total + grade
. Note this statement is indented relative to the other code to indicate it is “inside” the for loop, which means it will be repeated for each element in the list. On the next line we divide the total by the length of the list to obtain the average, which we then display on the fifth line.
A for-loop describes a certain action or actions that you want Python to repeat. In this case, we have a list of grades $[g_1, g_2, g_3, g_4]$, and we want Python to repeat the action “add $g_i$ to the current total” four times, once for each number $g_i$ in the list. The variable total
takes on different values while the code runs. Before the start of the for-loop, the variable total contains 0, we’ll refer to this as total0
. Then Python starts the for-loop and runs the code block total = total + grade
four times, once for each grade in the list grades, which has the effect of changing the value stored in the variable total four times. We’ll refer to the different values of the variable total as total1, total2, total3, and total4. See Figure 3 for the intermediate values of the variable total during the four steps of the for loop.
Figure 3: Mathematical description and computational description of the procedure for computing the average in four steps. The label totalk
describes the value stored in the variable total
after the code inside the for loop has run k
times.
The key new idea here is that the same line of code total = total + grade
corresponds to four different calculations. Visit this page to see a demonstration of how the variable total
changes during the four steps of the for loop. At the end of the for loop, the variable total
contains 300 which is the sum of the grades, and calculating total/len(grades)
gives us the average grade, which gives the same result as the expression sum(grades)/len(grades)
that we used earlier.
If this is the first time you’re seeing a for-loop in your life, the syntax might look a little weird, but don’t worry you’ll get used to it. I want you to know about for-loops because we can use for-loops to generate random samples and gain hands-on experience with statistics concepts like sampling distributions, confidence intervals, and hypothesis testing. We’ll talk about these in PART 2 of the series.
Python is a useful calculator because of the numerous functions it comes with, which are like different calculator buttons you can press to perform computations.
We’ve already seen the function sum
that computes the sum of a list of elements. This is one of the builtin Python functions that are available for you to use. You can think of sum
as a calculator button that works on lists of numbers. Readers who are familiar with spreadsheet software might have already seen the spreadsheet function SUM(<RANGE\>)
which computes the sum for a <RANGE\> of cells in the spreadsheet.
Here are some other examples of Python built-in functions:
len
: computes the length of a listrange(0,n)
: creates a list of numbers [0,1,2,3, ..., n-1]
min
/max
: finds the smallest/largest value in a listprint
: prints some value to the screenhelp
: shows helpful documentationLearning Python is all about getting to know the functions that are available to you.
Python also makes it easy to define new functions, which is like adding “custom” buttons to the calculator. For example, suppose we want to define a new button mean
for computing the mean (average) of a list of values. Recall the math notation for calculating the average value for the list $[x_1, x_2, x_3, \ldots, x_n]$ is $(\sum_{i=1}^{i=n} x_i)/n$.
To define a Python function mean
we can use the following code:
def mean(values): total = 0 for value in values: total = total + value avg = total / len(values) return avg
Let’s go through this code example line by line. To define a new Python function, we start with the keyword def
followed by the function’s name, we then specify what variables the function expects as input in parentheses. In this case the function mean
takes a single input called values
. We end the line with the symbol :
, which tells us the “body” of the function is about to start. The function body is an indented code block that specifies all the calculations that the function is supposed to perform. The last line in the function body is a return
statement that tells us the output of the function. Note we’ve seen the calculations in the body of the function previously, when we computed the average of the list of grades grades = [80,90,70,60]
using a for
-loop.
The function mean
encapsulates the steps of the procedure for calculating the mean and makes it available to us as a “button” we can press whenever we need to compute the mean of some list of numbers.
To call a Python function, we write its name followed by the input argument specified in parentheses. To call the function mean
on the list of values grades
, we use the following code:
mean(grades) # 75.0
When Python sees the expression mean(grades)
it recognizes it as a function call. It then looks for the definition of the function mean and finds the def
statement we showed in the previous code block, which tells us what steps need to be performed on the input.
The function mean
can compute the average for any list of values, but when we call mean(grades)
we are supplying the specific list grades
as the input values
. In other words, calling a function mean
with the input grades
is equivalent to the assignment statement values = grades
, then running the code instructions in the body of the function. The return statement on the last line of the function’s body tells us the function output value is avg
. After the function call is done, the expression mean(grades) gets replaced with the return value of the function, which is the number 75. Click here to see the visualization of how this code runs.
Calling a function is like delegating some work to a contractor. You want to know the mean of the numbers in the list grades, so you give this task to the function mean
, which does the calculation for you, and returns to you the final value.
Okay that’s it! If you were able to understand the Python code examples above, then you already know the three Python syntax primitives that you need for 90% of statistics calculations. If you want to learn more about Python, take a look at the Python tutorial (Appendix C), which is a beginner-friendly introduction to Python syntax, data types, and builtin functions.
But wait there’s more…. I haven’t even shown you the best parts yet!
Python is a popular language because of the numerous functions it comes with. You can “import” various Python modules and functions to do specialized calculations like numerical computing (NumPy), scientific computing (SciPy), data management (Pandas), data visualization (Seaborn), and statistical modelling (statsmodels).
We’ll now briefly mention a few of the Python libraries that are useful when learning statistics:
scipy.stats
contains all the probability models we use in statistics. We’ll show some examples of probability models in PART 2.solve
, simplify
, expand
, and factor
can do the math for you. See the tutorial for more info and try some commands in the SymPy live shell.So yes, Python is just like a calculator, but because it comes with all the above libraries, it’s like the best calculator ever! We’ll now show some examples that highlight the usefulness of the Seaborn library for data visualizations.
The first part of any statistical analysis is to look at the data. Descriptive statistics is a general term we use to refer to all the visualizations and calculations we do based on a sample in order to describe its properties. Statistical plots give us visual summaries for data samples. Examples of such plots include scatter plots, bar plots, box plots, etc. We can easily create all these visual summaries using the Python library Seaborn, which makes statistical plotting very straightforward.
Let’s look at some examples of data visualizations based on a sample of prices, which we will represent as a list of nine values (prices collected at different locations).
prices = [11.8, 10, 11, 8.6, 8.3, 9.4, 8, 6.8, 8.5]
Next we’ll import the module seaborn
under the alias sns
.
import seaborn as sns
This is called an “import statement” in Python, and its effect is to make all the functions in the seaborn module available by calling them as sns.<functionname\>
. We use the alias sns
instead of the full module name seaborn
to save a few keystrokes.
Statistical plots can be generated by specifying prices
as the x-input to one of the seaborn plot-functions. We can use sns.stripplot
to generate a strip plot, sns.histplot
to generate a histogram, or sns.boxplot
to generate a box-plot, as shown below:
sns.stripplot(x=prices, jitter=0) # See Figure 4 (a) sns.histplot(x=prices) # See Figure 4 (b) sns.boxplot(x=prices) # See Figure 4 (c)
Figure 4. Three common types of statistics visualizations.
The plot shown in Figure 4 (a) shows the “raw data,” with each value in the prices list represented as a separate point, whose $x$-location within the graph corresponds to the price. The histogram in Figure 4 (b) shows what happens when we count the numbers of observations in separate “bins.” The higher the bar, the more observations in that bin. Finally, the box-plot in Figure 4 (c) shows the location of the quartiles of the prices
data (to be defined shortly).
Data visualization is an essential part of any statistical analysis, and the Seaborn library is a best-in-class tool for doing statistical visualizations: box plots, histograms, scatter plots, violinplots, etc. Anything you might want to do, there is probably a plot function for it in Seaborn. As a first step to learning Seaborn, you can check out the Seaborn tutorial (Appendix E), which explains the essential Seaborn plotting functions that we use in the book.
Data is the fuel for statistics. All statistical calculations start with data collection and processing. The Python library Pandas provides lots of helpful functions for data manipulations. Understanding what Pandas can do is best done through examples, so we’ll now show some real-world code examples of using Pandas. We don’t have time to explain all the details, so we’ll use the show-don’t-tell approach in this section.
We’ll use a dataset epriceswide.csv which consists of 18 samples of electricity prices collected from charging stations in the East End and the West End of a city. Figure 5 shows the data as appears when viewed online, when opened with a simple text editor, and when viewed using spreadsheet software like LibreOffice Calc.
Figure 5: The dataset epriceswide.csv viewed online, in a text editor, and in spreadsheet software.
We call this data format CSV, which is short for comma-separated values. CSV files are very common in both industry and academia, since they are a very simple data format that can be opened with many software tools.
We can use the function read_csv
from the Pandas module to load the data from a CSV file, which can be a local file or a URL (an internet location). Pandas provides other functions like read_excel
and read_db_table
for loading data from other data formats.
Let’s look at the commands we need to load the data file epriceswide.csv from a web URL .
import pandas as pd epriceswide = pd.read_csv("https://nobsstats.com/datasets/epriceswide.csv") epriceswide # East West # 0 7.7 11.8 # 1 5.9 10.0 # 2 7.0 11.0 # 3 4.8 8.6 # 4 6.3 8.3 # 5 6.3 9.4 # 6 5.5 8.0 # 7 5.4 6.8 # 8 6.5 8.5
The first line import pandas as pd
imported the pandas library under the alias pd
. This is similar to how we imported seaborn
as sns
. The shorter alias will save us some typing. Next we call the function pd.read_csv
to load the data, and store the result in a variable called epriceswide
, which we then print on the next line. The variable epriceswide
contains a Pandas data frame, which is similar to a spreadsheet (a table of values whose rows are numbered and whose columns have names).
The data frame epriceswide
contains two columns of prices from the West End and the East End of a city.
I’m interested only in the prices in the West End of the city (the second column). To extract the prices from the West End of the city we use the following Pandas code.
pricesW = epriceswide["West"] pricesW # 0 11.8 # 1 10.0 # 2 11.0 # 3 8.6 # 4 8.3 # 5 9.4 # 6 8.0 # 7 6.8 # 8 8.5
The square brackets syntax has the effect of selecting the data from one of the columns. The variable pricesW
corresponds to a Pandas series object, which is similar to a list, but has many useful functions “attached” to it.
We can easily compute summary statistics like the mean, the median, the variance, the standard deviation, etc., from any Pandas data frame or series by calling a function of the same name: mean
, median
, var
, std
, etc. These numerical summaries are called descriptive statistics, and they are very useful for understanding the characteristics of the dataset.
For example, to see the number of observations in the series pricesW
(the sample size), we use the function .count()
.
pricesW.count() # 9
Note the syntax is pricesW.count()
and not count(pricesW)
. This is because the function count()
is “attached” to the pricesW
object. Functions attached to objects are called methods. The syntax for calling a method is a little weird, obj.fun()
, but the effect is the same as fun(obj)
.
To compute the sample mean $\overline{\mathbf{x}} = \frac{1}{n} \sum_{i=1}^9 x_i$, we call the .mean()
method on pricesW
.
pricesW.mean() # 9.155555555555557
To compute the median (50th percentile), we call the .median()
method:
pricesW.median() # 8.6
To compute the sample standard deviation $s_{\mathbf{x}} = \sqrt{ \frac{1}{n-1}\sum\nolimits_{i=1}^n \left(x_i – \overline{\mathbf{x}} \right)^2 }$, we call the .std()
method:
pricesW.std() # 1.5621388471508475
Note the standard deviation is a complicated math formula involving subtraction, squaring, summation, divisions, and square root, but we didn’t have to do any of these math operations, since the predefined Pandas function std
did the standard deviation calculation for us. Thanks Pandas!
The method .describe()
computes all numerical descriptive statistics for the sample in one shot:
pricesW.describe() # count 9.000000 # mean 9.155556 # std 1.562139 # min 6.800000 # 25% 8.300000 # 50% 8.600000 # 75% 10.000000 # max 11.800000
This result includes the count, the mean, the standard deviation, as well as a “five-number summary” of the data, which consists of the minimum value (0th percentile), the first quartile (25th percentile), the median (50th percentile), the third quartile (75th percentile), and the maximum value (100th percentile).
Another important class of transformations is data cleaning: making sure data values are expressed in a standardized format, correcting input errors, and otherwise restructuring the data to make it suitable for analysis. Using various filtering and selection functions within Pandas allows you to do most data cleaning operations without too much trouble.
Using Pandas for data loading, data transformations, and data cleaning are useful skills to develop if you want to apply statistics to real-world situations. To learn more about Pandas, I’ll refer you to the notebook for Section 1.3 which is a condensed crash course on Pandas methods for descriptive statistics, and the Pandas tutorial (Appendix D) which contains info about data manipulation techniques.
After all this talk about learning Python, I want to remind you that you don’t need to become a Python programmer! As long as you know how to use Python as a calculator, then you’re in business. Specifically, here are the skills I believe everyone can pick up:
Even if you’re an absolute beginner in Python and don’t want to learn Python, you can still benefit from the code examples in the notebooks as a spectator. Just running the code and seeing what happens. You’ll also be able to solve some of the exercises by simply copy-pasting commands.
Once you gain some experience with basic Python commands and learn the common function of the Pandas and Seaborn libraries, you’ll be able to apply these skills to load new datasets and generate your own visualizations. Learning Pandas and Seaborn will allow you to tackle real world datasets, not just the ones prepared for this book.
Your knowledge of basic Python will be very helpful for understanding probability and statistics. Here are some examples of activities that will be available to you:
scipy.stats
probability models (these are the LEGOs of the XXIst century)We’ll talk more about these in PART 2 of the blog post series.
I want to emphasize that all these calculations are accessible in “spectator mode” where you run the notebooks to see what happens, without any expectation of writing new code. As you gain more Python experience over time, you’ll be able to modify the code examples and adapt them to solving new problems.
I hope the code examples in this blog post convinced you that learning Python is not that complicated. I also hope you take my word for it that knowing how to use Python as a calculator will be very helpful when learning statistics.
In the next post (PART 2 of 3) we’ll talk about specific probability and statistics concepts that become easier to understand when you can “play” with them using the Python calculator. Indeed, Python makes working with fancy, complicated, mathy topics like random variables and their probability distributions as easy as playing with LEGOs. We can also use Python to run probability simulations, which can be very helpful for understanding conceptually-difficult concepts like sampling distributions.
In the final post (PART 3 of 3) I’ll describe how the PYTHON+STATS curriculum is packaged and delivered in the upcoming textbook No Bullshit Guide to Statistics. I’ve been working on this book for the past 5+ years, continuously optimizing the structure of the book to make it accessible for the average reader. I want every adult learner who wants to learn statistics to have a concise guide to the material, including explanations for how things work “under the hood” in terms of probability calculations.
Here are some suggestions for further reading:
This question is bound to come up, so I’ll go ahead and answer it preemptively. The R ecosystem (R, RStudio, RMarkdown, Quarto, CRAN packages, tidyverse, etc.) is also a very good platform for learning statistics. Some might even say R is better than Python, since it is a programming language specially designed for statistics. Certainly, if your goal is to learn advanced statistical techniques, then learning R would be a good choice given the numerous packages available in R. For learning basic statistics, however, R and Python are equally good.
I chose Python for the book because it is an easy language to learn, with minimal syntax rules, and good naming conventions. Python is a general purpose language, which means the Python skills you develop for statistical calculations can be used for all kinds of other tasks beyond statistics including, web programming, data science, and machine learning. Learning Python will make you part of a community of millions of Pythonistas (people who use Python), which makes it very likely you’ll be able to find help whenever you get stuck on a problem, and find lots of learning resources for beginners.
All this being said, I still think R is very useful to know, and I plan to release R notebooks to accompany the book at some point. A multi-language approach would be great for learning, since you see the same concept explained twice, which is good for understanding the underlying statistical principles.
]]>I’ll summarize the results of the survey below (140+ respondents) and comment on some of the readers’ suggestions and advice. The survey is still open in case you want to add your feedback, or feel free to send me an email directly. My email is ivan at this domain.
The goal of this survey was to clarify three product aspects that I was uncertain about. Worded as direct questions to the reader, these are:
Getting the product wrong along any of these three dimensions will be a big problem, and since I don’t like guessing, the survey seemed like a good idea. Luckily readers were very generous with their time: more than 140 readers responded to the survey. Read all about their responses below.
Let’s start by looking at the uncertainty-reducing multiple choice questions first. I’ll show the percentage breakdown for the answers to each question, and comment on the implications for the book.
The top response confirms my assumption why readers want to learn statistics: 65% for work. That’s great news, because the focus I’ve chosen for the book is on STATS 101 topics that are most relevant for real-world, practical data analysis scenarios. The fact that 50% of people want to learn stats for fun was quite surprising. People want to learn stats for fun? Seriously?! I guess I had underestimated people’s baseline curiosity. This is good news… I’ll keep the utilitarian angle on the whole thing (stats is useful), but I feel now more comfortable talking about “inherent” benefits of learning statistics too (e.g. knowledge buzz). I’ll do my best to cover statistical procedures needed for publishing papers to please the “stats for research” people (38%). The fact that “for school” clocks in at only 9% is due to the sampling bias for this survey: most readers I keep in touch with through the mailing lists are adult learners (i.e. not students). Even if “for school” answer is small, I still plan to cover all the STAT 101 topics because undergrads are an important audience for the book.
The leading response (55%) to focus on practical statistics is consistent with people wanting to learn statistics for work. My choice to focus on permutation tests and bootstrap estimation will help with this, since these techniques are more widely applicable as compared to analytical approximations. The second response with 32% to do “full details” coverage of STATS 101 topics is also encouraging, since I plan to include all the standard topics (t-test and other analytical approximations). I’m happy to see there is little traction (10% combined) for an abridged or watered-down curriculum.
The majority of people (52%) want to see a full probability theory course in the book, which is what I was planning to do. Coming in second place with 20% is the review-without-details option, and the use-computer options is at 18%. Very few people want to skip probability theory math by simplifying (4%) or skipping (2%).
I’m very relieved to see that a majority of respondents are OK with code (combined 73% expert and intermediate). We have to be mindful of the bias of the survey toward a technical audience, but overall I’m interpreting this as a green light to continue with the current approach that includes code examples as an additional mode of explanation for probability concepts and statistical procedures.
The third and fourth place responses are 8% beginners and 8% willing to learn, which is something I can work with since the most advanced coding concepts in the book will be a for loop. I’m optimistic that a short intro-to-Python tutorial in appendix can bring the average “office worker” into the realm of basic Python proficiency, so they can at least read and run Python code, if not write it.
Very few (4%) are against the idea of coding in general, although in the free-form responses, several readers (5+) said they are experienced coders that would still prefer fewer code examples. One reader said “I’m an expert coder, but prefer to understand the math.” I’m fully in agreement: this is not going to be a “coding book,” the code examples are there to support the understanding. In other words, I’m going for a code-in-addition-to-math approach, and not code-instead-of-math approach.
I got a lot of very thoughtful and detailed responses to this open-ended question. In order to better process all this feedback, I grouped the responses into themes, extracted the most useful quotes, and highlighted in bold some key phrases. The arrow symbol → denotes my replies to certain suggestions.
Several people described the problems they see with current statistics textbooks and the courses they took. Essentially, you can take a stats course, but that doesn’t guarantee you’ll understand anything about stats by the end of it:
These observations resonate with my own experience learning statistics as a student. I got through the course for sure, but I had no idea about why the procedures worked, and had very little practical data munging skills.
Respondents were unanimous in their desire for a new book that actually explains concepts deeply and avoids the problems with traditional statistics teaching. Here is what they wrote:
Some people talked explicitly about the need to “go into the math details” in order to achieve better understanding:
One reader even talked about falling in love with the material: “When textbooks try to limit the theory content to only what is needed for some practical examples, there is few opportunities to fall in love, and thus, get creative with the newfound tools.”
The specific suggestions for statistics topics to cover were very much in line with my research on the STATS 101 curriculum:
I’m 100% aligned with this these suggestions and they are all included in the book outline.
Respondents had lots of practical tips and suggestions for what makes a good learning experience:
This list describes all the tricks I’ve used in the previous books. I’m very glad to see people appreciate the need to “connect concepts” to reach a better understanding, which is what I’ve been working on recently by making a smooth connection between the PROB and STATS chapters. One respondent said they want a “conversation with a t-test like the one with sin and cos in the MATH&PHYS book,” which made me laugh. For sure, we’ll need to include some jokes interspersed with the serious stats stuff to lighten the mood. Yes learning statistics is strategically important, but we can also have fun along the way!
Lots of respondents emphasized the need for examples, exercises, and problems to be part of the book—not as optional add-ons but as core part of the learning experience. Here are some suggestions in their words:
For sure there will be lots of examples and practical applications. In the PROB chapter, I plan to show a mix of math derivations and code simulations, while in the STATS chapter there will be exercises to complete using pen-and-paper approach or using a computer.
Respondents were generally on board with the idea of using code examples to teach statistical concepts:
One reader said that “statistics and probability are extra tricky because intuition doesn’t work as well as in other fields” and continued to suggest “it would be better to approach from a computational point of view.” I’m very much in agreement with this, and I plan to show many of the procedurally complex stats recipes through unambiguous code examples.
Respondents mentioned several specific statistics applications that they are interested in:
For sure the book will have lots of applications. I’m already collecting examples of statistics applied in different domains (education, medicine, psychology, software engineering). I’ll also make sure to cover all the foundational ideas of statistics that are used for more advanced studies like machine learning.
Of course I can’t accept all the suggestions, since that will make for a very long book. Here are some elements of feedback that I’m not going to include, and the reasons why.
Easy mode. One readers suggested to “move code and math concepts to an appendix” while keeping the main text as a high-level plain language discussion. → It’s not possible to understand statistics without the probability math details, and my plan is to use the Python code examples to support the conceptual understanding, so it’s not possible to put the code and math in appendix.
Assuming prior knowledge. Another respondent said I should assume readers have read the MATH&PHYS and LINEAR ALGEBRA books and use concepts from calculus and linear algebra. → Ideally, all readers of the stats book would have read MATH&PHYS and LINEAR ALGEBRA books, but I still plan to provide some material on calculus and linear algebra to make the book self-contained. We don’t need much: just a little bit of calculus to describe probability calculations with continuous distributions, and some linear algebra to explain the linear model fitting.
Advanced topics. There were various suggestions to cover advanced topics like Monte Carlo methods, hierarchical models, machine learning, graphical models, and other state-of-the-art applied techniques. → This was planned in the early draft of the book, which included a whole chapter on machine learning, but I’ve since reconsidered the scope. Statistics is already big enough as it is, so trying to cover advanced topics in the same book will take us too far afield. Instead, I’m saving all the draft material for a followup book on machine learning.
My conclusion from the survey is that there are no blockers to the current approach. All steam ahead with the vision for the book! Feedback from readers validates the need to explain things in detail (i.e. real probability theory, not some watered down version). There is also strong support for including lots of hands-on, practical tasks, which I will do in examples and in the problem sets. I’ll also have some extra tutorials and missions in the form of jupyter notebooks.
My other conclusion is that including code examples will be OK, so no change in plans on that front. Statistics is such a complicated topic that we need to unleash the triple-combo: equations, visualizations, and code examples to make progress. The intro-to-Python tutorial should be sufficient to handle the exceptional case of a reader that has never seen Python before. One respondent said that “coding is getting at the level of reading and writing in terms of expected literacy” so it’s a good thing to force readers to level-up at basic code literacy.
Now it’s time to start working on the main text. I plan to finalize and release new sections as they become ready every month, and circulate them to test readers on the mailing list. If you’re interested in receiving free chapter previews, you can sign up to the stats mailing list here to get notified when new material is released. Expect a mix of jupyter notebooks and PDF previews in the coming months, culminating with the final book draft in Spring 2022.
]]>In this blog post, I want to look at the mechanics that make learning loops work and think about ways they could be used by teachers, private tutors, and publishers to build learning experiences in which learners have more agency and control over their learning. We’ll also look at the related phenomenon of game mechanics that exists in certain “addictive” computer games. Figure 1 contains a visual summary of the ideas we’ll discuss in this blog post. The two main questions we’re interested in are: “What can teachers within the formal educational system learn from autodidacts?” and “What can autodidacts learn from the gaming industry about staying motivated?”
In the second part of the blog post we’ll think about the role of teachers and educational resources in supporting and reinforcing learning loops. I’m writing this mostly as a self-reflection and welcome comments by other educators, content creators, and learning experience designers interested in this phenomenon.
Figure 1: The main question I’m interested in thinking about is how to introduce aspects of self-directed learning into the formal educational system, in order to give students more agency over their learning process.
Let’s start with some examples to illustrate the concept. I’ll give three examples of learning loops from my personal experience.
EX1. As a youth, I always had good grades in math, but despite this, taking exams was a stressful experience. All this changed one summer back in 2004, during which the only source of entertainment I had was a calculus book. I started solving the problems in the back of the book out of boredom. Each problem that I solved gave me a little dose of “achievement buzz,” which motivated me to solve the next problem, then another one, and before I knew it I had gone through all the end-of-chapter problems in the book, including problems with a double-star difficulty rating. Since I had the answers to each problem in the back of the book, I was getting constant feedback about my performance. Each time I solved a problem correctly, I felt a little boost in confidence. By the end of this summer, I had developed a completely new attitude toward solving math problems: a general curiosity in the sense of “Let’s see what they are asking?” and there was zero grams of math anxiety left in me.
EX2. Another example of a learning loop from my life is how I learned to use the UNIX command line. As a kid, I had read a crime novel in which the protagonist (FBI special agent) used UNIX to look up some important data as part of her work, which gave me the idea that UNIX is cool. Later in life, there was a UNIX command line seminar offered at my university, which helped me get started with basic commands like ls, cd, cat, etc. I also had ssh access to a UNIX server on campus to practice the commands I was learning. Later on, I started installing GNU/Linux on old laptops, which required developing further command line skills, gradually learning about more and more programs, services, and config files. This non-curricular hobby turned out to be very useful later in life since every tech job I’ve had involves command line and scripting in one way or another. I’m still learning about servers and command line tools to this day, continuously benefitting from the instant feedback when commands fail, and from easily accessible docs, manpages, blog posts, and discussion forums.
EX3. The third example of a learning loop is how I got into web programming. I attribute my success in web programming almost entirely to the Django web framework, which has excellent documentation. Before starting with Django, I had tried to learn Flash, C, PHP, and JavaScript, but I never managed to get off the ground with any of these. Then one day I made an attempt at coding a project using Django and fell in love with it. I was productive from the very first day: I was able to create a simple view function and test it out, then added templates, and slowly built the project via small, incremental steps. Over time I understood the request-response cycle, learned to handle HTML form submissions via POST requests, login sessions, and all other features needed for a full website. The source of feedback was the manual testing process: navigating the webpages as I build them out and manually testing each new feature.
Here are some common aspects of all three examples:
Note also what is missing from these learning experiences:
I find it fascinating that three of the deepest skills I have developed in life are in no-way related to my formal education at university. It’s as if the formal education system is only good at achieving the basic level of competency, and achieving true depth with any skill requires active participation of the learner: learners need to get into a learning loop and pursue the skill on their own.
In the rest of the blog post, I’ll think on this idea further, and flesh out the implications for educators in general and specifically for content producers like Minireference Co.
Let’s take a broader view to see where learning loops fit as part of the overall learning process. Let’s analyze the user-journey of a self-taught programmer, let’s call her Aisha, as she progresses through different levels of experience, from absolute beginner to domain expert.
Aisha hears about programming as a career choice either by meeting someone who works in tech or talking to a friend who is studying to become a developer or data scientist. Aisha could also be exposed to programming as part of some course.
This initial exposure is one of the most crucial moments in the overall learning journey, due to the emotional dimension. Does the person who is introducing Aisha to programming give the impression that programming is something cool? Do they look like they’re having fun when programming? Do they look confident and happy with what they are doing?
A successful exposure will draw Aisha’s attention to programming, raise her interest, and build a positive association, making her think of programming as something worth looking into. Ideally, the person through which Aisha gets her first exposure is a woman, which would help her see herself in that profession. This is why it’s so important to support (or at least not interfere with) women in tech—even if the field continues to be gender-imbalanced, the presence of women is essential to give role model examples to the next generation of tech women. It’s a very simple message: women can be tech bosses too, no hipster-beards required!
This is where the actual journey begins. Aisha faces a number of barriers to enter the field. Does she have a working laptop? Does she have the computer literacy required to set up the software development environment? Does she have the language skills to follow an introductory tutorial (in case English is not Aisha’s first language).
Assuming she passes these starting hurdles, the initial efforts invested in learning to program (from tutorials, books, videos, apps, etc.) will feel like energy sinking into a black hole. Nothing makes sense at first. She’s just learning a bunch of disorganized facts and concepts relying on memorization and without understanding why they are important. The primary skill that Aisha must cultivate at this stage is persistence in the face of endless errors. This is a long and painful stage where stepping outside the tutorial and you end up with error messages. At this stage, having access to good tutorials[ex1,ex2,ex3] and helper tools that print friendly error messages (see explainer video) are very helpful.
Ideally, this initial “uphill” step shouldn’t be necessary if Aisha is learning in a beginner-friendly environment that allows her to figure things out on her own and get into a learning loop right away (see next level). I’ve included this “initial investment” level as a prerequisite for the “learning loop,” because for many people learning coding doesn’t come easy at first, so it’s important to recognize this difficulty (and plan for it).
Assuming Aisha survives the initial “information gathering” step, she will eventually start to feel more comfortable with coding, spend less time stuck, and finally start to productively use her programming skills. Aisha is no longer a beginner, so the learning resources that become accessible to her have increased: lots of textbooks, online lecture notes, video tutorial series, school projects, and personal projects are available to you once you know the basics.
The success, duration, and productivity of Aisha’s learning loop will depend on the type of tasks/projects she picks. Personal projects are particularly good at this stage because they allow the integration of ideas: the disorganized facts turn into experience. In the ideal case she can find numerous challenges that are at an appropriate level and is also able to find additional resources for just-in-time learning of new concepts. Each time she completes one level/task/assignment, there is an immediate positive feedback, which in turn motivates her to do the next level/task/assignment, thus creating the learning loop.
Persistence in the face of errors continues to play a major role at this level—not quitting even when being stuck. A real-world mentor that can help her get unstuck, choose projects/tasks of appropriate difficulty, fill-in knowledge gaps, and provide feedback on her performance would be very useful to have.
Thanks to her efforts, Aisha will gain enough experience to get a job, which is the next level.
The work-experience loop is similar to the self-teaching loop, but now Aisha is learning on the job. She’s fighting bugs, finding workarounds for problems, and taking on medium-size projects. Educational projects are replaced by real-world business objectives, which means the tasks become much more concrete and well-defined. This is where Aisha will develop auxiliary collaboration skills like: version control (git), bug tracking, personal time management, project management, and team communications.
Initially each task is going to be an opportunity to develop new skills (learning on the job for the win!). Over time Aisha starts to see reusable patterns for solving problems. Throughout this work experience, Aisha will build her confidence and skills for solving more and more problems using the “I have done this before” feeling. Work experience level can last for many years, and some people transition to the next level which is mastery and specialization.
At this level Aisha knows how to plan, scope, and manage big projects. She doesn’t need supervision or technical management anymore, since she’s capable of planning work on her own and getting things done. She knows how to deal with very hard bugs, and becomes comfortable reading other people’s source code (e.g. look under the hood at the source code of libraries) and understanding problems from first principles.
With enough work experience and debugging, she builds an inventory of common solutions in her head and no longer needs to google things. In fact she probably knows the domain better than 90% of people out there on the internet.
I’ve written out this whole user journey in order to highlight the importance of the learning loop in the overall career trajectory. Once Aisha gets a job and starts learning on the job, she’s all set to advance and continuously develop new skills (unless she makes particularly bad choices of employers). The learning loop (Level 3) is the key step that brings her up to employable level, and hence it’s worth thinking about how we can make life easier for Aisha to get there—not as a lucky coincidence of circumstances, but as a repeatable, reliable, and enjoyable process.
Of course we must not forget the prerequisite levels that come before the learning loop: the exposure to coding as something cool and worth considering as a career (Level 1), and the initial time investment needed to reach basic proficiency (Level 2). Both these levels have the potential for learners to drop-out, so we must think about them in parallel with the focus on the learning loop.
Before we get excited about learning loops, let’s check if self-directed learning is not an exception that is only applicable to the domain of computer programming, or only for people with a specific background.
I think self-directed learning loops will work for all STEM fields: basic math, physics, chemistry, biology—all fields where exact answers exist. These are necessary to give learners the immediate feedback and “achievement buzz” of getting the correct answer which is required to keep them going. Within the STEM fields, computer science is the easiest case; it doesn’t require anything more than a computer of some sort. These days, thanks to tools like jupyterlite, basthon, and thebe you can learn to code without leaving the browser! Other theory topics will be equally easy to learn. However, learning advanced biology, chemistry, and other subjects that require a “wet lab” would not be practical for at-home autodidacts. Still, there is a lot of learning that can happen before the need for specialized equipment becomes the limiting factor.
Artistic fields like writing, painting, singing, and playing an instrument are also amenable to learning loops, at least at the beginner level. Unlike STEM subjects, there is no “right” answer for a given artistic performance, but the inner critic is enough to keep learners motivated. If you’re learning to play the guitar and you like the sounds you’re producing when playing your favourite songs, then you’ll play more and more. The greatest guitarist of all times learned to play on his own, without taking music lessons. Reading literature also leads to strong self-reinforcing loops. I know many people who get addicted to reading: they read one novel, enjoy the story, then read more and more. That’s why we have bookstores: they are the SQDCs of reading.
It’s not all good though. Developing advanced-level skills is difficult using self-directed learning loops. The more advanced learners’ skills get, the higher the need for fine-grained feedback. We’re not talking about getting the right answer anymore, but about polishing and improving specific performance aspects. This kind of improvement only happens when you have detailed feedback from an expert teacher or mentor. For an example from the math field, it’s easy to know when you have the right answer to simple numeric questions, but much harder to know when you have all the right steps in a lengthy math proof.
In summary, the learning loops approach may not be fully general, but it’s pretty generalizable and applicable in many fields. The fact that we hear a lot about self-taught programmers and not other professions is not because programming is special somehow, but rather because it’s the current “hot” field and attracts a lot of attention. Generalizability test passed! There is no reason the same dynamics that make self-taught programming work can’t be applied to other fields.
The concept of a learning loop is very common in the gaming world. In fact we could even say getting players into a “learning loop” is the primary mechanic behind most games. Players come into the game world and are presented with a background story and a clear mission (motivation). After an initial onboarding tutorial to explain the game controls, players enter a compelling game world with clear objectives and path for advancement. Progress in the game requires some special skill, and the player faces progressively more difficult challenges.
Games contain a microcosm of reinforcement loop in which players can spend hours and hours engaged with game-world tasks. Some people can get addicted to these game worlds and spend multiple days playing and having fun, completely immersed in the game world. Clearly, there is a lot to learn from the gaming industry about user experience, the power of storytelling, and motivating players to complete missions.
Zooming into the genre of educational games specifically, we find a rich history of projects that use game mechanics to teach mathematics, geography, history, business, physics, and other academic concepts. Indeed, the idea of using computers as learning devices has been around since the early days of computer hardware.
As an educational platform, computer games have some unique characteristics as compared to traditional methods of instructions (textbooks and lectures):
I hope you’ll agree with me that these characteristics of game-based learning are very interesting and hold the potential for making education fun and enjoyable. The problem with games is that they require lots of resources to create (programming, level design, educational expertise, knowledge of curriculum standards, artswork, music). It’s not like every teacher can sit down and create an edu-game for their students, but when the teacher has the skills, the results can be amazing. Perhaps we can make edu-games easier to create by providing “game templates” that teachers can fill-in using their course materials? I’m also optimistic about the potential for “lightweight gamification” where a teacher keeps using their traditional handouts and exercises, but renames them as “knowledge scrolls” and “missions” and embeds them in a game-like storyline.
Let’s think about the mechanism that makes learning loops possible, and look for ways we could make the learning more efficient. We’ve already identified the three core requirements for a learning loop:
A loop is formed because learners who complete a task successfully receive positive reinforcement that motivates them to start the next task. The positive reinforcement can be intrinsic (self evaluation, building confidence, personal satisfaction) or extrinsic (game points or grades awarded by a teacher). A crucial aspect for a learning loop to form is for the tasks to be of the right granularity (think one-hour-long tasks and not one-week-long tasks) so that learners get to experience some “achievement buzz” at regular intervals.
Figure 2: To enter the learning loop, learners need some initial motivation and a series of problems to solve. The learning happens as learners work toward a solution, and the feedback lets them know when they have found a good one. The feeling of success then motivates them to go around the loop another time.
These three core requirements for a learning loop are similar to the ingredients for optimal skill development identified by Anders Ericsson and Robert Pool in their book Peak: Secrets from the New Science of Expertise. The authors studied top-performers in various fields in order to identify the secrets of their success. The ingredient for developing mastery at a given task are:
I don’t think I’m doing justice to the book with this point-form list, but I encourage you to look into it and read about this further by consulting this condensed book summary here. For me, the main takeaway from Peak is the importance of fast, accurate feedback. Without this clear “error signal” from an experienced coach, it is much harder for learners to fix the mistakes they are making, especially for more advanced skill levels.
Speaking of teachers…
Learning loops are primarily a learner-driven activity and not a teacher-led activity. Learners advance on their own without the need to be controlled or supervised by a teacher. However teachers, coaches, mentors still have a role to play in fostering and supporting students’ looping behaviour.
First, teachers and mentors are important to give the initial motivation for learning the subject. Why should learners care about learning subject X? Without this initial motivation, the new knowledge or skill will be perceived as inert and lifeless. If the teacher or mentor is excited about the subject X, then this excitement will rub off on learners. Basically, the learning process is not just about transferring skills. There is also the “romance” aspect, getting students to fall in love with the subject matter. This “romance” aspect is well described by Stefan Schindler in his paper The Tao of Teaching.
Teachers’ role is essential for setting up the foundational knowledge. Recall that entering a learning loop requires some minimal level of competency in order to make the learner independent and able to complete tasks on their own. An introduction to the subject by an experienced tutor can be an excellent way to get up to speed.
Teachers also play a crucial role of modelling the right attitude, exemplifying a chill mindset in the face of difficult challenges, confidence, showing examples of breaking down complicated tasks systematically, knowing when to go for details and when it’s OK to cut corners. Such “soft skills” and domain conventions are difficult to pick up on one’s own, so it’s great when learners can see examples of this behaviour.
When a teacher is nearby students will have access to a mentor whom they can ask for help. This help can come in the form of detailed feedback on the student’s performance to improve their skill, or helping learners get unstuck.
Note the teacher’s role described here is very different from the traditional role of the teacher as the unique source of knowledge and information, but still very valuable as a stop-gap (for learners who are behind) and accelerator (for learners who can move faster). A coach’s role is more of a “teacher on demand” and “learning loop support” service, rather than a principal educator taking centre stage.
Last but not least, working with a mentor has the potential to select challenges of appropriate difficulty level for learners. Optimal progress requires taking on challenges that are adjacent to their current skill level: what better way than to have expert teacher support in this process?
Not everyone has access to an expert coach (teacher or private tutor) available to help 24/7. In the absence of a dedicated teacher that caters to each learner’s specific needs, the presence of a group coach can be a good replacement. Think of an online forum or chat group on subject X where learners can ask questions when they are stuck and receive recommendations for task challenges. A mentor doesn’t need to invest many hours of their time, but just pop-in once in a while to answer questions and help learners get unstuck.
Peers can also play a useful role in the learning process. The hypothetical discussion forum on subject X doesn’t need to be populated by experts—other learners can also be helpful to answer questions, share resources, and give the general solidarity feeling in the face of difficult tasks. The self-organizing learning groups at the Recurse Centre are a good example of this: there are no instructors and there is no fixed curriculum to follow, just a bunch of people getting together to learn and practice programming. Recent trends in online education are for courses that leverage the power of learner groups and use cohorts of learners, instead of the traditional MOOCs where the assumption was each student learns independently.
We finally get to the most important question when thinking about “learning loops” from the point of view of a textbook author and educational publisher. What types of learning resources are best for supporting and encouraging learning loops?
The optimal learning resources for learning loops might not be textbooks or lesson plans, which are the standard types of resources used in the traditional educational system. Instead what we need is a set of standalone resources that students can reach for when they need them, in the middle of a task. Accessing learning resources on-demand is called the “pull condition” by Nick Shackleton-Jones, see this video.
Let’s see what type of resources might be helpful in the next subsections.
Introductory tutorials are probably the best way to pick up the prerequisite skills, either before the learning loop starts, or in a just-in-time moment to fill in necessary background info. The goal of a tutorial is not to be exhaustive, but to be a short hands-on introduction that takes learners to the first “win” in the learning process. A tutorial doesn’t need to explain everything, but rather focus on the specific needs of beginners.
Recently interactive notebooks have been all the rage and rapidly being adopted by educators. For those who might not be familiar, Jupyter notebooks are collections of code examples embedded in a text narrative. Learners can run the code examples and easily modify them interactively. Getting started tutorials presented in the form of jupyter notebooks offer the perfect combination of hands-on and minimal, just-in-time explanations.
Textbooks can serve multiple purposes for learners: they can be a primary source of knowledge (async instruction), they can serve as a reference (just-in-time consultation), and when they contain exercises and problem sets, they can provide hands-on practice opportunities.
In order for a textbook to be a good reference, it must have a good table of contents and an index so that learners can navigate directly to the place in the book that is relevant for their current need.
Reference-oriented resources are perfect for the “pull condition” when a learner has a specific knowledge need. Simple, self-service resources like concept maps (e.g.: math, mathphys, LA), flowcharts, glossaries, checklists, FAQs, templates, quick start guides, infographics and other visualizations can be easier to consume than traditional textbooks and video lectures. Most people have a short attention span these days and will not read long walls of text, but if a learning resource provides a useful way to think about the material, then people will read. Basically, learners don’t have the time for any blah-blah, and just want the facts. Educators must be ready to give learners what they need.
Practice problems are by far the most useful part of any textbook or course. Most people erroneously believe that learning happens by listening to lectures and reading books. In reality the real value of a course or textbook is in the practice problem sets, which give learners the chance to apply the knowledge. In Eastern Europe it is common for students to buy a workbook (сборник) of problems and not a textbook.
Practice problems should be numerous, of varying difficulty levels, and also come with hints so that learners never get stuck for too long. We’re not talking about ANY problems here—the problems need to be interesting. We want the learners’ experience solving the problems to be fun by using real world scenarios, adding jokes, and generally choosing the questions that feel relevant to the learner’s context. Problems should not be difficult for no reason, the goal is for learners to actually appreciate the effort of solving a problem, not to suffer needlessly.
Completing a problem often has an interesting component (applying insights) and a boring component (manual labour). A good problem is one which has pre-filled the boring parts and leaves students to do only the interesting parts. This is what I call a “neat” task: a task that gives learners a good feeling, not only are they spared the boring component of the task, but they also feel their instructor really cares about them by putting in the effort to prepare well structured questions. A good example of “neat” problems are the assignments for the Stanford CS231n course, which contain elaborate notebooks with scaffolded classes and pre-filled Jupyter notebooks and test code, allowing students to focus on implementing only the interesting parts.
Projects are by far the best way to encourage learning loops. Projects need to have the right scope, be complex enough to be meaningful tasks, but not too complex so as to be overwhelming. Projects are an opportunity for learners to take end-to-end ownership of a complete task, including filling in knowledge gaps. In the education space, this is called project-based learning (PBL). PBL classes have a proven track record of improved learning outcomes. See this webinar about PBL to learn more, or browse these PBL curriculum samples. Some earlier research had cast doubt on the effectiveness of the discovery-based learning approach, but I don’t see a contradiction here. There is no doubt that scaffolding can be very useful: don’t need to go 100% constructivist on this. Direct instruction seem like the perfect tool for learning the basic skills (Level 2) required to enter a learning loop (Level 3 learning).
Projects are important because they allow learners to integrate knowledge—all the disparate facts, laws, equations, and rules to memorize must be used and combined into a coherent whole in order to ship the project. Working on a project is also an opportunity for learners to develop teamwork and collaboration skills.
Is it possible to combine the best aspects of books, games, and project-based learning into a learning app? Imagine installing a learning app that starts with some tutorials to give beginners the minimum competencies in subject X, then guides them through a series of well-defined projects of increasing difficulty, just like a computer game. Such an app would be the holy grail of educational technology. This is what I’ve always dreamed of working on one day—a gamification layer on top of the No Bullshit Guide content that makes the learning process more fun. I had a lot of knowledge buzz learning all this stuff, and I know many more people will enjoy learning math and physics, if the material were presented as a set of challenges.
I’m thinking the app could contain a mixture of problems (short missions) and projects (long missions), and possibly additional team challenges to get some social aspects going, e.g., meeting people from around the world that are working on the same mission as you. I recently started the noBSmath community chat room on Gitter to give readers a place to ask questions. It could easily scale this up to separate channels for different chapters, sections, and project-based missions. It would be nice to introduce some mechanism that encourages advanced learners to play the role of mentor for beginners.
I realize this idea is not well defined at all, but I think there is something that can be done here. Introducing learners to different subjects and then letting the dynamics of “knowledge buzz” take over seems like a fruitful way to get people into learning on their own.
I like the idea of “learning loops” a lot because they represent a potential solution to the main problem in the formal educational system: apathy, or the fact that students don’t care! I know from my personal experience and the experience of teacher friends, that the bottleneck in formal education is not the lack of content, teacher skills, or other resources, but the lack of students’ interest in learning the material that is “forced” upon them. When the material is presented in a disconnected and unmotivated way, students don’t care. I wrote this blog post because I think it’s important to think about how self-directed learning dynamics can make formal schooling work better. If we move away from an education where teachers “push” knowledge onto students and instead let students “pull” the knowledge they are interested in, then the educational system can be salvaged.
Acknowledgments. I want to thank Jonathan Herman and Kevin Ollivier for our discussions about education where these ideas originate. I also want to thank Jonathan, Kevin, Edith, and Julia for their constructive comments that improved this blog post.
[ An article about reforming the educational systems by Alfred N. Whitehead ]
https://minireference.com/blog/the-aims-of-education-according-to-whitehead/
[ Anders Ericsson: Dismantling the 10,000 Hour Rule ]
https://www.goodlifeproject.com/podcast/anders-ericsson/
[ An interview with the founder of Jump Math: math learning based on step-by-step exercises ]
https://www.cbc.ca/radio/thecurrent/about-jump-math-1.5426840
[ Sir Ken Robinson talks about changing the education paradigms ]
https://www.youtube.com/watch?v=zDZFcDGpL4U
[ Seymour Papert and Alan Kay discuss technology use in education ]
https://www.youtube.com/watch?v=0CKGsJRoKKs
[ Roman Kudryashov defines the components of a learning loop ]
https://romandesign.co/how-learning-works-components-systems-and-loops/
[ More on the just-in-time learning resources ]
https://educationoutrage.blogspot.com/2016/03/pragmatic-learning-its-not-fun.html
I’ve now been working on the No Bullshit Guide to Statistics for three years so I figured it’s about time for an update to let y’all know how it’s going. My goals with this blog post are to share with you the detailed book outline and chapter previews, and also ask for your help to validate certain assumptions about the readers’ background (math and programming skills) and their motivation to learn statistics. Please jump to the short survey before continuing with the rest of the blog post. It won’t take longer than 2 mins.
Blog post overview:
When I started working on this book, the first thing I did was to look at other textbooks and statistics courses to pin down the “standard” curriculum for introductory course in statistics, the course usually called STATS 101. I had a lot of difficulties trying to figure out the scope of the book. There are ongoing debates about what should be taught, what methods should be used for research, and how results should be reported. The old-school frequentists stuff that teaches hypothesis testing based on plug-the-number-get-the-answer approach currently dominates stats courses. New ideas based on computation-first statistics like resampling and simulations are also recommended but not widely adopted yet. I wrote a detailed blog post about the problems with the “classical” statistics curriculum, which I invite you to check out for more details.
Luckily, I wasn’t alone in this quest to figure out statistics. I managed to recruit Robyn Thiessen-Bock to help with the research and scoping of the book. Robyn and I had previously worked on another project, so I knew I could count on her. Together we came up with an initial outline for the book and iterated over several draft chapters. Thanks to Robyn’s contributions on research, writing, and review, we managed to push ahead on the most difficult part of the book: hypothesis testing—the dreaded recipe that makes no sense whatsoever. Following the startup principles of maximum uncertainty reduction, we decided to focus on this chapter first (see draft). We also thought about prerequisite topics like data management and probability theory that readers will need to know before they can learn statistics. See Figure 1 below for an overview of the main components of the book and the conceptual dependencies between them.
In the publishing world, authors pitch their book ideas by preparing a book proposal package to be reviewed by an acquisition editor at the publishing house. You can think of the book proposal as an elevator pitch for books. A book proposal must explain what the book is about, what the target audience is, and usually includes a book outline (table of contents) and a sample chapter. Since the Minireference Co. business model is direct-to-reader, readers are the only acquisition editor I have to consult with. Dear reader, I submit the following book proposal for your review. Please take a look and let me know what you think.
The No Bullshit Guide to Statistics includes prerequisite topics in probability theory and also covers some topics related to practical data processing and visualization. The goal is to provide readers with a self-contained three-in-one package DATA+PROB+STATS that is suitable both for independent learners and university students. The book will cover all the standard material from the STATS 101 curriculum, but place extra emphasis on the topics that are generally practically applicable and useful for future studies.
Figure 1: An overview of the how topics in the book (blue boxes) depend on each other and build towards the applications and use cases (shown in purple boxes).
The target audiences for the book are threefold:
In general it’s not good for a book to target multiple audiences, but I think it makes sense in this case. Whether you’re an undergrad student or an industry person who has been out of school for a very long time, the challenge you have is the same: to learn statistics, you need to get through layers and layers of modelling, approximations, hypothetical repetition scenarios, and complicated procedures. Learners’ strategy in the face of this challenge generally fall into two categories: people who understand-from-first-principles camp and the memorize-shortcuts camp. This book is for the understand-from-first-principles camp.
It’s also worth noting who is not in the target audience: high school students and grad students. It won’t be possible to cover all the material I want to cover if I stick to simplified high-school-level explanations of probabilistic modelling. Similarly, advanced statistics topics like multilevel models, experimental design, and research methods that graduate students need to carry out research projects are out of scope. This doesn’t mean the book won’t be useful for these audiences, advanced high school students who want to study university-level stats and graduate students who need a review of the fundamentals will certainly benefit from the book, but we won’t make any special effort to cater for their needs.
So how are we going to make this happen? How can we make statistics interesting and understandable? There are thousands of teachers and hundreds of textbook authors that have tried this before, so what is different about this attempt? I’m glad you asked; here’s the plan:
Okay, enough talk, let’s see the book already! After three years, there better be something to show!
Based on all this planning, scoping, and research considerations, I finally present to thee the sequencing of stats topics I propose: Outline of the No Bullshit Guide to Statistics (a shared google doc, open for comments).
This is your chance to provide input on what material that should be covered in the upcoming stats book. The above gdoc is the “draft playlist” of topics I have prepared for you, and I hope to evolve and iterate on this playlist, so your feedback will be much appreciated. What do you want to learn? Is there some stats concept or topic you’d be interested in, but you can’t find it in outline? Comments in the gdoc please!
You might also want to take a look at the new concept map that shows how STATS ideas build on prerequisite concepts form DATA and PROB. I am very proud of this concept map since it manages to fit all the core ideas on a single page, which I was not able to achieve the previous concept maps.
Every book proposal must include some sample chapters. I’ve prepared PDFs of four chapter drafts that you can use to get an idea of the writing tone and the level of mathyness you can expect in the final book:
Please ignore the chapter numbers and sectioning; the book is currently undergoing refactoring, so final chapter numbers and book structure will be different.
I’m not committing to any specific dates, but this is a P1 project (high priority) and I’m optimistic about making progress on this book by end of the year:
To receive updates about the stats book, sign up to this mailing list: https://confirmsubscription.com/h/t/A17516BF2FCB41B2
I’ll message you when the useful “chunks” of the book are ready to preview chapters, concept maps, notebooks, etc.
The overall game plan for the book is in place, but I need your help to make the book accessible to all readers. The main thing I’m not sure about is how to calibrate the book along the following three dimensions:
pandas
, numpy
, scipy
, and statsmodels
allows us to hide a lot of complexity, and at the same time introduce readers to the Python tool set for data analysis (hands-on data science skills). However, readers who are not comfortable with programming concepts might not be able to follow these code-first explanations and prefer standard explanations using math equations and words, like in a traditional textbook. Can I show useful code examples, but also make the narrative understandable for non-tech readers who get intimidated by code blocks?
The final book will aim for a balance between these options, but I’m not sure what will be most useful. I figured, the best thing I could do is to ask y’all what you would like the book to be and help me get this right. If you have five free minutes, please fill out this form: https://miniref.typeform.com/to/DhyRCp
You can also send me feedback and ideas by email at ivan@minireference.com. I’d be particularly interested to hear advice from people who use stats in their day-to-day job—what are the most useful parts of STATS 101? Let me know by email or leave your comments in the outline gdoc.
]]>I’ve been looking into this question for the last three years and I finally have a plan for how we can improve things. I’ll start wiht a summary of the statistics curriculum—the set of topics students are supposed to learn in STATS 101. I’ll list all the topics of the “classical” curriculum based on analytical approximations like the t-test. This is the approach currently taught in most high schools and universities around the world.
The “classical” curriculum has a number of problems with it. The main problem is that it’s based on difficult to understand concepts, and these concepts are often presented as procedures to follow without understanding the details. The classical curriculum is also very narrow, since it covers a slim subset of all the possible types of statistical analysis that can be described as math formulas that can be used blindly by plugging in the numbers. In the end of the introductory stats course, students know a few “recipes” for statistical analysis they can apply if they ever run into one of the few scenarios where the recipe can be used (comparison of two proportions, comparison of two means, etc.). That’s nice, but in practice this leaves learners totally unprepared to solve all stats problems that don’t fit the memorized templates, which is most of the problems they will need to solve in their day-to-day life. The current statistics curriculum is simply outdated (developed in times when the only computation available was simple algebraic formulas for computing test statistics and lookup tables for finding p-values). The focus on formulas and use of analytical approximations in the classical curriculum limits learners development of adjacent skills like programming and data management. Clearly there is room for improvement here, we can’t let the next generation of scientists, engineers, and business folks grow up without basic data literacy.
Something must be done.
The something I have in mind is a new book, the No Bullshit Guide to Statistics, which is my best shot at improving the teaching of statistics. I’ve been researching the table-of-contents of dozens of statistics textbooks, read hundreds of stats-explainer blog posts, watched hundreds of lectures on YouTube, all of this while trying to make sense of this subject. What is statistics? What are the core statistics topics that everyone should know? Read the rest of this wall of text to find out.
This is part 1 of the 2-part blog post series about the book production process, in the style of the “How it’s made” TV series. In this first “episode” of this series, we’ll talk about the statistics curriculum in general (what we need to teach and how to best teach it). The goal is to produce a complete “bill of materials” (BOM) for the set of statistics topics and concepts that should be covered in the introductory stats course. In the second episode, we’ll talk about the “packaging” problem: how to organize the bill of materials—the O(100) core topics and concepts of statistics—into an easy to follow sequence of learning experiences in book form (chapters, sections, subsections, etc.). The second post includes a book progress report and links to live-updating book outline gdoc, draft chapters pdfs[1,2,3,4], and a form you can use to provide feedback and comments to influence what the final version of the book will look like, specifically, how much math equations and how much code examples I can use without making the book overwhelming for beginners.
Table of Contents:
Statistics is used in physics, chemistry, biology, engineering, business, marketing, education research, international development, epidemiology, medicine, finance, social sciences, and many other fields. Every university has some version of a STATS 101 introductory course and sometimes departments have subject-specific “Statistics for X” courses. Statistics concepts are even supposed to be taught in middle school and high schools! Clearly the educational establishment thinks that students should be learning statistics, and there are hundreds of textbooks out there that teach the statistics fundamentals.
Which topics and concepts belong in the first course on statistics? To answer this question I looked at the syllabus documents from several university courses [McG203, McG204, McEPIB, UIC, UWO, UNIPD, UH, SFSU, CEU, NAU, DAL, USU, UFL, McM, UBC, WKU], tables of contents of introductory statistics textbooks [OpenIntro, OSE, ER, Lane, OpenStax, Lavine], online courses [CC], and educational standards [CCSSM.HSS, AP STATS]. The list below includes all statistics topics I encountered during this research. Topic shown in bold are deemed core material (covered in all courses), while topics shown in grey are less common and only covered in certain courses.
As you can see, the STATS 101 curriculum is pretty packed! Not every stats course covers all these topics and the focus in different courses varies, but overall this is a representative summary. Looking at the above list should give you a general idea of the multitude of ideas and concepts that will be presented to (or should we say inflicted upon) first-year university students.
Despite the long list of topics normally covered in the first statistics course, there are some important topics that are skipped or covered only superficially:
I recently found an excellent paper on stats pedagogy titled The Introductory Statistics Course: A Ptolemaic Curriculum by George Cobb, which has been instrumental for my understanding of statistics teaching:
I really like this paper because it puts the emphasis on a learner-centric perspective. What are the conceptual leaps we’re asking students to make when learning statistics? How many of the formulas shown to learners are explained properly? Are students getting the big picture or blindly memorizing procedures?
I encourage you to read the whole paper (especially the highlighted parts), but if you don’t have the time, here are the main takeaways:
The recommendations in Cobb’s paper were a big inspiration for how I plan to organize the stats book. I want to be part of the “new generation of adventurous authors” that lead students down a new, more exciting statistics curriculum.
My friend Patrick Mineault has been helping me with the book planning process, and the first thing he asked me to do is to come up with a ranked list of topics that I find most important for understanding statistics. Before thinking about the organization and sequenceing of topics, it’s useful to start with a vision for what you want to put emphasis on. What is the message of the book? What is the teacher’s vision?
This is the list of “high priority” stats ideas according to me:
y ~ mx + c
, where m
is the slope and c
is the y-intercept of the best-fit line for the x–y relationship that exists in the data. If readers understand linear models by the end of the book, then I’m calling this mission accomplished. Using linear regression is both very practical and also touches core ideas like predictions, cross-validation, and model-selection, which are very useful for later studies in machine learning. Whatever time we spendn on linear models will be time well spent.Note this list doesn’t show topics in the order as they will appear in the book, but rather by “most important” to “least important.” This rank of each topic in the list gives you an idea “page weight” of the topics—topics higher in the list will have more pages dedicated to them.
Here are some additional topics that I would like to weave into the narrative on introductory statistics:
For more about the upcoming book, see the second post, which includes a detailed book outline and PDF previews of several chapters[1,2,3,4], and the feedback form. The No Bullshit Guide to Statistics is still far from finished, but the research stage is done now and I can start writing and preparing stats notebooks with practice problems. You can expect a first draft will be ready by the end of the year, but you don’t need to wait that long to learn stats! In the next section you’ll find links to the best statistics learning resources that exist already so you have the option to start learning stats right now.
In my research and learning about statistics for the past three years, I found some excellent video lectures, visualizations, tutorials, books, and other learning resources. Below is a list of the best ones:
scipy
, you can learn how to do probability estimation, which is an essential skill in all of statistics (and machine learning).tea-lang
, which is a domain-specific language (DSL) for formulating statistical questions using the classical hypothesis testing framework. Watch this presentation by the tea-lang inventor to get started, then install the code and read the paper. This is really cool stuff. Instead of trying to find the right statistical test from a inventory of all possible stats tests or following a decision tree, why not let tea-lang
automatically analyze the “statistical analysis spec” you give it and run all the statistical tests that apply?In this blog post, I want to share what I’ve learned about generating ePub and Mobi files from LaTeX source files that contain lots of math equations. I feel this ought to be recorded somewhere for the benefit of other STEM authors and publishers who have LaTeX manuscripts and want to convert them to .epub and .mobi formats. Read on to watch the “How it’s made” episode about math eBooks.
The end-to-end book production pipeline looks like this:
Figure 1: The eBook production pipeline described in this blog post. Each box represents a different markup format and the arrows indicate the software used to convert between formats. The hard step is to produce clean .html+MathJax format from the .tex source. The generation of the other formats is standard.
Historically, the primary tool I used to produce the No Bullshit Guide textbooks has always been the pdflatex, which is the most common way to produce PDFs in the LaTeX ecosystem. The problem is the fixed-width layouts of PDF files are not a good fit for mobile screens and eReader devices with varying screen sizes. Modern publishing is all about reflowable formats like HTML and ePub. If we want more people to learn math, we have to make math textbooks readable on mobile.
The solution for ePub production that I came up with looks like this:
The conversion process is fairly complex and depends on tools written in several programming languages. Seriously, there are like 6+ different programming languages used in this pipeline: Python 3.x, Python 2.7 (because one library needs it), Perl, Ruby, JavaScript, and Bash. If this were a collect-them-all contest, I’d be winning big time! It’s a lot of dependencies to take on at once, but it had to be done. Besides, the Rube Goldberg machine is a well-recognized software design pattern
I’m super proud of the scripts I created for automating this pipeline, but I want it to be clear the real credit for everything you’re about to read goes to the people who developed the tools that I used. I was standing on the shoulders of giants: Michael Hartl (softcover and polytexnic), Alvin Wan (TexSoup), and Kovid Goyal (calibre). These are the people who made all this possible through the open source software tools they maintain. They deserve the real credit—my contribution is simply to connect these tools to build an end-to-end pipeline using some Python scripts.
If you only have 10 more seconds of attention left, the only thing you need to know is to go install softcover (install instructions) and use it for any new projects: it’s the best way of converting math books (.tex source files) to .epub and .mobi. Start with the sample book produced with (softcover new --latex mybook
) and extend the chapters while compiling regularly.
Let’s begin.
Let’s talk a little bit about eBook formats before we get into the technical details. It’s good to know a little bit about the final products we’re trying to produce.
The Kindle file formats (.awz, .awz3, .kf8, etc.) are a family of proprietary format used on Amazon Kindle devices. There is no way to generate Kindle eBooks directly, instead, the recommended procedure for KDP distribution is to generate an ePub file and let KDP take care of the conversion to their proprietary formats. So for distribution on Kindle devices through KDP, the key is to generate a good quality, standards-compliant ePub.
The .mobi format (short for Mobipocket) is another proprietary eBook file format that was very popular before the wide adoption of ePub. Early versions of the Amazon Kindle used mobi format internally, so the mobi format is still supported on Kindle devices.
It’s considered good practice for eBook publishers to offer .mobi files in addition to .epub for the benefit of readers who have a Kindle device, since Kindle devices don’t support ePub files natively. Luckily there is an excellent command-line tool for this (ebook-convert) that comes with every installation of Calibre.
Using ebook-convert is really simple. If you have a source file book.epub
, and you want to convert it to .mobi format, you can run
ebook-convert book.epub book.mobi --mobi-file-type both
and you’ll end up with a file that is readable by all Kindle devices.
This is the most widely-supported format for ebooks and the main focus of our efforts here. The ePub file format is based on web technologies like HTML and CSS, and it is codified as an IDPF standard. An .epub eBook is a self-contained container of HTML, CSS, images, and metadata that is packaged as a .zip file. If you have an ePub file you can change it’s extension from .epub to .zip and unzip it to look at the contents. You can also use an ePub editor like Sigil that lets you “view source” and modify ePub files directly without the need to unzip them first.
The use of HTML for content is—in principle—a good thing. In practice, different readers have different levels of support for markup, styling, and media playback, and very few ePub readers support the possibility of running scripts. For this reason, we take a conservative approach and target the basic ePub v3 format, without taking advantage of modern features of the web platform like SVG images, audio playback, and interactive elements. I’m looking forward to exploring these advanced features of ePub3 in the future, but for maximum compatibility, I will avoid such niceties for now and assume the “client” is a bare-bones eReader device that only supports basic HTML, CSS, and images.
So how do you build an ePub? Theoretically you just prepare your book using HTML markup and you’re done. Well, not quite. When I said the ePub format is based on web technologies like HTML and CSS, it’s a bit misleading, since that covers only the content pages—the chapters of the book. Additionally, a standards-compliant .epub file must also specify the book metadata (content.opf
) and structure (toc.ncx
). There are several “ebook frameworks” like pandoc, Sphinx, gitbook, etc. that can be used to produce ePubs. If your book doesn’t contain math equations, I would recommend one of these well-established frameworks to “outsource” the complexity of generating the ePub. See this blog post for example.
However, in the case of the No Bullshit Guide textbooks, we’re starting with 1000+ pages of LaTeX source files containing A LOT of equations.
Statement of the problem: Generate a good-looking ePub file from LaTeX source files that include equations, figures, tables, and custom macros for exercises and problems.
The occurrence of the word “custom” should give you a hint that things will get interesting soon…
Let’s now talk about the software tools that perform the specific transformation from .tex source to .epub file. In order of importance these tools are softcover, calibre, TexSoup, and fab-classic. We start with the most important parts first: the tools that are broadly reusable in any context; we defer the technical details about the specific transformations we perform until the second half of this blog post.
Enter the first protagonist, the softcover project. Softcover is a framework for producing eBooks from markdown or LaTeX source files. You run a single command (softcover build) to build all kinds of book formats including PDF, ePub, and Mobi. It is a beautiful thing written in Ruby.
As soon as I learned about this project I knew it was the way forward. I had previously tried several other tools for generating ePub files form LaTeX, which worked well for short articles and blog posts, but failed on longer texts like the No Bullshit Guide textbooks. All the other approaches I tried before softcover were inferior in one way or another, and none of them supported all the markup needed for the No Bullshit Guide textbooks. Softcover supports 90% of the markup I need straight out of the box, and missing pieces can easily be added.
So softcover is good, however there was one big problem: the fact that I don’t know anything about the Ruby ecosystem. What is a gem? What is a bundle? What am-I supposed to rake exactly? It’s not an easy feeling being the “clueless beginner” and it reminded me of the early days when I was learning to code and couldn’t even get the basic dev setup working, let along program anything. But softcover is so good, that—prior Ruby knowledge or not—I knew this was the tool for the job.
Restatement of the problem: learn enough Ruby to use softcover and customize the code to support LaTeX macros used in the books for figures, exercises, and problems.
The way softcover works internally is based on a (subset of) LaTeX source file format called PolyTex, which gets processed to produce HTML files and from there ePubs. The softcover LaTeX format (PolyTex) includes 90% of LaTeX macros I use in the books (inline math, displayed equations, figures, tables, references, etc.), all I have to do is pre-process the book source files to make them softcover-compatible. Sounds simple, right?
Enter the second protagonist in this story: the LaTeX processing library called TexSoup created by Alvin Wan. The name TexSoup is an analogy to the popular library for processing HTML markup called BeautifulSoup. In case you’re wondering about the etymology, the term “soup” refers to the messy markup of most web pages, which is usually a jumbled soup of tags that are difficult to parse by hand and require a soup-parsing library to help with this task.
I had previously attempted to make the books’ source files softcover-compatible several times using ad-hoc sciripts based on sed
, python regular expressions, manual annotations, patches, Perl cleanup scripts, and several other hacky solutions. None of these prior attempts worked well, since parsing LaTeX with regexes is inherently a lost battle.
In contrast, the TexSoup library is capable of parsing LaTeX source files “logically” and allows for rich tex manipulations similar to what BeautifulSoup allows for HTML. You can find particular elements using find
and find_all
methods, and make content manipulations programmatically using a civilized API. What more can one ask for than a Python library that does exactly what you need to do!
When I discovered the TexSoup library, I immediately recognized it as the right tool for the task of transforming the book source files to softcover-compatible format. The combination of TexSoup for transformations and Softcover for generating eBooks is the right way to do this. All I had to do is write some scripts to combine these two tools.
There is an unspoken strategic objective that motivates this entire effort. I want the eBook production pipeline to be fully automated: from .tex to .epub in one command. Having an automated pipeline is essential to support the process of continuous improvement (Kaizen textbooks) which is a central ethos of the Minirefrence Publishing Co. All the books are continuously updated based on readers’ feedback: typo fixes, polishing rough explanations, adding links to external learning resources, etc. If producing ePub from the source files required any time-consuming manual steps, then the ePub files will be second-class citizens, since they will not be updated regularly like the print books, which is something we want to avoid.
The tool I used to automate the process of .tex soruce transformations and subsequent eBook generation is called Fabric (specifically fab-classic which is a fork of the Fabric project that maintains backward compatibility with the original API). Any other automation library could be used to achive the same results, but I chose Fabric because I have the most experience with this zero-magic Python scripting framework. If you know what a Makefile is, then you’ll easily recognize how fabfile.py works.
So this is where I was on October 12: I had the tools ready and the plan figured out, but I still needed to do the work…
In the remainder of the blog post, we’ll give some more technical details about TexSoup Softcover automation scripts for transforming LaTeX source files to ePub. My goal is to write a complete “walkthrough” for the code in the sample-book repository.
¡Vamonos!
The term “data pipeline” is a fancy way to talk about a sequence of transformations you apply to some data: take this input format, transform it to another format, then output a third format. In this case, the pipeline takes LaTeX source files as inputs, transforms the LaTeX to XML, then does some further transformations to finally output the HTML files that get packaged into an ePub file. In the earlier parts of the blog post we introduced the nouns (file formats) and the verbs (software), now it’s time to put together a whole sentence.
The first step of the pipeline requires processing all the LaTeX macros and styles used in the Minireference books (see the tex header file), and transforming them to the softcover-compatible PolyTex format.
The No Bullshit Guide textbooks are mostly written in “standard” LaTeX. There is a common header file called 00.minirefrence.hdr.tex
that defines the page setup, font selection, and a few custom macros like \eqdef
for ≝ (read “is defined as”) and \sfT
for the matrix transpose symbol.
The Minireference header file also defines some custom environments for the exercises and problems sets in the book, which are stored in separate files that get included in appropriate locations of the main text. Each problem consists of a question, an answer, and a solution. These three parts get treated differently: the question appears in the main text, while answers and solutions are sent to special files (e.g. answers_ch1.tex
and solutions_ch1.tex
respectively). These files get included in the Answers and Solutions appendix. The answers
LaTeX package and these custom macros enable this.
The figures in the book consist of a mixture of concept maps, charts, plots, force diagrams, and other illustrations. As part of previous updates to the books (2019), I did a lot of work to replace my ugly hand-drawn diagrams with beautiful vector graphics generated with TikZ (many thanks to Robyn Theissen-Bock who led this effort). Each of these figures is created from a separate .tex source file based on the standalone
document class, and each time you build the pdf version of a figure, a .png file with the same name is also generated (this will become important later on).
Overall the books are pretty much “vanilla” LaTeX with very few macros and customizations. That’s the whole point of LaTeX: the basic book
documentclass is so powerful that you don’t need to do anything custom to produce beautifully typeset books.
The subset of the LaTeX syntax that is supported by the Softcover framework is called PolyTeX. PolyTeX covers 90% of the macros used in the No Bullshit Guide textbooks, so right out of the box, tex-compiling the books source files with Softcover gives a pretty decent result.
However 90% is not 100% and we have to do some transformations to the source files to make them work. The specific things I had to do to get to 100% are:
\label
must be placed inside a the figures’s \caption
in order for references to work right. No problem, this is a five-liner in TexSoup.It took me about a week and a half of intense coding to put together all these conversion procedures required to transform No Bullshit Guide LaTeX to softcover-compatible PolyTeX.
The concept of an “ETL job” is a standard jargon term in the enterprise world, referring to the process of “ingesting” (extracting) data from an external system and “injecting” it into your company’s system (loading). The source data is rarely in the format expected by your internal systems, hence the need for the transformation step in the middle.
Extract-transform-load pipelines are a useful design pattern to manage the complexity of dealing with data in general. First you extract the source data in whatever format you can get, then you do one or more transformations, and finally you output the format expected for further processing. Here are the four steps of the pipeline I created:
That’s it. Half the work is already done—we just have to handoff to Softcover for the rest!
This transformation is all taken care of by the softcover framework, which is based on the polytexnic library, which in turn uses the Tralics LaTeX-to-XML converter. The softcover pipeline for producing HTML looks like this:
Figure 2: The internal transformation steps used by Softcover.
Typical users of the softcover framework don’t need to understand all the inner workings. I’m writing about this only because I think it’s a marvelous feat of engineering. Big up Michael Hartl for inventing this!
The steps of this transformation are:
The results of Transformation 2 is a set of files in HTML+MathJax format that can be opened in any browser.
The HTML+MathJax format we obtained in the previous section is not suitable for use inside an ePub because it requires JavaScript to render the math equations. The workaround for this is to pre-render all the math equations as images. This sounds simple in theory, but is quite a complicated process in practice. The following quote summarizes the process very well:
The real challenge is producing EPUB and MOBI output. The trick is to (1) create a self-contained HTML page with embedded math, (2) include the amazing MathJax JavaScript library, configured to render math as SVG images, (3) hit the page with the headless PhantomJS browser to force MathJax to render the math (including any equation numbers) as SVGs, (4) extract self-contained SVGs from the rendered pages, and (5) use Inkscape to convert the SVGs to PNGs for inclusion in EPUB and MOBI books. Easy, right? In fact, no—it was excruciating and required excessive amounts of profanity to achieve. But it’s done, so ha. — Michael Hartl
It’s almost a pipeline within a pipeline! The steps described are all carefully chosen for a reason:
Some of the more tricky things associated with the process involved “smart” vertical placement of inline math equations so the math font baseline matches the surrounding text font, choosing the right resolution for the SVG-to-PNG conversion (too small and math looks pixelated, too large and ePub files end up being 20MB+), and some accessibility considerations to make sure math images contain the LaTeX source as alt tags. It’s a pretty crazy little pipeline, but it works!
I hope this blog post has been useful for people who are looking for ways to convert their math books from LaTeX to ePub. It’s not a simple process by any means, but we have to work on it so that math can be a first-class citizen in the new digital books economy. Seeing math equations as ugly images in ePubs published by big publishers made me very sad and discouraged—if big-name publishers can’t make decent-looking math ePubs, then what chance do I have?
This is why I’m so happy to have discovered this “civilized” way of producing .epub/.mobi from .tex files and why I wrote this long wall of text to share the info with y’all. There is a bigger lesson to learn here: never despair—sometimes an independent publisher with a team of one can end up ahead of the curve in eBook production technology. With access to FOSS (free and open source software) tools and libraries anything is possible!
Speaking of open source… the toolchain I build is very specific to my use cases, but I’m still open sourcing the sample-book repo where all these scripts were developed in case other authors want to follow in my footsteps for their LaTeX to ePub conversion needs. This is the power of community and sharing. Even complicated tasks can be tackled when we work as a group, and people from around the world can collaborate around these projects for their personal motivations and incentives. We’re literally building printing presses. That’s a beautiful thing.
For readers wondering “Where do I get them math books?” right now, here is a list of the different options:
In this blog post I’ll summarize the changes to the book and provide some more context about why these changes were needed. I’ve included links to PDF excerpts of all the new sections[1,2,3,4] and a detailed red-blue diff that shows all edits.
For an exhaustive list of all the changes between v2.1 and v2.2 of the book, see this file diff_noBSLA_v21_v22.pdf (102 pages). The text shown in red has been removed, while the text shown in blue has been added.
As you can see in the above file, there were a lot of edits and improvements to the book. Based on reader’s questions and issue reports, I know which parts of the book need clarifications, and I implement those fixes to the source. I call this process “Kaizen for textbooks” (改善) and I plan to continue making such improvement and release updates v2.3, v2.4, etc. This is the power of print-on-demand publishing: if I update the print source files today, readers who purchase the book tomorrow will get the latest version printed for them.
If you purchased the book from gumroad, use the link you received by email to access the latest files. If you have a print copy of the book, you can send me a proof of purchase (e.g. picture or receipt email) to combodeal@minireference.com and I’ll send you the the latest eBooks. There is probably no point in purchasing a new print copy for yourself, but you should consider buying a copy for a friend or family member who is interested in linear algebra (holiday season coming up, you know).
New readers can get the book in print or digital form from the following links:
In case you’re not familiar with the No Bullshit Guide to Linear Algebra, it’s a book that covers all the topics of linear algebra and also includes a review of prerequisites from high school math. The high school math review makes the material accessible for readers who have been out of school for a long time.
The book also includes hundreds of pages on the applications of linear algebra like balancing chemical equations, circuits, computer graphics, Fourier transformations, cryptography, error correcting codes, and even a chapter on quantum mechanics. Yes, that’s right—when you learn linear algebra a lot of new doors will open for you.
Check out the amazon reviews for the book to see what others are saying and check out the extended book preview (152 pages). You can also download a the concepts maps from the book: linear_algebra_concepts.pdf (3 pages).
]]>In this blog post I’ll give an overview of the changes to the book with additional comments about the importance of each change. I’ve included links to PDF excerpts of all the new sections[1,2,3,4,5,6] as well as a detailed red-blue diff that shows all edits. All this with the goal to make the PDFs linked to from this blog post to contain a the complete “patch” information for readers who have an older version of the book.
The book is now available as .epub and .mobi files. Making the book available in a “reflowable” format that works on all devices (mobile and eReaders) is something that I’ve been promising readers for a very long time. I’m very happy to finally be able to keep this promise after all these years.
Generating decent-looking ePub and Mobi files from LaTeX source full of math equations wasn’t easy, but after many coffees and late nights I managed to build a pipeline based on TexSoup, Softcover, and Calibre that works great. See this tech blog post for details about this new eBook generation pipeline.
The high school math chapter received some much-needed revisions to smooth over several parts that were not explained well:
Most of these fixes and clarifications are the result of comments and questions I’ve received from readers. Having an open communication channel with readers (email: ivan@minireference.com) has been tremendously useful since they tell me which parts of the book are confusing and in need of further clarifications. If there’s anything I’ve learned about startups over the years, it’s that listening to the users is a good thing.
The recent work on a French translation of the book was another source of feedback and improvements. Working closely with the translator (a math professor) was an amazing experience, as he “called bullshit” on several missing definitions, imprecise analogies, and hand-wavy explanations. I didn’t expect this, but it turns out the translation process is an excellent “test” for explanations: if an explanation is not 100% clear it won’t translate well. All “weak spots” in the original get magnified by the translation process and it makes it easy to spot the places in the original that need more work. I wrote a blog post about this “authoring hack” and others, see Multilingual authoring for the win.
The combination of fixes from years of user-testing by readers and the recent rigor checks by a real mathematician makes Chapter 1: Math fundamentals pretty solid now. Issue closed.
For a detailed red-blue diff of all the changes between v5.3 and v5.4, see this file diff_noBSmathphys_v53_v54.pdf (250 pages). The text shown in red has been removed, while the text shown in blue has been added. (Tech sidenote: this word-level diff was generated using latexdiff, which is an amazing tool that I recommend for anyone writing in LaTeX. More generally, anyone using git for text files can use the command git diff --color-words
to get a similar word-level diff instead of the default line-level diff.)
Many readers of the book are adult learners interested in (re)learning math and physics for personal interest and not students studying for a specific course or exam. I’ve been asked repeatedly to create an accompanying online course to provide some structure and accountability for independent learners’ journeys through the book. The book covers the material from three university-level courses, so it’s understandable that an external support structure would be useful!
I’m working on video tutorials, an offline self-directed course pack, email lessons, and maybe even a mobile app of some sort, but these projects will take time to research, develop, and ship. Rest assured the query SELECT * FROM EdTech WHERE license="free" AND good=TRUE;
is continuously running and there are some really good matches you’ll be hearing about soon[K,M]. Expect good things in 2021. In the meantime, I have a “hack” for you that provides some of the benefits of an online course.
Gamify-it-yourself: There are 500 pages to read in the book, organized into five chapters, and each chapter is like a stage in a computer game. The concept maps in the front of the book show the concepts and topics you need to learn to complete each stage. Print the concept maps and post them somewhere visible. Use a pen or pencil (old-school EdTech) to represent your current state of knowledge about each term in the concept map using the following criteria:
Add a single dot (●) next to all concepts you’ve heard of, two dots (●●) next to concepts you think you know, and three dots (●●●) next to concepts you’ve used in exercises and problems.
Basically, you can gamify your learning process using a piece of paper and some metacognitive skills. Who better to reward your curiosity, learning, and practice than yourself? Trust me, there is no intelligent tutoring system or machine learning model out there that is anywhere close to what you can do.
The goal of the dot system is to give partial credit for all types of learning: heard about concept X, understood concept X, and know how to apply concept X. To get all the points you must not only read about and understand the concepts from all five chapters, but also practiced using these concepts. It might take a month or two, but at any point in the “game” you can check your learner profile and know what is coming next, and how much is left.
If you purchased the book from gumroad, use the link you received by email to access the latest files. If you have a print copy of the book, you can send me a proof of purchase (e.g. picture or receipt email) to combodeal@minireference.com and I’ll send you the the latest eBooks. There is probably no point in purchasing a new print copy for yourself, but you should consider buying a copy for a friend or family member who might benefit from having some math and physics in their life (holidays coming up).
New readers can get the book in print or digital form from the following links:
The No Bullshit Guide to Math & Physics is the ideal book for last-year high school students, first-year university students, and adult learners who want to learn calculus and mechanics. The book explains all the material required for first year science courses in an easy to understand, conversational tone. Don’t take my word for it though, check the amazon reviews to see what other readers have said, and check out the extended preview and sample chapter (170 pages) to see for yourself.
Note: if you’re interested in (re)learning high school math topics, but don’t want to go all the way to mechanics and calculus, then you should get the No Bullshit Guide to Mathematics (print or digital), which covers only the essential math fundamentals (Chapter 1 and Chapter 3 of the Math & Physics book). The green book is great for high school students and parents of such.
]]>Before we begin with the “How it’s made” episode, let me show you some examples of the final product. I have selected the best four “backports” — explanations that now exist in the English version thanks to the additions in the French version.
Contuinuez à lire si ça a l’air intéressant. Read on if you’re interested.
The No Bullshit Guide to Mathematics (a.k.a. the green book) is a short summary of all the essential topics from high school math intended for adult learners. Last year, by sheer luck and good fortune, I was introduced to Gerard Barbanson who offered to translate the book to French. Gerard is a professional mathematician, a native French speaker, and has also taught math in English for many years, which makes him the perfect translator. Gerard is leading the translation project and provides lots of useful feedback and improvements for the text.
Look out for a followup blog post and announcement about the release of the French translation (in a few months). This blog post is not about that, but about the benefits of the translation efforts brought to the original English version.
While reviewing Gerard’s “first pass” of translation, I kept noticing spots where the explanations didn’t work well. My initial reaction was that this was a bug in the translation, but every time I looked into a passage, I realized the problem existed in the original English text, and the translation only magnified the problem and made it more noticeable. Examples of “weak spots” include paragraphs that are too conversational (i.e. no content), missing definitions, and explanations that are unclear or confusing.
I found this process to be extremely useful. Even though I’ve read and reread the English version many times, I never noticed these weak spots until now. The translation process highlighted the lack of clarity in certain specific parts and forced me to think of ways to fix these explanations. Essentially, if an explanation is good, it will “survive” the process of translation, but if it’s not 100% solid and clear, then it turns into “noise” at the end of the process.
We can think of translating explanations as a communication scenario, where the source language (English) is the transmitter, and the target language (French) is the receiver. The process of translation adds “noise” in the form of ambiguities, so the received signal is a degraded version of the original signal. The French translation will be good only if the original English explanation is really solid and clear. This puts additional pressure on the original English version to be extra clear and precise.
Another benefit that came out of the translation work has been the focus on the consistent use of terminology and notation. For the most part, mathematical concepts translate well between English and French, but sometimes French has more precise terminology available. For example I’ve adopted the precise terminology of source set and target set to refer to the sets that appear in the function “type signature,” which are distinct concepts from the function’s domain and image.
One of the core responsibilities of any math teacher is to use precise and consistent language to describe mathematics, including choosing the simplest terminology when the complicated terminology is unnecessary, but not shying away from the “real math” terms when they help illuminate the concepts. Working with Gerard to explicitly establish our conventions for the French version forced me to also be consistent in the English version as well.
I guess that’s not too surprising—using consistent terminology and notation is just best practices.
Perhaps the most surprising thing I noticed from the translation project is the amazing efficiency of developing English and French explanations in parallel, sometimes aided by Google Translate. This was most apparent in writing the new sections on polar coordinates and vectors. Normally writing a new section would take me days, going through several mediocre versions, rereading on paper, and slowly converging to a decent narrative. I noticed the new sections I added over the holidays converged to a “quality product” much faster. Here is the process I followed:
After a few cycles of going between the English and French version, I saw clear improvements from the initial English version and of course the French version was improving in tandem.
I guess the thing that makes me excited about these “authoring hacks” is the fact that they allow me to go one step deeper in the process of continuous improvement of the books. I’ve read and reread the text at least a dozen times, worked closely with my editor Sandy Gordon to iron out all the major flaws and acted on feedback from readers to fix confusing passages, but at some point I get tired and start to let things go. I say to myself things like “yeah this is not the clearest explanation, but it’s kind of OK as is.” This is partially out of laziness, but also because of the law of diminishing returns: sometimes rewriting makes things worse!
There is a famous quote that says:
― Louis D. Brandeis
I know this is good advice, but it’s hard to adhere to it. After five editions of the math book, I find it difficult to motivate myself to rewrite things, even if I know there is still room for improvement. That’s why I’m always looking for hacks that can help with the process (see for example the text-to-speech proofreading hack). The translation work of the past few months gave me the impetus to do more productive rewriting without it feeling like a chore. Look out for the updated No Bullshit Guide to Mathematics v5.4 coming soon in both English and French. Sign up for the mailing list if you want to be notified.
]]>