Fixing the introductory statistics curriculum

Let’s talk about the problems with the teaching of statistics. Understanding statistics is essential for many fields of academic research, and also useful in industry. Why is it that first-year statistics courses sucks so bad? It seems that conceptual understanding of statistics ideas only marginally improve after taking a STATS 101 course. Is this because statistics is a really difficult subject to teach, or are we teaching it wrong?

I’ve been looking into this question for the last three years and I finally have a plan for how we can improve things. I’ll start wiht a summary of the statistics curriculum—the set of topics students are supposed to learn in STATS 101. I’ll list all the topics of the “classical” curriculum based on analytical approximations like the t-test. This is the approach currently taught in most high schools and universities around the world.

The “classical” curriculum has a number of problems with it. The main problem is that it’s based on difficult to understand concepts, and these concepts are often presented as procedures to follow without understanding the details. The classical curriculum is also very narrow, since it covers a slim subset of all the possible types of statistical analysis that can be described as math formulas that can be used blindly by plugging in the numbers. In the end of the introductory stats course, students know a few “recipes” for statistical analysis they can apply if they ever run into one of the few scenarios where the recipe can be used (comparison of two proportions, comparison of two means, etc.). That’s nice, but in practice this leaves learners totally unprepared to solve all stats problems that don’t fit the memorized templates, which is most of the problems they will need to solve in their day-to-day life. The current statistics curriculum is simply outdated (developed in times when the only computation available was simple algebraic formulas for computing test statistics and lookup tables for finding p-values). The focus on formulas and use of analytical approximations in the classical curriculum limits learners development of adjacent skills like programming and data management. Clearly there is room for improvement here, we can’t let the next generation of scientists, engineers, and business folks grow up without basic data literacy.

Something must be done.

The something I have in mind is a new book, the No Bullshit Guide to Statistics, which is my best shot at improving the teaching of statistics. I’ve been researching the table-of-contents of dozens of statistics textbooks, read hundreds of stats-explainer blog posts, watched hundreds of lectures on YouTube, all of this while trying to make sense of this subject. What is statistics? What are the core statistics topics that everyone should know? Read the rest of this wall of text to find out.

This is part 1 of the 2-part blog post series about the book production process, in the style of the “How it’s made” TV series. In this first “episode” of this series, we’ll talk about the statistics curriculum in general (what we need to teach and how to best teach it). The goal is to produce a complete “bill of materials” (BOM) for the set of statistics topics and concepts that should be covered in the introductory stats course. In the second episode, we’ll talk about the “packaging” problem: how to organize the bill of materials—the O(100) core topics and concepts of statistics—into an easy to follow sequence of learning experiences in book form (chapters, sections, subsections, etc.). The second post includes a book progress report and links to live-updating book outline gdoc, draft chapters pdfs[1,2,3,4], and a form you can use to provide feedback and comments to influence what the final version of the book will look like, specifically, how much math equations and how much code examples I can use without making the book overwhelming for beginners.

Table of Contents:

Components of the statistics curriculum
Missing stuff in the current curriculum (gap analysis)
Alternative curriculum based on randomization (a better way to teach stats)
Stacked-rank of stats priority topics
Recommend learning resources

Components of the statistics curriculum

Statistics is used in physics, chemistry, biology, engineering, business, marketing, education research, international development, epidemiology, medicine, finance, social sciences, and many other fields. Every university has some version of a STATS 101 introductory course and sometimes departments have subject-specific “Statistics for X” courses. Statistics concepts are even supposed to be taught in middle school and high schools! Clearly the educational establishment thinks that students should be learning statistics, and there are hundreds of textbooks out there that teach the statistics fundamentals.

Which topics and concepts belong in the first course on statistics? To answer this question I looked at the syllabus documents from several university courses [McG203, McG204, McEPIB, UIC, UWO, UNIPD, UH, SFSU, CEU, NAU, DAL, USU, UFL, McM, UBC, WKU], tables of contents of introductory statistics textbooks [OpenIntro, OSE, ER, Lane, OpenStax, Lavine], online courses [CC], and educational standards [CCSSM.HSS, AP STATS]. The list below includes all statistics topics I encountered during this research. Topic shown in bold are deemed core material (covered in all courses), while topics shown in grey are less common and only covered in certain courses.

Probability theory: randomness, probability models, probability axioms, probability rules, sample spaces, counting techniques, random variables, distribution and density functions, discrete distributions, continuous distributions, expectation, conditional probability, Bayes’ rule, independence, multivariate distributions, law of large numbers, central limit theorem, inequalities.
Data management: population vs. sample, types of data, data visualization, collecting data, random sampling, biases.
Descriptive statistics: measures of central tendency and variability, detecting outliers, correlation analysis, tables, visualization.
Statistics distributions: sampling distributions, normal, Student’s t, chi-square, Fisher’s F, estimation of mean and variance, standard error of the mean, using tables to lookup p-values.
Null hypothesis significance testing (NHST) procedure: inference about proportions and means, Type I and Type II errors, p-values, power, confidence intervals, effect size, practical significance, analysis of one-sample, two-samples, and paired-samples scenarios, analysis of variance, goodness of fit, probability model assumptions, normality test, non-parametric tests, resampling methods.
Statistical test recipes: binomial test, chi-square goodness-of-fit, one-sample t-test, two independent samples t-test, Welch’s t-test, paired t-test, one-way ANOVA, chi-square test, Fisher’s exact test, Mann-Whitney test, Wilcoxon rank sum test, Kruskal-Wallis test.
Linear regression: ordinary least squares, extrapolation, confidence intervals, regression diagnostics, model checking, correlation vs. causation, multiple linear regression, logistic regression.
Research methods: statistical principles, experimental designs, random sampling, random assignment, causal effects, blinding, controls, randomized controlled trials, misuses of statistics, ethics and privacy.
Statistics software: statistical computing packages (Excel, Python, R, SAS, SPSS, STATA, jamovi), calculations, interpreting results, simulation tests, permutation tests, bootstrap estimates.

As you can see, the STATS 101 curriculum is pretty packed! Not every stats course covers all these topics and the focus in different courses varies, but overall this is a representative summary. Looking at the above list should give you a general idea of the multitude of ideas and concepts that will be presented to (or should we say inflicted upon) first-year university students.

Missing material

Despite the long list of topics normally covered in the first statistics course, there are some important topics that are skipped or covered only superficially:

Practical data munging. Statisticians and data scientists spend a lot of their time on extracting, transforming, and cleaning data, which are the essential steps before statistical analysis can be performed. Students are insulated from the difficulties of “messy” data and given only “clean” data to work on. This decision to deal only with clean data is understandable given how loaded the curriculum is, but can be counterproductive in the end. By focusing on clean, non-realistic data, learners lose contact with the domain and may feel that the procedures they are learning are too “clinical” and far-removed from reality.
- opportunity: introduce some practical exercises and projects (half-filled-in Jupyter notebooks) that invite readers to supplement their reading of the book with hands-on practice doing statistical analysis on real data.

Estimators. The concept of an estimator (a function computed based on a data sample) is rarely defined in introductory statistics courses, presumably because it is seen as too advanced. However, without a formal definition of this fundamental concept, statistical reasoning becomes much harder to understand. For example, the sample mean estimator is defined as the function $g(\mathbf{x}) = \frac{1}{n}\sum_{i=1}^n x_i$, where $\mathbf{x} = \{x_1, x_2, \ldots, x_n\}$. The sample-mean estimator can be applied to a particular sample $\mathbf{x}=\{x_1, x_2, \ldots, x_n \}$ or to a hypothetical random sample $\mathbf{X}=\{ X_1, X_2, \ldots, X_n \}$ that consists of $n$ randomly selected items from the population. Using estimators is essential to understanding the relation between the sample mean $\overline{\mathbf{x}} = g(\mathbf{x})$ computed from the particular sample $\mathbf{x}$ and the general sampling distribution of the mean $\overline{\mathbf{X}} = g(\mathbf{X})$ computed from a theoretical random sample $\mathbf{X}$.
- opportunity: provide more details about estimators to make stats calculations understandable from first principles. As the old saying goes: teach learners to compute one estimate and feed their stats understanding for a day; teach learners about estimators in general and feed their stats understanding for life.

Randomization methods. Using randomization methods like permutation tests and bootstrap estimation is a much more intuitive way to introduce the sampling distributions used in hypothesis testing compared to traditional analytical approximation methods like the t-test. There are currently very few stats textbooks using randomization methods [ISRS, IMS, CEKST, ISI] and the modern approach is not widely adopted yet.
- opportunity: join the wave of modern stats teaching by introducing statistics concepts through hands-on resampling methods. More details about this in the next section.

Bayesian statistics. This topic is largely considered out of scope for most introductory statistics courses, and the classical, frequentist approach is preferred. It’s not that Bayesian thinking is more complicated (arguably it’s simpler), but students are already exposed to so much (descriptive statistics, hypothesis testing, pre-packaged statistical analysis recipes, etc.) that there is no room left to introduce another way of thinking (the original Laplace shit).
- opportunity: take advantage of the conceptual framework built so far (DATA + PROB + ESTIMATORS + DECISIONS) to revisit the same statistical concepts on a Bayesian substrate.
- opportunity: Bayesian reasoning is a key idea that students will need for advanced studies in statistics and machine learning, so even a superficial intro would be worth having.

Practical computer skills. Teaching randomization methods and Bayesian statistics requires some computer skills. Statistical educators have been trying to avoid this requirement by using various graphical stats software that allow all learners to do statistical analysis using point-and-click interfaces, see jamovi for example. This is like giving learners pre-assembled toys to play with—they are very nice, but you have to read the user manual to learn how to use them. Life will be so much simpler for learners (and for teachers too!) if we instead use a general-purpose programming language like Python to carry out calculations. Yes, some basic programming skills will be required, but anyone can learn Python in like a week or two. Learning the basic functions and methods relevant for statistical analysis is like giving learners the LEGOs and letting them build the stats toys on their own.
- opportunity: using Python code examples to explain stats concepts can be a very good teaching device, since code examples can be very concise, but also precisely describe the steps in calculations and procedures.
- opportunity: prepare a GitHub repository with lots of stats exercises and problems in the form of partially-filled Jupyter notebooks that learners have to fill in by writing Python code. Examples: descriptie statistics, permutaiton test, and t-test.
- opportunity: embed an intro-to-Python tutorial in the text through just-in-time explanations of what is going on in the code throughout the book, supplemented by a focused intro-to-Python tutorial in appendix. It’s not a programming book, but it is a programming book if you want it to be.

Math and calculus prerequisites. Most first year statistics courses assume learners have the necessary background knowledge of high school math concepts (equations, algebra, functions, set theory) and calculus (mostly summations and integrals). Students who lack this prerequisite knowledge will find some of the material very difficult to pick up—if you don’t understand the math, then stats will look very complicated! In particular students in the social sciences who might have intentionally been avoiding taking courses that require math will have a moment of reckoning. Similarly, adult learners with “rusty” math skills may find the mathy parts difficult to follow.
- opportunity: include relevant prerequisite materials from noBSmathphys and noBSLA books (in condensed form) to make the stats book self-contained, and accessible even for readers with no math background. Hey, it’s not plagiarism if you wrotez it!

Alternative curriculum based on randomization

I recently found an excellent paper on stats pedagogy titled The Introductory Statistics Course: A Ptolemaic Curriculum by George Cobb, which has been instrumental for my understanding of statistics teaching:

I really like this paper because it puts the emphasis on a learner-centric perspective. What are the conceptual leaps we’re asking students to make when learning statistics? How many of the formulas shown to learners are explained properly? Are students getting the big picture or blindly memorizing procedures?

I encourage you to read the whole paper (especially the highlighted parts), but if you don’t have the time, here are the main takeaways:

The main “uphill” in the classical curriculum is when sampling distributions are introduced, students are taught a formula for computing the t statistic, and shown how to look up a t-value to get a p-value in a table. The average learner is lost as to the meaning of the procedures they are following. A concept as complex as the sampling distribution takes a while to absorb, and it’s not realistic to expect learners to grasp it right away when it is first introduced, even if we draw a nice picture.
We must take the t-test off its pedestal and replace it with more intuitive resampling methods (e.g. permutation test to compare means between two samples).
The correct term to describe the current “cannon” of frequentist statistical test recipes taught to students is “analytical approximations.” This naming change was a big revelation for me! Behind most statistics formulas (confidence intervals, effect size, power), there is often some step in the calculation that requires normality assumptions or uses some sort of handwavy argument, so approximation is the correct way to describe how these formulas are derived. There are very few “exact” results in statistics. Statistics is messy, so we must make approximations. Sidenote: I love the derogatory connotation of a label containing “approximations,” since it’s a perfectly targeted insult for theoreticians.
Assumptions like population normality are pervasive in the traditional statistics curriculum, but the everything-is-normally-distributed paradigm is not representative of the real world. Normality is good and all, but it can only take you so far. Data methods (resampling) are distribution agnostic and much more widely applicable.
There is nothing wrong with showing students the classical statistical recipes like the t-tests, but these need to be presented as secondary characters to leave room for more important statistical reasoning principles to take center stage (estimation, confidence intervals, decision theory, interpretations).

The recommendations in Cobb’s paper were a big inspiration for how I plan to organize the stats book. I want to be part of the “new generation of adventurous authors” that lead students down a new, more exciting statistics curriculum.

Stacked-rank topic priorities for the upcoming book

My friend Patrick Mineault has been helping me with the book planning process, and the first thing he asked me to do is to come up with a ranked list of topics that I find most important for understanding statistics. Before thinking about the organization and sequenceing of topics, it’s useful to start with a vision for what you want to put emphasis on. What is the message of the book? What is the teacher’s vision?

This is the list of “high priority” stats ideas according to me:

Probability distributions (probabilistic modelling). This is by far the most important piece of “math technology” learners of statistics need to be able to take away from the book. We’ll not only introduce theory topics, but spend hundreds of pages doing probability calculations and using probabilistic models for statistics applications. I want readers to be black-belt-level good at probability by the end of the book.
Estimators and their sampling distributions. This is the cornerstone of statistics and I want students to understand this really well. Computing expectations of the form $\mathbb{E}_{\mathbf{X}}$ where $\mathbf{X}=\{ X_1, X_2, \ldots, X_n \}$ is a random sample that consists of $n$ identically distributed random variables $X_i \sim \textbf{model}(\theta)$ drawn from the model $\textbf{model}$ with parameters $\theta$ is the main probability idea students need to get acquainted with in order to do well in statistics.
Random sampling. The idea of random sampling (each member of the population has the same chance of being selected in the sample) is a crucial bridge that allows us to infer information about wider populations. Without careful attention to the data collection protocols, it is very easy to break the “random selection” assumptions (sampling bias, selection bias, …). Readers must be warned about the common traps, so they can watch out for them.
Linear regression. A scatter plot of $(x,y)$ values representing some input variable $x$ and observations of an output variable $y$ are great way to visualize simple relationships. A linear regression models for this data has the form y ~ mx + c, where m is the slope and c is the y-intercept of the best-fit line for the x–y relationship that exists in the data. If readers understand linear models by the end of the book, then I’m calling this mission accomplished. Using linear regression is both very practical and also touches core ideas like predictions, cross-validation, and model-selection, which are very useful for later studies in machine learning. Whatever time we spendn on linear models will be time well spent.
Confidence intervals. All self-respecting statisticians provide not only point estimates (specific values) but also provide “error bars” that describe the variability of the estimate. Confidence intervals provide probabilistic guarantees (under the sampling distribution $\mathbb{E}_{\mathbf{X}^n}$, the unknown value will fall within the confidence interval 95% of the time). Calculating estimates and confidence intervals using different techniques is an imporant skill I would like all my students to have.
NHST procedure. The pinnacle of the old-school statistics curriculum is the Null Hypothesis Significance Testing procedure, which has six-steps: 1) formulating hypotheses, 2) choosing error parameters $\alpha$ (max Type I error allowed) and $\beta$ (max Type II error allowed), 3) calculating the required sample size $n$ and collecting the data, 4) checking assumptions, 5) performing the statistical test, and 6) reporting the results using relevant measures for the strength of the evidence like p-values, confidence intervals, and effect size estimates. This is by far the least exciting part of the statistics curriculum, but we must do it to help out students who are taking a STATS 101 course.
p-values. The definition of the $p$-value computed under the null hypothesis is introduced as a secondary character—as basic sanity check used to confirm the data observed is unlikely to occur by chance under the null hypothesis. Warn readers about the numerous possible misinterpretations of p-values.
Permutation test. Reader’s first contact with statistical inference will be through various examples of permutation tests for comparing two groups: permutation for comparing two proportions, and permutation test for comparing two means. In both cases the null distribution is the “no difference between groups,” which we can simulate by reshuffling.
t-test and other analytical approximations: We’ll introduce readers to the analytical approximation formula used to calculate p-values, and explain the historical importance of these results, including examples of the most important STATS 101 recipes.
Effect sizes. Effect size estimates were already included in the confidence intervals, but it’s worth highlighting this topic again since it represents much more useful information. We have a vision for the future where science reports contain more effect size estimates and less p-values.
Bayesian stats. I’m categorizing this as a stretch goal for now, only to include if there is room in the book.

Note this list doesn’t show topics in the order as they will appear in the book, but rather by “most important” to “least important.” This rank of each topic in the list gives you an idea “page weight” of the topics—topics higher in the list will have more pages dedicated to them.

Hidden curriculum

Here are some additional topics that I would like to weave into the narrative on introductory statistics:

statistics best practice (in the end of the day, following recipes are fine, but you have to know what you’re doing: no peeking, no p-hacking, etc…)
do honest work (no cheating or deception—science is already hard enough when results are honest)
statistics ethics (think about real-world implications of your data collection, assignment to groups, an reporting of result)

For more about the upcoming book, see the second post, which includes a detailed book outline and PDF previews of several chapters[1,2,3,4], and the feedback form. The No Bullshit Guide to Statistics is still far from finished, but the research stage is done now and I can start writing and preparing stats notebooks with practice problems. You can expect a first draft will be ready by the end of the year, but you don’t need to wait that long to learn stats! In the next section you’ll find links to the best statistics learning resources that exist already so you have the option to start learning stats right now.

Recommended stats learning resources

In my research and learning about statistics for the past three years, I found some excellent video lectures, visualizations, tutorials, books, and other learning resources. Below is a list of the best ones:

As a starting point, I recommend you take a look at the probability theory chapter draft. You don’t need to worry about particular formulas and details (you can fill that in later), but I want you to have an idea of all the probability theory tools that are at your disposal.
Instead of just reading about probability theory, it’s better if you can see probability theory. Head over to the Seeing Theory website and spend a few hours working through the visualizations of probability rules, probability distributions, frequentist inference, and linear regression. These are the core ideas you need to become friends with.
Next you can jump straight into the deep end by watching the talk Statistics for Hackers by Jake Vanderplas, which is the best TL;DR about statistical inference based on resampling methods.
For more details about the computation-first approach to statistics you can read the book Think Stats by Allen B. Downey, which goes through several statistical analysis examples using Python code, and doesn’t require any fancy math.
For even more details, check out the OpenIntro Introduction to Modern Statistics which is a textbook covering all the topics from the statistics curriculum using modern randomization methods. This book is still in draft, pre-release stage, but it’s already very good and worth looking into, especially if you’re taking a STATS 101 course right now. Speaking of OpenIntro, the same authors also published the OpenIntro Statistics, which follows the “classical” statistics curriculum based on analytical approximations.
Another book that I found to be excellent is All of Statistics by Larry Wasserman. It’s more mathematically oriented, but for readers who are not afraid of math, there is hardly any better way to learn statistics than reading chapters 1 through 10 and chapter 13 in this book.
Readers who identify more with the show-me-the-code camp can piece together a complete statistics education by watching the talks on statistics from past PyCon and SciPy conferences and playing with the Jupyter notebooks provided for each talk. The SciPy 2015 tutorial Computational Statistics II by Chris Fonnesbeck and the associated notebooks [1,2,3,4,5] are good starting point. Using standard functions from scipy, you can learn how to do probability estimation, which is an essential skill in all of statistics (and machine learning).
People who are interested in a “high level” API for performing statistical hypothesis testing should look at tea-lang, which is a domain-specific language (DSL) for formulating statistical questions using the classical hypothesis testing framework. Watch this presentation by the tea-lang inventor to get started, then install the code and read the paper. This is really cool stuff. Instead of trying to find the right statistical test from a inventory of all possible stats tests or following a decision tree, why not let tea-lang automatically analyze the “statistical analysis spec” you give it and run all the statistical tests that apply?
For more recommended stats learning resources, check out the shared google doc Links to STATS Learning Resources in which I’m collecting links to books, courses, tutorials, visualizations, software tools, and other resources that will eventually be included as links in the book. Let’s face it, statistics is a huge field and no single book that can teach you all the material. The best I can do in the No Bullshit Guide to Statistics is to provide a coherent, self-contained introduction to the basics, and point readers to the other learning resources for further reading. I’m currently working through a backlog of O(1000) stats bookmarks and I’ll be adding the “best-of”-grade resources to this gdoc over time. I also created a mailing list for the book. Sign up for to receive preview chapters and updates about the book by email https://confirmsubscription.com/h/t/A17516BF2FCB41B2

Minireference blog

Starting a revolution in the textbook industry