The links below will take you to preview PDFs of the latest edition of the book.
Click the
buttons to launch an JupyterLab insance in the cloud with the code examples from each section.
Part 1: Data and Probability Prerequisites
- Concept map:
visual representation of the connections between the concepts from the data, probability, and statistics.
- Preface:
explains the motivation for the book, and describes the target audiences.
- Introduction:
defines statistical concepts and describes the three datasets that we'll use in the book.
-
Chapter 1 DATA.
Chapter 1 includes:
- Section 1.1 Introduction to data:
defines concepts like population, sample, and other stats terminology.
This sectin also introduces the concepts of random sampling and random assignment,
which are two key building blocks of statistical experiments.
- Section 1.2 Data in practice:
explains practical aspects of data manipulation using Pandas,
data visualization using Seaborn,
and talks about data pre-processing steps like data cleaning and outlier removal.
- Section 1.3 Descriptive statistics:
explains how to compute numerical summaries (mean, standard deviation, quartiles, etc.)
and generate data visualizations like
histograms,
box plots, bar plots, etc.
-
Chapter 2 PROBABILITY THEORY.
Chapter 2 includes:
- Section 2.1 Discrete random variables:
a complete introduction to probability theory including definitions, probability formulas,
and lots of examplesof discrete
random variables like coin tosses, die rolls,
hard disk failures, etc.
- Section 2.2 Multiple random variables:
introduces the joint probability distribution $f_{XY}$ for a pair of random variables $(X,Y)$.
We also discuss the concept of independent random variables.
- Section 2.3 Inventory of discrete distributions.
The Python module
scipy.stats contains pre-defined probability models that you can use for modeling tasks.
These are the LEGOs for the XXIst century.
- Section 2.4 Continuous random variables.
In this section,
we learn all the probability concepts for continuous random variables.
Section 2.4 is similar to Section 2.1 with every occurrence of $\textrm{Pr}(a \leq X \leq b)=\sum_{x=a}^{x=b}f_X(x)$
replaced by the integral $\textrm{Pr}(a \leq X \leq b)=\int_{x=a}^{x=b}f_X(x)dx$.
This section includes an introduction to integrals $\int_{x=a}^{x=b} f(x)dx$,
which is the math machinery for calculating probabilities of continuous random variables.
Don't worry—it's not a big deal—an integral is just a fancy name for computing the area under the graph of $f(x)$ between $x=a$ and $x=b$.
- Section 2.5 Multiple continuous random variables.
We see what happens when we have multiple continuous random variables.
The section includes some nice 3D visualizations of conditional distributions.
- Section 2.6 Inventory of continuous distributions.
In this section we'll learn about all the continuous random variables.
- Section 2.7 Simulation and empirical distributions.
How can we use computers to generation observations from random variables?
In this section, we'll describe some practical techniques for generating observations from any probability distribution,
and develop some very useful math tools along the way.
- Section 2.8 Probability models for random samples.
What can we say about the characteristics of $n$ copies of the random variable
$\mathbf{X} = (X_1,X_2,X_3,\ldots,X_n)$,
where each $X_i$ is independent copy of the random variable $X$.
Understanding the properties of $\mathbf{X}$ is important for all the
statistics operations we'll be doing in the STATS chapter.
- Probability problems:
this is your time to shine!
Try solving these problems to see if you really understand the material in Chapter 2.
- End matter
- Appendix A Answers and solutions
- Appendix B Notation
- Appendix C Python tutorial:
a condensed tutorial on Python syntax and operations to any reader up to speed.
- Appendix D Pandas tutorial:
introduction to Pandas functionality for data management.
- Appendix E Seaborn tutorial:
summary of Seaborn plotting functions for data visualization.
- Appendix F Calculus tutorial
- Bibliography
Part 2: Statistical Inference