Chapter 2 — Probability Theory and Random Variables

Companion material for Chapter 2. Covers probability spaces, random variables, distributions, expectations, and limit theorems.

§ 2.1 Concept of Probability

Probability Space

Source: https://probability4datascience.com/

Mathematical Heart of Probability

Probability is a measure of the size of a set of outcomes. \text{Consistency: } E_1 \subseteq E_2 \;\Rightarrow\; \mathbb{P}(E_1) \leq \mathbb{P}(E_2). Source: https://probability4datascience.com/

Conditional Probability

Example

What are possible values for the following probabilities: \begin{aligned} \mathbb{P}[\text{grass is wet} \mid \text{it rained}] \qquad & \mathbb{P}[\text{grass is wet}] \\ \mathbb{P}[\text{it rained} \mid \text{grass is wet}] \qquad & \mathbb{P}[\text{it rained}] \end{aligned}

Exercise

Consider a tetrahedral (4-sided) die. Let X be the first roll and Y be the second roll. Let B be the event that \min(X,Y)=2 and M be the event that \max(X,Y)=3. Find \mathbb{P}[M \mid B].

Bayes’ Theorem

🎥 3Blue1Brown — Bayes’ theorem visualized. Directly follows the lecture derivation.

🎥 3Blue1Brown — A concise visual proof of Bayes’ theorem (4 min).

🎥 3Blue1Brown — The medical test paradox: conditional probability in practice, with Bayes factors.

Three Prisoners problem

The king says that he will pardon two prisoners and sentence one.

A friendly guard is allowed to tell prisoner A that prisoner B or prisoner C would be pardoned.

Should the prisoner ask or not ask?

Show answer image

Excursion: Bayesian Reasoning (see Jaynes)

Aristotelian logic based on syllogism, e.g.,

A \Rightarrow B \quad (\text{``A implies B''})

does not allow inversion, i.e., B \nRightarrow A.

In Bayesian reasoning, however, we have

A \Rightarrow B \;\;\Rightarrow\;\; \mathbb{P}(B \mid A) = 1

Because of Bayes’ rule,

\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A)}{\mathbb{P}(B)} \ge \mathbb{P}(A)

“A becomes more likely when B is observed.”

Excursion: Bertrand’s Paradox

🎥 Numberphile — Three reasonable but contradictory solutions to the same probability problem, illustrating that a probability space must be fully specified.

Practice Questions

Practice questions for this section →

§ 2.2 Random Variables, Distributions and Densities

Cumulative Distribution Function and PDF

→ Interactive Demo: Distributions

PDF & CDF explorer for 12 common distributions (Normal, Laplace, Rayleigh, Exponential, Cauchy, Binomial, Geometric, Poisson, Gamma, Erlang, Chi-square, Uniform) with adjustable parameters.

Joint and Conditional Distributions

→ Interactive Demo: Conditional Distributions

Condition a 2D joint distribution on one variable and observe the resulting conditional PDF.

Practice Questions

Practice questions for this section →

§ 2.3 Functions of Random Variables

Transformation of a Random Variable

The mapping Y = X^2, can be approximated by a piecewise linear mapping. Source: https://probability4datascience.com/

Probability distributions can be directly mapped. Source: https://probability4datascience.com/

→ Interactive Demo: Mapping of Random Variables

Apply a nonlinear transformation Y = g(X) to a chosen input distribution and watch the output PDF change in real time.

🎥 3Blue1Brown — Change of variables and the transformation formula for PDFs.

Mapping of Two Random Variables — Convolutions

🎥 3Blue1Brown — What is a convolution? Covers discrete and continuous cases, including adding random variables.

🎥 3Blue1Brown — Adding continuous random variables: two geometric interpretations of convolution in probability.

🎥 3Blue1Brown — Why Gaussian + Gaussian = Gaussian: a visual proof via convolution.

Practice Questions

Practice questions for this section →

§ 2.4 Expectations

Expectation Operator

St. Petersburg Paradox

🎥 A game with infinite expected value but no rational bid — challenges the naive use of expectation as a decision criterion.

Correlation and Covariance

Independence vs. uncorrelatedness vs. orthogonality.

Practice Questions

Practice questions for this section →

§ 2.5 Special Distributions

Distribution Explorer

→ Interactive Demo: Distributions

Compare all major distributions in one place. Adjust parameters and switch between PDF and CDF view.

Binomial and Geometric Distributions

🎥 3Blue1Brown — Visualizing the binomial distribution and its Gaussian limit.

Excursion: Binomial Coefficient and Polynomials

Binomial coefficients arise when expanding powers of sums: (x+y)^n=\sum_{k=0}^n\binom{n}{k}x^k y^{\,n-k}.

This expansion can be understood combinatorially. Each term corresponds to choosing, for each of the n factors, either x or y. Thus, (x+y)^n enumerates all possible sequences of length n consisting of x and y.

For example, \begin{aligned} (x+y)^3 =\;& xxx + xxy + xyx + xyy \\ &+ yxx + yxy + yyx + yyy. \end{aligned}

Each term contains a certain number k of x’s and n-k of y’s. The binomial coefficient \binom{n}{k} counts how many such sequences exist, i.e., how many ways we can choose the positions of the x’s among the n factors.

Excursion: ABRACADABRA theorem

The ABRACADABRA theorem provides a surprising application of geometric distributions and martingales:

🎥 Numberphile — Expected waiting time to see ABRACADABRA in a random sequence.

Gaussian (Normal) Distribution

🎥 3Blue1Brown — Why \pi appears in the Gaussian PDF: the Herschel-Maxwell derivation.

Excursion: Tails of the Gaussian

For Gaussian-distributed variables, the likelihood of an event is determined by its deviation from the mean, measured in standard deviations. The probabilities of falling within 1, 2, and 3 standard deviations are approximately 68%, 95%, and 99.7%, respectively, as described by the 68-95-99.7 rule.

Poisson Distribution

Photons arriving at an image sensor can be modeled as a Poisson process, where the arrival rate depends on the scene brightness.

Random fluctuations in the image are called shot noise. The stronger the signal (\lambda), the higher the variance. However, the signal-to-noise ratio (SNR) still increases:

\mathrm{SNR} = \frac{\mathbb{E}[Y]}{\sqrt{\mathrm{Var}(Y)}} = \frac{\lambda}{\sqrt{\lambda}} = \sqrt{\lambda}. Source: https://probability4datascience.com/

Exponential Distribution

In a Poisson process, the number of arrivals in a fixed time interval follows a Poisson distribution, while the inter-arrival times are exponentially distributed. Source: https://probability4datascience.com/

Moments Overview

Practice Questions

Practice questions for this section →

§ 2.6 Limit Considerations

Jensen’s Inequality

Jensen inequality illustration. Source: https://probability4datascience.com/

Markov and Chebyshev Inequalities

Central Limit Theorem

→ Interactive Demo: Central Limit Theorem

Animated convolution: add IID random variables one by one and watch the normalized sum converge to a Gaussian — for Laplace, Uniform, Rayleigh, Exponential, or Gamma source distributions.

🎥 3Blue1Brown — Why sample means converge to a normal distribution: a visual derivation (31 min).

Practice Questions

Practice questions for this section →

§ 2.7 Jointly Distributed Random Variables

Excursion: The Library of Babel

The short story The Library of Babel by Jorge Luis Borges (1941) describes a library containing all possible books of length N, written using 25 basic characters (22 letters, the period, the comma, and the space).

This corresponds to the sample space of all sequences of length N over a finite alphabet \mathcal{A}, i.e., \Omega = \mathcal{A}^N, which underlies a multivariate (or product) distribution.

Source: https://medium.com/arch-201/the-library-of-babel-621aa9a924a5

Excursion: ImageNet

One of the central datasets in modern deep learning is ImageNet. Images are typically resized to 224 \times 224 pixels with 3 color channels. When vectorized, this corresponds to \mathbf{x} \in \mathbb{R}^{150{,}528}.

Spatial Distance and Joint Distribution

Multivariate Gaussian Distribution

→ Interactive Demo: 2D Normal Distributions

Interactive 3D visualization of the bivariate Gaussian with adjustable means \mu_1, \mu_2, standard deviations \sigma_1, \sigma_2, and correlation coefficient \rho.

Visualizing Correlation

Joint PMF

The pairs of letters (for example in English texts) can be used to characterize the language. The letter pair in is the most common.

Transformation of Gaussians

Source: https://probability4datascience.com/

Excursion: Eigenfaces

Excursion: Properties and Surprising Behavior

🎥 Strange conditional distributions — regression to the mean and related paradoxes in joint distributions.

🎥 “How We’re Fooled By Statistics” — regression to the mean in real-world data.

🎥 Is the “hot hand” real? - controversies about hidden assumptions in computing conditional probabilities.

Practice Questions

Practice questions for this section →