Chapter 2 — Probability Theory and Random Variables

Companion material for Chapter 2. Covers probability spaces, random variables, distributions, expectations, and limit theorems.


§ 2.1 Concept of Probability

Probability Space

Source: https://probability4datascience.com/
NoteMathematical Heart of Probability

Probability is a measure of the size of a set of outcomes. \text{Consistency: } E_1 \subseteq E_2 \;\Rightarrow\; \mathbb{P}(E_1) \leq \mathbb{P}(E_2). Source: https://probability4datascience.com/

Conditional Probability

TipExample

What are possible values for the following probabilities: \begin{aligned} \mathbb{P}[\text{grass is wet} \mid \text{it rained}] \qquad & \mathbb{P}[\text{grass is wet}] \\ \mathbb{P}[\text{it rained} \mid \text{grass is wet}] \qquad & \mathbb{P}[\text{it rained}] \end{aligned}

NoteExercise

Consider a tetrahedral (4-sided) die. Let X be the first roll and Y be the second roll. Let B be the event that \min(X,Y)=2 and M be the event that \max(X,Y)=3. Find \mathbb{P}[M \mid B].

Source: https://probability4datascience.com/

Bayes’ Theorem

3Blue1Brown — Bayes’ theorem visualized. Directly follows the lecture derivation.

3Blue1Brown — A concise visual proof of Bayes’ theorem (4 min).

3Blue1Brown — The medical test paradox: conditional probability in practice, with Bayes factors.

Example: Binary Channel

Binary symmetric channel.

Law of Total Probability

Law of Total Probability.
NoteThree Prisoners problem

The king says that he will pardon two prisoners and sentence one.

Source: https://probability4datascience.com/

A friendly guard is allowed to tell prisoner A that prisoner B or prisoner C would be pardoned.

Source: https://probability4datascience.com/

Should the prisoner ask or not ask?

Show answer image

Source: https://probability4datascience.com/

Excursion: Bayesian Reasoning (see Jaynes)

Aristotelian logic based on syllogism, e.g.,

A \Rightarrow B \quad (\text{``A implies B''})

does not allow inversion, i.e., B \nRightarrow A.

In Bayesian reasoning, however, we have

A \Rightarrow B \;\;\Rightarrow\;\; \mathbb{P}(B \mid A) = 1

Because of Bayes’ rule,

\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A)}{\mathbb{P}(B)} \ge \mathbb{P}(A)

“A becomes more likely when B is observed.”

Excursion: Bertrand’s Paradox

Numberphile — Three reasonable but contradictory solutions to the same probability problem, illustrating that a probability space must be fully specified.


§ 2.2 Random Variables, Distributions and Densities

Cumulative Distribution Function and PDF

PDF & CDF explorer for 12 common distributions (Normal, Laplace, Rayleigh, Exponential, Cauchy, Binomial, Geometric, Poisson, Gamma, Erlang, Chi-square, Uniform) with adjustable parameters.

Joint and Conditional Distributions

Condition a 2D joint distribution on one variable and observe the resulting conditional PDF.


§ 2.3 Functions of Random Variables

Transformation of a Random Variable

The mapping Y = X^2, can be approximated by a piecewise linear mapping. Source: https://probability4datascience.com/

Probability distributions can be directly mapped. Source: https://probability4datascience.com/

Apply a nonlinear transformation Y = g(X) to a chosen input distribution and watch the output PDF change in real time.

3Blue1Brown — Change of variables and the transformation formula for PDFs.

Mapping of Two Random Variables — Convolutions

3Blue1Brown — What is a convolution? Covers discrete and continuous cases, including adding random variables.

3Blue1Brown — Adding continuous random variables: two geometric interpretations of convolution in probability.

3Blue1Brown — Why Gaussian + Gaussian = Gaussian: a visual proof via convolution.


§ 2.4 Expectations

Expectation Operator

St. Petersburg Paradox

A game with infinite expected value but no rational bid — challenges the naive use of expectation as a decision criterion.

Correlation and Covariance

Independence vs. uncorrelatedness vs. orthogonality.

§ 2.5 Special Distributions

Distribution Explorer

Compare all major distributions in one place. Adjust parameters and switch between PDF and CDF view.

Binomial and Geometric Distributions

Binomial PMF.

3Blue1Brown — Visualizing the binomial distribution and its Gaussian limit.

Excursion: Binomial Coefficient and Polynomials

Binomial coefficients arise when expanding powers of sums: (x+y)^n=\sum_{k=0}^n\binom{n}{k}x^k y^{\,n-k}.

This expansion can be understood combinatorially. Each term corresponds to choosing, for each of the n factors, either x or y. Thus, (x+y)^n enumerates all possible sequences of length n consisting of x and y.

For example, \begin{aligned} (x+y)^3 =\;& xxx + xxy + xyx + xyy \\ &+ yxx + yxy + yyx + yyy. \end{aligned}

Each term contains a certain number k of x’s and n-k of y’s. The binomial coefficient \binom{n}{k} counts how many such sequences exist, i.e., how many ways we can choose the positions of the x’s among the n factors.

Excursion: ABRACADABRA theorem

The ABRACADABRA theorem provides a surprising application of geometric distributions and martingales:

Numberphile — Expected waiting time to see ABRACADABRA in a random sequence.

Gaussian (Normal) Distribution

Normal PDF and CDF.

3Blue1Brown — Why \pi appears in the Gaussian PDF: the Herschel-Maxwell derivation.

Excursion: Tails of the Gaussian

For Gaussian-distributed variables, the likelihood of an event is determined by its deviation from the mean, measured in standard deviations. The probabilities of falling within 1, 2, and 3 standard deviations are approximately 68%, 95%, and 99.7%, respectively, as described by the 68-95-99.7 rule.

Source: https://probability4datascience.com/

Poisson Distribution

Photons arriving at an image sensor can be modeled as a Poisson process, where the arrival rate depends on the scene brightness.

Source: https://probability4datascience.com/

Random fluctuations in the image are called shot noise. The stronger the signal (\lambda), the higher the variance. However, the signal-to-noise ratio (SNR) still increases:

\mathrm{SNR} = \frac{\mathbb{E}[Y]}{\sqrt{\mathrm{Var}(Y)}} = \frac{\lambda}{\sqrt{\lambda}} = \sqrt{\lambda}. Source: https://probability4datascience.com/

Exponential Distribution

In a Poisson process, the number of arrivals in a fixed time interval follows a Poisson distribution, while the inter-arrival times are exponentially distributed. Source: https://probability4datascience.com/ Source: https://probability4datascience.com/

Moments Overview

Source: https://probability4datascience.com/

§ 2.6 Limit Considerations

Jensen’s Inequality

Jensen inequality illustration. Source: https://probability4datascience.com/

Markov and Chebyshev Inequalities

Chebyshev inequality illustration.

Central Limit Theorem

Animated convolution: add IID random variables one by one and watch the normalized sum converge to a Gaussian — for Laplace, Uniform, Rayleigh, Exponential, or Gamma source distributions.

3Blue1Brown — Why sample means converge to a normal distribution: a visual derivation (31 min).

De Moivre–Laplace approximation.

§ 2.7 Jointly Distributed Random Variables

Excursion: The Library of Babel

The short story The Library of Babel by Jorge Luis Borges (1941) describes a library containing all possible books of length N, written using 25 basic characters (22 letters, the period, the comma, and the space).

This corresponds to the sample space of all sequences of length N over a finite alphabet \mathcal{A}, i.e., \Omega = \mathcal{A}^N, which underlies a multivariate (or product) distribution.

Source: https://medium.com/arch-201/the-library-of-babel-621aa9a924a5

Excursion: ImageNet

One of the central datasets in modern deep learning is ImageNet. Images are typically resized to 224 \times 224 pixels with 3 color channels. When vectorized, this corresponds to \mathbf{x} \in \mathbb{R}^{150{,}528}.

Source: https://probability4datascience.com/

Spatial Distance and Joint Distribution

Multivariate Gaussian Distribution

Interactive 3D visualization of the bivariate Gaussian with adjustable means \mu_1, \mu_2, standard deviations \sigma_1, \sigma_2, and correlation coefficient \rho.

Joint Gaussian density.

Visualizing Correlation

Joint PMF

The pairs of letters (for example in English texts) can be used to characterize the language. The letter pair in is the most common.

Transformation of Gaussians

Source: https://probability4datascience.com/

Excursion: Eigenfaces

Source: https://probability4datascience.com/

Source: https://probability4datascience.com/

Excursion: Properties and Surprising Behavior

Strange conditional distributions — regression to the mean and related paradoxes in joint distributions.

“How We’re Fooled By Statistics” — regression to the mean in real-world data.

Is the “hot hand” real? - controversies about hidden assumptions in computing conditional probabilities.