1 Random variable and distributions of probability

Modified

March 19, 2025

Code

# load additional packages
library(ggplot2)

1.1 Motivation

When analyzing data, we

look for patterns,
use a sample to make an inference about the whole,

with the tools provided by probability theory and mathematical statistics.

Good knowledge helps to minimize unintentional misuse.

Example (cars)

First 10 observations from mtcars dataset:

Code

mtcars |> 
  tibble::rownames_to_column("model") |> 
  head(10) |> 
  pander::pander(split.tables = Inf)

model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21	6	160	110	3.9	2.62	16.46	0	1	4	4
Mazda RX4 Wag	21	6	160	110	3.9	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
Duster 360	14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4

Example (floods)

Code

read.csv("data/QmaxVolumeTime_anual(5180).csv") |>
  setNames(c("Year", "Peak", "Volume", "Duration")) |>
  dplyr::select(-Duration) |>
  tidyr::pivot_longer(cols = c(Peak, Volume), 
                      names_to = "variable", values_to = "value") |>
  ggplot() + aes(x = Year, y = value) + 
  geom_point() + geom_line(linetype = "dotted") + 
  facet_grid(rows = vars(variable), scales = "free_y") +
  #theme(text = element_text(size = 15)) +
  ylab("")

Example (system of RV)

1.2 Random variable

Random variable \(X\) is a function that projects a random event to numeric or some other formal value. Mostly we deal with numerical (i.e. quantitative) random variables (RV), later we show, that also the non-numerical ones (i.e. nominal, qualitative) are important.

Each value of RV may happen with some probability, e.g., if a person walking by is taken as a statistical unit (who I observe) than his height is a random event that may be mapped by numerical RV as “height in centimeters”, which formally attain values from \(\mathbb{R}^+\), in reality just from the interval, say, (30, 250), but also may be mapped by a nominal RV with three values: “low”, “normal”, “tall”. The probability, that the accidental passer-by will be tall, is intuitively smaller than the probability that he will be of normal height.

According to the range (image) of RV, denoted \(ran(X)\), we distinguish a variable to be either discrete or continuous.

The set of probabilities for all values of RV is called distribution of probability and it fully defines the RV. Distribution can be characterized by

probability mass function (PMF, in case of discrete RV),
probability density function (PDF, continuous RV) or
cumulative distribution function (CDF, for any RV with ordered values).

PMF represents probability of certain discrete value \(x\), formally \[ p(x) = Pr(X=x), \] while CDF represents probability that \(X\) will not exceed certain value, i.e. \[ F(x) = Pr(X\leq x) = \begin{cases} \sum\limits_{i:\ x_i\leq x} p(x_i) & \text{for discrete } X \\ \int\limits_{-\infty}^x f(t)dt & \text{for continuous } X \end{cases} \] where \(f\) denotes PDF and \(\{x_i\}\) is a set of all values of \(X\).

If we know probability distribution, we are able to calculate any characteristics of RV. Those are usually the first moments such as mean, variance, skewness and kurtosis which are measures of location, variability, asymmetry and (in a sense) extremeness, respectively. In particular the first two are given by the following formulas,

mean (expected value, first raw moment) \[ E[X] = \sum_i x_ip(x_i) \quad\text{or}\quad E[X] = \int_{ran(X)} tf(t)dt, \]
variance (second central moment), generally defined as \[ var(X) = E[(X-E[X])^2] \] which – in terms of probability mass/density – means \[ \begin{split} var(X) &= \sum_i (x_i-E[X])^2p(x_i) \qquad\text{or} \\ var(X) &= \int_{ran(X)} (t-E[X])^2f(t)dt. \end{split} \]

Besides moments, any quantitative or discrete ordinal RV can also be characterized by quantiles \(q_p\) associated with probability \(p\) through relation \[ Pr(X\leq q_p) = p \] A well known example is median (\(q_{0.5}\), the 50%-quantile), quartiles (\(q_0,q_{1/4},q_{1/2},q_{3/4},q_1\)), or their spread measure given by interquartile range (IQR = \(q_{3/4} - q_{1/4}\)).

Distribution of nominal RV is usually represented by mode, a value with the highest probability.

1.3 Distribution of discrete RV

1.3.1 Bernoulli

If a random variable \(X\) can attain only two values, say \(\{M, F\}\), than it follows Bernoulli distribution with parameter \(p\). The parameter represents probability, that in a random trial one of the values (a chosen one) will be observed.

Probability mass function is defined by relation

\[ p(x) = \begin{cases} p & \quad\text{if}\quad x=M \\ 1-p & \quad\phantom{a}\quad x=F\end{cases} \]

Code

data.frame(x = forcats::as_factor(c("M","F")), PMF =  c(2/3, 1/3)) |> 
  ggplot() + aes(x = x, y = PMF) +
  geom_point() + geom_linerange(aes(ymin = 0, ymax = PMF)) +
  scale_y_continuous(breaks=c(0,0.33,0.67,1), limits = c(0,1))

1.3.2 Uniform

If a random variable \(X\) can attain more than two values and all of them with the same probability, then \(X\) follows Uniform distribution. The probability mass function is constant, i.e. \[ p(x) = \frac1{card(X)} \] while it has no mode.

A well known example is rolling a dice with \(X\) representing the outcome (symbol or number of dots on the uppermost side).

Code

n <- 6
dat <- data.frame(x = 1:n) |> 
  dplyr::mutate(PMF = 1/n, 
                CDF = cumsum(PMF)
                ) 
dat |> 
  ggplot() + aes(x = x, y = PMF) +
  geom_point() + geom_linerange(aes(ymin = 0, ymax = PMF)) +
  scale_x_continuous(breaks=1:n) + 
  scale_y_continuous(breaks=round(0:6/6,2), limits = c(0,1))
dat |>
  ggplot() + aes(x = x, y = CDF) +
  geom_point() + geom_step() +
  scale_x_continuous(breaks=1:n) +
  geom_path(data = data.frame(
    x = c(0,rep(dat$x,each=2),7), 
    CDF = c(0,0,rep(dat$CDF,each=2))
    ))

1.3.3 Binomial

If Bernoulli trial is repeated \(n\)-times, and \(X\) is set as the number of desired outcomes, then \(X\) follows Binomial distribution with parameters \(p\) and \(n\). Formally we write \(X\sim B(n,p)\).

The probability mass function is defined by \[ p(x) = \binom{n}{x}p^x(1-p)^{n-x} \] and \(x\in\{0,1,\ldots,n\}\). Mean value and variance are also well-defined, \[ E[X] = np, \qquad var(X) = np(1-p). \] Example

According to a bank observations 15% of customers, who have a payment card, have asked at least once for blocking the card. Now, given a sample of 12 card holders, what is the probability, that
1. at most
2. exactly
3. at least
one of them will use the bank service? Also show PMF, CDF and find mean, median and mode.

Code

p <- 0.15
n <- 12
dat <- data.frame(x = 0:n) |> 
  dplyr::mutate(PMF = dbinom(x, size = n, prob = p), 
                CDF = cumsum(PMF)
                ) 
dat |> 
  ggplot() + aes(x = x, y = PMF) +
  geom_point() + geom_linerange(aes(ymin = 0, ymax = PMF)) +
  scale_x_continuous(breaks=0:n)
dat |>
  ggplot() + aes(x = x, y = CDF) +
  geom_point() + geom_step() +
  scale_x_continuous(breaks=0:n) +
  geom_path(data = data.frame(x = c(-1,0,0), CDF = c(0,0,dat$CDF[1])))
with(dat,
     list(probability = c("Pr(X≤1)" = CDF[x==1], "Pr(X=1)" = PMF[2], "Pr(X≥1)" = 1-CDF[1]) |> round(3),
          location = c("E(X)" = sum(x*PMF), "Me(X)" = qbinom(0.5,n,p), "Mo(X)" = x[which.max(PMF)])
          )
     ) |> 
  pander::pander()

probability:

Pr(X≤1) Pr(X=1) Pr(X≥1)

0.443 0.301 0.858
location:

E(X) Me(X) Mo(X)

1.8 2 1

Pr(X≤1)	Pr(X=1)	Pr(X≥1)
0.443	0.301	0.858

E(X)	Me(X)	Mo(X)
1.8	2	1

1.3.4 Poisson

If the number of trials grows (\(n\to\infty\)) and probability of single success decreases (\(p\to0\)), such that the expected value remains the same (\(np=\lambda\)), then the Binomial distribution can be approximated by Poisson distribution with the parameter \(\lambda\) (rate). The PMF is defined by \[ p(x) = \frac{\lambda^xe^{-\lambda}}{x!},\qquad \lambda\in(0,\infty) \] and its mean is equal to variance, \(E[X]=var(X)=\lambda\).
Thus the Poisson distribution can be applied to the systems with a large number of possible events, each of which is rare.

Example

There are 2000 components in a machine. Failure probability of single component is equal to 0.005. If the random variable \(X\) represents number of broken components, what is the probability that more than 10 components will fail?

Code

p <- 0.005
n <- 2000
lambda <- n*p
dat <- data.frame(x = 0:(3*lambda)) |> 
  dplyr::mutate(PMF = dpois(x, lambda), 
                CDF = cumsum(PMF)
                ) 
dat |> 
  ggplot() + aes(x = x, y = PMF) +
  geom_point() + geom_linerange(aes(ymin = 0, ymax = PMF))
dat |>
  ggplot() + aes(x = x, y = CDF) +
  geom_point() + geom_step() +
  geom_path(data = data.frame(x = c(-1,0,0), CDF = c(0,0,dat$CDF[1])))
c("λ" = lambda, "Pr(X>10)" = 1 - ppois(10, lambda)) |> round(3)

       λ Pr(X>10) 
  10.000    0.417

1.4 Distribution of continuous RV

1.4.1 Uniform

Discrete uniform distribution has also its counterpart among continuous distributions. Whenever \(X\) follows uniform distribution with parameters (limits) \(a,b\), i.e., \(X\sim U(a,b)\), its probability density function is defined as \[ f(x) = \frac1{b-a} \] Mean value (also median) is obviously a simple average of the limits, \(E[X]=\frac{a+b}{2}\), while variance is given by \(var(X)=\frac1{12}(b-a)^2\). Again, there is no single mode.

This is, for example, the distribution of waiting time for a bus, if its inter-arrival period is \((b-a)\).

Example

Find a probability, that we will wait 2 to 6 minutes for our bus, if it is departed in 10min intervals.

Code

b <- 10
a <- 0
ba <- b-a
dat <- data.frame(x = c(a-ba/5, a, a, b, b, b+ba/5)) |> 
  dplyr::mutate(PDF = c(0,0,dunif(c(a,b), a, b), 0, 0), 
                CDF = punif(x, a, b),
                linegroup = rep(1:3, each=2)
                )
dat1 <- dat |> 
  dplyr::slice(2:5) |> 
  dplyr::mutate(linegroup = c(1,1,2,2))
dat2 <- data.frame(x = c(2,6)) |> 
  dplyr::mutate(CDF = punif(x, a, b))

dat |> 
  ggplot() + aes(x = x, y = PDF, group = linegroup) +
  geom_line() + 
  geom_line(data = dat1, linetype = 2) +
  scale_y_continuous(breaks=range(dat$PDF)) + 
  scale_x_continuous(breaks=seq(0,10,2)) +
  geom_rect(xmin = dat2$x[1], xmax = dat2$x[2], 
            ymin = 0, ymax = 1/ba, alpha = 0.05)
dat |>
  ggplot() + aes(x = x, y = CDF) +
  geom_line() + 
  scale_x_continuous(breaks=seq(0,10,2)) +
  scale_y_continuous(breaks=punif(seq(0,10,2),a,b)) +
  geom_linerange(aes(ymax = CDF), data = dat2, ymin=0, linetype = 3) +
  geom_linerange(aes(xmax = x), data = dat2, xmin=dat$x[1], linetype = 3, orientation = "y")
c("E[X]" = mean(c(a,b)), "var(X)" = ba^2/12, "Pr(2<X≤6)" = diff(dat2$CDF)) |> round(1)

     E[X]    var(X) Pr(2<X≤6) 
      5.0       8.3       0.4

1.4.2 Normal (Gaussian)

Back in the Binomial distribution case, if we let \(n\) (the number of Bernoulli trials) go to infinity, the distribution of the random variable (which represented the number of successful trials) would become normal with parameters \(\mu = np\) (mean) and \(\sigma^2 = np(1-p)\) (variance). The density function is given by \[ f(x) = \frac1{\sqrt{2\pi}\sigma}\exp\left\{-\frac{x-\mu}{2\sigma}\right\}, \] where \(\sigma\) is the so-called standard deviation.

A special case, standard normal distribution, is frequently used in statistics. It is the distribution of the standardized (centered and rescaled) random variable \(Z=\frac{X-\mu}{\sigma}\).

Code

# parameters
p <- 0.15
n <- 100
mu <- n*p
sigma <- sqrt( n*p*(1-p) )
# data to plot
datB <- data.frame(x = 0:(n*p*2.2)) |> 
  dplyr::mutate(binomial = dbinom(x, size = n, prob = p)) 
datN <- data.frame(x = seq(0,n*p*2.2,0.1)) |> 
  dplyr::mutate(normal = dnorm(x, mu, sigma))
# plots
ggplot() + aes(x = x) +
  geom_point(aes(y = binomial, color = "binomial"), 
             data = datB, shape = 16) + 
  geom_line(aes(y = normal, color = "normal"), 
            data = datN, linetype = 1) +
  scale_color_manual(values = c(binomial = 1, normal = 2),
                     labels = c(binomial = "B(100,0.15)", normal = "N(15,3.6²)")
                     ) +
  guides(color = guide_legend(
    override.aes = list(shape = c(16,NA), linetype = c(NA, 1)),
    title = "distribution")) +
  theme(legend.position = c(0.8, 0.8)) +
  ylab("PMF/PDF")
# standardization
dat <- data.frame(x = seq(-3*sigma, mu + 3*sigma, 0.1)) |> 
  dplyr::mutate("N(15,3.6²)" = dnorm(x, mu, sigma),
                "N(0,3.6²)" = dnorm(x, 0, sigma),
                "N(0,1)" = dnorm(x)
                ) |> 
  tidyr::pivot_longer(-x, names_to = "distribution", values_to = "PDF")
dat |> 
  ggplot() + aes(x = x, y = PDF, color = distribution) +
  geom_line() + 
  scale_color_manual(values = 4:2) + 
  guides(color = guide_legend(reverse = TRUE)) +
  theme(legend.position = c(0.8, 0.8))

If two random variables \(X_1,X_2\) follow normal distribution, i.e., \(X_1\sim N(\mu_1,\sigma_1^2)\) and \(X_2\sim N(\mu_2,\sigma_2^2)\), then their linear combination \(Y = aX_1 + bX_2 + c\) also follows normal distribution, \(Y \sim N(a\mu_1+b\mu_2+c, a^2\sigma_1^2 + b^2\sigma_2^2)\).

1.4.3 Log-normal

If \(X\sim N(\mu,\sigma^2)\) then random variable \(Y=e^X\) follows log-normal distribution, \(Y\sim LN(\mu,\sigma^2)\).

In nature a lot of random variables follows a distribution that is close to log-normal (i.e., positive-valued). Thus by taking their logarithm it is possible to turn their distribution into approximately normal, which enables one to use wast amount of methods available in statistics that require normality.

Code

dat <- data.frame(x = seq(-3, 5, 0.01)) |> 
  dplyr::mutate("N(0,1)" = dnorm(x),
                "LN(0,1)" = dlnorm(x)
                ) |> 
  tidyr::pivot_longer(-x, names_to = "distribution", values_to = "PDF")
dat |> 
  ggplot() + aes(x = x, y = PDF, color = distribution) +
  geom_line() + 
  scale_color_manual(values = c(6,4)) + 
  guides(color = guide_legend(reverse = TRUE)) +
  theme(legend.position = c(0.8, 0.8))

1.4.4 \(\chi^2\) (chi-squared) distribution

If \(X_1,X_2,\ldots X_n\) come from standard normal \(N(0,1)\), then \(Y=X_1^2+X_2^2+\ldots+X_n^2\) follows chi-square(d) distribution with (parameter) \(n\) degrees of freedom, i.e., \(Y\sim\chi^2(n)\). We also get mean value \(E[Y]=n\) and variance \(var(Y)=2n\).

Code

dat <- data.frame(x = seq(-3, 5, 0.01)) |> 
  dplyr::mutate("N(0,1)" = dnorm(x),
                "χ²(3)" = dchisq(x, 3),
                "χ²(4)" = dchisq(x, 4),
                ) |> 
  tidyr::pivot_longer(-x, names_to = "distribution", values_to = "PDF")
dat |> 
  ggplot() + aes(x = x, y = PDF, color = distribution) +
  geom_line() + 
  scale_color_manual(values = c(4,3,2)) +
  theme(legend.position = c(0.8, 0.75))

1.4.5 t-distribution

If \(X_1\sim N(0,1)\) and \(X_2\sim\chi^2(n)\) then \(Y\sim\frac{X_1}{\sqrt{X_2/n}}\) follows the Student’s t-distribution with (parameter) \(n\) degrees of freedom, i.e., \(Y\sim t(n)\) for which we have the mean value \(E[Y]=0\) and variance \(var(X)=\frac{n}{n-2}\).

Code

dat <- data.frame(x = seq(-3, 5, 0.01)) |> 
  dplyr::mutate("N(0,1)" = dnorm(x),
                "t(2)" = dt(x, 2),
                "t(20)" = dt(x, 20),
                ) |> 
  tidyr::pivot_longer(-x, names_to = "distribution", values_to = "PDF")
dat |> 
  ggplot() + aes(x = x, y = PDF, color = distribution) +
  geom_line() + 
  scale_color_manual(values = c(4,3,2)) +
  theme(legend.position = c(0.8, 0.75))