Statistics

See also: Statistics.

Probability

Random variable
Discrete (categorical) and continuous variables.
Conditional probability
- Bayes’ theorem: P(A|B) = P(A,B)/P(B) where P(A,B) = P(B|A)P(A)
Marginal probability: over a subset of variables
Expectation: \(\E_p f = \sum_x p(x) f(x)\) or \(\int_x p(x) f(x)\).
Independence
Correlation
- correlation r
- R-squared or coefficient of determination
Entropy \(H(p) = -\E_p \log p = -\int p\log p = -\sum p\log p\)
Cross entropy \(H(p, q) = -\E_p \log q\)
Mutual information

Distributions

Joint distribution between multiple variables.
Summary statistics
- Measures of central tendency: mean, median, and mode.
- Percentile
- Outliers
  - https://en.wikipedia.org/wiki/Truncated_mean
- Z-score: number of standard deviations from the mean.
- Density curve
- Cumulative distribution function
- Variability
  - Variance: \(E[(X-E[X])^2]\), second central moment
  - Standard deviation
  - Interquartile range (IQR)
  - Skew
  - Moments: \(E[X^n]\)
Central limit theorem
Law of large numbers
Memorylessness: probabilities are independent of past state
Heavy-tailed distribution

Distributions

Discrete
- Binomial(n, p): number of successes
- Geometric(p): trials til first success, maxent with specified mean
- Poisson(rate λ): number of events in a unit time, λ^k e^(-λ)/k!
Continuous
- Normal(μ,σ). 1/sqrt(2π)σ exp[-1/2 ((x-μ)/σ)^2]. Maxent for given variance.
  - 68% density within 1σ, 95% within 2σ, 99.7% within 3σ.
- Exponential(rate λ): time til event. λe^(-λx). Maxent for given mean 1/λ.
- Beta(α, β): mean α/(α+β), conjugate prior
- Gamma(shape n, scale λ): time til nth event = sum of n exp(λ)
https://en.wikipedia.org/wiki/Template:Probability_distributions

Estimation

Statistical inference: estimating parameters or testing hypotheses.
Bias
Variance of sample mean: s^2/n
Sample variance: (n-1)/n var(data)
Sample variance, unbiased population variance estimator \(S^2 = \sum_i \frac{(x_i-\bar{X})^2}{n-1}\)
If the population is normal, then \(\frac{(n-1) S^2}{\sigma^2} \sim \chi^2_{n-1}\)
A statistical model represents the data generating process.
- Set of all possible distributions over the observations.
- The dimension of a model is its number of parameters. A parametric model has a finite dimension.
- A parameterized model is identifiable if distinct parameter values give rise to distinct distributions.
Design matrix X: rows are observations, and columns are explanatory variables.
- We can add a constant column so we don’t need to specify an explicit bias parameter \(b\).
Maximum Entropy (MaxEnt) finds the least informative distribution given constraints such as mean and variance.
Bayes network
Approximate Bayesian computation (ABC)
Generalized likelihood uncertainty estimation (GLUE)
James-Stein estimator achieves lower error than MLE. Biased and nonlinear.
German tank problem
https://en.wikipedia.org/wiki/Calibration_(statistics)

Design of experiments (DOE)

Simple random sample
Stratified sample
Observational study
- Observation error
- case-control study: odds ratio or relative risk of people with vs. without a condition.
Matched pairs
Survey bias
- Response bias including social-desirability bias
- https://en.wikipedia.org/wiki/Template:Social_surveys
randomized controlled trial

Maximum likelihood estimation (MLE) finds parameters to maximize P(data).

Maximum a priori (MAP) adds a prior.
- \(\displaystyle\hat{\theta} = \argmax_{\theta} \pi(\theta|X^n) = \argmax_{\theta} p(X^n|\theta) \pi(\theta)\).
- Better than MLE for small n. MLE uses a constant prior.
Asymptotically normal: \(\sqrt{n}(\hat{\theta}_{mle} - \theta) \overset{d}{\rightarrow} N(0, I_1(\theta)^{-1})\)
Equivariant: if \(\eta = g(\theta)\) then \(\hat{\eta} = g(\hat{\theta})\)
Score: \(s(\theta) = \sum \nabla_{\theta} \log p(X_i; \theta)\)
Fisher information: \(I_n(\theta) = -\E \sum \nabla^2 \ell\ell(\theta) = \E s(\theta)s(\theta)^T = \text{Cov}(s)\)
Asymptotically efficient: Cramér-Rao \(\V \hat{\theta} \geq \frac{1}{nI_1(\theta)}\)

Linear regression

Linear predictor function: \(y = Xw\) for design matrix \(X\) and weights \(w\).
Simple linear regression has one explanatory variable
Ordinary least squares (OLS)
- independent observations and errors
  - homoscedasticity: equal variance
- no multicollinearity
- Estimator: \((X^T X)^{-1} X^T y\).
  - Computing the inverse directly is unstable.
  - Instead, solve the system \((X^T X) \hat{y} = X^T y\).
Residuals and residual plots
Root mean square deviation (RMSD)
Gauss-Markov theorem: OLS has the lowest sampling variance within the class of linear unbiased estimators. Only assumes errors are uncorrelated with equal variance–does not require normal or iid errors.
Weighted least squares (WLS)
- For observations with unequal variance, the weight of an observation should be inversely proportional to its variance.
Generalized least squares (GLS)
Ridge regression is linear regression with L2 regularization on the weights
- Min \(\|Xw-y\|_2^2 + \lambda \|w\|^2\)
- Estimator: \((X^T X + \lambda I)^{-1} X^T y\)
Lasso (least absolute shrinkage and selection operator) is L1 regularization

Binary logistic regression

Predict \(\hat{P}(y=1|x)\) using \(\hat{y} = s(xw+b)\) with weight \(w\), bias \(b\), and sigmoid activation.
Sigmoid converts logits or log-odds \(z\) to probabilities.
- \(\displaystyle s(z) = \frac1{1+e^{-z}}\).
Binary cross entropy loss aka negative log likelihood.
- \(\ell(\hat{y}, y) = -\E_p \log q = -y\ln \hat{y} - (1-y)\ln(1-\hat{y})\)

Multinomial logistic regression

Target \(y\) can take values from 0 to \(k-1\).
Predict a multinomial distribution over categories: \(\hat{y}_k = \hat{P}(y=k|x) = \sigma(xw+b)_k\)
Softmax \(\displaystyle\sigma(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\) normalizes scores \(z\) into a valid probability distribution \(p\).
Multinomial cross entropy loss

KL distance or relative entropy from Q to P: \(\KL(p\|q) = \E_p \log\frac{p}q\).

Nonnegative with equality if \(p=q\). Asymmetric.
\(KL = -H(p) + H(p,q)\), the number of extra bits required to code samples from \(p\) using the optimal code for \(q\).
Likelihood ratio \(\log \frac{p}q\) if \(x\sim p\).
Minimum \(\KL(p\|q)\) = maximum likelihood: \(\hat{\E}_p \log q = \frac1n \ell_q(X_i)\).
\(\KL(p\|q)\) ensures that \(q>0\) where \(p>0\), so \(q\) averages across modes of \(p\). \(\KL(q\|p)\) ensures that \(q=0\) where \(p=0\), so \(q\) finds one mode of \(p\).
For uniform \(q\), \(\KL(p\|q) = -\sum p\log q - H(q) = \log |X| - H(q)\)
Other distances:
- The Jensen–Shannon divergence is a symmetrized KL:
  \(\mathrm{JSD}(p\|q) = \frac12 \KL(p\|m) + \frac12 \KL(q\|m)\)
  where \(m = \frac12(p+q)\).
- Wasserstein distance (earth mover distance)
- Total variation distance

Testing

Test statistic: convert sample statistic to a score assuming H0
t-test: location test with unknown pop var; independent samples, data approx normal or n > 30
- t-distribution \(t = \frac{Z}{\sqrt{S^2/n}}\), where \(U\sim \chi^2_n\)
- Student’s t-test \(\frac{\bar{X}-\mu}{\sqrt{S^2/n}} \sim t_{n-1}\) because \(\bar{X}\) is independent of the residuals and thus independent of \(S^2\)
- Multivariate t-distribution
ANOVA: check for statistically significant differences between groups.
- Compare within-group variance with between-group variance.
- Assume iid normal errors with homogenous variance.
Nonparametric testing
- Wilcoxon rank sum test
- KS test: compare empirical distributions
Likelihood-ratio test or Wilks test
- Neyman-Pearson lemma: the likelihood ratio \(\log \frac{p(x)}{q(x)}\) is the most powerful way to determine whether a sample \(x\) is drawn from distribution P or Q.
- Wald test
Model selection and goodness of fit
- Akaike information criterion (AIC)
- Bayesian information criterion

https://en.wikipedia.org/wiki/Bayesian_probability

Dempster-Shafer theory of belief functions.
transferable belief model (TBM)
https://en.wikipedia.org/wiki/Probabilistic_programming

Curse of dimensionality

Sketching
- Count–min sketch
HyperLogLog
Approximate counting algorithm

Graphical models

Directed models or Bayes nets
Undirected models: Markov random fields (MRF) and RBMs
Problems
- Inference: given parameters \(\theta\), sample/compute marginals
  - Sample \(p(x)\): easy for directed models, hard for undirected
  - Sample \(p(z|x)\): easy in RBMs, hard in Bayes nets
  - Hard due to an unknown normalizing constant/partition function
  - Rejection sampling: propose f(x), accept f(x)/g(x)/M. M is expected number of iterations.
- Learning: find parameters \(\theta\) to maximize \(\log p(x)\)
MCMC methods
- 1970. Metropolis-Hastings
- Gibbs sampling
Variational inference solves using optimization

Variational inference

Have: graphical model \(z\to x\)
Want: posterior \(p(z|x) = \frac{p(x|z)p(z)}{p(x)}\)
Problem: evidence \(p(x) = \int p(x,z)\,dz\) is hard to compute
Solution: find an approximation \(q(z)\) that minimizes \(\KL(q(z)\|p(z|x))\)
Note that forward KL requires expectations w.r.t. p which is hard
\(\KL(q\|p) = -H(q) - \E_q \log p(z,x) + \log p(x)\)
Since \(\log p(x)\) is constant, minimizing KL = maximizing the ELBO \(H(q) + \E_q \log p(x,z)\) \(= \log p(x) - KL(q \| p(z))\) \(\le \log p(x)\).
More generally, we can compute partition functions \(Z\) for \(p(x) = \frac1Z \exp(E(x))\).
- \(\KL(q\|p) = -H(q) - \E_q \log p = -H(q) - \E E(x) + \log Z\), so \(\log Z = H(q) + \E E(x) + \KL(q\|p)\) maximizes Gibbs free energy.
control variables are held constant
Inverse distance weighting (IDW) for multivariate interpolation.
Intransitive dice
https://en.wikipedia.org/wiki/Regression_toward_the_mean
https://en.wikipedia.org/wiki/Prevention_paradox
https://en.wikipedia.org/wiki/Wisdom_of_the_crowd
https://en.wikipedia.org/wiki/Aggregate_function
https://en.wikipedia.org/wiki/Martingale_(betting_system)
https://en.wikipedia.org/wiki/Oscar%27s_grind

https://en.m.wikipedia.org/wiki/Metropolis–Hastings_algorithm
https://en.m.wikipedia.org/wiki/Rejection_sampling#Adaptive_rejection_sampling
https://en.m.wikipedia.org/wiki/Inverse_transform_sampling
https://en.m.wikipedia.org/wiki/Box–Muller_transform

https://jotterbach.github.io/content/posts/tsne/2016-05-23-TSNE/
https://strathprints.strath.ac.uk/52372/1/Connor_etal_LNCS2013_Evaluation_Jensen_Shannon_distance_over_sparse_data.pdf
https://github.com/cran/entropy/blob/master/R/KL.plugin.R
https://github.com/cran/entropy/blob/master/R/entropy.empirical.R
https://www.stefanom.io/notes/2021/02/25/concept_drift.html
http://web.archive.org/web/20150121224302/https://www.tsc.uc3m.es/~fernando/bare_conf3.pdf
https://notesonai.com/Jensen%E2%80%93Shannon+Divergence

Variational inference
https://arxiv.org/abs/1606.05908
https://arxiv.org/abs/1312.6114
https://arxiv.org/abs/2108.13083