Statistics
See also: Statistics.
Probability
- Random variable
- Discrete (categorical) and continuous variables.
- Conditional probability
- Bayes’ theorem: P(A|B) = P(A,B)/P(B) where P(A,B) = P(B|A)P(A)
- Marginal probability: over a subset of variables
- Expectation: \(\E_p f = \sum_x p(x) f(x)\) or \(\int_x p(x) f(x)\).
- Independence
- Correlation
- correlation r
- R-squared or coefficient of determination
- Entropy \(H(p) = -\E_p \log p = -\int p\log p = -\sum p\log p\)
- Cross entropy \(H(p, q) = -\E_p \log q\)
- Mutual information
Distributions
- Joint distribution between multiple variables.
- Summary statistics
- Measures of central tendency: mean, median, and mode.
- Percentile
- Outliers
- Z-score: number of standard deviations from the mean.
- Density curve
- Cumulative distribution function
- Variability
- Variance: \(E[(X-E[X])^2]\), second central moment
- Standard deviation
- Interquartile range (IQR)
- Skew
- Moments: \(E[X^n]\)
- Central limit theorem
- Law of large numbers
- Memorylessness: probabilities are independent of past state
- Heavy-tailed distribution
Distributions
- Discrete
- Binomial(n, p): number of successes
- Geometric(p): trials til first success, maxent with specified mean
- Poisson(rate λ): number of events in a unit time, λ^k e^(-λ)/k!
- Continuous
- Normal(μ,σ). 1/sqrt(2π)σ exp[-1/2 ((x-μ)/σ)^2]. Maxent for given variance.
- 68% density within 1σ, 95% within 2σ, 99.7% within 3σ.
- Exponential(rate λ): time til event. λe^(-λx). Maxent for given mean 1/λ.
- Beta(α, β): mean α/(α+β), conjugate prior
- Gamma(shape n, scale λ): time til nth event = sum of n exp(λ)
- Normal(μ,σ). 1/sqrt(2π)σ exp[-1/2 ((x-μ)/σ)^2]. Maxent for given variance.
- https://en.wikipedia.org/wiki/Template:Probability_distributions
Estimation
- Statistical inference: estimating parameters or testing hypotheses.
- Bias
- Variance of sample mean: s^2/n
- Sample variance: (n-1)/n var(data)
- Sample variance, unbiased population variance estimator \(S^2 = \sum_i \frac{(x_i-\bar{X})^2}{n-1}\)
- If the population is normal, then \(\frac{(n-1) S^2}{\sigma^2} \sim \chi^2_{n-1}\)
- A statistical model represents the data generating process.
- Set of all possible distributions over the observations.
- The dimension of a model is its number of parameters. A parametric model has a finite dimension.
- A parameterized model is identifiable if distinct parameter values give rise to distinct distributions.
- Design matrix X: rows are observations, and columns are explanatory variables.
- We can add a constant column so we don’t need to specify an explicit bias parameter \(b\).
- Maximum Entropy (MaxEnt) finds the least informative distribution given constraints such as mean and variance.
- Bayes network
- Approximate Bayesian computation (ABC)
- Generalized likelihood uncertainty estimation (GLUE)
- James-Stein estimator achieves lower error than MLE. Biased and nonlinear.
- German tank problem
- https://en.wikipedia.org/wiki/Calibration_(statistics)
Design of experiments (DOE)
- Simple random sample
- Stratified sample
- Observational study
- Observation error
- case-control study: odds ratio or relative risk of people with vs. without a condition.
- Matched pairs
- Survey bias
- Response bias including social-desirability bias
- https://en.wikipedia.org/wiki/Template:Social_surveys
- randomized controlled trial
Maximum likelihood estimation (MLE) finds parameters to maximize P(data).
- Maximum a priori (MAP) adds a prior.
- \(\displaystyle\hat{\theta} = \argmax_{\theta} \pi(\theta|X^n) = \argmax_{\theta} p(X^n|\theta) \pi(\theta)\).
- Better than MLE for small n. MLE uses a constant prior.
- Asymptotically normal: \(\sqrt{n}(\hat{\theta}_{mle} - \theta) \overset{d}{\rightarrow} N(0, I_1(\theta)^{-1})\)
- Equivariant: if \(\eta = g(\theta)\) then \(\hat{\eta} = g(\hat{\theta})\)
- Score: \(s(\theta) = \sum \nabla_{\theta} \log p(X_i; \theta)\)
- Fisher information: \(I_n(\theta) = -\E \sum \nabla^2 \ell\ell(\theta) = \E s(\theta)s(\theta)^T = \text{Cov}(s)\)
- Asymptotically efficient: Cramér-Rao \(\V \hat{\theta} \geq \frac{1}{nI_1(\theta)}\)
Linear regression
- Linear predictor function: \(y = Xw\) for design matrix \(X\) and weights \(w\).
- Simple linear regression has one explanatory variable
- Ordinary least squares (OLS)
- independent observations and errors
- homoscedasticity: equal variance
- no multicollinearity
- Estimator: \((X^T X)^{-1} X^T y\).
- Computing the inverse directly is unstable.
- Instead, solve the system \((X^T X) \hat{y} = X^T y\).
- independent observations and errors
- Residuals and residual plots
- Root mean square deviation (RMSD)
- Gauss-Markov theorem: OLS has the lowest sampling variance within the class of linear unbiased estimators. Only assumes errors are uncorrelated with equal variance–does not require normal or iid errors.
- Weighted least squares (WLS)
- For observations with unequal variance, the weight of an observation should be inversely proportional to its variance.
- Generalized least squares (GLS)
- Ridge regression is linear regression with L2 regularization on the weights
- Min \(\|Xw-y\|_2^2 + \lambda \|w\|^2\)
- Estimator: \((X^T X + \lambda I)^{-1} X^T y\)
- Lasso (least absolute shrinkage and selection operator) is L1 regularization
Binary logistic regression
- Predict \(\hat{P}(y=1|x)\) using \(\hat{y} = s(xw+b)\) with weight \(w\), bias \(b\), and sigmoid activation.
- Sigmoid converts logits or log-odds \(z\) to probabilities.
- \(\displaystyle s(z) = \frac1{1+e^{-z}}\).
- Binary cross entropy loss aka negative log likelihood.
- \(\ell(\hat{y}, y) = -\E_p \log q = -y\ln \hat{y} - (1-y)\ln(1-\hat{y})\)
Multinomial logistic regression
- Target \(y\) can take values from 0 to \(k-1\).
- Predict a multinomial distribution over categories: \(\hat{y}_k = \hat{P}(y=k|x) = \sigma(xw+b)_k\)
- Softmax \(\displaystyle\sigma(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\) normalizes scores \(z\) into a valid probability distribution \(p\).
- Multinomial cross entropy loss
KL distance or relative entropy from Q to P: \(\KL(p\|q) = \E_p \log\frac{p}q\).
- Nonnegative with equality if \(p=q\). Asymmetric.
- \(KL = -H(p) + H(p,q)\), the number of extra bits required to code samples from \(p\) using the optimal code for \(q\).
- Likelihood ratio \(\log \frac{p}q\) if \(x\sim p\).
- Minimum \(\KL(p\|q)\) = maximum likelihood: \(\hat{\E}_p \log q = \frac1n \ell_q(X_i)\).
- \(\KL(p\|q)\) ensures that \(q>0\) where \(p>0\), so \(q\) averages across modes of \(p\). \(\KL(q\|p)\) ensures that \(q=0\) where \(p=0\), so \(q\) finds one mode of \(p\).
- For uniform \(q\), \(\KL(p\|q) = -\sum p\log q - H(q) = \log |X| - H(q)\)
- Other distances:
- The Jensen–Shannon divergence is a symmetrized KL:
\(\mathrm{JSD}(p\|q) = \frac12 \KL(p\|m) + \frac12 \KL(q\|m)\)
where \(m = \frac12(p+q)\). - Wasserstein distance (earth mover distance)
- Total variation distance
- The Jensen–Shannon divergence is a symmetrized KL:
Testing
- Test statistic: convert sample statistic to a score assuming H0
- t-test: location test with unknown pop var; independent samples, data approx normal or n > 30
- t-distribution \(t = \frac{Z}{\sqrt{S^2/n}}\), where \(U\sim \chi^2_n\)
- Student’s t-test \(\frac{\bar{X}-\mu}{\sqrt{S^2/n}} \sim t_{n-1}\) because \(\bar{X}\) is independent of the residuals and thus independent of \(S^2\)
- Multivariate t-distribution
- ANOVA: check for statistically significant differences between groups.
- Compare within-group variance with between-group variance.
- Assume iid normal errors with homogenous variance.
- Nonparametric testing
- Wilcoxon rank sum test
- KS test: compare empirical distributions
- Likelihood-ratio test or Wilks test
- Neyman-Pearson lemma: the likelihood ratio \(\log \frac{p(x)}{q(x)}\) is the most powerful way to determine whether a sample \(x\) is drawn from distribution P or Q.
- Wald test
- Model selection and goodness of fit
- Akaike information criterion (AIC)
- Bayesian information criterion
https://en.wikipedia.org/wiki/Bayesian_probability
- Dempster-Shafer theory of belief functions.
- transferable belief model (TBM)
- https://en.wikipedia.org/wiki/Probabilistic_programming
Curse of dimensionality
- Sketching
- Count–min sketch
- HyperLogLog
- Approximate counting algorithm
Graphical models
- Directed models or Bayes nets
- Undirected models: Markov random fields (MRF) and RBMs
- Problems
- Inference: given parameters \(\theta\), sample/compute marginals
- Sample \(p(x)\): easy for directed models, hard for undirected
- Sample \(p(z|x)\): easy in RBMs, hard in Bayes nets
- Hard due to an unknown normalizing constant/partition function
- Rejection sampling: propose f(x), accept f(x)/g(x)/M. M is expected number of iterations.
- Learning: find parameters \(\theta\) to maximize \(\log p(x)\)
- Inference: given parameters \(\theta\), sample/compute marginals
- MCMC methods
- 1970. Metropolis-Hastings
- Gibbs sampling
- Variational inference solves using optimization
Variational inference
- Have: graphical model \(z\to x\)
- Want: posterior \(p(z|x) = \frac{p(x|z)p(z)}{p(x)}\)
- Problem: evidence \(p(x) = \int p(x,z)\,dz\) is hard to compute
- Solution: find an approximation \(q(z)\) that minimizes \(\KL(q(z)\|p(z|x))\)
- Note that forward KL requires expectations w.r.t. p which is hard
- \(\KL(q\|p) = -H(q) - \E_q \log p(z,x) + \log p(x)\)
- Since \(\log p(x)\) is constant, minimizing KL = maximizing the ELBO \(H(q) + \E_q \log p(x,z)\) \(= \log p(x) - KL(q \| p(z))\) \(\le \log p(x)\).
- More generally, we can compute partition functions \(Z\) for \(p(x) = \frac1Z \exp(E(x))\).
- \(\KL(q\|p) = -H(q) - \E_q \log p = -H(q) - \E E(x) + \log Z\), so \(\log Z = H(q) + \E E(x) + \KL(q\|p)\) maximizes Gibbs free energy.
- \(\KL(q\|p) = -H(q) - \E_q \log p = -H(q) - \E E(x) + \log Z\), so \(\log Z = H(q) + \E E(x) + \KL(q\|p)\) maximizes Gibbs free energy.
- control variables are held constant
- Inverse distance weighting (IDW) for multivariate interpolation.
- Intransitive dice
- https://en.wikipedia.org/wiki/Regression_toward_the_mean
- https://en.wikipedia.org/wiki/Prevention_paradox
- https://en.wikipedia.org/wiki/Wisdom_of_the_crowd
- https://en.wikipedia.org/wiki/Aggregate_function
- https://en.wikipedia.org/wiki/Martingale_(betting_system)
- https://en.wikipedia.org/wiki/Oscar%27s_grind
https://en.m.wikipedia.org/wiki/Metropolis–Hastings_algorithm
https://en.m.wikipedia.org/wiki/Rejection_sampling#Adaptive_rejection_sampling
https://en.m.wikipedia.org/wiki/Inverse_transform_sampling
https://en.m.wikipedia.org/wiki/Box–Muller_transform
https://jotterbach.github.io/content/posts/tsne/2016-05-23-TSNE/
https://strathprints.strath.ac.uk/52372/1/Connor_etal_LNCS2013_Evaluation_Jensen_Shannon_distance_over_sparse_data.pdf
https://github.com/cran/entropy/blob/master/R/KL.plugin.R
https://github.com/cran/entropy/blob/master/R/entropy.empirical.R
https://www.stefanom.io/notes/2021/02/25/concept_drift.html
http://web.archive.org/web/20150121224302/https://www.tsc.uc3m.es/~fernando/bare_conf3.pdf
https://notesonai.com/Jensen%E2%80%93Shannon+Divergence
Variational inference
https://arxiv.org/abs/1606.05908
https://arxiv.org/abs/1312.6114
https://arxiv.org/abs/2108.13083