Machine learning
See also: Deep learning.
UA caveats: no guarantee that we can learn the golden network, can require exponential nodes, assumes function/smoothness, specific range
Generative model p(x) vs. discriminative model p(y|x).
Reduce overfitting:
- Use test and validation sets
- Regularization
- Augment data
- Early stopping
- Select only most important features
- Bottleneck layers for autoencoders
Classification: logistic, k-NN (k-d tree), LDA/QDA and mixture models
- SVM: Support vector machine
- Linear decision boundary: \(\hat{y} = I(w^T x + b)\).
- Maximizes the margin between the decision boundary and support vectors, the closest data points.
Decision tree:
- Splitting criteria: entropy, Gini index, misclass error
- can postprune with pruning set
Metrics
Regression: linear/polynomial
ridge regression = l2 penalty (Gaussian prior), lasso = l1 (Laplacian prior)
Ensembles
- Bagging: draw instances with replacement
- Boosting: train later stages on incorrect samples from previous learners
- AdaBoost: weight incorrect samples and accurate classifiers higher. maximizes margin
Dimensionality reduction: autoencoders
Sequence tasks
- RNNs: long-range dependencies difficult d/t long propagation path, memory bottleneck, sequential processing, MLPs require lots of RAM. arbitary length
- LSTM solves vanishing/exploding gradient
- CNN, causal CNN if we need autoregressive
- Attention
Ordering: conv, ReLU, BN, dropout
Regularization to improve generalization:
- Data augmentation
- L2 weight decay
- early stopping
- batch norm?
- smaller batch size = more noise
- dropout (train/test var shift clashes with BN so dropout only after last BN)
- Discretization
- Straight-through Estimator (Bengio 2013): backpropagate through the hard threshold function I(x>0) as if it had been the identity function. Simple but biased gradient estimator.
Debugging
- Plot each loss
- Check that loss fn is correct. softmax loss is -logp(correct class), about 2
- Visualize everything possible
Transformer: selects values with corresponding keys that match the query, values weighted by softmax of dot-products. Improves long-range dependencies and easier to parallelize compared to RNNs.
Distributed training
- Data parallel: usually easier, communicating gradient updates can be a bottleneck
- Model parallel
https://en.wikipedia.org/wiki/Template:Recommender_systems
- Collective classification
Probabilistic graphical models
RL
https://en.wikipedia.org/wiki/Program_synthesis
Reinforcement learning
RL studies how to choose actions to maximize expected return. The usual setting is a Markov Decision Process, which consists of states, actions, transitions, and rewards and satisfies the Markov property.
Behavior cloning and GAIL
Q-learning. Use dynamic programming aka Bellman equation: \(r(s_i, a_i) + \gamma \max_{a_i'} Q(s_i', a'_i)\). Use a replay buffer to decorrelate observations. When there is estimation error, bootstrap Q estimates based on max_a are biased upward, so Double Q-learning uses a second Q net as the value target.
TD vs MC. MC target is the actual return, which is unbiased but high variance since it accumulates randomness from rewards and actions. TD(0) target is reward + gamma * V(s’), which is biased since it bootstraps. The TD(0) error or residual is target - V(s). n-step TD estimates value of the last state. TD(lambda) uses exponentially weighted sums of TD(n) updates. lambda is the decay parameter where lambda=1 is MC learning. Idea: automatically decrease lambda as value function becomes more accurate.
SARSA. \(Q(s,a) = r+\gamma Q(s',a')\), on policy
ESARSA. \(Q(s,a) = r+\gamma\E_{a'} Q(s',a')\)
Q-learning. \(Q(s,a) = r+\gamma \max_a' Q(s',a')\)
Model-based learning
- Learn using some model of the environment dynamics.
- MuZero: learn using MCTS/UCB
- don’t learn to predict observations–we want to map states to the same embedding if the transitions are the same
- learn encoder, dynamics, policy, and reward and n-step return predictors jointly end-to-end
- recurrent dynamics function
Policy gradient
On-policy since we need to sample trajectories for the current policy
REINFORCE estimates the gradient of expected return wrt policy params. Equivalent to return-weighted behavior cloning: maximize R log p(a). To reduce variance, we can replace episode return with reward-to-go and subtract a baseline.
Actor-Critic. Weight by Q value, advantage A = Q-V, TD error (V easier to learn than Q), or generalized advantage.
Natural Policy Gradient, TRPO (second order), PPO (first order).
IMPALA uses importance weighting to be more off-policy and sample efficient.
Distributional RL
- Easier to learn: target is more stable and richer
- Soft Q-Learning uses Stein variational gradient descent.
- SAC: Soft Actor-Critic.
- Maximizing action entropy improves exploration, robustness, compositionality (mix of multiple different solutions).
- Equivalence Between Policy Gradients and Soft Q-Learning (Schulman 2018)
Bandits
Total regret \(\E \sum_t v^* - r_t = \sum_a \E N_t(a)\Delta_a\), count \(N_t(a)\), efficiency gap \(\Delta_a\)
Optimistic greedy can still have linear regret. Need to reduce exploration over time to have sublinear regret.
UCB. \(P[q^*(a) \le \hat{Q}_t(a) + \hat{U}_t(a)] \ge 1-p\).
\(U_t(a) = \sqrt{\frac{-\log p}{2N_t(a)}}\) by Hoeffding’s or Gaussian tail bound.
UCB1 sets \(p=t^{-4} \Rightarrow U_t(a) = \sqrt{\frac{2\log t}{N_t(a)}}\).
Thompson sampling. Choose \(\pi(a) = P\)(\(a\) is optimal).
Gaussian processes. Optimization without gradients. But assume larger correlations between closer \(x\) coordinates. \(O(N^3)\) so limited to about 1000 vars. Can use with UCB or Thompson sampling.
Kernel \(K(x,x') = \sigma^2 \exp\left(\frac{-1}{2l^2} (x-x')^2\right)\)
ML system design
http://www.mlebook.com
Understand goal, who will use it and how?, entry points (user interaction?), outputs
Scope: online/offline, mobile/web, authentication, analytics, integration
Scale: how many users/requests/data
Tradeoffs, bottlenecks, pre-generate expensive queries
Services, data
Define goal, context, and scale
- Online metrics? User ratings, CTR, revenue, retention.
- Diagram I/O between ML and overall product system.
- Talk with upstream (data owners) and downstream (stakeholders)
- Watch out for feedback loops where predictions can affect training data
- UI/UX: how do we collect input data? How do collect user feedback?
- Amount of data, requests/second, latency
- How mission-critical/reliable/accurate?
Collect data
- Explore data
- Feature Importance - partial dependency plot, SHAP values, dataschool video (reference)
- Potential issues
- Sampling bias? Label posts a simple random sample
- Unbalanced data? Over/undersampling
- Batch or streaming (Kafka)
- Split dataset
- Versioning
Feature engineering
- Matrix row, column, and label
- Start simple and expand from there (complexity tradeoff)
- Data types
- User history (time series)
- Cluster users or generate user embeddings
- Text: bag of words, BERT embeddings (tokenize, maximum sentence length)
- Passive: demographics, device, location, page, time, etc
- Whitening for numeric features
- Impute missing values? Outliers? Typos?
- Maybe cache in feature store
Select Model (bias/variance tradeoff)
- Mean baseline
- Heuristic baseline
- Logistic classifier with cross entropy loss
- CNN or RNN
- Random forest/XGBoost
- Regularization
- Hyperparameter search
- Do we need to communicate model confidence?
Evaluation and monitoring
- Precision/recall tradeoff -> value model
- Do we have target values?
- A/B testing
- Bias: monitor subgroups. Do we cause disproportionate harm?
- Privacy
- Adversarial robustness
- Monitor for change in performance, usage, latency
- Distribution shift (median/min/max, KS test)? Train models continuously, send alerts
- Covariate shift p(x), prior shift p(y), concept p(y|x) same features may have different labels, or p(x|y) same labels may have different features.
- Detection. Monitor model error if available. Unsup: statistical distance (low-dim only), novelty detection e.g. 1-class SVM, discriminative distance (classify old vs new data).
- Solutions. Remove moving features, importance weighting by test similarity,
Deployment
- Where? statically on client (embedded in binary), downloaded on client, on server
- Caching
- Version metadata (libraries, datasets, hyperparams, scoring code), authors and timestamps
- Canary rollout and rollbacks
Architecture diagram: data lake -> data prep -> batch training -> model store
Recommendation system
- Collaborative filtering
- Nearest neighbor e.g. cosine similarity
- Factor user-product interactions into user profile x media profile
- Suffers from cold start problem and missing not at random
- Content-based filtering e.g. metadata
- Candidate generation (high precision) and ranking (high recall)
- or model distillation to maximize an approximate model first
Theory
See also Statistics, Optimization, and Learning theory.
Math
Distance: zero, symmetric, triangle
Norm: zero, absolutely homogeneous, triangle. Seminorm doesn’t need zero.
orthogonal/independence
>=0 with 0 iff x=0
, |ax| = |a||x|, triangle |x+y| <= |x|+|y|. Induces metric d(x,y) = |x-y|.
Metric: 0 iff x=y, symmetric, triangle d(x,y) <= d(x,z) + d(z,y)
Sigmoid: 1/(1+e^(-x))
Positive definite (PD) matrix:
- x^TMx > 0 for all nonzero x
- symmetric and all eigenvalues are positive
- symmetric and all pivots are positive
- leading principal minors are all positive
- can be written as R^TR (R with indep columns)
- quadratic form is (strictly) convex
- (x,y) = y^TMx is an inner product
Jacobian: matrix dfi/dxj derivative of a vector-valued function
Hessian: matrix df^2/dxidxj of second-order partial derivatives of a scalar-valued function, describes local curvature
Newton’s method: x2 = x - f(x)/f’(x) since 0 = f(x) + f’(x)(x2 - x)
Chain rule, integration by parts
Logistic regression
- Binary cross entropy loss
- BCE loss on logits is more numerically stable
- \(\ell(z, y) = \ln(1+e^{-z}) + z(1-y)\).
- PyTorch:
BCEWithLogitsLoss(z, y)
and BCELoss(y_hat, y)
- Binary gradient update
- Loss: \(\displaystyle\frac{\partial \ell}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\)
- Sigmoid: \(\displaystyle\frac{\partial \ell}{\partial z} = \frac{\partial \ell}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z} = \frac{\partial \ell}{\partial \hat{y}} s(z)(1-s(z)) = \hat{y}-y\)
- Linear: \(\displaystyle\frac{\partial \ell}{\partial w} = \frac{\partial \ell}{\partial z}\frac{\partial z}{\partial w} = x^T \frac{\partial \ell}{\partial z}\)
- Multinomial cross entropy loss
- PyTorch:
CrossEntropyLoss(z, y)
and NLLLoss(log_y_hat, y)
(less stable).
- Softmax Jacobian \(J_{ij} = \begin{cases} p_i(1-p_j), & i=j\\ -p_ip_j, & i\neq j \end{cases}\) (derivation).
- Note that we should avoid creating \(J\) explicitly since that’s d^2. Instead, we should reorder the computation:
- \(\displaystyle\frac{\partial \ell}{\partial z} = J \frac{\partial \ell}{\partial p} = p \circ \left(\frac{\partial \ell}{\partial p} - p^T \frac{\partial \ell}{\partial p}\right)\)
Perceptron
\(\newcommand{\vvec}[2]{\begin{pmatrix}#1\\#2\end{pmatrix}}\)
Hinge loss penalizes margin \(y_i\hat{y}<1\): \(\max\{0, 1-y_i\hat{y}_i\}\) vs. 0 if \(\hat{y}<0\) else 1
\(y_i = \pm 1\), \(\hat{y_i} = w^Tx_i + b\)
SGD on hinge loss, \(\nabla l\vvec{w_t}{b_t} = -\vvec{y_ix_i}{y_i}\text{ or }\vvec00\)
Update rule: \(w_{t+1} \leftarrow w_t + y_ix_i\), \(b_{t+1} \leftarrow b_t + y_i\)
Perceptron convergence
\[T \leq \frac{(R^2+1)(1+{b^*}^2)}{\gamma^2}\]
\[(\gamma T)^2 \leq \left\langle\vvec{w^*}{b^*}, \vvec{w_{t+1}}{b_{t+1}}\right\rangle^2 \leq (1+{b^*}^2) \left\lVert \vvec{w_{t+1}}{b_{t+1}} \right\rVert \leq (1+{b^*}^2)T(R^2+1)\]
Min increase in alignment: \(\left\langle\vvec{w^*}{b^*}, \vvec{w_t}{b_t} + \vvec{y_ix_i}{y_i}\right\rangle,\ y_i({w^*}^Tx_i+b^*)\geq \gamma\)
Cauchy-Schwartz: \(\lVert w^* \rVert = 1\)
Max increase in norm: \(\left\lVert \vvec{w_t}{b_t} + y_i\vvec{x_i}{1} \right\rVert\), \(y_i(w_t^Tx_i+b_t) \leq 0\), \(R = \max \lVert x_i \rVert\)
Variational autoencoder
In a VAE, latent \(z\) generates observation \(x\). We have an encoder \(q(z|x)\), decoder \(p(x|z)\), and prior \(p(z)\). We can maximize the ELBO or minimize the loss \(- \E_q \log p(x|z) + KL(q \| p(z))\) = NLL + KL loss on the latent.
We can rewrite the NLL as an expectation over a distribution not dependent on the encoder parameters \(\theta\).
Reparameterization trick: sample \(y\sim N(\mu, \Sigma)\) as \(y = \mu + \Sigma^{1/2} x\), \(x\sim N(0, I)\).
For a Gaussian prior, \(KL(q\| p(z)) = KL[N(\mu, \Sigma) \| N(0, I)] = \frac12[ \mu^2 + \tr\Sigma -d - \log |\Sigma|]\).
For diagonal \(\Sigma\), \(KL = \frac{1}2 \Sigma_{k=1}^d (\mu_k^2 + \Sigma_k -1 -\log\Sigma_k)\).