Machine learning

Reinforcement learning

RL studies how to choose actions to maximize expected return. The usual setting is a Markov Decision Process, which consists of states, actions, transitions, and rewards and satisfies the Markov property.

Behavior cloning and GAIL

Q-learning. Use dynamic programming aka Bellman equation: \(r(s_i, a_i) + \gamma \max_{a_i'} Q(s_i', a'_i)\). Use a replay buffer to decorrelate observations. When there is estimation error, bootstrap Q estimates based on max_a are biased upward, so Double Q-learning uses a second Q net as the value target.

TD vs MC. MC target is the actual return, which is unbiased but high variance since it accumulates randomness from rewards and actions. TD(0) target is reward + gamma * V(s’), which is biased since it bootstraps. The TD(0) error or residual is target - V(s). n-step TD estimates value of the last state. TD(lambda) uses exponentially weighted sums of TD(n) updates. lambda is the decay parameter where lambda=1 is MC learning. Idea: automatically decrease lambda as value function becomes more accurate.

SARSA. \(Q(s,a) = r+\gamma Q(s',a')\), on policy
ESARSA. \(Q(s,a) = r+\gamma\E_{a'} Q(s',a')\)
Q-learning. \(Q(s,a) = r+\gamma \max_a' Q(s',a')\)

Model-based learning

Learn using some model of the environment dynamics.
MuZero: learn using MCTS/UCB
- don’t learn to predict observations–we want to map states to the same embedding if the transitions are the same
- learn encoder, dynamics, policy, and reward and n-step return predictors jointly end-to-end
- recurrent dynamics function

Policy gradient

On-policy since we need to sample trajectories for the current policy

REINFORCE estimates the gradient of expected return wrt policy params. Equivalent to return-weighted behavior cloning: maximize R log p(a). To reduce variance, we can replace episode return with reward-to-go and subtract a baseline.

Actor-Critic. Weight by Q value, advantage A = Q-V, TD error (V easier to learn than Q), or generalized advantage.

Natural Policy Gradient, TRPO (second order), PPO (first order).

IMPALA uses importance weighting to be more off-policy and sample efficient.

Distributional RL

Easier to learn: target is more stable and richer
Soft Q-Learning uses Stein variational gradient descent.
SAC: Soft Actor-Critic.
- Maximizing action entropy improves exploration, robustness, compositionality (mix of multiple different solutions).
Equivalence Between Policy Gradients and Soft Q-Learning (Schulman 2018)

Bandits

Total regret \(\E \sum_t v^* - r_t = \sum_a \E N_t(a)\Delta_a\), count \(N_t(a)\), efficiency gap \(\Delta_a\)

Optimistic greedy can still have linear regret. Need to reduce exploration over time to have sublinear regret.

UCB. \(P[q^*(a) \le \hat{Q}_t(a) + \hat{U}_t(a)] \ge 1-p\).

\(U_t(a) = \sqrt{\frac{-\log p}{2N_t(a)}}\) by Hoeffding’s or Gaussian tail bound.

UCB1 sets \(p=t^{-4} \Rightarrow U_t(a) = \sqrt{\frac{2\log t}{N_t(a)}}\).

Thompson sampling. Choose \(\pi(a) = P\)(\(a\) is optimal).

Gaussian processes. Optimization without gradients. But assume larger correlations between closer \(x\) coordinates. \(O(N^3)\) so limited to about 1000 vars. Can use with UCB or Thompson sampling.

Kernel \(K(x,x') = \sigma^2 \exp\left(\frac{-1}{2l^2} (x-x')^2\right)\)

ML system design

http://www.mlebook.com

Understand goal, who will use it and how?, entry points (user interaction?), outputs
Scope: online/offline, mobile/web, authentication, analytics, integration
Scale: how many users/requests/data
Tradeoffs, bottlenecks, pre-generate expensive queries
Services, data
Define goal, context, and scale
- Online metrics? User ratings, CTR, revenue, retention.
- Diagram I/O between ML and overall product system.
  - Talk with upstream (data owners) and downstream (stakeholders)
  - Watch out for feedback loops where predictions can affect training data
  - UI/UX: how do we collect input data? How do collect user feedback?
- Amount of data, requests/second, latency
- How mission-critical/reliable/accurate?
Collect data
- Explore data
  - Feature Importance - partial dependency plot, SHAP values, dataschool video (reference)
- Potential issues
  - Sampling bias? Label posts a simple random sample
  - Unbalanced data? Over/undersampling
- Batch or streaming (Kafka)
- Split dataset
- Versioning
Feature engineering
- Matrix row, column, and label
- Start simple and expand from there (complexity tradeoff)
- Data types
  - User history (time series)
  - Cluster users or generate user embeddings
  - Text: bag of words, BERT embeddings (tokenize, maximum sentence length)
  - Passive: demographics, device, location, page, time, etc
- Whitening for numeric features
- Impute missing values? Outliers? Typos?
- Maybe cache in feature store
Select Model (bias/variance tradeoff)
- Mean baseline
- Heuristic baseline
- Logistic classifier with cross entropy loss
- CNN or RNN
- Random forest/XGBoost
- Regularization
- Hyperparameter search
- Do we need to communicate model confidence?
Evaluation and monitoring
- Precision/recall tradeoff -> value model
- Do we have target values?
- A/B testing
- Bias: monitor subgroups. Do we cause disproportionate harm?
- Privacy
- Adversarial robustness
- Monitor for change in performance, usage, latency
- Distribution shift (median/min/max, KS test)? Train models continuously, send alerts
  - Covariate shift p(x), prior shift p(y), concept p(y|x) same features may have different labels, or p(x|y) same labels may have different features.
  - Detection. Monitor model error if available. Unsup: statistical distance (low-dim only), novelty detection e.g. 1-class SVM, discriminative distance (classify old vs new data).
  - Solutions. Remove moving features, importance weighting by test similarity,
Deployment
- Where? statically on client (embedded in binary), downloaded on client, on server
- Caching
- Version metadata (libraries, datasets, hyperparams, scoring code), authors and timestamps
- Canary rollout and rollbacks
  Architecture diagram: data lake -> data prep -> batch training -> model store

Recommendation system

Collaborative filtering
- Nearest neighbor e.g. cosine similarity
- Factor user-product interactions into user profile x media profile
- Suffers from cold start problem and missing not at random
Content-based filtering e.g. metadata
Candidate generation (high precision) and ranking (high recall)
or model distillation to maximize an approximate model first

Theory

See also Statistics, Optimization, and Learning theory.

Math
Distance: zero, symmetric, triangle
Norm: zero, absolutely homogeneous, triangle. Seminorm doesn’t need zero.
orthogonal/independence
>=0 with 0 iff x=0, |ax| = |a||x|, triangle |x+y| <= |x|+|y|. Induces metric d(x,y) = |x-y|.
Metric: 0 iff x=y, symmetric, triangle d(x,y) <= d(x,z) + d(z,y)
Sigmoid: 1/(1+e^(-x))
Positive definite (PD) matrix:

x^TMx > 0 for all nonzero x
symmetric and all eigenvalues are positive
symmetric and all pivots are positive
leading principal minors are all positive
can be written as R^TR (R with indep columns)
quadratic form is (strictly) convex
(x,y) = y^TMx is an inner product
Jacobian: matrix dfi/dxj derivative of a vector-valued function
Hessian: matrix df^2/dxidxj of second-order partial derivatives of a scalar-valued function, describes local curvature
Newton’s method: x2 = x - f(x)/f’(x) since 0 = f(x) + f’(x)(x2 - x)
Chain rule, integration by parts

Logistic regression

Binary cross entropy loss
- BCE loss on logits is more numerically stable
  - \(\ell(z, y) = \ln(1+e^{-z}) + z(1-y)\).
- PyTorch: BCEWithLogitsLoss(z, y) and BCELoss(y_hat, y)
Binary gradient update
- Loss: \(\displaystyle\frac{\partial \ell}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\)
- Sigmoid: \(\displaystyle\frac{\partial \ell}{\partial z} = \frac{\partial \ell}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z} = \frac{\partial \ell}{\partial \hat{y}} s(z)(1-s(z)) = \hat{y}-y\)
- Linear: \(\displaystyle\frac{\partial \ell}{\partial w} = \frac{\partial \ell}{\partial z}\frac{\partial z}{\partial w} = x^T \frac{\partial \ell}{\partial z}\)
Multinomial cross entropy loss
- PyTorch: CrossEntropyLoss(z, y) and NLLLoss(log_y_hat, y) (less stable).
Softmax Jacobian \(J_{ij} = \begin{cases} p_i(1-p_j), & i=j\\ -p_ip_j, & i\neq j \end{cases}\) (derivation).
- Note that we should avoid creating \(J\) explicitly since that’s d^2. Instead, we should reorder the computation:
- \(\displaystyle\frac{\partial \ell}{\partial z} = J \frac{\partial \ell}{\partial p} = p \circ \left(\frac{\partial \ell}{\partial p} - p^T \frac{\partial \ell}{\partial p}\right)\)

Perceptron

Hinge loss penalizes margin \(y_i\hat{y}<1\): \(\max\{0, 1-y_i\hat{y}_i\}\) vs. 0 if \(\hat{y}<0\) else 1

\(y_i = \pm 1\), \(\hat{y_i} = w^Tx_i + b\)

SGD on hinge loss, \(\nabla l\vvec{w_t}{b_t} = -\vvec{y_ix_i}{y_i}\text{ or }\vvec00\)

Update rule: \(w_{t+1} \leftarrow w_t + y_ix_i\), \(b_{t+1} \leftarrow b_t + y_i\)

Perceptron convergence
\[T \leq \frac{(R^2+1)(1+{b^*}^2)}{\gamma^2}\]
\[(\gamma T)^2 \leq \left\langle\vvec{w^*}{b^*}, \vvec{w_{t+1}}{b_{t+1}}\right\rangle^2 \leq (1+{b^*}^2) \left\lVert \vvec{w_{t+1}}{b_{t+1}} \right\rVert \leq (1+{b^*}^2)T(R^2+1)\]

Min increase in alignment: \(\left\langle\vvec{w^*}{b^*}, \vvec{w_t}{b_t} + \vvec{y_ix_i}{y_i}\right\rangle,\ y_i({w^*}^Tx_i+b^*)\geq \gamma\)

Cauchy-Schwartz: \(\lVert w^* \rVert = 1\)

Max increase in norm: \(\left\lVert \vvec{w_t}{b_t} + y_i\vvec{x_i}{1} \right\rVert\), \(y_i(w_t^Tx_i+b_t) \leq 0\), \(R = \max \lVert x_i \rVert\)

Variational autoencoder
In a VAE, latent \(z\) generates observation \(x\). We have an encoder \(q(z|x)\), decoder \(p(x|z)\), and prior \(p(z)\). We can maximize the ELBO or minimize the loss \(- \E_q \log p(x|z) + KL(q \| p(z))\) = NLL + KL loss on the latent.

We can rewrite the NLL as an expectation over a distribution not dependent on the encoder parameters \(\theta\).
Reparameterization trick: sample \(y\sim N(\mu, \Sigma)\) as \(y = \mu + \Sigma^{1/2} x\), \(x\sim N(0, I)\).

For a Gaussian prior, \(KL(q\| p(z)) = KL[N(\mu, \Sigma) \| N(0, I)] = \frac12[ \mu^2 + \tr\Sigma -d - \log |\Sigma|]\).
For diagonal \(\Sigma\), \(KL = \frac{1}2 \Sigma_{k=1}^d (\mu_k^2 + \Sigma_k -1 -\log\Sigma_k)\).

Contents