Contents

Machine learning

See also: Deep learning.

UA caveats: no guarantee that we can learn the golden network, can require exponential nodes, assumes function/smoothness, specific range

Generative model p(x) vs. discriminative model p(y|x).

Reduce overfitting:

Classification: logistic, k-NN (k-d tree), LDA/QDA and mixture models

Decision tree:

Metrics

Regression: linear/polynomial
ridge regression = l2 penalty (Gaussian prior), lasso = l1 (Laplacian prior)

Ensembles

Dimensionality reduction: autoencoders

Sequence tasks

Ordering: conv, ReLU, BN, dropout
Regularization to improve generalization:

Debugging

Transformer: selects values with corresponding keys that match the query, values weighted by softmax of dot-products. Improves long-range dependencies and easier to parallelize compared to RNNs.

Distributed training

https://en.wikipedia.org/wiki/Template:Recommender_systems

Probabilistic graphical models

RL

https://en.wikipedia.org/wiki/Program_synthesis

Reinforcement learning

RL studies how to choose actions to maximize expected return. The usual setting is a Markov Decision Process, which consists of states, actions, transitions, and rewards and satisfies the Markov property.

Behavior cloning and GAIL

Q-learning. Use dynamic programming aka Bellman equation: \(r(s_i, a_i) + \gamma \max_{a_i'} Q(s_i', a'_i)\). Use a replay buffer to decorrelate observations. When there is estimation error, bootstrap Q estimates based on max_a are biased upward, so Double Q-learning uses a second Q net as the value target.

TD vs MC. MC target is the actual return, which is unbiased but high variance since it accumulates randomness from rewards and actions. TD(0) target is reward + gamma * V(s’), which is biased since it bootstraps. The TD(0) error or residual is target - V(s). n-step TD estimates value of the last state. TD(lambda) uses exponentially weighted sums of TD(n) updates. lambda is the decay parameter where lambda=1 is MC learning. Idea: automatically decrease lambda as value function becomes more accurate.

SARSA. \(Q(s,a) = r+\gamma Q(s',a')\), on policy
ESARSA. \(Q(s,a) = r+\gamma\E_{a'} Q(s',a')\)
Q-learning. \(Q(s,a) = r+\gamma \max_a' Q(s',a')\)

Model-based learning

Policy gradient

On-policy since we need to sample trajectories for the current policy

REINFORCE estimates the gradient of expected return wrt policy params. Equivalent to return-weighted behavior cloning: maximize R log p(a). To reduce variance, we can replace episode return with reward-to-go and subtract a baseline.

Actor-Critic. Weight by Q value, advantage A = Q-V, TD error (V easier to learn than Q), or generalized advantage.

Natural Policy Gradient, TRPO (second order), PPO (first order).

IMPALA uses importance weighting to be more off-policy and sample efficient.

Distributional RL

Bandits

Total regret \(\E \sum_t v^* - r_t = \sum_a \E N_t(a)\Delta_a\), count \(N_t(a)\), efficiency gap \(\Delta_a\)

Optimistic greedy can still have linear regret. Need to reduce exploration over time to have sublinear regret.

UCB. \(P[q^*(a) \le \hat{Q}_t(a) + \hat{U}_t(a)] \ge 1-p\).

\(U_t(a) = \sqrt{\frac{-\log p}{2N_t(a)}}\) by Hoeffding’s or Gaussian tail bound.

UCB1 sets \(p=t^{-4} \Rightarrow U_t(a) = \sqrt{\frac{2\log t}{N_t(a)}}\).

Thompson sampling. Choose \(\pi(a) = P\)(\(a\) is optimal).

Gaussian processes. Optimization without gradients. But assume larger correlations between closer \(x\) coordinates. \(O(N^3)\) so limited to about 1000 vars. Can use with UCB or Thompson sampling.

Kernel \(K(x,x') = \sigma^2 \exp\left(\frac{-1}{2l^2} (x-x')^2\right)\)

ML system design

http://www.mlebook.com

  1. Understand goal, who will use it and how?, entry points (user interaction?), outputs

  2. Scope: online/offline, mobile/web, authentication, analytics, integration

  3. Scale: how many users/requests/data

  4. Tradeoffs, bottlenecks, pre-generate expensive queries

  5. Services, data

  6. Define goal, context, and scale

    • Online metrics? User ratings, CTR, revenue, retention.
    • Diagram I/O between ML and overall product system.
      • Talk with upstream (data owners) and downstream (stakeholders)
      • Watch out for feedback loops where predictions can affect training data
      • UI/UX: how do we collect input data? How do collect user feedback?
    • Amount of data, requests/second, latency
    • How mission-critical/reliable/accurate?
  7. Collect data

    • Explore data
      • Feature Importance - partial dependency plot, SHAP values, dataschool video (reference)
    • Potential issues
      • Sampling bias? Label posts a simple random sample
      • Unbalanced data? Over/undersampling
    • Batch or streaming (Kafka)
    • Split dataset
    • Versioning
  8. Feature engineering

    • Matrix row, column, and label
    • Start simple and expand from there (complexity tradeoff)
    • Data types
      • User history (time series)
      • Cluster users or generate user embeddings
      • Text: bag of words, BERT embeddings (tokenize, maximum sentence length)
      • Passive: demographics, device, location, page, time, etc
    • Whitening for numeric features
    • Impute missing values? Outliers? Typos?
    • Maybe cache in feature store
  9. Select Model (bias/variance tradeoff)

    • Mean baseline
    • Heuristic baseline
    • Logistic classifier with cross entropy loss
    • CNN or RNN
    • Random forest/XGBoost
    • Regularization
    • Hyperparameter search
    • Do we need to communicate model confidence?
  10. Evaluation and monitoring

    • Precision/recall tradeoff -> value model
    • Do we have target values?
    • A/B testing
    • Bias: monitor subgroups. Do we cause disproportionate harm?
    • Privacy
    • Adversarial robustness
    • Monitor for change in performance, usage, latency
    • Distribution shift (median/min/max, KS test)? Train models continuously, send alerts
      • Covariate shift p(x), prior shift p(y), concept p(y|x) same features may have different labels, or p(x|y) same labels may have different features.
      • Detection. Monitor model error if available. Unsup: statistical distance (low-dim only), novelty detection e.g. 1-class SVM, discriminative distance (classify old vs new data).
      • Solutions. Remove moving features, importance weighting by test similarity,
  11. Deployment

    • Where? statically on client (embedded in binary), downloaded on client, on server
    • Caching
    • Version metadata (libraries, datasets, hyperparams, scoring code), authors and timestamps
    • Canary rollout and rollbacks
      Architecture diagram: data lake -> data prep -> batch training -> model store

Recommendation system

Theory

See also Statistics, Optimization, and Learning theory.

Math
Distance: zero, symmetric, triangle
Norm: zero, absolutely homogeneous, triangle. Seminorm doesn’t need zero.
orthogonal/independence
>=0 with 0 iff x=0, |ax| = |a||x|, triangle |x+y| <= |x|+|y|. Induces metric d(x,y) = |x-y|.
Metric: 0 iff x=y, symmetric, triangle d(x,y) <= d(x,z) + d(z,y)
Sigmoid: 1/(1+e^(-x))
Positive definite (PD) matrix:

Logistic regression

Perceptron
\(\newcommand{\vvec}[2]{\begin{pmatrix}#1\\#2\end{pmatrix}}\)

Hinge loss penalizes margin \(y_i\hat{y}<1\): \(\max\{0, 1-y_i\hat{y}_i\}\) vs. 0 if \(\hat{y}<0\) else 1

\(y_i = \pm 1\), \(\hat{y_i} = w^Tx_i + b\)

SGD on hinge loss, \(\nabla l\vvec{w_t}{b_t} = -\vvec{y_ix_i}{y_i}\text{ or }\vvec00\)

Update rule: \(w_{t+1} \leftarrow w_t + y_ix_i\), \(b_{t+1} \leftarrow b_t + y_i\)

Perceptron convergence
\[T \leq \frac{(R^2+1)(1+{b^*}^2)}{\gamma^2}\]
\[(\gamma T)^2 \leq \left\langle\vvec{w^*}{b^*}, \vvec{w_{t+1}}{b_{t+1}}\right\rangle^2 \leq (1+{b^*}^2) \left\lVert \vvec{w_{t+1}}{b_{t+1}} \right\rVert \leq (1+{b^*}^2)T(R^2+1)\]

Min increase in alignment: \(\left\langle\vvec{w^*}{b^*}, \vvec{w_t}{b_t} + \vvec{y_ix_i}{y_i}\right\rangle,\ y_i({w^*}^Tx_i+b^*)\geq \gamma\)

Cauchy-Schwartz: \(\lVert w^* \rVert = 1\)

Max increase in norm: \(\left\lVert \vvec{w_t}{b_t} + y_i\vvec{x_i}{1} \right\rVert\), \(y_i(w_t^Tx_i+b_t) \leq 0\), \(R = \max \lVert x_i \rVert\)

Variational autoencoder
In a VAE, latent \(z\) generates observation \(x\). We have an encoder \(q(z|x)\), decoder \(p(x|z)\), and prior \(p(z)\). We can maximize the ELBO or minimize the loss \(- \E_q \log p(x|z) + KL(q \| p(z))\) = NLL + KL loss on the latent.

We can rewrite the NLL as an expectation over a distribution not dependent on the encoder parameters \(\theta\).
Reparameterization trick: sample \(y\sim N(\mu, \Sigma)\) as \(y = \mu + \Sigma^{1/2} x\), \(x\sim N(0, I)\).

For a Gaussian prior, \(KL(q\| p(z)) = KL[N(\mu, \Sigma) \| N(0, I)] = \frac12[ \mu^2 + \tr\Sigma -d - \log |\Sigma|]\).
For diagonal \(\Sigma\), \(KL = \frac{1}2 \Sigma_{k=1}^d (\mu_k^2 + \Sigma_k -1 -\log\Sigma_k)\).