Can one algorithm rule them all?
How to automate statistical computations
Alp Kucukelbir
COLUMBIA UNIVERSITY
Can one algorithm rule them all?
Not yet. (But some tools can help!)
Rajesh Ranganath Dustin Tran
Andrew Gelman David Blei
Machine Learning
data
machine
learning
hidden
patterns
We want to discover and explore hidden patterns
to study hard-to-see connections,
to predict future outcomes,
to explore causal relationships.
How taxis navigate the city of Porto [1.7m trips] (K et.al., 2016).
How do we use machine learning?
statistical model
data
machine
learning
expert
hidden
patterns
many months later
statistical model
data
machine
learning
expert
hidden
patterns
many months later
statistical model
data
machine
learning
expert
hidden
patterns
many months later
Statistical Model
Make assumptions about data.
Capture uncertainties using probability.
statistical model
data
machine
learning
expert
hidden
patterns
many months later
Statistical Model
Make assumptions about data.
Capture uncertainties using probability.
statistical model
data
machine
learning
expert
hidden
patterns
many months later
Statistical Model
Make assumptions about data.
Capture uncertainties using probability.
Machine Learning Expert
aka a PhD student.
statistical model
data
machine
learning
expert
hidden
patterns
many months later
Statistical Model
Make assumptions about data.
Capture uncertainties using probability.
Machine Learning Expert
aka a PhD student.
statistical model
data
machine
learning
expert
hidden
patterns
many months later
Machine learning should be
1. Easy to use 2. Scalable 3. Flexible.
statistical model
data
automatic
tool
hidden
patternsinstant
revise
Machine learning should be
1. Easy to use 2. Scalable 3. Flexible.
statistical model
data
automatic
tool
hidden
patternsinstant
revise
Machine learning should be
1. Easy to use 2. Scalable 3. Flexible.
“[Statistical] models are developed iteratively: we build a
model, use it to analyze data, assess how it succeeds and
fails, revise it, and repeat.” (Box, 1960; Blei, 2014)
What does this automatic tool need to do?
statistical model
data
machine
learning
expert
hidden
patterns
many months later
statistical model
data
inference
(maths)
inference
(algorithm)
hidden
patterns
statistical model
data
inference
(maths)
inference
(algorithm)
hidden
patterns
X θ
Bayesian Model
likelihood p(X | θ)
model p(X,θ) = p(X | θ) p(θ)
prior p(θ)
statistical model
data
inference
(maths)
inference
(algorithm)
hidden
patterns
X θ
Bayesian Model
likelihood p(X | θ)
model p(X,θ) = p(X | θ) p(θ)
prior p(θ)
The model describes a data generating process.
The latent variables θ capture hidden patterns.
statistical model
data
inference
(maths)
inference
(algorithm)
hidden
patterns
X θ
Bayesian Inference
posterior p(θ | X) =
p(X,θ)
p(X,θ)dθ
The posterior describes hidden patterns given data X.
It is typically intractable.
statistical model
data
inference
(maths)
inference
(algorithm)
hidden
patterns
X θ
Approximating the Posterior
Sampling draw samples using MCMC
Variational approximate using a simple function
The computations depend heavily on the model!
Common Statistical Computations
Expectations
q(θ;φ) logp(X,θ) = logp(X,θ) q(θ;φ)dθ
Gradients (of expectations)
∇φ q(θ;φ) logp(X,θ)
Maximization (by following gradients)
max
φ
q(θ;φ) logp(X,θ)
Automating Expectations
Monte Carlo sampling
θ
f(θ)
a a + 1
θ
f(θ)
a a + 1
f(θ(s)
)
a+1
a
f(θ)dθ ≈
1
S
S
s=1
f(θ(s)
)
where θ(s)
∼ Uniform(a,a + 1)
Automating Expectations
Monte Carlo sampling
q(θ;φ) logp(X,θ) = logp(X,θ) q(θ;φ)dθ
≈
1
S
S
s=1
logp(X,θ(s)
)
where θ(s)
∼ q(θ;φ)
Monte Carlo Statistical Methods, Robert and Casella, 1999
Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009
Automating Expectations
Probability Distributions
Stan, GSL (C++)
NumPy, SciPy, edward (Python)
built-in (R)
Distributions.jl (Julia)
Automating Gradients
Symbolic or Automatic Differentiation
Let f(x1,x2) = logx1 +x1x2 −sinx2. Compute ∂ f(2,5)/∂ x1.
Automatic di↵erentiation in machine learning: a survey 9
Table 2 Forward mode AD example, with y = f(x1, x2) = ln(x1) + x1x2 sin(x2) at
(x1, x2) = (2, 5) and setting ˙x1 = 1 to compute @y
@x1
. The original forward run on the left
is augmented by the forward AD operations on the right, where each line supplements the
original on its left.
Forward Evaluation Trace
v 1 = x1 = 2
v0 = x2 = 5
v1 = ln v 1 = ln 2
v2 = v 1 ⇥v0 = 2 ⇥ 5
v3 = sin v0 = sin 5
v4 = v1 + v2 = 0.693 + 10
v5 = v4 v3 = 10.693 + 0.959
y = v5 = 11.652
Forward Derivative Trace
˙v 1 = ˙x1 = 1
˙v0 = ˙x2 = 0
˙v1 = ˙v 1/v 1 = 1/2
˙v2 = ˙v 1⇥v0+ ˙v0⇥v 1 = 1⇥5+0⇥2
˙v3 = ˙v0 ⇥ cos v0 = 0 ⇥ cos 5
˙v4 = ˙v1 + ˙v2 = 0.5 + 5
˙v5 = ˙v4 ˙v3 = 5.5 0
˙y = ˙v5 = 5.5
each intermediate variable vi a derivative
˙vi =
@vi
@x1
.
Applying the chain rule to each elementary operation in the forward evalu-
ation trace, we generate the corresponding derivative trace, given on the right
hand side of Table 2. Evaluating variables vi one by one together with their
corresponding ˙vi values gives us the required derivative in the final variable
@y
Automatic differentiation in machine learning: a survey, Baydin
et al., 2015
#include < stan /math . hpp>
i n t main () {
using namespace std ;
stan : : math : : var x1 = 2 , x2 = 5;
stan : : math : : var f ;
f = log ( x1 ) + x1*x2 - sin ( x2 ) ;
cout << " f ( x1 , x2 ) = " << f . val () << endl ;
f . grad () ;
cout << " df / dx1 = " << x1 . adj () << endl
<< " df / dx2 = " << x2 . adj () << endl ;
return 0;
}
The Stan math library, Carpenter et al., 2015
Automating Gradients
Automatic Differentiation
Stan, Adept, CppAD (C++)
autograd, Tensorflow (Python)
radx (R)
http://www.juliadiff.org/ (Julia)
Symbolic Differentiation
SymbolicC++ (C++)
SymPy, Theano (Python)
Deriv, Ryacas (R)
http://www.juliadiff.org/ (Julia)
Stochastic Optimization
Follow noisy unbiased gradients.
8.5. Online learning and stochastic optimization
black line = LMS trajectory towards LS soln (red cross)
w0
w1
−1 0 1 2 3
−1
−0.5
0
0.5
1
1.5
2
2.5
3
(a)
0 5 10 15
3
4
5
6
7
8
9
10
RSS vs iteration
(b)
Figure 8.8 Illustration of the LMS algorithm. Left: we start from θ = (−0.5,
to the least squares solution of ˆθ = (1.45, 0.92) (red cross). Right: plot of obje
Note that it does not decrease monotonically. Figure generated by LMSdemo.
where i = i(k) is the training example to use at iteration k. If the data s
i(k) = k; we shall assume this from now on, for notational simplicity.
Figure 8.8a.
Scale up by subsampling the data at each step.
Machine Learning: a Probabilistic Perspective, Murphy, 2012
Stochastic Optimization
Generic Implementations
Vowpal Wabbit, sgd (C++)
Theano, Tensorflow (Python)
sgd (R)
SGDOptim.jl (Julia)
ADVI (Automatic Differentiation Variational Inference)
An easy-to use, scalable, flexible algorithm
smc‐ tan.org
Stan is a probabilistic programming system.
1. Write the model in a simple language.
2. Provide data.
3. Run.
RStan, PyStan, Stan.jl, ...
How taxis navigate the city of Porto [1.7m trips] (K et.al., 2016).
Exploring Taxi Rides
Data: 1.7 million taxi rides
Write down a pPCA model. (∼minutes)
Use ADVI to infer subspace. (∼hours)
Project data into pPCA subspace. (∼minutes)
Write down a mixture model. (∼minutes)
Use ADVI to find patterns. (∼minutes)
Write down a supervised pPCA model. (∼minutes)
Repeat. (∼hours)
What would have taken us weeks → a single day.
statistical model
data
automatic
tool
hidden
patternsinstant
revise
Monte Carlo Statistical Methods, Robert and Casella, 1999
Monte Carlo and Quasi-Monte Carlo Sampling, Lemieux, 2009
Automatic differentiation in machine learning: a survey, Baydin et al., 2015
The Stan math library, Carpenter et al., 2015
Machine Learning: a Probabilistic Perspective, Murphy, 2012
Automatic differentiation variational inference, K et al., 2016
proditus.com mc-stan.org Thank you!
EXTRA SLIDES
Kullback Leibler Divergence
KL(q(θ) p(θ | X)) =
θ
q(θ)log
q(θ)
p(θ | X)
dθ
= q(θ) log
q(θ)
p(θ | X)
= q(θ) [logq(θ) − logp(θ | X)]
Related Objective Function
(φ) = logp(X) − KL(q(θ) p(θ | X))
= logp(X) − q(θ) [logq(θ) − logp(θ | X)]
= logp(X) + q(θ) [logp(X | θ)] − q(θ) [logq(θ)]
= q(θ) [logp(θ,X)] − q(θ) [logq(θ)]
= q(θ ;φ) logp(X,θ)
cross-entropy
− q(θ ;φ) logq(θ ; φ)
entropy

One Algorithm to Rule Them All: How to Automate Statistical Computation