M A C H I N E L E A R N I N G T O D AY: C U R R E N T
R E S E A R C H A N D A D VA N C E S F R O M A M L A B , U V A
D A N I E L W O R R A L L
WHO ARE WE?
~30 researchers working under Max Welling and Joris Mooij
- 4 industrially funded ‘labs’
- Everyone works in deep learning
- We do fundamental research in machine learning
WHAT IS MACHINE LEARNING?
Mauna Loa is one of five volcanoes that form the Island of Hawaii in the U.S.
state of Hawaii in the Pacific Ocean. The largest subaerial volcano in both
mass and volume, Mauna Loa has historically been considered the largest
volcano on Earth, dwarfed only by Tamu Massif.
WHAT IS MACHINE LEARNING?
In machine learning we use past data to make predictions about the future.
p(y⇤|x⇤, D) = N y⇤|µ(x⇤), 2
(x⇤)
DataTest inputTest output Gaussian
WHAT IS MACHINE LEARNING?
Predictions are probability distributions.
Our main tool, conditional distributions:
Data
p(x|✓)
Parameters/models/unknowns
Symmetry constraints
Domain choice
Flexibility
Approximations
Computation
Memory
How do we choose p?
How do we learn θ?
WHAT IS MACHINE LEARNING?
Probability
p(x|✓)
✓
{x1, x2, ...} ✓
{x1, x2, ...}
Statistics
Machine
Learning
✓
{x1, x2, ...}
{x⇤}
Some terminology
WHAT WE DO
- Variational methods
- Normalizing flows
- Graphs
- Symmetry
- Reinforcement learning
- Transfer learning
- Medical imaging
- Generative modelling
- Compression
- Low precision neural networks
- Spiking neural networks
- Semi-supervised learning
VARIATIONAL METHODS
Approximate inference
Intractable
p(✓|D) =
p(D|✓)p(✓)
p(D)
=
p(D|✓)p(✓)
R
p(D|✓)p(✓) d✓
Log-likelihood Regulariser
ELBO
q (✓) = arg min DKL [q (✓)kp(✓|D)]
= arg min
Z
q (✓) log
q (✓)
p(✓|D)
d✓
= arg min
Z
q (✓) log
q (✓)p(D)
p(D|✓)p(✓)
d✓
= arg max Eq (✓)[p(D|✓)] + DKL [q (✓)kp(✓)]
q (✓) = arg min DKL [q (✓)kp(✓|D)]
= arg min
Z
q (✓) log
q (✓)
p(✓|D)
d✓
= arg min
Z
q (✓) log
q (✓)p(D)
p(D|✓)p(✓)
d✓
= arg max Eq (✓)[p(D|✓)] + DKL [q (✓)kp(✓)]
q (✓) = arg min DKL [q (✓)kp(✓|D)]
= arg min
Z
q (✓) log
q (✓)
p(✓|D)
d✓
= arg min
Z
q (✓) log
q (✓)p(D)
p(D|✓)p(✓)
d✓
= arg max Eq (✓)[p(D|✓)] + DKL [q (✓)kp(✓)]
‘Distance’ between
distributions
q (✓) = arg min DKL [q (✓)kp(✓|D)]
= arg min
Z
q (✓) log
q (✓)
p(✓|D)
d✓
= arg min
Z
q (✓) log
q (✓)p(D)
p(D|✓)p(✓)
d✓
= arg max Eq (✓)[p(D|✓)] DKL [q (✓)kp(✓)]
p(✓|D) =
p(D|✓)p(✓)
p(D)
=
p(D|✓)p(✓)
R
p(D|✓)p(✓) d✓
VARIATIONAL METHODS
Approximate inference
If we use latents (each x has a z) then we have a variational auto-encoder
arg max Ep(x)
⇥
Eq (z|x)[p(x|z)] DKL [q (z|x)kp(z)]
⇤
= arg min q (✓) log
q (✓)
p(✓|D)
d✓
= arg min
Z
q (✓) log
q (✓)p(D)
p(D|✓)p(✓)
d✓
= arg max Eq (✓)[p(D|✓)] DKL [q (✓)kp(✓)]
Neural network
Kingma and Welling (2013)
VARIATIONAL METHODS
NORMALIZING FLOWS
What is a flexible probability distribution?
e.g. p(x) = N(x|µ, 2
)
e.g. p(x) =
X
i
⇡iN(x|µi, 2
i )
x = f✓(z), z ⇠ p(z)
Implicitly define a distribution via a change
of variables
=) p(x) = p(z) det
@z
@x
= p(z) det
@f
@z
1
Rather expensive
Goal: design flexible f with cheap determinants
Target Flow
Rezende & Mohamed (2016)
NORMALIZING FLOWS
x = f✓(z), z ⇠ p(z)
Kingma & Dhariwal (2018)
NORMALIZING FLOWS: INVERTIBLE LAYERS
y = g(Wx + b)
Typical layer
Householder flow: volume-preservation
zt =
✓
I 2
vtv>
t
kvtk2
◆
zt 1 = Htzt
Predicted by NN
Tomczak and Welling (2016)
Sylvester normalising flows
zt = zt 1 + Ah (Bzt 1 + b)
det(I + AB) = det(I + BA)
det
@zt+1
@zt
= det (I + diag(h0
(Bz + b))BA)
van den Berg et al. (2018)
Emerging convolutions
Hoogeboom et al. (2019)
GRAPHS
A lot of data is graph-based: social networks, particle interactions, human
skeletal data, molecular structures, 3D graphics meshes
GRAPHS
Structure Weights
Kipf and Welling (2017)
GRAPHS: MOTION PREDICTION
Kipf et al. (2018)
SYMMETRY?
f(I) = f(T✓[I])
Notational aside:
T✓[I](x) = I(x ✓)
T✓[I](x) = I(R 1
✓ x)
T [I] = (I µ)/. 1
function/
feature mapping
image
transformation
Symmetry is a property of functions/tasks, e.g.
Classification
Disentangling
(cocktail party)
Signal discovery/detection
e.g. Geometric translation
e.g. Geometric rotation
e.g. Pixel normalisation
Set of input transformations leaving invariantS✓[f](I) = f(T✓[I])
EQUIVARIANCE
S✓[f](I) = f(T✓[I])S✓[f](I) = f(T✓[I])S✓[f](I) = f(T✓[I])S✓[f](I) = f(T✓[I])
transformation in feature space
Mapping preserves algebraic
structure of transformation
Different representations of same
transformation
https://github.com/vdumoulin/conv_arithmetic
Convolution (and correlation)
[I ⇤ W](x ✓) = T✓[I] ⇤ W(x)
S✓ = Id
Invariance
Convolutions Symmetry
GROUP EXAMPLES
*Current research direction: Scalings are probably better modelled as semigroups, i.e. groups without
the invertibility condition.
Scalings*
Translation
Reflections
Roto-translationRotation
Occlusions
Non-example
GROUP CONVOLUTIONS
“Convolution” examples [I ⇤ W](✓) =
X
x2Z2
I(x)W(R 1
✓ x)
[I ⇤ W](y) =
X
x2Z2
I(x)W(x y)
[I ⇤ W](✓, y) =
X
x2Z2
I(x)W(R 1
✓ x y)
[I ⇤ W](✓) =
X
x2Z2
I(x)T✓[W](x)Group convolution
[I ⇤ W](✓) =
X
x2Z2
T✓[I](x)W(x)Semigroup convolution
DenseNet Rotation
equivariant
DenseNet
DenseNet Rotation
equivariant
DenseNet
Input
Mean prediction Standard deviation
EXAMPLES
Machine Learning Today: Current Research And Advances From AMLAB, UvA

Machine Learning Today: Current Research And Advances From AMLAB, UvA

  • 1.
    M A CH I N E L E A R N I N G T O D AY: C U R R E N T R E S E A R C H A N D A D VA N C E S F R O M A M L A B , U V A D A N I E L W O R R A L L
  • 2.
    WHO ARE WE? ~30researchers working under Max Welling and Joris Mooij - 4 industrially funded ‘labs’ - Everyone works in deep learning - We do fundamental research in machine learning
  • 3.
    WHAT IS MACHINELEARNING? Mauna Loa is one of five volcanoes that form the Island of Hawaii in the U.S. state of Hawaii in the Pacific Ocean. The largest subaerial volcano in both mass and volume, Mauna Loa has historically been considered the largest volcano on Earth, dwarfed only by Tamu Massif.
  • 4.
    WHAT IS MACHINELEARNING? In machine learning we use past data to make predictions about the future. p(y⇤|x⇤, D) = N y⇤|µ(x⇤), 2 (x⇤) DataTest inputTest output Gaussian
  • 5.
    WHAT IS MACHINELEARNING? Predictions are probability distributions. Our main tool, conditional distributions: Data p(x|✓) Parameters/models/unknowns Symmetry constraints Domain choice Flexibility Approximations Computation Memory How do we choose p? How do we learn θ?
  • 6.
    WHAT IS MACHINELEARNING? Probability p(x|✓) ✓ {x1, x2, ...} ✓ {x1, x2, ...} Statistics Machine Learning ✓ {x1, x2, ...} {x⇤} Some terminology
  • 7.
    WHAT WE DO -Variational methods - Normalizing flows - Graphs - Symmetry - Reinforcement learning - Transfer learning - Medical imaging - Generative modelling - Compression - Low precision neural networks - Spiking neural networks - Semi-supervised learning
  • 8.
    VARIATIONAL METHODS Approximate inference Intractable p(✓|D)= p(D|✓)p(✓) p(D) = p(D|✓)p(✓) R p(D|✓)p(✓) d✓ Log-likelihood Regulariser ELBO q (✓) = arg min DKL [q (✓)kp(✓|D)] = arg min Z q (✓) log q (✓) p(✓|D) d✓ = arg min Z q (✓) log q (✓)p(D) p(D|✓)p(✓) d✓ = arg max Eq (✓)[p(D|✓)] + DKL [q (✓)kp(✓)] q (✓) = arg min DKL [q (✓)kp(✓|D)] = arg min Z q (✓) log q (✓) p(✓|D) d✓ = arg min Z q (✓) log q (✓)p(D) p(D|✓)p(✓) d✓ = arg max Eq (✓)[p(D|✓)] + DKL [q (✓)kp(✓)] q (✓) = arg min DKL [q (✓)kp(✓|D)] = arg min Z q (✓) log q (✓) p(✓|D) d✓ = arg min Z q (✓) log q (✓)p(D) p(D|✓)p(✓) d✓ = arg max Eq (✓)[p(D|✓)] + DKL [q (✓)kp(✓)] ‘Distance’ between distributions q (✓) = arg min DKL [q (✓)kp(✓|D)] = arg min Z q (✓) log q (✓) p(✓|D) d✓ = arg min Z q (✓) log q (✓)p(D) p(D|✓)p(✓) d✓ = arg max Eq (✓)[p(D|✓)] DKL [q (✓)kp(✓)] p(✓|D) = p(D|✓)p(✓) p(D) = p(D|✓)p(✓) R p(D|✓)p(✓) d✓
  • 9.
    VARIATIONAL METHODS Approximate inference Ifwe use latents (each x has a z) then we have a variational auto-encoder arg max Ep(x) ⇥ Eq (z|x)[p(x|z)] DKL [q (z|x)kp(z)] ⇤ = arg min q (✓) log q (✓) p(✓|D) d✓ = arg min Z q (✓) log q (✓)p(D) p(D|✓)p(✓) d✓ = arg max Eq (✓)[p(D|✓)] DKL [q (✓)kp(✓)] Neural network Kingma and Welling (2013)
  • 10.
  • 11.
    NORMALIZING FLOWS What isa flexible probability distribution? e.g. p(x) = N(x|µ, 2 ) e.g. p(x) = X i ⇡iN(x|µi, 2 i ) x = f✓(z), z ⇠ p(z) Implicitly define a distribution via a change of variables =) p(x) = p(z) det @z @x = p(z) det @f @z 1 Rather expensive Goal: design flexible f with cheap determinants Target Flow Rezende & Mohamed (2016)
  • 12.
    NORMALIZING FLOWS x =f✓(z), z ⇠ p(z) Kingma & Dhariwal (2018)
  • 13.
    NORMALIZING FLOWS: INVERTIBLELAYERS y = g(Wx + b) Typical layer Householder flow: volume-preservation zt = ✓ I 2 vtv> t kvtk2 ◆ zt 1 = Htzt Predicted by NN Tomczak and Welling (2016) Sylvester normalising flows zt = zt 1 + Ah (Bzt 1 + b) det(I + AB) = det(I + BA) det @zt+1 @zt = det (I + diag(h0 (Bz + b))BA) van den Berg et al. (2018) Emerging convolutions Hoogeboom et al. (2019)
  • 14.
    GRAPHS A lot ofdata is graph-based: social networks, particle interactions, human skeletal data, molecular structures, 3D graphics meshes
  • 15.
  • 16.
  • 17.
    SYMMETRY? f(I) = f(T✓[I]) Notationalaside: T✓[I](x) = I(x ✓) T✓[I](x) = I(R 1 ✓ x) T [I] = (I µ)/. 1 function/ feature mapping image transformation Symmetry is a property of functions/tasks, e.g. Classification Disentangling (cocktail party) Signal discovery/detection e.g. Geometric translation e.g. Geometric rotation e.g. Pixel normalisation Set of input transformations leaving invariantS✓[f](I) = f(T✓[I])
  • 18.
    EQUIVARIANCE S✓[f](I) = f(T✓[I])S✓[f](I)= f(T✓[I])S✓[f](I) = f(T✓[I])S✓[f](I) = f(T✓[I]) transformation in feature space Mapping preserves algebraic structure of transformation Different representations of same transformation https://github.com/vdumoulin/conv_arithmetic Convolution (and correlation) [I ⇤ W](x ✓) = T✓[I] ⇤ W(x) S✓ = Id Invariance Convolutions Symmetry
  • 19.
    GROUP EXAMPLES *Current researchdirection: Scalings are probably better modelled as semigroups, i.e. groups without the invertibility condition. Scalings* Translation Reflections Roto-translationRotation Occlusions Non-example
  • 20.
    GROUP CONVOLUTIONS “Convolution” examples[I ⇤ W](✓) = X x2Z2 I(x)W(R 1 ✓ x) [I ⇤ W](y) = X x2Z2 I(x)W(x y) [I ⇤ W](✓, y) = X x2Z2 I(x)W(R 1 ✓ x y) [I ⇤ W](✓) = X x2Z2 I(x)T✓[W](x)Group convolution [I ⇤ W](✓) = X x2Z2 T✓[I](x)W(x)Semigroup convolution
  • 21.