1
CNRS & Université Paris-Saclay
BALÁZS KÉGL
DEEP LEARNING	

A STORY OF HARDWARE, DATA, AND
TECHNIQUES&TRICKS
2
The bumpy 60-year history that
led to the current state of the
art
and an overview of what you
can do with it
3
DEEP LEARNING = THREE INTERTWINING STORY
techniques / tricks hardware data
1957-69

dawn
perceptron early mainframes toy linear, small images, XOR
1986-95

golden age
early NNs workstations MNIST
2006-

deep learning
deep NNs GPU,TPU, Intel Xeon Phi Imagenet
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.
neurons in a kernel map). The second convolutional layer takes as input the (response-normalized
and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48.
The third, fourth, and fifth convolutional layers are connected to one another without any intervening
pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥
256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth
• Classification problem y = f(x)
4
DATA-DRIVEN INFERENCE
x
f y
‘Stomorhina’
f y
‘Scaeva’
x
• Classification problem y = f(x)
• No model to fit, but a large set of (x, y) pairs	

• The source is typically observation + human labeling
• And a loss function L(y, ypred)
5
DATA-DRIVEN INFERENCE
6
THE PERCEPTRON (ROSENBLATT 1957)
Weights were encoded in potentiometers, and weight
updates during learning were performed by electric motors.
7
THE PERCEPTRON (ROSENBLATT 1957)
Based on Rosenblatt's
statements, The New York
Times reported the perceptron
to be "the embryo of an
electronic computer that [the
Navy] expects will be able to
walk, talk, see, write, reproduce
itself and be conscious of its
existence."
8
MINSKY-PAPERT (1969)
It is impossible to linearly
separate the XOR function
The first winter: 1969 - 86
Is it?
• We knew in the early seventies that the nonlinearity
problem was possible to overcome (Grossberg 72)	

• It was also suggested by several authors that error
gradients could be back propagated through the chain
rule	

• So why the winter? floating point multiplication was
expensive
9
MULTI-LAYER PERCEPTRON
The XOR net
10
BACK PROPAGATION
• Convolutional nets	

• The first algorithmic tricks: initialization, weight decay,
early stopping	

• Some limited understanding of the theory	

• First commercial success:AT&T check reader (Bottou,
LeCun, Burges, Nohl, Bengio, Haffner, 1996)
11
THE GOLDEN AGE (1986-95)
12
THE AT&T CHECK READER
• Reading checks is more
than character
recognition	

• If all steps are
differentiable, the whole
system can be trained
end-to-end by backdrop	

• Does it ring a bell?
(Google’s TensorFlow)
13
THE AT&T CHECK READER
• The mainstream narrative	

• Nonconvexity, local minima, lack of theoretical understanding - BS,
looking for your key where its lit	

• The vanishing gradient - true	

• Lots of nuts and bolts - partially true but would not explain if it was
worth the effort	

• Strong competitors with an order of magnitude less engineering:
the support vector machine, forests, boosting - true
14
THE SECOND WINTER (1996-2006-2012)
• The real story	

• We didn’t have the computational power and architecture and
large enough data to train deep nets	

• Random forests are way less high-maintenance and they are on par
with single-hidden-layer (shallow) nets, even today	

• We missed some of the tricks due to lack of critical mass in
research: trial and error is expensive	

• NNs didn’t disappear from industry: the check reader processes
20M checks per day, today
15
THE SECOND WINTER (1996-2006-2012)
16
FIRST TAKE-HOME MESSAGE
Before you jump on
the deep learning
bandwagon: scikit-
learn forests +
xgboost gets 

>90% performance
on >90% of the
industrial problems,
cautious estimate
• NNs are back on the research agenda
17
2006: A NEW WAVE BEGINS
18
2009: IMAGENET
“We believe that a large-scale ontology of images is a critical
resource for developing advanced, large-scale content-based
image search and image understanding algorithms, as well as for
providing critical training and benchmarking data for such
algorithms.” (Fei Fei Li et al CVPR09)	

!
• 80K hierarchical categories	

• 80M images of size >100x100	

• labeled by 50K Amazon Turks
19
2009: IMAGENET
20
GPUS (2004 - )
21
GPUS (2004 - )
• dropout, ReLU, max-pooling, data augmentation, batch normalization,
automatic differentiation, end-to-end training, lots of layers	

• Krizhevsky, Sutskever, Hinton (2012): 1.2M images, 60M parameters,
6 days training on two GPUs
22
TECHNIQUES & TRICKS
23
IMAGENET COMPETITIONS
24
SECOND TAKE-HOME MESSAGE
To make deep learning shine,
you need huge labeled data sets and time to train
• Imagenet (80M >100x100 color images, 80K classes)	

• FaceBook (300M photos/day)	

• Google (300h of video/minute)
25
SECOND TAKE-HOME MESSAGE
To make deep learning shine,
you need huge labeled data sets and time to train
• Theano	

• TensorFlow	

• Keras	

• Caffe	

• Torch
26
TODAY: EASY-TO-USE LIBRARIES
27
TODAY: HARDWARE
Google TPU
28
COMMERCIAL APPLICATIONS
29
GOOGLE IMAGE SEARCH
30
FACE RECOGNITION/DETECTION	

A 6B$ MARKET IN 2020
31
SELF-DRVING CARS
32

A historical introduction to deep learning: hardware, data, and tricks

  • 1.
    1 CNRS & UniversitéParis-Saclay BALÁZS KÉGL DEEP LEARNING A STORY OF HARDWARE, DATA, AND TECHNIQUES&TRICKS
  • 2.
    2 The bumpy 60-yearhistory that led to the current state of the art and an overview of what you can do with it
  • 3.
    3 DEEP LEARNING =THREE INTERTWINING STORY techniques / tricks hardware data 1957-69
 dawn perceptron early mainframes toy linear, small images, XOR 1986-95
 golden age early NNs workstations MNIST 2006-
 deep learning deep NNs GPU,TPU, Intel Xeon Phi Imagenet Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000. neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth
  • 4.
    • Classification problemy = f(x) 4 DATA-DRIVEN INFERENCE x f y ‘Stomorhina’ f y ‘Scaeva’ x
  • 5.
    • Classification problemy = f(x) • No model to fit, but a large set of (x, y) pairs • The source is typically observation + human labeling • And a loss function L(y, ypred) 5 DATA-DRIVEN INFERENCE
  • 6.
    6 THE PERCEPTRON (ROSENBLATT1957) Weights were encoded in potentiometers, and weight updates during learning were performed by electric motors.
  • 7.
    7 THE PERCEPTRON (ROSENBLATT1957) Based on Rosenblatt's statements, The New York Times reported the perceptron to be "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."
  • 8.
    8 MINSKY-PAPERT (1969) It isimpossible to linearly separate the XOR function The first winter: 1969 - 86 Is it?
  • 9.
    • We knewin the early seventies that the nonlinearity problem was possible to overcome (Grossberg 72) • It was also suggested by several authors that error gradients could be back propagated through the chain rule • So why the winter? floating point multiplication was expensive 9 MULTI-LAYER PERCEPTRON The XOR net
  • 10.
  • 11.
    • Convolutional nets •The first algorithmic tricks: initialization, weight decay, early stopping • Some limited understanding of the theory • First commercial success:AT&T check reader (Bottou, LeCun, Burges, Nohl, Bengio, Haffner, 1996) 11 THE GOLDEN AGE (1986-95)
  • 12.
  • 13.
    • Reading checksis more than character recognition • If all steps are differentiable, the whole system can be trained end-to-end by backdrop • Does it ring a bell? (Google’s TensorFlow) 13 THE AT&T CHECK READER
  • 14.
    • The mainstreamnarrative • Nonconvexity, local minima, lack of theoretical understanding - BS, looking for your key where its lit • The vanishing gradient - true • Lots of nuts and bolts - partially true but would not explain if it was worth the effort • Strong competitors with an order of magnitude less engineering: the support vector machine, forests, boosting - true 14 THE SECOND WINTER (1996-2006-2012)
  • 15.
    • The realstory • We didn’t have the computational power and architecture and large enough data to train deep nets • Random forests are way less high-maintenance and they are on par with single-hidden-layer (shallow) nets, even today • We missed some of the tricks due to lack of critical mass in research: trial and error is expensive • NNs didn’t disappear from industry: the check reader processes 20M checks per day, today 15 THE SECOND WINTER (1996-2006-2012)
  • 16.
    16 FIRST TAKE-HOME MESSAGE Beforeyou jump on the deep learning bandwagon: scikit- learn forests + xgboost gets 
 >90% performance on >90% of the industrial problems, cautious estimate
  • 17.
    • NNs areback on the research agenda 17 2006: A NEW WAVE BEGINS
  • 18.
    18 2009: IMAGENET “We believethat a large-scale ontology of images is a critical resource for developing advanced, large-scale content-based image search and image understanding algorithms, as well as for providing critical training and benchmarking data for such algorithms.” (Fei Fei Li et al CVPR09) !
  • 19.
    • 80K hierarchicalcategories • 80M images of size >100x100 • labeled by 50K Amazon Turks 19 2009: IMAGENET
  • 20.
  • 21.
  • 22.
    • dropout, ReLU,max-pooling, data augmentation, batch normalization, automatic differentiation, end-to-end training, lots of layers • Krizhevsky, Sutskever, Hinton (2012): 1.2M images, 60M parameters, 6 days training on two GPUs 22 TECHNIQUES & TRICKS
  • 23.
  • 24.
    24 SECOND TAKE-HOME MESSAGE Tomake deep learning shine, you need huge labeled data sets and time to train
  • 25.
    • Imagenet (80M>100x100 color images, 80K classes) • FaceBook (300M photos/day) • Google (300h of video/minute) 25 SECOND TAKE-HOME MESSAGE To make deep learning shine, you need huge labeled data sets and time to train
  • 26.
    • Theano • TensorFlow •Keras • Caffe • Torch 26 TODAY: EASY-TO-USE LIBRARIES
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.