CONFIDENTIAL
Quantization and Training of
Neural Networks for Efficient
Integer-Arithmetic-Only Inference
[Jacob et al. from Google 2017]
Ryo Takahashi
2
Motivation
Let’s get deeper into
optimized arithmetic
inside Neural Networks!!
3
Approaches to CNN deployment on mobile platform
● Approach 1: computation/memory-efficient network architecture
l e.g. MobileNet[arXiv:1704.04861], SqueezeNet[arXiv:1602.07360]
● Approach 2: quantization (Today’s topic)
l definition: quantize weights and activations from float into lower bit-depth format
l benefit: save memory/power use, speed up inference
Existing works Issues
• Ternary weight networks [arXiv:1605.04711]
• Binary Neural networks [arXiv:1602.02505]
• Their baseline architectures are over-parameterized
- fat architectures (e.g. VGG) are easy to compress
- it’s still unclear that their schemes are applicable
to modern light-weight architectures (e.g. MobileNet)
- they are verified only in classification tasks, which
are tolerant to quantization errors unlike regression
• NOT efficient on common hardware (e.g. CPU)
- bit-shifts/counts based conv. provides
benefit only on custom hardware (e.g. FPGA, ASIC)these works can approximate conv. by bit-shifts/counts
4
● improve latency-vs-accuracy tradeoffs of MobileNets on common hardware
a) Integer-arithmetic-only inference
- why convert weight and activation to not int8 but uint8 ?
- why keep the bit-depth of biases to 32bit?
b) Quantization-aware training
- quantize weight and activation during training unlike calibration
c) Evaluation in ImageNet classification and COCO object detection
Proposal: Integer-arithmetic-only quantization
5
OSS Contribution
● This work is included in Google’s ML
software stack:
l TensorFlow (Model optimization)
l TensorFlow Lite (Case studies)
l Android NN
light
weightfat
big accuracy drop small accuracy drop
this work
↓
6
Quantization scheme
● Equation:
l where:
l r : real value
l q : quantized value
l S : scale (learned in training)
l Z : zero-point (learned in training)
● Data structure in C++
l create struct QuantizedBuffer for
each weight and activation
l each buffer has different S and Z
e.g. QType=uint3
Whey we can say integer-only-arithmetic
in spite of this float S ?
7
● Consider 𝑋" = 𝑋$ % 𝑋& where
which be rewritten as:
where:
𝑀 is empirically in (0,1)
where:
𝑛 is a non-negative integer
𝑀) is a fixed-point value of typedef int32_t q31_t; // Q-format
Integer-arithmetic-only matrix multiplication
𝑋* =
𝑟*
(),))
⋯ 𝑟*
(),0)
⋮ 𝑟*
(2,3)
⋮
𝑟*
(0,))
⋯ 𝑟*
(0,0)
,
Conv. & Affine get free from float-arithmetic
by approximating 𝑀 by int32_t
these 𝑁" addition can be
factored-out from
calculation for each 𝑞"
this 2𝑁" arithmetic operation
stays in the inner loop
8
Implementation of a typical fused layer
(1) Accumulate products in
• quantize bias-vectors by not uint8 but int32
- reason: quantization errors in bias-vectors tend to be overall errors
because their elements are added to many output activations
(2) Scale down int32_t to uint8
a. multiplying the fixed-point value 𝑀)
b. 𝑛 bit-shift
c. saturating cast to [0, 255]
(3) Apply activation functions
• mere clamp uint8_t because MobileNets use only ReLU and ReLU6
(2) (1) (3)
9
Quantization-aware training
● Motivation: post-quantization has difficulties in handling:
l large differences (> 100×) in ranges of weights for each output channels
l outlier weight values
● Approach: simulate integer-quantization effects during training
Step-1: Create a floating-point
graph as usual
Step-2: Insert fake quantization operations, which
downcast tensors to fewer bits in float
After training
proceeds…
As for activations, in addition to
simulation, aggregate them via
exponential moving averages (EMA)
10
Experiments with MobileNets
● CPU: Snapdragon 835
l march: ARM big.LITTLE [Cortex-(A73|A53)]
l optimize by ARM NEON
ImageNet:
• in the LITTLE core, the accuracy gap at 33ms (30FPS)
is quite substantial (~10%)
COCO:
• MobileNet SSD was used
• up to a 50% reduction in inference time
with a minimal loss in accuracy (−1.8% relative)
• The INT8 quantization deals well with regression
tasks
big LITTLE
11
Summary & My perspective
● Summary
l integer quantization benefits in common hardware like CPU
l quantization-aware training is crucial in quantizing modern light-weight
architectures (e.g. Mobile-Nets) and error-sensitive tasks such as regression
● My perspective
l integer quantization is the most moderate and generic scheme so far
l Extreme quantization like BNNs
is achievable only if parallel development
of hardware and software works
l Google
l NVIDIA
l Apple

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

  • 1.
    CONFIDENTIAL Quantization and Trainingof Neural Networks for Efficient Integer-Arithmetic-Only Inference [Jacob et al. from Google 2017] Ryo Takahashi
  • 2.
    2 Motivation Let’s get deeperinto optimized arithmetic inside Neural Networks!!
  • 3.
    3 Approaches to CNNdeployment on mobile platform ● Approach 1: computation/memory-efficient network architecture l e.g. MobileNet[arXiv:1704.04861], SqueezeNet[arXiv:1602.07360] ● Approach 2: quantization (Today’s topic) l definition: quantize weights and activations from float into lower bit-depth format l benefit: save memory/power use, speed up inference Existing works Issues • Ternary weight networks [arXiv:1605.04711] • Binary Neural networks [arXiv:1602.02505] • Their baseline architectures are over-parameterized - fat architectures (e.g. VGG) are easy to compress - it’s still unclear that their schemes are applicable to modern light-weight architectures (e.g. MobileNet) - they are verified only in classification tasks, which are tolerant to quantization errors unlike regression • NOT efficient on common hardware (e.g. CPU) - bit-shifts/counts based conv. provides benefit only on custom hardware (e.g. FPGA, ASIC)these works can approximate conv. by bit-shifts/counts
  • 4.
    4 ● improve latency-vs-accuracytradeoffs of MobileNets on common hardware a) Integer-arithmetic-only inference - why convert weight and activation to not int8 but uint8 ? - why keep the bit-depth of biases to 32bit? b) Quantization-aware training - quantize weight and activation during training unlike calibration c) Evaluation in ImageNet classification and COCO object detection Proposal: Integer-arithmetic-only quantization
  • 5.
    5 OSS Contribution ● Thiswork is included in Google’s ML software stack: l TensorFlow (Model optimization) l TensorFlow Lite (Case studies) l Android NN light weightfat big accuracy drop small accuracy drop this work ↓
  • 6.
    6 Quantization scheme ● Equation: lwhere: l r : real value l q : quantized value l S : scale (learned in training) l Z : zero-point (learned in training) ● Data structure in C++ l create struct QuantizedBuffer for each weight and activation l each buffer has different S and Z e.g. QType=uint3 Whey we can say integer-only-arithmetic in spite of this float S ?
  • 7.
    7 ● Consider 𝑋"= 𝑋$ % 𝑋& where which be rewritten as: where: 𝑀 is empirically in (0,1) where: 𝑛 is a non-negative integer 𝑀) is a fixed-point value of typedef int32_t q31_t; // Q-format Integer-arithmetic-only matrix multiplication 𝑋* = 𝑟* (),)) ⋯ 𝑟* (),0) ⋮ 𝑟* (2,3) ⋮ 𝑟* (0,)) ⋯ 𝑟* (0,0) , Conv. & Affine get free from float-arithmetic by approximating 𝑀 by int32_t these 𝑁" addition can be factored-out from calculation for each 𝑞" this 2𝑁" arithmetic operation stays in the inner loop
  • 8.
    8 Implementation of atypical fused layer (1) Accumulate products in • quantize bias-vectors by not uint8 but int32 - reason: quantization errors in bias-vectors tend to be overall errors because their elements are added to many output activations (2) Scale down int32_t to uint8 a. multiplying the fixed-point value 𝑀) b. 𝑛 bit-shift c. saturating cast to [0, 255] (3) Apply activation functions • mere clamp uint8_t because MobileNets use only ReLU and ReLU6 (2) (1) (3)
  • 9.
    9 Quantization-aware training ● Motivation:post-quantization has difficulties in handling: l large differences (> 100×) in ranges of weights for each output channels l outlier weight values ● Approach: simulate integer-quantization effects during training Step-1: Create a floating-point graph as usual Step-2: Insert fake quantization operations, which downcast tensors to fewer bits in float After training proceeds… As for activations, in addition to simulation, aggregate them via exponential moving averages (EMA)
  • 10.
    10 Experiments with MobileNets ●CPU: Snapdragon 835 l march: ARM big.LITTLE [Cortex-(A73|A53)] l optimize by ARM NEON ImageNet: • in the LITTLE core, the accuracy gap at 33ms (30FPS) is quite substantial (~10%) COCO: • MobileNet SSD was used • up to a 50% reduction in inference time with a minimal loss in accuracy (−1.8% relative) • The INT8 quantization deals well with regression tasks big LITTLE
  • 11.
    11 Summary & Myperspective ● Summary l integer quantization benefits in common hardware like CPU l quantization-aware training is crucial in quantizing modern light-weight architectures (e.g. Mobile-Nets) and error-sensitive tasks such as regression ● My perspective l integer quantization is the most moderate and generic scheme so far l Extreme quantization like BNNs is achievable only if parallel development of hardware and software works l Google l NVIDIA l Apple