An AI accelerator ASIC architecture

A scalable AI accelerator
ASIC platform
K. Le, January 17, 2019
Presented for information only. No guarantee for accuracy
and correctness.

2 1/17/2019K. Le
Market moving toward a different type of AI chips
 New realization: cloud-based AI processing has many limitations
 Concentrated cloud-based AI processing costs too much storage, power and bandwidth
 Limited ability to support real time requirements (automotive, robots, drones, etc.)
 Connectivity to cloud is not always guaranteed (security, mobility, network coverage,
etc.)
 Better user-experience requires local AI processing
 Mega-trend is toward edge AI processing
 Need new AI chips to enable “sensor triggered actions” and
“decentralized ai” at the edge

3 1/17/2019K. Le
Required edge AI processing functions
 Audio
 speech recognition
 identification, security
 language processing/translation
 Video
 image recognition
 pattern/object/face recognition
 Environmental/physical condition
 pressure, tension, force, temperature, noise, heart beat, humidity, etc.

4 1/17/2019K. Le
A scalable accelerator ASIC platform
for edge AI

5 1/17/2019K. Le
High level architecture
 Based on a scalable AI
compute fabric
 Pipelined flow for fast
learning and inferring
 Flexible architecture suitable
for cloud, gateway and edge
ai
 Allows up-scaling to multi-
chip solutions
AI Compute
Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
Control
Processor
IO
Inter
face
s
ROM/
SRAM
PLL &
PM
M
u
l
t
i
p
l
e
D
a
t
a
S
t
r
e
a
m
s
• AI COMPUTE FABRIC
• MULTIPLE PARALLEL DATA
STREAMS
• SCALABLE AND PARTITIONABLE
• ENERGY EFFICIENT
• FAST (UP TO 2-4GHZ)
• CONTROL PROCESSOR (ANDES,
ARM, MIPS, RISC-V, ETC.)
• LEARNING (CALCULATIONS,
ALGORITHM EXECUTION,
COMPARING, MODEL UPDATES)
• INFERRING [ALGORITHM UPDATE,
DECISION MAKING)
• IO INTERFACE TO MULTI-CHIP
SOLUTIONS

6 1/17/2019K. Le
Detailed diagram
 @250mHz FABRIC frequency, MAXIMUM THROUGHPUT
of a 1-cluster AI accelerator is 4 giga byte ops
 (2 bytes x 8 PE) x 1 cluster x 0.25Ghz -> 4 giga byte
operations per second
 128-bit input and output data buffers allow scaling
of fabric frequency without throttling
 Control processor can be V-type, ARM Cortex, etc.
operating at 100-250mhz
 Two independent clock domains – fabric and control
 Power: <2W on 14-16FF @250MHz
 Die size: 15 to 20mm2
AI Compute
Fabric
(1 block of 8x8
PEs)
128-bit Wide Output
Data Buffer
Control
Processor
IO
Inter
face
Cont
rol
32KB
L1 /
0.5MB
L2
SRAM
PLLs
32 Gbps Interface
(32 1 Gpbs LVDS)
128-bit Wide Input
Data Buffer
32 Gbps Interface
(32 1 Gpbs LVDS)
Vertical
Fabric
Flow
Control
Power
Mgmt
ROM/Se
curity
GPIO
s
/
LVDS
Horizontal Fabric
Control
GPIOs/LVDS
JTAG/T
est
CPU
LPDDR
Contro
ller &
PHY

7 1/17/2019K. Le
AI accelerator ASIC platform: multi-chip solutions
 Up-scalable SOC & system
architecture
 Suitable for massive data
processing
 Connectivity to server racks or
cloud via network
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
Host IO: PCIe-4
Switch & Flow
Control SOC
To Cloud
Or Cards/
Racks
Host IO: PCIe-4

8 1/17/2019K. Le
Processing Fabric
An example is the “piperench” coarse-grained reconfigurable
architecture from Carnegie-Mellon U.

9 1/17/2019K. Le
Fabric: local cluster
 Local Fabric architecture offers:
 8x8 local cluster configuration
sufficient for most applications
 Byte-wide processing elements
 Easy Scalability to 8 bytes per local
cluster
 Predictable Performance
 Ample Routing resources
 Pipe-lined flow architecture
 Faster and more power efficient than
CPU/GPU architectures
 Might add local mem blocks for
reverse machine learning applications
8
Note: H- and V-bus widths to be optimized
Expandable to 16b, 32b, 64b, etc.
word widths
P
E
P
E
P
E
…
P
E
P
E
P
E
…
…
…
…
P
E
P
E
P
E
…
8
8-bit Wide
V-Local (8-bit)* V-Local (8-bit)*
H-Local
(8-bit)*
H-Local
(8-bit)*
H-Local
(8-bit)*
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
Compute Stream A
Compute Stream B Compute Stream N
Local Mem Local Mem Local Mem

10 1/17/2019K. Le
Fabric: global clusters
 Global fabric architecture offers:
 Easy scalability to any (X,Y)
configurations to fit particular
applications
 Pipe-lined flow architecture
 Higher performance and efficiency
Note: H- and V-bus widths to be optimized
8x8
PEs
8x8
PEs
8x8
PEs
…
8x8
PEs
8x8
PEs
8x8
PEs
…
…
…
…
8x8
PEs
8x8
PEs
8x8
PEs
…
Y
Local Clusters
V-Global (8/16-bit)* V-Global (8/16-bit)*
H-Global
(8/16-bit)*
H-Global
(8/16-bit)*
H-Global
(8/16-bit)*
x
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
…
Compute Stream

11 1/17/2019K. Le
Fast parallel
computational fabric
 Parallel computational tasks
mapped at compiler-level to
multiple kernels concurrently
executed inside fabric
 On-chip HW task-master
 control
 schedule
 monitor
8x8
PEs
8x8
PEs
8x8
PEs
…
8x8
PEs
8x8
PEs
8x8
PEs
…
…
…
…
8x8
PEs
8x8
PEs
8x8
PEs
…
Local Clusters
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
…
Kernel A Kernel B
Kernel C
TASK
MASTER

12 1/17/2019K. Le
Processing element
An example is this PE from U. of Illinois

13 1/17/2019K. Le
Processing Element
 Proposed by Lu Wan, Chen Dong and Deming
Chen of U. of ILLInois, Urbana-Champaign
(2012)
 Advantages:
 Complete
 High-performance by-pass path
 Compatible with fabric architecture
 Changes from original:
 No on-the-fly fabric reconfiguration, done at
compile time
 PE To be re-optimized (add barrel shifter?)

14 1/17/2019K. Le
Extension: high performance AI
accelerator ASIC platform

15 1/17/2019K. Le
AI accelerator ASIC: high
performance platform
 @1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a
4x4 cluster AI accelerator asic is 256 Giga byte OPS
 (2 bytes x 8 PE) x 4x4 Clusters x 1 Ghz -> 256 giga byte
operations per second
 @1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a
8x8 cluster AI accelerator asic is 1 TOPS
 Fabric can probably operate at up to 4GHz in 14LPP -> 4
TOPS
 512-bit input and output data buffers allow scaling of
fabric frequency to over 1GHz without significant
throttling
 Control processor can be 32- or 64-bit (andes n9 or v-
type, arm cortex, etc.) operating at 300-500mhz
 Two independent clock domains – fabric and control
 Power: 10-12W on 14/16FF (7W for PCIe-4, 2-3W for
fabric, 2w for all others)
 Die size: should not exceed 50-60mm2
1GHz
AI Compute
Fabric
(4x4 blocks of
8x8 PEs)
512-bit Wide Output
Data Buffer
32- or 64-bit
Control
Processor
IO
Inter
face
Cont
rol
64KB
L1 /
1MB
L2
SRAM
PLLs32 PCIe-4 SerDes
512-bit Wide Input
Data Buffer
32 PCIe-/4 SerDes
Vertical
Fabric
Flow
Control
Power
Mgmt
ROM/
Securi
ty
GPIO
s
/
LVDS
Horizontal Fabric
Control
GPIOs/LVDS
JTAG/T
est
This architecture utilizes a dedicated CPU for AI tasks in the SOC.
CPU
LPDDR
Contro
ller &
PHY
Fabric
LPDDR
Control
ler &
PHY

An AI accelerator ASIC architecture

In this document