A scalable AI accelerator
ASIC platform
K. Le, January 17, 2019
Presented for information only. No guarantee for accuracy
and correctness.
2 1/17/2019K. Le
Market moving toward a different type of AI chips
 New realization: cloud-based AI processing has many limitations
 Concentrated cloud-based AI processing costs too much storage, power and bandwidth
 Limited ability to support real time requirements (automotive, robots, drones, etc.)
 Connectivity to cloud is not always guaranteed (security, mobility, network coverage,
etc.)
 Better user-experience requires local AI processing
 Mega-trend is toward edge AI processing
 Need new AI chips to enable “sensor triggered actions” and
“decentralized ai” at the edge
3 1/17/2019K. Le
Required edge AI processing functions
 Audio
 speech recognition
 identification, security
 language processing/translation
 Video
 image recognition
 pattern/object/face recognition
 Environmental/physical condition
 pressure, tension, force, temperature, noise, heart beat, humidity, etc.
4 1/17/2019K. Le
A scalable accelerator ASIC platform
for edge AI
5 1/17/2019K. Le
High level architecture
 Based on a scalable AI
compute fabric
 Pipelined flow for fast
learning and inferring
 Flexible architecture suitable
for cloud, gateway and edge
ai
 Allows up-scaling to multi-
chip solutions
AI Compute
Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
Control
Processor
IO
Inter
face
s
ROM/
SRAM
PLL &
PM
M
u
l
t
i
p
l
e
D
a
t
a
S
t
r
e
a
m
s
• AI COMPUTE FABRIC
• MULTIPLE PARALLEL DATA
STREAMS
• SCALABLE AND PARTITIONABLE
• ENERGY EFFICIENT
• FAST (UP TO 2-4GHZ)
• CONTROL PROCESSOR (ANDES,
ARM, MIPS, RISC-V, ETC.)
• LEARNING (CALCULATIONS,
ALGORITHM EXECUTION,
COMPARING, MODEL UPDATES)
• INFERRING [ALGORITHM UPDATE,
DECISION MAKING)
• IO INTERFACE TO MULTI-CHIP
SOLUTIONS
6 1/17/2019K. Le
Detailed diagram
 @250mHz FABRIC frequency, MAXIMUM THROUGHPUT
of a 1-cluster AI accelerator is 4 giga byte ops
 (2 bytes x 8 PE) x 1 cluster x 0.25Ghz -> 4 giga byte
operations per second
 128-bit input and output data buffers allow scaling
of fabric frequency without throttling
 Control processor can be V-type, ARM Cortex, etc.
operating at 100-250mhz
 Two independent clock domains – fabric and control
 Power: <2W on 14-16FF @250MHz
 Die size: 15 to 20mm2
AI Compute
Fabric
(1 block of 8x8
PEs)
128-bit Wide Output
Data Buffer
Control
Processor
IO
Inter
face
Cont
rol
32KB
L1 /
0.5MB
L2
SRAM
PLLs
32 Gbps Interface
(32 1 Gpbs LVDS)
128-bit Wide Input
Data Buffer
32 Gbps Interface
(32 1 Gpbs LVDS)
Vertical
Fabric
Flow
Control
Power
Mgmt
ROM/Se
curity
GPIO
s
/
LVDS
Horizontal Fabric
Control
GPIOs/LVDS
JTAG/T
est
CPU
LPDDR
Contro
ller &
PHY
7 1/17/2019K. Le
AI accelerator ASIC platform: multi-chip solutions
 Up-scalable SOC & system
architecture
 Suitable for massive data
processing
 Connectivity to server racks or
cloud via network
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
Host IO: PCIe-4
Switch & Flow
Control SOC
To Cloud
Or Cards/
Racks
Host IO: PCIe-4
8 1/17/2019K. Le
Processing Fabric
An example is the “piperench” coarse-grained reconfigurable
architecture from Carnegie-Mellon U.
9 1/17/2019K. Le
Fabric: local cluster
 Local Fabric architecture offers:
 8x8 local cluster configuration
sufficient for most applications
 Byte-wide processing elements
 Easy Scalability to 8 bytes per local
cluster
 Predictable Performance
 Ample Routing resources
 Pipe-lined flow architecture
 Faster and more power efficient than
CPU/GPU architectures
 Might add local mem blocks for
reverse machine learning applications
8
Note: H- and V-bus widths to be optimized
Expandable to 16b, 32b, 64b, etc.
word widths
P
E
P
E
P
E
…
P
E
P
E
P
E
…
…
…
…
P
E
P
E
P
E
…
8
8-bit Wide
V-Local (8-bit)* V-Local (8-bit)*
H-Local
(8-bit)*
H-Local
(8-bit)*
H-Local
(8-bit)*
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
Compute Stream A
Compute Stream B Compute Stream N
Local Mem Local Mem Local Mem
10 1/17/2019K. Le
Fabric: global clusters
 Global fabric architecture offers:
 Easy scalability to any (X,Y)
configurations to fit particular
applications
 Pipe-lined flow architecture
 Higher performance and efficiency
Note: H- and V-bus widths to be optimized
8x8
PEs
8x8
PEs
8x8
PEs
…
8x8
PEs
8x8
PEs
8x8
PEs
…
…
…
…
8x8
PEs
8x8
PEs
8x8
PEs
…
Y
Local Clusters
V-Global (8/16-bit)* V-Global (8/16-bit)*
H-Global
(8/16-bit)*
H-Global
(8/16-bit)*
H-Global
(8/16-bit)*
x
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
…
Compute Stream
11 1/17/2019K. Le
Fast parallel
computational fabric
 Parallel computational tasks
mapped at compiler-level to
multiple kernels concurrently
executed inside fabric
 On-chip HW task-master
 control
 schedule
 monitor
8x8
PEs
8x8
PEs
8x8
PEs
…
8x8
PEs
8x8
PEs
8x8
PEs
…
…
…
…
8x8
PEs
8x8
PEs
8x8
PEs
…
Local Clusters
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
…
Kernel A Kernel B
Kernel C
TASK
MASTER
12 1/17/2019K. Le
Processing element
An example is this PE from U. of Illinois
13 1/17/2019K. Le
Processing Element
 Proposed by Lu Wan, Chen Dong and Deming
Chen of U. of ILLInois, Urbana-Champaign
(2012)
 Advantages:
 Complete
 High-performance by-pass path
 Compatible with fabric architecture
 Changes from original:
 No on-the-fly fabric reconfiguration, done at
compile time
 PE To be re-optimized (add barrel shifter?)
14 1/17/2019K. Le
Extension: high performance AI
accelerator ASIC platform
15 1/17/2019K. Le
AI accelerator ASIC: high
performance platform
 @1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a
4x4 cluster AI accelerator asic is 256 Giga byte OPS
 (2 bytes x 8 PE) x 4x4 Clusters x 1 Ghz -> 256 giga byte
operations per second
 @1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a
8x8 cluster AI accelerator asic is 1 TOPS
 Fabric can probably operate at up to 4GHz in 14LPP -> 4
TOPS
 512-bit input and output data buffers allow scaling of
fabric frequency to over 1GHz without significant
throttling
 Control processor can be 32- or 64-bit (andes n9 or v-
type, arm cortex, etc.) operating at 300-500mhz
 Two independent clock domains – fabric and control
 Power: 10-12W on 14/16FF (7W for PCIe-4, 2-3W for
fabric, 2w for all others)
 Die size: should not exceed 50-60mm2
1GHz
AI Compute
Fabric
(4x4 blocks of
8x8 PEs)
512-bit Wide Output
Data Buffer
32- or 64-bit
Control
Processor
IO
Inter
face
Cont
rol
64KB
L1 /
1MB
L2
SRAM
PLLs32 PCIe-4 SerDes
512-bit Wide Input
Data Buffer
32 PCIe-/4 SerDes
Vertical
Fabric
Flow
Control
Power
Mgmt
ROM/
Securi
ty
GPIO
s
/
LVDS
Horizontal Fabric
Control
GPIOs/LVDS
JTAG/T
est
This architecture utilizes a dedicated CPU for AI tasks in the SOC.
CPU
LPDDR
Contro
ller &
PHY
Fabric
LPDDR
Control
ler &
PHY

An AI accelerator ASIC architecture

  • 1.
    A scalable AIaccelerator ASIC platform K. Le, January 17, 2019 Presented for information only. No guarantee for accuracy and correctness.
  • 2.
    2 1/17/2019K. Le Marketmoving toward a different type of AI chips  New realization: cloud-based AI processing has many limitations  Concentrated cloud-based AI processing costs too much storage, power and bandwidth  Limited ability to support real time requirements (automotive, robots, drones, etc.)  Connectivity to cloud is not always guaranteed (security, mobility, network coverage, etc.)  Better user-experience requires local AI processing  Mega-trend is toward edge AI processing  Need new AI chips to enable “sensor triggered actions” and “decentralized ai” at the edge
  • 3.
    3 1/17/2019K. Le Requirededge AI processing functions  Audio  speech recognition  identification, security  language processing/translation  Video  image recognition  pattern/object/face recognition  Environmental/physical condition  pressure, tension, force, temperature, noise, heart beat, humidity, etc.
  • 4.
    4 1/17/2019K. Le Ascalable accelerator ASIC platform for edge AI
  • 5.
    5 1/17/2019K. Le Highlevel architecture  Based on a scalable AI compute fabric  Pipelined flow for fast learning and inferring  Flexible architecture suitable for cloud, gateway and edge ai  Allows up-scaling to multi- chip solutions AI Compute Fabric Input Data Buffer, Memory & Control Output Data Buffer, Memory & Control Control Processor IO Inter face s ROM/ SRAM PLL & PM M u l t i p l e D a t a S t r e a m s • AI COMPUTE FABRIC • MULTIPLE PARALLEL DATA STREAMS • SCALABLE AND PARTITIONABLE • ENERGY EFFICIENT • FAST (UP TO 2-4GHZ) • CONTROL PROCESSOR (ANDES, ARM, MIPS, RISC-V, ETC.) • LEARNING (CALCULATIONS, ALGORITHM EXECUTION, COMPARING, MODEL UPDATES) • INFERRING [ALGORITHM UPDATE, DECISION MAKING) • IO INTERFACE TO MULTI-CHIP SOLUTIONS
  • 6.
    6 1/17/2019K. Le Detaileddiagram  @250mHz FABRIC frequency, MAXIMUM THROUGHPUT of a 1-cluster AI accelerator is 4 giga byte ops  (2 bytes x 8 PE) x 1 cluster x 0.25Ghz -> 4 giga byte operations per second  128-bit input and output data buffers allow scaling of fabric frequency without throttling  Control processor can be V-type, ARM Cortex, etc. operating at 100-250mhz  Two independent clock domains – fabric and control  Power: <2W on 14-16FF @250MHz  Die size: 15 to 20mm2 AI Compute Fabric (1 block of 8x8 PEs) 128-bit Wide Output Data Buffer Control Processor IO Inter face Cont rol 32KB L1 / 0.5MB L2 SRAM PLLs 32 Gbps Interface (32 1 Gpbs LVDS) 128-bit Wide Input Data Buffer 32 Gbps Interface (32 1 Gpbs LVDS) Vertical Fabric Flow Control Power Mgmt ROM/Se curity GPIO s / LVDS Horizontal Fabric Control GPIOs/LVDS JTAG/T est CPU LPDDR Contro ller & PHY
  • 7.
    7 1/17/2019K. Le AIaccelerator ASIC platform: multi-chip solutions  Up-scalable SOC & system architecture  Suitable for massive data processing  Connectivity to server racks or cloud via network AI Compute Fabric Input Data Buffer, Memory & Control Output Data Buffer, Memory & Control AI Compute Fabric Input Data Buffer, Memory & Control Output Data Buffer, Memory & Control AI Compute Fabric Input Data Buffer, Memory & Control Output Data Buffer, Memory & Control AI Compute Fabric Input Data Buffer, Memory & Control Output Data Buffer, Memory & Control Host IO: PCIe-4 Switch & Flow Control SOC To Cloud Or Cards/ Racks Host IO: PCIe-4
  • 8.
    8 1/17/2019K. Le ProcessingFabric An example is the “piperench” coarse-grained reconfigurable architecture from Carnegie-Mellon U.
  • 9.
    9 1/17/2019K. Le Fabric:local cluster  Local Fabric architecture offers:  8x8 local cluster configuration sufficient for most applications  Byte-wide processing elements  Easy Scalability to 8 bytes per local cluster  Predictable Performance  Ample Routing resources  Pipe-lined flow architecture  Faster and more power efficient than CPU/GPU architectures  Might add local mem blocks for reverse machine learning applications 8 Note: H- and V-bus widths to be optimized Expandable to 16b, 32b, 64b, etc. word widths P E P E P E … P E P E P E … … … … P E P E P E … 8 8-bit Wide V-Local (8-bit)* V-Local (8-bit)* H-Local (8-bit)* H-Local (8-bit)* H-Local (8-bit)* M u l t i p l e C o m p u t e S t r e a m s Compute Stream A Compute Stream B Compute Stream N Local Mem Local Mem Local Mem
  • 10.
    10 1/17/2019K. Le Fabric:global clusters  Global fabric architecture offers:  Easy scalability to any (X,Y) configurations to fit particular applications  Pipe-lined flow architecture  Higher performance and efficiency Note: H- and V-bus widths to be optimized 8x8 PEs 8x8 PEs 8x8 PEs … 8x8 PEs 8x8 PEs 8x8 PEs … … … … 8x8 PEs 8x8 PEs 8x8 PEs … Y Local Clusters V-Global (8/16-bit)* V-Global (8/16-bit)* H-Global (8/16-bit)* H-Global (8/16-bit)* H-Global (8/16-bit)* x M u l t i p l e C o m p u t e S t r e a m s … Compute Stream
  • 11.
    11 1/17/2019K. Le Fastparallel computational fabric  Parallel computational tasks mapped at compiler-level to multiple kernels concurrently executed inside fabric  On-chip HW task-master  control  schedule  monitor 8x8 PEs 8x8 PEs 8x8 PEs … 8x8 PEs 8x8 PEs 8x8 PEs … … … … 8x8 PEs 8x8 PEs 8x8 PEs … Local Clusters M u l t i p l e C o m p u t e S t r e a m s … Kernel A Kernel B Kernel C TASK MASTER
  • 12.
    12 1/17/2019K. Le Processingelement An example is this PE from U. of Illinois
  • 13.
    13 1/17/2019K. Le ProcessingElement  Proposed by Lu Wan, Chen Dong and Deming Chen of U. of ILLInois, Urbana-Champaign (2012)  Advantages:  Complete  High-performance by-pass path  Compatible with fabric architecture  Changes from original:  No on-the-fly fabric reconfiguration, done at compile time  PE To be re-optimized (add barrel shifter?)
  • 14.
    14 1/17/2019K. Le Extension:high performance AI accelerator ASIC platform
  • 15.
    15 1/17/2019K. Le AIaccelerator ASIC: high performance platform  @1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a 4x4 cluster AI accelerator asic is 256 Giga byte OPS  (2 bytes x 8 PE) x 4x4 Clusters x 1 Ghz -> 256 giga byte operations per second  @1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a 8x8 cluster AI accelerator asic is 1 TOPS  Fabric can probably operate at up to 4GHz in 14LPP -> 4 TOPS  512-bit input and output data buffers allow scaling of fabric frequency to over 1GHz without significant throttling  Control processor can be 32- or 64-bit (andes n9 or v- type, arm cortex, etc.) operating at 300-500mhz  Two independent clock domains – fabric and control  Power: 10-12W on 14/16FF (7W for PCIe-4, 2-3W for fabric, 2w for all others)  Die size: should not exceed 50-60mm2 1GHz AI Compute Fabric (4x4 blocks of 8x8 PEs) 512-bit Wide Output Data Buffer 32- or 64-bit Control Processor IO Inter face Cont rol 64KB L1 / 1MB L2 SRAM PLLs32 PCIe-4 SerDes 512-bit Wide Input Data Buffer 32 PCIe-/4 SerDes Vertical Fabric Flow Control Power Mgmt ROM/ Securi ty GPIO s / LVDS Horizontal Fabric Control GPIOs/LVDS JTAG/T est This architecture utilizes a dedicated CPU for AI tasks in the SOC. CPU LPDDR Contro ller & PHY Fabric LPDDR Control ler & PHY

Editor's Notes

  • #3 160 zetta bytes by 2025 from AI devices 2.6 billion connected AI devices
  • #7 CPU and memory contain the known agents, model representations, and algorithms for ML. Needs DDR (128b) for large data sets and complex models and algos.
  • #8 PCIe-4 switch and interface to rack or cloud
  • #16 CPU and memory contain the known agents, model representations, and algorithms for ML. Needs DDR (128b) for large data sets and complex models and algos.