Introduction to
Accelerators
CS4532 Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some sides adapted from Prof. Sanath Jayasena, University of Moratuwa
Outline
 GPUs
 CUDA programming
 Xeon Phi
2
 GPU is typically a computer card, installed into a
PCI Express slot
 Market leaders – NVIDIA, Intel, AMD (ATI)
 NVIDIA GPUs at UoM
 Intel MIC (Many Integrated Core)
Graphics Processing Unit (GPU)
3
GeForce GTX 480 Tesla 2070
Example Specifications
GTX 480 Tesla 2070 Tesla K80
Peak double
precision FP
performance
650 Gigaflops 515 Gigaflops 2.91 Teraflops
Peak single
precision FP
performance
1.3 Teraflops 1.03 Teraflops 8.74 Teraflops
CUDA cores 480 448 4992
Frequency of CUDA
Cores
1.40 GHz 1.15 GHz 560/875 MHz
Memory size
(GDDR5)
1536 MB 6 GB 24 GB
Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec
ECC Memory No Yes Yes 4
GPUs (Cont.)
 Originally designed to accelerate a large no of
computations performed in graphics rendering
 Offloaded numerically intensive computation from CPU
 GPUs grew with demand for high performance
graphics
 Eventually GPUs have become much powerful,
even more than CPUs for many computations
 Cost-power-performance advantage
5
GPU Basics
 Today’s GPUs
 High performance, many-core processors that can be
used to accelerate a wide range of applications
 GPUs lead the race for floating-point performance since
start of 21st century
 GPGPU
 GPUs are being used as parallel processors for
general-purpose computation
6
CPU vs. GPU Architecture
7
GPU devotes more transistors for computation
FLOPS & GFLOPS
 FLOPS = floating-point operations per second
 Example
8
CPU GPU
Number of cores 4 448
FLOPS per core 4 1
Clock speed (GHz) 2.5 1.15
Performance (GFLOPS) 40 515
CPU-GPU Performance Gap
9
Source - http://michaelgalloy.com
Simple Performance Test
 Matrix multiplication
 C = B*A
 CPU
 Intel Core i7
 cblas_dgemm(…) from BLAS library
 GPU
 Nvidia GTX 480
10
Results
Dimensions (A,B,C)
CPU
time (s)
GPU
time (s)
Speedup
[400,800] , [400, 400], [400 , 800] 0.17 0.00109 155
[800,1600] , [800, 800], [800 , 1600] 2.10 0.00846 258
[1200,2400] , [1200, 1200], [2400 , 2400] 6.65 0.02860 232
[1600,2400] , [1600, 1600], [2400 , 2400] 15.18 0.06739 225
[2000,4000] , [4000, 4000], [4000 , 4000] 29.44 0.13178 223
[2400,4800] , [2400, 4800], [4800 , 4800] 50.21 0.22703 221
11
Applications of GPGPU
 Computational Structural Mechanics
 Bio-Informatics and Life Sciences
 Computational Electromagnetics & Electrodynamics
 Computational Finance
 Computational Fluid Dynamics
 Data Mining, Analytics, & Databases
 Imaging & Computer Vision
 Medical Imaging
 Molecular Dynamics
 Numerical Analytics
 Weather, Atmospheric, Ocean Modeling & Space
Sciences 12
Programming GPUs
 CUDA language for Nvidia GPU products
 Compute Unified Device Architecture
 Based on C
 nvcc compiler
 Lots of tools for analysis, debug, profile, …
 OpenCL – Open Computing Language
 Based on C
 Supports GPU & CPU programming
 Support for Java, Python, Matlab, etc.
 Lots of active research
 e.g., automatic code generation for GPUs 13
Multithreaded SIMD Processor
14
Caution!
 GPU designed as a numeric computing engine
 Will not perform well on some tasks as CPUs
 Most applications will use both CPUs & GPUs
 For some computations, cost of transfering data
between CPU & GPU can be high
 SIMD-type data parallelism is key to benefit from
GPUs
 … and enough of it (out of total computation)
15
CUDA Architecture
 CUDA is NVIDA’s solution to access the GPU
 Can be seen as an extension to C/C++
16
CUDA Software Stack
CUDA Architecture (Cont.)
2 main parts
1. Host (CPU part)
• Single Program, Single Data
• Launches kernel on GPU
2. Device (GPU part)
• Single Program, Multiple
Data
• Runs kernel
Function executed on GPU
(device) is called a “kernel”
17
CUDA Architecture (Cont.)
GRID Architecture
18
Grid
• A group of threads all running
the same kernel
• Can run multiple grids at once
Block
• Grids composed of blocks
• Each block is a logical unit
containing a no of
coordinating threads &
some amount of shared
memory
#include <cuda.h>
#include <stdio.h>
__global__ void kernel (void)
{ }
int main (void)
{
kernel <<< 1, 1 >>> ();
printf("Hello World!n");
return 0;
}
Example Program
 “__global__” says
function is to be
compiled to run on a
“device” (GPU), not
“host” (CPU)
 Angle brackets “<<<“
& “>>>” for passing
params/args to
runtime
19
Thread Blocks
 Within host (CPU) code, call kernel by using
<<< & >>> specifying grid size (no of blocks)
& block size (no of threads)
20
Grids, Blocks & Threads
21
 Grid of size 6 (3x2
blocks)
 Each block has 12
threads (4x3)
Thread IDs
22
CUDA Device Memory Model
 Host, devices have separate memory spaces
 e.g., hardware cards with their own DRAM
 To execute a kernel on a device
 Need to allocate memory on device
 Transfer data
 Host memory  device memory
 After device execution
 Transfer results
 Device memory  host memory
 Free device memory no longer needed
23
CUDA Device Memory Model
24
CUDA API – Memory Mgt.
25
Memory Access
int main (void) {
int c, *dev_c;
cudaMalloc ((void **) &dev_c, sizeof (int));
add <<< 1, 1 >>> (2,7, dev_c);
cudaMemcpy(&c, dev_c, sizeof(int),
cudaMemcpyDeviceToHost);
printf(“2 + 7 = %dn“, c);
cudaFree(dev_c);
return 0;
}
26
Example
#include "stdio.h"
#define N 10
__global__ void add(int *a, int *b, int *c)
{
int tID = blockIdx.x;
if (tID < N)
{
c[tID] = a[tID] + b[tID];
}
}
int main()
{
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void **) &dev_a, N*sizeof(int));
cudaMalloc((void **) &dev_b, N*sizeof(int));
cudaMalloc((void **) &dev_c, N*sizeof(int));
for (int i = 0; i < N; i++)
a[i] = i, b[i] = 1;
cudaMemcpy(dev_a, a, N*sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N*sizeof(int),
cudaMemcpyHostToDevice);
add<<<N,1>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, N*sizeof(int),
cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++)
printf("%d + %d = %dn", a[i], b[i], c[i]);
return 0;
}
27
Intel Xeon Phi – Many CPUs
28
Source - www.pcgameshardware.de/Xeon-Phi-Hardware-
256199/News/Intel-Xeon-Phi-Hardware-Informationen-1040924/
Intel Xeon Phi Family
29
Intel Xeon Phi (Cont.)
30
Source - www.altera.com/technology/system-design/articles/2012/multicore-many-core.html
Intel Xeon Phi (Cont.)
 More simple cores, many threads, & wider
vector units
 Same programming model across host & device
 Linux on device
 Remote login
 Access to network file systems
 High compute density & energy efficiency
31

Introduction to Accelerators

  • 1.
    Introduction to Accelerators CS4532 ConcurrentProgramming Dilum Bandara Dilum.Bandara@uom.lk Some sides adapted from Prof. Sanath Jayasena, University of Moratuwa
  • 2.
    Outline  GPUs  CUDAprogramming  Xeon Phi 2
  • 3.
     GPU istypically a computer card, installed into a PCI Express slot  Market leaders – NVIDIA, Intel, AMD (ATI)  NVIDIA GPUs at UoM  Intel MIC (Many Integrated Core) Graphics Processing Unit (GPU) 3 GeForce GTX 480 Tesla 2070
  • 4.
    Example Specifications GTX 480Tesla 2070 Tesla K80 Peak double precision FP performance 650 Gigaflops 515 Gigaflops 2.91 Teraflops Peak single precision FP performance 1.3 Teraflops 1.03 Teraflops 8.74 Teraflops CUDA cores 480 448 4992 Frequency of CUDA Cores 1.40 GHz 1.15 GHz 560/875 MHz Memory size (GDDR5) 1536 MB 6 GB 24 GB Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec ECC Memory No Yes Yes 4
  • 5.
    GPUs (Cont.)  Originallydesigned to accelerate a large no of computations performed in graphics rendering  Offloaded numerically intensive computation from CPU  GPUs grew with demand for high performance graphics  Eventually GPUs have become much powerful, even more than CPUs for many computations  Cost-power-performance advantage 5
  • 6.
    GPU Basics  Today’sGPUs  High performance, many-core processors that can be used to accelerate a wide range of applications  GPUs lead the race for floating-point performance since start of 21st century  GPGPU  GPUs are being used as parallel processors for general-purpose computation 6
  • 7.
    CPU vs. GPUArchitecture 7 GPU devotes more transistors for computation
  • 8.
    FLOPS & GFLOPS FLOPS = floating-point operations per second  Example 8 CPU GPU Number of cores 4 448 FLOPS per core 4 1 Clock speed (GHz) 2.5 1.15 Performance (GFLOPS) 40 515
  • 9.
    CPU-GPU Performance Gap 9 Source- http://michaelgalloy.com
  • 10.
    Simple Performance Test Matrix multiplication  C = B*A  CPU  Intel Core i7  cblas_dgemm(…) from BLAS library  GPU  Nvidia GTX 480 10
  • 11.
    Results Dimensions (A,B,C) CPU time (s) GPU time(s) Speedup [400,800] , [400, 400], [400 , 800] 0.17 0.00109 155 [800,1600] , [800, 800], [800 , 1600] 2.10 0.00846 258 [1200,2400] , [1200, 1200], [2400 , 2400] 6.65 0.02860 232 [1600,2400] , [1600, 1600], [2400 , 2400] 15.18 0.06739 225 [2000,4000] , [4000, 4000], [4000 , 4000] 29.44 0.13178 223 [2400,4800] , [2400, 4800], [4800 , 4800] 50.21 0.22703 221 11
  • 12.
    Applications of GPGPU Computational Structural Mechanics  Bio-Informatics and Life Sciences  Computational Electromagnetics & Electrodynamics  Computational Finance  Computational Fluid Dynamics  Data Mining, Analytics, & Databases  Imaging & Computer Vision  Medical Imaging  Molecular Dynamics  Numerical Analytics  Weather, Atmospheric, Ocean Modeling & Space Sciences 12
  • 13.
    Programming GPUs  CUDAlanguage for Nvidia GPU products  Compute Unified Device Architecture  Based on C  nvcc compiler  Lots of tools for analysis, debug, profile, …  OpenCL – Open Computing Language  Based on C  Supports GPU & CPU programming  Support for Java, Python, Matlab, etc.  Lots of active research  e.g., automatic code generation for GPUs 13
  • 14.
  • 15.
    Caution!  GPU designedas a numeric computing engine  Will not perform well on some tasks as CPUs  Most applications will use both CPUs & GPUs  For some computations, cost of transfering data between CPU & GPU can be high  SIMD-type data parallelism is key to benefit from GPUs  … and enough of it (out of total computation) 15
  • 16.
    CUDA Architecture  CUDAis NVIDA’s solution to access the GPU  Can be seen as an extension to C/C++ 16 CUDA Software Stack
  • 17.
    CUDA Architecture (Cont.) 2main parts 1. Host (CPU part) • Single Program, Single Data • Launches kernel on GPU 2. Device (GPU part) • Single Program, Multiple Data • Runs kernel Function executed on GPU (device) is called a “kernel” 17
  • 18.
    CUDA Architecture (Cont.) GRIDArchitecture 18 Grid • A group of threads all running the same kernel • Can run multiple grids at once Block • Grids composed of blocks • Each block is a logical unit containing a no of coordinating threads & some amount of shared memory
  • 19.
    #include <cuda.h> #include <stdio.h> __global__void kernel (void) { } int main (void) { kernel <<< 1, 1 >>> (); printf("Hello World!n"); return 0; } Example Program  “__global__” says function is to be compiled to run on a “device” (GPU), not “host” (CPU)  Angle brackets “<<<“ & “>>>” for passing params/args to runtime 19
  • 20.
    Thread Blocks  Withinhost (CPU) code, call kernel by using <<< & >>> specifying grid size (no of blocks) & block size (no of threads) 20
  • 21.
    Grids, Blocks &Threads 21  Grid of size 6 (3x2 blocks)  Each block has 12 threads (4x3)
  • 22.
  • 23.
    CUDA Device MemoryModel  Host, devices have separate memory spaces  e.g., hardware cards with their own DRAM  To execute a kernel on a device  Need to allocate memory on device  Transfer data  Host memory  device memory  After device execution  Transfer results  Device memory  host memory  Free device memory no longer needed 23
  • 24.
  • 25.
    CUDA API –Memory Mgt. 25
  • 26.
    Memory Access int main(void) { int c, *dev_c; cudaMalloc ((void **) &dev_c, sizeof (int)); add <<< 1, 1 >>> (2,7, dev_c); cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost); printf(“2 + 7 = %dn“, c); cudaFree(dev_c); return 0; } 26
  • 27.
    Example #include "stdio.h" #define N10 __global__ void add(int *a, int *b, int *c) { int tID = blockIdx.x; if (tID < N) { c[tID] = a[tID] + b[tID]; } } int main() { int a[N], b[N], c[N]; int *dev_a, *dev_b, *dev_c; cudaMalloc((void **) &dev_a, N*sizeof(int)); cudaMalloc((void **) &dev_b, N*sizeof(int)); cudaMalloc((void **) &dev_c, N*sizeof(int)); for (int i = 0; i < N; i++) a[i] = i, b[i] = 1; cudaMemcpy(dev_a, a, N*sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, N*sizeof(int), cudaMemcpyHostToDevice); add<<<N,1>>>(dev_a, dev_b, dev_c); cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost); for (int i = 0; i < N; i++) printf("%d + %d = %dn", a[i], b[i], c[i]); return 0; } 27
  • 28.
    Intel Xeon Phi– Many CPUs 28 Source - www.pcgameshardware.de/Xeon-Phi-Hardware- 256199/News/Intel-Xeon-Phi-Hardware-Informationen-1040924/
  • 29.
    Intel Xeon PhiFamily 29
  • 30.
    Intel Xeon Phi(Cont.) 30 Source - www.altera.com/technology/system-design/articles/2012/multicore-many-core.html
  • 31.
    Intel Xeon Phi(Cont.)  More simple cores, many threads, & wider vector units  Same programming model across host & device  Linux on device  Remote login  Access to network file systems  High compute density & energy efficiency 31