Introduction to Accelerators

Introduction to
Accelerators
CS4532 Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some sides adapted from Prof. Sanath Jayasena, University of Moratuwa

Outline
 GPUs
 CUDA programming
 Xeon Phi
2

 GPU is typically a computer card, installed into a
PCI Express slot
 Market leaders – NVIDIA, Intel, AMD (ATI)
 NVIDIA GPUs at UoM
 Intel MIC (Many Integrated Core)
Graphics Processing Unit (GPU)
3
GeForce GTX 480 Tesla 2070

Example Specifications
GTX 480 Tesla 2070 Tesla K80
Peak double
precision FP
performance
650 Gigaflops 515 Gigaflops 2.91 Teraflops
Peak single
precision FP
performance
1.3 Teraflops 1.03 Teraflops 8.74 Teraflops
CUDA cores 480 448 4992
Frequency of CUDA
Cores
1.40 GHz 1.15 GHz 560/875 MHz
Memory size
(GDDR5)
1536 MB 6 GB 24 GB
Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec
ECC Memory No Yes Yes 4

GPUs (Cont.)
 Originally designed to accelerate a large no of
computations performed in graphics rendering
 Offloaded numerically intensive computation from CPU
 GPUs grew with demand for high performance
graphics
 Eventually GPUs have become much powerful,
even more than CPUs for many computations
 Cost-power-performance advantage
5

GPU Basics
 Today’s GPUs
 High performance, many-core processors that can be
used to accelerate a wide range of applications
 GPUs lead the race for floating-point performance since
start of 21st century
 GPGPU
 GPUs are being used as parallel processors for
general-purpose computation
6

CPU vs. GPU Architecture
7
GPU devotes more transistors for computation

FLOPS & GFLOPS
 FLOPS = floating-point operations per second
 Example
8
CPU GPU
Number of cores 4 448
FLOPS per core 4 1
Clock speed (GHz) 2.5 1.15
Performance (GFLOPS) 40 515

CPU-GPU Performance Gap
9
Source - http://michaelgalloy.com

Simple Performance Test
 Matrix multiplication
 C = B*A
 CPU
 Intel Core i7
 cblas_dgemm(…) from BLAS library
 GPU
 Nvidia GTX 480
10

Results
Dimensions (A,B,C)
CPU
time (s)
GPU
time (s)
Speedup
[400,800] , [400, 400], [400 , 800] 0.17 0.00109 155
[800,1600] , [800, 800], [800 , 1600] 2.10 0.00846 258
[1200,2400] , [1200, 1200], [2400 , 2400] 6.65 0.02860 232
[1600,2400] , [1600, 1600], [2400 , 2400] 15.18 0.06739 225
[2000,4000] , [4000, 4000], [4000 , 4000] 29.44 0.13178 223
[2400,4800] , [2400, 4800], [4800 , 4800] 50.21 0.22703 221
11

Applications of GPGPU
 Computational Structural Mechanics
 Bio-Informatics and Life Sciences
 Computational Electromagnetics & Electrodynamics
 Computational Finance
 Computational Fluid Dynamics
 Data Mining, Analytics, & Databases
 Imaging & Computer Vision
 Medical Imaging
 Molecular Dynamics
 Numerical Analytics
 Weather, Atmospheric, Ocean Modeling & Space
Sciences 12

Programming GPUs
 CUDA language for Nvidia GPU products
 Compute Unified Device Architecture
 Based on C
 nvcc compiler
 Lots of tools for analysis, debug, profile, …
 OpenCL – Open Computing Language
 Based on C
 Supports GPU & CPU programming
 Support for Java, Python, Matlab, etc.
 Lots of active research
 e.g., automatic code generation for GPUs 13

Multithreaded SIMD Processor
14

Caution!
 GPU designed as a numeric computing engine
 Will not perform well on some tasks as CPUs
 Most applications will use both CPUs & GPUs
 For some computations, cost of transfering data
between CPU & GPU can be high
 SIMD-type data parallelism is key to benefit from
GPUs
 … and enough of it (out of total computation)
15

CUDA Architecture
 CUDA is NVIDA’s solution to access the GPU
 Can be seen as an extension to C/C++
16
CUDA Software Stack

CUDA Architecture (Cont.)
2 main parts
1. Host (CPU part)
• Single Program, Single Data
• Launches kernel on GPU
2. Device (GPU part)
• Single Program, Multiple
Data
• Runs kernel
Function executed on GPU
(device) is called a “kernel”
17

CUDA Architecture (Cont.)
GRID Architecture
18
Grid
• A group of threads all running
the same kernel
• Can run multiple grids at once
Block
• Grids composed of blocks
• Each block is a logical unit
containing a no of
coordinating threads &
some amount of shared
memory

#include <cuda.h>
#include <stdio.h>
__global__ void kernel (void)
{ }
int main (void)
{
kernel <<< 1, 1 >>> ();
printf("Hello World!n");
return 0;
}
Example Program
 “__global__” says
function is to be
compiled to run on a
“device” (GPU), not
“host” (CPU)
 Angle brackets “<<<“
& “>>>” for passing
params/args to
runtime
19

Thread Blocks
 Within host (CPU) code, call kernel by using
<<< & >>> specifying grid size (no of blocks)
& block size (no of threads)
20

Grids, Blocks & Threads
21
 Grid of size 6 (3x2
blocks)
 Each block has 12
threads (4x3)

CUDA Device Memory Model
 Host, devices have separate memory spaces
 e.g., hardware cards with their own DRAM
 To execute a kernel on a device
 Need to allocate memory on device
 Transfer data
 Host memory  device memory
 After device execution
 Transfer results
 Device memory  host memory
 Free device memory no longer needed
23

Memory Access
int main (void) {
int c, *dev_c;
cudaMalloc ((void **) &dev_c, sizeof (int));
add <<< 1, 1 >>> (2,7, dev_c);
cudaMemcpy(&c, dev_c, sizeof(int),
cudaMemcpyDeviceToHost);
printf(“2 + 7 = %dn“, c);
cudaFree(dev_c);
return 0;
}
26

Example
#include "stdio.h"
#define N 10
__global__ void add(int *a, int *b, int *c)
{
int tID = blockIdx.x;
if (tID < N)
{
c[tID] = a[tID] + b[tID];
}
}
int main()
{
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void **) &dev_a, N*sizeof(int));
cudaMalloc((void **) &dev_b, N*sizeof(int));
cudaMalloc((void **) &dev_c, N*sizeof(int));
for (int i = 0; i < N; i++)
a[i] = i, b[i] = 1;
cudaMemcpy(dev_a, a, N*sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N*sizeof(int),
cudaMemcpyHostToDevice);
add<<<N,1>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, N*sizeof(int),
cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++)
printf("%d + %d = %dn", a[i], b[i], c[i]);
return 0;
}
27

Intel Xeon Phi – Many CPUs
28
Source - www.pcgameshardware.de/Xeon-Phi-Hardware-
256199/News/Intel-Xeon-Phi-Hardware-Informationen-1040924/

Intel Xeon Phi (Cont.)
30
Source - www.altera.com/technology/system-design/articles/2012/multicore-many-core.html

Intel Xeon Phi (Cont.)
 More simple cores, many threads, & wider
vector units
 Same programming model across host & device
 Linux on device
 Remote login
 Access to network file systems
 High compute density & energy efficiency
31

Introduction to Accelerators

More Related Content

Similar to Introduction to Accelerators

More from Dilum Bandara

Recently uploaded

Introduction to Accelerators