Introduction to CUDA programming in C language

Introduction to
CUDA Programming
Hemant Shukla
hshukla@lbl.gov

Trends
2

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Scien+fic
Data
Deluge

LSST

0.5
PB/month

JGI

5
TB/yr
*

LOFAR
500
GB/s

SKA

100
x
LOFAR

Energy
Efficiency

Exascale
will
need

1000x
Performance

enhancement
with
10x

energy
consump+on

Flops/waT

*
Jeff
Broughton
(NERSC)
and
JGI

Tradi+onal
source
of

performance
are
flat-‐lining

Figure
courtesy
of
Kunle
Olukotun,
Lance

Hammond,
Herb
SuTer,
and
Burton
Smith

Developments
3

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Industry
Emergence of more cores on single chips
Number of cores per chip double every two years
Systems with millions of concurrent threads
Systems with inter and intra-chip parallelism

Architectural designs driven by reduction in Energy Consumption
New Parallel Programming models, languages, frameworks, …
Academia
Graphical Processing Units (GPUs) are adopted as co-processors for
high performance computing

Architectural Differences
4

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

ALU

Cache

DRAM

Control

Logic

DRAM

CPU GPU
512
cores

10s
to
100s
of
threads
per
core

Latency
is
hidden
by
fast
context

switching

Less
than
20
cores

1-‐2
threads
per
core

Latency
is
hidden
by
large
cache

GPUs don’t run without CPUs

CPUs vs. GPUs
5

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Silly debate… It’s all about Cores
Next phase of HPC has been touted as “Disruptive”
Future HPC is massively parallel and likely on hybrid architectures
Programming models may not resemble the current state
Embrace change and brace for impact
Write modular, adaptable and easily mutative applications
Build auto-code generators, auto-tuning tools, frameworks, libraries
Use this opportunity to learn how to efficiently program massively parallel
systems

Applications

X-ray computed
tomography
Alain Bonissent et al.
Total
volume

560
x
560
x
960
pixels

360
projec+ons

Speed
up
=
110x

N-body with SCDM
K. Nitadori et al.

4.5
giga-‐par+cles,
R
=
630
Mpc

2000x
more
volume
than
Kawai
et
al.

EoR with diesel powered
radio interferometry
Lincoln Greenhill et al.
512
antennas,
correlated
visibili+es
for

130,000
baseline
pairs,
each
with
768

channels
and
4
polariza+ons
~
20

Tﬂops.
Power
budget
20
kW.

INTEL
Core2
Quad
2.66GHz

=
1121
ms

NVIDIA
GPU
C1060

=
103.4
ms

6

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

GPU
7

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

GPU H/W Example
8

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

L2

L1

Shared
Memory

SM

NVIDIA FERMI
Load/Store
address

width
64
bits.
Can

calculate
addresses
of

16
threads
per
clock.

16
Stream
Mul+processors
(SM)

512
CUDA
cores
(32/SM)

IEEE
754-‐2008
ﬂoa+ng
point
(DP
and
SP)

6
GB
GDDR5
DRAM
(Global
Memory)

ECC
Memory
support

Two
DMA
interface

L2
Cache
768
KB

Reconﬁgurable
L1

Cache

and
Shared

Memory

48
KB
/
16
KB

Programming Models
9

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

CUDA (Compute Unified Device Architecture)
OpenACC
OpenCL
Microsoft's DirectCompute
Third party wrappers are also available for Python, Perl, Fortran,
Java, Ruby, Lua, MATLAB and IDL, and Mathematica
Compilers from PGI, RCC, HMPP, Copperhead

CUDA
10

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

CUDA Device Driver
CUDA Toolkit (compiler, debugger, profiler, lib)
CUDA SDK (examples)
Windows, Mac OS, Linux
Parallel Computing Architecture
NVIDIA
CUDA
Compa+ble
GPU

DX

Compute

OpenCL
FORTRAN

Java

Python

C/C++

Applica+on

CUDA
Run+me
and
Device
Driver

nvcc

C/C++
Compiler

NVIDIA
Assembly
Host
Assembly

Libraries
CPU/GPU
code

Libraries – FFT, Sparse Matrix, BLAS, RNG, CUSP, Thrust…

Dataflow
11

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Host
Memory

Device
Memory

Host
(CPU)

Device
(GPU)

Data
is
copied

from
the
host

memory
to
the

device
memory

via
PCIe
Bus

1
Host
launches

kernel
on
the

device

2
The
kernel
is

executed
by

mul+ple
threads

concurrently

3
The
data
within

the
device
is

accessed
by

threads
through

memory
hierarchy

4
The
results
are
moved

back
to
the
device

memory
and
are

transferred
back
to
the

host
via
PCIe
bus

5
PCIe

S/W Abstraction
12

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Kernel is executed by threads
processed by CUDA Core
Threads
Blocks
Grids
Maximum 8 blocks per SM
32 parallel threads are
executed at the same time in
a WARP
One grid per kernel with
multiple concurrent kernels
512-‐1024
threads
/
block

SM

Memory Hierarchy
13

Shared
Memory

per
Block

Global

M
emory

Local
Memory
per

Thread

Thread
Block
Grid
0

Grid
1

Constant

M
emory

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Private
memory

Visible
only
to
the
thread

Shared
memory

Visible
to
all
the
threads
in
a
block

Global
memory

Visible
to
all
the
threads

Visible
to
host

Accessible
to
mul+ple
kernels

Data
is
stored
in
row
major
order

Registers

Constant
memory
(Read
Only)

Visible
to
all
the
threads
in
a
block

CUDA API Examples
14

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Which GPU do I have?
15

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

#include <stdio.h>
int main()
{
int noOfDevices;
/* get no. of device */
cudaGetDeviceCount (&noOfDevices);
cudaDeviceProp prop;
for (int i = 0; i < noOfDevices; i++)
{
/*get device properties */
cudaGetDeviceProperties (&prop, i );
printf ("Device Name:t %sn", prop.name);
printf ("Total global memory:t %ldn",
prop.totalGlobalMem);
printf (”No. of SMs:t %dn",
prop.multiProcessorCount);
printf ("Shared memory / SM:t %ldn",
prop.sharedMemPerBlock);
printf("Registers / SM:t %dn",
prop.regsPerBlock);
}
return 1;
}
Device Name: ! !Tesla C2050"
Total global memory: !2817720320"
No. of SMs:
! ! !14"
Shared memory / SM: !49152"
Registers / SM: ! !32768"
For details see CUDA Reference Manual
Use
cudaGetDeviceCount
cudaGetDeviceProperties

Output
For more properties see
struct cudaDeviceProp

> nvcc whatDevice.cu –o whatDevice"
Compilation

Timing with CUDA Event API
16

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

int main ()
{
cudaEvent_t start, stop;
float time;
cudaEventCreate (&start);
cudaEventCreate (&stop);
cudaEventRecord (start, 0);
someKernel <<<grids, blocks, 0, 0>>> (...);
cudaEventRecord (stop, 0);
cudaEventSynchronize (stop);
cudaEventElapsedTime (&time, start, stop);
cudaEventDestroy (start);
cudaEventDestroy (stop);
printf ("Elapsed time %f secn", time*.001);
return 1;
}
Ensures kernel execution has completed
CUDA Event API Timer are,
- OS independent
- High resolution
- Useful for timing asynchronous calls
Standard CPU timers will not measure the
timing information of the device.

Memory Allocations / Copies
17

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

int main ()
{
...
float host_signal[N]; host_result[N];
float *device_signal, *device_result;
//allocate memory on the device (GPU)
cudaMalloc ((void**) &device_signal, N * sizeof(float));
cudaMalloc ((void**) &device_result, N * sizeof(float));
... Get data for the host_signal array
// copy host_signal array to the device
cudaMemcpy (device_signal, host_signal , N * sizeof(float),
cudaMemcpyHostToDevice);
someKernel <<<< >>> (...);
//copy the result back from device to the host
cudaMemcpy (host_result, device_result, N * sizeof(float),
cudaMemcpyDeviceToHost);
//display the results
...
cudaFree (device_signal); cudaFree (device_result) ;
}
Host and device have separate physical memory
Cannot
dereference

host
pointers
on
device

and
vice
versa

cudaError_t cudaMemcpyAsync (void ∗ dst, const void ∗ src, size_t count,
enum cudaMemcpyKind kind, cudaStream_t stream)

cudaMemcpyAsync() is asynchronous with respect to the host. The call may return before the copy
is complete. It only works on page-locked host memory and returns an error if a pointer to pageable
memory is passed as input.
Basic Memory Methods
18

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

cudaError_t cudaMalloc (void ∗∗ devPtr, size_t size)

Allocates size bytes of linear memory on the device and returns in ∗devPtr a pointer to the
allocated memory. In case of failure cudaMalloc() returns cudaErrorMemoryAllocation.
cudaError_t cudaMemcpy (void ∗ dst, const void ∗ src, size_t count, enum
cudaMemcpyKind kind)

Copies count bytes from the memory area pointed to by src to the memory area pointed to by
dst. The argument kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice,
cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the
copy.
Blocking call
Non-Blocking call
See also, cudaMemset, cudaFree, ...

Kernel
19

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

The CUDA kernel is,
Run on device
Defined by __global__ qualifier and does not return anything
__global__ void someKernel ();
Executed asynchronously by the host with <<< >>> qualifier, for example,
someKernel <<<nGrid, nBlocks, sharedMemory, streams>>> (...)
someKernel <<<nGrid, nBlocks>>> (...)
The kernel launches a 1- or 2-D grid of 1-, 2- or 3-D blocks of threads
Each thread executes the same kernel in parallel (SIMT)
Threads within blocks can communicate via shared memory
Threads within blocks can be synchronized
Grids and blocks are of type struct dim3
Built-in variables gridDim, blockDim, threadIdx, blockIdx are used to
traverse across the device memory space with multi-dimensional indexing

Grids, Blocks and Threads
20

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

someKernel<<< 1, 1 >>> ();
gridDim.x = 1
blockDim.x = 1
blockIdx.x = 0
threadIdx.x = 0
Grid

Block

Thread

dim3 blocks (2,1,1);
someKernel<<< (blocks, 4) >>> ();
gridDim.x = 2;
blockDim.x = 4;
blockIdx.x = 0,1;
threadIdx.x = 0,1,2,3,0,1,2,3
block
(0,
0)

block
(1,
0)

<<< number of blocks in a grid, number of threads per block >>>
Useful for multidimensional indexing and creating unique thread IDs
int index = threadIdx.x + blockDim.x * blockIdx.x;

Thread Indices
21

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Array traversal
blockDim.x = 4
blockIdx.x = 0
threadIdx.x = 0, 1, 2, 3
Index = 0, 1, 2, 3
blockDim.x = 4
blockIdx.x = 1
threadIdx.x = 0, 1, 2, 3
Index = 4, 5, 6, 7
int index = threadIdx.x + blockDim.x * blockIdx.x;

Example - Inner Product
22

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Matrix-multiplication
x =
Each element of product matrix C is generated by row column multiplication and
reduction of matrices A and B. This operation is similar to inner product of the
vector multiplication kind also known as vector dot product.
N by N N by N N by N
For size N × N matrices the matrix-multiplication C = A  B will be equivalent to
N2 independent (hence parallel) inner products.
A B C

Example
23

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

c = aibi
i
∑
a

b

×

+

c

Multiplications are done in parallel
Summation is sequential
double c = 0.0;
for (int i = 0; i < SIZE; i++)
c += a[i] * b[i];
Serial representation
Simple parallelization strategy

Example
24

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

__global__ void innerProduct (int *a, int *b, int *c)
{
int product[SIZE];
int i = threadIdx.x;
if (i < SIZE)
product[i] = a[i] * b[i];
}
__global__ void innerProduct (...)
{
...
}
int main ()
{
...
innerProduct<<<grid, block>>> (...);
...
}
CUDA Kernel
Called in the host code

Example
25

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

{
int product[SIZE];
if (i < SIZE)
}
Qualifier __global__ encapsulates
device specific code that runs on the
device and is called by the host
Other qualifiers are,
__device__, __host__,
host__and__device
threadIdx is a built in iterator for
threads. It has 3 dimensions x, y and
z.
Each thread with a unique threadIdx.x
runs the kernel code in parallel.

Example
26

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

{
int product[SIZE];
if (i < SIZE)
int sum = 0;
for (int k = 0; k < N; k++)
sum += product[k];
*c = sum;
}
Now we can sum the all the products to get
the scalar c
Unfortunately this won’t work for following reasons,
- product[i] is local to each thread
- Threads are not visible to each other

Example
27

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

{
__shared__ int product[SIZE];
if (i < SIZE)
__syncthreads();
if (threadIdx.x == 0)
{
int sum = 0;
for (int k = 0; k < SIZE; k++)
sum += product[k];
*c = sum;
}
}
First we make the product[i] visible to all the
threads by copying it to shared memory
Next we make sure that all the threads are
synchronized. In other words each thread has
finished its workload before we move ahead. We do
this by calling __syncthreads()
Finally we assign summation to one thread
(extremely inefficient reduction)
Aside: cudaThreadSynchronize() is used
on the host side to synchronize host and device

Example
28

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

{
if (i < SIZE)
__syncthreads();
// Efficient reduction call
*c = someEfficientLibrary_reduce (product);
}

Performance Considerations
29

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Memory Bandwidth
30

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Memory bandwidth – rate at which the data is transferred – is a valuable
metric to gauge the performance of an application
Memory bandwidth (GB/s) = Memory clock rate (Hz) × interface width (bytes) / 109
Theoretical Bandwidth
Bandwidth (GB/s) = [(bytes read + bytes written) / 109 ] / execution time
Real Bandwidth (Effective Bandwidth)
May also use profilers to estimate bandwidth and bottlenecks
If real bandwidth is much lower than the theoretical then code may need review
Optimize on Real Bandwidth

Arithmetic Intensity
31

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Memory access bandwidth of GPUs is limited compared to the peak compute
throughput
High arithmetic intensity (arithmetic operations per memory access) algorithms
perform well on such architectures
Example
Fermi peak throughput for SP is 1 TFLOP/s and DP is 0.5 TFLOP/s
Global memory (off-chip) bandwidth is 144 GB/s
For every 4 byte single precision floating point operand bandwidth is 36 GB/s and 18
GB/s for double precision
To obtain peak throughout will require 1000/36 ~ 28 SP (14 DP) arithmetic operations

Example revisited
32

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

{
if (i < SIZE)
__syncthreads();
if (threadIdx.x == 0)
{
int sum = 0;
for (int k = 0; k < SIZE; k++)
sum += product[k];
*c = sum;
}
}
Contrast this with inner product example where for
every 2 memory (data ai and bi) accesses only two
operations (multiplication and add) are performed.
That is ratio of 1 as opposed to 28 that is required for
peak throughput.
Room for algorithm improvement!

Aside: Not all performance will be peak performance

Optimization Strategies
33

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Coalesced memory data accesses (use faster memories like shared memory)
Minimize data transfer over PCIe (~ 5 GB/s)
Overlap data transfers and computations with asynchronous calls
Use fast page-locked memory (pinned memory – host memory guaranteed to device)
Threads in a block should be multiples of 32 (warp size). Experiment with your device
Smaller thread-blocks better than large many threads blocks when resource limited
Fast libraries (cuBLAS, Thrust, CUSP, cuFFT,...)
Built-in arithmetic instructions
Judiciously

Atomic Functions
34

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Used to avoid race conditions resulting from thread synchronization and coordination
issues.
Multiple threads accessing same address space for read/write simultaneously.
Applicable to both shared memory and global memory.
Atomic methods in CUDA guarantee address update without interrupts. Implemented
using locks and serialization.
Atomic functions run faster on shared memory than on shared memory.
Atomic functions should also be used judiciously as they serialize the code. Overuse
results in performance degradation.
Examples: atomicAdd, atomicMax, atomicXor...

CUDA Streams
35

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Stream is defined as sequence of device operations executed in order
Do memCopy Start timer Launch kernel Stop timer
Stream 1
cudaStream_t stream0, stream1;
cudaStreamCreate (&stream0);
cudaMemCopyAsync (..., stream0); someKernel<<<..., stream0>>>();
cudaMemCopyAsync (..., stream1); someKernel<<<..., stream1>>>();
cudaStreamSynchronize (stream0);
Down
(1)
Down
(2)
Down
(3)

Ker
(1)
Ker
(2)

Up
(1)
Up
(N-‐2)
Up
(N-‐1)
Up
(N)

Ker
(N-‐1)
Ker
(N)

Down
(N)

Time
Task (stream ID)
Example
N streams performing
3 tasks
  

36

Rela+ve
Performance
of
Algorithms

Arithme+c
Intensity

Gﬂop/s

Benchmarks
Courtesy - Sam Williams
Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

References
37

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

CUDA
http://developer.nvidia.com/category/zone/cuda-zone
OpenCL
http://www.khronos.org/opencl/
GPGPU
http://www.gpucomputing.net/
Advanced topics from Jan 2011 ICCS Summer School
http://iccs.lbl.gov/workshops/tutorials.html

Conclusion
38

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

If you have parallel code you may benefit from GPUs
In some cases algorithms written on sequential machines may not migrate
efficiently and require reexamination and rewrite
If you have short-term goal(s) it may be worthwhile looking into CUDA etc
CUDA provides better performance over OpenCL (Depends)
Most efficient codes optimally use the entire system and not just parts
Heterogeneous computing and parallel programming are here to stay
Number two2-PetaFlop/s HPC machine in the world (Tianhe-1 in China) is a
heterogeneous cluster with 7k+ NVIDIA GPUs and 14k Intel CPUs

Algorithms
Lessons from ICCS Tutorials by Wen-Mei Hwu
39

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Think Parallel
40

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Promote fine grain parallelism
Consider minimal data movement
Exploit parallel memory access patterns
Data layout
Data Blocking/Tiling
Load Balance

Amdhal’s Argument
41

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Sequen+al

Code

Parallel
Code

Sequen+al

Code

Sequen+al

Code

Sequen+al

Code

!me
t1

!me
t2

Code cannot run faster than time t2
If
X
is
the
serialized
part
of
the
code
then
speedup
cannot
be
greater
than
1/1-‐X

no
maTer
how
many
cores
are
added.

Blocking
42

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Also known as Tiling.
Basic idea is to move blocks/tiles of commonly useable data from global
memory into shared memory or registers memory.
Global
Memory

Shared
Memory

per
Block

Registers
Reuse computed results
Get
data
blocks
for

thread
to
share

Register
Tiling

Shared
Memory
Tiling

Blocking / Tiling Technique
43

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Focused Access pattern
Identify block/tile of global memory data to be accessed by threads.
Load the data into the fast memory (Shared, register)
Get the multithreads to use the data
Assure barrier synchronization
Repeat (move to next block, next iterations etc.)
Make the most of one load of data into fast memory

Variables on Memory
44

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

CUDA Variable Type Qualifiers
__device__ __shared__ int SharedVar;
__device__ int GlobalVar;
__device__ __constant__ int ConstantVar;
Kernel variables without any qualifiers reside in a registe with an
exception for arrays that reside in local memory

Matrix Multiplication
45

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Example
A B C
×
=

k
i
j
k
WIDTH

Matrix Multiplication...
46

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

void matrixMultiplication ( float* A, float* B, float* C, int WIDTH)
{
for (i  0 : WIDHT)
for (j  0 : WIDTH)
for (k  0 : WIDTH)
a = Ai;
b = Bj;
sum += a * b;
Cij = sum;
}
CPU Version

47

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

__global__ void matrixMultiplication (float* A, float* B, float* C, int WIDTH)
{
int i = blockIdx.y * WIDTH + threadIdx.y;
int j = blockIdx.x * WIDTH + threadIdx.x;
// each thread computes one element of product matrix C
for (k  0 : k)
sum += A[i][k] * B[k][j];
C[i][j] = sum;
}
GPU Version (Memory locations)
Constant memory
Shared memory
Global memory (read)
Global memory (write)

48

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Kernel analysis
2 floating point read accesses, 2 x 4 bytes = 8 bytes per one
multiply and add that is 2 floating point operations per second (add
and multiply). Hence the throughput is 8 bytes / 2 = 4B / FLOPs.
Theoretical peak of Fermi is 530 FLOPs
To achieve peak will require bandwidth of 4 x 530 = 2120 GB/s
The actual bandwidth is 177GB/s
With this bandwidth it yields 177/4 = 44.25 FLOP/s
About 12 times below peak performance.
In practice it will be slower.

49

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

How to speed up?
BLOCKING
Load data into shared memory and reuse
Since the Shared memory size is small it helps to partition the
data in equal sized blocks that fit into the shared memory and
reuse.

50

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Block/Tile
Partial rows and columns are
loaded in shared memory
One row is reused to calculate
two elements.
For a 16 x 16 tile width the
global memory loads are
reduced by 16.
Multiple blocks are executed in
parallel.

51

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Tile
1
Tile
2

T0,0

A0,0

A_S0,0

B0,0

B_S0,0

C0,0
=

A_S0,0
*
B_S0,0
+

A_S1,0
*
B_S0,1

A2,0

A_S0,0

B0,2

B_S0,0

C0,0
=

A_S0,0
*
B_S0,0
+

A_S1,0
*
B_S0,1

T1,0

A0,0

A_S1,0

B0,0

B_S1,0

C1,0
=

A_S0,0
*
B_S1,0
+

A_S1,0
*
B_S1,1

A3,0

A_S1,0

B1,2

B_S1,0

C1,0
=

A_S0,0
*
B_S1,0
+

A_S1,0
*
B_S1,1

T0,1

A0,1

A_S0,1

B0,1

B_S0,1

C0,1
=

A_S0,1
*
B_S0,0
+

A_S1,1
*
B_S0,1

A2,1

A_S0,1

B0,3

B_S0,1

C0,1
=

A_S0,1
*
B_S0,0
+

A_S1,1
*
B_S0,1

T1,1

A1,1

A_S1,1

B1,1

B_S1,1

C1,1
=

A_S0,1
*
B_S1,0
+

A_S1,1
*
B_S1,1

A3,1

A_S1,1

B1,3

B_S1,1

C1,1
=

A_S0,1
*
B_S1,0
+

A_S1,1
*
B_S1,1

Threads
Time

52

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

__global__ void matrixMultiplication(float* A, float* B, float* C, int WIDTH,
int TILE_WIDTH)
{
__shared__float A_S[TILE_WIDTH][TILE_WIDTH];
__shared__float B_S[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
// row and column of the C element to calculate
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
float sum = 0;
// Loop over the A and B tiles required to compute the C element
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collectively Load A and B tiles from the global memory into shared memory
A_S[tx][ty] = A[(m*TILE_WIDTH + tx)*Width+Row];
B_S[tx][ty] = B[Col*Width+(m*TILE_WIDTH + ty)];
__syncthreads();
for (int k = 0; k < TILE_WIDTH; ++k)
sum += A_S[tx][k] * B_C[k][ty];
__synchthreads();
}
C [Row*Width+Col] = sum;
}

7-Point Stencil
53

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Used for PDEs, Convolution etc.

7-Point Stencil …
54

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Conceptually all points can be upgraded in parallel.
Each computations performs global sweep of entire data.
Memory bound.
Challenge is to parallelize without overusing memory bandwidth.

7-Point Stencil …
55

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Calculate values along one axis.
Traversing the axis 3 values are needed along the axis
Keep the three values in the register for next iteration
This is called Register Tiling
For 7-point there are 2 in the register so only 5 access will be needed.
A combination of register and block tiling should give 7x speed up.
In reality 4-5x because halos have to be considered.

Questions?
56

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Use case
57

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

GAMER
Hsi-Yu Schive, T. Chiueh, and Y. C. Tsai
Astrophysics adaptive mesh refinement (AMR) code with solvers for hydrodynamics and gravity

Parallelization achieved by OpenMP, MPI on multi-node multicores and CUDA for accelerators (GPU)
Decoupling of AMR (CPU) and solvers (GPU) lends to increased performance, ease of code development
Speed-ups of the order of 10-12x attained on single and multi-GPU heterogeneous systems
Simulations

58

GAMER Framework
Hemant Shukla, Hsi-Yu Schive, Tak-Pong Woo, and T. Chiueh
Generalized GAMER codebase to multi-science framework
Use GAMER to deeply benchmark heterogeneous hardware, optimizations and algorithms in applications
Collect performance, memory access, power consumption and various other metrics for broader user base
Develop codebases as ensembles of highly optimized existing and customizable components for HPC

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

59

Adaptive Mesh Refinement
Data stored in Octree data structure
Refinement with 2l
spatial resolution per level l
Figure
-
Hsi-Yu
Schive
et
al.,
2010
2D Patch
83 cells per patch
Identical spatial geometry (same kernel)
Uniform and individual time-steps

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

60

Construct and Dataflow
...

Cluster
GAMER Codebase C++/CUDA, MPI, OpenMP
AMR, Framework, Libraries
Solvers Poisson, Hydro, Custom, …
Time
Steps

Problem domain covered
with coarse patch on CPUs
User defined refinement, spatial
averaging, flux correction on
CPUs
Concurrently patches are transferred
to GPUs, processed by solvers, one
cell per thread, and returned
Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

61

Solvers
Hydrodynamics PDE Solver
3D Euler equations solved in 5 separate schemes
Second-order relaxing Total Variation Diminishing
Weighted average flux
MUSCL-Hancock (MHM)
MUSCL-Hancock (VL)
Corner transport upwind (CTU)
Flux conservation is done using Riemann Solver
(4 types - exact solver, HLLE, HLLC, and Roe)
€
∂ρ
∂t
+
∂(ρv j )
∂x j
= 0
∂(ρvi )
∂t
+
∂(ρviv j +Pδij )
∂x j
= −ρ
∂φ
∂xi
∂e
∂t
+
∂[(e + P)v j ]
∂x j
= −ρv j
∂φ
∂x j
Poisson-Gravity Solver
€
∇2
φ(

x) = 4πGρ(

x)
Laplacian operator is replaced by seven-point
finite difference operator
For root level patches Green’s functions is used
using FFTW
For refined levels SOR is used
Recently implemented
Multigrid Poisson Solver
Hilbert space-filling curve (load balancing)
Currently implementing
Fast Poisson Solver with Dirichlet’s boundary
conditions
€
∇2
Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

62

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

GAMER Framework
Allows for adding custom/new solvers to the codebase
The size of computational stencil
An optimized CPU version of the implementation
An optimized GPU version of the implementation
CUDA thread blocks and stream objects
New Solver inherits
New Solver implements
Async memcpy, concurrent execution, MPI and OpenMP optimization

63

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Multi-Science
Cosmological Large-scale Structure
Bosonic Dark Matter
Gravitational Lensing Potential
€
∇2
φ(

x) = 4πGa[ρ(

x) − ρb (

x)]
Eﬀec+ve
resolu+on
81923
€
i
∂ψ
∂t
= −
2
2a2
m
∇2
ψ + mVψ
Gravitational potential
Schrodinger-Poisson equation

u =

x − ∇φ(

x)
∇2
φ(

x) = ∑(

x)/∑cr
Lens equation and mass relationship
Structure
due
to
dark
maTer
model

in
early
universe

64

Kernel Analysis
Read
Write
Global Memory Access
GB/s
Max bandwidth 144 GB/s
Gravity
Fluid
Poisson
Instructions
/
byte
3.57
Poisson
Fluid
Gravity
Compute
bound
Memory
bound
268.77
4.02
2.58
SOR takes 20-30 iterations to converge
0.0% 64.3% 15.9%
L1 cache hits while
global memory access
Intensive use of
shared memory
Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

65

Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Hemant Shukla,
Hsi-Yu Schive
et al. SC 2011
Results Large scale Cosmological Simulations with GAMER

66

Hemant Shukla,
Hsi-Yu Schive
et al. SC 2011
Bosonic Dark Mater Simulation
Base level resolution 2563 to level 7 32,7683
Results
0
400
800
1600
3553
Gravity
Kinematic (Schrödinger's eqn.)
8 64
+
GPU
Cores
Seconds
Fix-up
Refinement
MPI
Time-step
1200
5.52 X
4.79 X
Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

67

Load Balance with Hilbert space filling curve
New Results
Gravity
Kinematic (Schrödinger's eqn.)
8 64
+
GPU
Cores
Seconds
Fix-up
Refinement
MPI
Time-step
3.03x
Unbalanced Balanced
Introduc+on
to
CUDA
Programming
-‐
Hemant
Shukla

Introduction to CUDA programming in C language

More Related Content

Similar to Introduction to CUDA programming in C language

More from angelo119154

Recently uploaded

Introduction to CUDA programming in C language