C for Cuda - Small Introduction to GPU computing

www.ipal.cnrs.fr
Patrick Jamet, François Regnoult, Agathe Valette
May 6th 2013
C for CUDA
Small introduction to GPU computing

Summary
‣ Introduction
‣ GPUs
‣ Hardware
‣ Software abstraction
- Grids, Blocks, Threads
‣ Kernels
‣ Memory
‣ Global, Constant, Texture, Shared, Local, Register
‣ Program Example
‣ Conclusion

PRESENTATION OF GPUS
General and hardware considerations

What are GPUs?
‣ Processors designed to handle graphic computations and
scene generation
- Optimized for parallel computation
‣ GPGPU: the use of GPU for general purpose computing
instead of graphic operations like shading, texture mapping,
etc.

Why use GPUs for general purposes?
‣ CPUs are suffering from:
- Performance growth slow-down
- Limits to exploiting instruction-level parallelism
- Power and thermal limitations
‣ GPUs are found in all PCs
‣ GPUs are energy efficient
- Performance per watt

Why use GPUs for general purposes?
‣ Modern GPUs provide extensive ressources
- Massive parallelism and processing cores
- flexible and increased programmability
- high floating point precision
- high arithmetic intensity
- high memory bandwidth
- Inherent parallelism

How do CPUs and GPUs differ
‣ Latency: delay between request and first data return
- delay btw request for texture reading and texture data returns
‣ Throughput: amount of work/amount of time
‣ CPU: low-latency, low-throughput
‣ GPU: high-latency, high-throughput
- Processing millions of pixels in a single frame
- No cache : more transistors dedicated to horsepower

How do CPUs and GPUs differ
Task Parallelism for CPU
‣ multiple tasks map to multiple
threads
‣ tasks run different instructions
‣ 10s of heavy weight threads on
10s of cores
‣ each thread managed and
scheduled explicitly
Data Parallelism for GPU
‣ SIMD model
‣ same instruction on different
data
‣ 10,000s light weight threads,
working on 100s of cores
‣ threads managed and scheduled
by hardware

SOFTWARE ABSTRACTION
Grids, blocks and threads

Host and Device
‣ CUDA assumes a distinction between Host and Device
‣ Terminology
- Host The CPU and its memory (host memory)
- Device The GPU and its memory (device memory)

Threads, blocks and grid
‣ Threads are independent
sequences of program that run
concurrently
‣ Threads are organized in blocks,
which are organized in a grid
‣ Blocks and Threads can be
accessed using 3D coordinates
‣ Threads in the same block share
fast memory with each other

Blocks
‣ The number of Threads in a Block is limited and depends on the
Graphic Card
‣ Threads in a Block are divided in groups of 32 Threads called
Warps
- Threads in the same Warp are executed in parallel
‣ Automatic scalability,
because of blocks

Kernels
‣ The Kernel consists in the code
each Thread is supposed to
execute
‣ Threads can be thought of as
entities mapping the elements of
a certain data structure
‣ Kernels are launched by the
Host, and can also be launched
by other Kernels in recent CUDA
versions

How to use kernels ?
‣ A kernel can only be a void function
‣ The CUDA __global__ instruction means the Kernel is
accessible either from the Host and Device. But it is run on the
Device
‣ Each Kernel can access its Thread and
block position to get a unique identifier

How to use kernels ?
‣ Kernel call
‣ If you want to call a normal function in your Kernel, you must
declare it with the CUDA __device__ instruction.
‣ A __device__ function can only be accessed by the Device and
is automatically defined as inline

Memory Management
Each thread can:
‣ Read/write per-thread registers
‣ Read/write per-thread local memory
‣ Read/write per-thread shared
‣ Read/write per-grid global memory
‣ Read per-grid constant memory
‣ Read per-grid texture memory

Global Memory
‣ Host and Device global memory are separate entities
- Device pointers point to GPU memory
May not be dereferenced in Host code
- Host pointers point to CPU memory
May not be dereferenced in Device code
‣ Slowest memory
‣ Easy to use
‣ ~1,5Go on GPU
C
int *h_T;
C for CUDA
int *d_T;
malloc() cudaMalloc()
free() cudaFree()
memcpy() cudaMemcpy()

Global Memory example
‣ C ;
‣ C for CUDA:
&

Constant Memory
‣ Constant memory is a read-only memory located in the Global
memory and can be accessed by every thread
‣ Two reason to use Constant memory:
- A single read can be broadcast up to 15 others threads
(half-warp)
- Constant memory is cached on GPU
‣ Drawback:
- The half-warp broadcast feature can degrade the performance
when all 16 threads read different addresses.

How to use constant memory ?
‣ The instruction to define constant memory is __constant__
‣ Must be declared out of the main body and cudaMemcpyToSymbol
is used to copy values from the Host to the Device
‣ Constant Memory variables don't need to be declared to be accessed
in the kernel invocation

Texture memory
‣ Texture memory is located in the Global memory and can be
accessed by every thread
‣ Accessed through a dedicated read-only cache
‣ Cache includes hardware filtering which can perform linear
floating point interpolation as part of the read process.
‣ Cache optimised for spatial locality, in the coordinate system of the
texture, not in memory.

Shared Memory
‣ [16-64] KB of memory per block
‣ Extremely fast on-chip memory,
user managed
‣ Declare using __shared__,
allocated per block
‣ Data is not visible to threads in
other blocks
‣ !!!Bank Conflict!!!
‣ When to use? When threads will
access many times the global
memory

Shared Memory - Example
1D stencil
SUM
How many times is it
read?
7 Times

Shared Memory - Example
__global__ void stencil_1d(int *in, int *out)
{
__shared__ int temp[BLOCK_SIZE];
int lindex = threadIdx.x ;
// Read input elements into shared memory
temp[lindex] = in[lindex];
if (lindex > RADIUS && lindex < BLOCK_SIZE-RADIUS);
{
for (...) //Loop for calculating the sum
out[lindex] = res;
}
}
??
__syncthreads()

Conclusion
‣ GPUs are designed for parallel computing
‣ CUDA’s software abstraction is adapted to the GPU
architecture with grids, blocks and threads
‣ The management of which functions access what type of
memory is very important
- Be careful of bank conflicts!
‣ Data transfer between host and device is slow (5GB device to
host/host to device and 16GB device-device/host/host)

Resources
‣ We skipped some details, you can learn more with
- CUDA programming guide
- CUDA Zone – tools, training, webinars and more
- http://developer.nvidia.com/cuda
‣ Install from
- https://developer.nvidia.com/category/zone/cuda-zone and
learn from provided examples

C for Cuda - Small Introduction to GPU computing

C for Cuda - Small Introduction to GPU computing

More Related Content

What's hot

Viewers also liked

Similar to C for Cuda - Small Introduction to GPU computing

More from IPALab

Recently uploaded

C for Cuda - Small Introduction to GPU computing