www.ipal.cnrs.fr
Patrick Jamet, François Regnoult, Agathe Valette
May 6th 2013
C for CUDA
Small introduction to GPU computing
Summary
‣ Introduction
‣ GPUs
‣ Hardware
‣ Software abstraction
- Grids, Blocks, Threads
‣ Kernels
‣ Memory
‣ Global, Constant, Texture, Shared, Local, Register
‣ Program Example
‣ Conclusion
PRESENTATION OF GPUS
General and hardware considerations
What are GPUs?
‣ Processors designed to handle graphic computations and
scene generation
- Optimized for parallel computation
‣ GPGPU: the use of GPU for general purpose computing
instead of graphic operations like shading, texture mapping,
etc.
Why use GPUs for general purposes?
‣ CPUs are suffering from:
- Performance growth slow-down
- Limits to exploiting instruction-level parallelism
- Power and thermal limitations
‣ GPUs are found in all PCs
‣ GPUs are energy efficient
- Performance per watt
Why use GPUs for general purposes?
‣ Modern GPUs provide extensive ressources
- Massive parallelism and processing cores
- flexible and increased programmability
- high floating point precision
- high arithmetic intensity
- high memory bandwidth
- Inherent parallelism
CPU architecture
How do CPUs and GPUs differ
‣ Latency: delay between request and first data return
- delay btw request for texture reading and texture data returns
‣ Throughput: amount of work/amount of time
‣ CPU: low-latency, low-throughput
‣ GPU: high-latency, high-throughput
- Processing millions of pixels in a single frame
- No cache : more transistors dedicated to horsepower
How do CPUs and GPUs differ
Task Parallelism for CPU
‣ multiple tasks map to multiple
threads
‣ tasks run different instructions
‣ 10s of heavy weight threads on
10s of cores
‣ each thread managed and
scheduled explicitly
Data Parallelism for GPU
‣ SIMD model
‣ same instruction on different
data
‣ 10,000s light weight threads,
working on 100s of cores
‣ threads managed and scheduled
by hardware
SOFTWARE ABSTRACTION
Grids, blocks and threads
Host and Device
‣ CUDA assumes a distinction between Host and Device
‣ Terminology
- Host The CPU and its memory (host memory)
- Device The GPU and its memory (device memory)
Threads, blocks and grid
‣ Threads are independent
sequences of program that run
concurrently
‣ Threads are organized in blocks,
which are organized in a grid
‣ Blocks and Threads can be
accessed using 3D coordinates
‣ Threads in the same block share
fast memory with each other
Blocks
‣ The number of Threads in a Block is limited and depends on the
Graphic Card
‣ Threads in a Block are divided in groups of 32 Threads called
Warps
- Threads in the same Warp are executed in parallel
‣ Automatic scalability,
because of blocks
Kernels
‣ The Kernel consists in the code
each Thread is supposed to
execute
‣ Threads can be thought of as
entities mapping the elements of
a certain data structure
‣ Kernels are launched by the
Host, and can also be launched
by other Kernels in recent CUDA
versions
How to use kernels ?
‣ A kernel can only be a void function
‣ The CUDA __global__ instruction means the Kernel is
accessible either from the Host and Device. But it is run on the
Device
‣ Each Kernel can access its Thread and
block position to get a unique identifier
How to use kernels ?
‣ Kernel call
‣ If you want to call a normal function in your Kernel, you must
declare it with the CUDA __device__ instruction.
‣ A __device__ function can only be accessed by the Device and
is automatically defined as inline
MEMORY ORGANIZATION
Memory Management
Each thread can:
‣ Read/write per-thread registers
‣ Read/write per-thread local memory
‣ Read/write per-thread shared
‣ Read/write per-grid global memory
‣ Read per-grid constant memory
‣ Read per-grid texture memory
Global Memory
‣ Host and Device global memory are separate entities
- Device pointers point to GPU memory
May not be dereferenced in Host code
- Host pointers point to CPU memory
May not be dereferenced in Device code
‣ Slowest memory
‣ Easy to use
‣ ~1,5Go on GPU
C
int *h_T;
C for CUDA
int *d_T;
malloc() cudaMalloc()
free() cudaFree()
memcpy() cudaMemcpy()
Global Memory example
‣ C ;
‣ C for CUDA:
&
Constant Memory
‣ Constant memory is a read-only memory located in the Global
memory and can be accessed by every thread
‣ Two reason to use Constant memory:
- A single read can be broadcast up to 15 others threads
(half-warp)
- Constant memory is cached on GPU
‣ Drawback:
- The half-warp broadcast feature can degrade the performance
when all 16 threads read different addresses.
How to use constant memory ?
‣ The instruction to define constant memory is __constant__
‣ Must be declared out of the main body and cudaMemcpyToSymbol
is used to copy values from the Host to the Device
‣ Constant Memory variables don't need to be declared to be accessed
in the kernel invocation
Texture memory
‣ Texture memory is located in the Global memory and can be
accessed by every thread
‣ Accessed through a dedicated read-only cache
‣ Cache includes hardware filtering which can perform linear
floating point interpolation as part of the read process.
‣ Cache optimised for spatial locality, in the coordinate system of the
texture, not in memory.
Shared Memory
‣ [16-64] KB of memory per block
‣ Extremely fast on-chip memory,
user managed
‣ Declare using __shared__,
allocated per block
‣ Data is not visible to threads in
other blocks
‣ !!!Bank Conflict!!!
‣ When to use? When threads will
access many times the global
memory
Shared Memory - Example
1D stencil
SUM
How many times is it
read?
7 Times
Shared Memory - Example
__global__ void stencil_1d(int *in, int *out)
{
__shared__ int temp[BLOCK_SIZE];
int lindex = threadIdx.x ;
// Read input elements into shared memory
temp[lindex] = in[lindex];
if (lindex > RADIUS && lindex < BLOCK_SIZE-RADIUS);
{
for (...) //Loop for calculating the sum
out[lindex] = res;
}
}
??
__syncthreads()
Shared Problem
PROGRAM EXAMPLE
1D stencil
Global Memory
CONCLUSION
Conclusion
‣ GPUs are designed for parallel computing
‣ CUDA’s software abstraction is adapted to the GPU
architecture with grids, blocks and threads
‣ The management of which functions access what type of
memory is very important
- Be careful of bank conflicts!
‣ Data transfer between host and device is slow (5GB device to
host/host to device and 16GB device-device/host/host)
Resources
‣ We skipped some details, you can learn more with
- CUDA programming guide
- CUDA Zone – tools, training, webinars and more
- http://developer.nvidia.com/cuda
‣ Install from
- https://developer.nvidia.com/category/zone/cuda-zone and
learn from provided examples
C for Cuda - Small Introduction to GPU computing

C for Cuda - Small Introduction to GPU computing

  • 1.
    www.ipal.cnrs.fr Patrick Jamet, FrançoisRegnoult, Agathe Valette May 6th 2013 C for CUDA Small introduction to GPU computing
  • 2.
    Summary ‣ Introduction ‣ GPUs ‣Hardware ‣ Software abstraction - Grids, Blocks, Threads ‣ Kernels ‣ Memory ‣ Global, Constant, Texture, Shared, Local, Register ‣ Program Example ‣ Conclusion
  • 3.
    PRESENTATION OF GPUS Generaland hardware considerations
  • 4.
    What are GPUs? ‣Processors designed to handle graphic computations and scene generation - Optimized for parallel computation ‣ GPGPU: the use of GPU for general purpose computing instead of graphic operations like shading, texture mapping, etc.
  • 5.
    Why use GPUsfor general purposes? ‣ CPUs are suffering from: - Performance growth slow-down - Limits to exploiting instruction-level parallelism - Power and thermal limitations ‣ GPUs are found in all PCs ‣ GPUs are energy efficient - Performance per watt
  • 6.
    Why use GPUsfor general purposes? ‣ Modern GPUs provide extensive ressources - Massive parallelism and processing cores - flexible and increased programmability - high floating point precision - high arithmetic intensity - high memory bandwidth - Inherent parallelism
  • 7.
  • 8.
    How do CPUsand GPUs differ ‣ Latency: delay between request and first data return - delay btw request for texture reading and texture data returns ‣ Throughput: amount of work/amount of time ‣ CPU: low-latency, low-throughput ‣ GPU: high-latency, high-throughput - Processing millions of pixels in a single frame - No cache : more transistors dedicated to horsepower
  • 9.
    How do CPUsand GPUs differ Task Parallelism for CPU ‣ multiple tasks map to multiple threads ‣ tasks run different instructions ‣ 10s of heavy weight threads on 10s of cores ‣ each thread managed and scheduled explicitly Data Parallelism for GPU ‣ SIMD model ‣ same instruction on different data ‣ 10,000s light weight threads, working on 100s of cores ‣ threads managed and scheduled by hardware
  • 10.
  • 11.
    Host and Device ‣CUDA assumes a distinction between Host and Device ‣ Terminology - Host The CPU and its memory (host memory) - Device The GPU and its memory (device memory)
  • 12.
    Threads, blocks andgrid ‣ Threads are independent sequences of program that run concurrently ‣ Threads are organized in blocks, which are organized in a grid ‣ Blocks and Threads can be accessed using 3D coordinates ‣ Threads in the same block share fast memory with each other
  • 13.
    Blocks ‣ The numberof Threads in a Block is limited and depends on the Graphic Card ‣ Threads in a Block are divided in groups of 32 Threads called Warps - Threads in the same Warp are executed in parallel ‣ Automatic scalability, because of blocks
  • 14.
    Kernels ‣ The Kernelconsists in the code each Thread is supposed to execute ‣ Threads can be thought of as entities mapping the elements of a certain data structure ‣ Kernels are launched by the Host, and can also be launched by other Kernels in recent CUDA versions
  • 15.
    How to usekernels ? ‣ A kernel can only be a void function ‣ The CUDA __global__ instruction means the Kernel is accessible either from the Host and Device. But it is run on the Device ‣ Each Kernel can access its Thread and block position to get a unique identifier
  • 16.
    How to usekernels ? ‣ Kernel call ‣ If you want to call a normal function in your Kernel, you must declare it with the CUDA __device__ instruction. ‣ A __device__ function can only be accessed by the Device and is automatically defined as inline
  • 17.
  • 18.
    Memory Management Each threadcan: ‣ Read/write per-thread registers ‣ Read/write per-thread local memory ‣ Read/write per-thread shared ‣ Read/write per-grid global memory ‣ Read per-grid constant memory ‣ Read per-grid texture memory
  • 19.
    Global Memory ‣ Hostand Device global memory are separate entities - Device pointers point to GPU memory May not be dereferenced in Host code - Host pointers point to CPU memory May not be dereferenced in Device code ‣ Slowest memory ‣ Easy to use ‣ ~1,5Go on GPU C int *h_T; C for CUDA int *d_T; malloc() cudaMalloc() free() cudaFree() memcpy() cudaMemcpy()
  • 20.
    Global Memory example ‣C ; ‣ C for CUDA: &
  • 21.
    Constant Memory ‣ Constantmemory is a read-only memory located in the Global memory and can be accessed by every thread ‣ Two reason to use Constant memory: - A single read can be broadcast up to 15 others threads (half-warp) - Constant memory is cached on GPU ‣ Drawback: - The half-warp broadcast feature can degrade the performance when all 16 threads read different addresses.
  • 22.
    How to useconstant memory ? ‣ The instruction to define constant memory is __constant__ ‣ Must be declared out of the main body and cudaMemcpyToSymbol is used to copy values from the Host to the Device ‣ Constant Memory variables don't need to be declared to be accessed in the kernel invocation
  • 23.
    Texture memory ‣ Texturememory is located in the Global memory and can be accessed by every thread ‣ Accessed through a dedicated read-only cache ‣ Cache includes hardware filtering which can perform linear floating point interpolation as part of the read process. ‣ Cache optimised for spatial locality, in the coordinate system of the texture, not in memory.
  • 24.
    Shared Memory ‣ [16-64]KB of memory per block ‣ Extremely fast on-chip memory, user managed ‣ Declare using __shared__, allocated per block ‣ Data is not visible to threads in other blocks ‣ !!!Bank Conflict!!! ‣ When to use? When threads will access many times the global memory
  • 25.
    Shared Memory -Example 1D stencil SUM How many times is it read? 7 Times
  • 26.
    Shared Memory -Example __global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE]; int lindex = threadIdx.x ; // Read input elements into shared memory temp[lindex] = in[lindex]; if (lindex > RADIUS && lindex < BLOCK_SIZE-RADIUS); { for (...) //Loop for calculating the sum out[lindex] = res; } } ?? __syncthreads()
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    Conclusion ‣ GPUs aredesigned for parallel computing ‣ CUDA’s software abstraction is adapted to the GPU architecture with grids, blocks and threads ‣ The management of which functions access what type of memory is very important - Be careful of bank conflicts! ‣ Data transfer between host and device is slow (5GB device to host/host to device and 16GB device-device/host/host)
  • 32.
    Resources ‣ We skippedsome details, you can learn more with - CUDA programming guide - CUDA Zone – tools, training, webinars and more - http://developer.nvidia.com/cuda ‣ Install from - https://developer.nvidia.com/category/zone/cuda-zone and learn from provided examples