Lloyd Moore, President
Lloyd@CyberData-Robotics.com
www.CyberData-Robotics.com
CUDA Without a PhD
Agenda:
Introduction to GPU Architecture
What is CUDA?
CUDA Setup
Problem Definition
CPU Single Threaded Solution
GPU Massively Parallel Solution
Debugging CUDA Kernels
Resources
Q & A
Disclaimer
CUDA development can be a VERY deep topic. To get the maximum
performance from a GPU based application the problem needs to start
with a correct formulation, considering the specific hardware being
used, data access patterns, memory bandwidth, processor topology
and much more.
This talk IS NOT about all of that, as there are plenty of well done
presentations covering those topics already. If you want to get into the
deep details NVIDA has a great starting point here:
https://docs.nvidia.com/cuda/doc/index.html
This talk IS about how a developer can get a very simple start with
CUDA applications and see immediate benefits without having to
spend weeks of time learning all of the details. You won’t be able to
fully optimize an application but you will be able to quickly convert
select processing patterns and see a considerable speed ups.
GPU Architecture
A modern CPU consists of a small number of complex processors that are mostly
independent of one another, and perform work on generally independent tasks.
This computational pattern is generally referred to as SISD – Single Instruction,
Single Data.
A modern GPU consists of hundreds to thousands of very simple processors that
work together to perform the same operation on multiple pieces data in parallel.
This computational pattern is generally referred to as SIMD – Single Instruction,
Multiple Data. (And SIMT – Single Instruction Multiple Thread - per NVIDIA)
The GPU “likes” to work in 32 bit floating point. 64 bit floating point is supported
however you do take a time penalty for the additional precision. Of course integer
math is also supported!
GPU Architecture
From this description, the GPU offers an effective speed up in the following case:
1. You have a VERY large set of data that needs to be processed
Think 100’s of MB or GB of data
2. The data format is “regular”
For example stored in arrays or vectors
3. The same (or very similar) operations need to be performed on each element
4. The operations to be performed on each data element are independent
5. The amount of work to be performed on each element is significant enough to
justify copying the data at least twice
Let’s look at that last statement in more detail…..
GPU Memory Architecture
A GPU typically contains a dedicated bank of memory, independent from the
normal CPU memory.
GPU memory is optimized for highly parallel access patterns.
Information to be processed by the GPU must be copied from the CPU memory,
called “host memory”, to the GPU memory, called “device memory”.
Results may be used on the GPU directly or copied back to the CPU / host
memory, depending on the application.
Due to the overhead of having to copy data between memories, the amount of
work that needs to be done needs to be complex enough to amortize the copy
overhead.
Note: “Unified Memory”, “Shared Memory” and “Texture Memory” also exist, not
going to talk about those here as each has a specific use and trade offs.
What is NVIDIA CUDA?
NVIDIA CUDA is a framework and tools which allow for application development
on NVIDIA GPU hardware.
Top level documentation is here: https://docs.nvidia.com/cuda/doc/index.html
Main Components:
NVIDIA Compiler: nvcc
CUDA API
Debugging and Profiling Tools: Nsight Compute
Math Libraries: cuBlas, cuFFT, cuRand, cuTensor, cuSparse, cuSolver,
nvJPEG, Thrust, and many others
Technologies: GPUDirectStorage – Direct GPU to disk access
CUDA Setup - Requirements
CUDA can run in Windows and Linux environments on PCs (x86/64) and Jetson
(ARM) hardware.
For this exercise I’ll use the following configuration (Note: smaller systems WILL
also work fine – this is NOT a minimum recommended configuration):
CPU: AMD Ryzen 9 7950X, 16 core, 32 thread, 4.5 Ghz
Motherboard: Asus ProArt X670E-Creator
RAM: 64GB DDR5 4800
GPU: Asus GeForce RTX 4080, 16GB RAM
GeForce Game Ready Driver Version 546.33
OS: Windows 11 Pro, 64 Bit, 22H2 22621.3007
Visual Studio Community Edition 2022
CUDA: 12.2
CUDA Setup – Tool Chain
For this talk we’ll focus on Visual Studio and Windows as it is the simplest to get
going.
CUDA supports many other configurations on both Windows and Linux including
operating through WSL2.
Install Microsoft Visual Studio Community 2022, 64 bit:
https://visualstudio.microsoft.com/vs/community/
Configure for at least C++ development
Install NVIDIA CUDA for Microsoft Visual Studio:
https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.htm
l
Don’t need to worry about installing the Python tools unless you want to.
Sample Problem Definition
For this talk we’ll solve a fairly simple problem showing a typical design pattern for
solving many other problems, and that won’t distract us with any complexity of the
problem itself.
Problem:
Compute the hypotenuse of a large quantity of triangles given the lengths of the
sides of the triangles, using the Pythagorean Theorem: h = sqrt( a^2 + b^2)
For this example we’ll create two vectors of random numbers for ‘a’ and ‘b'.
We’ll compute the results using a single threaded approach on the CPU and then
convert the code to use the GPU and compare the execution times.
Finally we’ll compare the results between the CPU and GPU and make sure they
match.
Note: For this example we’ll use 32 bit floating point.
Creating the Project in VS
Creating the Project in VS
Creating the Project in VS
Creating the Project in VS
Note: Full sample truncated to fit on this slide!
Creating the Project in VS
When adding files for CUDA use the “Add” → “Module…” option.
CUDA files are named *.ccuh for header files and *.cu for C++ files.
CPU Single Threaded
To start we’ll create a little C++ class to hold all of our data and algorithms
called CudaWorker:
CPU Single Threaded
I am using a library called Thrust. Thrust is a C++ template library for
CUDA, based on the std::vector that works for both ‘host’ and ‘device’
memory.
CPU Single Threaded
The constructor simply fills the a & b vectors with random data, we also
prep the ‘device’ vectors for the GPU at the same time:
Note that with Thrust copying the vector from host memory to device
memory is a simple assignment!
CPU Single Threaded
CpuCompute() simply consists of a loop that calls a common math routine
for each set of data:
Note that do_pythagorean() is tagged with “__device__” and “__host__”.
These are attributes to the NVCC compiler to build the function so it can run
on both the CPU and GPU.
We will also see “__global__” which is an attribute flagging a CUDA Kernel –
we’ll talk about that more when we convert this code for the GPU.
CPU Single Threaded
And of course we need a main() to instantiate and call CudaWorker. This
also has the GPU code present…….
GPU Massively Parallel
Next we’ll convert this solution to run on the GPU. The first thing we need
to do is initialize the GPU. I have a singleton class called Gpu for this:
GPU Massively Parallel
The constructor initializes the GPU and prints out some of the GPU
parameters:
GPU Massively Parallel
The destructor “cleans up” the GPU with a cudaDeviceReset():
GPU Massively Parallel
A block of code that runs on the GPU is called a “kernel”
An instance of the “kernel” is run on each core of the processor as an
independent thread.
From the developer’s point of view the “kernel” is just a function call,
however under the covers this function call will be instantiated on each
core in parallel, and have access to information uniquely identifying each
instance. This is called a “thread address”.
In traditional graphics processing each pixel displayed is assigned to a
thread. See https://www.shadertoy.com/ to play with this concept!
For the current problem we have vectors of data so we’ll simply assign
each element / index of the vector to a thread.
GPU Massively Parallel
GpuCompute() is the member function that configures and launches a
“kernel” on the GPU, at this point it is assumed the data has already been
copied to the GPU memory, the CudaWorker constructor did that:
cudaGetLastError() will return an error code if the “kernel” was not
launched successfully.
cudaDeviceSynchronize() will wait for the work on the GPU to be
completed and return an error code if anything went wrong.
GPU Massively Parallel
The GPU hardware only allows so many threads in a “block” so the work
must be partitioned into blocks.
Invoking a kernel looks just like a function call with some extra annotation:
The “<<<blocks, threads>>>” annotation is picked up by NVCC and
converted into a kernel invocation matching the given geometry
Under the covers CUDA maps the given geometry to the hardware
geometry and launches as many threads in parallel as the hardware
allows. If there are more threads than actual hardware, multiple launches
are serialized until all the work is done.
GPU Massively Parallel
Each GPU thread needs to do two things:
Identify the data elements that it is to work on
Perform the specified work on those data elements
Function annotations tell the compiler how to build and call the code:
__global__ : Runs on the GPU, called from either CPU or GPU
__device__ : Runs on the GPU, called from the GPU
__host__ : Runs on the CPU, called from the CPU
Annotations can be combined.
GPU Massively Parallel
Each kernel is invoked with variables for “thread addressing”:
threadIdx : Contains the “address” of current thread
blockIdx : Contains the “address” of the current block
blockDim : Contains the geometry of the block sizes
Currently we use only the X dimension, in reality these values have 3
dimensions allowing for easy mapping to real word 3D spaces.
For each dimension combine the threadIdx, blockIdx and blockDim as
shown to create a fully unique ID for the kernel invocation.
For this problem data was up such that the “thread address” directly
maps to the index of the data – this is very common and very simple!
There may be more “thread addresses” than data, mask with an “if”.
CPU vs. GPU Code
Key Conversion Points:
1. The sequential ‘for’ statement becomes a CUDA kernel invocation
2. The address calculations become a thread address calculation
3. The “work” ends up being done by exactly the same function!
Once you get familiar with this conversion technique you can generally apply
it in about 30 to 60 minutes! (Faster if you plan for it in advance!)
1
2
1
3
Verification
You don’t normally need to include a verification routine, but is helpful:
Math processing on the GPU is different than on the CPU
Nice sanity check to convince yourself this really works
Results
Speed up: 107394us / 2047us = 52.46x (for this one case, clearly run more!)
This speed up DOES NOT include the copy overhead of the data to and from
the GPU. This will impact the results considerably, however the “work” we are
also doing is pretty simple. It does include kernel invocation, which is nontrivial.
This is a VERY unoptimized solution! With a full effort you can get 1000x
improvements for very well formed and well fitting cases.
Debugging CUDA Kernels
Debugging kernels can be a bit more challenging due to the following:
Typically there are THOUSANDS of instances running
Access to the GPU memory is more restricted
Simple guidelines to get started:
Place the “work” to be done in function, like was done here
Debug the “work” on the CPU as you normally would
Once this is done all that is left is the data mapping
Reduce the kernel invocation to a single thread
function<<<1,1>>>(a, b, c);
Gets around thousands of invocations running
Also helpful to test with two invocations: <<<1,2>>>
In Visual Studio, printf() works just as you expect inside kernels
Combine with reducing the kernel invocations
Breakpoints and visual debugging techniques ARE available!
Additional Resources
This talk has barely scratched the surface of what can be done. The goal was
to provide a simple, effective solution to a common problem, and is the
beginning of a journey!
Official Documentation:
CUDA Main Docs: https://docs.nvidia.com/cuda/doc/index.html
CUDA Dev Tools: https://developer.nvidia.com/tools-overview
Thrust Library: https://developer.nvidia.com/thrust
Good Books:
Programming in Parallel with CUDA
Richard Anderson; ISBN: 978-1108479530
Programming Massively Parallel Processors
Hwu, Kirk, Jajj; ISBN: 978-0323912310
Shader Toy: https://www.shadertoy.com/
Open Discussion
&
Q & A

Cuda Without a Phd - A practical guick start

  • 1.
  • 2.
    Agenda: Introduction to GPUArchitecture What is CUDA? CUDA Setup Problem Definition CPU Single Threaded Solution GPU Massively Parallel Solution Debugging CUDA Kernels Resources Q & A
  • 3.
    Disclaimer CUDA development canbe a VERY deep topic. To get the maximum performance from a GPU based application the problem needs to start with a correct formulation, considering the specific hardware being used, data access patterns, memory bandwidth, processor topology and much more. This talk IS NOT about all of that, as there are plenty of well done presentations covering those topics already. If you want to get into the deep details NVIDA has a great starting point here: https://docs.nvidia.com/cuda/doc/index.html This talk IS about how a developer can get a very simple start with CUDA applications and see immediate benefits without having to spend weeks of time learning all of the details. You won’t be able to fully optimize an application but you will be able to quickly convert select processing patterns and see a considerable speed ups.
  • 4.
    GPU Architecture A modernCPU consists of a small number of complex processors that are mostly independent of one another, and perform work on generally independent tasks. This computational pattern is generally referred to as SISD – Single Instruction, Single Data. A modern GPU consists of hundreds to thousands of very simple processors that work together to perform the same operation on multiple pieces data in parallel. This computational pattern is generally referred to as SIMD – Single Instruction, Multiple Data. (And SIMT – Single Instruction Multiple Thread - per NVIDIA) The GPU “likes” to work in 32 bit floating point. 64 bit floating point is supported however you do take a time penalty for the additional precision. Of course integer math is also supported!
  • 5.
    GPU Architecture From thisdescription, the GPU offers an effective speed up in the following case: 1. You have a VERY large set of data that needs to be processed Think 100’s of MB or GB of data 2. The data format is “regular” For example stored in arrays or vectors 3. The same (or very similar) operations need to be performed on each element 4. The operations to be performed on each data element are independent 5. The amount of work to be performed on each element is significant enough to justify copying the data at least twice Let’s look at that last statement in more detail…..
  • 6.
    GPU Memory Architecture AGPU typically contains a dedicated bank of memory, independent from the normal CPU memory. GPU memory is optimized for highly parallel access patterns. Information to be processed by the GPU must be copied from the CPU memory, called “host memory”, to the GPU memory, called “device memory”. Results may be used on the GPU directly or copied back to the CPU / host memory, depending on the application. Due to the overhead of having to copy data between memories, the amount of work that needs to be done needs to be complex enough to amortize the copy overhead. Note: “Unified Memory”, “Shared Memory” and “Texture Memory” also exist, not going to talk about those here as each has a specific use and trade offs.
  • 7.
    What is NVIDIACUDA? NVIDIA CUDA is a framework and tools which allow for application development on NVIDIA GPU hardware. Top level documentation is here: https://docs.nvidia.com/cuda/doc/index.html Main Components: NVIDIA Compiler: nvcc CUDA API Debugging and Profiling Tools: Nsight Compute Math Libraries: cuBlas, cuFFT, cuRand, cuTensor, cuSparse, cuSolver, nvJPEG, Thrust, and many others Technologies: GPUDirectStorage – Direct GPU to disk access
  • 8.
    CUDA Setup -Requirements CUDA can run in Windows and Linux environments on PCs (x86/64) and Jetson (ARM) hardware. For this exercise I’ll use the following configuration (Note: smaller systems WILL also work fine – this is NOT a minimum recommended configuration): CPU: AMD Ryzen 9 7950X, 16 core, 32 thread, 4.5 Ghz Motherboard: Asus ProArt X670E-Creator RAM: 64GB DDR5 4800 GPU: Asus GeForce RTX 4080, 16GB RAM GeForce Game Ready Driver Version 546.33 OS: Windows 11 Pro, 64 Bit, 22H2 22621.3007 Visual Studio Community Edition 2022 CUDA: 12.2
  • 9.
    CUDA Setup –Tool Chain For this talk we’ll focus on Visual Studio and Windows as it is the simplest to get going. CUDA supports many other configurations on both Windows and Linux including operating through WSL2. Install Microsoft Visual Studio Community 2022, 64 bit: https://visualstudio.microsoft.com/vs/community/ Configure for at least C++ development Install NVIDIA CUDA for Microsoft Visual Studio: https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.htm l Don’t need to worry about installing the Python tools unless you want to.
  • 10.
    Sample Problem Definition Forthis talk we’ll solve a fairly simple problem showing a typical design pattern for solving many other problems, and that won’t distract us with any complexity of the problem itself. Problem: Compute the hypotenuse of a large quantity of triangles given the lengths of the sides of the triangles, using the Pythagorean Theorem: h = sqrt( a^2 + b^2) For this example we’ll create two vectors of random numbers for ‘a’ and ‘b'. We’ll compute the results using a single threaded approach on the CPU and then convert the code to use the GPU and compare the execution times. Finally we’ll compare the results between the CPU and GPU and make sure they match. Note: For this example we’ll use 32 bit floating point.
  • 11.
  • 12.
  • 13.
  • 14.
    Creating the Projectin VS Note: Full sample truncated to fit on this slide!
  • 15.
    Creating the Projectin VS When adding files for CUDA use the “Add” → “Module…” option. CUDA files are named *.ccuh for header files and *.cu for C++ files.
  • 16.
    CPU Single Threaded Tostart we’ll create a little C++ class to hold all of our data and algorithms called CudaWorker:
  • 17.
    CPU Single Threaded Iam using a library called Thrust. Thrust is a C++ template library for CUDA, based on the std::vector that works for both ‘host’ and ‘device’ memory.
  • 18.
    CPU Single Threaded Theconstructor simply fills the a & b vectors with random data, we also prep the ‘device’ vectors for the GPU at the same time: Note that with Thrust copying the vector from host memory to device memory is a simple assignment!
  • 19.
    CPU Single Threaded CpuCompute()simply consists of a loop that calls a common math routine for each set of data: Note that do_pythagorean() is tagged with “__device__” and “__host__”. These are attributes to the NVCC compiler to build the function so it can run on both the CPU and GPU. We will also see “__global__” which is an attribute flagging a CUDA Kernel – we’ll talk about that more when we convert this code for the GPU.
  • 20.
    CPU Single Threaded Andof course we need a main() to instantiate and call CudaWorker. This also has the GPU code present…….
  • 21.
    GPU Massively Parallel Nextwe’ll convert this solution to run on the GPU. The first thing we need to do is initialize the GPU. I have a singleton class called Gpu for this:
  • 22.
    GPU Massively Parallel Theconstructor initializes the GPU and prints out some of the GPU parameters:
  • 23.
    GPU Massively Parallel Thedestructor “cleans up” the GPU with a cudaDeviceReset():
  • 24.
    GPU Massively Parallel Ablock of code that runs on the GPU is called a “kernel” An instance of the “kernel” is run on each core of the processor as an independent thread. From the developer’s point of view the “kernel” is just a function call, however under the covers this function call will be instantiated on each core in parallel, and have access to information uniquely identifying each instance. This is called a “thread address”. In traditional graphics processing each pixel displayed is assigned to a thread. See https://www.shadertoy.com/ to play with this concept! For the current problem we have vectors of data so we’ll simply assign each element / index of the vector to a thread.
  • 25.
    GPU Massively Parallel GpuCompute()is the member function that configures and launches a “kernel” on the GPU, at this point it is assumed the data has already been copied to the GPU memory, the CudaWorker constructor did that: cudaGetLastError() will return an error code if the “kernel” was not launched successfully. cudaDeviceSynchronize() will wait for the work on the GPU to be completed and return an error code if anything went wrong.
  • 26.
    GPU Massively Parallel TheGPU hardware only allows so many threads in a “block” so the work must be partitioned into blocks. Invoking a kernel looks just like a function call with some extra annotation: The “<<<blocks, threads>>>” annotation is picked up by NVCC and converted into a kernel invocation matching the given geometry Under the covers CUDA maps the given geometry to the hardware geometry and launches as many threads in parallel as the hardware allows. If there are more threads than actual hardware, multiple launches are serialized until all the work is done.
  • 27.
    GPU Massively Parallel EachGPU thread needs to do two things: Identify the data elements that it is to work on Perform the specified work on those data elements Function annotations tell the compiler how to build and call the code: __global__ : Runs on the GPU, called from either CPU or GPU __device__ : Runs on the GPU, called from the GPU __host__ : Runs on the CPU, called from the CPU Annotations can be combined.
  • 28.
    GPU Massively Parallel Eachkernel is invoked with variables for “thread addressing”: threadIdx : Contains the “address” of current thread blockIdx : Contains the “address” of the current block blockDim : Contains the geometry of the block sizes Currently we use only the X dimension, in reality these values have 3 dimensions allowing for easy mapping to real word 3D spaces. For each dimension combine the threadIdx, blockIdx and blockDim as shown to create a fully unique ID for the kernel invocation. For this problem data was up such that the “thread address” directly maps to the index of the data – this is very common and very simple! There may be more “thread addresses” than data, mask with an “if”.
  • 29.
    CPU vs. GPUCode Key Conversion Points: 1. The sequential ‘for’ statement becomes a CUDA kernel invocation 2. The address calculations become a thread address calculation 3. The “work” ends up being done by exactly the same function! Once you get familiar with this conversion technique you can generally apply it in about 30 to 60 minutes! (Faster if you plan for it in advance!) 1 2 1 3
  • 30.
    Verification You don’t normallyneed to include a verification routine, but is helpful: Math processing on the GPU is different than on the CPU Nice sanity check to convince yourself this really works
  • 31.
    Results Speed up: 107394us/ 2047us = 52.46x (for this one case, clearly run more!) This speed up DOES NOT include the copy overhead of the data to and from the GPU. This will impact the results considerably, however the “work” we are also doing is pretty simple. It does include kernel invocation, which is nontrivial. This is a VERY unoptimized solution! With a full effort you can get 1000x improvements for very well formed and well fitting cases.
  • 32.
    Debugging CUDA Kernels Debuggingkernels can be a bit more challenging due to the following: Typically there are THOUSANDS of instances running Access to the GPU memory is more restricted Simple guidelines to get started: Place the “work” to be done in function, like was done here Debug the “work” on the CPU as you normally would Once this is done all that is left is the data mapping Reduce the kernel invocation to a single thread function<<<1,1>>>(a, b, c); Gets around thousands of invocations running Also helpful to test with two invocations: <<<1,2>>> In Visual Studio, printf() works just as you expect inside kernels Combine with reducing the kernel invocations Breakpoints and visual debugging techniques ARE available!
  • 33.
    Additional Resources This talkhas barely scratched the surface of what can be done. The goal was to provide a simple, effective solution to a common problem, and is the beginning of a journey! Official Documentation: CUDA Main Docs: https://docs.nvidia.com/cuda/doc/index.html CUDA Dev Tools: https://developer.nvidia.com/tools-overview Thrust Library: https://developer.nvidia.com/thrust Good Books: Programming in Parallel with CUDA Richard Anderson; ISBN: 978-1108479530 Programming Massively Parallel Processors Hwu, Kirk, Jajj; ISBN: 978-0323912310 Shader Toy: https://www.shadertoy.com/
  • 34.