This document provides an introduction to the CUDA parallel computing platform from NVIDIA. It discusses the CUDA hardware capabilities including GPUDirect, Dynamic Parallelism, and HyperQ. It then outlines three main programming approaches for CUDA: using libraries, OpenACC directives, and programming languages. It provides examples of libraries like cuBLAS and cuRAND. For OpenACC, it shows how to add directives to existing Fortran/C code to parallelize loops. And for languages, it lists supports like CUDA C/C++, CUDA Fortran, Python with PyCUDA etc. The document aims to provide developers with maximum flexibility in choosing the best approach to accelerate their applications using CUDA and GPUs.
Overview of CUDA as a parallel computing platform, including programming languages, libraries for app acceleration, and tools for easier development.
Describes three methods for application acceleration: libraries, programming languages (like OpenACC), and maximum flexibility for efficient acceleration.
Overview of CUDA C/C++ including kernel launches, memory management, and essential programming constructs for utilizing GPU computing effectively.
Detailed explanation of executing parallel operations on the GPU including memory management, thread configuration, and function calls.
Explains shared memory utilization, synchronization between threads, error reporting, and device management to optimize CUDA performance.
Introduces Thrust as a high-level C++ parallel algorithms library and discusses its API, productivity, and performance portability.
Illustrates the use of CUDA libraries for drop-in acceleration in applications through straightforward coding strategies and examples.
Illustrates how to implement OpenACC directives for exercises like SAXPY, demonstrating performance gains and kernel execution.
Discusses advanced OpenACC constructs such as directives, parallel and loop constructs, and optimization techniques for maximizing performance.
// generate 32Mrandom numbers on host
thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(),
h_vec.end(),
rand);
// transfer data to device (GPU)
thrust::device_vector<int> d_vec = h_vec;
// sort data on device
thrust::sort(d_vec.begin(), d_vec.end());
// transfer data back to host
thrust::copy(d_vec.begin(),
d_vec.end(),
h_vec.begin());
Rapid Parallel C++ Development
• Resembles C++ STL
• High-level interface
• Enhances developer
productivity
• Enables performance
portability between GPUs and
multicore CPUs
• Flexible
• CUDA, OpenMP, and TBB
backends
• Extensible and customizable
• Integrates with existing
software
• Open source
http://developer.nvidia.com/thrust or http://thrust.googlecode.com
What is Thrust?
•High-Level Parallel Algorithms Library
• Parallel Analog of the C++ Standard Template
Library (STL)
• Performance-Portable Abstraction Layer
• Productive way to program CUDA
94.
Example
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include<thrust/sort.h>
#include <cstdlib>
int main(void)
{
// generate 32M random numbers on the host
thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(), h_vec.end(), rand);
// transfer data to the device
thrust::device_vector<int> d_vec = h_vec;
// sort data on the device
thrust::sort(d_vec.begin(), d_vec.end());
// transfer data back to host
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
return 0;
}
95.
Easy to Use
•Distributed with CUDA Toolkit
• Header-only library
• Architecture agnostic
• Just compile and run!
$ nvcc -O2 -arch=sm_20 program.cu -o program
Productivity
• Containers
host_vector
device_vector
• MemoryMangement
– Allocation
– Transfers
• Algorithm Selection
– Location is implicit
// allocate host vector with two elements
thrust::host_vector<int> h_vec(2);
// copy host data to device memory
thrust::device_vector<int> d_vec = h_vec;
// write device values from the host
d_vec[0] = 27;
d_vec[1] = 13;
// read device values from the host
int sum = d_vec[0] + d_vec[1];
// invoke algorithm on device
thrust::sort(d_vec.begin(), d_vec.end());
// memory automatically released
98.
Productivity
• Large setof algorithms
– ~75 functions
– ~125 variations
• Flexible
– User-defined types
– User-defined operators
Algorithm Description
reduce Sum of a sequence
find First position of a value in a sequence
mismatch First position where two sequences differ
inner_product Dot product of two sequences
equal Whether two sequences are equal
min_element Position of the smallest value
count Number of instances of a value
is_sorted Whether sequence is in sorted order
transform_reduce Sum of transformed sequence
Portability
• Support forCUDA, TBB and OpenMP
– Just recompile!
GeForce GTX 280
$ time ./monte_carlo
pi is approximately 3.14159
real 0m6.190s
user 0m6.052s
sys 0m0.116s
NVIDA GeForce GTX 580 Core2 Quad Q6600
$ time ./monte_carlo
pi is approximately 3.14159
real 1m26.217s
user 11m28.383s
sys 0m0.020s
Intel Core i7 2600K
nvcc -DTHRUST_DEVICE_SYSTEM=THRUST_HOST_SYSTEM_OMP
101.
Backend System Options
DeviceSystems
THRUST_DEVICE_SYSTEM_CUDA
THRUST_DEVICE_SYSTEM_OMP
THRUST_DEVICE_SYSTEM_TBB
Host Systems
THRUST_HOST_SYSTEM_CPP
THRUST_HOST_SYSTEM_OMP
THRUST_HOST_SYSTEM_TBB
102.
Multiple Backend Systems
•Mix different backends freely within the same app
thrust::omp::vector<float> my_omp_vec(100);
thrust::cuda::vector<float> my_cuda_vec(100);
...
// reduce in parallel on the CPU
thrust::reduce(my_omp_vec.begin(), my_omp_vec.end());
// sort in parallel on the GPU
thrust::sort(my_cuda_vec.begin(), my_cuda_vec.end());