Rob GillenIntro to GPGPU Programing With CUDA
CodeStock is proudly partnered with:RecruitWise and Staff with Excellence - www.recruitwise.jobsSend instant feedback on this session via Twitter:Send a direct message with the room number to @CodeStockd codestock 411 This guy is Amazing!For more information on sending feedback using Twitter while at CodeStock, please see the “CodeStock README” in your CodeStock guide.
Intro to GPGPU Programming with CUDARob Gillen
Welcome!Goals:Overview of GPGPU with CUDA“Vision Casting” for how you can use GPUs to improve your applicationOutlineWhy GPGPUs?ApplicationsToolingHands-On: Matrix MultiplicationRating: http://spkr8.com/t/7714
CPU vs. GPUGPU devotes more transistors to data processing
NVIDIA Fermi~1.5TFLOPS (SP)/~800GFLOPS (DP)230 GB/s DRAM Bandwidth
MotivationFLoating-Point Operations per Second (FLOPS) and memory bandwidth For the CPU and GPU
Example: Sparse Matrix-VectorCPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms",  Williams et al, Supercomputing 2007
Rayleigh-Bénard ResultsDouble precision384 x 384 x 192 grid (max that fits in 4GB)Vertical slice of temperature at y=0Transition from stratified (left) to turbulent (right)Regime depends on Rayleigh number: Ra = gαΔT/κν8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon
G80 Characteristics367 GFLOPS  peak performance (25-50 times of current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per app30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics
Supercomputer Comparison
ApplicationsExciting applications in future mass computing market have been traditionally considered “supercomputing applications”Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent worldVarious granularities of parallelism exist, but…programming model must not hinder parallel implementationdata delivery needs careful management
*Not* for all applicationsSPMD (Single Program, Multiple Data) are best (data parallel)Operations need to be of sufficient size to overcome overheadThink Millions of operations.
Raytracing
NVIRT: CUDA Ray Tracing API
ToolingVS 2010 C++ (Express is OK… sortof.)NVIDIA CUDA-Capable GPUNVIDIA CUDA Toolkit (v4+)NVIDIA CUDA Tools (v4+)GPU Computing SDKNVIDIA Parallel Insight
Parallel Debugging
Parallel Analysis
VS Project Templates
VS Project Templates
Before we get too excited…Host vs DeviceKernels __global__   __device__  __host__Thread/Block Control<<<x, y>>>Multi-dimensioned coordinate objectsMemory Management/MovementThread Management – think 1000’s or 1,000,000’s
Block IDs and ThreadsEach thread uses IDs to decide what data to work onBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D Simplifies memoryaddressing when processingmultidimensional dataImage processing
CUDA Thread BlockAll threads in a block execute the same kernel program (SPMD)Programmer declares block:Block size 1 to 512 concurrent threadsBlock shape 1D, 2D, or 3DBlock dimensions in threadsThreads have thread id numbers within blockThread program uses thread id to select work and address shared dataThreads in the same block share data and synchronize while doing their share of the workThreads in different blocks cannot cooperateEach block can execute in any order relative to other blocs!CUDA Thread BlockThread Id #:0 1 2 3 …          m   Thread program
Transparent ScalabilityHardware is free to assigns blocks to any processor at any timeA kernel scales across any number of parallel processorsKernel gridDeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7DeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7Block 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7timeEach block can execute in any order relative to other blocks.
A Simple Running ExampleMatrix MultiplicationA simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programsLeave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and deviceAssume square matrix for simplicity
Programming Model:Square Matrix Multiplication ExampleP = M * N of size WIDTH x WIDTHWithout tiling:One thread calculates one element of PM and N are loaded WIDTH timesfrom global memoryNWIDTHMPWIDTHWIDTHWIDTH27
Memory Layout of Matrix in CM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3MM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3
Simple Matrix Multiplication (CPU)void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏{   for (int i = 0; i < Width; ++i) {‏  for (int j = 0; j < Width; ++j) {	float sum = 0;for (int k = 0; k < Width; ++k) {float a = M[i * width + k];float b = N[k * width + j];sum += a * b;}P[i * Width + j] = sum;   } }}NkjWIDTHMPiWIDTHk29WIDTHWIDTH
Simple Matrix Multiplication (GPU)void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏{intsize = Width * Width * sizeof(float); float* Md, Nd, Pd;   …  // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);// Allocate P on the devicecudaMalloc(&Pd, size);
Simple Matrix Multiplication (GPU)// 2. Kernel invocation code – to be shown later     … // 3. Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);// Free device matricescudaFree(Md); cudaFree(Nd); cudaFree(Pd);}
Kernel Function// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{    // Pvalue is used to store the element of the matrix    // that is computed by the thread    float Pvalue = 0;
Kernel Function (contd.)for (int k = 0; k < Width; ++k)‏ {float Melement = Md[threadIdx.y*Width+k];float Nelement = Nd[k*Width+threadIdx.x];Pvalue+= Melement * Nelement;   }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}NdkWIDTHtxMdPdtytyWIDTHtxk33WIDTHWIDTH
Kernel Function (full)// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{   // Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0; for (int k = 0; k < Width; ++k)‏ {     float Melement = Md[threadIdx.y*Width+k];     float Nelement = Nd[k*Width+threadIdx.x];Pvalue += Melement * Nelement;   }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}
Kernel Invocation (Host Side) // Setup the execution configurationdim3 dimGrid(1, 1);dim3 dimBlock(Width, Width);// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
Only One Thread Block UsedNdGrid 1One Block of threads compute matrix PdEach thread computes one element of PdEach threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)‏Size of matrix limited by the number of threads allowed in a thread blockBlock 1Thread(2, 2)‏48   WIDTHPdMd
Handling Arbitrary Sized Square MatricesHave each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrixEach has (TILE_WIDTH)2 threadsGenerate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocksNdWIDTHMdPdbyYou still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!TILE_WIDTHtyWIDTHbxtx37WIDTHWIDTH
Small ExampleNd1,0Nd0,0Block(0,0)Block(1,0)Nd1,1Nd0,1P1,0P0,0P2,0P3,0Nd1,2Nd0,2TILE_WIDTH = 2P0,1P1,1P3,1P2,1Nd0,3Nd1,3P0,2P2,2P3,2P1,2P0,3P2,3P3,3P1,3Pd1,0Md2,0Md1,0Md0,0Md3,0Pd0,0Pd2,0Pd3,0Md1,1Md0,1Md2,1Md3,1Pd0,1Pd1,1Pd3,1Pd2,1Block(1,1)Block(0,1)Pd0,2Pd2,2Pd3,2Pd1,2Pd0,3Pd2,3Pd3,3Pd1,3
Cleanup TopicsMemory ManagementPinned Memory (Zero-Transfer)Portable Pinned MemoryMulti-GPUWrappers (Python, Java, .NET)KernelsAtomicsThread Synchronization (staged reductions)NVCC
Questions?rob@gillenfamily.net@argodevhttp://rob.gillenfamily.netRate: http://spkr8.com/t/7714

Intro to GPGPU Programming with Cuda

  • 1.
    Rob GillenIntro toGPGPU Programing With CUDA
  • 2.
    CodeStock is proudlypartnered with:RecruitWise and Staff with Excellence - www.recruitwise.jobsSend instant feedback on this session via Twitter:Send a direct message with the room number to @CodeStockd codestock 411 This guy is Amazing!For more information on sending feedback using Twitter while at CodeStock, please see the “CodeStock README” in your CodeStock guide.
  • 4.
    Intro to GPGPUProgramming with CUDARob Gillen
  • 5.
    Welcome!Goals:Overview of GPGPUwith CUDA“Vision Casting” for how you can use GPUs to improve your applicationOutlineWhy GPGPUs?ApplicationsToolingHands-On: Matrix MultiplicationRating: http://spkr8.com/t/7714
  • 6.
    CPU vs. GPUGPUdevotes more transistors to data processing
  • 7.
    NVIDIA Fermi~1.5TFLOPS (SP)/~800GFLOPS(DP)230 GB/s DRAM Bandwidth
  • 8.
    MotivationFLoating-Point Operations perSecond (FLOPS) and memory bandwidth For the CPU and GPU
  • 9.
    Example: Sparse Matrix-VectorCPUResults from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007
  • 10.
    Rayleigh-Bénard ResultsDouble precision384x 384 x 192 grid (max that fits in 4GB)Vertical slice of temperature at y=0Transition from stratified (left) to turbulent (right)Regime depends on Rayleigh number: Ra = gαΔT/κν8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon
  • 11.
    G80 Characteristics367 GFLOPS peak performance (25-50 times of current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per app30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics
  • 12.
  • 13.
    ApplicationsExciting applications infuture mass computing market have been traditionally considered “supercomputing applications”Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent worldVarious granularities of parallelism exist, but…programming model must not hinder parallel implementationdata delivery needs careful management
  • 14.
    *Not* for allapplicationsSPMD (Single Program, Multiple Data) are best (data parallel)Operations need to be of sufficient size to overcome overheadThink Millions of operations.
  • 15.
  • 16.
    NVIRT: CUDA RayTracing API
  • 17.
    ToolingVS 2010 C++(Express is OK… sortof.)NVIDIA CUDA-Capable GPUNVIDIA CUDA Toolkit (v4+)NVIDIA CUDA Tools (v4+)GPU Computing SDKNVIDIA Parallel Insight
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    Before we gettoo excited…Host vs DeviceKernels __global__ __device__ __host__Thread/Block Control<<<x, y>>>Multi-dimensioned coordinate objectsMemory Management/MovementThread Management – think 1000’s or 1,000,000’s
  • 23.
    Block IDs andThreadsEach thread uses IDs to decide what data to work onBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D Simplifies memoryaddressing when processingmultidimensional dataImage processing
  • 24.
    CUDA Thread BlockAllthreads in a block execute the same kernel program (SPMD)Programmer declares block:Block size 1 to 512 concurrent threadsBlock shape 1D, 2D, or 3DBlock dimensions in threadsThreads have thread id numbers within blockThread program uses thread id to select work and address shared dataThreads in the same block share data and synchronize while doing their share of the workThreads in different blocks cannot cooperateEach block can execute in any order relative to other blocs!CUDA Thread BlockThread Id #:0 1 2 3 … m Thread program
  • 25.
    Transparent ScalabilityHardware isfree to assigns blocks to any processor at any timeA kernel scales across any number of parallel processorsKernel gridDeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7DeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7Block 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7timeEach block can execute in any order relative to other blocks.
  • 26.
    A Simple RunningExampleMatrix MultiplicationA simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programsLeave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and deviceAssume square matrix for simplicity
  • 27.
    Programming Model:Square MatrixMultiplication ExampleP = M * N of size WIDTH x WIDTHWithout tiling:One thread calculates one element of PM and N are loaded WIDTH timesfrom global memoryNWIDTHMPWIDTHWIDTHWIDTH27
  • 28.
    Memory Layout ofMatrix in CM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3MM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3
  • 29.
    Simple Matrix Multiplication(CPU)void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏{ for (int i = 0; i < Width; ++i) {‏ for (int j = 0; j < Width; ++j) { float sum = 0;for (int k = 0; k < Width; ++k) {float a = M[i * width + k];float b = N[k * width + j];sum += a * b;}P[i * Width + j] = sum; } }}NkjWIDTHMPiWIDTHk29WIDTHWIDTH
  • 30.
    Simple Matrix Multiplication(GPU)void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏{intsize = Width * Width * sizeof(float); float* Md, Nd, Pd; … // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);// Allocate P on the devicecudaMalloc(&Pd, size);
  • 31.
    Simple Matrix Multiplication(GPU)// 2. Kernel invocation code – to be shown later … // 3. Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);// Free device matricescudaFree(Md); cudaFree(Nd); cudaFree(Pd);}
  • 32.
    Kernel Function// Matrixmultiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;
  • 33.
    Kernel Function (contd.)for(int k = 0; k < Width; ++k)‏ {float Melement = Md[threadIdx.y*Width+k];float Nelement = Nd[k*Width+threadIdx.x];Pvalue+= Melement * Nelement; }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}NdkWIDTHtxMdPdtytyWIDTHtxk33WIDTHWIDTH
  • 34.
    Kernel Function (full)//Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{ // Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0; for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x];Pvalue += Melement * Nelement; }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}
  • 35.
    Kernel Invocation (HostSide) // Setup the execution configurationdim3 dimGrid(1, 1);dim3 dimBlock(Width, Width);// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
  • 36.
    Only One ThreadBlock UsedNdGrid 1One Block of threads compute matrix PdEach thread computes one element of PdEach threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)‏Size of matrix limited by the number of threads allowed in a thread blockBlock 1Thread(2, 2)‏48 WIDTHPdMd
  • 37.
    Handling Arbitrary SizedSquare MatricesHave each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrixEach has (TILE_WIDTH)2 threadsGenerate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocksNdWIDTHMdPdbyYou still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!TILE_WIDTHtyWIDTHbxtx37WIDTHWIDTH
  • 38.
    Small ExampleNd1,0Nd0,0Block(0,0)Block(1,0)Nd1,1Nd0,1P1,0P0,0P2,0P3,0Nd1,2Nd0,2TILE_WIDTH =2P0,1P1,1P3,1P2,1Nd0,3Nd1,3P0,2P2,2P3,2P1,2P0,3P2,3P3,3P1,3Pd1,0Md2,0Md1,0Md0,0Md3,0Pd0,0Pd2,0Pd3,0Md1,1Md0,1Md2,1Md3,1Pd0,1Pd1,1Pd3,1Pd2,1Block(1,1)Block(0,1)Pd0,2Pd2,2Pd3,2Pd1,2Pd0,3Pd2,3Pd3,3Pd1,3
  • 39.
    Cleanup TopicsMemory ManagementPinnedMemory (Zero-Transfer)Portable Pinned MemoryMulti-GPUWrappers (Python, Java, .NET)KernelsAtomicsThread Synchronization (staged reductions)NVCC
  • 40.

Editor's Notes

  • #10 Sparse linear algebra is interesting both because many science and engineering codes rely on it, and also because it was traditionally assumed to be something that GPUs would not be good at (because of irregular data access patterns). We have shown that in fact GPUs are extremely good at sparse matrix-vector multiply (SpMV), which is the basic building block of sparse linear algebra. The code and an accompanying white paper are available on the cuda forums and also posted on research.nvidia.com.This is compared to an extremely well-studied, well-optimized SpMV implementation from a widely respected paper in Supercomputing 2007. that paper only reported double-precision results for CPUs; our single precision results are even more impressive in comparison.
  • #11 Compared to highly optimizedfortran code from an oceanography researcher at UCLA
  • #16 Current implementation uses short-stack approach. Top elements of the stack are cached in registers.
  • #17 RTAPI enables implementation of manydifferent raytracing flavors.left-right, top-bottom: Procedural materials, Ambient occlusion, Whittedraytracer (thin shell glass and metalic spheres) Path tracer (Cornell box), Refactions, Cook-style distribution raytracingCould also do non-rendering stuff, e.g. GIS (line of sight say), physics (collision/proximity detection)