Intro to GPGPU Programming with Cuda

Rob GillenIntro to GPGPU Programing With CUDA

CodeStock is proudly partnered with:RecruitWise and Staff with Excellence - www.recruitwise.jobsSend instant feedback on this session via Twitter:Send a direct message with the room number to @CodeStockd codestock 411 This guy is Amazing!For more information on sending feedback using Twitter while at CodeStock, please see the “CodeStock README” in your CodeStock guide.

Intro to GPGPU Programming with CUDARob Gillen

Welcome!Goals:Overview of GPGPU with CUDA“Vision Casting” for how you can use GPUs to improve your applicationOutlineWhy GPGPUs?ApplicationsToolingHands-On: Matrix MultiplicationRating: http://spkr8.com/t/7714

CPU vs. GPUGPU devotes more transistors to data processing

NVIDIA Fermi~1.5TFLOPS (SP)/~800GFLOPS (DP)230 GB/s DRAM Bandwidth

MotivationFLoating-Point Operations per Second (FLOPS) and memory bandwidth For the CPU and GPU

Example: Sparse Matrix-VectorCPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

Rayleigh-Bénard ResultsDouble precision384 x 384 x 192 grid (max that fits in 4GB)Vertical slice of temperature at y=0Transition from stratified (left) to turbulent (right)Regime depends on Rayleigh number: Ra = gαΔT/κν8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon

G80 Characteristics367 GFLOPS peak performance (25-50 times of current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per app30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics

ApplicationsExciting applications in future mass computing market have been traditionally considered “supercomputing applications”Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent worldVarious granularities of parallelism exist, but…programming model must not hinder parallel implementationdata delivery needs careful management

*Not* for all applicationsSPMD (Single Program, Multiple Data) are best (data parallel)Operations need to be of sufficient size to overcome overheadThink Millions of operations.

ToolingVS 2010 C++ (Express is OK… sortof.)NVIDIA CUDA-Capable GPUNVIDIA CUDA Toolkit (v4+)NVIDIA CUDA Tools (v4+)GPU Computing SDKNVIDIA Parallel Insight

Before we get too excited…Host vs DeviceKernels __global__ __device__ __host__Thread/Block Control<<<x, y>>>Multi-dimensioned coordinate objectsMemory Management/MovementThread Management – think 1000’s or 1,000,000’s

Block IDs and ThreadsEach thread uses IDs to decide what data to work onBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D Simplifies memoryaddressing when processingmultidimensional dataImage processing

CUDA Thread BlockAll threads in a block execute the same kernel program (SPMD)Programmer declares block:Block size 1 to 512 concurrent threadsBlock shape 1D, 2D, or 3DBlock dimensions in threadsThreads have thread id numbers within blockThread program uses thread id to select work and address shared dataThreads in the same block share data and synchronize while doing their share of the workThreads in different blocks cannot cooperateEach block can execute in any order relative to other blocs!CUDA Thread BlockThread Id #:0 1 2 3 … m Thread program

Transparent ScalabilityHardware is free to assigns blocks to any processor at any timeA kernel scales across any number of parallel processorsKernel gridDeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7DeviceBlock 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7Block 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7timeEach block can execute in any order relative to other blocks.

A Simple Running ExampleMatrix MultiplicationA simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programsLeave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and deviceAssume square matrix for simplicity

Programming Model:Square Matrix Multiplication ExampleP = M * N of size WIDTH x WIDTHWithout tiling:One thread calculates one element of PM and N are loaded WIDTH timesfrom global memoryNWIDTHMPWIDTHWIDTHWIDTH27

Memory Layout of Matrix in CM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3MM0,2M0,1M0,0M0,3M1,1M1,0M1,2M1,3M2,1M2,0M2,2M2,3M3,1M3,0M3,2M3,3

Simple Matrix Multiplication (CPU)void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏{ for (int i = 0; i < Width; ++i) {‏ for (int j = 0; j < Width; ++j) { float sum = 0;for (int k = 0; k < Width; ++k) {float a = M[i * width + k];float b = N[k * width + j];sum += a * b;}P[i * Width + j] = sum; } }}NkjWIDTHMPiWIDTHk29WIDTHWIDTH

Simple Matrix Multiplication (GPU)void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏{intsize = Width * Width * sizeof(float); float* Md, Nd, Pd; … // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);// Allocate P on the devicecudaMalloc(&Pd, size);

Simple Matrix Multiplication (GPU)// 2. Kernel invocation code – to be shown later … // 3. Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);// Free device matricescudaFree(Md); cudaFree(Nd); cudaFree(Pd);}

Kernel Function// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;

Kernel Function (contd.)for (int k = 0; k < Width; ++k)‏ {float Melement = Md[threadIdx.y*Width+k];float Nelement = Nd[k*Width+threadIdx.x];Pvalue+= Melement * Nelement; }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}NdkWIDTHtxMdPdtytyWIDTHtxk33WIDTHWIDTH

Kernel Function (full)// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{ // Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0; for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x];Pvalue += Melement * Nelement; }Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}

Kernel Invocation (Host Side) // Setup the execution configurationdim3 dimGrid(1, 1);dim3 dimBlock(Width, Width);// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

Only One Thread Block UsedNdGrid 1One Block of threads compute matrix PdEach thread computes one element of PdEach threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)‏Size of matrix limited by the number of threads allowed in a thread blockBlock 1Thread(2, 2)‏48 WIDTHPdMd

Handling Arbitrary Sized Square MatricesHave each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrixEach has (TILE_WIDTH)2 threadsGenerate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocksNdWIDTHMdPdbyYou still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!TILE_WIDTHtyWIDTHbxtx37WIDTHWIDTH

Small ExampleNd1,0Nd0,0Block(0,0)Block(1,0)Nd1,1Nd0,1P1,0P0,0P2,0P3,0Nd1,2Nd0,2TILE_WIDTH = 2P0,1P1,1P3,1P2,1Nd0,3Nd1,3P0,2P2,2P3,2P1,2P0,3P2,3P3,3P1,3Pd1,0Md2,0Md1,0Md0,0Md3,0Pd0,0Pd2,0Pd3,0Md1,1Md0,1Md2,1Md3,1Pd0,1Pd1,1Pd3,1Pd2,1Block(1,1)Block(0,1)Pd0,2Pd2,2Pd3,2Pd1,2Pd0,3Pd2,3Pd3,3Pd1,3

Cleanup TopicsMemory ManagementPinned Memory (Zero-Transfer)Portable Pinned MemoryMulti-GPUWrappers (Python, Java, .NET)KernelsAtomicsThread Synchronization (staged reductions)NVCC

Questions?rob@gillenfamily.net@argodevhttp://rob.gillenfamily.netRate: http://spkr8.com/t/7714

Intro to GPGPU Programming with Cuda

More Related Content

What's hot

Similar to Intro to GPGPU Programming with Cuda

More from Rob Gillen

Recently uploaded

Intro to GPGPU Programming with Cuda

Editor's Notes