Introduction to
CUDA Programming
Hemant Shukla
hshukla@lbl.gov
Trends
2	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Scien+fic	
  Data	
  Deluge	
  
	
  
LSST	
   	
  0.5	
  PB/month	
  
JGI 	
   	
  5	
  TB/yr	
  *	
  
LOFAR 	
  500	
  GB/s	
  
SKA 	
   	
  100	
  x	
  LOFAR	
  
	
  
Energy	
  Efficiency	
  
	
  
Exascale	
  will	
  need	
  
1000x	
  Performance	
  
enhancement	
  with	
  10x	
  
energy	
  consump+on	
  
	
  
Flops/waT	
  
*	
  Jeff	
  Broughton	
  (NERSC)	
  and	
  JGI	
  
Tradi+onal	
  source	
  of	
  
performance	
  are	
  flat-­‐lining	
  
Figure	
  courtesy	
  of	
  Kunle	
  Olukotun,	
  Lance	
  
Hammond,	
  Herb	
  SuTer,	
  and	
  Burton	
  Smith	
  
Developments
3	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Industry
Emergence of more cores on single chips
Number of cores per chip double every two years
Systems with millions of concurrent threads
Systems with inter and intra-chip parallelism
	
  
Architectural designs driven by reduction in Energy Consumption
New Parallel Programming models, languages, frameworks, …
Academia
Graphical Processing Units (GPUs) are adopted as co-processors for
high performance computing
Architectural Differences
4	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
ALU	
  
Cache	
  
DRAM	
  
Control	
  
Logic	
  
DRAM	
  
CPU GPU
512	
  cores	
  
10s	
  to	
  100s	
  of	
  threads	
  per	
  core	
  
Latency	
  is	
  hidden	
  by	
  fast	
  context	
  
switching	
  
Less	
  than	
  20	
  cores	
  
1-­‐2	
  threads	
  per	
  core	
  
Latency	
  is	
  hidden	
  by	
  large	
  cache	
  
GPUs don’t run without CPUs
CPUs vs. GPUs
5	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Silly debate… It’s all about Cores
Next phase of HPC has been touted as “Disruptive”
Future HPC is massively parallel and likely on hybrid architectures
Programming models may not resemble the current state
Embrace change and brace for impact
Write modular, adaptable and easily mutative applications
Build auto-code generators, auto-tuning tools, frameworks, libraries
Use this opportunity to learn how to efficiently program massively parallel
systems
Applications	
  
X-ray computed
tomography
Alain Bonissent et al.
Total	
  volume	
  	
  
560	
  x	
  560	
  x	
  960	
  pixels	
  
360	
  projec+ons	
  	
  
Speed	
  up	
  =	
  110x	
  
N-body with SCDM
K. Nitadori et al.
	
  
4.5	
  giga-­‐par+cles,	
  R	
  =	
  630	
  Mpc	
  
2000x	
  more	
  volume	
  than	
  Kawai	
  et	
  al.	
  	
  
EoR with diesel powered
radio interferometry
Lincoln Greenhill et al.
512	
  antennas,	
  correlated	
  visibili+es	
  for	
  
130,000	
  baseline	
  pairs,	
  each	
  with	
  768	
  
channels	
  and	
  4	
  polariza+ons	
  ~	
  20	
  
Tflops.	
  Power	
  budget	
  20	
  kW.	
  
	
  
INTEL	
  Core2	
  Quad	
  2.66GHz	
  	
  =	
  1121	
  ms	
  
NVIDIA	
  GPU	
  C1060	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  103.4	
  ms	
  
	
  
6	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
GPU
7	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
GPU H/W Example
8	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
L2	
  
L1	
  	
  
Shared	
  Memory	
  
SM	
  
NVIDIA FERMI
Load/Store	
  address	
  	
  
width	
  64	
  bits.	
  Can	
  
calculate	
  addresses	
  of	
  
16	
  threads	
  per	
  clock.	
  
16	
  Stream	
  Mul+processors	
  (SM)	
  
512	
  CUDA	
  cores	
  (32/SM)	
  
IEEE	
  754-­‐2008	
  floa+ng	
  point	
  (DP	
  and	
  SP)	
  
6	
  GB	
  GDDR5	
  DRAM	
  (Global	
  Memory)	
  
ECC	
  Memory	
  support	
  
Two	
  DMA	
  interface	
  
L2	
  Cache	
  768	
  KB	
  
Reconfigurable	
  L1	
  
Cache	
  	
  and	
  Shared	
  
Memory	
  
48	
  KB	
  /	
  16	
  KB	
  
Programming Models
9	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
CUDA (Compute Unified Device Architecture)
OpenACC
OpenCL
Microsoft's DirectCompute
Third party wrappers are also available for Python, Perl, Fortran,
Java, Ruby, Lua, MATLAB and IDL, and Mathematica
Compilers from PGI, RCC, HMPP, Copperhead
CUDA
10	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
CUDA Device Driver
CUDA Toolkit (compiler, debugger, profiler, lib)
CUDA SDK (examples)
Windows, Mac OS, Linux
Parallel Computing Architecture
NVIDIA	
  CUDA	
  Compa+ble	
  GPU	
  
DX	
  
Compute	
  
OpenCL	
   FORTRAN	
  
Java	
  
Python	
  
C/C++	
  
Applica+on	
  
CUDA	
  Run+me	
  and	
  Device	
  Driver	
  
nvcc	
  	
  C/C++	
  Compiler	
  
NVIDIA	
  Assembly	
   Host	
  Assembly	
  
Libraries	
   CPU/GPU	
  code	
  
Libraries – FFT, Sparse Matrix, BLAS, RNG, CUSP, Thrust…
Dataflow
11	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Host	
  Memory	
  
Device	
  Memory	
  
Host	
  (CPU)	
  
Device	
  (GPU)	
  
Data	
  is	
  copied	
  
from	
  the	
  host	
  
memory	
  to	
  the	
  
device	
  memory	
  
via	
  PCIe	
  Bus	
  
1
Host	
  launches	
  
kernel	
  on	
  the	
  
device	
  
2
The	
  kernel	
  is	
  
executed	
  by	
  
mul+ple	
  threads	
  	
  
concurrently	
  
3
The	
  data	
  within	
  
the	
  device	
  is	
  
accessed	
  by	
  
threads	
  through	
  
memory	
  hierarchy	
  
4
The	
  results	
  are	
  moved	
  
back	
  to	
  the	
  device	
  
memory	
  and	
  are	
  
transferred	
  back	
  to	
  the	
  
host	
  via	
  PCIe	
  bus	
  
5
PCIe	
  
S/W Abstraction
12	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Kernel is executed by threads
processed by CUDA Core
Threads
Blocks
Grids
Maximum 8 blocks per SM
32 parallel threads are
executed at the same time in
a WARP
One grid per kernel with
multiple concurrent kernels
512-­‐1024	
  threads	
  /	
  block	
  
SM
Memory Hierarchy
13	
  
Shared	
  Memory	
  	
  
per	
  Block	
  
Global	
  
M
emory	
  
Local	
  Memory	
  per	
  
Thread	
  
Thread
Block
Grid	
  0	
  
Grid	
  1	
  
Constant	
  
M
emory	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Private	
  memory	
  
	
  	
  	
  Visible	
  only	
  to	
  the	
  thread	
  
Shared	
  memory	
  	
  
	
  	
  	
  Visible	
  to	
  all	
  the	
  threads	
  in	
  a	
  block	
  
Global	
  memory	
  
	
  	
  	
  Visible	
  to	
  all	
  the	
  threads	
  
	
  	
  	
  Visible	
  to	
  host	
  
	
  	
  	
  Accessible	
  to	
  mul+ple	
  kernels	
  
	
  	
  	
  Data	
  is	
  stored	
  in	
  row	
  major	
  order	
  	
  
	
  
Registers	
  
Constant	
  memory	
  (Read	
  Only)	
  
	
  	
  	
  Visible	
  to	
  all	
  the	
  threads	
  in	
  a	
  block	
  
CUDA API Examples
14	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Which GPU do I have?
15	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
#include <stdio.h>
int main()
{
int noOfDevices;
/* get no. of device */
cudaGetDeviceCount (&noOfDevices);
cudaDeviceProp prop;
for (int i = 0; i < noOfDevices; i++)
{
/*get device properties */
cudaGetDeviceProperties (&prop, i );
printf ("Device Name:t %sn", prop.name);
printf ("Total global memory:t %ldn",
prop.totalGlobalMem);
printf (”No. of SMs:t %dn",
prop.multiProcessorCount);
printf ("Shared memory / SM:t %ldn",
prop.sharedMemPerBlock);
printf("Registers / SM:t %dn",
prop.regsPerBlock);
}
return 1;
}
Device Name: ! !Tesla C2050"
Total global memory: !2817720320"
No. of SMs:
! ! !14"
Shared memory / SM: !49152"
Registers / SM: ! !32768"
For details see CUDA Reference Manual
Use
cudaGetDeviceCount
cudaGetDeviceProperties 	
  
Output
For more properties see
struct cudaDeviceProp 	
  
> nvcc whatDevice.cu –o whatDevice"
Compilation
Timing with CUDA Event API
16	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
int main ()
{
cudaEvent_t start, stop;
float time;
cudaEventCreate (&start);
cudaEventCreate (&stop);
cudaEventRecord (start, 0);
someKernel <<<grids, blocks, 0, 0>>> (...);
cudaEventRecord (stop, 0);
cudaEventSynchronize (stop);
cudaEventElapsedTime (&time, start, stop);
cudaEventDestroy (start);
cudaEventDestroy (stop);
printf ("Elapsed time %f secn", time*.001);
return 1;
}
Ensures kernel execution has completed
CUDA Event API Timer are,
- OS independent
- High resolution
- Useful for timing asynchronous calls
Standard CPU timers will not measure the
timing information of the device.
Memory Allocations / Copies
17	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
int main ()
{
...
float host_signal[N]; host_result[N];
float *device_signal, *device_result;
//allocate memory on the device (GPU)
cudaMalloc ((void**) &device_signal, N * sizeof(float));
cudaMalloc ((void**) &device_result, N * sizeof(float));
... Get data for the host_signal array
// copy host_signal array to the device
cudaMemcpy (device_signal, host_signal , N * sizeof(float),
cudaMemcpyHostToDevice);
someKernel <<<< >>> (...);
//copy the result back from device to the host
cudaMemcpy (host_result, device_result, N * sizeof(float),
cudaMemcpyDeviceToHost);
//display the results
...
cudaFree (device_signal); cudaFree (device_result) ;
}
Host and device have separate physical memory
Cannot	
  dereference	
  
host	
  pointers	
  on	
  device	
  
and	
  vice	
  versa	
  
cudaError_t cudaMemcpyAsync (void ∗ dst, const void ∗ src, size_t count,
enum cudaMemcpyKind kind, cudaStream_t stream)
	
  
cudaMemcpyAsync() is asynchronous with respect to the host. The call may return before the copy
is complete. It only works on page-locked host memory and returns an error if a pointer to pageable
memory is passed as input.
Basic Memory Methods
18	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
cudaError_t cudaMalloc (void ∗∗ devPtr, size_t size)
	
  
Allocates size bytes of linear memory on the device and returns in ∗devPtr a pointer to the
allocated memory. In case of failure cudaMalloc() returns cudaErrorMemoryAllocation.
cudaError_t cudaMemcpy (void ∗ dst, const void ∗ src, size_t count, enum
cudaMemcpyKind kind)
	
  
Copies count bytes from the memory area pointed to by src to the memory area pointed to by
dst. The argument kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice,
cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the
copy.
Blocking call
Non-Blocking call
See also, cudaMemset, cudaFree, ...
Kernel
19	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
The CUDA kernel is,
Run on device
Defined by __global__ qualifier and does not return anything
__global__ void someKernel ();
Executed asynchronously by the host with <<< >>> qualifier, for example,
someKernel <<<nGrid, nBlocks, sharedMemory, streams>>> (...)
someKernel <<<nGrid, nBlocks>>> (...)
The kernel launches a 1- or 2-D grid of 1-, 2- or 3-D blocks of threads
Each thread executes the same kernel in parallel (SIMT)
Threads within blocks can communicate via shared memory
Threads within blocks can be synchronized
Grids and blocks are of type struct dim3
Built-in variables gridDim, blockDim, threadIdx, blockIdx are used to
traverse across the device memory space with multi-dimensional indexing
Grids, Blocks and Threads
20	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
someKernel<<< 1, 1 >>> ();
gridDim.x = 1
blockDim.x = 1
blockIdx.x = 0
threadIdx.x = 0
Grid	
  
Block	
  
Thread	
  
dim3 blocks (2,1,1);
someKernel<<< (blocks, 4) >>> ();
gridDim.x = 2;
blockDim.x = 4;
blockIdx.x = 0,1;
threadIdx.x = 0,1,2,3,0,1,2,3
block	
  (0,	
  0)	
  
block	
  (1,	
  0)	
  
<<< number of blocks in a grid, number of threads per block >>>
Useful for multidimensional indexing and creating unique thread IDs
int index = threadIdx.x + blockDim.x * blockIdx.x;
Thread Indices
21	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Array traversal
blockDim.x = 4
blockIdx.x = 0
threadIdx.x = 0, 1, 2, 3
Index = 0, 1, 2, 3
blockDim.x = 4
blockIdx.x = 1
threadIdx.x = 0, 1, 2, 3
Index = 4, 5, 6, 7
int index = threadIdx.x + blockDim.x * blockIdx.x;
Example - Inner Product
22	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Matrix-multiplication
x =
Each element of product matrix C is generated by row column multiplication and
reduction of matrices A and B. This operation is similar to inner product of the
vector multiplication kind also known as vector dot product.
N by N N by N N by N
For size N × N matrices the matrix-multiplication C = A  B will be equivalent to
N2 independent (hence parallel) inner products.
A B C
Example
23	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
c = aibi
i
∑
a	

b	

×	
  
+	
  
c	

Multiplications are done in parallel
Summation is sequential
double c = 0.0;
for (int i = 0; i < SIZE; i++)
c += a[i] * b[i];
Serial representation
Simple parallelization strategy
Example
24	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
__global__ void innerProduct (int *a, int *b, int *c)
{
int product[SIZE];
int i = threadIdx.x;
if (i < SIZE)
product[i] = a[i] * b[i];
}
__global__ void innerProduct (...)
{
...
}
int main ()
{
...
innerProduct<<<grid, block>>> (...);
...
}
CUDA Kernel
Called in the host code
Example
25	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
__global__ void innerProduct (int *a, int *b, int *c)
{
int product[SIZE];
int i = threadIdx.x;
if (i < SIZE)
product[i] = a[i] * b[i];
}
Qualifier __global__ encapsulates
device specific code that runs on the
device and is called by the host
Other qualifiers are,
__device__, __host__,
host__and__device
threadIdx is a built in iterator for
threads. It has 3 dimensions x, y and
z.
Each thread with a unique threadIdx.x
runs the kernel code in parallel.
Example
26	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
__global__ void innerProduct (int *a, int *b, int *c)
{
int product[SIZE];
int i = threadIdx.x;
if (i < SIZE)
product[i] = a[i] * b[i];
int sum = 0;
for (int k = 0; k < N; k++)
sum += product[k];
*c = sum;
}
Now we can sum the all the products to get
the scalar c
Unfortunately this won’t work for following reasons,
- product[i] is local to each thread
- Threads are not visible to each other
Example
27	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
__global__ void innerProduct (int *a, int *b, int *c)
{
__shared__ int product[SIZE];
int i = threadIdx.x;
if (i < SIZE)
product[i] = a[i] * b[i];
__syncthreads();
if (threadIdx.x == 0)
{
int sum = 0;
for (int k = 0; k < SIZE; k++)
sum += product[k];
*c = sum;
}
}
First we make the product[i] visible to all the
threads by copying it to shared memory
Next we make sure that all the threads are
synchronized. In other words each thread has
finished its workload before we move ahead. We do
this by calling __syncthreads()
Finally we assign summation to one thread
(extremely inefficient reduction)
Aside: cudaThreadSynchronize() is used
on the host side to synchronize host and device
Example
28	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
__global__ void innerProduct (int *a, int *b, int *c)
{
__shared__ int product[SIZE];
int i = threadIdx.x;
if (i < SIZE)
product[i] = a[i] * b[i];
__syncthreads();
// Efficient reduction call
*c = someEfficientLibrary_reduce (product);
}
Performance Considerations
29	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Memory Bandwidth
30	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Memory bandwidth – rate at which the data is transferred – is a valuable
metric to gauge the performance of an application
Memory bandwidth (GB/s) = Memory clock rate (Hz) × interface width (bytes) / 109
Theoretical Bandwidth
Bandwidth (GB/s) = [(bytes read + bytes written) / 109 ] / execution time
Real Bandwidth (Effective Bandwidth)
May also use profilers to estimate bandwidth and bottlenecks
If real bandwidth is much lower than the theoretical then code may need review
Optimize on Real Bandwidth
Arithmetic Intensity
31	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Memory access bandwidth of GPUs is limited compared to the peak compute
throughput
High arithmetic intensity (arithmetic operations per memory access) algorithms
perform well on such architectures
Example
Fermi peak throughput for SP is 1 TFLOP/s and DP is 0.5 TFLOP/s
Global memory (off-chip) bandwidth is 144 GB/s
For every 4 byte single precision floating point operand bandwidth is 36 GB/s and 18
GB/s for double precision
To obtain peak throughout will require 1000/36 ~ 28 SP (14 DP) arithmetic operations
Example revisited
32	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
__global__ void innerProduct (int *a, int *b, int *c)
{
__shared__ int product[SIZE];
int i = threadIdx.x;
if (i < SIZE)
product[i] = a[i] * b[i];
__syncthreads();
if (threadIdx.x == 0)
{
int sum = 0;
for (int k = 0; k < SIZE; k++)
sum += product[k];
*c = sum;
}
}
Contrast this with inner product example where for
every 2 memory (data ai and bi) accesses only two
operations (multiplication and add) are performed.
That is ratio of 1 as opposed to 28 that is required for
peak throughput.
Room for algorithm improvement!	
  
Aside: Not all performance will be peak performance
Optimization Strategies
33	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Coalesced memory data accesses (use faster memories like shared memory)
Minimize data transfer over PCIe (~ 5 GB/s)
Overlap data transfers and computations with asynchronous calls
Use fast page-locked memory (pinned memory – host memory guaranteed to device)
Threads in a block should be multiples of 32 (warp size). Experiment with your device
Smaller thread-blocks better than large many threads blocks when resource limited
Fast libraries (cuBLAS, Thrust, CUSP, cuFFT,...)
Built-in arithmetic instructions
Judiciously
Atomic Functions
34	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Used to avoid race conditions resulting from thread synchronization and coordination
issues.
Multiple threads accessing same address space for read/write simultaneously.
Applicable to both shared memory and global memory.
Atomic methods in CUDA guarantee address update without interrupts. Implemented
using locks and serialization.
Atomic functions run faster on shared memory than on shared memory.
Atomic functions should also be used judiciously as they serialize the code. Overuse
results in performance degradation.
Examples: atomicAdd, atomicMax, atomicXor...
CUDA Streams
35	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Stream is defined as sequence of device operations executed in order
Do memCopy Start timer Launch kernel Stop timer
Stream 1
cudaStream_t stream0, stream1;
cudaStreamCreate (&stream0);
cudaMemCopyAsync (..., stream0); someKernel<<<..., stream0>>>();
cudaMemCopyAsync (..., stream1); someKernel<<<..., stream1>>>();
cudaStreamSynchronize (stream0);
Down	
  (1)	
   Down	
  (2)	
   Down	
  (3)	
  
Ker	
  (1)	
   Ker	
  (2)	
  
Up	
  (1)	
   Up	
  (N-­‐2)	
   Up	
  (N-­‐1)	
   Up	
  (N)	
  
Ker	
  (N-­‐1)	
   Ker	
  (N)	
  
Down	
  (N)	
  
Time
Task (stream ID)
Example
N streams performing
3 tasks
  	
  
36	
  
Rela+ve	
  Performance	
  of	
  Algorithms	
  
Arithme+c	
  Intensity	
  
Gflop/s	
  
Benchmarks
Courtesy - Sam Williams
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
References
37	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
CUDA
http://developer.nvidia.com/category/zone/cuda-zone
OpenCL
http://www.khronos.org/opencl/
GPGPU
http://www.gpucomputing.net/
Advanced topics from Jan 2011 ICCS Summer School
http://iccs.lbl.gov/workshops/tutorials.html
Conclusion
38	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
If you have parallel code you may benefit from GPUs
In some cases algorithms written on sequential machines may not migrate
efficiently and require reexamination and rewrite
If you have short-term goal(s) it may be worthwhile looking into CUDA etc
CUDA provides better performance over OpenCL (Depends)
Most efficient codes optimally use the entire system and not just parts
Heterogeneous computing and parallel programming are here to stay
Number two2-PetaFlop/s HPC machine in the world (Tianhe-1 in China) is a
heterogeneous cluster with 7k+ NVIDIA GPUs and 14k Intel CPUs
Algorithms
Lessons from ICCS Tutorials by Wen-Mei Hwu
39	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Think Parallel
40	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Promote fine grain parallelism
Consider minimal data movement
Exploit parallel memory access patterns
Data layout
Data Blocking/Tiling
Load Balance
Amdhal’s Argument
41	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Sequen+al	
  
Code	
  
Parallel	
  Code	
  
Sequen+al	
  
Code	
  
Sequen+al	
  
Code	
  
Sequen+al	
  
Code	
  
!me	
  t1	
  
!me	
  t2	
  
Code cannot run faster than time t2
If	
  X	
  is	
  the	
  serialized	
  part	
  of	
  the	
  code	
  then	
  speedup	
  cannot	
  be	
  greater	
  than	
  1/1-­‐X	
  	
  
no	
  maTer	
  how	
  many	
  cores	
  are	
  added.	
  
Blocking
42	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Also known as Tiling.
Basic idea is to move blocks/tiles of commonly useable data from global
memory into shared memory or registers memory.
Global	
  Memory	
  
Shared	
  Memory	
  	
  
per	
  Block	
  
Registers	
   Reuse computed results
Get	
  data	
  blocks	
  for	
  
thread	
  to	
  share	
  	
  
Register	
  Tiling	
  
Shared	
  Memory	
  Tiling	
  
Blocking / Tiling Technique
43	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Focused Access pattern
Identify block/tile of global memory data to be accessed by threads.
Load the data into the fast memory (Shared, register)
Get the multithreads to use the data
Assure barrier synchronization
Repeat (move to next block, next iterations etc.)
Make the most of one load of data into fast memory
Variables on Memory
44	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
CUDA Variable Type Qualifiers
__device__ __shared__ int SharedVar;
__device__ int GlobalVar;
__device__ __constant__ int ConstantVar;
Kernel variables without any qualifiers reside in a registe with an
exception for arrays that reside in local memory
Matrix Multiplication
45	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Example
A B C
×	
   =	
  
k
i
j
k
WIDTH
Matrix Multiplication...
46	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
void matrixMultiplication ( float* A, float* B, float* C, int WIDTH)
{
for (i  0 : WIDHT)
for (j  0 : WIDTH)
for (k  0 : WIDTH)
a = Ai;
b = Bj;
sum += a * b;
Cij = sum;
}
CPU Version
Matrix Multiplication...
47	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
__global__ void matrixMultiplication (float* A, float* B, float* C, int WIDTH)
{
int i = blockIdx.y * WIDTH + threadIdx.y;
int j = blockIdx.x * WIDTH + threadIdx.x;
// each thread computes one element of product matrix C
for (k  0 : k)
sum += A[i][k] * B[k][j];
C[i][j] = sum;
}
GPU Version (Memory locations)
Constant memory
Shared memory
Global memory (read)
Global memory (write)
Matrix Multiplication...
48	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Kernel analysis
2 floating point read accesses, 2 x 4 bytes = 8 bytes per one
multiply and add that is 2 floating point operations per second (add
and multiply). Hence the throughput is 8 bytes / 2 = 4B / FLOPs.
Theoretical peak of Fermi is 530 FLOPs
To achieve peak will require bandwidth of 4 x 530 = 2120 GB/s
The actual bandwidth is 177GB/s
With this bandwidth it yields 177/4 = 44.25 FLOP/s
About 12 times below peak performance.
In practice it will be slower.
Matrix Multiplication...
49	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
How to speed up?
BLOCKING
Load data into shared memory and reuse
Since the Shared memory size is small it helps to partition the
data in equal sized blocks that fit into the shared memory and
reuse.
Matrix Multiplication...
50	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Block/Tile
Partial rows and columns are
loaded in shared memory
One row is reused to calculate
two elements.
For a 16 x 16 tile width the
global memory loads are
reduced by 16.
Multiple blocks are executed in
parallel.
Matrix Multiplication...
51	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Tile	
  1	
   Tile	
  2	
  
T0,0	
  
A0,0	
  
	
  
A_S0,0	
  
B0,0	
  
	
  
B_S0,0	
  
C0,0	
  =	
  	
  
A_S0,0	
  *	
  B_S0,0	
  +	
  
A_S1,0	
  *	
  B_S0,1	
  
A2,0	
  
	
  
A_S0,0	
  
B0,2	
  
	
  
B_S0,0	
  
C0,0	
  =	
  	
  
A_S0,0	
  *	
  B_S0,0	
  +	
  
A_S1,0	
  *	
  B_S0,1	
  
T1,0	
  
A0,0	
  
	
  
A_S1,0	
  
B0,0	
  
	
  
B_S1,0	
  
C1,0	
  =	
  	
  
A_S0,0	
  *	
  B_S1,0	
  +	
  
A_S1,0	
  *	
  B_S1,1	
  
A3,0	
  
	
  
A_S1,0	
  
B1,2	
  
	
  
B_S1,0	
  
C1,0	
  =	
  	
  
A_S0,0	
  *	
  B_S1,0	
  +	
  
A_S1,0	
  *	
  B_S1,1	
  
T0,1	
  
A0,1	
  
	
  
A_S0,1	
  
B0,1	
  
	
  
B_S0,1	
  
C0,1	
  =	
  	
  
A_S0,1	
  *	
  B_S0,0	
  +	
  
A_S1,1	
  *	
  B_S0,1	
  
A2,1	
  
	
  
A_S0,1	
  
B0,3	
  
	
  
B_S0,1	
  
C0,1	
  =	
  	
  
A_S0,1	
  *	
  B_S0,0	
  +	
  
A_S1,1	
  *	
  B_S0,1	
  
T1,1	
  
A1,1	
  
	
  
A_S1,1	
  
B1,1	
  
	
  
B_S1,1	
  
C1,1	
  =	
  	
  
A_S0,1	
  *	
  B_S1,0	
  +	
  
A_S1,1	
  *	
  B_S1,1	
  
A3,1	
  
	
  
A_S1,1	
  
B1,3	
  
	
  
B_S1,1	
  
C1,1	
  =	
  	
  
A_S0,1	
  *	
  B_S1,0	
  +	
  
A_S1,1	
  *	
  B_S1,1	
  
Threads
Time
Matrix Multiplication...
52	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
__global__ void matrixMultiplication(float* A, float* B, float* C, int WIDTH,
int TILE_WIDTH)
{
__shared__float A_S[TILE_WIDTH][TILE_WIDTH];
__shared__float B_S[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
// row and column of the C element to calculate
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
float sum = 0;
// Loop over the A and B tiles required to compute the C element
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collectively Load A and B tiles from the global memory into shared memory
A_S[tx][ty] = A[(m*TILE_WIDTH + tx)*Width+Row];
B_S[tx][ty] = B[Col*Width+(m*TILE_WIDTH + ty)];
__syncthreads();
for (int k = 0; k < TILE_WIDTH; ++k)
sum += A_S[tx][k] * B_C[k][ty];
__synchthreads();
}
C [Row*Width+Col] = sum;
}
7-Point Stencil
53	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Used for PDEs, Convolution etc.
7-Point Stencil …
54	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Conceptually all points can be upgraded in parallel.
Each computations performs global sweep of entire data.
Memory bound.
Challenge is to parallelize without overusing memory bandwidth.
7-Point Stencil …
55	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Calculate values along one axis.
Traversing the axis 3 values are needed along the axis
Keep the three values in the register for next iteration
This is called Register Tiling
For 7-point there are 2 in the register so only 5 access will be needed.
A combination of register and block tiling should give 7x speed up.
In reality 4-5x because halos have to be considered.
Questions?
56	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Use case
57	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
GAMER
Hsi-Yu Schive, T. Chiueh, and Y. C. Tsai
Astrophysics adaptive mesh refinement (AMR) code with solvers for hydrodynamics and gravity	
  
Parallelization achieved by OpenMP, MPI on multi-node multicores and CUDA for accelerators (GPU)
Decoupling of AMR (CPU) and solvers (GPU) lends to increased performance, ease of code development
Speed-ups of the order of 10-12x attained on single and multi-GPU heterogeneous systems
Simulations 	
  
58	
  
GAMER Framework
Hemant Shukla, Hsi-Yu Schive, Tak-Pong Woo, and T. Chiueh
Generalized GAMER codebase to multi-science framework
Use GAMER to deeply benchmark heterogeneous hardware, optimizations and algorithms in applications
Collect performance, memory access, power consumption and various other metrics for broader user base
Develop codebases as ensembles of highly optimized existing and customizable components for HPC	
  
	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
59	
  
Adaptive Mesh Refinement
Data stored in Octree data structure
Refinement with 2l
spatial resolution per level l
Figure
-
Hsi-Yu
Schive
et
al.,
2010
2D Patch
83 cells per patch
Identical spatial geometry (same kernel)
Uniform and individual time-steps	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
60	
  
Construct and Dataflow
...	
  
Cluster
GAMER Codebase C++/CUDA, MPI, OpenMP
AMR, Framework, Libraries
Solvers Poisson, Hydro, Custom, …
Time	
  Steps	
  
Problem domain covered
with coarse patch on CPUs
User defined refinement, spatial
averaging, flux correction on
CPUs
Concurrently patches are transferred
to GPUs, processed by solvers, one
cell per thread, and returned
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
61	
  
Solvers
Hydrodynamics PDE Solver
3D Euler equations solved in 5 separate schemes
Second-order relaxing Total Variation Diminishing
Weighted average flux
MUSCL-Hancock (MHM)
MUSCL-Hancock (VL)
Corner transport upwind (CTU)
Flux conservation is done using Riemann Solver
(4 types - exact solver, HLLE, HLLC, and Roe)
€
∂ρ
∂t
+
∂(ρv j )
∂x j
= 0
∂(ρvi )
∂t
+
∂(ρviv j +Pδij )
∂x j
= −ρ
∂φ
∂xi
∂e
∂t
+
∂[(e + P)v j ]
∂x j
= −ρv j
∂φ
∂x j
Poisson-Gravity Solver
€
∇2
φ(

x) = 4πGρ(

x)
Laplacian operator is replaced by seven-point
finite difference operator
For root level patches Green’s functions is used
using FFTW
For refined levels SOR is used
Recently implemented
Multigrid Poisson Solver
Hilbert space-filling curve (load balancing)
Currently implementing
Fast Poisson Solver with Dirichlet’s boundary
conditions
€
∇2
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
62	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
	
  
GAMER Framework
Allows for adding custom/new solvers to the codebase
The size of computational stencil
An optimized CPU version of the implementation
An optimized GPU version of the implementation
CUDA thread blocks and stream objects
New Solver inherits
New Solver implements
Async memcpy, concurrent execution, MPI and OpenMP optimization
63	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Multi-Science
Cosmological Large-scale Structure
Bosonic Dark Matter
Gravitational Lensing Potential
€
∇2
φ(

x) = 4πGa[ρ(

x) − ρb (

x)]
Effec+ve	
  resolu+on	
  81923
€
i
∂ψ
∂t
= −
2
2a2
m
∇2
ψ + mVψ
Gravitational potential
Schrodinger-Poisson equation

u =

x − ∇φ(

x)
∇2
φ(

x) = ∑(

x)/∑cr
Lens equation and mass relationship
Structure	
  due	
  to	
  dark	
  maTer	
  model	
  
in	
  early	
  universe
64	
  
Kernel Analysis
Read
Write
Global Memory Access
GB/s
Max bandwidth 144 GB/s
Gravity
Fluid
Poisson
Instructions
/
byte
3.57
Poisson
Fluid
Gravity
Compute
bound
Memory
bound
268.77
4.02
2.58
SOR takes 20-30 iterations to converge
0.0% 64.3% 15.9%
L1 cache hits while
global memory access
Intensive use of
shared memory
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
65	
  
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
Hemant Shukla,
Hsi-Yu Schive
et al. SC 2011
Results Large scale Cosmological Simulations with GAMER
66	
  
Hemant Shukla,
Hsi-Yu Schive
et al. SC 2011
Bosonic Dark Mater Simulation
Base level resolution 2563 to level 7 32,7683
Results
0
400
800
1600
3553
Gravity
Kinematic (Schrödinger's eqn.)
8 64
+
GPU
Cores
Seconds
Fix-up
Refinement
MPI
Time-step
1200
5.52 X
4.79 X
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  
67	
  
Load Balance with Hilbert space filling curve
New Results
Gravity
Kinematic (Schrödinger's eqn.)
8 64
+
GPU
Cores
Seconds
Fix-up
Refinement
MPI
Time-step
3.03x
Unbalanced Balanced
Introduc+on	
  to	
  CUDA	
  Programming	
  -­‐	
  Hemant	
  Shukla	
  

Introduction to CUDA programming in C language

  • 1.
  • 2.
    Trends 2   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Scien+fic  Data  Deluge     LSST    0.5  PB/month   JGI    5  TB/yr  *   LOFAR  500  GB/s   SKA    100  x  LOFAR     Energy  Efficiency     Exascale  will  need   1000x  Performance   enhancement  with  10x   energy  consump+on     Flops/waT   *  Jeff  Broughton  (NERSC)  and  JGI   Tradi+onal  source  of   performance  are  flat-­‐lining   Figure  courtesy  of  Kunle  Olukotun,  Lance   Hammond,  Herb  SuTer,  and  Burton  Smith  
  • 3.
    Developments 3   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Industry Emergence of more cores on single chips Number of cores per chip double every two years Systems with millions of concurrent threads Systems with inter and intra-chip parallelism   Architectural designs driven by reduction in Energy Consumption New Parallel Programming models, languages, frameworks, … Academia Graphical Processing Units (GPUs) are adopted as co-processors for high performance computing
  • 4.
    Architectural Differences 4   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   ALU   Cache   DRAM   Control   Logic   DRAM   CPU GPU 512  cores   10s  to  100s  of  threads  per  core   Latency  is  hidden  by  fast  context   switching   Less  than  20  cores   1-­‐2  threads  per  core   Latency  is  hidden  by  large  cache   GPUs don’t run without CPUs
  • 5.
    CPUs vs. GPUs 5   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Silly debate… It’s all about Cores Next phase of HPC has been touted as “Disruptive” Future HPC is massively parallel and likely on hybrid architectures Programming models may not resemble the current state Embrace change and brace for impact Write modular, adaptable and easily mutative applications Build auto-code generators, auto-tuning tools, frameworks, libraries Use this opportunity to learn how to efficiently program massively parallel systems
  • 6.
    Applications   X-ray computed tomography AlainBonissent et al. Total  volume     560  x  560  x  960  pixels   360  projec+ons     Speed  up  =  110x   N-body with SCDM K. Nitadori et al.   4.5  giga-­‐par+cles,  R  =  630  Mpc   2000x  more  volume  than  Kawai  et  al.     EoR with diesel powered radio interferometry Lincoln Greenhill et al. 512  antennas,  correlated  visibili+es  for   130,000  baseline  pairs,  each  with  768   channels  and  4  polariza+ons  ~  20   Tflops.  Power  budget  20  kW.     INTEL  Core2  Quad  2.66GHz    =  1121  ms   NVIDIA  GPU  C1060                                =  103.4  ms     6   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 7.
    GPU 7   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 8.
    GPU H/W Example 8   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   L2   L1     Shared  Memory   SM   NVIDIA FERMI Load/Store  address     width  64  bits.  Can   calculate  addresses  of   16  threads  per  clock.   16  Stream  Mul+processors  (SM)   512  CUDA  cores  (32/SM)   IEEE  754-­‐2008  floa+ng  point  (DP  and  SP)   6  GB  GDDR5  DRAM  (Global  Memory)   ECC  Memory  support   Two  DMA  interface   L2  Cache  768  KB   Reconfigurable  L1   Cache    and  Shared   Memory   48  KB  /  16  KB  
  • 9.
    Programming Models 9   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   CUDA (Compute Unified Device Architecture) OpenACC OpenCL Microsoft's DirectCompute Third party wrappers are also available for Python, Perl, Fortran, Java, Ruby, Lua, MATLAB and IDL, and Mathematica Compilers from PGI, RCC, HMPP, Copperhead
  • 10.
    CUDA 10   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   CUDA Device Driver CUDA Toolkit (compiler, debugger, profiler, lib) CUDA SDK (examples) Windows, Mac OS, Linux Parallel Computing Architecture NVIDIA  CUDA  Compa+ble  GPU   DX   Compute   OpenCL   FORTRAN   Java   Python   C/C++   Applica+on   CUDA  Run+me  and  Device  Driver   nvcc    C/C++  Compiler   NVIDIA  Assembly   Host  Assembly   Libraries   CPU/GPU  code   Libraries – FFT, Sparse Matrix, BLAS, RNG, CUSP, Thrust…
  • 11.
    Dataflow 11   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Host  Memory   Device  Memory   Host  (CPU)   Device  (GPU)   Data  is  copied   from  the  host   memory  to  the   device  memory   via  PCIe  Bus   1 Host  launches   kernel  on  the   device   2 The  kernel  is   executed  by   mul+ple  threads     concurrently   3 The  data  within   the  device  is   accessed  by   threads  through   memory  hierarchy   4 The  results  are  moved   back  to  the  device   memory  and  are   transferred  back  to  the   host  via  PCIe  bus   5 PCIe  
  • 12.
    S/W Abstraction 12   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Kernel is executed by threads processed by CUDA Core Threads Blocks Grids Maximum 8 blocks per SM 32 parallel threads are executed at the same time in a WARP One grid per kernel with multiple concurrent kernels 512-­‐1024  threads  /  block   SM
  • 13.
    Memory Hierarchy 13   Shared  Memory     per  Block   Global   M emory   Local  Memory  per   Thread   Thread Block Grid  0   Grid  1   Constant   M emory   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Private  memory        Visible  only  to  the  thread   Shared  memory          Visible  to  all  the  threads  in  a  block   Global  memory        Visible  to  all  the  threads        Visible  to  host        Accessible  to  mul+ple  kernels        Data  is  stored  in  row  major  order       Registers   Constant  memory  (Read  Only)        Visible  to  all  the  threads  in  a  block  
  • 14.
    CUDA API Examples 14   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 15.
    Which GPU doI have? 15   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   #include <stdio.h> int main() { int noOfDevices; /* get no. of device */ cudaGetDeviceCount (&noOfDevices); cudaDeviceProp prop; for (int i = 0; i < noOfDevices; i++) { /*get device properties */ cudaGetDeviceProperties (&prop, i ); printf ("Device Name:t %sn", prop.name); printf ("Total global memory:t %ldn", prop.totalGlobalMem); printf (”No. of SMs:t %dn", prop.multiProcessorCount); printf ("Shared memory / SM:t %ldn", prop.sharedMemPerBlock); printf("Registers / SM:t %dn", prop.regsPerBlock); } return 1; } Device Name: ! !Tesla C2050" Total global memory: !2817720320" No. of SMs: ! ! !14" Shared memory / SM: !49152" Registers / SM: ! !32768" For details see CUDA Reference Manual Use cudaGetDeviceCount cudaGetDeviceProperties   Output For more properties see struct cudaDeviceProp   > nvcc whatDevice.cu –o whatDevice" Compilation
  • 16.
    Timing with CUDAEvent API 16   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   int main () { cudaEvent_t start, stop; float time; cudaEventCreate (&start); cudaEventCreate (&stop); cudaEventRecord (start, 0); someKernel <<<grids, blocks, 0, 0>>> (...); cudaEventRecord (stop, 0); cudaEventSynchronize (stop); cudaEventElapsedTime (&time, start, stop); cudaEventDestroy (start); cudaEventDestroy (stop); printf ("Elapsed time %f secn", time*.001); return 1; } Ensures kernel execution has completed CUDA Event API Timer are, - OS independent - High resolution - Useful for timing asynchronous calls Standard CPU timers will not measure the timing information of the device.
  • 17.
    Memory Allocations /Copies 17   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   int main () { ... float host_signal[N]; host_result[N]; float *device_signal, *device_result; //allocate memory on the device (GPU) cudaMalloc ((void**) &device_signal, N * sizeof(float)); cudaMalloc ((void**) &device_result, N * sizeof(float)); ... Get data for the host_signal array // copy host_signal array to the device cudaMemcpy (device_signal, host_signal , N * sizeof(float), cudaMemcpyHostToDevice); someKernel <<<< >>> (...); //copy the result back from device to the host cudaMemcpy (host_result, device_result, N * sizeof(float), cudaMemcpyDeviceToHost); //display the results ... cudaFree (device_signal); cudaFree (device_result) ; } Host and device have separate physical memory Cannot  dereference   host  pointers  on  device   and  vice  versa  
  • 18.
    cudaError_t cudaMemcpyAsync (void∗ dst, const void ∗ src, size_t count, enum cudaMemcpyKind kind, cudaStream_t stream)   cudaMemcpyAsync() is asynchronous with respect to the host. The call may return before the copy is complete. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input. Basic Memory Methods 18   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   cudaError_t cudaMalloc (void ∗∗ devPtr, size_t size)   Allocates size bytes of linear memory on the device and returns in ∗devPtr a pointer to the allocated memory. In case of failure cudaMalloc() returns cudaErrorMemoryAllocation. cudaError_t cudaMemcpy (void ∗ dst, const void ∗ src, size_t count, enum cudaMemcpyKind kind)   Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst. The argument kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy. Blocking call Non-Blocking call See also, cudaMemset, cudaFree, ...
  • 19.
    Kernel 19   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   The CUDA kernel is, Run on device Defined by __global__ qualifier and does not return anything __global__ void someKernel (); Executed asynchronously by the host with <<< >>> qualifier, for example, someKernel <<<nGrid, nBlocks, sharedMemory, streams>>> (...) someKernel <<<nGrid, nBlocks>>> (...) The kernel launches a 1- or 2-D grid of 1-, 2- or 3-D blocks of threads Each thread executes the same kernel in parallel (SIMT) Threads within blocks can communicate via shared memory Threads within blocks can be synchronized Grids and blocks are of type struct dim3 Built-in variables gridDim, blockDim, threadIdx, blockIdx are used to traverse across the device memory space with multi-dimensional indexing
  • 20.
    Grids, Blocks andThreads 20   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   someKernel<<< 1, 1 >>> (); gridDim.x = 1 blockDim.x = 1 blockIdx.x = 0 threadIdx.x = 0 Grid   Block   Thread   dim3 blocks (2,1,1); someKernel<<< (blocks, 4) >>> (); gridDim.x = 2; blockDim.x = 4; blockIdx.x = 0,1; threadIdx.x = 0,1,2,3,0,1,2,3 block  (0,  0)   block  (1,  0)   <<< number of blocks in a grid, number of threads per block >>> Useful for multidimensional indexing and creating unique thread IDs int index = threadIdx.x + blockDim.x * blockIdx.x;
  • 21.
    Thread Indices 21   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Array traversal blockDim.x = 4 blockIdx.x = 0 threadIdx.x = 0, 1, 2, 3 Index = 0, 1, 2, 3 blockDim.x = 4 blockIdx.x = 1 threadIdx.x = 0, 1, 2, 3 Index = 4, 5, 6, 7 int index = threadIdx.x + blockDim.x * blockIdx.x;
  • 22.
    Example - InnerProduct 22   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Matrix-multiplication x = Each element of product matrix C is generated by row column multiplication and reduction of matrices A and B. This operation is similar to inner product of the vector multiplication kind also known as vector dot product. N by N N by N N by N For size N × N matrices the matrix-multiplication C = A  B will be equivalent to N2 independent (hence parallel) inner products. A B C
  • 23.
    Example 23   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   c = aibi i ∑ a b ×   +   c Multiplications are done in parallel Summation is sequential double c = 0.0; for (int i = 0; i < SIZE; i++) c += a[i] * b[i]; Serial representation Simple parallelization strategy
  • 24.
    Example 24   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   __global__ void innerProduct (int *a, int *b, int *c) { int product[SIZE]; int i = threadIdx.x; if (i < SIZE) product[i] = a[i] * b[i]; } __global__ void innerProduct (...) { ... } int main () { ... innerProduct<<<grid, block>>> (...); ... } CUDA Kernel Called in the host code
  • 25.
    Example 25   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   __global__ void innerProduct (int *a, int *b, int *c) { int product[SIZE]; int i = threadIdx.x; if (i < SIZE) product[i] = a[i] * b[i]; } Qualifier __global__ encapsulates device specific code that runs on the device and is called by the host Other qualifiers are, __device__, __host__, host__and__device threadIdx is a built in iterator for threads. It has 3 dimensions x, y and z. Each thread with a unique threadIdx.x runs the kernel code in parallel.
  • 26.
    Example 26   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   __global__ void innerProduct (int *a, int *b, int *c) { int product[SIZE]; int i = threadIdx.x; if (i < SIZE) product[i] = a[i] * b[i]; int sum = 0; for (int k = 0; k < N; k++) sum += product[k]; *c = sum; } Now we can sum the all the products to get the scalar c Unfortunately this won’t work for following reasons, - product[i] is local to each thread - Threads are not visible to each other
  • 27.
    Example 27   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   __global__ void innerProduct (int *a, int *b, int *c) { __shared__ int product[SIZE]; int i = threadIdx.x; if (i < SIZE) product[i] = a[i] * b[i]; __syncthreads(); if (threadIdx.x == 0) { int sum = 0; for (int k = 0; k < SIZE; k++) sum += product[k]; *c = sum; } } First we make the product[i] visible to all the threads by copying it to shared memory Next we make sure that all the threads are synchronized. In other words each thread has finished its workload before we move ahead. We do this by calling __syncthreads() Finally we assign summation to one thread (extremely inefficient reduction) Aside: cudaThreadSynchronize() is used on the host side to synchronize host and device
  • 28.
    Example 28   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   __global__ void innerProduct (int *a, int *b, int *c) { __shared__ int product[SIZE]; int i = threadIdx.x; if (i < SIZE) product[i] = a[i] * b[i]; __syncthreads(); // Efficient reduction call *c = someEfficientLibrary_reduce (product); }
  • 29.
    Performance Considerations 29   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 30.
    Memory Bandwidth 30   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Memory bandwidth – rate at which the data is transferred – is a valuable metric to gauge the performance of an application Memory bandwidth (GB/s) = Memory clock rate (Hz) × interface width (bytes) / 109 Theoretical Bandwidth Bandwidth (GB/s) = [(bytes read + bytes written) / 109 ] / execution time Real Bandwidth (Effective Bandwidth) May also use profilers to estimate bandwidth and bottlenecks If real bandwidth is much lower than the theoretical then code may need review Optimize on Real Bandwidth
  • 31.
    Arithmetic Intensity 31   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Memory access bandwidth of GPUs is limited compared to the peak compute throughput High arithmetic intensity (arithmetic operations per memory access) algorithms perform well on such architectures Example Fermi peak throughput for SP is 1 TFLOP/s and DP is 0.5 TFLOP/s Global memory (off-chip) bandwidth is 144 GB/s For every 4 byte single precision floating point operand bandwidth is 36 GB/s and 18 GB/s for double precision To obtain peak throughout will require 1000/36 ~ 28 SP (14 DP) arithmetic operations
  • 32.
    Example revisited 32   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   __global__ void innerProduct (int *a, int *b, int *c) { __shared__ int product[SIZE]; int i = threadIdx.x; if (i < SIZE) product[i] = a[i] * b[i]; __syncthreads(); if (threadIdx.x == 0) { int sum = 0; for (int k = 0; k < SIZE; k++) sum += product[k]; *c = sum; } } Contrast this with inner product example where for every 2 memory (data ai and bi) accesses only two operations (multiplication and add) are performed. That is ratio of 1 as opposed to 28 that is required for peak throughput. Room for algorithm improvement!   Aside: Not all performance will be peak performance
  • 33.
    Optimization Strategies 33   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Coalesced memory data accesses (use faster memories like shared memory) Minimize data transfer over PCIe (~ 5 GB/s) Overlap data transfers and computations with asynchronous calls Use fast page-locked memory (pinned memory – host memory guaranteed to device) Threads in a block should be multiples of 32 (warp size). Experiment with your device Smaller thread-blocks better than large many threads blocks when resource limited Fast libraries (cuBLAS, Thrust, CUSP, cuFFT,...) Built-in arithmetic instructions Judiciously
  • 34.
    Atomic Functions 34   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Used to avoid race conditions resulting from thread synchronization and coordination issues. Multiple threads accessing same address space for read/write simultaneously. Applicable to both shared memory and global memory. Atomic methods in CUDA guarantee address update without interrupts. Implemented using locks and serialization. Atomic functions run faster on shared memory than on shared memory. Atomic functions should also be used judiciously as they serialize the code. Overuse results in performance degradation. Examples: atomicAdd, atomicMax, atomicXor...
  • 35.
    CUDA Streams 35   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Stream is defined as sequence of device operations executed in order Do memCopy Start timer Launch kernel Stop timer Stream 1 cudaStream_t stream0, stream1; cudaStreamCreate (&stream0); cudaMemCopyAsync (..., stream0); someKernel<<<..., stream0>>>(); cudaMemCopyAsync (..., stream1); someKernel<<<..., stream1>>>(); cudaStreamSynchronize (stream0); Down  (1)   Down  (2)   Down  (3)   Ker  (1)   Ker  (2)   Up  (1)   Up  (N-­‐2)   Up  (N-­‐1)   Up  (N)   Ker  (N-­‐1)   Ker  (N)   Down  (N)   Time Task (stream ID) Example N streams performing 3 tasks     
  • 36.
    36   Rela+ve  Performance  of  Algorithms   Arithme+c  Intensity   Gflop/s   Benchmarks Courtesy - Sam Williams Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 37.
    References 37   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   CUDA http://developer.nvidia.com/category/zone/cuda-zone OpenCL http://www.khronos.org/opencl/ GPGPU http://www.gpucomputing.net/ Advanced topics from Jan 2011 ICCS Summer School http://iccs.lbl.gov/workshops/tutorials.html
  • 38.
    Conclusion 38   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   If you have parallel code you may benefit from GPUs In some cases algorithms written on sequential machines may not migrate efficiently and require reexamination and rewrite If you have short-term goal(s) it may be worthwhile looking into CUDA etc CUDA provides better performance over OpenCL (Depends) Most efficient codes optimally use the entire system and not just parts Heterogeneous computing and parallel programming are here to stay Number two2-PetaFlop/s HPC machine in the world (Tianhe-1 in China) is a heterogeneous cluster with 7k+ NVIDIA GPUs and 14k Intel CPUs
  • 39.
    Algorithms Lessons from ICCSTutorials by Wen-Mei Hwu 39   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 40.
    Think Parallel 40   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Promote fine grain parallelism Consider minimal data movement Exploit parallel memory access patterns Data layout Data Blocking/Tiling Load Balance
  • 41.
    Amdhal’s Argument 41   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Sequen+al   Code   Parallel  Code   Sequen+al   Code   Sequen+al   Code   Sequen+al   Code   !me  t1   !me  t2   Code cannot run faster than time t2 If  X  is  the  serialized  part  of  the  code  then  speedup  cannot  be  greater  than  1/1-­‐X     no  maTer  how  many  cores  are  added.  
  • 42.
    Blocking 42   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Also known as Tiling. Basic idea is to move blocks/tiles of commonly useable data from global memory into shared memory or registers memory. Global  Memory   Shared  Memory     per  Block   Registers   Reuse computed results Get  data  blocks  for   thread  to  share     Register  Tiling   Shared  Memory  Tiling  
  • 43.
    Blocking / TilingTechnique 43   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Focused Access pattern Identify block/tile of global memory data to be accessed by threads. Load the data into the fast memory (Shared, register) Get the multithreads to use the data Assure barrier synchronization Repeat (move to next block, next iterations etc.) Make the most of one load of data into fast memory
  • 44.
    Variables on Memory 44   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   CUDA Variable Type Qualifiers __device__ __shared__ int SharedVar; __device__ int GlobalVar; __device__ __constant__ int ConstantVar; Kernel variables without any qualifiers reside in a registe with an exception for arrays that reside in local memory
  • 45.
    Matrix Multiplication 45   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Example A B C ×   =   k i j k WIDTH
  • 46.
    Matrix Multiplication... 46   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   void matrixMultiplication ( float* A, float* B, float* C, int WIDTH) { for (i  0 : WIDHT) for (j  0 : WIDTH) for (k  0 : WIDTH) a = Ai; b = Bj; sum += a * b; Cij = sum; } CPU Version
  • 47.
    Matrix Multiplication... 47   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   __global__ void matrixMultiplication (float* A, float* B, float* C, int WIDTH) { int i = blockIdx.y * WIDTH + threadIdx.y; int j = blockIdx.x * WIDTH + threadIdx.x; // each thread computes one element of product matrix C for (k  0 : k) sum += A[i][k] * B[k][j]; C[i][j] = sum; } GPU Version (Memory locations) Constant memory Shared memory Global memory (read) Global memory (write)
  • 48.
    Matrix Multiplication... 48   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Kernel analysis 2 floating point read accesses, 2 x 4 bytes = 8 bytes per one multiply and add that is 2 floating point operations per second (add and multiply). Hence the throughput is 8 bytes / 2 = 4B / FLOPs. Theoretical peak of Fermi is 530 FLOPs To achieve peak will require bandwidth of 4 x 530 = 2120 GB/s The actual bandwidth is 177GB/s With this bandwidth it yields 177/4 = 44.25 FLOP/s About 12 times below peak performance. In practice it will be slower.
  • 49.
    Matrix Multiplication... 49   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   How to speed up? BLOCKING Load data into shared memory and reuse Since the Shared memory size is small it helps to partition the data in equal sized blocks that fit into the shared memory and reuse.
  • 50.
    Matrix Multiplication... 50   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Block/Tile Partial rows and columns are loaded in shared memory One row is reused to calculate two elements. For a 16 x 16 tile width the global memory loads are reduced by 16. Multiple blocks are executed in parallel.
  • 51.
    Matrix Multiplication... 51   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Tile  1   Tile  2   T0,0   A0,0     A_S0,0   B0,0     B_S0,0   C0,0  =     A_S0,0  *  B_S0,0  +   A_S1,0  *  B_S0,1   A2,0     A_S0,0   B0,2     B_S0,0   C0,0  =     A_S0,0  *  B_S0,0  +   A_S1,0  *  B_S0,1   T1,0   A0,0     A_S1,0   B0,0     B_S1,0   C1,0  =     A_S0,0  *  B_S1,0  +   A_S1,0  *  B_S1,1   A3,0     A_S1,0   B1,2     B_S1,0   C1,0  =     A_S0,0  *  B_S1,0  +   A_S1,0  *  B_S1,1   T0,1   A0,1     A_S0,1   B0,1     B_S0,1   C0,1  =     A_S0,1  *  B_S0,0  +   A_S1,1  *  B_S0,1   A2,1     A_S0,1   B0,3     B_S0,1   C0,1  =     A_S0,1  *  B_S0,0  +   A_S1,1  *  B_S0,1   T1,1   A1,1     A_S1,1   B1,1     B_S1,1   C1,1  =     A_S0,1  *  B_S1,0  +   A_S1,1  *  B_S1,1   A3,1     A_S1,1   B1,3     B_S1,1   C1,1  =     A_S0,1  *  B_S1,0  +   A_S1,1  *  B_S1,1   Threads Time
  • 52.
    Matrix Multiplication... 52   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   __global__ void matrixMultiplication(float* A, float* B, float* C, int WIDTH, int TILE_WIDTH) { __shared__float A_S[TILE_WIDTH][TILE_WIDTH]; __shared__float B_S[TILE_WIDTH][TILE_WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; // row and column of the C element to calculate int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; float sum = 0; // Loop over the A and B tiles required to compute the C element for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collectively Load A and B tiles from the global memory into shared memory A_S[tx][ty] = A[(m*TILE_WIDTH + tx)*Width+Row]; B_S[tx][ty] = B[Col*Width+(m*TILE_WIDTH + ty)]; __syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) sum += A_S[tx][k] * B_C[k][ty]; __synchthreads(); } C [Row*Width+Col] = sum; }
  • 53.
    7-Point Stencil 53   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Used for PDEs, Convolution etc.
  • 54.
    7-Point Stencil … 54   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Conceptually all points can be upgraded in parallel. Each computations performs global sweep of entire data. Memory bound. Challenge is to parallelize without overusing memory bandwidth.
  • 55.
    7-Point Stencil … 55   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Calculate values along one axis. Traversing the axis 3 values are needed along the axis Keep the three values in the register for next iteration This is called Register Tiling For 7-point there are 2 in the register so only 5 access will be needed. A combination of register and block tiling should give 7x speed up. In reality 4-5x because halos have to be considered.
  • 56.
    Questions? 56   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 57.
    Use case 57   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 58.
    GAMER Hsi-Yu Schive, T.Chiueh, and Y. C. Tsai Astrophysics adaptive mesh refinement (AMR) code with solvers for hydrodynamics and gravity   Parallelization achieved by OpenMP, MPI on multi-node multicores and CUDA for accelerators (GPU) Decoupling of AMR (CPU) and solvers (GPU) lends to increased performance, ease of code development Speed-ups of the order of 10-12x attained on single and multi-GPU heterogeneous systems Simulations   58   GAMER Framework Hemant Shukla, Hsi-Yu Schive, Tak-Pong Woo, and T. Chiueh Generalized GAMER codebase to multi-science framework Use GAMER to deeply benchmark heterogeneous hardware, optimizations and algorithms in applications Collect performance, memory access, power consumption and various other metrics for broader user base Develop codebases as ensembles of highly optimized existing and customizable components for HPC     Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 59.
    59   Adaptive MeshRefinement Data stored in Octree data structure Refinement with 2l spatial resolution per level l Figure - Hsi-Yu Schive et al., 2010 2D Patch 83 cells per patch Identical spatial geometry (same kernel) Uniform and individual time-steps   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 60.
    60   Construct andDataflow ...   Cluster GAMER Codebase C++/CUDA, MPI, OpenMP AMR, Framework, Libraries Solvers Poisson, Hydro, Custom, … Time  Steps   Problem domain covered with coarse patch on CPUs User defined refinement, spatial averaging, flux correction on CPUs Concurrently patches are transferred to GPUs, processed by solvers, one cell per thread, and returned Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 61.
    61   Solvers Hydrodynamics PDESolver 3D Euler equations solved in 5 separate schemes Second-order relaxing Total Variation Diminishing Weighted average flux MUSCL-Hancock (MHM) MUSCL-Hancock (VL) Corner transport upwind (CTU) Flux conservation is done using Riemann Solver (4 types - exact solver, HLLE, HLLC, and Roe) € ∂ρ ∂t + ∂(ρv j ) ∂x j = 0 ∂(ρvi ) ∂t + ∂(ρviv j +Pδij ) ∂x j = −ρ ∂φ ∂xi ∂e ∂t + ∂[(e + P)v j ] ∂x j = −ρv j ∂φ ∂x j Poisson-Gravity Solver € ∇2 φ(  x) = 4πGρ(  x) Laplacian operator is replaced by seven-point finite difference operator For root level patches Green’s functions is used using FFTW For refined levels SOR is used Recently implemented Multigrid Poisson Solver Hilbert space-filling curve (load balancing) Currently implementing Fast Poisson Solver with Dirichlet’s boundary conditions € ∇2 Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 62.
    62   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla     GAMER Framework Allows for adding custom/new solvers to the codebase The size of computational stencil An optimized CPU version of the implementation An optimized GPU version of the implementation CUDA thread blocks and stream objects New Solver inherits New Solver implements Async memcpy, concurrent execution, MPI and OpenMP optimization
  • 63.
    63   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Multi-Science Cosmological Large-scale Structure Bosonic Dark Matter Gravitational Lensing Potential € ∇2 φ(  x) = 4πGa[ρ(  x) − ρb (  x)] Effec+ve  resolu+on  81923 € i ∂ψ ∂t = − 2 2a2 m ∇2 ψ + mVψ Gravitational potential Schrodinger-Poisson equation  u =  x − ∇φ(  x) ∇2 φ(  x) = ∑(  x)/∑cr Lens equation and mass relationship Structure  due  to  dark  maTer  model   in  early  universe
  • 64.
    64   Kernel Analysis Read Write GlobalMemory Access GB/s Max bandwidth 144 GB/s Gravity Fluid Poisson Instructions / byte 3.57 Poisson Fluid Gravity Compute bound Memory bound 268.77 4.02 2.58 SOR takes 20-30 iterations to converge 0.0% 64.3% 15.9% L1 cache hits while global memory access Intensive use of shared memory Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 65.
    65   Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla   Hemant Shukla, Hsi-Yu Schive et al. SC 2011 Results Large scale Cosmological Simulations with GAMER
  • 66.
    66   Hemant Shukla, Hsi-YuSchive et al. SC 2011 Bosonic Dark Mater Simulation Base level resolution 2563 to level 7 32,7683 Results 0 400 800 1600 3553 Gravity Kinematic (Schrödinger's eqn.) 8 64 + GPU Cores Seconds Fix-up Refinement MPI Time-step 1200 5.52 X 4.79 X Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla  
  • 67.
    67   Load Balancewith Hilbert space filling curve New Results Gravity Kinematic (Schrödinger's eqn.) 8 64 + GPU Cores Seconds Fix-up Refinement MPI Time-step 3.03x Unbalanced Balanced Introduc+on  to  CUDA  Programming  -­‐  Hemant  Shukla