Parallel Computing on the GPU

Parallel Computing on the GPU
Tilani Gunawardena

Goals
• How to program heterogeneous parallel
computing system and achieve
– High performance and energy efficiency
– Functionality and maintainability
– Scalability across future generations
• Technical subjects
– Principles and patterns of parallel algorithms
– Programming API, tools and techniques

Tentative Schedule
– Introduction
– GPU Computing and CUDA Intro
– CUDA threading model
– CUDA memory model
– CUDA performance
– Floating Point Considerations
– Application Case Study

Recommended Textbook/Notes
• D. Kirk and W. Hwu, “Programming Massively
Parallel Processors – A Hands-on Approach,”
• http://www.nvidia.com/(Communities
CUDA Zone)

• Would you rather plow a field with two strong
oxen or 1024 chickens??

How to Dig
a Hole
Faster??
1. Dig Faster
2. Buy a More Productive Shovel
3. Hire more diggers  best approach
Problems:
1. How to manage them?
2. Will they get in each other’s way
3. Will more diggers help to dig hole deeper instead of
just wider?
1. Dig Faster : Processor should run with faster clock to
spend a shorter amount of time on each step of a
computation (limit: power consumption on a chip :
increase clock speed increase power consumption)
2. Buy a More Productive Shovel: Processor do more work
on each clock cycle(How much instruction level
parallelism per clock cycle)
3. Hire more diggers  best approach

Parallelism
• Solve Large Problems by breaking them into
small pieces
• Then run smaller pieces at the same time
Modern GPU
• 1000’s of ALUs
• 100’s of processors
• Tens of thousands of concurrent threads
• Modern GPU
– Ex:GeForce GTX Titan X
CUDA cores: 3072
8000 million transistors
12GB GDDR5 Memory
Memory Bandwidth: 336(GB/s)
65000 concurrent threads

Feature size of Processors over time
As feature size decrease
Transistors
• get smaller
• run faster
• use less power
• put more of them on a chip

• As transistors improved , processor designers
would then increase clock rates of processors ,
running them faster and faster every year

• Why don’t we keep increasing clock speed?
• Have transistors stopped getting smaller+ faster?
– Problem: heat
• Even though transistors are continuing to get smaller and faster and consume
less energy per transistor … Problem is running billion transistors generate lot of
heat and we can not keep all these processors cool
• Can not make single processor faster and faster(processors that we cant keep
cool)
• Processor designers
– Smaller, more efficient processors in terms of power
– Larger number of efficient processors
(rather than faster less efficient processors)
• What kind of Processors we build?
• CPU
– Complex control hardware
– Flexibility in performance
– Expensive in terms of power
• GPU
– Simpler control hardware
– More haradware for Computation
– Potentially more power efficient
– More restrictive Programming
model

Latency vs Throughput
• Latency-Amount of time to complete a
task(time , seconds)
• Throughput-Task completed per unit
time(Jobs/Hour) Your goals are not aligned with post
office goals
Your goal: Optimize for Latency
(want to spend a little time)
Post office: Optimize for throughput
(number of customers they serve per a
day)
CPU: Optimize for latency(minimize the time
elapsed of one particular task)
GPU: Chose to Optimize for throughput

Bandwidth
• How fast to devise can send data over a single
cable

Bandwidth vs Throughput vs Latency
– Bandwidth is the maximum amount of data that
can travel through a 'channel'.
– Throughput is how much data actually does travel
through the 'channel' successfully.
– Latency is a function of how long it takes the data
to get sent all the way from the start point to the
end

Latency vs Bandwidth
• Drive from Colombo to Kandy(100km)
– Car(5 people, 60km/h)
– Bus(60 people, 20km/h)
• Calculate
– Latency?
– Throughput?

GPUs from the point of view of the
software developer ?
• Importance in programing in parallel
– 8 core ivy bridge processor(intel)
– 8-wide AVX vector operations/Core
– 2 threads/core (hyperthreading)
128 way parallelism
In this processor if you run a complete serial, C
program with no parallelism at all, you are going
to use less than 1% of the capability of this
machine.

Introduction
• Microprocessor based on CPU drove rapid performance
increases and cost reduction in computer applications for
more than 2 decades.
– The users demand even more improvements once they become
accustomed to these improvements creating a positive cycle for the
computer industry.
• This drive has slowed since 2003 due to power consumption
issues that limited the increase of the clock frequency and
the level of productive activities that can be performed in
each clock period within a single CPU.
– All microprocessor vendors have switched to multi-core and many-
core models where multiple processing unit are used in each chip to
increase the processing power.

• Vast majority of SA are written as sequential programs
– The expectation is that program run faster with each new
generation of microprocessors. This is no longer valid from
this day onward.
– No performance improvement
– Reducing the growth opportunities of computer industries.
• SA will continue to enjoy performance improvement as
parallel programs, in which multiple threads of
execution cooperate to achieve the funcionality faster.

18
• Parallel programming is by no means new
– HPC community has been developing parallel
programs for decades.
– But these programs run on large scale, expensive
computers and only a few elite application justify the
use of these costs. In practice limiting the parallel
programming to a small number of appication
developers.
• Now that all new microprocessors are parallel
computers, the number of applications that need
to be developed as parallel programs has
increased.

GPU as Parallel Computers
• Since 2003 a class of many-cores processors
called GPUs have led the race for floating
point performance.
While the
performance of
general purpose
microprocessor has
slowed, the GPU
have continued to
improve.
Many application developers are motivated to move the computationally intensive
parts of their software to GPU for execution.

Why there is this Large Gap?
• The answer lies in the differences in the
fundamental design philosophies between
the two types of processors.
Latency oriented
cores
Throughput
oriented cores

CPU: Latency Oriented Design
• CPU is optimized for sequential code
performance
• Large caches
– Convert long latency memory accesses to short
latency cache accesses
• Sophisticated control
– Branch prediction for reduced branch latency
– Data forwarding for reduced data latency
• Powerful ALU
– Reduced operation latency

• GPU is optimized for the execution of massive
number of threads.
• Small caches
– To boost memory throughput
• Simple control
– No branch prediction
– No data forwarding
• Energy efficient ALUs
– Many, long latency but heavily pipelined for high
throughput
• Require massive number of threads to tolerate
latencies
GPU: Throughput Oriented Design

Winning Applications Use Both CPU
and GPU
• CPUs for sequential parts where latency
matters
– CPUs can be 10+X faster than GPUs for sequential
code
• GPUs for parallel parts where throughput wins
– GPUs can be 10+X faster than CPUs for parallel
code

Parallel Computing on the GPU

In this document