GPUs: Not Just for
Graphics Anymore
David Ostrovsky | Couchbase
GPGPU refers to using a Graphics
Processing Unit (GPU) to perform
computation in applications
traditionally handled by the CPU.
CPU vs. GPU Architecture
• Image processing, graphics rendering
• Fractal images (e.g. Mandelbrot set)
• String matching
• Distributed queries, MapRecuce
• Brute-force cryptographic attacks
• Bitcoin mining
Embarrassingly Parallel Problems
Amdahl’s Law
The speedup of a
program using multiple
processors in parallel
computing is limited by
the sequential fraction of
the program.
GPGPU Concepts
• Texture: A common way to provide the
read-only input data stream as a 2D grid.
• Frame Buffer: A write-only memory
interface for output.
• Kernel: The operation to perform on each
unit of data. Roughly similar to the body
of a loop.
Parallelizing Your Code
void compute(float in[10000], float *out[10000])
{
for(int i=0; i < 10000; i++)
*out[i] = func(in[i]);
}
Texture Frame Buffer
Kernel
• OpenCL
• Subset of C99
• Implementations for Intel,
AMD, and nVidia GPUs
• CUDA
• C++ SDK, wrappers for
other languages
• Only supported on nVidia
GPUs
GPGPU Frameworks
• C++ AMP
• Subset of C++
• Microsoft
implementation
based on DirectX,
integrated into
Visual Studio
• Supports most
modern GPUs
• OpenCL
• Vendor-specific SDKs,
available from Intel, AMD,
IBM, and nVidia
• Wrappers for popular
languages, including C#,
Python, Java, etc.
• Supports multiple vendor-
specific debuggers
Client Integration
• C++ AMP
• Native C++
projects, P/Invoke
from .NET, WinRT
component, any
language that can
interoperate with
native libraries
• Supports GPU
debugging, profiling
Using C++ AMP
extern "C" __declspec ( dllexport ) void _stdcall square_array(float* arr, int n)
{
array_view<float,1> dataView(n, &arr[0]);
parallel_for_each(dataView.extent, [=] (index<1> idx) restrict(amp)
{
dataView[idx] = dataView[idx] * dataView[idx];
});
dataView.synchronize();
}
Native DLL
Using C++ AMP
[DllImport("NativeAmpLibrary", CallingConvention = CallingConvention.StdCall)]
extern unsafe static void square_array(float* array, int length);
float[] arr = new[] { 1.0f, 2.0f, 3.0f, 4.0f };
fixed (float* arrPt = &arr[0]) {
square_array(arrPt, arr.Length);
}
Managed Code
Using OpenCL
C# Project NuGet Package
Using OpenCL
OpenCL Code
Using Aparapi (OpenCL)
Aparapi Java Code
• Converts Java bytecode to
OpenCL at runtime
• Syntax somewhat similar to
C++ AMP
final float[] data = new float[size];
Kernel kernel = new Kernel(){
@Override public void run() {
int gid = getGlobalId();
data[gid] = data[gid] * data[gid];
}
};
kernel.execute(Range.create(512));
Demo Time!
Simple GPGPU Applications
Case Study 1: Edge Detection
Sobel Operator
Pixels can be checked
in parallel
Find all the points in the
image where the
brightness changes
sharply.
More Demo Time!
Processing a Video Stream
Case Study 2: Password Cracking
Passwords are commonly stored as hashes of the original plain
text: "12345" = "5994471abb01112afcc18159f6cc74b4f511b99806da59b3caf5a9c173cacfc5"
Cracking a password by
brute force requires
repeatedly hashing
guesses until a match is
found – can be
parallelized effectively.
Even More Demos!
Cracking a Single Password Hash with a Dictionary Attack
Thank you!
@DavidOstrovsky
CodeHardBlog.azurewebsites.net
linkedin.com/in/davidostrovsky
davido@couchbase.com
David Ostrovsky | Couchbase

General Programming on the GPU - Confoo

  • 1.
    GPUs: Not Justfor Graphics Anymore David Ostrovsky | Couchbase
  • 2.
    GPGPU refers tousing a Graphics Processing Unit (GPU) to perform computation in applications traditionally handled by the CPU.
  • 3.
    CPU vs. GPUArchitecture
  • 4.
    • Image processing,graphics rendering • Fractal images (e.g. Mandelbrot set) • String matching • Distributed queries, MapRecuce • Brute-force cryptographic attacks • Bitcoin mining Embarrassingly Parallel Problems
  • 5.
    Amdahl’s Law The speedupof a program using multiple processors in parallel computing is limited by the sequential fraction of the program.
  • 6.
    GPGPU Concepts • Texture:A common way to provide the read-only input data stream as a 2D grid. • Frame Buffer: A write-only memory interface for output. • Kernel: The operation to perform on each unit of data. Roughly similar to the body of a loop.
  • 7.
    Parallelizing Your Code voidcompute(float in[10000], float *out[10000]) { for(int i=0; i < 10000; i++) *out[i] = func(in[i]); } Texture Frame Buffer Kernel
  • 8.
    • OpenCL • Subsetof C99 • Implementations for Intel, AMD, and nVidia GPUs • CUDA • C++ SDK, wrappers for other languages • Only supported on nVidia GPUs GPGPU Frameworks • C++ AMP • Subset of C++ • Microsoft implementation based on DirectX, integrated into Visual Studio • Supports most modern GPUs
  • 9.
    • OpenCL • Vendor-specificSDKs, available from Intel, AMD, IBM, and nVidia • Wrappers for popular languages, including C#, Python, Java, etc. • Supports multiple vendor- specific debuggers Client Integration • C++ AMP • Native C++ projects, P/Invoke from .NET, WinRT component, any language that can interoperate with native libraries • Supports GPU debugging, profiling
  • 10.
    Using C++ AMP extern"C" __declspec ( dllexport ) void _stdcall square_array(float* arr, int n) { array_view<float,1> dataView(n, &arr[0]); parallel_for_each(dataView.extent, [=] (index<1> idx) restrict(amp) { dataView[idx] = dataView[idx] * dataView[idx]; }); dataView.synchronize(); } Native DLL
  • 11.
    Using C++ AMP [DllImport("NativeAmpLibrary",CallingConvention = CallingConvention.StdCall)] extern unsafe static void square_array(float* array, int length); float[] arr = new[] { 1.0f, 2.0f, 3.0f, 4.0f }; fixed (float* arrPt = &arr[0]) { square_array(arrPt, arr.Length); } Managed Code
  • 12.
  • 13.
  • 14.
    Using Aparapi (OpenCL) AparapiJava Code • Converts Java bytecode to OpenCL at runtime • Syntax somewhat similar to C++ AMP final float[] data = new float[size]; Kernel kernel = new Kernel(){ @Override public void run() { int gid = getGlobalId(); data[gid] = data[gid] * data[gid]; } }; kernel.execute(Range.create(512));
  • 15.
  • 16.
    Case Study 1:Edge Detection Sobel Operator Pixels can be checked in parallel Find all the points in the image where the brightness changes sharply.
  • 17.
  • 18.
    Case Study 2:Password Cracking Passwords are commonly stored as hashes of the original plain text: "12345" = "5994471abb01112afcc18159f6cc74b4f511b99806da59b3caf5a9c173cacfc5" Cracking a password by brute force requires repeatedly hashing guesses until a match is found – can be parallelized effectively.
  • 19.
    Even More Demos! Crackinga Single Password Hash with a Dictionary Attack
  • 21.

Editor's Notes

  • #3 Particularly effective for Stream Processing – performing the same operation on multiple records in a stream in parallel
  • #5 Workloads that can be easily separated into parallel tasks. This is often the case when there is no dependency between the work units.
  • #6 Gene Myron Amdahl (born November 16, 1922) is an American computer architect and high-tech entrepreneur, chiefly known for his work on mainframe computers at IBM and later his own companies, especially Amdahl Corporation. He formulated Amdahl's law, which states a fundamental limitation of parallel computing.
  • #21 Fast hash algorithms like MD5, SHA1 and SHA2 are terrible for storing passwords. Use CPU intensive algorithms like PBKDF2, bcrypt, scrypt. They are expensive to calculate and have an adjustable work factor.