Parallel Programming on the ANDC cluster

Programming the ANDC Cluster Sudhang Shankar

Traditional Programming Serial: One instruction at a time, one after the other, on a single CPU. Instructions PROBLEM t6 t5 t4 t3 t2 t1 CPU

The Funky Ishtyle Parallel: The problem is split in parts. Each part is represented as a sequence of instructions. Each such sequence is run on a separate CPU. Problem Sub- Problem1 Sub- Problem2 t3 t2 t1 t3 t2 t1 CPU1 CPU2

Why Parallelise? Speed - ”Many Hands Make Light Work” Precision/Scale – We can solve bigger problems, with greater accuracy.

Parallel Programming Models There are several parallel programming models in common use: Shared Memory Threads Message Passing Data Parallel

Message Passing Model The applications on the ANDC cluster as of now work on this model Tasks use their own local memory during computation Task exchange data through messages

The MPI Standard MPI: Message Passing Interface A standard, with many implementations Codifies ”best practises” of the Parallel Design community Implementations LAM/MPI – Argonne Labs MPICH openMPI

How MPI works communicators define which collection of processes may communicate with each other Every process in a communicator has a unique rank The size of the communicator is the total no. of processes in the communicator

MPI primitives – Environment Setup MPI_INIT: initialises the MPI execution environment MPI_COMM_SIZE: Determines the number of processes in the group associated with a communicator MPI_COMM_RANK: Determines the rank of the calling process within the communicator MPI_FINALIZE: Terminates the MPI execution environment

MPI primitives – Message Passing MPI_Send(buffer,count,type,dest,tag,comm) MPI_Recv(buffer,count,type,source,tag,comm,status)

An Example Application The Monte-Carlo Pi Estimation Algorithm AKA ”The Dartboard Algorithm”

Algorithm Description Imagine you have a square ”dartboard”, with a circle inscribed in it:

Randomly throw N darts at the board Count the no of HITS ( darts landing within circle) hits flops

pi will be the value obtained after multiplying the ratio of hits to total throws by 4 Why: pi = A c / r 2 A s = 4r 2 r 2 = A s / 4 pi = 4 * A c / A s hits flops

Parallel Version Make each worker throw an equal number of darts A worker counts the HITS The Master adds all the individual ”HITS” It then computes pi as: pi = (4.)*(HITS)/N

To Make it Faster.... Increase the number of workers p, while keeping N constant. Each worker deals with (N/p) throws so the greater the value of p, the fewer throws a worker handles Fewer throws => faster

To Make it ”Better” Increase the number of throws N This makes the calculation more accurate

MPI For each task, run the dartboard algorithm homehits = dboard(DARTS); Workers send homehits to master if (taskid != MASTER) MPI_Send(&homehits, 1, MPI_DOUBLE, MASTER, count, MPI_COMM_WORLD);

Master gets homehit values from workers for (i=0;i<p;i++) { rc = MPI_Recv(&hitrecv, 1, MPI_DOUBLE, MPI_ANY_SOURCE, mtype, MPI_COMM_WORLD, &status); totalhits = totalhits + hitrecv; } Master calculates pi as pi = (4.0)*(totalhits)/N

MapReduce Framework for simplifying the development of parallel programs. Developed at Google. FLOSS implementations Hadoop (Java) from Yahoo Disco (Erlang) from Nokia Dumbo from Audioscrobbler Many others (including a 36-line Ruby one!)

The MapReduce library requires the user to implement: Map(): takes as input a function and a sequence of values. Applies the function to each value in the sequence Reduce(): combines all the elements of a sequence using a binary operation MapReduce

How it works (oversimplified) map() takes as input a set of <key,value> pairs produces a set of <intermediate key,value> pairs This is all done in parallel, across many machines The parallelisation is done by the mapreduce library (the programmer doesn't have to think about it)

The MapReduce Library groups together all intermediate values associated with the same intermediate key I and passes them to reduce() A Reduce() instance takes a set of <intermediate key,value> pairs and produces an output value for that key, like a ”summary” value.

Pi Estimation in MapReduce Here, map() is the dartboard algo. Each worker runs the algo. Hits are represented as <1,no_of_Hits> and flops as <0,no_of_flops> Thus Each Map() instance returns two <boolean,count> to the MapReduce library.

The library then clumps all the <bool,count> pairs into two ”sets”: one for key 0 and one for key 1 and passes them to reduce() Reduce() then adds up the ”count” for each key to produce a ”grand total”. Thus we know the total hits and total flops. These are output as <key, Grand_total> pairs to the master process. The master then finds pi as pi = 4*(hits)/(hits+flops)

Other Solutions PVM OpenMP LINDA Occam Parallel/Scientific Python

Problems/Limitations Parallel Slowdown: parallelization of a parallel computer program beyond a certain point causes the program to run slower The Amdahl Principle: Parallel speedup is limited by the sequential fraction of the program.

Applications Compute clusters are used whenever we have: lots and lots of data to process, Too little time to work sequentially.

Finance Risk Assessment: India's NSE uses a linux cluster in order to monitor the risk of members Broker crosses VAR limit => account disabled VAR is calculated in realtime using PRISM (Parallel Risk Management System), which uses MPI NSE's PRISM handles 500 trades/sec and can scale to 1000 trades/sec.

Molecular Dynamics Given a collection of atoms, we’d like to calculate how they interact and move under realistic laboratory conditions expensive part: determining the force on each atom, since it depends on the positions of all other atoms in the system

Software: GROMACS: helps scientists simulate the behavior of large molecules (like proteins, lipids, and even polymers) PyMol: molecular graphics and modelling package which can be also used to generate animated sequences. Raytraced Lysozyme structure created with Pymol

Other Distributed Problems Rendering multiple frames of high-quality animation (eg – Shrek) Indexing the web (eg – Google) Data Mining

Parallel Programming on the ANDC cluster

More Related Content

What's hot

Similar to Parallel Programming on the ANDC cluster

Recently uploaded

Parallel Programming on the ANDC cluster