Programming the ANDC Cluster Sudhang Shankar
Traditional Programming Serial:  One instruction at a time, one after the other, on a single CPU. Instructions PROBLEM t6 t5 t4 t3 t2 t1 CPU
The Funky Ishtyle Parallel:  The problem is split in parts. Each part is represented as a sequence of instructions. Each such sequence is run on a separate CPU. Problem Sub- Problem1 Sub- Problem2 t3 t2 t1 t3 t2 t1 CPU1 CPU2
Why Parallelise? Speed  - ”Many Hands Make Light Work” Precision/Scale  – We can solve bigger problems, with greater accuracy.
Parallel Programming Models There are several parallel programming models in common use: Shared Memory Threads Message Passing Data Parallel
Message Passing Model The applications on the ANDC cluster as of now work on this model Tasks use their own local memory during computation Task exchange data through  messages
The MPI Standard MPI: Message Passing Interface A standard, with many implementations Codifies ”best practises” of the Parallel Design community Implementations LAM/MPI – Argonne Labs MPICH openMPI
How MPI works communicators  define which collection of processes may communicate with each other Every process in a communicator has a unique  rank The  size  of the communicator is the total no. of  processes in the communicator
MPI primitives – Environment Setup MPI_INIT:  initialises the MPI execution environment MPI_COMM_SIZE:  Determines the number of processes in the group associated with a communicator MPI_COMM_RANK:   Determines the rank of the  calling process within the communicator MPI_FINALIZE:  Terminates the MPI execution environment
MPI primitives – Message Passing MPI_Send(buffer,count,type,dest,tag,comm)  MPI_Recv(buffer,count,type,source,tag,comm,status)
An Example Application The Monte-Carlo Pi Estimation Algorithm AKA ”The Dartboard Algorithm”
Algorithm Description Imagine you have a square ”dartboard”, with a circle inscribed in it:
Randomly throw N darts at the board Count the no of HITS ( darts landing within circle) hits flops
pi will be the value obtained after multiplying the ratio of hits to total throws by 4 Why: pi = A c  / r 2 A s  = 4r 2 r 2  = A s  / 4 pi = 4 * A c  / A s hits flops
Parallel Version Make each worker throw an equal number of darts A worker counts the HITS The Master adds all the individual ”HITS” It then computes pi as: pi = (4.)*(HITS)/N
To Make it Faster.... Increase the number of workers p, while keeping N constant. Each worker deals with (N/p) throws so the greater the value of p, the fewer throws a worker handles Fewer throws => faster
To Make it ”Better” Increase the number of throws N This makes the calculation more accurate
MPI For each task, run the dartboard algorithm homehits = dboard(DARTS); Workers send homehits to master if (taskid != MASTER) MPI_Send(&homehits, 1,  MPI_DOUBLE,   MASTER, count,  MPI_COMM_WORLD);
Master gets homehit values from workers     for (i=0;i<p;i++) { rc = MPI_Recv(&hitrecv, 1, MPI_DOUBLE,  MPI_ANY_SOURCE,  mtype, MPI_COMM_WORLD,  &status); totalhits = totalhits + hitrecv; } Master calculates pi as   pi = (4.0)*(totalhits)/N
MapReduce Framework for simplifying the development of parallel programs. Developed at Google. FLOSS implementations Hadoop (Java) from Yahoo Disco (Erlang) from Nokia Dumbo from Audioscrobbler Many others (including a 36-line Ruby one!)
The MapReduce library requires the user to implement: Map():  takes as input a function and a sequence of values. Applies the function to each value in the sequence Reduce():  combines all the elements of a sequence using a binary operation MapReduce
How it works (oversimplified) map()  takes as input a set of <key,value> pairs produces a set of <intermediate key,value> pairs This is all done in parallel, across many machines The parallelisation is done by the mapreduce library (the programmer doesn't have to think about it)
The MapReduce Library groups together all intermediate values associated with the same intermediate key I and passes them to reduce() A  Reduce()  instance takes a set of <intermediate key,value> pairs and produces an output value for that key, like a ”summary” value.
Pi Estimation in MapReduce Here, map() is the dartboard algo. Each worker runs the algo. Hits are represented as <1,no_of_Hits> and flops as <0,no_of_flops> Thus Each Map() instance returns two <boolean,count> to the MapReduce library.
The library then clumps all the <bool,count> pairs into two ”sets”: one for key 0 and one for key 1 and passes them to reduce() Reduce() then adds up the ”count” for each key to produce a ”grand total”. Thus we know the total hits and total flops. These are output as <key, Grand_total> pairs to the master process. The master then finds pi as  pi = 4*(hits)/(hits+flops)
Other Solutions PVM OpenMP LINDA Occam Parallel/Scientific Python
Problems/Limitations Parallel Slowdown:  parallelization of a parallel computer program beyond a certain point causes the program to run  slower The Amdahl Principle:  Parallel speedup  is limited by the sequential  fraction of the program.
Applications Compute clusters are used whenever we have: lots and lots of data to process,  Too little time to work sequentially.
Finance Risk Assessment: India's NSE uses a linux cluster in order to monitor the risk of members Broker crosses VAR limit => account disabled VAR is calculated in realtime using  PRISM  (Parallel Risk Management System), which uses MPI NSE's PRISM handles 500 trades/sec and can scale to 1000 trades/sec.
Molecular Dynamics Given a collection of atoms, we’d like to calculate how they interact and move under realistic laboratory conditions expensive part: determining the force on each atom, since it depends on the positions of all other atoms in  the system
Software:  GROMACS:  helps scientists simulate the behavior of large molecules (like proteins, lipids, and even polymers) PyMol:  molecular graphics  and modelling package which  can be also used to  generate animated sequences. Raytraced Lysozyme structure  created with Pymol
Other Distributed Problems Rendering multiple frames of high-quality animation (eg – Shrek) Indexing the web (eg – Google) Data Mining
Questions?

Parallel Programming on the ANDC cluster

  • 1.
    Programming the ANDCCluster Sudhang Shankar
  • 2.
    Traditional Programming Serial: One instruction at a time, one after the other, on a single CPU. Instructions PROBLEM t6 t5 t4 t3 t2 t1 CPU
  • 3.
    The Funky IshtyleParallel: The problem is split in parts. Each part is represented as a sequence of instructions. Each such sequence is run on a separate CPU. Problem Sub- Problem1 Sub- Problem2 t3 t2 t1 t3 t2 t1 CPU1 CPU2
  • 4.
    Why Parallelise? Speed - ”Many Hands Make Light Work” Precision/Scale – We can solve bigger problems, with greater accuracy.
  • 5.
    Parallel Programming ModelsThere are several parallel programming models in common use: Shared Memory Threads Message Passing Data Parallel
  • 6.
    Message Passing ModelThe applications on the ANDC cluster as of now work on this model Tasks use their own local memory during computation Task exchange data through messages
  • 7.
    The MPI StandardMPI: Message Passing Interface A standard, with many implementations Codifies ”best practises” of the Parallel Design community Implementations LAM/MPI – Argonne Labs MPICH openMPI
  • 8.
    How MPI workscommunicators define which collection of processes may communicate with each other Every process in a communicator has a unique rank The size of the communicator is the total no. of processes in the communicator
  • 9.
    MPI primitives –Environment Setup MPI_INIT: initialises the MPI execution environment MPI_COMM_SIZE: Determines the number of processes in the group associated with a communicator MPI_COMM_RANK: Determines the rank of the calling process within the communicator MPI_FINALIZE: Terminates the MPI execution environment
  • 10.
    MPI primitives –Message Passing MPI_Send(buffer,count,type,dest,tag,comm) MPI_Recv(buffer,count,type,source,tag,comm,status)
  • 11.
    An Example ApplicationThe Monte-Carlo Pi Estimation Algorithm AKA ”The Dartboard Algorithm”
  • 12.
    Algorithm Description Imagineyou have a square ”dartboard”, with a circle inscribed in it:
  • 13.
    Randomly throw Ndarts at the board Count the no of HITS ( darts landing within circle) hits flops
  • 14.
    pi will bethe value obtained after multiplying the ratio of hits to total throws by 4 Why: pi = A c / r 2 A s = 4r 2 r 2 = A s / 4 pi = 4 * A c / A s hits flops
  • 15.
    Parallel Version Makeeach worker throw an equal number of darts A worker counts the HITS The Master adds all the individual ”HITS” It then computes pi as: pi = (4.)*(HITS)/N
  • 16.
    To Make itFaster.... Increase the number of workers p, while keeping N constant. Each worker deals with (N/p) throws so the greater the value of p, the fewer throws a worker handles Fewer throws => faster
  • 17.
    To Make it”Better” Increase the number of throws N This makes the calculation more accurate
  • 18.
    MPI For eachtask, run the dartboard algorithm homehits = dboard(DARTS); Workers send homehits to master if (taskid != MASTER) MPI_Send(&homehits, 1, MPI_DOUBLE, MASTER, count, MPI_COMM_WORLD);
  • 19.
    Master gets homehitvalues from workers for (i=0;i<p;i++) { rc = MPI_Recv(&hitrecv, 1, MPI_DOUBLE, MPI_ANY_SOURCE, mtype, MPI_COMM_WORLD, &status); totalhits = totalhits + hitrecv; } Master calculates pi as pi = (4.0)*(totalhits)/N
  • 20.
    MapReduce Framework forsimplifying the development of parallel programs. Developed at Google. FLOSS implementations Hadoop (Java) from Yahoo Disco (Erlang) from Nokia Dumbo from Audioscrobbler Many others (including a 36-line Ruby one!)
  • 21.
    The MapReduce libraryrequires the user to implement: Map(): takes as input a function and a sequence of values. Applies the function to each value in the sequence Reduce(): combines all the elements of a sequence using a binary operation MapReduce
  • 22.
    How it works(oversimplified) map() takes as input a set of <key,value> pairs produces a set of <intermediate key,value> pairs This is all done in parallel, across many machines The parallelisation is done by the mapreduce library (the programmer doesn't have to think about it)
  • 23.
    The MapReduce Librarygroups together all intermediate values associated with the same intermediate key I and passes them to reduce() A Reduce() instance takes a set of <intermediate key,value> pairs and produces an output value for that key, like a ”summary” value.
  • 24.
    Pi Estimation inMapReduce Here, map() is the dartboard algo. Each worker runs the algo. Hits are represented as <1,no_of_Hits> and flops as <0,no_of_flops> Thus Each Map() instance returns two <boolean,count> to the MapReduce library.
  • 25.
    The library thenclumps all the <bool,count> pairs into two ”sets”: one for key 0 and one for key 1 and passes them to reduce() Reduce() then adds up the ”count” for each key to produce a ”grand total”. Thus we know the total hits and total flops. These are output as <key, Grand_total> pairs to the master process. The master then finds pi as pi = 4*(hits)/(hits+flops)
  • 26.
    Other Solutions PVMOpenMP LINDA Occam Parallel/Scientific Python
  • 27.
    Problems/Limitations Parallel Slowdown: parallelization of a parallel computer program beyond a certain point causes the program to run slower The Amdahl Principle: Parallel speedup is limited by the sequential fraction of the program.
  • 28.
    Applications Compute clustersare used whenever we have: lots and lots of data to process, Too little time to work sequentially.
  • 29.
    Finance Risk Assessment:India's NSE uses a linux cluster in order to monitor the risk of members Broker crosses VAR limit => account disabled VAR is calculated in realtime using PRISM (Parallel Risk Management System), which uses MPI NSE's PRISM handles 500 trades/sec and can scale to 1000 trades/sec.
  • 30.
    Molecular Dynamics Givena collection of atoms, we’d like to calculate how they interact and move under realistic laboratory conditions expensive part: determining the force on each atom, since it depends on the positions of all other atoms in the system
  • 31.
    Software: GROMACS: helps scientists simulate the behavior of large molecules (like proteins, lipids, and even polymers) PyMol: molecular graphics and modelling package which can be also used to generate animated sequences. Raytraced Lysozyme structure created with Pymol
  • 32.
    Other Distributed ProblemsRendering multiple frames of high-quality animation (eg – Shrek) Indexing the web (eg – Google) Data Mining
  • 33.