Algorithms 101 for Data
Scientists
Presented by Chris Conlan and Janice McMahon
Bethesda Data Science Meetup
Sources of Sub-optimal code
• Every line of code in a program consumes resources and therefore has a
cost
• Mathematical operations, or steps in the program
• Memory operations, or data allocation and creation
• The rules of the programming language determine how the resources are
used
• Inefficient use of resources is the greatest source of “hidden” complexity; i.e.,
operations that are not part of the mathematics of the algorithm, but affect its
performance
• The way to avoid accidentally writing sub-optimal code is to understand
how an algorithm specified in a language results in a program that runs on
a computer
Problem #1: Unnecessary Operations
• Mathematical operations in a Python program are not the same as
mathematical operations in an equation
• Python does not know how to “reduce” your equation
A = 5 A = 5
B = 6 B = 6
C = A + B
for i in range(10) : for i in range(10) :
sum += A + B + i sum += C + i
sum -= A + B – i sum += C - i
These are redundant
O(n) reduction in ops!
Problem #2: Memory Allocation
• Python is dynamically typed and uses a private heap for all data
structures and objects
• Example: string concatenation
S = “” H = [“hello”,”hello”, … , “hello”]
for i in range(10) : S = ’’.join(H)
S += “hello”
Each append operation
causes a new string to be
created, with the old string
copied to the new string and
the new text added
Avoids extra memory copies
and allocations – much faster
for large strings
Interpreted vs. Compiled Languages
• Compiled languages solve these problems by translating a program as
a unit instead of a statement at a time
• Optimizes over the whole expression to produce efficient code
• Data types are statically determined and stored efficiently
Common subexpression elimination
• Redundant operations are found in the code via dataflow analysis
• Example code in C programming language:
int A = 5;
int B = 6;
for (int i = 0; i < 10; i++) {
sum += A + B + i;
sum -= A + B – i;
}
Compiler performs dataflow analysis and
uses registers for intermediate values
Data is given explicit “integer” type;
statically allocated as number with no
object overhead
Explicit memory allocation
• Dynamic memory allocation is explicit in code, exposing use of heap
• Example in C programming language:
char *a = malloc(50 * sizeof(char));
for (int i = 0; i < 50; i+=5)
strcpy(&a[i], “hello”);
String literal is copied directly
into pre-allocated space; no
allocation inside the loop
Memory is allocated once at the beginning;
maximum size must be given in allocation
Compilation to the Architecture
• Underneath the hood, the program is using functional units and a
memory hierarchy to implement the operations in the program
• Memory and operations have different latencies and bandwidths, the
mix of memory and computational operations determines the
optimal schedule on a particular hardware architecture
Vectorization
https://www.cs.utexas.edu/~pingali/CS380C/2016/lectures/david-vectorization.pdf
Example: Dot Product
• Example code in C programming language:
float dot = 0;
for (int i = 0; i < 10; i++)
dot += A[i] * B[i];
C compiler will vectorize this
computation, organizing it into
groups of parallel operations
Python version of dot product:
• Example code in classic Python:
for i in range(len(a)) :
dot += a[i] + b[i]
• Example using numPy:
dot = numpy.dot(a, b)
Interpreter will execute one
operation per loop iteration
The numPy library performs
the vectorization internally to
the library
Interpreted languages often get the performance
improvements of compiled languages via libraries
– wherever possible, use them!
Memory Hierarchy
https://www.edn.com/memory-hierarchy-design-part-1-basics-of-memory-hierarchies/
• Memory closest to the processor
is fastest but most expensive
• Data moves through the
hierarchy in blocks
• Get better performance by
re-using data closer to the
processor
• Copies of data at different levels
must be consistent
Example: Matrix Multiplication
• Naïve C code:
for (i = 0; i < n; i++)
for (j = 0; j < m; j++)
for (k = 0; k < l; k++)
C(i,j) = C(i,j) + A(i,k) * B(k,j)
• Block Algorithm:
for (i = 0; i < n; i++)
for (j = 0; j < m; j++) {
// read block C(i,j) into fast memory
for (k = 0; k < l; k++) {
// read block A(i,k) into fast memory
// read block B(k,j) into fast memory
C(i,j) = C(i,j) + A(i,k) * B(k,j)
}
// write block C(i,j) to slow memory
• Python code:
c = numpy.matmul(a, b)
https://sites.cs.ucsb.edu/~tyang/class/240a17/slides/Cache3.pdf
Each operation involves
a memory access
Data is read and written
in blocks, taking
advantage of cache reuse
to improve performance
numPy library optimizes
algorithm implementation
Be your own Optimizer
• Count your operations – don’t do O(n2) when the mathematics is only
O(n)
• Look at your loops – don’t put operations inside the loop body that
can be taken out
• Use packages like NumPy that improve object representation for
arrays and numerical objects
• Use packages like Cython that include some level of source analysis
• If desperate – use SWIG and call a C routine!!

Algorithms 101 for Data Scientists (Part 2)

  • 1.
    Algorithms 101 forData Scientists Presented by Chris Conlan and Janice McMahon Bethesda Data Science Meetup
  • 2.
    Sources of Sub-optimalcode • Every line of code in a program consumes resources and therefore has a cost • Mathematical operations, or steps in the program • Memory operations, or data allocation and creation • The rules of the programming language determine how the resources are used • Inefficient use of resources is the greatest source of “hidden” complexity; i.e., operations that are not part of the mathematics of the algorithm, but affect its performance • The way to avoid accidentally writing sub-optimal code is to understand how an algorithm specified in a language results in a program that runs on a computer
  • 3.
    Problem #1: UnnecessaryOperations • Mathematical operations in a Python program are not the same as mathematical operations in an equation • Python does not know how to “reduce” your equation A = 5 A = 5 B = 6 B = 6 C = A + B for i in range(10) : for i in range(10) : sum += A + B + i sum += C + i sum -= A + B – i sum += C - i These are redundant O(n) reduction in ops!
  • 4.
    Problem #2: MemoryAllocation • Python is dynamically typed and uses a private heap for all data structures and objects • Example: string concatenation S = “” H = [“hello”,”hello”, … , “hello”] for i in range(10) : S = ’’.join(H) S += “hello” Each append operation causes a new string to be created, with the old string copied to the new string and the new text added Avoids extra memory copies and allocations – much faster for large strings
  • 5.
    Interpreted vs. CompiledLanguages • Compiled languages solve these problems by translating a program as a unit instead of a statement at a time • Optimizes over the whole expression to produce efficient code • Data types are statically determined and stored efficiently
  • 6.
    Common subexpression elimination •Redundant operations are found in the code via dataflow analysis • Example code in C programming language: int A = 5; int B = 6; for (int i = 0; i < 10; i++) { sum += A + B + i; sum -= A + B – i; } Compiler performs dataflow analysis and uses registers for intermediate values Data is given explicit “integer” type; statically allocated as number with no object overhead
  • 7.
    Explicit memory allocation •Dynamic memory allocation is explicit in code, exposing use of heap • Example in C programming language: char *a = malloc(50 * sizeof(char)); for (int i = 0; i < 50; i+=5) strcpy(&a[i], “hello”); String literal is copied directly into pre-allocated space; no allocation inside the loop Memory is allocated once at the beginning; maximum size must be given in allocation
  • 8.
    Compilation to theArchitecture • Underneath the hood, the program is using functional units and a memory hierarchy to implement the operations in the program • Memory and operations have different latencies and bandwidths, the mix of memory and computational operations determines the optimal schedule on a particular hardware architecture
  • 9.
  • 10.
    Example: Dot Product •Example code in C programming language: float dot = 0; for (int i = 0; i < 10; i++) dot += A[i] * B[i]; C compiler will vectorize this computation, organizing it into groups of parallel operations
  • 11.
    Python version ofdot product: • Example code in classic Python: for i in range(len(a)) : dot += a[i] + b[i] • Example using numPy: dot = numpy.dot(a, b) Interpreter will execute one operation per loop iteration The numPy library performs the vectorization internally to the library Interpreted languages often get the performance improvements of compiled languages via libraries – wherever possible, use them!
  • 12.
    Memory Hierarchy https://www.edn.com/memory-hierarchy-design-part-1-basics-of-memory-hierarchies/ • Memoryclosest to the processor is fastest but most expensive • Data moves through the hierarchy in blocks • Get better performance by re-using data closer to the processor • Copies of data at different levels must be consistent
  • 13.
    Example: Matrix Multiplication •Naïve C code: for (i = 0; i < n; i++) for (j = 0; j < m; j++) for (k = 0; k < l; k++) C(i,j) = C(i,j) + A(i,k) * B(k,j) • Block Algorithm: for (i = 0; i < n; i++) for (j = 0; j < m; j++) { // read block C(i,j) into fast memory for (k = 0; k < l; k++) { // read block A(i,k) into fast memory // read block B(k,j) into fast memory C(i,j) = C(i,j) + A(i,k) * B(k,j) } // write block C(i,j) to slow memory • Python code: c = numpy.matmul(a, b) https://sites.cs.ucsb.edu/~tyang/class/240a17/slides/Cache3.pdf Each operation involves a memory access Data is read and written in blocks, taking advantage of cache reuse to improve performance numPy library optimizes algorithm implementation
  • 14.
    Be your ownOptimizer • Count your operations – don’t do O(n2) when the mathematics is only O(n) • Look at your loops – don’t put operations inside the loop body that can be taken out • Use packages like NumPy that improve object representation for arrays and numerical objects • Use packages like Cython that include some level of source analysis • If desperate – use SWIG and call a C routine!!