Algorithms 101 for Data Scientists (Part 2)

Algorithms 101 for Data
Scientists
Presented by Chris Conlan and Janice McMahon
Bethesda Data Science Meetup

Sources of Sub-optimal code
• Every line of code in a program consumes resources and therefore has a
cost
• Mathematical operations, or steps in the program
• Memory operations, or data allocation and creation
• The rules of the programming language determine how the resources are
used
• Inefficient use of resources is the greatest source of “hidden” complexity; i.e.,
operations that are not part of the mathematics of the algorithm, but affect its
performance
• The way to avoid accidentally writing sub-optimal code is to understand
how an algorithm specified in a language results in a program that runs on
a computer

Problem #1: Unnecessary Operations
• Mathematical operations in a Python program are not the same as
mathematical operations in an equation
• Python does not know how to “reduce” your equation
A = 5 A = 5
B = 6 B = 6
C = A + B
for i in range(10) : for i in range(10) :
sum += A + B + i sum += C + i
sum -= A + B – i sum += C - i
These are redundant
O(n) reduction in ops!

Problem #2: Memory Allocation
• Python is dynamically typed and uses a private heap for all data
structures and objects
• Example: string concatenation
S = “” H = [“hello”,”hello”, … , “hello”]
for i in range(10) : S = ’’.join(H)
S += “hello”
Each append operation
causes a new string to be
created, with the old string
copied to the new string and
the new text added
Avoids extra memory copies
and allocations – much faster
for large strings

Interpreted vs. Compiled Languages
• Compiled languages solve these problems by translating a program as
a unit instead of a statement at a time
• Optimizes over the whole expression to produce efficient code
• Data types are statically determined and stored efficiently

Common subexpression elimination
• Redundant operations are found in the code via dataflow analysis
• Example code in C programming language:
int A = 5;
int B = 6;
for (int i = 0; i < 10; i++) {
sum += A + B + i;
sum -= A + B – i;
}
Compiler performs dataflow analysis and
uses registers for intermediate values
Data is given explicit “integer” type;
statically allocated as number with no
object overhead

Explicit memory allocation
• Dynamic memory allocation is explicit in code, exposing use of heap
• Example in C programming language:
char *a = malloc(50 * sizeof(char));
for (int i = 0; i < 50; i+=5)
strcpy(&a[i], “hello”);
String literal is copied directly
into pre-allocated space; no
allocation inside the loop
Memory is allocated once at the beginning;
maximum size must be given in allocation

Compilation to the Architecture
• Underneath the hood, the program is using functional units and a
memory hierarchy to implement the operations in the program
• Memory and operations have different latencies and bandwidths, the
mix of memory and computational operations determines the
optimal schedule on a particular hardware architecture

Vectorization
https://www.cs.utexas.edu/~pingali/CS380C/2016/lectures/david-vectorization.pdf

Example: Dot Product
• Example code in C programming language:
float dot = 0;
for (int i = 0; i < 10; i++)
dot += A[i] * B[i];
C compiler will vectorize this
computation, organizing it into
groups of parallel operations

Python version of dot product:
• Example code in classic Python:
for i in range(len(a)) :
dot += a[i] + b[i]
• Example using numPy:
dot = numpy.dot(a, b)
Interpreter will execute one
operation per loop iteration
The numPy library performs
the vectorization internally to
the library
Interpreted languages often get the performance
improvements of compiled languages via libraries
– wherever possible, use them!

Memory Hierarchy
https://www.edn.com/memory-hierarchy-design-part-1-basics-of-memory-hierarchies/
• Memory closest to the processor
is fastest but most expensive
• Data moves through the
hierarchy in blocks
• Get better performance by
re-using data closer to the
processor
• Copies of data at different levels
must be consistent

Example: Matrix Multiplication
• Naïve C code:
for (i = 0; i < n; i++)
for (j = 0; j < m; j++)
for (k = 0; k < l; k++)
C(i,j) = C(i,j) + A(i,k) * B(k,j)
• Block Algorithm:
for (i = 0; i < n; i++)
for (j = 0; j < m; j++) {
// read block C(i,j) into fast memory
for (k = 0; k < l; k++) {
// read block A(i,k) into fast memory
// read block B(k,j) into fast memory
C(i,j) = C(i,j) + A(i,k) * B(k,j)
}
// write block C(i,j) to slow memory
• Python code:
c = numpy.matmul(a, b)
https://sites.cs.ucsb.edu/~tyang/class/240a17/slides/Cache3.pdf
Each operation involves
a memory access
Data is read and written
in blocks, taking
advantage of cache reuse
to improve performance
numPy library optimizes
algorithm implementation

Be your own Optimizer
• Count your operations – don’t do O(n2) when the mathematics is only
O(n)
• Look at your loops – don’t put operations inside the loop body that
can be taken out
• Use packages like NumPy that improve object representation for
arrays and numerical objects
• Use packages like Cython that include some level of source analysis
• If desperate – use SWIG and call a C routine!!

Algorithms 101 for Data Scientists (Part 2)

More Related Content

What's hot

Similar to Algorithms 101 for Data Scientists (Part 2)

More from Christopher Conlan

Recently uploaded

Algorithms 101 for Data Scientists (Part 2)