Seattle useR Group - R + Scala

Shouheng Yi
Data Scientist
shouhengyi@gmail.com
www.linkedin.com/in/shouhengyi
+
Seattle useR Group
05/05/2015

R is Hard to Scale
• Architectural Parallelism: most R’s parallelism is
done on CPU level using MPI
• Data Parallelism: data must have full presents in
RAM during an R session
• Why?
R
C and Fortran

Debugging Deadlocks - Good Times

Scientists vs. Developers
• Scientists and researchers love R, because most of
their computing tasks are iterative/procedural
• Software engineers are less impressed, because
they need to develop concurrent, reactive and
robust applications

Why I Found Scala Useful
• Lives on JVM (most devs are comfortable with JVM)
• Great distributed frameworks - Akka, Slick, Spark, etc.
• Syntactic sugar (less typing) -> easier to debug -> rapid development
R
vec <- 1:100
sum <- 0
for(i in vec){
sum <- sum + i
}
Scala
val vec = 1 to 100
val sum = (0 /: vec)((a, b) => a + b)

Intro to Akka’s Actor Model
Actor
Inbox
Actor
Inbox

Eventually…
Therefore the form of parallelism is not limited

A Simple Task
• Step 1: read from a CSV ﬁle that has 100,000,000
double elements (~1.7G).
read.csv() freaked out on my MacBook Air. It had been like this for 20+ hours
> vector <- read.csv(“./vector.csv”, quote = F, row.names = F)
• Step 2: calculate its sum
There are existing R packages like ff, bigmemory to address these out-of-
memory issues, but I want to demonstrate an alternative method that is much
more generic, robust and scalable

Rserve
> library(Rserve)
> Rserve()
Starting Rserve:
/Library/Frameworks/R.framework/Resources/bin/R CMD /Library/Frameworks/R.framework/
Versions/3.1/Resources/library/Rserve/libs//Rserve
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
Rserv started in daemon mode.

Producer
case ProcessData(sum: Double, isEnd: Boolean)
Inbox
Worker
case DoWork(ind: Int, size: Int)
Inbox
sender ! doWork(ind, size - 1)
sender ! processData(sum, isEnd)

Producer Class
class Producer extends Actor with ActorLogging { 
// Some inputs 
var (size, nworker) = (1000000, 10) 
// Some counters and result holder 
var (ind, ncorpse, sum_total): (Int, Int, Double) = (0, 0, 0.0) 
// Create the router 
val workerRouter = context.actorOf( 
Props(new Worker(self, sum_total)).withRouter(RoundRobinRouter(nworker)), 
name = "workerRouter" 
) 
// Read File and Chop It into Pieces 
val iterator = Source.fromFile(“./vector.csv”).getLines.grouped(size)
 
// What to do when it enters 
override def preStart() = println(s"Producer $self is alive") 
// What to do when it exits 
override def postStop() = println(s"Producer $self is dead. The sum is $sum_total") 
// What mssgs to be received 
override def receive = { 
case ProcessData(sum) => 
sum_total += sum 
if(iterator.hasNext) { 
sender ! DoWork(iterator.next) 
} else { 
ncorpse += 1 
context.stop(sender) 
} 
if(ncorpse == nworker) context.stop(self) 
} 
}

Worker Class
class Worker(master: ActorRef, sum_total: Double) extends Actor with ActorLogging { 
 
override def preStart() = { 
println(s"Worker $self is alive!!!") 
master ! ProcessData(sum_total) 
} 
 
override def receive = { 
 
case DoWork(iter) => 
 
// Rserve 
val c: RConnection = new RConnection() 
c.assign("x", iter.toArray) 
val sum: Double = c.eval("sum(as.numeric(x))").asDouble() 
c.close() 
 
// Asking for more 
println(s"$self => Partial Sum: $sum, Size: ${iter.length}") 
sender ! ProcessData(sum) 
 
} 
}

Main
object Application extends App{ 
override def main(arg: Array[String]){ 
val system = ActorSystem("ClusterSystem") 
system.actorOf(Props[Producer], name = "producer") 
} 
}
object ClusterMessageProtocol { 
sealed trait Message 
 
// Producer side 
case class InitiateWorker(worker: ActorRef) extends Message 
case class ProcessData(sum: Double) extends Message 
 
// Actor side 
case class DoWork(iter: List[String]) extends Message 
}

…
Worker Actor[akka://ClusterSystem/user/producer/workerRouter/$h#504275836] is alive!!!
Worker Actor[akka://ClusterSystem/user/producer/workerRouter/$e#1071584906] is alive!!!
Producer Actor[akka://ClusterSystem/user/producer#1272599354] is alive
Actor[akka://ClusterSystem/user/producer/workerRouter/$h#1269880699] => Partial Sum: -964.3282348781046, Size: 1000000
Actor[akka://ClusterSystem/user/producer/workerRouter/$f#500982456] => Partial Sum: -177.85266733478048, Size: 1000000
…
Actor[akka://ClusterSystem/user/producer/workerRouter/$e#1850062035] => Partial Sum: -547.8233029081448, Size: 1000000
Actor[akka://ClusterSystem/user/producer/workerRouter/$h#1269880699] => Partial Sum: -660.0674912837135, Size: 1000000
Producer Actor[akka://ClusterSystem/user/producer#1420020857] is dead. The sum is -13615.40143829277
> sum(vector)
[1] -13615.4

Applications
1. Optimization Problems
Evaluating objective function, simulation in parallel (Differential Evolution!)
2. Distributed Matrix Operations
Product, transpose, inverse of distributed matrices, quadratic
programming in large dimensional space
3. Real-time machine learning
Linear/logistic regression (see 2), Random Forest, Neural network
4. Statistical Inference
Bootstrap, sampling, log-likelihood estimation, Bayesian

Thank You!
Any Questions?
Email: shouhengyi@gmail.com
LinkedIn: www.linkedin.com/in/shouhengyi
知乎: 伊⾸首衡

Seattle useR Group - R + Scala

More Related Content

What's hot

Viewers also liked

Similar to Seattle useR Group - R + Scala

Recently uploaded

Seattle useR Group - R + Scala