Shouheng Yi
Data Scientist
shouhengyi@gmail.com
www.linkedin.com/in/shouhengyi
+
Seattle useR Group
05/05/2015
R is Hard to Scale
• Architectural Parallelism: most R’s parallelism is
done on CPU level using MPI
• Data Parallelism: data must have full presents in
RAM during an R session
• Why?
R
C and Fortran
Debugging Deadlocks - Good Times
Scientists vs. Developers
• Scientists and researchers love R, because most of
their computing tasks are iterative/procedural
• Software engineers are less impressed, because
they need to develop concurrent, reactive and
robust applications
To be exact: Akka + Rserve
Why I Found Scala Useful
• Lives on JVM (most devs are comfortable with JVM)
• Great distributed frameworks - Akka, Slick, Spark, etc.
• Syntactic sugar (less typing) -> easier to debug -> rapid development
R
vec <- 1:100
sum <- 0
for(i in vec){
sum <- sum + i
}
Scala
val vec = 1 to 100
val sum = (0 /: vec)((a, b) => a + b)
Intro to Akka’s Actor Model
Actor
Inbox
Actor
Inbox
Eventually…
Therefore the form of parallelism is not limited
Code Dump!
A Simple Task
• Step 1: read from a CSV file that has 100,000,000
double elements (~1.7G).
read.csv() freaked out on my MacBook Air. It had been like this for 20+ hours
> vector <- read.csv(“./vector.csv”, quote = F, row.names = F)
• Step 2: calculate its sum
There are existing R packages like ff, bigmemory to address these out-of-
memory issues, but I want to demonstrate an alternative method that is much
more generic, robust and scalable
Rserve
> library(Rserve)
> Rserve()
Starting Rserve:
/Library/Frameworks/R.framework/Resources/bin/R CMD /Library/Frameworks/R.framework/
Versions/3.1/Resources/library/Rserve/libs//Rserve
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
Rserv started in daemon mode.
Producer
case ProcessData(sum: Double, isEnd: Boolean)
Inbox
Worker
case DoWork(ind: Int, size: Int)
Inbox
sender ! doWork(ind, size - 1)
sender ! processData(sum, isEnd)
Producer Class
class Producer extends Actor with ActorLogging {

// Some inputs

var (size, nworker) = (1000000, 10)

// Some counters and result holder

var (ind, ncorpse, sum_total): (Int, Int, Double) = (0, 0, 0.0)

// Create the router

val workerRouter = context.actorOf(

Props(new Worker(self, sum_total)).withRouter(RoundRobinRouter(nworker)),

name = "workerRouter"

)

// Read File and Chop It into Pieces

val iterator = Source.fromFile(“./vector.csv”).getLines.grouped(size)


// What to do when it enters

override def preStart() = println(s"Producer $self is alive")

// What to do when it exits

override def postStop() = println(s"Producer $self is dead. The sum is $sum_total")

// What mssgs to be received

override def receive = {

case ProcessData(sum) =>

sum_total += sum

if(iterator.hasNext) {

sender ! DoWork(iterator.next)

} else {

ncorpse += 1

context.stop(sender)

}

if(ncorpse == nworker) context.stop(self)

}

}
Worker Class
class Worker(master: ActorRef, sum_total: Double) extends Actor with ActorLogging {



override def preStart() = {

println(s"Worker $self is alive!!!")

master ! ProcessData(sum_total)

}



override def receive = {



case DoWork(iter) =>



// Rserve

val c: RConnection = new RConnection()

c.assign("x", iter.toArray)

val sum: Double = c.eval("sum(as.numeric(x))").asDouble()

c.close()



// Asking for more

println(s"$self => Partial Sum: $sum, Size: ${iter.length}")

sender ! ProcessData(sum)



}

}
Main
object Application extends App{

override def main(arg: Array[String]){

val system = ActorSystem("ClusterSystem")

system.actorOf(Props[Producer], name = "producer")

}

}
object ClusterMessageProtocol {

sealed trait Message



// Producer side

case class InitiateWorker(worker: ActorRef) extends Message

case class ProcessData(sum: Double) extends Message



// Actor side

case class DoWork(iter: List[String]) extends Message

}
…
Worker Actor[akka://ClusterSystem/user/producer/workerRouter/$h#504275836] is alive!!!
Worker Actor[akka://ClusterSystem/user/producer/workerRouter/$e#1071584906] is alive!!!
Producer Actor[akka://ClusterSystem/user/producer#1272599354] is alive
Actor[akka://ClusterSystem/user/producer/workerRouter/$h#1269880699] => Partial Sum: -964.3282348781046, Size: 1000000
Actor[akka://ClusterSystem/user/producer/workerRouter/$f#500982456] => Partial Sum: -177.85266733478048, Size: 1000000
…
Actor[akka://ClusterSystem/user/producer/workerRouter/$e#1850062035] => Partial Sum: -547.8233029081448, Size: 1000000
Actor[akka://ClusterSystem/user/producer/workerRouter/$h#1269880699] => Partial Sum: -660.0674912837135, Size: 1000000
Producer Actor[akka://ClusterSystem/user/producer#1420020857] is dead. The sum is -13615.40143829277
> sum(vector)
[1] -13615.4
Applications
1. Optimization Problems
Evaluating objective function, simulation in parallel (Differential Evolution!)
2. Distributed Matrix Operations
Product, transpose, inverse of distributed matrices, quadratic
programming in large dimensional space
3. Real-time machine learning
Linear/logistic regression (see 2), Random Forest, Neural network
4. Statistical Inference
Bootstrap, sampling, log-likelihood estimation, Bayesian
Thank You!
Any Questions?
Email: shouhengyi@gmail.com
LinkedIn: www.linkedin.com/in/shouhengyi
知乎:  伊⾸首衡

Seattle useR Group - R + Scala

  • 1.
  • 2.
    R is Hardto Scale • Architectural Parallelism: most R’s parallelism is done on CPU level using MPI • Data Parallelism: data must have full presents in RAM during an R session • Why? R C and Fortran
  • 3.
  • 4.
    Scientists vs. Developers •Scientists and researchers love R, because most of their computing tasks are iterative/procedural • Software engineers are less impressed, because they need to develop concurrent, reactive and robust applications
  • 5.
    To be exact:Akka + Rserve
  • 6.
    Why I FoundScala Useful • Lives on JVM (most devs are comfortable with JVM) • Great distributed frameworks - Akka, Slick, Spark, etc. • Syntactic sugar (less typing) -> easier to debug -> rapid development R vec <- 1:100 sum <- 0 for(i in vec){ sum <- sum + i } Scala val vec = 1 to 100 val sum = (0 /: vec)((a, b) => a + b)
  • 7.
    Intro to Akka’sActor Model Actor Inbox Actor Inbox
  • 8.
    Eventually… Therefore the formof parallelism is not limited
  • 9.
  • 10.
    A Simple Task •Step 1: read from a CSV file that has 100,000,000 double elements (~1.7G). read.csv() freaked out on my MacBook Air. It had been like this for 20+ hours > vector <- read.csv(“./vector.csv”, quote = F, row.names = F) • Step 2: calculate its sum There are existing R packages like ff, bigmemory to address these out-of- memory issues, but I want to demonstrate an alternative method that is much more generic, robust and scalable
  • 11.
    Rserve > library(Rserve) > Rserve() StartingRserve: /Library/Frameworks/R.framework/Resources/bin/R CMD /Library/Frameworks/R.framework/ Versions/3.1/Resources/library/Rserve/libs//Rserve R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin10.8.0 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. Rserv started in daemon mode.
  • 12.
    Producer case ProcessData(sum: Double,isEnd: Boolean) Inbox Worker case DoWork(ind: Int, size: Int) Inbox sender ! doWork(ind, size - 1) sender ! processData(sum, isEnd)
  • 13.
    Producer Class class Producerextends Actor with ActorLogging {
 // Some inputs
 var (size, nworker) = (1000000, 10)
 // Some counters and result holder
 var (ind, ncorpse, sum_total): (Int, Int, Double) = (0, 0, 0.0)
 // Create the router
 val workerRouter = context.actorOf(
 Props(new Worker(self, sum_total)).withRouter(RoundRobinRouter(nworker)),
 name = "workerRouter"
 )
 // Read File and Chop It into Pieces
 val iterator = Source.fromFile(“./vector.csv”).getLines.grouped(size) 
 // What to do when it enters
 override def preStart() = println(s"Producer $self is alive")
 // What to do when it exits
 override def postStop() = println(s"Producer $self is dead. The sum is $sum_total")
 // What mssgs to be received
 override def receive = {
 case ProcessData(sum) =>
 sum_total += sum
 if(iterator.hasNext) {
 sender ! DoWork(iterator.next)
 } else {
 ncorpse += 1
 context.stop(sender)
 }
 if(ncorpse == nworker) context.stop(self)
 }
 }
  • 14.
    Worker Class class Worker(master:ActorRef, sum_total: Double) extends Actor with ActorLogging {
 
 override def preStart() = {
 println(s"Worker $self is alive!!!")
 master ! ProcessData(sum_total)
 }
 
 override def receive = {
 
 case DoWork(iter) =>
 
 // Rserve
 val c: RConnection = new RConnection()
 c.assign("x", iter.toArray)
 val sum: Double = c.eval("sum(as.numeric(x))").asDouble()
 c.close()
 
 // Asking for more
 println(s"$self => Partial Sum: $sum, Size: ${iter.length}")
 sender ! ProcessData(sum)
 
 }
 }
  • 15.
    Main object Application extendsApp{
 override def main(arg: Array[String]){
 val system = ActorSystem("ClusterSystem")
 system.actorOf(Props[Producer], name = "producer")
 }
 } object ClusterMessageProtocol {
 sealed trait Message
 
 // Producer side
 case class InitiateWorker(worker: ActorRef) extends Message
 case class ProcessData(sum: Double) extends Message
 
 // Actor side
 case class DoWork(iter: List[String]) extends Message
 }
  • 16.
    … Worker Actor[akka://ClusterSystem/user/producer/workerRouter/$h#504275836] isalive!!! Worker Actor[akka://ClusterSystem/user/producer/workerRouter/$e#1071584906] is alive!!! Producer Actor[akka://ClusterSystem/user/producer#1272599354] is alive Actor[akka://ClusterSystem/user/producer/workerRouter/$h#1269880699] => Partial Sum: -964.3282348781046, Size: 1000000 Actor[akka://ClusterSystem/user/producer/workerRouter/$f#500982456] => Partial Sum: -177.85266733478048, Size: 1000000 … Actor[akka://ClusterSystem/user/producer/workerRouter/$e#1850062035] => Partial Sum: -547.8233029081448, Size: 1000000 Actor[akka://ClusterSystem/user/producer/workerRouter/$h#1269880699] => Partial Sum: -660.0674912837135, Size: 1000000 Producer Actor[akka://ClusterSystem/user/producer#1420020857] is dead. The sum is -13615.40143829277 > sum(vector) [1] -13615.4
  • 17.
    Applications 1. Optimization Problems Evaluatingobjective function, simulation in parallel (Differential Evolution!) 2. Distributed Matrix Operations Product, transpose, inverse of distributed matrices, quadratic programming in large dimensional space 3. Real-time machine learning Linear/logistic regression (see 2), Random Forest, Neural network 4. Statistical Inference Bootstrap, sampling, log-likelihood estimation, Bayesian
  • 18.
    Thank You! Any Questions? Email:shouhengyi@gmail.com LinkedIn: www.linkedin.com/in/shouhengyi 知乎:  伊⾸首衡