Getting StartedWith Sparklyr
Bob Wakefield
Principal
bob@MassStreet.net
Twitter:
@BobLovesData
I’m hiring a Data Science
Intern
• Start date sometime in March
• Three projects
• Convert some old R to Python
• Help me update a model using Keras and
Tensorflow
• Create a dashboard using R
I’m hiring a Data Science
Intern
• Pay between $10 and $15 depending on
experience
• Prefer a student. Will take a working adult.
• Age is not a factor in pay.
• Send resumes to Bob@MassStreet.net
• Don’t message me over MeetUp.
Follow Me!
•Personal Twitter: @BobLovesData
•Company Twitter: @MassStreet
•Blog: DataDrivenPerspectives.com
•Website: www.MassStreet.net
•Facebook: @MassStreetAnalyticsLLC
What We’ll Cover
• We’re gonna hit stuff at 10,000 ft and the speed of
heat.
• This is getting turned into an online class
• Spark
• General information
• Sparklyr
• How to get everything installed locally
• Walk through some sample code
• Databricks
• How to do Sparkr on a cluster
All Material Can Be Downloaded
from GitHub
MassStreetAnalytics/getting-started-with-
sparklyr
What Is Spark
• Distributed in memory processing framework
• Rapidly replacing MapReduce as a means to crunch
data.
• Many ways to interact with Spark
• R, Java, Scala, Python
• Sparkr, Sparklyr, H2O with Sparkling Water
•Several APIs
• RDDs, Dataset/Dataframe, Spark Streaming, Spark
SQL, Spark Streaming, Strucutred Streaming
What is Sparklyr?
•There are two R packages for Spark
•Sparkr and Sparklyr
•Sparklyr is a product of the folks that
make R Studio
•That used to mean strings attached
•R Studio Server has been open sourced
What is Sparklyr?
•Sparklyr allows you to work with data on a
spark cluster using dplyr.
•Uses the SparkSQL API
•Little bit weird. Not like working with
normal R.
What is SparkSQL?
• The new way to interact with Spark.
• Another SQL on big data implementation.
• Allows you to write SQL statements against a spark
cluster.
• You can use straight SQL or dplyr techniques
• All super easy with R.
Cluster Connection Methods
• In every case you need a copy of Spark
• Locally
• On a cluster
• Options for connecting to a cluster
• R Studio Server
• Mesos/Yarn
• Spark Standalone
• Livy
• Just use Databricks
What is Databricks?
• Spark P/SaaS
• Basically a worry free Spark cluster
• Two versions
• Commerical
• Community
• Community version
• Practice on a real cluster
• Limited on space
Examples

Getting Started With Sparklyr

  • 1.
    Getting StartedWith Sparklyr BobWakefield Principal bob@MassStreet.net Twitter: @BobLovesData
  • 2.
    I’m hiring aData Science Intern • Start date sometime in March • Three projects • Convert some old R to Python • Help me update a model using Keras and Tensorflow • Create a dashboard using R
  • 3.
    I’m hiring aData Science Intern • Pay between $10 and $15 depending on experience • Prefer a student. Will take a working adult. • Age is not a factor in pay. • Send resumes to Bob@MassStreet.net • Don’t message me over MeetUp.
  • 4.
    Follow Me! •Personal Twitter:@BobLovesData •Company Twitter: @MassStreet •Blog: DataDrivenPerspectives.com •Website: www.MassStreet.net •Facebook: @MassStreetAnalyticsLLC
  • 5.
    What We’ll Cover •We’re gonna hit stuff at 10,000 ft and the speed of heat. • This is getting turned into an online class • Spark • General information • Sparklyr • How to get everything installed locally • Walk through some sample code • Databricks • How to do Sparkr on a cluster
  • 6.
    All Material CanBe Downloaded from GitHub MassStreetAnalytics/getting-started-with- sparklyr
  • 7.
    What Is Spark •Distributed in memory processing framework • Rapidly replacing MapReduce as a means to crunch data. • Many ways to interact with Spark • R, Java, Scala, Python • Sparkr, Sparklyr, H2O with Sparkling Water •Several APIs • RDDs, Dataset/Dataframe, Spark Streaming, Spark SQL, Spark Streaming, Strucutred Streaming
  • 8.
    What is Sparklyr? •Thereare two R packages for Spark •Sparkr and Sparklyr •Sparklyr is a product of the folks that make R Studio •That used to mean strings attached •R Studio Server has been open sourced
  • 9.
    What is Sparklyr? •Sparklyrallows you to work with data on a spark cluster using dplyr. •Uses the SparkSQL API •Little bit weird. Not like working with normal R.
  • 10.
    What is SparkSQL? •The new way to interact with Spark. • Another SQL on big data implementation. • Allows you to write SQL statements against a spark cluster. • You can use straight SQL or dplyr techniques • All super easy with R.
  • 11.
    Cluster Connection Methods •In every case you need a copy of Spark • Locally • On a cluster • Options for connecting to a cluster • R Studio Server • Mesos/Yarn • Spark Standalone • Livy • Just use Databricks
  • 12.
    What is Databricks? •Spark P/SaaS • Basically a worry free Spark cluster • Two versions • Commerical • Community • Community version • Practice on a real cluster • Limited on space
  • 13.