Type Checking Scala Spark
Datasets: 

Data Set Transforms
John Nestor 47 Degrees
www.47deg.com
Seattle Spark Meetup
September 22, 2016
147deg.com
47deg.com © Copyright 2016 47 Degrees
Outline
• Introduction
• Transforms
• Demos
• Implementation
• Getting the Code
2
Introduction
3
47deg.com © Copyright 2016 47 Degrees
Spark Scala APIs
• RDD (pass closures)
• Functional programming model
• Types checked at compile time
• DataFrame (pass SQL)
• SQL programming model (can be optimized)
• Types checked at run time
• Dataset (pass SQL)
• Combines best of RDDs and DataFrames
• Some (not all) types checked at compile time
4
47deg.com © Copyright 2016 47 Degrees
Run-Time Scala Checking
• Field/column names
• Names specified as strings
• RT error if no such field
• Field/column types
• Specified via casting to expected type
• RT error if not of expected type
5
47deg.com © Copyright 2016 47 Degrees
Dataset Example
case class ABC(a: Int, b: String, c: String)

case class CA(c: String, a: Int)


val abc = ABC(3, "foo", "test")

val abc1 = ABC(5, "xxx", "alpha")

val abc3 = ABC(10, "aaa", "aaa")

val abcs = Seq(abc, abc1, abc3)

val ds = abcs.toDS()
/* Compile time type checking;
but must pass closure and can’t optimize */

val ds1 = ds.map(abc => CA(abc.b, abc.a * 2 + abc.a))



/* Can be query optimized;
but run-time type and field name checking */

val ds2 = ds.select($"b" as "c",
($"a" * 2 + $"a") as "a").as[CA]
6
Transforms
7
47deg.com © Copyright 2016 47 Degrees
Goal
• Add strong typing to Scala Spark Datasets
• Check field names at compile time
• Check field types at compile time
• Each transform maps one of more Datasets to a new
Dataset.
• Dataset rows are compile-time types: Scala case
classes
8
47deg.com © Copyright 2016 47 Degrees
Transform Example
case class ABC(a: Int, b: String, c: String)

case class CA(c: String, a: Int)


val abc = ABC(3, "foo", "test")

val abc1 = ABC(5, "xxx", "alpha")

val abc3 = ABC(10, "aaa", "aaa")

val abcs = Seq(abc, abc1, abc3)

val ds = abcs.toDS()
/* Compile time type checking;
but can do query optimization */


val smap = SqlMap[ABC, CA]
.act(cols => (cols.b, cols.a * 2 + cols.a))
val ds3 = smap(ds)


9
47deg.com © Copyright 2016 47 Degrees
Current Transforms
• Filter
• Map
• Sort
• Join (combines 2 DataSets)
• Aggregate (sum, count, max)
10
Demos
11
47deg.com © Copyright 2016 47 Degrees
Demo
• Dataset example
• map
• select
• Transform examples
• Map
• Sort
• Join
• Filter
• Aggregate
12
Implementation
13
47deg.com © Copyright 2016 47 Degrees
Scala Macros
• Scala code executed at compile time
• Kinds
• Black box - single result type specified
• * White box - result type computed
14
47deg.com © Copyright 2016 47 Degrees
Transform Implementation
• case class Person(name:String,age:Int)

val p = Person(“Sam”,30)
• Scala macro converts
• from: an arbitrary case class type
• classOf[p]
• to: a meta structure that encodes field names and
types
• case class PersonM(name:StringCol,age:IntCol)

val cols =
PersonM(name:StringCol(“name”),age:IntCol(“age”))
15
47deg.com © Copyright 2016 47 Degrees
Column Operations
• StrCol(“A”) === StrCol(“B”) => BoolCol(“A === B”)
• IntCol(“A”) + IntCol(“B”) => IntCol(“A + B”)
• IntCol(“A”).max => IntCol(“A.max”)
16
47deg.com © Copyright 2016 47 Degrees
White Box Macro Restrictions
• Works fine in SBT and Eclipse
• Not supported in Intellij but can use
• Reports type errors
• Does not show available completions
17
Getting the Code
18
47deg.com © Copyright 2016 47 Degrees
Transforms Code
• https://github.com/nestorpersist/dataset-transform
• Code
• Documentation
• Examples
• "com.persist" % "dataset-transforms_2.11" % "0.0.5"
19
Questions
20

Type Checking Scala Spark Datasets: Dataset Transforms

  • 1.
    Type Checking ScalaSpark Datasets: 
 Data Set Transforms John Nestor 47 Degrees www.47deg.com Seattle Spark Meetup September 22, 2016 147deg.com
  • 2.
    47deg.com © Copyright2016 47 Degrees Outline • Introduction • Transforms • Demos • Implementation • Getting the Code 2
  • 3.
  • 4.
    47deg.com © Copyright2016 47 Degrees Spark Scala APIs • RDD (pass closures) • Functional programming model • Types checked at compile time • DataFrame (pass SQL) • SQL programming model (can be optimized) • Types checked at run time • Dataset (pass SQL) • Combines best of RDDs and DataFrames • Some (not all) types checked at compile time 4
  • 5.
    47deg.com © Copyright2016 47 Degrees Run-Time Scala Checking • Field/column names • Names specified as strings • RT error if no such field • Field/column types • Specified via casting to expected type • RT error if not of expected type 5
  • 6.
    47deg.com © Copyright2016 47 Degrees Dataset Example case class ABC(a: Int, b: String, c: String)
 case class CA(c: String, a: Int) 
 val abc = ABC(3, "foo", "test")
 val abc1 = ABC(5, "xxx", "alpha")
 val abc3 = ABC(10, "aaa", "aaa")
 val abcs = Seq(abc, abc1, abc3)
 val ds = abcs.toDS() /* Compile time type checking; but must pass closure and can’t optimize */
 val ds1 = ds.map(abc => CA(abc.b, abc.a * 2 + abc.a))
 
 /* Can be query optimized; but run-time type and field name checking */
 val ds2 = ds.select($"b" as "c", ($"a" * 2 + $"a") as "a").as[CA] 6
  • 7.
  • 8.
    47deg.com © Copyright2016 47 Degrees Goal • Add strong typing to Scala Spark Datasets • Check field names at compile time • Check field types at compile time • Each transform maps one of more Datasets to a new Dataset. • Dataset rows are compile-time types: Scala case classes 8
  • 9.
    47deg.com © Copyright2016 47 Degrees Transform Example case class ABC(a: Int, b: String, c: String)
 case class CA(c: String, a: Int) 
 val abc = ABC(3, "foo", "test")
 val abc1 = ABC(5, "xxx", "alpha")
 val abc3 = ABC(10, "aaa", "aaa")
 val abcs = Seq(abc, abc1, abc3)
 val ds = abcs.toDS() /* Compile time type checking; but can do query optimization */ 
 val smap = SqlMap[ABC, CA] .act(cols => (cols.b, cols.a * 2 + cols.a)) val ds3 = smap(ds) 
 9
  • 10.
    47deg.com © Copyright2016 47 Degrees Current Transforms • Filter • Map • Sort • Join (combines 2 DataSets) • Aggregate (sum, count, max) 10
  • 11.
  • 12.
    47deg.com © Copyright2016 47 Degrees Demo • Dataset example • map • select • Transform examples • Map • Sort • Join • Filter • Aggregate 12
  • 13.
  • 14.
    47deg.com © Copyright2016 47 Degrees Scala Macros • Scala code executed at compile time • Kinds • Black box - single result type specified • * White box - result type computed 14
  • 15.
    47deg.com © Copyright2016 47 Degrees Transform Implementation • case class Person(name:String,age:Int)
 val p = Person(“Sam”,30) • Scala macro converts • from: an arbitrary case class type • classOf[p] • to: a meta structure that encodes field names and types • case class PersonM(name:StringCol,age:IntCol)
 val cols = PersonM(name:StringCol(“name”),age:IntCol(“age”)) 15
  • 16.
    47deg.com © Copyright2016 47 Degrees Column Operations • StrCol(“A”) === StrCol(“B”) => BoolCol(“A === B”) • IntCol(“A”) + IntCol(“B”) => IntCol(“A + B”) • IntCol(“A”).max => IntCol(“A.max”) 16
  • 17.
    47deg.com © Copyright2016 47 Degrees White Box Macro Restrictions • Works fine in SBT and Eclipse • Not supported in Intellij but can use • Reports type errors • Does not show available completions 17
  • 18.
  • 19.
    47deg.com © Copyright2016 47 Degrees Transforms Code • https://github.com/nestorpersist/dataset-transform • Code • Documentation • Examples • "com.persist" % "dataset-transforms_2.11" % "0.0.5" 19
  • 20.