© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark+AI SF
•
Making Nested Columns as First
Citizens in Apache Spark SQL
Cesar Delgado @hpcfarmer
DB Tsai @dbtsai
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri
The world’s most popular intelligent assistant service powering
every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri Open Source Team
• We’re Spark, Hadoop, HBase PMCs / Committers / Contributors
• We’re the advocate for Open Source
• Pushing our internal changes back to the upstreams
• Working with the communities to review pull requests, develop
new features and bug fixes
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri Data
• Machine learning is used to personalize your experience
throughout your day
• We believe privacy is a fundamental human right
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri Scale
• Large amounts of requests, Data Centers all over the world
• Hadoop / Yarn Cluster has thousands of nodes
• HDFS has hundred of PB
• 100's TB of raw event data per day
• More than 90% of jobs are Spark
• Less than 10% are legacy Pig and MapReduce jobs
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Details about our data
• Deeply nested relational model data with couple top level columns
• The total nested fields are more than 2k
• Stored in Parquet format partitioned by UTC day
• Most queries are only for a small subset of the data
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
An Example of Hierarchically Organized Table
Real estate information can be naturally modeled by
case class Address(houseNumber: Int,
streetAddress: String,
city: String,
state: String,
zipCode: String)
case class Facts(price: Int,
size: Int,
yearBuilt: Int)
case class School(name: String)
case class Home(address: Address,
facts: Facts,
schools: List[School])
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
root
|-- address: struct (nullable = true)
| |-- houseNumber: integer (nullable = true)
| |-- streetAddress: string (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
| |-- zipCode: string (nullable = true)
|-- facts: struct (nullable = true)
| |-- price: integer (nullable = true)
| |-- size: integer (nullable = true)
| |-- yearBuilt: integer (nullable = true)
|-- schools: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
sql("select * from homes”).printSchema()
Nested SQL Schema
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
sql("select address.city from homes where facts.price > 2000000”)
.explain(true)
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#75]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56],
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts)],
ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string,
city:string,state:string,zipCode:strin…,
facts:struct(address:int…)>
• We only need two nested columns, address.city and facts.prices
• But entire address and facts are read
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
[SPARK-4502], [SPARK-25363] Parquet with Nested Columns
• Parquet is a columnar storage format with complex nested data
structures in mind
• Support very efficient compression and encoding schemes
• As a columnar format, each nested column is stored separately as if it's a
flattened table
• No easy way to cherry pick couple nested columns in Spark
• Foundation - Allow reading subset of nested columns right after Parquet
FileScan
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
sql("select address.city from homes where facts.price > 2000000”)
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#77]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56]
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts)],
ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>>
• Only two nested columns are read!
• With [SPARK-4502], [SPARK-25363]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#77]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56]
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts)],
ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>>
• Parquet predicate pushdown are not working for nested fields in Spark
sql("select address.city from homes where facts.price > 2000000”)
• With [SPARK-4502], [SPARK-25363]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#77]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56]
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts), GreaterThan(facts.price,2000000)],
ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>>
• Predicate Pushdown in Parquet for nested fields provides significant
performance gain by eliminating non-matches earlier to read less data
and save the cost of processing them
sql("select address.city from homes where facts.price > 2000000”)
• With [SPARK-25556]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
val areaUdf = udf{ (city: String, state: String, zipCode: String) =>
s"$city, $state $zipCode"
}
val query = sql("select * from homes").repartition(1).select(
areaUdf(col("address.city"),
col("address.state"),
col("address.zipCode"))
).explain()
Applying an UDF after repartition
== Physical Plan ==
*(2) Project [UDF(address#58.city, address#58.state, address#58.zipCode) AS
UDF(address.city, address.state, address.zipCode)#70]
+- Exchange RoundRobinPartitioning(1)
+- *(1) Project [address#58]
+- *(1) FileScan parquet [address#58] Format: Parquet,
ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string,
city:string,state:string,zipCode:string>>
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Problems in Supporting Nested Structures in Spark
• Root level columns are represented by Attribute which is base of
leaf named expressions
• To get a nested field from a root level column, a GetStructField
expression with child of Attribute has to be used
• All column pruning logics are done in Attribute levels, resulting
either the entire root column is taken or pruned
• No easy way to cherry pick couple nested columns in this model
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
[SPARK-25603] Generalize Nested Column Pruning
• [SPARK-4502], [SPARK-25363] are foundation to support nested
structures better with Parquet in Spark
• If an operator such as Repartition, Sample, or Limit are applied after
Parquet FileScan, nested column pruning will not work
• We address this by flattening the nested fields using Alias right after data
read
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
val query = sql("select * from homes").repartition(1).select(
areaUdf(col("address.city"),
col("address.state"),
col("address.zipCode"))
).explain()
Applying an UDF after repartition
== Physical Plan ==
*(2) Project [UDF(_gen_alias_84#84, _gen_alias_85#85, _gen_alias_86#86) AS UDF(address.city,
address.state, address.zipCode)#64]
+- Exchange RoundRobinPartitioning(1)
+- *(1) Project [address#55.city AS _gen_alias_84#84, address#55.state AS _gen_alias_85#85,
address#55.zipCode AS _gen_alias_86#86]
+- *(1) FileScan parquet [address#55]
ReadSchema: struct<address:struct<city:string,state:string,zipCode:string>>
• Nested fields are replaced by Alias with flatten structures
• Only three used nested fields are read from Parquet files
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Production Query - Finding a Needle in a Haystack
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark 2.3.1
1.2h 7.1TB
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark 2.4 with [SPARK-4502], [SPARK-25363], and [SPARK-25556]
3.3min 840GB
7.1TB1.2h
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
• 21x faster in wall clock time

• 8x less data being read

• More power efficient
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Other work
• Enhance the Dataset performance by analyzing JVM bytecode
and turn closures into Catalyst expressions
• Please check our other presentation tomorrow at 11am for more
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Conclusions
With some work, engineering rigor and some optimizations
Spark can run at very large scale in lightning speed
• [SPARK-4502]
• [SPARK-25363]
• [SPARK-25556]
• [SPARK-25603]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Thank you

Making Nested Columns as First Citizen in Apache Spark SQL

  • 1.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark+AI SF • Making Nested Columns as First Citizens in Apache Spark SQL Cesar Delgado @hpcfarmer DB Tsai @dbtsai
  • 2.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri The world’s most popular intelligent assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod
  • 3.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri Open Source Team • We’re Spark, Hadoop, HBase PMCs / Committers / Contributors • We’re the advocate for Open Source • Pushing our internal changes back to the upstreams • Working with the communities to review pull requests, develop new features and bug fixes
  • 4.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri Data • Machine learning is used to personalize your experience throughout your day • We believe privacy is a fundamental human right
  • 5.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri Scale • Large amounts of requests, Data Centers all over the world • Hadoop / Yarn Cluster has thousands of nodes • HDFS has hundred of PB • 100's TB of raw event data per day • More than 90% of jobs are Spark • Less than 10% are legacy Pig and MapReduce jobs
  • 6.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Details about our data • Deeply nested relational model data with couple top level columns • The total nested fields are more than 2k • Stored in Parquet format partitioned by UTC day • Most queries are only for a small subset of the data
  • 7.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. An Example of Hierarchically Organized Table Real estate information can be naturally modeled by case class Address(houseNumber: Int, streetAddress: String, city: String, state: String, zipCode: String) case class Facts(price: Int, size: Int, yearBuilt: Int) case class School(name: String) case class Home(address: Address, facts: Facts, schools: List[School])
  • 8.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. root |-- address: struct (nullable = true) | |-- houseNumber: integer (nullable = true) | |-- streetAddress: string (nullable = true) | |-- city: string (nullable = true) | |-- state: string (nullable = true) | |-- zipCode: string (nullable = true) |-- facts: struct (nullable = true) | |-- price: integer (nullable = true) | |-- size: integer (nullable = true) | |-- yearBuilt: integer (nullable = true) |-- schools: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) sql("select * from homes”).printSchema() Nested SQL Schema
  • 9.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. sql("select address.city from homes where facts.price > 2000000”) .explain(true) Find cities with houses worth more than 2M == Physical Plan == *(1) Project [address#55.city AS city#75] +- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000)) +- *(1) FileScan parquet [address#55,facts#56], DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts)], ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string, city:string,state:string,zipCode:strin…, facts:struct(address:int…)> • We only need two nested columns, address.city and facts.prices • But entire address and facts are read
  • 10.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. [SPARK-4502], [SPARK-25363] Parquet with Nested Columns • Parquet is a columnar storage format with complex nested data structures in mind • Support very efficient compression and encoding schemes • As a columnar format, each nested column is stored separately as if it's a flattened table • No easy way to cherry pick couple nested columns in Spark • Foundation - Allow reading subset of nested columns right after Parquet FileScan
  • 11.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. sql("select address.city from homes where facts.price > 2000000”) Find cities with houses worth more than 2M == Physical Plan == *(1) Project [address#55.city AS city#77] +- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000)) +- *(1) FileScan parquet [address#55,facts#56] DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts)], ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>> • Only two nested columns are read! • With [SPARK-4502], [SPARK-25363]
  • 12.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Find cities with houses worth more than 2M == Physical Plan == *(1) Project [address#55.city AS city#77] +- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000)) +- *(1) FileScan parquet [address#55,facts#56] DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts)], ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>> • Parquet predicate pushdown are not working for nested fields in Spark sql("select address.city from homes where facts.price > 2000000”) • With [SPARK-4502], [SPARK-25363]
  • 13.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Find cities with houses worth more than 2M == Physical Plan == *(1) Project [address#55.city AS city#77] +- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000)) +- *(1) FileScan parquet [address#55,facts#56] DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts), GreaterThan(facts.price,2000000)], ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>> • Predicate Pushdown in Parquet for nested fields provides significant performance gain by eliminating non-matches earlier to read less data and save the cost of processing them sql("select address.city from homes where facts.price > 2000000”) • With [SPARK-25556]
  • 14.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. val areaUdf = udf{ (city: String, state: String, zipCode: String) => s"$city, $state $zipCode" } val query = sql("select * from homes").repartition(1).select( areaUdf(col("address.city"), col("address.state"), col("address.zipCode")) ).explain() Applying an UDF after repartition == Physical Plan == *(2) Project [UDF(address#58.city, address#58.state, address#58.zipCode) AS UDF(address.city, address.state, address.zipCode)#70] +- Exchange RoundRobinPartitioning(1) +- *(1) Project [address#58] +- *(1) FileScan parquet [address#58] Format: Parquet, ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string, city:string,state:string,zipCode:string>>
  • 15.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Problems in Supporting Nested Structures in Spark • Root level columns are represented by Attribute which is base of leaf named expressions • To get a nested field from a root level column, a GetStructField expression with child of Attribute has to be used • All column pruning logics are done in Attribute levels, resulting either the entire root column is taken or pruned • No easy way to cherry pick couple nested columns in this model
  • 16.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. [SPARK-25603] Generalize Nested Column Pruning • [SPARK-4502], [SPARK-25363] are foundation to support nested structures better with Parquet in Spark • If an operator such as Repartition, Sample, or Limit are applied after Parquet FileScan, nested column pruning will not work • We address this by flattening the nested fields using Alias right after data read
  • 17.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. val query = sql("select * from homes").repartition(1).select( areaUdf(col("address.city"), col("address.state"), col("address.zipCode")) ).explain() Applying an UDF after repartition == Physical Plan == *(2) Project [UDF(_gen_alias_84#84, _gen_alias_85#85, _gen_alias_86#86) AS UDF(address.city, address.state, address.zipCode)#64] +- Exchange RoundRobinPartitioning(1) +- *(1) Project [address#55.city AS _gen_alias_84#84, address#55.state AS _gen_alias_85#85, address#55.zipCode AS _gen_alias_86#86] +- *(1) FileScan parquet [address#55] ReadSchema: struct<address:struct<city:string,state:string,zipCode:string>> • Nested fields are replaced by Alias with flatten structures • Only three used nested fields are read from Parquet files
  • 18.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Production Query - Finding a Needle in a Haystack
  • 19.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark 2.3.1 1.2h 7.1TB
  • 20.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark 2.4 with [SPARK-4502], [SPARK-25363], and [SPARK-25556] 3.3min 840GB 7.1TB1.2h
  • 21.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. • 21x faster in wall clock time
 • 8x less data being read
 • More power efficient
  • 22.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Other work • Enhance the Dataset performance by analyzing JVM bytecode and turn closures into Catalyst expressions • Please check our other presentation tomorrow at 11am for more
  • 23.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Conclusions With some work, engineering rigor and some optimizations Spark can run at very large scale in lightning speed
  • 24.
    • [SPARK-4502] • [SPARK-25363] •[SPARK-25556] • [SPARK-25603]
  • 25.
    © 2018 AppleInc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Thank you