Making Nested Columns as First Citizen in Apache Spark SQL

© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark+AI SF
•
Making Nested Columns as First
Citizens in Apache Spark SQL
Cesar Delgado @hpcfarmer
DB Tsai @dbtsai

Siri
The world’s most popular intelligent assistant service powering
every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod

Siri Open Source Team
• We’re Spark, Hadoop, HBase PMCs / Committers / Contributors
• We’re the advocate for Open Source
• Pushing our internal changes back to the upstreams
• Working with the communities to review pull requests, develop
new features and bug fixes

Siri Data
• Machine learning is used to personalize your experience
throughout your day
• We believe privacy is a fundamental human right

Siri Scale
• Large amounts of requests, Data Centers all over the world
• Hadoop / Yarn Cluster has thousands of nodes
• HDFS has hundred of PB
• 100's TB of raw event data per day
• More than 90% of jobs are Spark
• Less than 10% are legacy Pig and MapReduce jobs

Details about our data
• Deeply nested relational model data with couple top level columns
• The total nested fields are more than 2k
• Stored in Parquet format partitioned by UTC day
• Most queries are only for a small subset of the data

An Example of Hierarchically Organized Table
Real estate information can be naturally modeled by
case class Address(houseNumber: Int,
streetAddress: String,
city: String,
state: String,
zipCode: String)
case class Facts(price: Int,
size: Int,
yearBuilt: Int)
case class School(name: String)
case class Home(address: Address,
facts: Facts,
schools: List[School])

sql("select address.city from homes where facts.price > 2000000”)
.explain(true)
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#75]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56],
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts)],
ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string,
city:string,state:string,zipCode:strin…,
facts:struct(address:int…)>
• We only need two nested columns, address.city and facts.prices
• But entire address and facts are read

[SPARK-4502], [SPARK-25363] Parquet with Nested Columns
• Parquet is a columnar storage format with complex nested data
structures in mind
• Support very efficient compression and encoding schemes
• As a columnar format, each nested column is stored separately as if it's a
flattened table
• No easy way to cherry pick couple nested columns in Spark
• Foundation - Allow reading subset of nested columns right after Parquet
FileScan

== Physical Plan ==
+- *(1) FileScan parquet [address#55,facts#56]
Format: Parquet,
ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>>
• Only two nested columns are read!
• With [SPARK-4502], [SPARK-25363]

== Physical Plan ==
Format: Parquet,
• Parquet predicate pushdown are not working for nested fields in Spark
• With [SPARK-4502], [SPARK-25363]

== Physical Plan ==
Format: Parquet,
PushedFilters: [IsNotNull(facts), GreaterThan(facts.price,2000000)],
• Predicate Pushdown in Parquet for nested fields provides significant
performance gain by eliminating non-matches earlier to read less data
and save the cost of processing them
• With [SPARK-25556]

val areaUdf = udf{ (city: String, state: String, zipCode: String) =>
s"$city, $state $zipCode"
}
val query = sql("select * from homes").repartition(1).select(
areaUdf(col("address.city"),
col("address.state"),
col("address.zipCode"))
).explain()
Applying an UDF after repartition
== Physical Plan ==
*(2) Project [UDF(address#58.city, address#58.state, address#58.zipCode) AS
UDF(address.city, address.state, address.zipCode)#70]
+- Exchange RoundRobinPartitioning(1)
+- *(1) Project [address#58]
+- *(1) FileScan parquet [address#58] Format: Parquet,
ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string,
city:string,state:string,zipCode:string>>

Problems in Supporting Nested Structures in Spark
• Root level columns are represented by Attribute which is base of
leaf named expressions
• To get a nested field from a root level column, a GetStructField
expression with child of Attribute has to be used
• All column pruning logics are done in Attribute levels, resulting
either the entire root column is taken or pruned
• No easy way to cherry pick couple nested columns in this model

[SPARK-25603] Generalize Nested Column Pruning
• [SPARK-4502], [SPARK-25363] are foundation to support nested
structures better with Parquet in Spark
• If an operator such as Repartition, Sample, or Limit are applied after
Parquet FileScan, nested column pruning will not work
• We address this by flattening the nested fields using Alias right after data
read

val query = sql("select * from homes").repartition(1).select(
areaUdf(col("address.city"),
col("address.state"),
col("address.zipCode"))
).explain()
Applying an UDF after repartition
== Physical Plan ==
*(2) Project [UDF(_gen_alias_84#84, _gen_alias_85#85, _gen_alias_86#86) AS UDF(address.city,
address.state, address.zipCode)#64]
+- Exchange RoundRobinPartitioning(1)
+- *(1) Project [address#55.city AS _gen_alias_84#84, address#55.state AS _gen_alias_85#85,
address#55.zipCode AS _gen_alias_86#86]
+- *(1) FileScan parquet [address#55]
ReadSchema: struct<address:struct<city:string,state:string,zipCode:string>>
• Nested fields are replaced by Alias with flatten structures
• Only three used nested fields are read from Parquet files

Production Query - Finding a Needle in a Haystack

Spark 2.3.1
1.2h 7.1TB

Spark 2.4 with [SPARK-4502], [SPARK-25363], and [SPARK-25556]
3.3min 840GB
7.1TB1.2h

• 21x faster in wall clock time 
• 8x less data being read 
• More power efficient

Other work
• Enhance the Dataset performance by analyzing JVM bytecode
and turn closures into Catalyst expressions
• Please check our other presentation tomorrow at 11am for more

Conclusions
With some work, engineering rigor and some optimizations
Spark can run at very large scale in lightning speed

• [SPARK-4502]
• [SPARK-25363]
• [SPARK-25556]
• [SPARK-25603]

Thank you

Making Nested Columns as First Citizen in Apache Spark SQL

More Related Content

What's hot

Similar to Making Nested Columns as First Citizen in Apache Spark SQL

More from Databricks

Recently uploaded

In this document

Making Nested Columns as First Citizen in Apache Spark SQL