PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

1
Connecting PyData to other Big Data
Landscapes using Arrow and Parquet
Uwe L. Korn, PyCon.DE 2017

2
• Data Scientist & Architect at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Work in Python, Cython, C++11 and SQL
• Heavy Pandas user
About me
xhochy
uwe@apache.org

3
Python is a good companion for a Data Scientist
…but there are other ecosystems out there.

• Large set of files on distributed filesystem
• Non-uniform schema
• Execute query
• Only a subset is interesting
4
Why do I care?
not in Python

5
All are amazing but…
How to get my data out of Python and back in again?
…but there was no fast Parquet access 2 years ago.
Use Parquet!

6
A general problem
• Great interoperability inside ecosystems
• Often based on a common backend (e.g. NumPy)
• Poor integration to other systems
• CSV is your only resort
• „We need to talk!“
• Memory copy is about 10GiB/s
• (De-)serialisation comes on top

7
Columnar Data
Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )

9
About Parquet
1. Columnar on-disk storage format
2. Started in fall 2012 by Cloudera & Twitter
3. July 2013: 1.0 release
4. top-level Apache project
5. Fall 2016: Python & C++ support
6. State of the art format in the Hadoop ecosystem
• often used as the default I/O option

10
Why use Parquet?
1. Columnar format 
—> vectorized operations
2. Eﬃcient encodings and compressions 
—> small size without the need for a fat CPU
3. Predicate push-down 
—> bring computation to the I/O layer
4. Language independent format 
—> libs in Java / Scala / C++ / Python /…

Compression
1. Shrink data size independent of its content
2. More CPU intensive than encoding
3. encoding+compression performs better than
compression alone with less CPU cost
4. LZO, Snappy, GZIP, Brotli 
—> If in doubt: use Snappy
5. GZIP: 174 MiB (11%) 
Snappy: 216 MiB (14 %)

Predicate pushdown
1. Only load used data
• skip columns that are not needed
• skip (chunks of) rows that not relevant
2. saves I/O load as the data is not transferred
3. saves CPU as the data is not decoded
Which products are sold in $?

File Structure
File
RowGroup
Column Chunks
Page
Statistics

Read & Write Parquet
14
https://arrow.apache.org/docs/python/parquet.html
Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/

Read & Write Parquet
15
Pandas 0.21 will bring
pd.read_parquet(…)
df.write_parquet(…)
http://pandas.pydata.org/pandas-docs/version/0.21/io.html#io-parquet

16
Save in one, load in another ecosystem
…but always persist the intermediate.

2.57s
Converting 1 million longs
from Spark to PySpark
18
(8MiB)

19
Apache Arrow
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for eﬃciency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib),
Ruby, Lua, R, JavaScript and the JVM
• This brought Parquet to Pandas without any Python code in
parquet-cpp

20
Dissecting Arrow C++
• General zero-copy memory management
• jemalloc as the base allocator
• Columnar memory format & metadata
• Schema & DataType
• Columns & Table

21
Dissecting Arrow C++
• Structured data IPC (inter-process communication)
• used in Spark for JVM<->Python
• future extensions include: GRPC backend, shared memory
communication, …
• Columnar in-memory analytics
• be the backbone of Pandas 2.0

0.05s
Converting 1 million longs
from Spark to PySpark
22
with Arrow
https://github.com/apache/spark/pull/15821#issuecomment-282175163

23
Apache Arrow – Real life improvement
Real life example!
Retrieve a dataset from an MPP database and analyze it in Pandas
1. Run a query in the DB
2. Pass it in columnar form to the DB driver
3. The OBDC layer transform it into row-wise form
4. Pandas makes it columnar again
Ugly real-life solution: export as CSV, bypass ODBC

24
Better solution: Turbodbc with Arrow support
1. Retrieve columnar results
2. Pass them in a columnar fashion to Pandas
More systems in the future (without the ODBC overhead)
See also Michael’s talk tomorrow: Turbodbc: Turbocharged database
access for data scientists
Apache Arrow – Real life improvement

GPU Open Analytics Initiative
26
https://blogs.nvidia.com/blog/2017/09/22/gpu-data-frame/

Cross language DataFrame library
• Website: https://arrow.apache.org/
• ML: dev@arrow.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
• Slack: https://
apachearrowslackin.herokuapp.com/
• Github: https://github.com/apache/arrow
Apache Arrow Apache Parquet
Famous columnar file format
• Website: https://parquet.apache.org/
• ML: dev@parquet.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
• Slack: https://parquet-slack-
invite.herokuapp.com/
• Github: https://github.com/apache/parquet-
cpp
27
Get Involved!

Blue Yonder GmbH
Ohiostraße 8
76149 Karlsruhe
Germany
+49 721 383117 0
Blue Yonder Software Limited
19 Eastbourne Terrace
London, W2 6LG
United Kingdom
+44 20 3626 0360
Blue Yonder
Best decisions,
delivered daily
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
28

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

More Related Content

What's hot

Similar to PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

More from Uwe Korn

Recently uploaded

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet