1
Connecting PyData to other Big Data
Landscapes using Arrow and Parquet
Uwe L. Korn, PyCon.DE 2017
2
• Data Scientist & Architect at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Work in Python, Cython, C++11 and SQL
• Heavy Pandas user
About me
xhochy
uwe@apache.org
3
Python is a good companion for a Data Scientist
…but there are other ecosystems out there.
• Large set of files on distributed filesystem
• Non-uniform schema
• Execute query
• Only a subset is interesting
4
Why do I care?
not in Python
5
All are amazing but…
How to get my data out of Python and back in again?
…but there was no fast Parquet access 2 years ago.
Use Parquet!
6
A general problem
• Great interoperability inside ecosystems
• Often based on a common backend (e.g. NumPy)
• Poor integration to other systems
• CSV is your only resort
• „We need to talk!“
• Memory copy is about 10GiB/s
• (De-)serialisation comes on top
7
Columnar Data
Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
8
Apache Parquet
9
About Parquet
1. Columnar on-disk storage format
2. Started in fall 2012 by Cloudera & Twitter
3. July 2013: 1.0 release
4. top-level Apache project
5. Fall 2016: Python & C++ support
6. State of the art format in the Hadoop ecosystem
• often used as the default I/O option
10
Why use Parquet?
1. Columnar format

—> vectorized operations
2. Efficient encodings and compressions

—> small size without the need for a fat CPU
3. Predicate push-down

—> bring computation to the I/O layer
4. Language independent format

—> libs in Java / Scala / C++ / Python /…
Compression
1. Shrink data size independent of its content
2. More CPU intensive than encoding
3. encoding+compression performs better than
compression alone with less CPU cost
4. LZO, Snappy, GZIP, Brotli

—> If in doubt: use Snappy
5. GZIP: 174 MiB (11%)

Snappy: 216 MiB (14 %)
Predicate pushdown
1. Only load used data
• skip columns that are not needed
• skip (chunks of) rows that not relevant
2. saves I/O load as the data is not transferred
3. saves CPU as the data is not decoded
Which products are sold in $?
File Structure
File
RowGroup
Column Chunks
Page
Statistics
Read & Write Parquet
14
https://arrow.apache.org/docs/python/parquet.html
Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/
Read & Write Parquet
15
Pandas 0.21 will bring
pd.read_parquet(…)	
df.write_parquet(…)	
http://pandas.pydata.org/pandas-docs/version/0.21/io.html#io-parquet
16
Save in one, load in another ecosystem
…but always persist the intermediate.
17
Zero-Copy DataFrames
2.57s
Converting 1 million longs
from Spark to PySpark
18
(8MiB)
19
Apache Arrow
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for efficiency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib),
Ruby, Lua, R, JavaScript and the JVM
• This brought Parquet to Pandas without any Python code in
parquet-cpp
20
Dissecting Arrow C++
• General zero-copy memory management
• jemalloc as the base allocator
• Columnar memory format & metadata
• Schema & DataType
• Columns & Table
21
Dissecting Arrow C++
• Structured data IPC (inter-process communication)
• used in Spark for JVM<->Python
• future extensions include: GRPC backend, shared memory
communication, …
• Columnar in-memory analytics
• be the backbone of Pandas 2.0
0.05s
Converting 1 million longs
from Spark to PySpark
22
with Arrow
https://github.com/apache/spark/pull/15821#issuecomment-282175163
23
Apache Arrow – Real life improvement
Real life example!
Retrieve a dataset from an MPP database and analyze it in Pandas
1. Run a query in the DB
2. Pass it in columnar form to the DB driver
3. The OBDC layer transform it into row-wise form
4. Pandas makes it columnar again
Ugly real-life solution: export as CSV, bypass ODBC
24
Better solution: Turbodbc with Arrow support
1. Retrieve columnar results
2. Pass them in a columnar fashion to Pandas
More systems in the future (without the ODBC overhead)
See also Michael’s talk tomorrow: Turbodbc: Turbocharged database
access for data scientists
Apache Arrow – Real life improvement
25
Ray
GPU Open Analytics Initiative
26
https://blogs.nvidia.com/blog/2017/09/22/gpu-data-frame/
Cross language DataFrame library
• Website: https://arrow.apache.org/
• ML: dev@arrow.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
• Slack: https://
apachearrowslackin.herokuapp.com/
• Github: https://github.com/apache/arrow
Apache Arrow Apache Parquet
Famous columnar file format
• Website: https://parquet.apache.org/
• ML: dev@parquet.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
• Slack: https://parquet-slack-
invite.herokuapp.com/
• Github: https://github.com/apache/parquet-
cpp
27
Get Involved!
Blue Yonder GmbH
Ohiostraße 8
76149 Karlsruhe
Germany
+49 721 383117 0
Blue Yonder Software Limited
19 Eastbourne Terrace
London, W2 6LG
United Kingdom
+44 20 3626 0360
Blue Yonder
Best decisions,
delivered daily
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
28

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

  • 1.
    1 Connecting PyData toother Big Data Landscapes using Arrow and Parquet Uwe L. Korn, PyCon.DE 2017
  • 2.
    2 • Data Scientist& Architect at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas user About me xhochy uwe@apache.org
  • 3.
    3 Python is agood companion for a Data Scientist …but there are other ecosystems out there.
  • 4.
    • Large setof files on distributed filesystem • Non-uniform schema • Execute query • Only a subset is interesting 4 Why do I care? not in Python
  • 5.
    5 All are amazingbut… How to get my data out of Python and back in again? …but there was no fast Parquet access 2 years ago. Use Parquet!
  • 6.
    6 A general problem •Great interoperability inside ecosystems • Often based on a common backend (e.g. NumPy) • Poor integration to other systems • CSV is your only resort • „We need to talk!“ • Memory copy is about 10GiB/s • (De-)serialisation comes on top
  • 7.
    7 Columnar Data Image source:https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
  • 8.
  • 9.
    9 About Parquet 1. Columnaron-disk storage format 2. Started in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option
  • 10.
    10 Why use Parquet? 1.Columnar format
 —> vectorized operations 2. Efficient encodings and compressions
 —> small size without the need for a fat CPU 3. Predicate push-down
 —> bring computation to the I/O layer 4. Language independent format
 —> libs in Java / Scala / C++ / Python /…
  • 11.
    Compression 1. Shrink datasize independent of its content 2. More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli
 —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)
 Snappy: 216 MiB (14 %)
  • 12.
    Predicate pushdown 1. Onlyload used data • skip columns that are not needed • skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded Which products are sold in $?
  • 13.
  • 14.
    Read & WriteParquet 14 https://arrow.apache.org/docs/python/parquet.html Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/
  • 15.
    Read & WriteParquet 15 Pandas 0.21 will bring pd.read_parquet(…) df.write_parquet(…) http://pandas.pydata.org/pandas-docs/version/0.21/io.html#io-parquet
  • 16.
    16 Save in one,load in another ecosystem …but always persist the intermediate.
  • 17.
  • 18.
    2.57s Converting 1 millionlongs from Spark to PySpark 18 (8MiB)
  • 19.
    19 Apache Arrow • Specificationfor in-memory columnar data layout • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp
  • 20.
    20 Dissecting Arrow C++ •General zero-copy memory management • jemalloc as the base allocator • Columnar memory format & metadata • Schema & DataType • Columns & Table
  • 21.
    21 Dissecting Arrow C++ •Structured data IPC (inter-process communication) • used in Spark for JVM<->Python • future extensions include: GRPC backend, shared memory communication, … • Columnar in-memory analytics • be the backbone of Pandas 2.0
  • 22.
    0.05s Converting 1 millionlongs from Spark to PySpark 22 with Arrow https://github.com/apache/spark/pull/15821#issuecomment-282175163
  • 23.
    23 Apache Arrow –Real life improvement Real life example! Retrieve a dataset from an MPP database and analyze it in Pandas 1. Run a query in the DB 2. Pass it in columnar form to the DB driver 3. The OBDC layer transform it into row-wise form 4. Pandas makes it columnar again Ugly real-life solution: export as CSV, bypass ODBC
  • 24.
    24 Better solution: Turbodbcwith Arrow support 1. Retrieve columnar results 2. Pass them in a columnar fashion to Pandas More systems in the future (without the ODBC overhead) See also Michael’s talk tomorrow: Turbodbc: Turbocharged database access for data scientists Apache Arrow – Real life improvement
  • 25.
  • 26.
    GPU Open AnalyticsInitiative 26 https://blogs.nvidia.com/blog/2017/09/22/gpu-data-frame/
  • 27.
    Cross language DataFramelibrary • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github: https://github.com/apache/arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • Github: https://github.com/apache/parquet- cpp 27 Get Involved!
  • 28.
    Blue Yonder GmbH Ohiostraße8 76149 Karlsruhe Germany +49 721 383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 28