1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  [Incomplete]	
  Data	
  Tools	
  
Landscape	
  [for	
  Hackers]	
  in	
  
2015	
  
Wes	
  McKinney	
  @wesmckinn	
  
Data^3	
  MeeMng	
  —	
  Minneapolis,	
  MN	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
This	
  talk	
  
•  A	
  parMal	
  look	
  at	
  different	
  languages	
  and	
  tools	
  
•  LimiMng	
  scope	
  to	
  either:	
  
• Permissively	
  licensed	
  open	
  source	
  soSware,	
  e.g.	
  Apache-­‐licensed	
  (OSS)	
  
• Non-­‐dual-­‐licensed	
  copyleS	
  OSS	
  (e.g.	
  GPL)	
  
• i.e.	
  “do	
  you	
  [the	
  community]	
  have	
  any	
  incenMve	
  to	
  create	
  patches?”	
  
•  Some	
  trends	
  (that	
  I	
  see,	
  anyway)	
  
•  Challenges	
  and	
  opportuniMes	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Who	
  am	
  I?	
  
•  Python	
  data	
  firestarter	
  
•  Financial	
  analyMcs	
  in	
  R	
  /	
  Python	
  starMng	
  2007	
  
•  pandas	
  project	
  born	
  of	
  frustraMon	
  in	
  2008	
  
•  2010-­‐2012	
  
• Hiatus	
  from	
  gainful	
  employment	
  
• Make	
  pandas	
  ready	
  for	
  primeMme	
  
• Write	
  "Python	
  for	
  Data	
  Analysis"	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Who	
  am	
  I?	
  (cont’d)	
  
•  2013-­‐2014:	
  Co-­‐founder/CEO	
  of	
  DataPad	
  (analyMcs	
  startup,	
  with	
  early	
  pandas	
  
collaborator	
  Chang	
  She)	
  
•  Late	
  2014:	
  DataPad	
  team	
  joins	
  Cloudera	
  
•  Now:	
  backend	
  systems	
  and	
  all-­‐things-­‐Python	
  @	
  Cloudera	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
SQL:	
  SMll	
  a	
  lingua	
  franca	
  
•  “SQL:	
  the	
  Fortran	
  of	
  AnalyMcs”	
  
•  OSen	
  a	
  concise,	
  declaraMve	
  way	
  to	
  express	
  data	
  transforms,	
  analyMcs,	
  etc.	
  
•  RelaMvely	
  easy	
  to	
  parse,	
  analyze	
  
•  SQL	
  recently	
  has	
  seen	
  resurgence	
  with	
  focus	
  on	
  interacMve-­‐speed	
  SQL	
  engines,	
  
especially	
  on	
  top	
  of	
  HDFS/Hadoop	
  
•  Relevant	
  and	
  impaclul	
  features	
  (e.g.	
  JSON	
  support)	
  sMll	
  arriving	
  in	
  established	
  
RDBMS	
  like	
  PostgreSQL	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Historical	
  Python	
  Context	
  
•  ScienMfic	
  /	
  HPC	
  compuMng	
  focus	
  in	
  1990s,	
  2000s	
  
• Python	
  web	
  community	
  developed	
  in	
  parallel,	
  matured	
  faster!	
  
•  NumPy	
  became	
  community	
  standard	
  in	
  2005,	
  born	
  from	
  Numeric	
  +	
  Numarray	
  
•  Pyrex,	
  later	
  Cython,	
  easier	
  C	
  /	
  C++	
  wrapping	
  
•  f2py:	
  easy	
  Fortran	
  wrapping	
  
•  Anaconda	
  distribuMon	
  
• Finally	
  solving	
  Python	
  deployment	
  for	
  all	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
EssenMal	
  Python	
  stack	
  
•  NumPy:	
  low-­‐level	
  array	
  processing	
  
•  SciPy:	
  essenMal	
  computaMonal	
  algos	
  
•  pandas:	
  data	
  wrangling	
  
•  scikit-­‐learn:	
  machine	
  learning	
  
•  matplotlib	
  (+	
  add-­‐ons,	
  like	
  seaborn):	
  visualizaMon	
  
•  numba:	
  numeric	
  hotspot	
  LLVM	
  compiler	
  
•  Domain-­‐specific	
  toolkits:	
  nltk,	
  scikit-­‐image,	
  statsmodels,	
  Theano,	
  PyCUDA/
PyOpenCL	
  and	
  many	
  others	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas	
  
•  A	
  Pythonic	
  take	
  on	
  the	
  classic	
  R	
  “data	
  frame”	
  data	
  structure	
  
•  CriMcal	
  piece	
  to	
  make	
  the	
  Python	
  stack	
  useful	
  in	
  everyday	
  work	
  
•  Added	
  axis	
  metadata	
  /	
  labeling	
  for	
  represenMng	
  mulMdimensional	
  data	
  
•  Focus	
  on	
  easy	
  data	
  wrangling,	
  IO,	
  ploung,	
  and	
  basic	
  analyMcs	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Jeff	
  Reback’s	
  “pandas	
  as	
  PyData	
  middleware”	
  diagram	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Newer	
  /	
  Up-­‐and-­‐coming	
  Python	
  projects	
  
•  Bokeh:	
  interacMve	
  /	
  reacMve	
  visualizaMon	
  for	
  the	
  web	
  
•  Blaze:	
  uniform	
  data	
  expression	
  API	
  
•  Odo:	
  easy	
  data	
  migraMon	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
R	
  Project	
  
•  Trusted	
  base	
  of	
  staMsMcs	
  libraries	
  
• Latest	
  and	
  greatest	
  stats	
  research	
  oSen	
  hits	
  R	
  first	
  
•  RStudio	
  
•  The	
  "Hadley	
  stack”	
  
• VisualizaMon:	
  ggplot2	
  (staMc)	
  and	
  ggvis	
  (interacMve)	
  
• Data	
  Wrangling:	
  dplyr	
  
• legacy:	
  plyr	
  /	
  reshape2	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
dplyr	
  
•  Started	
  late	
  2012	
  by	
  Hadley	
  Wickham,	
  supported	
  by	
  RStudio	
  
•  Composable	
  /	
  chainable	
  analyMcs	
  and	
  data	
  wrangling	
  expressions	
  
•  In-­‐memory	
  and	
  SQL	
  backends	
  
•  Has	
  avracted	
  folks	
  back	
  to	
  R	
  from	
  Python	
  in	
  a	
  lot	
  of	
  cases	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  other	
  great	
  R	
  stuff	
  
•  shiny:	
  interacMve	
  web	
  apps	
  in	
  R	
  
•  Rcpp	
  
•  data.table	
  
•  xts	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
IPython	
  
•  IPython	
  started	
  out	
  as	
  a	
  bever	
  interacMve	
  Python	
  
•  Grew	
  to	
  include	
  web-­‐based	
  computaMonal	
  notebook,	
  GUI	
  console,	
  and	
  other	
  
components	
  
• (Google	
  even	
  integrated	
  into	
  Google	
  Drive!)	
  
•  IPython	
  Notebook	
  architecture	
  enabled	
  “kernel”	
  processes	
  to	
  be	
  wriven	
  in	
  nearly	
  
any	
  language	
  (even	
  bash!)	
  	
  
•  How	
  to	
  build	
  community	
  beyond	
  Python?	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Enter	
  Jupyter	
  
•  hvp://jupyter.org	
  
•  Breaking	
  out	
  notebook	
  machinery	
  into	
  a	
  standalone	
  non-­‐Python-­‐specific	
  project	
  	
  
•  Enable	
  project	
  components	
  to	
  evolve	
  at	
  own	
  pace,	
  without	
  large	
  monolithic	
  
releases	
  
•  JupyterHub:	
  upcoming	
  mulM-­‐user	
  notebook	
  server	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  few	
  words	
  about	
  Hadoop	
  +	
  Big	
  Data	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Spark	
  
•  Originated	
  from	
  Berkeley	
  AMPLab	
  
•  General	
  purpose	
  distributed	
  memory-­‐centric	
  data	
  processing	
  framework	
  
•  Official	
  APIs:	
  Scala,	
  Java,	
  Python	
  
Source:	
  databricks.com	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  1.3:	
  DataFrames!	
  
•  R/pandas-­‐inspired	
  API	
  for	
  tabular	
  data	
  manipulaMon	
  in	
  Scala,	
  Python,	
  etc.	
  
•  Logical	
  operaMon	
  graphs	
  rewriven	
  internally	
  in	
  more	
  efficient	
  form	
  
•  Good	
  interop	
  with	
  Spark	
  SQL	
  
•  Some	
  interoperability	
  with	
  pandas	
  
•  Will	
  help	
  close	
  the	
  semanMc	
  gap	
  between	
  Spark	
  and	
  R/Python	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  problems	
  in	
  need	
  of	
  solving	
  
•  A	
  Shiny-­‐like	
  quick-­‐and-­‐dirty	
  data	
  app	
  development	
  framework	
  for	
  Python	
  
•  IPython/Jupyter	
  notebook	
  collaboraMon	
  
•  A	
  community-­‐standard,	
  Apache-­‐licensed	
  C/C++	
  data	
  frame	
  library	
  with	
  best-­‐in-­‐
class	
  performance	
  
•  Ubiquitous	
  support	
  for	
  emerging	
  analyMcal	
  on-­‐disk	
  storage	
  standards	
  like	
  Parquet	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Other	
  interesMng	
  stuff	
  to	
  look	
  at	
  	
  
•  Torch7	
  /	
  LuaJIT:	
  high	
  performance	
  ML	
  /	
  deep	
  learning	
  on	
  GPUs	
  
• Facebook	
  AI	
  group	
  open	
  sourced	
  several	
  ML	
  modules	
  
•  Apache	
  Flink	
  
• Up-­‐and-­‐coming	
  Scala-­‐based	
  data	
  processing	
  framework	
  
• Some	
  overlap	
  with	
  Spark	
  use	
  cases	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  other	
  interesMng	
  industry	
  trends	
  
•  MicrosoS	
  
• Acquired	
  RevoluMon	
  AnalyMcs,	
  leading	
  commercial	
  R	
  vendor	
  
• Launched	
  Azure	
  ML:	
  R,	
  Python,	
  and	
  more	
  on	
  Azure	
  cloud	
  
•  Dato	
  (ya	
  GraphLab)	
  
• faster,	
  more	
  scalable	
  machine	
  learning,	
  with	
  Python	
  interface	
  (Paid	
  commercial	
  
product,	
  free	
  for	
  non-­‐commercial/academic	
  use)	
  
• Largest-­‐ever	
  VC	
  investment	
  in	
  a	
  data	
  tools	
  company	
  beung	
  big	
  on	
  Python	
  
•  Databricks	
  
• Offering	
  cloud	
  Spark-­‐notebook-­‐as-­‐a-­‐service	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
@wesmckinn	
  

An Incomplete Data Tools Landscape for Hackers in 2015

  • 1.
    1  ©  Cloudera,  Inc.  All  rights  reserved.   A  [Incomplete]  Data  Tools   Landscape  [for  Hackers]  in   2015   Wes  McKinney  @wesmckinn   Data^3  MeeMng  —  Minneapolis,  MN  
  • 2.
    2  ©  Cloudera,  Inc.  All  rights  reserved.   This  talk   •  A  parMal  look  at  different  languages  and  tools   •  LimiMng  scope  to  either:   • Permissively  licensed  open  source  soSware,  e.g.  Apache-­‐licensed  (OSS)   • Non-­‐dual-­‐licensed  copyleS  OSS  (e.g.  GPL)   • i.e.  “do  you  [the  community]  have  any  incenMve  to  create  patches?”   •  Some  trends  (that  I  see,  anyway)   •  Challenges  and  opportuniMes  
  • 3.
    3  ©  Cloudera,  Inc.  All  rights  reserved.   Who  am  I?   •  Python  data  firestarter   •  Financial  analyMcs  in  R  /  Python  starMng  2007   •  pandas  project  born  of  frustraMon  in  2008   •  2010-­‐2012   • Hiatus  from  gainful  employment   • Make  pandas  ready  for  primeMme   • Write  "Python  for  Data  Analysis"  
  • 4.
    4  ©  Cloudera,  Inc.  All  rights  reserved.   Who  am  I?  (cont’d)   •  2013-­‐2014:  Co-­‐founder/CEO  of  DataPad  (analyMcs  startup,  with  early  pandas   collaborator  Chang  She)   •  Late  2014:  DataPad  team  joins  Cloudera   •  Now:  backend  systems  and  all-­‐things-­‐Python  @  Cloudera  
  • 5.
    5  ©  Cloudera,  Inc.  All  rights  reserved.   SQL:  SMll  a  lingua  franca   •  “SQL:  the  Fortran  of  AnalyMcs”   •  OSen  a  concise,  declaraMve  way  to  express  data  transforms,  analyMcs,  etc.   •  RelaMvely  easy  to  parse,  analyze   •  SQL  recently  has  seen  resurgence  with  focus  on  interacMve-­‐speed  SQL  engines,   especially  on  top  of  HDFS/Hadoop   •  Relevant  and  impaclul  features  (e.g.  JSON  support)  sMll  arriving  in  established   RDBMS  like  PostgreSQL  
  • 6.
    6  ©  Cloudera,  Inc.  All  rights  reserved.   Historical  Python  Context   •  ScienMfic  /  HPC  compuMng  focus  in  1990s,  2000s   • Python  web  community  developed  in  parallel,  matured  faster!   •  NumPy  became  community  standard  in  2005,  born  from  Numeric  +  Numarray   •  Pyrex,  later  Cython,  easier  C  /  C++  wrapping   •  f2py:  easy  Fortran  wrapping   •  Anaconda  distribuMon   • Finally  solving  Python  deployment  for  all  
  • 7.
    7  ©  Cloudera,  Inc.  All  rights  reserved.   EssenMal  Python  stack   •  NumPy:  low-­‐level  array  processing   •  SciPy:  essenMal  computaMonal  algos   •  pandas:  data  wrangling   •  scikit-­‐learn:  machine  learning   •  matplotlib  (+  add-­‐ons,  like  seaborn):  visualizaMon   •  numba:  numeric  hotspot  LLVM  compiler   •  Domain-­‐specific  toolkits:  nltk,  scikit-­‐image,  statsmodels,  Theano,  PyCUDA/ PyOpenCL  and  many  others  
  • 8.
    8  ©  Cloudera,  Inc.  All  rights  reserved.   pandas   •  A  Pythonic  take  on  the  classic  R  “data  frame”  data  structure   •  CriMcal  piece  to  make  the  Python  stack  useful  in  everyday  work   •  Added  axis  metadata  /  labeling  for  represenMng  mulMdimensional  data   •  Focus  on  easy  data  wrangling,  IO,  ploung,  and  basic  analyMcs  
  • 9.
    9  ©  Cloudera,  Inc.  All  rights  reserved.   Jeff  Reback’s  “pandas  as  PyData  middleware”  diagram  
  • 10.
    10  ©  Cloudera,  Inc.  All  rights  reserved.   Newer  /  Up-­‐and-­‐coming  Python  projects   •  Bokeh:  interacMve  /  reacMve  visualizaMon  for  the  web   •  Blaze:  uniform  data  expression  API   •  Odo:  easy  data  migraMon  
  • 11.
    11  ©  Cloudera,  Inc.  All  rights  reserved.   R  Project   •  Trusted  base  of  staMsMcs  libraries   • Latest  and  greatest  stats  research  oSen  hits  R  first   •  RStudio   •  The  "Hadley  stack”   • VisualizaMon:  ggplot2  (staMc)  and  ggvis  (interacMve)   • Data  Wrangling:  dplyr   • legacy:  plyr  /  reshape2  
  • 12.
    12  ©  Cloudera,  Inc.  All  rights  reserved.   dplyr   •  Started  late  2012  by  Hadley  Wickham,  supported  by  RStudio   •  Composable  /  chainable  analyMcs  and  data  wrangling  expressions   •  In-­‐memory  and  SQL  backends   •  Has  avracted  folks  back  to  R  from  Python  in  a  lot  of  cases  
  • 13.
    13  ©  Cloudera,  Inc.  All  rights  reserved.   Some  other  great  R  stuff   •  shiny:  interacMve  web  apps  in  R   •  Rcpp   •  data.table   •  xts  
  • 14.
    14  ©  Cloudera,  Inc.  All  rights  reserved.   IPython   •  IPython  started  out  as  a  bever  interacMve  Python   •  Grew  to  include  web-­‐based  computaMonal  notebook,  GUI  console,  and  other   components   • (Google  even  integrated  into  Google  Drive!)   •  IPython  Notebook  architecture  enabled  “kernel”  processes  to  be  wriven  in  nearly   any  language  (even  bash!)     •  How  to  build  community  beyond  Python?  
  • 15.
    15  ©  Cloudera,  Inc.  All  rights  reserved.   Enter  Jupyter   •  hvp://jupyter.org   •  Breaking  out  notebook  machinery  into  a  standalone  non-­‐Python-­‐specific  project     •  Enable  project  components  to  evolve  at  own  pace,  without  large  monolithic   releases   •  JupyterHub:  upcoming  mulM-­‐user  notebook  server  
  • 16.
    16  ©  Cloudera,  Inc.  All  rights  reserved.   A  few  words  about  Hadoop  +  Big  Data  
  • 17.
    17  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Spark   •  Originated  from  Berkeley  AMPLab   •  General  purpose  distributed  memory-­‐centric  data  processing  framework   •  Official  APIs:  Scala,  Java,  Python   Source:  databricks.com  
  • 18.
    18  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  1.3:  DataFrames!   •  R/pandas-­‐inspired  API  for  tabular  data  manipulaMon  in  Scala,  Python,  etc.   •  Logical  operaMon  graphs  rewriven  internally  in  more  efficient  form   •  Good  interop  with  Spark  SQL   •  Some  interoperability  with  pandas   •  Will  help  close  the  semanMc  gap  between  Spark  and  R/Python  
  • 19.
    19  ©  Cloudera,  Inc.  All  rights  reserved.   Some  problems  in  need  of  solving   •  A  Shiny-­‐like  quick-­‐and-­‐dirty  data  app  development  framework  for  Python   •  IPython/Jupyter  notebook  collaboraMon   •  A  community-­‐standard,  Apache-­‐licensed  C/C++  data  frame  library  with  best-­‐in-­‐ class  performance   •  Ubiquitous  support  for  emerging  analyMcal  on-­‐disk  storage  standards  like  Parquet  
  • 20.
    20  ©  Cloudera,  Inc.  All  rights  reserved.   Other  interesMng  stuff  to  look  at     •  Torch7  /  LuaJIT:  high  performance  ML  /  deep  learning  on  GPUs   • Facebook  AI  group  open  sourced  several  ML  modules   •  Apache  Flink   • Up-­‐and-­‐coming  Scala-­‐based  data  processing  framework   • Some  overlap  with  Spark  use  cases  
  • 21.
    21  ©  Cloudera,  Inc.  All  rights  reserved.   Some  other  interesMng  industry  trends   •  MicrosoS   • Acquired  RevoluMon  AnalyMcs,  leading  commercial  R  vendor   • Launched  Azure  ML:  R,  Python,  and  more  on  Azure  cloud   •  Dato  (ya  GraphLab)   • faster,  more  scalable  machine  learning,  with  Python  interface  (Paid  commercial   product,  free  for  non-­‐commercial/academic  use)   • Largest-­‐ever  VC  investment  in  a  data  tools  company  beung  big  on  Python   •  Databricks   • Offering  cloud  Spark-­‐notebook-­‐as-­‐a-­‐service  
  • 22.
    22  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   @wesmckinn