Winpython with R : comparing DPLYR and Pandas¶

It is based on the Thomas Augspurger comparison Notebook (refreshed for Pandas 0.16.0)

http://nbviewer.ipython.org/urls/gist.githubusercontent.com/TomAugspurger/6e052140eaa5fdb6e8c0/raw/811585624e843f3f80b9b6fe89e18119d7d2d73c/dplyr_pandas.ipynb

We just play the "R" code at the same time, instead of keeping it in comments

In [1]:

!echo %R_HOME%

D:\WinPython\basedir34\buildFlavorRfull\winpython-3.4.3.2.amd64\tools\R

In [2]:

# Some prep work to get the data from R and into pandas
%matplotlib inline
#bad tetst : move magic befor module imprort
#%load_ext rpy2.ipython

import rpy2
%load_ext rpy2.ipython
from rpy2.robjects.conversion import ri2py
from rpy2.ipython.rmagic import ri2ipython
ri2ipython.register(rpy2.robjects.Sexp, ri2py)
import numpy as np
import pandas as pd
import seaborn as sns

pd.set_option("display.max_rows", 5)

Transform this Markdown cell to a Code cell, if ever you need to re-feed a basic R environnement¶

%R install.packages("tidyr")

%R install.packages("dplyr")

%R install.packages("ggplot2")

%R install.packages("rvest")

%R install.packages('RSQLite')

%R install.packages("zoo")

%R install.packages("forecast")

%R install.packages('R.utils')

%R install.packages("nycflights13")

%R install.packages('hflights')

Thomas Augspurger part (with comments replaced by true %R code)¶

This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you're an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition.

We'll work through the introductory dplyr vignette to analyze some flight data.

I'm working on a better layout to show the two packages side by side. But for now I'm just putting the dplyr code in a comment above each python call.

In [3]:

%%R
library("dplyr") # for functions
library("nycflights13")
write.csv(flights, "flights.csv")

Data: nycflights13¶

In [4]:

flights = pd.read_csv("flights.csv", index_col=0)

In [5]:

%R dim(flights)

Out[5]:

array([336776,     16], dtype=int32)

In [6]:

# dim(flights)   <--- The R code
flights.shape  # <--- The python code

Out[6]:

(336776, 16)

In [7]:

%R head(flights)

Out[7]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
0	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
1	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4	2013	1	1	554	-6	812	-25	DL	N668DN	461	LGA	ATL	116	762	5	54
5	2013	1	1	554	-4	740	12	UA	N39463	1696	EWR	ORD	150	719	5	54

6 rows × 16 columns

In [8]:

# head(flights)
flights.head()

Out[8]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
3	2013	1	1	542	2	923	33	AA	N619AA	1141	JFK	MIA	160	1089	5	42
4	2013	1	1	544	-1	1004	-18	B6	N804JB	725	JFK	BQN	183	1576	5	44
5	2013	1	1	554	-6	812	-25	DL	N668DN	461	LGA	ATL	116	762	5	54

Single table verbs¶

dplyr has a small set of nicely defined verbs. I've listed their closest pandas verbs.

dplyr	pandas
filter() (and slice())	query() (and loc[], iloc[])
arrange()	sort()
select() (and rename())	\_\_getitem\_\_ (and rename())
distinct()	drop_duplicates()
mutate() (and transmute())	None
summarise()	None
sample_n() and sample_frac()	None

Some of the "missing" verbs in pandas are because there are other, different ways of achieving the same goal. For example summarise is spread across mean, std, etc. Others, like sample_n, just haven't been implemented yet.

Filter rows with filter(), query()¶

In [9]:

%R filter(flights, month == 1, day == 1)

Out[9]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
0	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
1	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
840	2013	1	1	-2147483648	NaN	-2147483648	NaN	AA	N3EVAA	1925	LGA	MIA	NaN	1096	NaN	NaN
841	2013	1	1	-2147483648	NaN	-2147483648	NaN	B6	N618JB	125	JFK	FLL	NaN	1069	NaN	NaN

842 rows × 16 columns

In [10]:

# filter(flights, month == 1, day == 1)
flights.query("month == 1 & day == 1")

Out[10]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
841	2013	1	1	NaN	NaN	NaN	NaN	AA	N3EVAA	1925	LGA	MIA	NaN	1096	NaN	NaN
842	2013	1	1	NaN	NaN	NaN	NaN	B6	N618JB	125	JFK	FLL	NaN	1069	NaN	NaN

842 rows × 16 columns

The more verbose version:

In [11]:

%R flights[flights$month == 1 & flights$day == 1, ]

Out[11]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
0	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
1	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
840	2013	1	1	-2147483648	NaN	-2147483648	NaN	AA	N3EVAA	1925	LGA	MIA	NaN	1096	NaN	NaN
841	2013	1	1	-2147483648	NaN	-2147483648	NaN	B6	N618JB	125	JFK	FLL	NaN	1069	NaN	NaN

842 rows × 16 columns

In [12]:

# flights[flights$month == 1 & flights$day == 1, ]
flights[(flights.month == 1) & (flights.day == 1)]

Out[12]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
841	2013	1	1	NaN	NaN	NaN	NaN	AA	N3EVAA	1925	LGA	MIA	NaN	1096	NaN	NaN
842	2013	1	1	NaN	NaN	NaN	NaN	B6	N618JB	125	JFK	FLL	NaN	1069	NaN	NaN

842 rows × 16 columns

In [13]:

%R slice(flights, 1:10)

Out[13]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
0	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
1	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8	2013	1	1	557	-3	838	-8	B6	N593JB	79	JFK	MCO	140	944	5	57
9	2013	1	1	558	-2	753	8	AA	N3ALAA	301	LGA	ORD	138	733	5	58

10 rows × 16 columns

In [14]:

# slice(flights, 1:10)
flights.iloc[:9]

Out[14]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8	2013	1	1	557	-3	709	-14	EV	N829AS	5708	LGA	IAD	53	229	5	57
9	2013	1	1	557	-3	838	-8	B6	N593JB	79	JFK	MCO	140	944	5	57

9 rows × 16 columns

Arrange rows with arrange(), sort()¶

In [15]:

%R arrange(flights, year, month, day)

Out[15]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
0	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
1	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336774	2013	12	31	-2147483648	NaN	-2147483648	NaN	UA		219	EWR	ORD	NaN	719	NaN	NaN
336775	2013	12	31	-2147483648	NaN	-2147483648	NaN	UA		443	JFK	LAX	NaN	2475	NaN	NaN

336776 rows × 16 columns

In [16]:

# arrange(flights, year, month, day) 
flights.sort(['year', 'month', 'day'])

Out[16]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
111295	2013	12	31	NaN	NaN	NaN	NaN	UA	NaN	219	EWR	ORD	NaN	719	NaN	NaN
111296	2013	12	31	NaN	NaN	NaN	NaN	UA	NaN	443	JFK	LAX	NaN	2475	NaN	NaN

336776 rows × 16 columns

In [17]:

%R arrange(flights, desc(arr_delay))

Out[17]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
0	2013	1	9	641	1301	1242	1272	HA	N384HA	51	JFK	HNL	640	4983	6	41
1	2013	6	15	1432	1137	1607	1127	MQ	N504MQ	3535	JFK	CMH	74	483	14	32
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336774	2013	9	30	-2147483648	NaN	-2147483648	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN
336775	2013	9	30	-2147483648	NaN	-2147483648	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN

336776 rows × 16 columns

In [18]:

# arrange(flights, desc(arr_delay))
flights.sort('arr_delay', ascending=False)

Out[18]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
7073	2013	1	9	641	1301	1242	1272	HA	N384HA	51	JFK	HNL	640	4983	6	41
235779	2013	6	15	1432	1137	1607	1127	MQ	N504MQ	3535	JFK	CMH	74	483	14	32
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336775	2013	9	30	NaN	NaN	NaN	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN
336776	2013	9	30	NaN	NaN	NaN	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN

336776 rows × 16 columns

Select columns with select(), []¶

In [19]:

%R select(flights, year, month, day)

Out[19]:

	year	month	day
0	2013	1	1
1	2013	1	1
...	...	...	...
336774	2013	9	30
336775	2013	9	30

336776 rows × 3 columns

In [20]:

# select(flights, year, month, day) 
flights[['year', 'month', 'day']]

Out[20]:

	year	month	day
1	2013	1	1
2	2013	1	1
...	...	...	...
336775	2013	9	30
336776	2013	9	30

336776 rows × 3 columns

In [21]:

%R select(flights, year:day)

Out[21]:

	year	month	day
0	2013	1	1
1	2013	1	1
...	...	...	...
336774	2013	9	30
336775	2013	9	30

336776 rows × 3 columns

In [22]:

# select(flights, year:day) 

# No real equivalent here. Although I think this is OK.
# Typically I'll have the columns I want stored in a list
# somewhere, which can be passed right into __getitem__ ([]).

In [23]:

%%R
select(flights, -(year:day))

In [24]:

# select(flights, -(year:day)) 

# Again, simliar story. I would just use
# flights.drop(cols_to_drop, axis=1)
# or fligths[flights.columns.difference(pd.Index(cols_to_drop))]
# point to dplyr!

In [25]:

%R select(flights, tail_num = tailnum)

Out[25]:

	tail_num
0	N14228
1	N24211
...	...
336774	N511MQ
336775	N839MQ

336776 rows × 1 columns

In [26]:

# select(flights, tail_num = tailnum)
flights.rename(columns={'tailnum': 'tail_num'})['tail_num']

Out[26]:

1         N14228
2         N24211
           ...  
336775    N511MQ
336776    N839MQ
Name: tail_num, dtype: object

But like Hadley mentions, not that useful since it only returns the one column. dplyr and pandas compare well here.

In [27]:

%R rename(flights, tail_num = tailnum)

Out[27]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tail_num	flight	origin	dest	air_time	distance	hour	minute
0	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
1	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336774	2013	9	30	-2147483648	NaN	-2147483648	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN
336775	2013	9	30	-2147483648	NaN	-2147483648	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN

336776 rows × 16 columns

In [28]:

# rename(flights, tail_num = tailnum)
flights.rename(columns={'tailnum': 'tail_num'})

Out[28]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tail_num	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336775	2013	9	30	NaN	NaN	NaN	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN
336776	2013	9	30	NaN	NaN	NaN	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN

336776 rows × 16 columns

Pandas is more verbose, but the the argument to columns can be any mapping. So it's often used with a function to perform a common task, say df.rename(columns=lambda x: x.replace('-', '_')) to replace any dashes with underscores. Also, rename (the pandas version) can be applied to the Index.

Extract distinct (unique) rows¶

In [29]:

%R distinct(select(flights, tailnum))

Out[29]:

	tailnum
0	N14228
1	N24211
...	...
4042	N785SK
4043	N557AS

4044 rows × 1 columns

In [30]:

# distinct(select(flights, tailnum))
flights.tailnum.unique()

Out[30]:

array(['N14228', 'N24211', 'N619AA', ..., 'N776SK', 'N785SK', 'N557AS'], dtype=object)

FYI this returns a numpy array instead of a Series.

In [31]:

%R distinct(select(flights, origin, dest))

Out[31]:

	origin	dest
0	EWR	IAH
1	LGA	IAH
...	...	...
222	EWR	ANC
223	EWR	LGA

224 rows × 2 columns

In [32]:

# distinct(select(flights, origin, dest))
flights[['origin', 'dest']].drop_duplicates()

Out[32]:

	origin	dest
1	EWR	IAH
2	LGA	IAH
...	...	...
255456	EWR	ANC
275946	EWR	LGA

224 rows × 2 columns

OK, so dplyr wins there from a consistency point of view. unique is only defined on Series, not DataFrames. The original intention for drop_duplicates is to check for records that were accidentally included twice. This feels a bit hacky using it to select the distinct combinations, but it works!

Add new columns with mutate()¶

In [33]:

%R mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60)

Out[33]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute	gain	speed
0	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17	9	370.044053
1	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33	16	374.273128
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336774	2013	9	30	-2147483648	NaN	-2147483648	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN	NaN	NaN
336775	2013	9	30	-2147483648	NaN	-2147483648	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN	NaN	NaN

336776 rows × 18 columns

In [34]:

# mutate(flights,
#   gain = arr_delay - dep_delay,
#   speed = distance / air_time * 60)

#before pandas 0.16.0
# flights['gain'] = flights.arr_delay - flights.dep_delay
# flights['speed'] = flights.distance / flights.air_time * 60
# flights
flights.assign(gain=flights.arr_delay - flights.dep_delay,
               speed=flights.distance / flights.air_time * 60)

Out[34]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute	gain	speed
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17	9	370.044053
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33	16	374.273128
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336775	2013	9	30	NaN	NaN	NaN	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN	NaN	NaN
336776	2013	9	30	NaN	NaN	NaN	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN	NaN	NaN

336776 rows × 18 columns

In [35]:

%R mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) )

Out[35]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute	gain	gain_per_hour
0	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17	9	2.378855
1	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33	16	4.229075
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336774	2013	9	30	-2147483648	NaN	-2147483648	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN	NaN	NaN
336775	2013	9	30	-2147483648	NaN	-2147483648	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN	NaN	NaN

336776 rows × 18 columns

In [36]:

# mutate(flights,
#   gain = arr_delay - dep_delay,
#   gain_per_hour = gain / (air_time / 60)
# )

#before pandas 0.16.0
# flights['gain'] = flights.arr_delay - flights.dep_delay
# flights['gain_per_hour'] = flights.gain / (flights.air_time / 60)
# flights
(flights.assign(gain=flights.arr_delay - flights.dep_delay)
        .assign(gain_per_hour = lambda df: df.gain / (df.air_time / 60)))

Out[36]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute	gain	gain_per_hour
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17	9	2.378855
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33	16	4.229075
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336775	2013	9	30	NaN	NaN	NaN	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN	NaN	NaN
336776	2013	9	30	NaN	NaN	NaN	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN	NaN	NaN

336776 rows × 18 columns

The first example is pretty much identical (aside from the names, mutate vs. assign).

The second example just comes down to language differences. In R, it's possible to implement a function like mutate where you can refer to gain in the line calcuating gain_per_hour, even though gain hasn't actually been calcuated yet.

In Python, you can have arbitrary keyword arguments to functions (which we needed for .assign), but the order of the argumnets is arbitrary. So you can't have something like df.assign(x=df.a / df.b, y=x **2), because you don't know whether x or y will come first (you'd also get an error saying x is undefined.

To work around that with pandas, you'll need to split up the assigns, and pass in a callable to the second assign. The callable looks at itself to find a column named gain. Since the line above returns a DataFrame with the gain column added, the pipeline goes through just fine.

In [37]:

%R transmute(flights,  gain = arr_delay - dep_delay,  gain_per_hour = gain / (air_time / 60)  )

Out[37]:

	gain	gain_per_hour
0	9	2.378855
1	16	4.229075
...	...	...
336774	NaN	NaN
336775	NaN	NaN

336776 rows × 2 columns

In [38]:

# transmute(flights,
#   gain = arr_delay - dep_delay,
#   gain_per_hour = gain / (air_time / 60)
# )

#before pandas 0.16.0
# flights['gain'] = flights.arr_delay - flights.dep_delay
# flights['gain_per_hour'] = flights.gain / (flights.air_time / 60)
# flights[['gain', 'gain_per_hour']]
(flights.assign(gain=flights.arr_delay - flights.dep_delay)
        .assign(gain_per_hour = lambda df: df.gain / (df.air_time / 60))
        [['gain', 'gain_per_hour']])

Out[38]:

	gain	gain_per_hour
1	9	2.378855
2	16	4.229075
...	...	...
336775	NaN	NaN
336776	NaN	NaN

336776 rows × 2 columns

Summarise values with summarise()¶

In [39]:

flights.dep_delay.mean()

Out[39]:

12.639070257304708

Randomly sample rows with sample_n() and sample_frac()¶

There's an open PR on Github to make this nicer (closer to dplyr). For now you can drop down to numpy.

In [40]:

%R sample_n(flights, 10)

Out[40]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
0	2013	12	14	1939	39	2333	78	AA	N339AA	21	JFK	LAX	323	2475	19	39
1	2013	11	6	1954	-6	2221	-18	DL	N964DL	1147	LGA	ATL	113	762	19	54
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8	2013	4	9	1855	-4	2128	3	FL	N981AT	645	LGA	ATL	105	762	18	55
9	2013	7	1	1931	11	2129	-29	UA	N36472	1667	EWR	LAS	271	2227	19	31

10 rows × 16 columns

In [41]:

# sample_n(flights, 10)
flights.loc[np.random.choice(flights.index, 10)]

Out[41]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
116699	2013	2	7	921	-7	1033	-17	B6	N649JB	4	JFK	BUF	51	301	9	21
105486	2013	12	25	1205	-1	1516	0	B6	N661JB	1129	JFK	RSW	169	1074	12	5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
156196	2013	3	22	1037	5	1213	2	EV	N22971	4662	EWR	RDU	72	416	10	37
288597	2013	8	9	NaN	NaN	NaN	NaN	MQ	N735MQ	3493	LGA	CLE	NaN	419	NaN	NaN

10 rows × 16 columns

In [42]:

%R sample_frac(flights, 0.01)

Out[42]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
0	2013	6	6	1044	44	1441	6	HA	N384HA	51	JFK	HNL	580	4983	10	44
1	2013	1	24	1110	-4	1259	-16	DL	N338NB	1031	LGA	DTW	79	502	11	10
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3366	2013	3	19	2107	3	15	16	UA	N847UA	852	EWR	IAH	213	1400	21	7
3367	2013	4	19	1633	28	1922	30	DL	N3736C	539	JFK	DEN	220	1626	16	33

3368 rows × 16 columns

In [43]:

# sample_frac(flights, 0.01)
flights.iloc[np.random.randint(0, len(flights),
                               .1 * len(flights))]

Out[43]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
210476	2013	5	19	1312	107	1435	90	AA	N438AA	327	LGA	ORD	108	733	13	12
68187	2013	11	14	951	-9	1058	-15	US	N953UW	2142	LGA	BOS	39	184	9	51
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
88447	2013	12	6	1242	-3	1447	-2	DL	N943DL	1131	LGA	DTW	85	502	12	42
87856	2013	12	5	2136	36	2254	15	EV	N717EV	5451	LGA	RIC	53	292	21	36

33677 rows × 16 columns

Grouped operations¶

In [44]:

%R planes <- group_by(flights, tailnum)
%R delay <- summarise(planes, count = n(),dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE))
%R delay <- filter(delay, count > 20, dist < 2000)

Out[44]:

	tailnum	count	dist	delay
0		2512	710.257564	NaN
1	N0EGMQ	371	676.188679	9.982955
...	...	...	...	...
2960	N999DN	61	895.459016	14.311475
2961	N9EAMQ	248	674.665323	9.235294

2962 rows × 4 columns

In [45]:

# planes <- group_by(flights, tailnum)
# delay <- summarise(planes,
#   count = n(),
#   dist = mean(distance, na.rm = TRUE),
#   delay = mean(arr_delay, na.rm = TRUE))
# delay <- filter(delay, count > 20, dist < 2000)

planes = flights.groupby("tailnum")
delay = (planes.agg({"year": "count",
                     "distance": "mean",
                     "arr_delay": "mean"})
               .rename(columns={"distance": "dist",
                                "arr_delay": "delay",
                                "year": "count"})
               .query("count > 20 & dist < 2000"))
delay

Out[45]:

	dist	delay	count
tailnum
N0EGMQ	676.188679	9.982955	371
N10156	757.947712	12.717241	153
...	...	...	...
N999DN	895.459016	14.311475	61
N9EAMQ	674.665323	9.235294	248

2961 rows × 3 columns

For me, dplyr's n() looked is a bit starge at first, but it's already growing on me.

I think pandas is more difficult for this particular example. There isn't as natural a way to mix column-agnostic aggregations (like count) with column-specific aggregations like the other two. You end up writing could like .agg{'year': 'count'} which reads, "I want the count of year", even though you don't care about year specifically. Additionally assigning names can't be done as cleanly in pandas; you have to just follow it up with a rename like before.

We may as well reproduce the graph. It looks like ggplots geom_smooth is some kind of lowess smoother. We can either us seaborn:

In [46]:

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12, 6))

sns.regplot("dist", "delay", data=delay, lowess=True, ax=ax,
            scatter_kws={'color': 'k', 'alpha': .5, 's': delay['count'] / 10}, ci=90,
            line_kws={'linewidth': 3});

No description has been provided for this image

Or using statsmodels directly for more control over the lowess, with an extremely lazy "confidence interval".

In [47]:

import statsmodels.api as sm

In [48]:

smooth = sm.nonparametric.lowess(delay.delay, delay.dist, frac=1/8)
ax = delay.plot(kind='scatter', x='dist', y = 'delay', figsize=(12, 6),
                color='k', alpha=.5, s=delay['count'] / 10)
ax.plot(smooth[:, 0], smooth[:, 1], linewidth=3);
std = smooth[:, 1].std()
ax.fill_between(smooth[:, 0], smooth[:, 1] - std, smooth[:, 1] + std, alpha=.25);

In [49]:

%R destinations <- group_by(flights, dest)
%R summarise(destinations, planes = n_distinct(tailnum), flights = n())

Out[49]:

	dest	planes	flights
0	ABQ	108	254
1	ACK	58	265
...	...	...	...
103	TYS	273	631
104	XNA	176	1036

105 rows × 3 columns

In [50]:

# destinations <- group_by(flights, dest)
# summarise(destinations,
#   planes = n_distinct(tailnum),
#   flights = n()
# )

destinations = flights.groupby('dest')
destinations.agg({
    'tailnum': lambda x: len(x.unique()),
    'year': 'count'
    }).rename(columns={'tailnum': 'planes',
                       'year': 'flights'})

Out[50]:

	planes	flights
dest
ABQ	108	254
ACK	58	265
...	...	...
TYS	273	631
XNA	176	1036

105 rows × 2 columns

Similar to how dplyr provides optimized C++ versions of most of the summarise functions, pandas uses cython optimized versions for most of the agg methods.

In [51]:

%R daily <- group_by(flights, year, month, day)
%R (per_day   <- summarise(daily, flights = n()))

Out[51]:

	year	month	day	flights
0	2013	1	1	842
1	2013	1	2	943
...	...	...	...	...
363	2013	12	30	968
364	2013	12	31	776

365 rows × 4 columns

In [52]:

# daily <- group_by(flights, year, month, day)
# (per_day   <- summarise(daily, flights = n()))

daily = flights.groupby(['year', 'month', 'day'])
per_day = daily['distance'].count()
per_day

Out[52]:

year  month  day
2013  1      1      842
             2      943
                   ... 
      12     30     968
             31     776
Name: distance, dtype: int64

In [53]:

%R (per_month <- summarise(per_day, flights = sum(flights)))

Out[53]:

	year	month	flights
0	2013	1	27004
1	2013	2	24951
...	...	...	...
10	2013	11	27268
11	2013	12	28135

12 rows × 3 columns

In [54]:

# (per_month <- summarise(per_day, flights = sum(flights)))
per_month = per_day.groupby(level=['year', 'month']).sum()
per_month

Out[54]:

year  month
2013  1        27004
      2        24951
               ...  
      11       27268
      12       28135
Name: distance, dtype: int64

In [55]:

%R (per_year  <- summarise(per_month, flights = sum(flights)))

Out[55]:

	year	flights
0	2013	336776

In [56]:

# (per_year  <- summarise(per_month, flights = sum(flights)))
per_year = per_month.sum()
per_year

Out[56]:

I'm not sure how dplyr is handling the other columns, like year, in the last example. With pandas, it's clear that we're grouping by them since they're included in the groupby. For the last example, we didn't group by anything, so they aren't included in the result.

Chaining¶

Any follower of Hadley's twitter account will know how much R users love the %>% (pipe) operator. And for good reason!

In [57]:

%R flights %>% group_by(year, month, day) %>%  select(arr_delay, dep_delay) %>%  summarise( arr = mean(arr_delay, na.rm = TRUE),  dep = mean(dep_delay, na.rm = TRUE)) %>% filter(arr > 30 | dep > 30)

Out[57]:

	year	month	day	arr	dep
0	2013	1	16	34.247362	24.612865
1	2013	1	31	32.602854	28.658363
...	...	...	...	...	...
47	2013	12	17	55.871856	40.705602
48	2013	12	23	32.226042	32.254149

49 rows × 5 columns

In [58]:

# flights %>%
#   group_by(year, month, day) %>%
#   select(arr_delay, dep_delay) %>%
#   summarise(
#     arr = mean(arr_delay, na.rm = TRUE),
#     dep = mean(dep_delay, na.rm = TRUE)
#   ) %>%
#   filter(arr > 30 | dep > 30)
(
flights.groupby(['year', 'month', 'day'])
    [['arr_delay', 'dep_delay']]
    .mean()
    .query('arr_delay > 30 | dep_delay > 30')
)

Out[58]:

			arr_delay	dep_delay
year	month	day
2013	1	16	34.247362	24.612865
	1	31	32.602854	28.658363
	1	...	...	...
	12	17	55.871856	40.705602
	12	23	32.226042	32.254149

49 rows × 2 columns

Other Data Sources¶

Pandas has tons IO tools to help you get data in and out, including SQL databases via SQLAlchemy.

Summary¶

I think pandas held up pretty well, considering this was a vignette written for dplyr. I found the degree of similarity more interesting than the differences. The most difficult task was renaming of columns within an operation; they had to be followed up with a call to rename after the operation, which isn't that burdensome honestly.

More and more it looks like we're moving towards future where being a language or package partisan just doesn't make sense. Not when you can load up a Jupyter (formerly IPython) notebook to call up a library written in R, and hand those results off to python or Julia or whatever for followup, before going back to R to make a cool shiny web app.

There will always be a place for your "utility belt" package like dplyr or pandas, but it wouldn't hurt to be familiar with both.

If you want to contribute to pandas, we're always looking for help at https://github.com/pydata/pandas/. You can get ahold of me directly on twitter.

In [56]: