Use DataFrame.explode():
df.explode('B')
A B
0 1 1
1 1 2
0 2 1
1 2 2
This was added in pandas 0.25, and support for multiple columns was added in pandas 1.3.0.
[Original answer follows, including legacy solutions]
I know object dtype makes the data hard to convert with pandas functions. When I received the data like this, the first thing that came to mind was to "flatten" or unnest the columns.
I am using pandas and Python functions for this type of question. If you are worried about the speed of the below solutions, check out user3483203's answer, since it's using numpy and most of the time numpy is faster. I recommend Cython or numba if speed matters.
Method 0: .explode() [pandas >= 0.25]
[Solution moved above. The below note no longer applies since NaNs are now handled as expected, but it is kept for reference.]
Given a dataframe with an empty list or a NaN in the column. An empty list will not cause an issue, but a NaN will need to be filled with a list
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index}) # replace NaN with []
df.explode('B')
A B
0 1 1
0 1 2
1 2 1
1 2 2
2 3 NaN
3 4 NaN
Method 1: apply + pd.Series
(easy to understand but in terms of performance not recommended.)
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
Method 2
Using repeat with DataFrame constructor, re-create your dataframe (good at performance, not good at multiple columns)
df = pd.DataFrame({
'A': df.A.repeat(df.B.str.len()),
'B': np.concatenate(df.B.values)})
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
Method 2.1
for example besides A we have A.1 .....A.n. If we still use the method (Method 2) above it is hard for us to re-create the columns one by one.
Solution: join or merge with the index after 'unnest' the single columns
s = pd.DataFrame({
'B': np.concatenate(df.B.values)},
index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B', 1), how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
If you need the column order exactly the same as before, add reindex at the end.
s.join(df.drop('B', 1), how='left').reindex(columns=df.columns)
Method 3
recreate the list
pd.DataFrame([[x] + [z] for x, y in df.values for z in y], columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
If more than two columns, use
s = pd.DataFrame([[x] + [z] for x, y in zip(df.index, df.B) for z in y])
s.merge(df, left_on=0, right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
Method 4: using reindex or loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
Method 5
when the list only contains unique values:
df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()), columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
Method 6: using numpy for high performance
newvalues = np.dstack((
np.repeat(df.A.values, list(map(len, df.B.values))),
np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0], columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Method 7: Pure Python solution just for fun
using base function itertools cycle and chain
from itertools import cycle, chain
l = df.values.tolist()
l1 = [
list(zip([x[0]], cycle(x[1]))
if len([x[0]]) > len(x[1])
else list(zip(cycle([x[0]]), x[1])))
for x in l]
pd.DataFrame(list(chain.from_iterable(l1)), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Generalizing to multiple columns
df = pd.DataFrame({
'A': [1, 2],
'B': [[1, 2], [3, 4]],
'C': [[1, 2], [3, 4]]})
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
Self-def function:
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat(
[pd.DataFrame({x: np.concatenate(df[x].values)})
for x in explode],
axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
unnesting(df, ['B', 'C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
Column-wise Unnesting
All above methods are talking about vertical unnesting and explode. If you need to expand the list horizontally, try pd.DataFrame constructor
df.join(pd.DataFrame(df.B.tolist(), index=df.index).add_prefix('B_'))
Out[33]:
A B C B_0 B_1
0 1 [1, 2] [1, 2] 1 2
1 2 [3, 4] [3, 4] 3 4
Updated function
def unnesting(df, explode, axis):
if axis == 1:
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat(
[pd.DataFrame({x: np.concatenate(df[x].values)})
for x in explode],
axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
else:
df1 = pd.concat(
[pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x)
for x in explode],
axis=1)
return df1.join(df.drop(explode, 1), how='left')
Test output
unnesting(df, ['B', 'C'], axis=0)
Out[36]:
B0 B1 C0 C1 A
0 1 2 1 2 1
1 3 4 3 4 2
Update 2021-02-17 with original explode function
def unnesting(df, explode, axis):
if axis == 1:
df1 = pd.concat(
[df[x].explode() for x in explode],
axis=1)
return df1.join(df.drop(explode, 1), how='left')
else:
df1 = pd.concat(
[pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x)
for x in explode],
axis=1)
return df1.join(df.drop(explode, 1), how='left')