2

I have a DataFrame that looks like this (where 'ID' is the name of the index):

                      VAF
ID  
chr1-115227855-T-A  0.002491
chr1-115227855-T-C  0.005449
chr1-115227856-C-A  0.000466
chr1-115227856-C-G  0.000311
chr1-115227856-C-T  0.002331

And a second DataFrame that looks like this:

    Chrom   Loc WT  Var Change  ConvChange  AO  DP  VAF IntEx   Gene    Upstream    Downstream  Individual
0   chr1    115227855   T   C   T>C T>C 43  16155   0.00266171  TIII    TIIIa   NaN NaN 1
1   chr1    115227856   C   T   C>T C>T 25  16179   0.00154521  TIII    TIIIa   NaN NaN 1
2   chr1    115227857   C   T   C>T C>T 20  16178   0.00123625  TIII    TIIIa   NaN NaN 1
3   chr1    115227858   A   T   A>T T>A 29  16178   0.00179256  TIII    TIIIa   NaN NaN 1
4   chr1    115227880   C   T   C>T C>T 18  16150   0.00111455  TIII    TIIIa   NaN NaN 1

I would like to make the second DataFrame look like the first. I have tried setting a new index like this:

df2.set_index(['Chrom','Loc','WT','Var']).VAF

But this just give me a multiple indexed DataFrame.

Is there a way to do this?

2 Answers 2

6

apply a format_map

fmt = '{Chrom}-{Loc}-{WT}-{Var}'.format_map
df[['VAF']].set_index(df.apply(fmt, 1).rename('ID'))

                         VAF
ID                          
chr1-115227855-T-C  0.002662
chr1-115227856-C-T  0.001545
chr1-115227857-C-T  0.001236
chr1-115227858-A-T  0.001793
chr1-115227880-C-T  0.001115

one-line

because it's cool ¯\_(ツ)_/¯

df[['VAF']].set_index(df.apply('{Chrom}-{Loc}-{WT}-{Var}'.format_map, 1).rename('ID'))

Explanation

Create a function that takes a dictionary and passes its key:value pairs as parameters to used in a formatting string. Notice that 'Loc' can be str or int as format/format_map uses the string representation.

fmt = '{Chrom}-{Loc}-{WT}-{Var}'.format_map

Make a new series object by applying the function to each row of df using df.apply with axis=1. In this case, each row will be passed as a pandas.Series and can be processed in a dictionary context. That's perfect for format_map. I'll end up renaming the series to 'ID' to match OP's output.

idx = df.apply(fmt, 1).rename('ID')

Now if we use a pandas.Series within a set_index, Pandas will align the existing index with the index of the passed series... which is fine.

Use a double square bracket to slice the columns [['VAF']] to make sure we keep a dataframe with the columns equal to ['VAF']. Otherwise, if we used df['VAF']we would return a series object whose name is 'VAF'. Also, pandas.Series doesn't have a set_index method and pandas.DataFrame does.

df[['VAF']].set_index(idx)

                         VAF
ID                          
chr1-115227855-T-C  0.002662
chr1-115227856-C-T  0.001545
chr1-115227857-C-T  0.001236
chr1-115227858-A-T  0.001793
chr1-115227880-C-T  0.001115

We could have done this to get a series

df.set_index(idx)['VAF']

ID
chr1-115227855-T-C    0.002662
chr1-115227856-C-T    0.001545
chr1-115227857-C-T    0.001236
chr1-115227858-A-T    0.001793
chr1-115227880-C-T    0.001115
Name: VAF, dtype: float64

See! Same data, but now a series whose name is 'VAF'

Sign up to request clarification or add additional context in comments.

Comments

4

First join columns together to Series, set_index, change index name by rename_axis and select column VAF by double [] to one column DataFrame:

s = df['Chrom'] + '-' + df['Loc'].astype(str) + '-' +  df['WT'] + '-' + df['Var']

df1 = df.set_index(s).rename_axis('ID')[['VAF']]
print (df1)
                         VAF
ID                          
chr1-115227855-T-C  0.002662
chr1-115227856-C-T  0.001545
chr1-115227857-C-T  0.001236
chr1-115227858-A-T  0.001793
chr1-115227880-C-T  0.001115

1 Comment

This is likely to be much faster by avoiding apply.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.