2

I have a dataframe with two columns (and alot of rows), one column is the full sequence the other contains a sub sequence.

I want to find the index of where the sub sequence starts within the full sequence and add this as a another column:

I have tried this:

df["start"] = df.sequence.index(df.sub_sequence)

But this returns: TypeError: 'RangeIndex' object is not callable

What am i doing wrong?

Heres the df and the df i wish to end up with:

Sample dataframe:

import pandas as pd 

data = {"sequence": ["abcde","fghij","klmno"], "sub_sequence": ["cde", "gh", "no"]}    
df = pd.DataFrame (data, columns = ['sequence','sub_sequence'])

  sequence sub_sequence
0    abcde          cde
1    fghij           gh
2    klmno           no

Expected result:

data2 = {"sequence": ["abcde","fghij","klmno"], "sub_sequence": ["cde", "gh", "no"], "start": [2,1,3]}
df2 = pd.DataFrame (data2, columns = ['sequence','sub_sequence','start'])

  sequence sub_sequence  start
0    abcde          cde      2
1    fghij           gh      1
2    klmno           no      3
0

1 Answer 1

3

Use zip and str.index in a list comprehension:

df['start'] = [seq.index(sub) for seq, sub in zip(df['sequence'], df['sub_sequence'])]

OR Use DataFrame.apply along axis=1 + str.index:

df['start'] = df[['sequence', 'sub_sequence']].apply(lambda s: str.index(*s), axis=1)

Result:

  sequence sub_sequence  start
0    abcde          cde      2
1    fghij           gh      1
2    klmno           no      3
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.