0

I am working on a Python script which loops over N .SDF filles, creates their list using glob, performs some calculations for each of the file and then store this information in pandas data file format. Assuming that I calculate 4 different properties of each file, for 1000 filles the expected output should be summarized in data-file format with 5 columns and 1000 lines. Here is the example of the code:

  # make a list of all .sdf filles present in data folder:
dirlist = [os.path.basename(p) for p in glob.glob('data' + '/*.sdf')]

# create empty data file with 5 columns:
# name of the file,  value of variable p, value of ac, value of don, value of wt
df = pd.DataFrame(columns=["key", "p", "ac", "don", "wt"])

# for each sdf file get its name and calculate 4 different properties: p, ac, don, wt
for sdf in dirlist:
        sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
        # set a name of the file
        key = f'{sdf_name}'
        mol = open(sdf,'rb')
        # --- do some specific calculations --
        p = MolLogP(mol) # coeff conc-perm
        ac = CalcNumLipinskiHBA(mol)#
        don = CalcNumLipinskiHBD(mol)
        wt = MolWt(mol)
        # add one line to DF in the following order : ["key", "p", "ac", "don", "wt"]
        df[key] = [p, ac, don, wt]

The problem is in the last line of the script, required to summarize all of the calculations in one line and append it into the DF together with the processed file. Eventually, for 1000 processed SDF filles, my DF should contain 5 columns and 1000 lines.

1 Answer 1

0

You should replace the troublesome line with something like

df.loc[len(df)] = [key, p, ac, don, wt]

this will append a new row at the end of the df

Alternatively you can do

df = df.append(adict,ignore_index = True)

where adict is a dictionary of your values associated to the column names as keys:

adict = {'key':key, 'p':p, ...}
Sign up to request clarification or add additional context in comments.

7 Comments

thank you so much! the both methods do equally the same great job. What could be the advantage of the second method (using dictionary) ?
@MaîtreRenard No problem! I find the second one more readable, and harder to make a mistake when for example you change the order of columns for some reason. Also the first one relies on df.index being a simple RangeIndex (rows numbered sequentially from 0 to n like in your case) and would not work for more complex dataframes.
Right, thank you! personally I also prefer dictionary. P.S. I would be grateful for your comments in a simular topic for the filtering of pandas data filles according to several columns. Thank you +++ ! unix.stackexchange.com/questions/622367/…
@MaîtreRenard I am not a member of that community. You should repost it here for a quick pandas answer! Ot just check this out, it is quite simple: stackoverflow.com/a/43632964/14551426
Ok, thank you again! I suppose in my case this syntax could be adapted in something like : df_filter = df[(df['LogP'] < 5) & (df['Hb_acc'] < 10) & (df['Hb_donnors'] < 5) & (df['Weight'] < 500)] but I did not sure regarding & or | between different conditions for specific columns;
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.