python: multi-column pandas data-file obtained in FOR loop

Question

I am working on a Python script which loops over N .SDF filles, creates their list using glob, performs some calculations for each of the file and then store this information in pandas data file format. Assuming that I calculate 4 different properties of each file, for 1000 filles the expected output should be summarized in data-file format with 5 columns and 1000 lines. Here is the example of the code:

  # make a list of all .sdf filles present in data folder:
dirlist = [os.path.basename(p) for p in glob.glob('data' + '/*.sdf')]

# create empty data file with 5 columns:
# name of the file,  value of variable p, value of ac, value of don, value of wt
df = pd.DataFrame(columns=["key", "p", "ac", "don", "wt"])

# for each sdf file get its name and calculate 4 different properties: p, ac, don, wt
for sdf in dirlist:
        sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
        # set a name of the file
        key = f'{sdf_name}'
        mol = open(sdf,'rb')
        # --- do some specific calculations --
        p = MolLogP(mol) # coeff conc-perm
        ac = CalcNumLipinskiHBA(mol)#
        don = CalcNumLipinskiHBD(mol)
        wt = MolWt(mol)
        # add one line to DF in the following order : ["key", "p", "ac", "don", "wt"]
        df[key] = [p, ac, don, wt]

The problem is in the last line of the script, required to summarize all of the calculations in one line and append it into the DF together with the processed file. Eventually, for 1000 processed SDF filles, my DF should contain 5 columns and 1000 lines.

piterbarg · Accepted Answer · 2020-12-01 12:53:03Z

0

You should replace the troublesome line with something like

df.loc[len(df)] = [key, p, ac, don, wt]

this will append a new row at the end of the df

Alternatively you can do

df = df.append(adict,ignore_index = True)

where adict is a dictionary of your values associated to the column names as keys:

adict = {'key':key, 'p':p, ...}

answered Dec 1, 2020 at 12:53

piterbarg

8,2292 gold badges9 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

user14418738 Over a year ago

thank you so much! the both methods do equally the same great job. What could be the advantage of the second method (using dictionary) ?

piterbarg Over a year ago

@MaîtreRenard No problem! I find the second one more readable, and harder to make a mistake when for example you change the order of columns for some reason. Also the first one relies on df.index being a simple RangeIndex (rows numbered sequentially from 0 to n like in your case) and would not work for more complex dataframes.

user14418738 Over a year ago

Right, thank you! personally I also prefer dictionary. P.S. I would be grateful for your comments in a simular topic for the filtering of pandas data filles according to several columns. Thank you +++ ! unix.stackexchange.com/questions/622367/…

piterbarg Over a year ago

@MaîtreRenard I am not a member of that community. You should repost it here for a quick pandas answer! Ot just check this out, it is quite simple: stackoverflow.com/a/43632964/14551426

user14418738 Over a year ago

Ok, thank you again! I suppose in my case this syntax could be adapted in something like : df_filter = df[(df['LogP'] < 5) & (df['Hb_acc'] < 10) & (df['Hb_donnors'] < 5) & (df['Weight'] < 500)] but I did not sure regarding & or | between different conditions for specific columns;

|

Collectives™ on Stack Overflow

python: multi-column pandas data-file obtained in FOR loop

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related