I am working on a Python script which loops over N .SDF filles, creates their list using glob, performs some calculations for each of the file and then store this information in pandas data file format. Assuming that I calculate 4 different properties of each file, for 1000 filles the expected output should be summarized in data-file format with 5 columns and 1000 lines. Here is the example of the code:
# make a list of all .sdf filles present in data folder:
dirlist = [os.path.basename(p) for p in glob.glob('data' + '/*.sdf')]
# create empty data file with 5 columns:
# name of the file, value of variable p, value of ac, value of don, value of wt
df = pd.DataFrame(columns=["key", "p", "ac", "don", "wt"])
# for each sdf file get its name and calculate 4 different properties: p, ac, don, wt
for sdf in dirlist:
sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
# set a name of the file
key = f'{sdf_name}'
mol = open(sdf,'rb')
# --- do some specific calculations --
p = MolLogP(mol) # coeff conc-perm
ac = CalcNumLipinskiHBA(mol)#
don = CalcNumLipinskiHBD(mol)
wt = MolWt(mol)
# add one line to DF in the following order : ["key", "p", "ac", "don", "wt"]
df[key] = [p, ac, don, wt]
The problem is in the last line of the script, required to summarize all of the calculations in one line and append it into the DF together with the processed file. Eventually, for 1000 processed SDF filles, my DF should contain 5 columns and 1000 lines.