I want to iteratively append pandas DataFrames to a csv file. This is usually not a problem. However, the DataFrames may not have all columns. So simply appending appends the DataFrame to the wrong columns.
I start with
with open('test.csv', 'w') as output:
writer = csv.writer(output, delimiter=',')
writer.writerow(['a','b', 'c'])
Then for example I add the DataFrame df
a b c
0 2 2.0 3
1 2 NaN 3
using the command
df = pd.DataFrame([{'a':2, 'b':2, 'c':3}, {'a':2, 'c':3}])
df.to_csv('test.csv', index = False, header = False, mode = 'a')
However, the next DataFrame that I want to append may look like
a c
0 1 1
1 1 1
When I append it again, I do not ant to write the header because it already exists. Doing the same as before (as expected) does not work:
df =pd.DataFrame([{'a':1, 'c':1}, {'a':1, 'c':1}])
df.to_csv('test.csv', index = False, header = False, mode = 'a')
It yields
a b c
0 2 2.0 3.0
1 2 NaN 3.0
2 1 1.0 NaN
3 1 1.0 NaN
Of course I could import the existing csv into a DataFrame then append and the overwrite the old file:
file = pd.read_csv('test.csv')
df =pd.DataFrame([{'a':1, 'c':1}, {'a':1, 'c':1}])
file = file.append(df)
file.to_csv('test.csv', index = False, header = True)
pd.read_csv('test.csv')
This does exactly what I want
a b c
0 2 2.0 3
1 2 NaN 3
2 1 NaN 1
3 1 NaN 1
But always reading the entire csv file and appending in pandas and overwriting the csv is definitely bad concerning performance when I repeat the process many times. I want to write my intermediate results to a csv because all the aggregated data is lost if I only append in a pandas DataFrame and then an error occurs. Any better solutions to my problem?
I also tried to add new empty columns but they get added at the end which doesnt help but may help to find a better performing solution.
def append_to_csv(df, file):
if not os.path.exists(file):
pd.to_csv(file, index = False, header = True)
else:
with open(file) as f:
header = next(csv.reader(f))
columns = df.columns
for column in set(header) - set(columns):
df[column] = np.nan
df.to_csv(file, index = False, header = False, mode = 'a')