I have Python code that splits a given large csv into smaller csvs. This large CSV has an ID column (column 1), which consecutive entries in the csv can share. The large csv might look something like this:
sfsddf8sdf8, 123, -234, dfsdfe, fsefsddfe
sfsddf8sdf8, 754, 464, sdfgdg, QFdgdfgdr
sfsddf8sdf8, 485, 469, mgyhjd, brgfgrdfg
sfsddf8sdf8, 274, -234, dnthfh, jyfhghfth
sfsddf8sdf8, 954, -145, lihgyb, fthgfhthj
powedfnsk93, 257, -139, sdfsfs, sdfsdfsdf
powedfnsk93, 284, -126, sdgdgr, sdagssdff
powedfnsk93, 257, -139, srfgfr, sdffffsss
erfsfeeeeef, 978, 677, dfgdrg, ssdttnmmm
etc...
The IDs are not sorted alphabetically in the input file, but consecutive identical IDs are grouped together.
My code does not split the IDs into different csvs, ensuring that each id appears in only one output csv.
My code is:
import pandas as pd
import os
def iterateIDs(file): #create chunks based on tripID
csv_reader = pd.read_csv(file, iterator=True, chunksize=1, header=None)
first_chunk = csv_reader.get_chunk()
id = first_chunk.iloc[0,0]
chunk = pd.DataFrame(first_chunk)
for l in csv_reader:
if id == l.iloc[0,0] or len(chunk)<1000000: #Keep adding to chunk if less than 1,000,000, or in middle of trip
id = l.iloc[0,0]
chunk = chunk.append(l)
continue
id = l.iloc[0,0]
yield chunk
chunk = pd.DataFrame(l)
yield chunk
waypoint_filesize = os.stat('TripRecordsReportWaypoints.csv').st_size #checks filesize
if waypoint_filesize > 100000000: #if file too big, split into seperate chunks
chunk_count = 1
chunk_Iterate = iterateIDs("TripRecordsReportWaypoints.csv")
for chunk in chunk_Iterate:
chunk.to_csv('SmallWaypoints_{}.csv'.format(chunk_count),header=None,index=None)
chunk_count = chunk_count+1
However, this code runs very slowly. I tested it on a small file, 284 MB and 3.5 million rows, however it took over an hour to run. Is there any way I can achieve this result quicker? I don't mind if it's outside of python.