Python code to split csv into smaller csvs, not splitting IDs

Question

I have Python code that splits a given large csv into smaller csvs. This large CSV has an ID column (column 1), which consecutive entries in the csv can share. The large csv might look something like this:

sfsddf8sdf8, 123, -234, dfsdfe, fsefsddfe
sfsddf8sdf8, 754,  464, sdfgdg, QFdgdfgdr
sfsddf8sdf8, 485,  469, mgyhjd, brgfgrdfg
sfsddf8sdf8, 274, -234, dnthfh, jyfhghfth
sfsddf8sdf8, 954, -145, lihgyb, fthgfhthj
powedfnsk93, 257, -139, sdfsfs, sdfsdfsdf
powedfnsk93, 284, -126, sdgdgr, sdagssdff
powedfnsk93, 257, -139, srfgfr, sdffffsss
erfsfeeeeef, 978,  677, dfgdrg, ssdttnmmm
etc...

The IDs are not sorted alphabetically in the input file, but consecutive identical IDs are grouped together.

My code does not split the IDs into different csvs, ensuring that each id appears in only one output csv.

My code is:

import pandas as pd
import os

def iterateIDs(file): #create chunks based on tripID
    csv_reader = pd.read_csv(file, iterator=True, chunksize=1, header=None)
    first_chunk = csv_reader.get_chunk()
    id = first_chunk.iloc[0,0]
    chunk = pd.DataFrame(first_chunk)
    for l in csv_reader:
        if id == l.iloc[0,0] or len(chunk)<1000000: #Keep adding to chunk if less than 1,000,000, or in middle of trip
            id = l.iloc[0,0]
            chunk = chunk.append(l)
            continue
        id = l.iloc[0,0]
        yield chunk
        chunk = pd.DataFrame(l)
    yield chunk


waypoint_filesize = os.stat('TripRecordsReportWaypoints.csv').st_size #checks filesize

if waypoint_filesize > 100000000: #if file too big, split into seperate chunks
    chunk_count = 1
    chunk_Iterate = iterateIDs("TripRecordsReportWaypoints.csv")

    for chunk in chunk_Iterate:
        chunk.to_csv('SmallWaypoints_{}.csv'.format(chunk_count),header=None,index=None)
        chunk_count = chunk_count+1

However, this code runs very slowly. I tested it on a small file, 284 MB and 3.5 million rows, however it took over an hour to run. Is there any way I can achieve this result quicker? I don't mind if it's outside of python.

ChatterOne · Accepted Answer · 2017-03-29 13:40:56Z

If I understand correctly, you want to split a file into smaller files, based on size (no more than 1000000 lines) and ID (no ID should be split among files).

If that's the case, I think you're over-complicating things. You don't need pandas and you definitely don't need to keep all data in memory.

You just need two counters, one for the amount of lines you've written and one for the index of the file to write.

Sample code (of course, replace file names with what you need, or move the check after the write to start from 0 instead of 1):

current_id = ''
index = 0
written_lines = 0
max_lines = 1000000

with open('data.csv', 'r') as input_file:
    for line in input_file:
        values = line.split(',')
        if (current_id != values[0]) or (written_lines > max_lines):
            index += 1
            current_id = values[0]
        with open('output_{:08d}.csv'.format(index), 'a') as output_file:
            output_file.write(line)
            written_lines += 1

EDIT: This works assuming the file is sorted or that at least the IDs are grouped together as you said in the comment.

I think that 'or' should be an 'and', otherwise it creates a csv for every id — Joshua Kidd
– Joshua Kidd, Commented Mar 29, 2017 at 11:54

bli · Accepted Answer · 2017-03-30 08:10:24Z

I tested the following with a smaller value for max_lines and a small test file. It seems to work correctly (more than one id can be grouped in the same file) and is slightly faster than ChatterOne's proposal. I tried to avoid opening a file for each line to be written, hoping this makes the code fast enough. However, the buffering could lead to memory problems with large values of max_lines:

#!/usr/bin/env python3

# More lines can actually be written
# if a given id has more lines than this
max_lines = 100000000

def group_by_id(file):
    """This generator assumes that file has at least one line.
    It yields bunches of lines having the same first field."""
    lines = [file.readline()]
    last_id = lines[-1].split(",")[0]
    for line in file:
        id = line.split(",")[0]
        if id == last_id:
            lines.append(line)
        else:
            yield lines, len(lines)
            last_id = id
            lines = [line]
    yield lines, len(lines)


def main():
    with open("data.csv") as input_file:
        chunk_id = 0
        nb_buffered = 0
        line_buffer = []
        for lines, nb_lines in group_by_id(input_file):
            if nb_buffered + nb_lines > max_lines:
                # We need to write the current bunch of lines in a file
                chunk_id += 1
                with open("output_%d.csv" % chunk_id, "w") as output_file:
                    output_file.write("".join(line_buffer))
                # Reset the bunch of lines to be written
                line_buffer = lines
                nb_buffered = nb_lines
            else:
                # Update the bunch of lines to be written
                line_buffer.extend(lines)
                nb_buffered += nb_lines
        # Deal with the last bunch of lines
        chunk_id += 1
        with open("output_%d.csv" % chunk_id, "w") as output_file:
            output_file.write("".join(line_buffer))

if __name__ == "__main__":
    main()

You don't need the exit(0) at the end, the Python interpreter will exit automatically. I would also put the whole code in a if __name__ == "__main__": block to allow importing the function in another script. — Graipher
– Graipher, Commented Mar 29, 2017 at 16:02
@Graipher I had read somewhere that it was best practice to explicitly exit with 0 when everything seemed to have occurred OK. I'll add the __main__ stuff. — bli
– bli, Commented Mar 30, 2017 at 8:00
When the Python interpreter exits, it does so with status 0, unless you explicitly use exit(n) with n > 0. Also, exit is for the interactive interpreter, use sys.exit in a script if you need it. stackoverflow.com/questions/6501121/… — Graipher
– Graipher, Commented Mar 30, 2017 at 8:04
@bli, can you update your code to add the header in each file? — 5a01d01P
– 5a01d01P, Commented Feb 28, 2020 at 6:55

Stack Exchange Network

Python code to split csv into smaller csvs, not splitting IDs

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Python code to split csv into smaller csvs, not splitting IDs

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions