Convert a text CSV binary file and get a random line from it in Python without reading it to memory

Question

I have some CSV text files in the format:

1.3, 0, 1.0
20.0, 3.2, 0
30.5, 5.0, 5.2

The files are about 3.5Gb in size and I cannot read any of them in to memory in Pandas in a useful amount of time.

But I don't need to read the all file, because what I want to do, is to choose some random lines from the file and read the values there, and I know it's theoretically possible to do it if the file is formatted in a way that all the fields have the same size - for instance, float16 in a binary file.

Now, I think I can just convert it, using the NumPy method specified in the answer to question: How to output list of floats to a binary file in Python

But, how do I go about picking a random line from it after the conversion is done?

In a normal text file, I could just do:

import random
offset = random.randrange(filesize)
f = open('really_big_file')
f.seek(offset)                  #go to random position
f.readline()                    # discard - bound to be partial line
random_line = f.readline()      # bingo!

But I can't find a way for this to work in a binary file made from NumPy.

@TimPietzcker -- Isn't that basically what the code snippet is doing? Of course, with that approach you eliminate the possibility of picking the first line ... — mgilson
– mgilson, Commented Oct 9, 2012 at 11:56
No, because the lines in the original text CSV, have different length, and as such I would get a bias that would favour the bigger lines to get picked instead of the smaller ones. (i.e., in the example data, the 3rd line would have almost a 30% higher probability of being chosen than the 1st.) — jbssm
– jbssm, Commented Oct 9, 2012 at 11:57

mgilson · Accepted Answer · 2012-10-09 12:10:20Z

2

I'd use struct to convert to binary:

import struct
with open('input.txt') as fin, open('output.txt','wb') as fout:
     for line in fin:
         #You could also use `csv` if you're not lazy like me ...
         out_line = struct.pack('3f',*(float(x) for x in line.split(',')))
         fout.write(out_line)

This writes everything as standard 4-byte floats on most systems.

Now, to read the data again:

with open('output.txt','rb') as fin:
    line_size = 12 #each line is 12 bytes long (3 floats, 4 bytes each)
    offset = random.randrange(filesize//line_size)  #pick n'th line randomly
    f.seek(offset*line_size) #seek to position of n'th line
    three_floats_bytes = f.read(line_size)
    three_floats = struct.unpack('3f',three_floats_bytes)

If you're concerned about disk space and want to compress the data down using np.float16 (2 byte floats), you can do that too using the basic skeleton above, just substitute np.fromstring for struct.unpack and ndarray.tostring in place of struct.pack (with the appropriate data-type ndarray of course -- and line_size would drop to 6 ...).

edited Oct 9, 2012 at 12:10

answered Oct 9, 2012 at 12:03

mgilson

312k70 gold badges656 silver badges722 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jbssm Over a year ago

Thank you, you helped me a lot. By combining your example with the one I mentioned in the question about NumPy, I was able to do what I wanted in NumPy. I think I'll add it below.

jbssm Over a year ago

Just to say that this code doesn't work. I get the error: lineOut = struct.pack('3f', *(float(x) for x in line.split(','))) struct.error: pack requires exactly 3 arguments

mgilson Over a year ago

@jbssm -- Then it looks like you have a record with more (or less) than 3 elements in it.

jbssm Over a year ago

I checked what is wrong. I have like 10 elements, so I use something like out_line = struct.pack('10f',*(float(x) for x in line.split(','))) instead. Thank you.

Jon Clements · Accepted Answer · 2012-10-09 12:10:16Z

0

You'd have to play around with offsets depending on storage size, but:

import csv
import struct
import random

count = 0
with open('input.csv') as fin, open('input.dat', 'wb') as fout:
    csvin = csv.reader(fin)
    for row in csvin:
        for col in map(float, row):
            fout.write(struct.pack('f', col))
            count += 1


with open('input.dat', 'rb') as fin:
    i = random.randrange(count)
    fin.seek(i * 4)
    print struct.unpack('f', fin.read(4))

answered Oct 9, 2012 at 12:10

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

2 Comments

mgilson Over a year ago

This seeks to a random float, not a random line of floats. In other words, you're losing your "record" information here.

Jon Clements Over a year ago

@mgilson indeed - I just saw your answer - but it's pretty much the same, just adjust the line_size as you've called it

jbssm · Accepted Answer · 2012-10-10 16:24:20Z

So, using the example provided by the helpfull answers, I found a way to do it with NumPy if someone is interested:

# this converts the file from text CSV to bin
with zipfile.ZipFile("input.zip", 'r') as inputZipFile:
    inputCSVFile = inputZipFile.open(inputZipFile.namelist()[0], 'r') # it's 1 file only zip

    with open("output.bin", 'wb') as outFile:
        outCSVFile = csv.writer(outFile, dialect='excel')
        for line in inputCSVFile:
            lineParsed = ast.literal_eval(line)
            lineOut = numpy.array(lineParsed,'float16')
            lineOut.tofile(outFile)
        outFile.close()

    inputCSVFile.close()
    inputZipFile.close()

# this reads random lines from the binary file
with open("output.bin", 'wb') as file:
    file.seek(0)

    lineSize = 20 # float16 has 2 bytes and there are 10 values:
    fileSize = os.path.getsize("output.bin")

    offset = random.randrange(fileSize//lineSize)
    file.seek(offset * lineSize)
    random_line = file.read(lineSize)
    randomArr = numpy.fromstring(random_line, dtype='float16')

Collectives™ on Stack Overflow

Convert a text CSV binary file and get a random line from it in Python without reading it to memory

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related