Reading certain columns from multiple data files into one file in python

Question

I have about 30 data files and I need to extract the 4th, 5th, and 6th columns. Then skip 14 columns and grab the next 3 columns and so on till the end of the file. Each data file is about 400 rows and 17000 columns. So far I have this:

file_list = glob.glob('*.dat')

with open("result.dat", "wb") as outfile:
    for f in file_list:
        with open(f, "rb") as infile:
            outfile.write(infile.read())

data = np.loadtxt('result.dat')

arr = np.array(data)
a = arr[:, 4:-1:17]
b = arr[:, 5:-1:17]
c = arr[:, 6:-1:17]

This is writing a file called result.dat that contains all of the data from the multiple files and then I extract the columns I need. However, this is taking a long time to create the array because it is writing all of the information that I do not need as well. Is there a way to only read in the specific columns I am interested instead into the result.dat file? This should cut down the time significantly.

Is creating result.dat slow, or just reading it? Experiment with the usecolumns parameter of loadtxt. loadtxt also reads the file one line at a time, splits it, collects columns, saves it all in a list of lists. At the end it turns everything into an array. — hpaulj
– hpaulj, Commented Dec 6, 2015 at 18:30

MaxNoe · Accepted Answer · 2015-12-06 18:25:00Z

1

numpy.loadtxt is a pure python implementation, which makes is somehow slow. Using pandas.read_csv() is way faster. You also do not need to write another file with the complete content (if you do not need this file for other purposes).

This is equivalent code using pandas.read_csv:

import glob
import pandas as pd

file_list = glob.glob('*.dat')
cols = [4, 21, 38] # add more columns here

df = pd.DataFrame()

for f in file_list:
    df = df.append(
        pd.read_csv(f, delimiter='\s+', header=None, usecols=cols),
        ignore_index=True,    
    )

arr = df.values

The equivalent numpy code would be:

import glob
import numpy as np

file_list = glob.glob('*.dat')
cols = [0, 1, 2]  # add more columns here

data = []
for f in file_list:
    data.append(np.loadtxt(f, usecols=cols))

arr = np.vstack(data)

If timed both with 10 files of random numbers with shape (10000, 10).

pandas solution: 0.95 s

numpy solution: 2.6 s

edited Dec 6, 2015 at 18:25

answered Dec 6, 2015 at 18:13

MaxNoe

15.1k3 gold badges44 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

V22 Over a year ago

Do you know of a different way to get the columns that I am interested in instead of designating cols equal to all of the columns because it goes up to 17,000 columns in one data file. I did try this method just for the first couple of columns and it did work!

MaxNoe Over a year ago

you could use some of numpys array factorys, for example np.arange(4, 18000, 17) will give you every 17th column from 4 to 18000

mrucci · Accepted Answer · 2015-12-06 18:17:08Z

0

The numpy.loadtxt function accepts an optional usecols parameter.

You can generate column indices via:

usecols=set(xrange(4, num_col, 17)) | set(xrange(5, num_col, 17)) | set(xrange(6, num_col, 17))

edited Dec 6, 2015 at 18:17

answered Dec 6, 2015 at 18:11

mrucci

4,4703 gold badges37 silver badges38 bronze badges

Comments

hpaulj · Accepted Answer · 2015-12-06 19:05:47Z

loadtxt takes any iterable, including a generator. You could iterate through the files, but feed them directly to loadtxt rather than write an intermediate file. No guarantee that it will save much time, but it may be worth the experiment.

Here's my test:

def foo(filelist):
    for name in filelist:
        with open(name) as f:
            for line in f:
                yield line

A simple test file

In [71]: cat urls.txt
one.com
two.url
three.four

Read it 2 times with foo:

In [72]: list(foo(['urls.txt','urls.txt']))
Out[72]: 
['one.com\n',
 'two.url\n',
 'three.four\n',
 'one.com\n',
 'two.url\n',
 'three.four\n']

use it in loadtxt:

In [73]: np.loadtxt(foo(['urls.txt','urls.txt']),dtype=str,delimiter='.',usecols=[1])
Out[73]: 
array(['com', 'url', 'four', 'com', 'url', 'four'], 
      dtype='|S4')

The other approach is to load each file, collect the arrays in a list, and concatenate them.

Your 'usecols' with 'use 3 skip 14' is a little awkward to construct. Using 3 slices as you do is a nice idea, but you don't want to do that with loadtxt.

np.r_ might make the task easier:

In [81]: np.r_[4:100:17, 5:100:17, 6:100:17]
Out[81]: 
array([ 4, 21, 38, 55, 72, 89,  5, 22, 39, 56, 73, 90,  6, 23, 40, 57, 74,
       91])

In [82]: np.sort(np.r_[4:100:17, 5:100:17, 6:100:17])
Out[82]: 
array([ 4,  5,  6, 21, 22, 23, 38, 39, 40, 55, 56, 57, 72, 73, 74, 89, 90,
       91])

usecols does not have to be sorted, so you could use either.

Collectives™ on Stack Overflow

Reading certain columns from multiple data files into one file in python

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related