11

I want to do 10-fold cross-validation for huge files ( running into hundreds of thousands of lines each). I want to do a "wc -l " each time i start reading a file, then generate random numbers a fixed number of times, each time writing that line number into a separate file . I am using this:

import os 
for i in files:
    os.system("wc -l <insert filename>").

How do I insert the file name there. Its a variable. I went through the documentation but they mostly list out ls commands, something that doesn't have this problem.

1
  • 4
    FYI, google says 1 lakh == 100 000. Commented Jun 29, 2011 at 13:47

7 Answers 7

15

Let's compare:

from subprocess import check_output

def wc(filename):
    return int(check_output(["wc", "-l", filename]).split()[0])

def native(filename):
    c = 0
    with open(filename) as file:
        while True:
            chunk = file.read(10 ** 7)
            if chunk == "":
                return c
            c += chunk.count("\n")

def iterate(filename):
    with open(filename) as file:
        for i, line in enumerate(file):
            pass
        return i + 1

Go go timeit function!

from timeit import timeit
from sys import argv

filename = argv[1]

def testwc():
    wc(filename)

def testnative():
    native(filename)

def testiterate():
    iterate(filename)

print "wc", timeit(testwc, number=10)
print "native", timeit(testnative, number=10)
print "iterate", timeit(testiterate, number=10)

Result:

wc 1.25185894966
native 2.47028398514
iterate 2.40715694427

So, wc is about twice as fast on a 150 MB compressed files with ~500 000 linebreaks, which is what I tested on. However, testing on a file generated with seq 3000000 >bigfile, I get these numbers:

wc 0.425990104675
native 0.400163888931
iterate 3.10369205475

Hey look, python FTW! However, using longer lines (~70 chars):

wc 1.60881590843
native 3.24313092232
iterate 4.92839002609

So conclusion: it depends, but wc seems to be the best bet allround.

Sign up to request clarification or add additional context in comments.

Comments

8
import subprocess
for f in files:
    subprocess.call(['wc', '-l', f])

Also have a look at http://docs.python.org/library/subprocess.html#convenience-functions - for example, if you want to access the output in a string, you'll want to use subprocess.check_output() instead of subprocess.call()

5 Comments

And this also gives me an error. Says this : Traceback (most recent call last): File "../../scripts/gps_scripts/cross-validation.py", line 10, in <module> print subprocess.call(['wc','-l',f]) File "/usr/lib/python2.7/subprocess.py", line 486, in call return Popen(*popenargs, **kwargs).wait() File "/usr/lib/python2.7/subprocess.py", line 672, in init errread, errwrite) File "/usr/lib/python2.7/subprocess.py", line 1213, in _execute_child raise child_exception TypeError: execv() arg 2 must contain only strings
@crazyaboutliv You passed it a file object instead of a file name.
one-liner to get the file line count in Python: int(subprocess.check_output(["wc", "-l", fname]).decode("utf8").split()[0])
-@sudo, your one-liner works great on my windows 7 box, but does not work on windows 10. I get FileNotFoundError: [WinError 2] The system cannot find the file specified. Yet I can see the file clearly exists.
sudo's answer should be included in the correct one.
4

No need to use wc -l Use the following python function

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f, 1):
            pass
    return i

This is probably more efficient than calling an external utility (that loop over the input in a similar fashion).

Update

Dead wrong, wc -l is a lot faster!

seq 10000000 > huge_file

$ time wc -l huge_file 
10000000 huge_file

real    0m0.267s
user    0m0.110s
sys 0m0.010s

$ time ./p.py 
10000000

real    0m1.583s
user    0m1.040s
sys 0m0.060s

3 Comments

Depending on the size of the file it might be faster to use wc since it's written in C.
@ThiefMaster true, it's all about knowing your input
Yes, my files are 30 lakh lines. I was thinking that counting this way would get slower .
3

os.system gets a string. Just build the string explicitly:

import os 
for i in files:
    os.system("wc -l " + i)

4 Comments

"Execute the command (a string) in a subshell." - I smell security holes if the file list comes from an untrusted source.
I agree, but os.system is a gaping security hole to begin with, for precisely that reason.
Guys, I need to keep this in deployment. This goes straight into production . Do you guys suggest then to stick to enumerate, even though it would take a tad longer ?
This is giving me an error btw :( . Here : wc: invalid option -- 'g' Try wc --help' for more information. 256 wc: invalid option -- 'g' Try wc --help' for more information. 256 wc: invalid option -- 'g' Try `wc --help' for more information. 256 <repeat this another 10-12 times> [ The code is same as written above>
3

Here is a Python approach I found to solve this problem:

count_of_lines_in_any_textFile = sum(1 for l in open('any_textFile.txt'))

2 Comments

I don't close the file here, do you? Or will Python garbage collector do that for you?
This causes a StopIteration error if you are using an enumeration method afterwards.
1

I found a much more simple way:

import os
linux_shell='more /etc/hosts|wc -l'
linux_shell_result=os.popen(linux_shell).read()
print(linux_shell_result)

1 Comment

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value.
0

My solution is very similar to the “native” function by lazyr:

import functools

def file_len2(fname):
    with open(fname, 'rb') as f:
        lines= 0
        reader= functools.partial(f.read, 131072)
        for datum in iter(reader, ''):
            lines+= datum.count('\n')
            last_wasnt_nl= datum[-1] != '\n'
        return lines + last_wasnt_nl

This, unlike wc, considers a final line not ending with '\n' as a separate line. If one wants the same functionality as wc, then it can be (quite unpythonically :) written as:

import functools as ft, itertools as it, operator as op

def file_len3(fname):
    with open(fname, 'rb') as f:
        reader= ft.partial(f.read, 131072)
        counter= op.methodcaller('count', '\n')
        return sum(it.imap(counter, iter(reader, '')))

with comparable times to wc in all test files I produced.

NB: this applies to Windows and POSIX machines. Old MacOS used '\r' as line-end characters.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.