5

Is there a way to take the output from subprocess and turn it into an iterable csv.reader or csv.DictReader object? Here's the code I've been trying:

p2 = subprocess.Popen("sort command...", stdout=subprocess.PIPE)
output = p2.communicate()[0]
edits = csv.reader(output, delimiter="\t")

Basically, I'm sorting a large CSV file, and then I'd like to get it into Python as a csv.reader object.

The error I'm getting is

Error: iterator should return strings, not int (did you open the file in text mode?)

Is there a way to treat this bytestream as a csv.reader object, or am I thinking about things the wrong way?

2
  • ...to be honest, I'd be very tempted to do something like 'sort command' | python pythonscript.py and just have the python script read from sys.stdin. Commented Jul 16, 2015 at 3:33
  • That's my plan for now, if I can't figure this out. :) Commented Jul 16, 2015 at 3:36

4 Answers 4

6

This is a problem in Python 3. The CSV module needs unicode input, not byte strings. In addition to this, csv.reader() needs an iterable such as an open file or a list of strings. Try this:

encoding = 'ascii'    # specify the encoding of the CSV data
p2 = subprocess.Popen(['sort', '/tmp/data.csv'], stdout=subprocess.PIPE)
output = p2.communicate()[0].decode(encoding)
edits = csv.reader(output.splitlines(), delimiter=",")
for row in edits:
    print(row)

If /tmp/data.csv contains (I've used commas as the separator):

1,2,3,4
9,10,11,12
a,b,c,d
5,6,7,8

then the output would be:

['1', '2', '3', '4']
['5', '6', '7', '8']
['9', '10', '11', '12']
['a', 'b', 'c', 'd']
Sign up to request clarification or add additional context in comments.

1 Comment

3

The following works for me (even though the docs warn about reading from stdout). Wrapping stdout with an io.TextIOWrapper() supports newlines embedded in the data for fields.

Doing this allows a generator to be used which has the advantage of allowing stdout to be read incrementally, one line at at time.

p2 = subprocess.Popen(["sort", "tabbed.csv"], stdout=subprocess.PIPE)
output = io.TextIOWrapper(p2.stdout, newline=os.linesep)
edits = csv.reader((line for line in output), delimiter="\t")
for row in edits:
    print(row)

Output:

['1', '2', '3', '4']
['5', '6', '7', '8']
['9', '10', '11', '12']
['a', 'b\r\nx', 'c', 'd']

The tabbed.csv input test file contained this (where » represents tab characters and the a newline character):

1»2»3»4
9»10»11»12
a»"b≡x"»c»d
5»6»7»8

5 Comments

You should pass a list instead of the string as a command on POSIX systems ("sort tabbed.csv" will fail). Also, to enable embedded newlines, you could use io.TextIOWrapper().
use newline="" to support files created on other systems (as csv docs recommend)
@J.F.Sebastian: I didn't because this isn't a file created on another system and doing so would not support embedded newlines.
newline=os.linesep does not make any sense compared to newline="". 1. csv creates files with \r\n regardless of the system: why would you use os.linesep? Anyway why would you actively reject files from other systems? 2. csv "reader is hard-coded to recognize either '\r' or '\n' as end-of-line" -- it is misleading to provide newline=os.linesep (universal newlines mode is off) here. 3. You should pass newline parameter to TextIOWrapper (to avoid corrupting embedded newlines). Note: newline='' means that universal newlines is on but newlines are passed untranslated.
@J.F.Sebastian: How csv creates files isn't relevant here, where a csv.reader is being used to read output being piped in from the stdout of another program running locally. I am passing a newline keyword argument to TextIOWrapper. Perhaps you're confusing its newline parameter with the csv module's Dialect.lineterminator which is only used by csv.writers. FWIW, all versions of my answer were actually run with the input described and produced the output shown before being posted.
1

To enable text mode, pass universal_newlines=True parameter:

#!/usr/bin/env python3
import csv
from subprocess import Popen, PIPE

with Popen(["sort", "a.csv"], stdout=PIPE, universal_newlines=True) as p:
    print(list(csv.reader(p.stdout, delimiter="\t")))

If you need to interpret newlines embedded inside quoted fields then create io.TextIOWrapper, to pass newline='' parameter:

#!/usr/bin/env python3
import csv
import io
from subprocess import Popen, PIPE

with Popen(["sort", "a.csv"], stdout=PIPE) as p, \
     io.TextIOWrapper(p.stdout, newline='') as text_file:
    print(list(csv.reader(text_file, delimiter="\t")))

Also, TextIOWrapper allows to specify the character encoding explicitly (otherwise the default locale.getpreferredencoding(False) is used).

Note: you don't need the external sort command. You could sort the lines in pure Python:

#!/usr/bin/env python3
import csv

with open('a.csv', newline='') as text_file:
    rows = list(csv.reader(text_file, delimiter="\t"))
    rows.sort()
    print(rows)

Note: the later version sorts csv rows instead of physical lines (you could sort the lines if you want).

Comments

0

This works if your CSV file has column headings.

[ user@system currentDir ]$ ./ProgramThatCreatesCSVData
first,second,third,fourth
1,2,3,4
9,10,11,12
a,b,c,d
5,6,7,8
[ user@system currentDir ]$
[ user@system currentDir ]$
[ user@system currentDir ]$
[ user@system currentDir ]$ cat CSVtoDict.py
#!/usr/bin/python3
"""Sample program to open a pipe to run a command.
That command generates a CSV with heading names in the first row.
Output of this program is a conversion of that CSV to a list of dictionaries,
in pprint format."""

import csv, pprint, subprocess, io

pipe = subprocess.Popen(["./ProgramThatCreatesCSVData"], stdout=subprocess.PIPE)
pipeWrapper = io.TextIOWrapper(pipe.stdout)
pipeReader = csv.DictReader(pipeWrapper)
listOfDicts = [ dict(row) for row in pipeReader ]

pprint.pprint(listOfDicts)

[ user@system currentDir ]$
[ user@system currentDir ]$
[ user@system currentDir ]$
[ user@system currentDir ]$ python3 CSVtoDict.py
[{'first': '1', 'fourth': '4', 'second': '2', 'third': '3'},
 {'first': '9', 'fourth': '12', 'second': '10', 'third': '11'},
 {'first': 'a', 'fourth': 'd', 'second': 'b', 'third': 'c'},
 {'first': '5', 'fourth': '8', 'second': '6', 'third': '7'}]
[ user@system currentDir ]$

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.