Regular Expression, Matrix, CSV in Python

Question

I've seen a few related posts about the numpy module, etc. I need to use the csv module, and it should work for this. While a lot has been written on using the csv module here, I didn't quite find the answer I was looking for. Thanks so much in advance

Essentially I have the following function/pseudocode (tab didn't copy over well...):

import csv

def copy(inname, outname):
   infile = open(inname, "r")
   outfile = open(outname, "w")
   copying = False ##not copying yet

# if the first string up to the first whitespace in the "name" column of a row
# equals the first string up to the first whitespace in the "name" column of 
# the row directly below it AND the value in the "ID" column of the first row
# does NOT equal the value in the "ID" column of the second row, copy these two 
# rows in full to a new table.

For example, if inname looks like this:

ID,NAME,YEAR, SPORTS_ALMANAC,NOTES

(first thousand rows)

1001,New York Mets,1900,ESPN

1002,New York Yankees,1920,Guiness

1003,Boston Red Sox,1918,ESPN

1004,Washington Nationals,2010

(final large amount of rows until last row)

1231231231235,Detroit Tigers,1990,ESPN

Then I want my output to look like:

ID,NAME,YEAR,SPORTS_ALMANAC,NOTES

1001,New York Mets,1900,ESPN

1002,New York Yankees,1920,Guiness

Because the string "New" is the same string up to the first whitespace in the "Name" column, and the ID's are different. To be clear, I need the code to be as generalizable as possible, since a regular expression on "New" is not what I need, since the common first string could be really any string. And it doesn't matter what happens after the first whitespace (ie "Washington Nationals" and "Washington DC" should still give me a hit, as should the New York examples above...)

I'm confused because in R there is a way to do: inname$name to search easily by values in a specific row. I tried writing my script in R first, but it got confusing. So I want to stick with Python.

If you've fixed your own problem, and you think the fix would be valuable to the community, it would be really great if you wrote and accepted your own answer. — Mattie B
– Mattie B, Commented Aug 10, 2012 at 18:28
@zigg: I think he's referring to a formatting issue. Originally there were no commas given in the input data, making it seem like it would be a nuisance to parse. — DSM
– DSM, Commented Aug 10, 2012 at 18:31
Sorry for the confusion, I just fixed the formatting issue, not the problem. — user1590499
– user1590499, Commented Aug 10, 2012 at 18:33
I'm confused. Are you saying there are duplicate rows (you don't show any duplicates) and that you want to remove the duplicates? Or that there are many New York Yankees rows with different IDs and you want them all to have the same ID? — alan
– alan, Commented Aug 10, 2012 at 18:37

MRAB · Accepted Answer · 2012-08-10 19:26:39Z

2

Does this do what you want (Python 3)?

import csv 

def first_word(value):
    return value.split(" ", 1)[0]

with open(inname, "r") as infile:
    with open(outname, "w", newline="") as outfile:
        in_csv = csv.reader(infile)
        out_csv = csv.writer(outfile)

        column_names = next(in_csv)
        out_csv.writerow(column_names)

        id_index = column_names.index("ID")
        name_index = column_names.index("NAME")

        try:
            row_1 = next(in_csv)
            written_row = False

            for row_2 in in_csv:
                if first_word(row_1[name_index]) == first_word(row_2[name_index]) and row_1[id_index] != row_2[id_index]:
                    if not written_row:
                        out_csv.writerow(row_1)

                    out_csv.writerow(row_2)
                    written_row = True
                else:
                    written_row = False

                row_1 = row_2
        except StopIteration:
            # No data rows!
            pass

For Python 2, use:

with open(outname, "w") as outfile:
    in_csv = csv.reader(infile)
    out_csv = csv.writer(outfile, lineterminator="\n")

edited Aug 10, 2012 at 19:26

answered Aug 10, 2012 at 19:04

MRAB

20.7k6 gold badges44 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user1590499 Over a year ago

yup! works like a charm! (except I had to take out the newline="" since that gave me an error). Thanks so much!!! Also, how does the except StopIteration: pass work?

MRAB Over a year ago

I've edited my answer. If there are no rows then next(in_csv) will raise StopIteration. I'm catching it only in case there are no data rows, and not if there is no header row (i.e. no rows at all).

user1590499 Over a year ago

Thanks so much for your help and helpful explanation. One more quick question if you don't mind: How would I tweak the code such that instead of writing the rows we want to a new file, we instead add an extra column to the input file and flag these rows with a "1" in the new column, instead?

MRAB Over a year ago

You can add a new column by adding a new entry to the row (which is a Python list) before writing it out. You can't modify the input file in-place; you'll have to create a new file and then replace the old file with the new file.

Collectives™ on Stack Overflow

Regular Expression, Matrix, CSV in Python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related