0

I've seen a few related posts about the numpy module, etc. I need to use the csv module, and it should work for this. While a lot has been written on using the csv module here, I didn't quite find the answer I was looking for. Thanks so much in advance

Essentially I have the following function/pseudocode (tab didn't copy over well...):

import csv

def copy(inname, outname):
   infile = open(inname, "r")
   outfile = open(outname, "w")
   copying = False ##not copying yet

# if the first string up to the first whitespace in the "name" column of a row
# equals the first string up to the first whitespace in the "name" column of 
# the row directly below it AND the value in the "ID" column of the first row
# does NOT equal the value in the "ID" column of the second row, copy these two 
# rows in full to a new table.

For example, if inname looks like this:

ID,NAME,YEAR, SPORTS_ALMANAC,NOTES

(first thousand rows)

1001,New York Mets,1900,ESPN

1002,New York Yankees,1920,Guiness

1003,Boston Red Sox,1918,ESPN

1004,Washington Nationals,2010 

(final large amount of rows until last row)

1231231231235,Detroit Tigers,1990,ESPN

Then I want my output to look like:

ID,NAME,YEAR,SPORTS_ALMANAC,NOTES

1001,New York Mets,1900,ESPN

1002,New York Yankees,1920,Guiness

Because the string "New" is the same string up to the first whitespace in the "Name" column, and the ID's are different. To be clear, I need the code to be as generalizable as possible, since a regular expression on "New" is not what I need, since the common first string could be really any string. And it doesn't matter what happens after the first whitespace (ie "Washington Nationals" and "Washington DC" should still give me a hit, as should the New York examples above...)

I'm confused because in R there is a way to do: inname$name to search easily by values in a specific row. I tried writing my script in R first, but it got confusing. So I want to stick with Python.

10
  • sorry about that, just fixed it! Commented Aug 10, 2012 at 18:24
  • If you've fixed your own problem, and you think the fix would be valuable to the community, it would be really great if you wrote and accepted your own answer. Commented Aug 10, 2012 at 18:28
  • @zigg: I think he's referring to a formatting issue. Originally there were no commas given in the input data, making it seem like it would be a nuisance to parse. Commented Aug 10, 2012 at 18:31
  • Sorry for the confusion, I just fixed the formatting issue, not the problem. Commented Aug 10, 2012 at 18:33
  • I'm confused. Are you saying there are duplicate rows (you don't show any duplicates) and that you want to remove the duplicates? Or that there are many New York Yankees rows with different IDs and you want them all to have the same ID? Commented Aug 10, 2012 at 18:37

1 Answer 1

2

Does this do what you want (Python 3)?

import csv 

def first_word(value):
    return value.split(" ", 1)[0]

with open(inname, "r") as infile:
    with open(outname, "w", newline="") as outfile:
        in_csv = csv.reader(infile)
        out_csv = csv.writer(outfile)

        column_names = next(in_csv)
        out_csv.writerow(column_names)

        id_index = column_names.index("ID")
        name_index = column_names.index("NAME")

        try:
            row_1 = next(in_csv)
            written_row = False

            for row_2 in in_csv:
                if first_word(row_1[name_index]) == first_word(row_2[name_index]) and row_1[id_index] != row_2[id_index]:
                    if not written_row:
                        out_csv.writerow(row_1)

                    out_csv.writerow(row_2)
                    written_row = True
                else:
                    written_row = False

                row_1 = row_2
        except StopIteration:
            # No data rows!
            pass

For Python 2, use:

with open(outname, "w") as outfile:
    in_csv = csv.reader(infile)
    out_csv = csv.writer(outfile, lineterminator="\n")
Sign up to request clarification or add additional context in comments.

4 Comments

yup! works like a charm! (except I had to take out the newline="" since that gave me an error). Thanks so much!!! Also, how does the except StopIteration: pass work?
I've edited my answer. If there are no rows then next(in_csv) will raise StopIteration. I'm catching it only in case there are no data rows, and not if there is no header row (i.e. no rows at all).
Thanks so much for your help and helpful explanation. One more quick question if you don't mind: How would I tweak the code such that instead of writing the rows we want to a new file, we instead add an extra column to the input file and flag these rows with a "1" in the new column, instead?
You can add a new column by adding a new entry to the row (which is a Python list) before writing it out. You can't modify the input file in-place; you'll have to create a new file and then replace the old file with the new file.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.