Removing rows and columns in python CSV module

Question

I promise I searched and read several pages of google before I came to make this post. Due diligence has been done I swear.

I am trying to open a CSV file in python, read the file, make changes to it, and then write out a new file.

I got this far:

import csv
def water_data ():
    with open('aquastat.csv', 'r') as csv_file:
        csv_reader = csv.reader(csv_file)
        final_file_name = "final_water.data.csv"
        final_file = open(final_file_name,'w')
        csv_writer = csv.writer(final_file,delimiter="\t")
        for row in csv_reader:
            csv_writer.writerow(row)

But I'm struggling to get any further. I want to remove certain columns, but I cannot comprehend how python will know the difference between a row and a column. For example, the columns are Area, Area ID, Year, Value, etc. I only want Area, Year, Value. I tried

for row in final_file:

final_file.writerow(row[0] + row[2] + row[4] + row[5])

but I kept getting the following error: IndexError: list index out of range

[I would also like to replace blank cells with a *, but the column thing is the priority]

Note that I cannot use Pandas

If possible I would really appreciate if someone could not just tell me the code but explain it to me so I can figure it out further myself.

TLDR: How can I remove empty rows from the CVS file and write only certain columns into the new file?

INPUT:

"Area","Area Id","Variable Name","Variable Id","Year","Value","Symbol","Md" 
"Afghanistan",2,"Total area of the country",4100,1977,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1982,65286.0,"E","","" 
"Afghanistan",2,"Total area of the country",4100,1987,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1992,65286.0,"E","","" 
"Afghanistan",2,"Total area of the country",4100,1997,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,2002,65286.0,"E","",""

"Area","Area Id","Variable Name","Variable Id","Year","Value","Symbol","Md" "Afghanistan",2,"Total area of the country",4100,1977,65286.0,"E","","" "Afghanistan",2,"Total area of the country",4100,1982,65286.0,"E","","" "Afghanistan",2,"Total area of the country",4100,1987,65286.0,"E","","" "Afghanistan",2,"Total area of the country",4100,1992,65286.0,"E","","" "Afghanistan",2,"Total area of the country",4100,1997,65286.0,"E","","" "Afghanistan",2,"Total area of the country",4100,2002,65286.0,"E","","" — Ashley F
– Ashley F, Commented Feb 15, 2019 at 3:45
I'm not sure. I am just trying to clean up the data. i would like the data to be the same but without the area ID column, or without empty rows — Ashley F
– Ashley F, Commented Feb 15, 2019 at 3:54
@AshleyF do you need to do it by Python ? otherwise, you could use a simple bash command such as cat File.csv | cut -d, -f1,5,6 — user324810
– user324810, Commented Feb 15, 2019 at 4:13

Allan · Accepted Answer · 2019-02-15 04:10:02Z

I have tried to provide you an answer as close as possible than what you have done so far.

Prototype:

import csv

with open('aquastat.csv', 'r') as csv_file:
  csv_reader = csv.reader(csv_file)
  final_file_name = "final_water.data.csv"
  final_file = open(final_file_name,'w')
  csv_writer = csv.writer(final_file,delimiter="\t")
  for row in csv_reader:
    if len(row) >= 6:
        row = [row[0], row[4], row[5]]
        csv_writer.writerow(row)
  final_file.close()

explanations:

Before the line csv_writer.writerow(row) where you output the row in the output csv file. I have added the line row = [row[0], row[4], row[5]] where I overwrite the content of the array row by an array containing only 3 cells, those cells are respectively taken from the Area, Year, Value columns
On top of this, I have added a the if condition if len(row) >= 6: to check that you have at least enough elements in your row to extract the columns until Value.

input:

"Area","Area Id","Variable Name","Variable Id","Year","Value","Symbol","Md"
"Afghanistan",2,"Total area of the country",4100,1977,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1982,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1987,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1992,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1997,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,2002,65286.0,"E","",""

output:

Area    Year    Value
Afghanistan     1977    65286.0
Afghanistan     1982    65286.0
Afghanistan     1987    65286.0
Afghanistan     1992    65286.0
Afghanistan     1997    65286.0
Afghanistan     2002    65286.0

Benoît P · Accepted Answer · 2019-02-15 03:51:13Z

0

This line will not IndexError and will write the line ignoring non existing values:

final_file.writerow((row[i] for i in (0,2,5) if i<len(row)))

This line will not IndexError and will write the line replacing empty values with a star:

final_file.writerow((row[i] if i<len(row) else "*" for i in (0,2,5)))

This line will not IndexError either but won't write the line:

if len(row)>5: final_file.writerow((row[i] for i in (0,2,5)))

This line will not IndexError either but won't write any line at all:

pass

edited Feb 15, 2019 at 3:51

answered Feb 15, 2019 at 3:42

Benoît P

3,27016 silver badges33 bronze badges

Comments

wwii · Accepted Answer · 2019-02-15 17:23:58Z

You can use a DictReader and DictWriter to selectively modify and write specific columns using their header/column names.

I'll use io.StringIO to simulate the files

s = '''"Area","Area Id","Variable Name","Variable Id","Year","Value","Symbol","Md" 
"Afghanistan",2,"Total area of the country",4100,1977,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1982,65286.0,"E","","" 
"Afghanistan",2,"Total area of the country",4100,1987,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1992,65286.0,"E","","" 
"Afghanistan",2,"Total area of the country",4100,1997,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,2002,65286.0,"E","",""'''

f = io.StringIO(s)
g = io.StringIO()

reader = csv.DictReader(f)
writer = csv.DictWriter(g, fieldnames=["Area","Variable Id","Value"], extrasaction='ignore')

for row in reader:
    #process row values?
    row['Value'] = float(row['Value']) / 1000
    writer.writerow(row)

Note that the DictWriter extrasaction parameter needs to be set to 'ignore' because there are extra keys/fields in the original.

If the csv file does not have a header row, you have to specify the field names for the DictWriter.

>>> g.seek(0)
0
>>> print(g.read())
Afghanistan,4100,65.286
Afghanistan,4100,65.286
Afghanistan,4100,65.286
Afghanistan,4100,65.286
Afghanistan,4100,65.286
Afghanistan,4100,65.286

Collectives™ on Stack Overflow

Removing rows and columns in python CSV module

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related