1

I have a dataset in the format of a list of dicts that I would like to loop through and extract a subset of that dataset based on a match to a value in another list of values.

I am currently doing this in the way of two seperate for x in y loops, as shown below under sample code, but I'm sure that this is very inefficient and it's taking an extremely long time with large lists to look through.

example data in CSV format:

╔══════════════╦══════════════╦═══════════════╦════════════════╦══════════════════════╗
║     City     ║    State     ║ 2013 Estimate ║ 2013 Land Area ║ 2013 Popular Density ║
╠══════════════╬══════════════╬═══════════════╬════════════════╬══════════════════════╣
║ New York     ║ New York     ║ 8405837       ║ 302.6 sq mi    ║ 27012 per sq mi      ║
║ Los Angeles  ║ California   ║ 3884307       ║ 468.7 sq mi    ║ 8092 per sq mi       ║
║ Chicago      ║ Illinois     ║ 2718782       ║ 227.6 sq mi    ║ 11842 per sq mi      ║
║ Houston      ║ Texas        ║ 2195914       ║ 599.6 sq mi    ║ 3501 per sq mi       ║
║ Philadelphia ║ Pennsylvania ║ 1,553,165     ║ 134.1 sq mi    ║ 11379 per sq mi      ║
║ Phoenix      ║ Arizona      ║ 1513367       ║ 516.7 sq mi    ║ 2798 per sq mi       ║
║ San Antonio  ║ Texas        ║ 1409019       ║ 460.9 sq mi    ║ 2880 per sq mi       ║
║ San Diego    ║ California   ║ 1355896       ║ 325.2 sq mi    ║ 4020 per sq mi       ║
║ Dallas       ║ Texas        ║ 1257676       ║ 340.5 sq mi    ║ 3518 per sq mi       ║
║ San Jose     ║ California   ║ 998537        ║ 176.5 sq mi    ║ 5359 per sq mi       ║
╚══════════════╩══════════════╩═══════════════╩════════════════╩══════════════════════╝

sample code

#read data into list of dicts
import csv 
with open('data.csv', 'rb') as csv_file:
    data = list(csv.DictReader(csv_file))

# cities of interest to extract from larger data
int_cities = [['New York'],['Houston'],['Pheonix'],['San Jose']]

# loop through data and look for match in data['City'] and interest_cities, append match to int_cities_data
int_cities_data = []
for i in data:
    for u in int_cities:
        if i['City'] == u:
            int_cities_data.append(i)

As I state, this currently works, but it takes a very long time when I have to loop through ~2M rows in data and look if there is a match across another 50k rows in int_cities.

How can I make this more efficient?

EDIT 2014-08-22 9:30 PM EST

I forgot that the data is too large to use csv.DictReader so I have been using the following to read my data into a list of dicts (after removing the header):

This is untested

header = ['City','State','2013 Estimate','2013 Land Area','2013 Popular Density']
data = [{key: value for (key, value) in zip(header, line.strip().split(','))} for line in open('data.csv') if line['City'] in int_cities]

I tried to modify the above code I've used to load my data into a list of dicts without using csv.DictReader.

1 Answer 1

4

Instead of reading all the data in the file into a list, then iterating over that list to search for the cities you want, iterate over the csv file one line at a time, and only add items to the list if they're for the cities you care about. That way you don't need to store the entire file in memory, and you don't need to iterate over it twice (once to build the complete list, then again to pull the entries you care about out of it).

Additionally, store the cities you care about in a set instead of a list, so you can do lookups in O(1) time, instead of O(n). This will likely drastically improve performance if you're doing lots of lookups (and it sounds like you are).

#read data into list of dicts
import csv 

int_cities = set(['New York', 'Houston', 'Phoenix', 'San Jose'])
int_cities_data = []
with open('data.csv', 'rb') as csv_file:
    for line in csv.DictReader(csv_file):
        if line['City'] in int_cities:
            int_cities_data.append(line)

Or as a list comprehension:

with open('data.csv', 'rb') as csv_file:
    int_cities_data = [line for line in csv.DictReader(csv_file) if line['City'] in int_cities]
Sign up to request clarification or add additional context in comments.

11 Comments

line['City'] won't be in int_cities
@PadraicCunningham I tried this with his test data and it seemed to work ok. What do you mean?
If the OP's code was working, if i['City'] == u: is comparing i['City'] to each if the sublists in int_cities not to strings
@PadraicCunningham Ah, yeah, I noticed that, too. But the csv data he provided doesn't parse that way - line['City'] will just be a string, not a list. However, if it did parse to a list for some reason, using if line['City'][0] in int_cities: would work and allow int_cities to remain a set.
I tried running my edited code that does not use csv.DictReader since the data is too large for it, and I am getting the error TypeError: string indices must be integers, not str
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.