I have a dataset in the format of a list of dicts that I would like to loop through and extract a subset of that dataset based on a match to a value in another list of values.
I am currently doing this in the way of two seperate for x in y loops, as shown below under sample code, but I'm sure that this is very inefficient and it's taking an extremely long time with large lists to look through.
example data in CSV format:
╔══════════════╦══════════════╦═══════════════╦════════════════╦══════════════════════╗
║ City ║ State ║ 2013 Estimate ║ 2013 Land Area ║ 2013 Popular Density ║
╠══════════════╬══════════════╬═══════════════╬════════════════╬══════════════════════╣
║ New York ║ New York ║ 8405837 ║ 302.6 sq mi ║ 27012 per sq mi ║
║ Los Angeles ║ California ║ 3884307 ║ 468.7 sq mi ║ 8092 per sq mi ║
║ Chicago ║ Illinois ║ 2718782 ║ 227.6 sq mi ║ 11842 per sq mi ║
║ Houston ║ Texas ║ 2195914 ║ 599.6 sq mi ║ 3501 per sq mi ║
║ Philadelphia ║ Pennsylvania ║ 1,553,165 ║ 134.1 sq mi ║ 11379 per sq mi ║
║ Phoenix ║ Arizona ║ 1513367 ║ 516.7 sq mi ║ 2798 per sq mi ║
║ San Antonio ║ Texas ║ 1409019 ║ 460.9 sq mi ║ 2880 per sq mi ║
║ San Diego ║ California ║ 1355896 ║ 325.2 sq mi ║ 4020 per sq mi ║
║ Dallas ║ Texas ║ 1257676 ║ 340.5 sq mi ║ 3518 per sq mi ║
║ San Jose ║ California ║ 998537 ║ 176.5 sq mi ║ 5359 per sq mi ║
╚══════════════╩══════════════╩═══════════════╩════════════════╩══════════════════════╝
sample code
#read data into list of dicts
import csv
with open('data.csv', 'rb') as csv_file:
data = list(csv.DictReader(csv_file))
# cities of interest to extract from larger data
int_cities = [['New York'],['Houston'],['Pheonix'],['San Jose']]
# loop through data and look for match in data['City'] and interest_cities, append match to int_cities_data
int_cities_data = []
for i in data:
for u in int_cities:
if i['City'] == u:
int_cities_data.append(i)
As I state, this currently works, but it takes a very long time when I have to loop through ~2M rows in data and look if there is a match across another 50k rows in int_cities.
How can I make this more efficient?
EDIT 2014-08-22 9:30 PM EST
I forgot that the data is too large to use csv.DictReader so I have been using the following to read my data into a list of dicts (after removing the header):
This is untested
header = ['City','State','2013 Estimate','2013 Land Area','2013 Popular Density']
data = [{key: value for (key, value) in zip(header, line.strip().split(','))} for line in open('data.csv') if line['City'] in int_cities]
I tried to modify the above code I've used to load my data into a list of dicts without using csv.DictReader.