Megadata With Python and Hadoop

Processing MegadataWith Python and HadoopJuly 2010 TriHUGRyan Coxwww.asciiarmor.com

0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF1089919999999999999999990029029070999991901010113004+64333+023450FM-12+000599999V0202901N008219999999N0000001N9-00721+99999102001ADDGF1049919999999999999999990029029070999991901010120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00941+99999102001ADDGF1089919999999999999999990029029070999991901010206004+64333+023450FM-12+000599999V0201801N008219999999N0000001N9-00611+99999101831ADDGF1089919999999999999999990029029070999991901010213004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00561+99999101761ADDGF1089919999999999999999990029029070999991901010220004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999101751ADDGF1089919999999999999999990029029070999991901010306004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00671+99999101701ADDGF1069919999999999999999990029029070999991901010313004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999101741ADDGF1089919999999999999999990029029070999991901010320004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00281+99999101741ADDGF1089919999999999999999990029029070999991901010406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102311ADDGF1089919999999999999999990029029070999991901010413004+64333+023450FM-12+000599999V0202301N008219999999N0000001N9-00441+99999102261ADDGF1089919999999999999999990029029070999991901010420004+64333+023450FM-12+000599999V0202001N011819999999N0000001N9-00391+99999102231ADDGF1089919999999999999999990029029070999991901010506004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9+00001+99999101821ADDGF1049919999999999999999990029029070999991901010513004+64333+023450FM-12+000599999V0202701N002119999999N0000001N9+00061+99999102591ADDGF1049919999999999999999990029029070999991901010520004+64333+023450FM-12+000599999V0202301N004119999999N0000001N9+00001+99999102671ADDGF1049919999999999999999990029029070999991901010606004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102751ADDGF1039919999999999999999990029029070999991901010613004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102981ADDGF1009919999999999999999990029029070999991901010620004+64333+023450FM-12+000599999V0203201N002119999999N0000001N9-00111+99999103191ADDGF1009919999999999999999990029029070999991901010706004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999103341ADDGF1009919999999999999999990029029070999991901010713004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999103321ADDGF1009919999999999999999990029029070999991901010720004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00441+99999103321ADDGF1009919999999999999999990029029070999991901010806004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00281+99999103221ADDGF1089919999999999999999990029029070999991901010813004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999103201ADDGF1089919999999999999999990035029070999991901010820004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00331+99999102991ADDGF108991999999999999999999MW17010029029070999991901010906004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999102871ADDGF1089919999999999999999990029029070999991901010913004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102661ADDGF1089919999999999999999990029029070999991901010920004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999102391ADDGF1089919999999999999999990029029070999991901011006004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00441+99999101601ADDGF1009919999999999999999990029029070999991901011013004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00441+99999101481ADDGF1009919999999999999999990029029070999991901011020004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00441+99999101381ADDGF1009919999999999999999990029029070999991901011106004+64333+023450FM-12+000599999V0202501N006219999999N0000001N9-00391+99999101061ADDGF1009919999999999999999990029029070999991901011113004+64333+023450FM-12+000599999V0202701N008219999999N0000001N9-00501+99999101141ADDGF1009919999999999999999990029029070999991901011120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999101261ADDGF1009919999999999999999990029029070999991901011206004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9-00391+99999101311ADDGF1049919999999999999999990029029070999991901011213004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00331+99999102071ADDGF1039919999999999999999990029029070999991901011220004+64333+023450FM-12+000599999V0202901N009819999999N0000001N9-00221+99999102191ADDGF1009919999999999999999990029029070999991901011306004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9+00001+99999101661ADDGF1009919999999999999999990029029070999991901011313004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00061+99999102351ADDGF1009919999999999999999990029029070999991901011320004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9-00171+99999102321ADDGF1009919999999999999999990029029070999991901011406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999102721ADDGF1009919999999999999999990029029070999991901011413004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00391+99999102551ADDGF1009919999999999999999990029029070999991901011420004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999102261ADDGF1009919999999999999999990029029070999991901011506004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00061+99999101831ADDGF1089919999999999999999990029029070999991901011513004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9+00171+99999101541ADDGF1089919999999999999999990035029070999991901011520004+64333+023450FM-12+000599999V0202301N015919999999N0000001N9+00221+99999101321ADDGF108991999999999999999999MW1721~130 GB NCDC climate Dataset

high_temp=0forline inopen('1901'): line =line.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if(temp !="+9999"and quality in"01459"): high_temp=max(high_temp,float(temp)) printhigh_tempHow can we make this scale?( and do more interesting things )

JeffREY DEAN – Google - 2004“Our abstraction is in-spired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”

defmapper(line): line =line.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if(temp !="+9999"and quality in"01459"): returnfloat(temp) returnNoneoutput =map(mapper,open('1901')) printreduce(max,output)Mapreduce in pure python

for line in sys.stdin:val = line.strip() (year, temp, q) = (val[15:19], val[87:92], val[92:93]) if (temp != "+9999" and re.match("[01459]", q)): print "%s\t%s" % (year, temp)mapper.py(last_key, max_val) = (None, 0)for line in sys.stdin: (key, val) = line.strip().split("\t") if last_key and last_key != key: print "%s\t%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val)))if last_key: print "%s\t%s" % (last_key, max_val)reduer.pycat dataFile | mapper.py | sort | reducer.pyHadoop Streaming

def mapper(key,value): line = value.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if (temp != "+9999" and quality in "01459"): yield year, int(temp)def reducer(key,values): yield key,max(values)if __name__ == "__main__": import dumbodumbo.run(mapper,reducer,reducer)Dumbo

Ability to pass around Python objects

Ability to use non-Java combiners

Built-in library of mappers / reducers

Excellent way to model MR algorithmsDumbo

CLI – API – Web ConsoleAmazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).Elastic Map REduce

Use Hadoop’s Job Tracker or Amazon’s ‘Debugger’Elastic Map REduce

CloudWatch metricsElastic Map REduce

Megadata With Python and Hadoop

More Related Content

What's hot

Similar to Megadata With Python and Hadoop

More from ryancox

Megadata With Python and Hadoop

Editor's Notes