Processing MegadataWith Python and HadoopJuly 2010 TriHUGRyan Coxwww.asciiarmor.com
0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF1089919999999999999999990029029070999991901010113004+64333+023450FM-12+000599999V0202901N008219999999N0000001N9-00721+99999102001ADDGF1049919999999999999999990029029070999991901010120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00941+99999102001ADDGF1089919999999999999999990029029070999991901010206004+64333+023450FM-12+000599999V0201801N008219999999N0000001N9-00611+99999101831ADDGF1089919999999999999999990029029070999991901010213004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00561+99999101761ADDGF1089919999999999999999990029029070999991901010220004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999101751ADDGF1089919999999999999999990029029070999991901010306004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00671+99999101701ADDGF1069919999999999999999990029029070999991901010313004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999101741ADDGF1089919999999999999999990029029070999991901010320004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00281+99999101741ADDGF1089919999999999999999990029029070999991901010406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102311ADDGF1089919999999999999999990029029070999991901010413004+64333+023450FM-12+000599999V0202301N008219999999N0000001N9-00441+99999102261ADDGF1089919999999999999999990029029070999991901010420004+64333+023450FM-12+000599999V0202001N011819999999N0000001N9-00391+99999102231ADDGF1089919999999999999999990029029070999991901010506004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9+00001+99999101821ADDGF1049919999999999999999990029029070999991901010513004+64333+023450FM-12+000599999V0202701N002119999999N0000001N9+00061+99999102591ADDGF1049919999999999999999990029029070999991901010520004+64333+023450FM-12+000599999V0202301N004119999999N0000001N9+00001+99999102671ADDGF1049919999999999999999990029029070999991901010606004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102751ADDGF1039919999999999999999990029029070999991901010613004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102981ADDGF1009919999999999999999990029029070999991901010620004+64333+023450FM-12+000599999V0203201N002119999999N0000001N9-00111+99999103191ADDGF1009919999999999999999990029029070999991901010706004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999103341ADDGF1009919999999999999999990029029070999991901010713004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999103321ADDGF1009919999999999999999990029029070999991901010720004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00441+99999103321ADDGF1009919999999999999999990029029070999991901010806004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00281+99999103221ADDGF1089919999999999999999990029029070999991901010813004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999103201ADDGF1089919999999999999999990035029070999991901010820004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00331+99999102991ADDGF108991999999999999999999MW17010029029070999991901010906004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999102871ADDGF1089919999999999999999990029029070999991901010913004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102661ADDGF1089919999999999999999990029029070999991901010920004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999102391ADDGF1089919999999999999999990029029070999991901011006004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00441+99999101601ADDGF1009919999999999999999990029029070999991901011013004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00441+99999101481ADDGF1009919999999999999999990029029070999991901011020004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00441+99999101381ADDGF1009919999999999999999990029029070999991901011106004+64333+023450FM-12+000599999V0202501N006219999999N0000001N9-00391+99999101061ADDGF1009919999999999999999990029029070999991901011113004+64333+023450FM-12+000599999V0202701N008219999999N0000001N9-00501+99999101141ADDGF1009919999999999999999990029029070999991901011120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999101261ADDGF1009919999999999999999990029029070999991901011206004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9-00391+99999101311ADDGF1049919999999999999999990029029070999991901011213004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00331+99999102071ADDGF1039919999999999999999990029029070999991901011220004+64333+023450FM-12+000599999V0202901N009819999999N0000001N9-00221+99999102191ADDGF1009919999999999999999990029029070999991901011306004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9+00001+99999101661ADDGF1009919999999999999999990029029070999991901011313004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00061+99999102351ADDGF1009919999999999999999990029029070999991901011320004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9-00171+99999102321ADDGF1009919999999999999999990029029070999991901011406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999102721ADDGF1009919999999999999999990029029070999991901011413004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00391+99999102551ADDGF1009919999999999999999990029029070999991901011420004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999102261ADDGF1009919999999999999999990029029070999991901011506004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00061+99999101831ADDGF1089919999999999999999990029029070999991901011513004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9+00171+99999101541ADDGF1089919999999999999999990035029070999991901011520004+64333+023450FM-12+000599999V0202301N015919999999N0000001N9+00221+99999101321ADDGF108991999999999999999999MW1721~130 GB NCDC climate Dataset
high_temp=0forline inopen('1901'):   line =line.strip()   (year, temp, quality) =   (line[15:19], line[87:92], line[92:93])  if(temp !="+9999"and quality in"01459"): high_temp=max(high_temp,float(temp)) printhigh_tempHow can we make this scale?( and do more interesting things )
JeffREY DEAN – Google - 2004“Our abstraction is in-spired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”
defmapper(line):  line =line.strip()   (year, temp, quality) =   (line[15:19], line[87:92], line[92:93]) if(temp !="+9999"and quality in"01459"): returnfloat(temp) returnNoneoutput =map(mapper,open('1901')) printreduce(max,output)Mapreduce in pure python
for line in sys.stdin:val = line.strip()  (year, temp, q) = (val[15:19], val[87:92], val[92:93])  if (temp != "+9999" and re.match("[01459]", q)):    print "%s\t%s" % (year, temp)mapper.py(last_key, max_val) = (None, 0)for line in sys.stdin:  (key, val) = line.strip().split("\t")  if last_key and last_key != key:    print "%s\t%s" % (last_key, max_val)    (last_key, max_val) = (key, int(val))  else:    (last_key, max_val) = (key, max(max_val, int(val)))if last_key:  print "%s\t%s" % (last_key, max_val)reduer.pycat dataFile | mapper.py | sort | reducer.pyHadoop Streaming
Dumbo
def mapper(key,value):  line = value.strip()  (year, temp, quality) =    (line[15:19], line[87:92], line[92:93])  if (temp != "+9999" and quality in "01459"):    yield year, int(temp)def reducer(key,values):    yield key,max(values)if __name__ == "__main__":    import dumbodumbo.run(mapper,reducer,reducer)Dumbo
Ability to pass around Python objects
Job / Iteration Abstraction
Counter / Status Abstraction
Simplified Joining mechanism
Ability to use non-Java combiners
Built-in library of mappers / reducers
Excellent way to model MR algorithmsDumbo
CLI – API – Web ConsoleAmazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).Elastic Map REduce
Elastic Map REduce
Use Hadoop’s Job Tracker or Amazon’s ‘Debugger’Elastic Map REduce
CloudWatch metricsElastic Map REduce

Megadata With Python and Hadoop

Editor's Notes

  • #2 This template can be used as a starter file for a photo album.