HADOOP - MAPREDUCE - HADOOP streaming
Udacity + cloudera MOOC https://www.udacity.com/course/intro-to-hadoop-and-mapreduce--ud617
VirtualBox: New Red Hat (64bit) Linux (>=2GB RAM) - using existing hard disk http://content.udacity-data.com/courses/ud617/Cloudera-Udacity-Training-VM-4.1.1.c.zip 1.7GB
Udacity Hadoop VM user: training password: training
Map Reduce example 1 (MR_example1)
(Udacity course)
Purchase data beginning at 1/1/2012 and ending at 31/12/2012 (4138476 lines)
Data (purchases.txt) can be downloaded from http://content.udacity-data.com/courses/ud617/purchases.txt.gz
Tests locally
Total of all purchases per city.
$ curl http://content.udacity-data.com/courses/ud617/purchases.txt.gz --output purchaces.txt.gz
$ gzip -d purchaces.txt.gz
# Test locally
$ more purchases.txt
2012-01-01 09:00 San Jose Men's Clothing 214.05 Amex
2012-01-01 09:00 Fort Worth Women's Clothing 153.57 Visa
...
$ wc -l purchases.txt
4138476 purchases.txt
$ head -50 purchases.txt > pur50.txt
$ cat pur50.txt | ./mapper.py | sort | ./reducer.py
Anchorage 327.6
Aurora 117.81
...
$ cat purchases.txt | ./mapper.py | sort | ./reducer.py
Albuquerque 10052311.42
Anaheim 10076416.36
...
$ time cat purchases.txt | ./mapper.py | sort | ./reducer.py > /dev/null
real 0m25.061s
user 0m29.226s
sys 0m1.113s
Test on Hadoop (virtual) cluster
Fire up the VM…
# localhost:50070 --> name node
# localhost:50030 --> job tracker
VM$ cd udacity_training
VM$ tree
VM$ hadoop fs -ls
# copy data from local storage to the Hadoop cluster
VM$ hadoop fs -mkdir myinput
# no navigation is possible in hdfs with cd, full paths are needed
# upload data to hdfs
VM$ hadoop fs -put data/purchases.txt myinput
VM$ hadoop fs -ls -h myinput
# execute map reduce job using alias hs
VM$ hs code/mapper.py code/reducer.py myinput output
# execute map reduce job with full command using jar
VM$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper code/mapper.py -reducer code/reducer.py -file code/mapper.py -file code/reducer.py -input myinput -output output1
# display results
VM$ hadoop fs -cat output1/part-00000
# copy results from the Hadoop cluster to the local machine
VM$ hadoop fs -get output1/part-00000 results.txt
VM$ more results.txt
# delete input and output data from the Hadoop cluster
VM$ hadoop fs -rm -r -f myinput
VM$ hadoop fs -rm -r -f output1
Map Reduce example 2 (MR_example2)
Web log processing example. Count number of hits at each web page resource
Data (access_log.gz) can be downloaded from http://content.udacity-data.com/courses/ud617/access_log.gz.
$ curl http://content.udacity-data.com/courses/ud617/access_log.gz --output access_log.gz
# test locally
$ gzip -d access_log.gz
# create a small subset of data for testing
$ head -50 access_log > log50.txt
$ cat log50.txt | ./mapper.py | sort | ./reducer.py
# time
time cat ./access_log | ./mapper.py | sort | ./reducer.py > /dev/null
# export results to a file
$ cat access_log | ./mapper.py | sort | ./reducer.py > results.txt
# sort based on 2nd column, treat values of the 2nd column as numbers
$ sort -k 2n results.txt