Udacity + cloudera MOOC

VirtualBox: New Red Hat (64bit) Linux (>=2GB RAM) - using existing hard disk 1.7GB

Udacity Hadoop VM user: training password: training

Map Reduce example 1 (MR_example1)

(Udacity course)

Purchase data beginning at 1/1/2012 and ending at 31/12/2012 (4138476 lines)

Data (purchases.txt) can be downloaded from

Tests locally

Total of all purchases per city.

$ curl --output purchaces.txt.gz
$ gzip -d purchaces.txt.gz
# Test locally
$ more purchases.txt
2012-01-01      09:00   San Jose        Men's Clothing  214.05  Amex
2012-01-01      09:00   Fort Worth      Women's Clothing        153.57  Visa
$ wc -l purchases.txt
4138476 purchases.txt
$ head -50 purchases.txt > pur50.txt
$ cat pur50.txt | ./ | sort | ./
Anchorage 327.6
Aurora 117.81
$ cat purchases.txt | ./ | sort | ./
Albuquerque 10052311.42
Anaheim 10076416.36
$ time cat purchases.txt | ./ | sort | ./ > /dev/null
real 0m25.061s
user 0m29.226s
sys 0m1.113s

Test on Hadoop (virtual) cluster

Fire up the VM…

# localhost:50070  --> name node
# localhost:50030  --> job tracker
VM$ cd udacity_training
VM$ tree
VM$ hadoop fs -ls
# copy data from local storage to the Hadoop cluster
VM$ hadoop fs -mkdir myinput
# no navigation is possible in hdfs with cd, full paths are needed
# upload data to hdfs
VM$ hadoop fs -put data/purchases.txt myinput
VM$ hadoop fs -ls -h myinput
# execute map reduce job using alias hs
VM$ hs code/ code/ myinput output
# execute map reduce job with full command using jar
VM$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper code/ -reducer code/ -file code/ -file code/ -input myinput -output output1
# display results
VM$ hadoop fs -cat output1/part-00000
# copy results from the Hadoop cluster to the local machine
VM$ hadoop fs -get output1/part-00000 results.txt
VM$ more results.txt
# delete input and output data from the Hadoop cluster
VM$ hadoop fs -rm -r -f myinput
VM$ hadoop fs -rm -r -f output1

Map Reduce example 2 (MR_example2)

Web log processing example. Count number of hits at each web page resource

Data (access_log.gz) can be downloaded from

$ curl --output access_log.gz
# test locally
$ gzip -d access_log.gz
# create a small subset of data for testing
$ head -50 access_log > log50.txt
$ cat log50.txt | ./ | sort | ./
# time
time cat ./access_log | ./  | sort | ./ > /dev/null
# export results to a file
$ cat access_log | ./ | sort | ./ > results.txt
# sort based on 2nd column, treat values of the 2nd column as numbers
$ sort -k 2n results.txt