Using hadoop with python

Hadoop is a powerful framework accomodating distributed processing through map/reduce jobs. Hadoop comes with Java api’s, but how do you use it with python, ruby, or other languages? As it turns out, you can use other languages with Hadoop by utilizing the streaming library Hadoop provides (if you know unix piping, then this will seem familiar).

Set up

First, let’s upload a file to our HDFS. If our file is a text file named “students.txt” we would do:

hadoop fs -put class/students.txt

The contents of the file look something like this and our goal is to calculate averages for each assignment:

name	a1	80
name	a2	40
name	a3	20
name2	a1	90
name2	a2	80
name3	a3	70

Next, we need to write 2 scripts.. one to perform a map, and another one to perform a reduce operation.

Map

from sys import stdin

if __name__ == "__main__":
    for line in stdin:
        name, assignment, grade = line.strip().split("\t")
        print(assignment + "\t"+ grade)

Reduce

from sys import stdin

if __name__ == "__main__":

    total = 0
    count = 0

    prev_assignment = ""

    for line in stdin:
        assignment, grade = line.strip().split("\t")
       
        total += int(grade)
        count += 1
    
        if assignment != prev_assignment:
            prev_assignment = assignment
            print(assignment + " average: \t"+ str(total/count))
            total = 0
            count = 0

Execute

Now, we can run the following to execute our job:

hadoop jar /path/to/hadoop-streaming.jar -mapper map.py -reducer reduce.py -file map.py -file reduce.py -input class -output result

Alternatively, use pipes to test your code:

cat students.txt | python map.py | sort | python reduce.py

a1 average: 	80
a2 average: 	65
a3 average: 	50