Student Reviews
( 5 Of 5 )
1 review
Video of 8.1. Mapreduce Programming Writing MapReduce Code Using Java in Apache Hadoop course by CloudxLab Official channel, video No. 42 free certified online
In this session, we are going to learn how to write MapReduce programs using Java.
When we want to handle big data, instead of processing data using conventional way, we breakdown our logic into two parts: map and reduce. The map gets executed on all the machines having data and gives output in the form of key-value pairs. And reduce receives the output of all maps grouped by key. The grouping is done by mapreduce framework. Reduce can aggregate the values per key.
Let us learn how to write these map and reduce parts in Java. Why Java? Since Apache Hadoop is written in Java, you can modify the behaviour to a very large extent.
We are going to write a map-reduce job to count unique words in a plain text file big.txt stored in HDFS in /data/mr/wordcount/input/.
Let us first understand how to write mapper. We write a piece of logic which would break down the input record into key-value. The input record is also made up two parts - key and value.
The default inputformat is TextInputFormat which is used for handling plain text files. The textinput format creates a record for everyline in the input file. A record has two parts key and value. In case of TextInputput, the value is the content of line and the key is the location where line starts.
In the example in diagram, The raw data is converted to three records having values "This is a cow", "this is a buffalo" and "there is a hen". Since the first line "this is a cow" starts at location 0, the key is 0. For the line "there is a hen" starts at thirty forth byte or character, the key for the record is 34. Please let is not line number.
A class in java is a complex data type that can have methods and member variable in it too. A class is like is blue print. We create many objects of a class. We can create many objects our of stubmapper. Example of a class could be Person and "sandeep giri" is an instance or object of it. We will write code that defines class between these two curly brackets.
The name of this class could be any thing. Here the name of our class is StubMapper.
We need to create a class for mapper. It means that the behaviour of our class StubMapper would be inherited from Mapper. We will handover this stubmapper class to mapreduce framework and framework will create instance of it for each input split. Your mapper class must extend mapper class. It is mandatory from the framework side.
While extending the Mapper class, we can define the data types of input key, value and output key and value. Between the angle brackets, the first one signifies the Data type of Input Key. In our example, it is number of bytes at which the value is starting.
The second one is the Data type of input value. In our case, input value is each line, i.e. Text
The third datatype is of output key. We are going to give key as word, therefore it is Text
Forth one is for output value. We are going to give value as 1 therefore it is Long
Then we define the function with a name map() and put an override keyword on it. This function should return nothing that is void but has three arguments input key, input value and context. This is the function in which we write the logic to convert the input record into key-values. We will be writing key-values into context object. It is upto us how many and what key-values we give out.
So, the input record that is key and value along with the context is made available to you in this function, you can write whatever logic you want to generate key-values.
In our example where we have the InputFormat as TextInputFormat, the map function would be executed against every line of the file. And your code inside this map() function would be provided with each line as input value and the line's starting location in the input file as input key. And we would be writing the output key value pairs in the context object.
First, we split the input line by space or tabs into an array of strings. Then, for each of the words, we give out the word as key and numeric 1 as the value.
While giving out the key and the value to mapreduce via context, why are we writing "new Text(word)" and new LongWritable(1)?
The usual classes or types of Java to represent numbers, text and other objects were not efficient. So, MapReduce team designed their own classes called writables corresponding to most of the java primitive objects. So, while giving out the values to MapReduce we need to wrap the basic objects into the wrappers provided by Mapreduce framework. It is exactly like wrapping a gift for someone's birthday.
Let us know in the comments below if you find it helpful.
In order to claim the certificate from E&ICT Academy, IIT Roorkee, visit https://bit.ly/cxlyoutube
________
Website https://www.cloudxlab.com
Facebook https://www.facebook.com/cloudxlab
Instagram https://www.instagram.com/cloudxlab
Twitter http://www.twitter.com/cloudxlab