Everything related to Hadoop

In this blog i am going to show you how to write MapReduce job and how we can apply 3 different approaches to get same output.

For the sake of running the example, i didn't had data to run MapReduce job on, so i wrote a program to create the sample data. Benefit of this approach is, that i will be able to verify the output after running the MapReduce job.

What my program does?
In my program i have taken following 9 different strings
"this is first string",

"got it.......................... yeah",

"sorry...........................not found",

"find me if you can......................",

"still not found......................",

"okay................try moreeeeee",

"work harder...........................",

"done..... not found",

"try and get it.................."

My program is writing random combinations of these string with Timestamp and name of the class until total log size reaches to number of bytes you provide as input.

I am counting 2nd String ("got it.......................... yeah") and Last string ("try and get it..................") while writing combination of strings in log files.

To create sample data, open command prompt on your Linux machine and follow below steps:

1. hit command : "wget https://www.dropbox.com/s/ifm9xcw6rh855ps/ProduceData.jar"

Now you should have "ProduceData.jar" in the current directory. Change permission of the jar to make it executable if it is already not. (Command : chmod +x ProduceData.jar )

Note: alternatively you can go to this url and download jar manually.

2. Now to produce data hit below command

./ProduceData.jar <directory in which you want to create data> <size of each file in bytes> <number of files to be created>

for example:

./ProduceData.jar /home/deepak/Desktop/test 10000000 4

At the end of execution of command you will see a screen like this.

Please look at the screen carefully, as the last 2 lines of the output will be the output after running MapReduce job on the data.

In my example it is :

Total number of hits: 89044 i.e; total count of string "got it.......................... yeah" in all log files

Total number of almost hits: 88690 i.e; total count of string "try and get it.................." in all log files

So our job is to find total count of string "got it.......................... yeah" in all log files, which is 89044 in my case.

You can also download the source code of my program from GitHub.

Now, we have data and we need to write MapReduce jobs to find count of above said string.

If you don't have development environment, you can set it up by following my another blog post. Once your development environment is up, we are ready to write the MapReduce job.

For solving this particular problem i have thought of 3 different approaches for the sake of showing capability of Hadoop. I will explain each approach below.

1st approach:

This is very simple approach where each line of all the logs is Input Split for Mappers. Map will read each line and check if it matches with required string. This approach will take more time as you can see number of input splits is very large and size of input splits is very small. For those who don't understand Input Split please see the explanation below.

Hadoop relies on the input format of the job to do three things:1. Validate the input configuration for the job (i.e., checking that the data is there).2. Split the input blocks and files into logical chunks of type InputSplit, each of which is assigned to a map task for processing.3. Create the RecordReader implementation to be used to create key/value pairs from the raw InputSplit. These pairs are sent one by one to their mapper.

A RecordReader uses the data within the boundaries created by the input split to generate key/value pairs. In the context of file-based input, the “start” is the byte position in the file where the RecordReader should start generating key/value pairs. The “end” is where it should stop reading records. These are not hard boundaries as far as the API is concerned—there is nothing stopping a developer from reading the entire file for each map task. While reading the entire file is not advised, reading outside of the boundaries it often necessary to ensure that a complete record is generated

2nd approach:
If you look closely you will see that each block of write operation is with Timestamp, which is following some pattern. We can define our InputSplit in such a way that each split starts with date format. In this case we will need to define custom RecordReader which will be Pattern delimited.

3rd approach:

In this approach i will use fixed delimiter. If you see the logs, each block in log file is containing string "com.deepak.utils.ProduceData main". So we can use this string as record delimiter so that this string will act as starting point of our input split.

There can be some more approaches also, but i explained only 3 approaches here, because these 3 examples will give you a fair idea about how we can write MapReduce jobs.

I will explain 1st approach here and 2nd and 3rd approach will be on a separate blog post.

1st approach explained with solution:

Please follow my another blog post to create MapReduce project on Eclipse IDE and Mapper and Reducer classes.

Your Mapper class will look like below:

Your Reducer class will look like below:

Your Test class will look like below:

Once you create the above classes and setup hadoop environment as suggested, you will be ready to run your MapReduce Job.

To run the job Click on Run -> Run Configurations...

in argument panel put your argument like this:

Now Click on Run.

If everything is fine your MapReduce program will be executed and produce below output on eclipse console.

If you see the last line of the output on console which says :

"13/09/16 16:40:16 INFO mapred.JobClient: Map output records=89044"

you can also verify it from the file "part-00000" at your output directory stated as 2nd argument to your program, which says:

"got it.......................... yeah 89044"

As you can see it is consistent with the produced output from the ProduceData jar.

Thank you for following this post.

In the next blog i will explain 2nd and 3rd approach for solving the same problem.

In case if you find any difficulty or you want to suggest something to improve this post, please comment or write me mail.

I started working in Hadoop field since last 1.5 years and i can see a picture where every one will be moving to Hadoop and related big data processing tools and suddenly it will become even much more popular than it is now and at that time it will be a field where everyone will be somehow related. As a developer, i really feel that everyone, at least every developer, should know how to write a MapReduce job and run it over a Hadoop cluster. I tried to find some help to write MapReduce jobs over the internet, but i found very few and even there there was no complete information.

So, i felt an urge to write this blog, so that every body should be able to know, in very easy and faster manner.

In this example i am using latest stable release of Hadoop which is 1.2.1. If you are using a different version then it may not work.

Requirements to run this exercise:

Hadoop cluster (even a single node cluster will work) . I found Michael G. Noll's website very useful for setting up a single node cluster. You can follow the steps on this site to setup your cluster for testing.
It is very good practice to keep source files of an opensource software, so that you will be able to consult sources or even change, if required. You can download sources of Hadoop 1.2.1 from here.
Eclipse IDE 3.7. (I have tested it on Eclipse 3.7. It may not work on higher versions.)

Steps:

Download MapReduce eclipse plugin from here
Copy it into "Eclipse installation directory -> plugins" directory
Restart eclipse
Open MapReduce perspective in Eclipse

5. Add MapReduce locations

6. Click on add new locations

7. There are few more advanced parameters also which may be changed depending upon your environment

Now, development environment setup is done. You are ready to write and run your MapReduce job. To create a new MapReduce project and run over hadoop cluster follow below steps

8. Create new Map reduce project