How to Run an Elastic MapReduce Job Using a Custom Jar on Amazon EMR
Learn how to develop an application using Java and the MapReduce framework for Hadoop, then execute that app in Amazon's Elastic MapReduce.
Join the DZone community and get the full member experience.
Join For FreeAmazon EMR is a web service which allows developers to easily and efficiently process enormous amounts of data. It uses a hosted Hadoop framework running on the web-scale infrastructure of Amazon EC2 and Amazon S3.
Amazon EMR removes most of the cumbersome details of Hadoop, while taking care of provisioning Hadoop, running the job flow, terminating the job flow, moving the data between Amazon EC2 and Amazon S3, and optimizing Hadoop.
In this tutorial, we will first developed an example Java app (WordCount) using the MapReduce framework for Hadoop and, thereafter, we'll execute our program on Amazon Elastic MapReduce.
Prerequisites
You must have valid AWS account credentials. You should also have a general familiarity with using the Eclipse IDE before you begin. The reader can also use any other IDE of their choice.
Step 1 – Develop MapReduce WordCount Java Program
In this section, we are first going to develop our WordCount application. A WordCount program will determine how many times different words appear in a set of files.
- In Eclipse (or whatever the IDE you are using), create a simple Java project with the name "WordCount."
- Create a Java class named
Map
and override themap
method as follows: -
Java
xxxxxxxxxx
117
1public class Map extends Mapper<longwritable, intwritable="" text,=""> {
2private final static IntWritable one = new IntWritable(1);
3private Text word = new Text();
456public void map(LongWritable key, Text value, Context context)
7throws IOException, InterruptedException {
8String line = value.toString();
9StringTokenizer tokenizer = new StringTokenizer(line);
10while (tokenizer.hasMoreTokens()) {
11word.set(tokenizer.nextToken());
12context.write(word, one);
13}
14}
15}
16</longwritable,>
- Create a Java class named
Reduce
and override thereduce
method as shown below: -
Java
xxxxxxxxxx
116
1public class Reduce extends Reducer<text, intwritable,="" intwritable="" text,=""> {
23protected void reduce(
4Text key,
5java.lang.Iterable<intwritable> values,
6org.apache.hadoop.mapreduce.Reducer<text, intwritable,="" intwritable="" text,="">.Context context)
7throws IOException, InterruptedException {
8int sum = 0;
9for (IntWritable value : values) {
10sum += value.get();
11}
12context.write(key, new IntWritable(sum));
13}
14}
15</text,></intwritable></text,>
- Create a Java class named
WordCount
and defined themain
method as shown below: -
Java
xxxxxxxxxx
121
1public static void main(String[] args) throws Exception {
2Configuration conf = new Configuration();
34Job job = new Job(conf, "wordcount");
5job.setJarByClass(WordCount.class);
67job.setOutputKeyClass(Text.class);
8job.setOutputValueClass(IntWritable.class);
910job.setMapperClass(Map.class);
11job.setReducerClass(Reduce.class);
1213job.setInputFormatClass(TextInputFormat.class);
14job.setOutputFormatClass(TextOutputFormat.class);
1516FileInputFormat.addInputPath(job, new Path(args[0]));
17FileOutputFormat.setOutputPath(job, new Path(args[1]));
1819job.waitForCompletion(true);
20}
- Export the WordCount program in a jar using Eclipse and save it to some location on disk. Make sure that you have provided the Main Class (WordCount.jar) while extracting the jar file as shown below.
Step 2 – Upload the WordCount JAR and Input Files to Amazon S3
Now we are going to upload the WordCount jar to Amazon S3. First, go to the following URL: https://console.aws.amazon.com/s3/home. Next, click “Create Bucket,” give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane. Upload the WordCount JAR and sample input file for counting the words.
Step 3 – Running an Elastic MapReduce Job
Now that the JAR is uploaded into S3, all we need to do is to create a new Job flow. Let's execute the below steps (I encourage the reader to check out the following link for details regarding each step, How to Create a Job Flow Using a Custom JAR).
- Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/
- Click 'Create New Job Flow.'
- In the DEFINE JOB FLOW page, enter the following details:
- Job Flow Name = WordCountJob
- Select 'Run your own application'
- Select 'Custom JAR' in the drop-down list
- Click 'Continue'
- In the SPECIFY PARAMETERS page, enter values in the boxes using the following table as a guide, and then click Continue.
- JAR Location = bucketName/jarFileLocation
- JAR Arguments =
- s3n://bucketName/inputFileLocation
- s3n://bucketName/outputpath
- Please note that the output path must be unique each time we execute the job. Hadoop always creates a folder with same name specified here.
After executing the job, just wait and monitor your job that runs through the Hadoop flow. You can also look for errors by using the Debug button. The job should be complete within 10 to 15 minutes (though this can also depend on the size of the input). After completing the job, you can view the results in the S3 Browser panel. You can also download the files from S3 and analyze the outcome of the job.
Amazon Elastic MapReduce Resources
Published at DZone with permission of Muhammad Ali Khojaye, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments