Before we begin we should know few terms i.e. S3, EMR, Bucket
Amazon S3 stands for Amazon Simple Storage Service.
Amazon EMR stands for Amazon Elastic Map Reduce.
Bucket is a term used to store the data. We can place files, folders etc ... inside a S3 Bucket.
For more detail about the terms refer to AWS website.
We will create a bucket to store the result. We will use MapReduce to find the number of times word is repeated. For this example we will use the sample py file and data which is already available in aws website as Word Count example.
We assume that you have necessary credential for logging in to AWS Management Console. If not do sign up for AWS. Select Amazon Simple Storage Service. [ You are required to furnish your credit card details even if you are using free account]
Following Steps shows how to set up the bucket.
1. From Service Menu, Click on S3.
2. It will open S3 Management Console. Click on Create Bucket.
3. Provide bucket name and Select the region. Click on Create.
4. In All Bucket list, you can see your newly created bucket.
This newly created bucket will be used to hold the data. We can upload data directly in bucket also. We will see how to use MapReduce and store the data in the Bucket. To do so create a cluster using EMR.
5. Click again on Services, Select EMR ( Elastic Map Reduce ). Click on Create Cluster.
6. In Create Cluster, Click on Go to advanced options.
7. Select Streaming Program in drop down, click on Configure Button
8. In Name field enter name, In mapper, reducer enter the program and what reducer wants to d
We will use a sample files from amazon
Mapper : s3://elasticmapreduce/samples/wordcount/wordSplitter.py
Reducer : aggregator
Will add the count for number of words
Input s3 location : s3://elasticmapreduce/samples/wordcount/input
Output s3 location : s3://<bucket-name>/output
Click on Add
9. Click on Auto terminate cluster after the last step is completed. Click on Next.
10. Click Next [ No need to change Hardware Setting ]
11. "Under General Option, for s3 folder enter s3://<bucket-name>/logs. Click Next
12. Click on Create Cluster.
13. It will display Cluster Detail with states. Click on Cluster List at top.
14. Cluster List will be display all clusters. Click on Small triangle button on left side of Cluster Name that we have created.
15. It will display state of cluster Provisioning, Running etc..
16. After 10 to 15 minutes it will display Terminating, All steps completed.
17. To view the results, Click on Services in top menu, Open S3. Select the bucket.
18. Click on output.
19. _SUCCESS shows the map reduce worked and results are produced. which is seen as part-0000 etc..
To see the result [ final step ]
20. Right Click on part-00000, Select Download. This will download the file.
21. Open the file in editor [ notepad ], it will displays word and count in columns.