Get started Bring yourself up to speed with our introductory content.

Analyze big data using MapReduce

Learn a step-by-step way to analyze big data using MapReduce.

Amazon Elastic MapReduce (EMR) is a useful tool for developing applications, including log analysis, financial analysis, marketing analysis and bioinformatics. It uses Hadoop, an open source framework, to distribute your data across a cluster of Amazon EC2 instances.

The best way to analyze big data is to use Hive, an open source data warehouse and analytics package running on top of Hadoop. Hive scripts use an SQL-like language called Hive QL. With it, you can avoid the complexities of writing MapReduce programs in Java.

The following example of creating a HIVE cluster is based on an Amazon EMR example, Contextual Advertising using Apache Hive. This example shows how you can correlate customer click data to specific advertisements.

First, open the Amazon Elastic MapReduce console. Then click Create Cluster to set up configurations in five steps.

Step 1. Configure a cluster

  • In the Cluster name field, enter a descriptive name. It can be nonunique.
  • In the Termination protection field, the default is Yes. This ensures the cluster does not shut down due to accident or error.
  • In the Logging field, the default is Enabled. Log data is sent to Amazon S3.
  • In the Log folder S3 location field, enter the bucket name and folder in this format: s3://<bucket name>/<folder>/.
  • In the Debugging field, the default is Enabled.

The Tag section is optional. You can add as many as 10 tags to your EMR cluster. A tag consists of a case-sensitive key value pair.

Step 2. Set up software configuration

  • In the Hadoop distribution box, leave Amazon as the default.
  • In the AMI version box, choose 2.4.2 (Hadoop 1.0.3)
  • In the Application to be installed box,keep Hive and deletePig.

Step 3. Set up hardware configuration

  • In the Network field, choose Launch into EC-2 Classic.
  • In the EC2 Subnet field, choose No preference.
  • In the Master, Core and Task fields, the default EC2 instance type is m1.small. You use small instances for all nodes for light workloads (to keep your costs low). The Count is 1, 2, 0 by default, respectively. Leave Request Spot Instances unchecked for all three fields.

Note: Twenty is the maximum number of nodes per AWS account. If you have two clusters running, the total number of nodes running for both clusters must be 20 or less. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit.

Step 4. Set up security and access

  • In the EC2 key pair field, choose an Amazon EC2 key pair from the list. This allows you use Secure Shell (SSH) to connect to the master node.
  • In the IAM user access field, the default is No other IAM users.
  • In the EC2 role box, the default is no roles found.

You take no actions in the Bootstrap Actions section.

Step 5. Specify cluster parameters

  • In the Steps section, choose Hive Program from the list and click Configure and add.
  • In the Name field, the default is Hive Program.
  • In the Script s3 Location field (required), enter a value in the form of BucketName/path/ScriptName, for example, s3n://elasticmapreduce/samples/hive-ads/libs/model-build.
  • In the Input S3 Location field (optional), enter a value in the form of BucketName/path, for example, s3n://elasticmapreduce/samples/hive-ads/tables. This will be passed to the Hive script as a parameter named INPUT.
  • In the Output S3 Location field (optional), enter a value in the form of BucketName/path, for example, s3n://myawsbucket/hive-ads/output/2014-4-14. This will be passed to the Hive script as a parameter named OUTPUT.
  • In the Arguments field, enter, for example, - d LIBS=s3n://elasticreducemap/samples/hive-ads/libs. The HIVE script requires additional libraries.
  • In the Action on Failure field, choose Continue. If the current step fails, it will continue to the next step.

When you are done, click Add, then Create Cluster. You will see the Summary pane.

As in the Contextual Advertising using Apache Hive example, you need to prepare for a HIVE session on a master node before you can do a query to analyze big data.

You will need to push Impression and Click Log Files every five minutes on Amazon S3. An entry to the impression is added every time every time an advertisement is displayed to a customer. An entry to Click Log Files is added every time a customer clicks on an advertisement. SQL-like queries simplify the process of correlating customer click data to specific advertisements

In conclusion, the best way to analyze big data is to have Hive running on top of Hadoop and to use SQL queries to simplify log data analysis.

Dig Deeper on Amazon EC2 (Elastic Compute Cloud) management

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.