agsandrew - Fotolia


How to identify and avoid Amazon EMR problems

Amazon Elastic MapReduce helps our team process streaming data, but we've run into a number of issues. How can we identify and correct problems with these workloads?

The Amazon Elastic MapReduce platform can run big data frameworks in AWS, but resulting data can be complex and...

time-consuming to work with. Developers need to take steps to overcome Amazon EMR problems.

Amazon Elastic MapReduce (EMR) frameworks start with a foundation, such as Apache Hadoop or Apache Spark, which are routinely coupled with open source utilities such as Hive or Pig. When used together, these big data frameworks can process, analyze, transform and analyze vast quantities of data, and then interact with AWS databases and storage resources such as Amazon DynamoDB and Simple Storage Service. This integration between AWS tools can help IT teams more effectively manage, store and gain insights from otherwise unwieldy data.

Big data projects, including real-time analytics, generate valuable data from countless sources, such as IoT device telemetry, customer logs from mobile applications and instrumentations. But streaming data can be problematic to work with and can create several issues with Amazon EMR.

Logs, such as syslog file entries, can help pinpoint problems with streaming data. For example, incorrect data formats will typically cause errors in a mapper function. Job timeouts, which can cause a task to fail, can also occur after 10 minutes. To fix this, developers can either simplify the task script so that it takes less time than the default or change EMR's timeout setting. In both cases, a syslog file will usually show these faults as failed task attempts.

Coding mistakes can be another source of problems with Amazon EMR. Hadoop only recognizes a limited suite of arguments, and those arguments must be in a Java syntax, preceded by a single hyphen. Any changes in this format will cause the Hadoop cluster to fail.

Task script errors are another culprit. For example, Amazon EMR problems resulting from mapper or reducer script errors appear in the stderr file of the log outlining failed task attempts. The root cause or location of the error can give developers important details regarding the problem -- and can help find a solution.

Next Steps

Make sense of big data with AWS analytics tools

Big data workloads get help from Spark from AWS

Apache Spark helps Amazon EMR move forward

Dig Deeper on AWS big data and data analytics