Sergey Nivens - Fotolia

How do Hadoop coding errors disrupt big data processing?

Hadoop coding oversights can wreak havoc on big data projects. What Hadoop limitations should we be wary of with data input and output?

Developers running big data jobs in AWS typically move raw data from local storage to cloud storage, such as Amazon Simple Storage Service. But, sometimes, moving data causes Hadoop to stop -- without explanation. The culprit could be mistakes in Hadoop coding.

With a Simple Storage Service (S3) resource, data movement usually involves specifying a URL to the desired S3 bucket. But many web-savvy users neglect to include the end slash in the URL. For example, the correct format is xxx://xxxxxxxxxxx/ -- a slash ends the URL. Web servers generally don't need the end slash, but Hadoop instances need it to work properly.

Hadoop coding doesn't accept HTTP URLs, so storage resource locations like http://xxxxxx won't work. And mistakes can also crop up in the naming conventions of Amazon S3 buckets. S3 bucket names cannot end in numbers, and Hadoop coding demands that Amazon S3 bucket names for Amazon Elastic MapReduce (EMR) contain only lower case letters, numbers and periods or hyphens. Any variation in these naming restrictions will stop Hadoop.

Subdirectories are a way to organize data, but Hadoop won't search subdirectories for input data. Data must be in the exact directory or Amazon S3 bucket that a developer indicates or Hadoop will ignore the subdirectories, whether or not they contain data. For example, for the following directories -- /mydata/data1, /mydata/data2 -- use /mydata/ as the data input location. Hadoop won't find the data in /data1 or /data2. You will have to either move the data or change the path.

Output paths aren't supported in Hadoop coding, so a job may work fine once, but not after the output path is created. Hadoop sees the existing output path and stops. Developers may need to remove the existing path so that Hadoop can recreate it on the next job iteration, or they can change the output to a unique path for each run.

Amazon S3 is a service, so it doesn't behave like a traditional file system. Any application that involves S3, such as Hadoop EMR tasks, must check and handle S3 errors and retry failed S3 operations appropriately. Otherwise, an application may crash or cause other undesirable results.

In addition, reduce the rate at which an application calls S3; frequent calls increase the possibility of lag or failure. Also, reduce unnecessary S3 tasks, such as listing objects in S3, which can be expensive.

Next Steps

Identify and correct workload problems with Amazon EMR

Use Amazon EMR to fix common AWS credentials problems

How does Hadoop give big data a boost from the cloud?

Dig Deeper on AWS big data and data analytics