Q
Problem solve Get help with specific problems with your technologies, process and projects.

How do Hadoop coding errors disrupt big data processing?

Hadoop coding oversights can wreak havoc on big data projects. What Hadoop limitations should we be wary of with data input and output?

Developers running big data jobs in AWS typically move raw data from local storage to cloud storage, such as Amazon...

Simple Storage Service. But, sometimes, moving data causes Hadoop to stop -- without explanation. The culprit could be mistakes in Hadoop coding.

With a Simple Storage Service (S3) resource, data movement usually involves specifying a URL to the desired S3 bucket. But many web-savvy users neglect to include the end slash in the URL. For example, the correct format is xxx://xxxxxxxxxxx/ -- a slash ends the URL. Web servers generally don't need the end slash, but Hadoop instances need it to work properly.

Hadoop coding doesn't accept HTTP URLs, so storage resource locations like http://xxxxxx won't work. And mistakes can also crop up in the naming conventions of Amazon S3 buckets. S3 bucket names cannot end in numbers, and Hadoop coding demands that Amazon S3 bucket names for Amazon Elastic MapReduce (EMR) contain only lower case letters, numbers and periods or hyphens. Any variation in these naming restrictions will stop Hadoop.

Subdirectories are a way to organize data, but Hadoop won't search subdirectories for input data. Data must be in the exact directory or Amazon S3 bucket that a developer indicates or Hadoop will ignore the subdirectories, whether or not they contain data. For example, for the following directories -- /mydata/data1, /mydata/data2 -- use /mydata/ as the data input location. Hadoop won't find the data in /data1 or /data2. You will have to either move the data or change the path.

Output paths aren't supported in Hadoop coding, so a job may work fine once, but not after the output path is created. Hadoop sees the existing output path and stops. Developers may need to remove the existing path so that Hadoop can recreate it on the next job iteration, or they can change the output to a unique path for each run.

Amazon S3 is a service, so it doesn't behave like a traditional file system. Any application that involves S3, such as Hadoop EMR tasks, must check and handle S3 errors and retry failed S3 operations appropriately. Otherwise, an application may crash or cause other undesirable results.

In addition, reduce the rate at which an application calls S3; frequent calls increase the possibility of lag or failure. Also, reduce unnecessary S3 tasks, such as listing objects in S3, which can be expensive.

Next Steps

Identify and correct workload problems with Amazon EMR

Use Amazon EMR to fix common AWS credentials problems

How does Hadoop give big data a boost from the cloud?

This was last published in January 2017

Dig Deeper on AWS big data and data analytics

Have a question for an expert?

Please add a title for your question

Get answers from a TechTarget expert on whatever's puzzling you.

You will be able to add details on the next page.

Join the conversation

1 comment

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

What mistakes have you seen in big data processing on AWS?
Cancel

-ADS BY GOOGLE

SearchCloudApplications

TheServerSide.com

SearchSoftwareQuality

SearchCloudComputing

Close