Developers running big data jobs in AWS typically move raw data from local storage to cloud storage, such as Amazon...
Simple Storage Service. But, sometimes, moving data causes Hadoop to stop -- without explanation. The culprit could be mistakes in Hadoop coding.
With a Simple Storage Service (S3) resource, data movement usually involves specifying a URL to the desired S3 bucket. But many web-savvy users neglect to include the end slash in the URL. For example, the correct format is xxx://xxxxxxxxxxx/ -- a slash ends the URL. Web servers generally don't need the end slash, but Hadoop instances need it to work properly.
Hadoop coding doesn't accept HTTP URLs, so storage resource locations like http://xxxxxx won't work. And mistakes can also crop up in the naming conventions of Amazon S3 buckets. S3 bucket names cannot end in numbers, and Hadoop coding demands that Amazon S3 bucket names for Amazon Elastic MapReduce (EMR) contain only lower case letters, numbers and periods or hyphens. Any variation in these naming restrictions will stop Hadoop.
Subdirectories are a way to organize data, but Hadoop won't search subdirectories for input data. Data must be in the exact directory or Amazon S3 bucket that a developer indicates or Hadoop will ignore the subdirectories, whether or not they contain data. For example, for the following directories -- /mydata/data1, /mydata/data2 -- use /mydata/ as the data input location. Hadoop won't find the data in /data1 or /data2. You will have to either move the data or change the path.
Output paths aren't supported in Hadoop coding, so a job may work fine once, but not after the output path is created. Hadoop sees the existing output path and stops. Developers may need to remove the existing path so that Hadoop can recreate it on the next job iteration, or they can change the output to a unique path for each run.
Amazon S3 is a service, so it doesn't behave like a traditional file system. Any application that involves S3, such as Hadoop EMR tasks, must check and handle S3 errors and retry failed S3 operations appropriately. Otherwise, an application may crash or cause other undesirable results.
In addition, reduce the rate at which an application calls S3; frequent calls increase the possibility of lag or failure. Also, reduce unnecessary S3 tasks, such as listing objects in S3, which can be expensive.
Identify and correct workload problems with Amazon EMR
Use Amazon EMR to fix common AWS credentials problems
How does Hadoop give big data a boost from the cloud?
Dig Deeper on AWS big data and data analytics
Related Q&A from Stephen J. Bigelow
DR planning mistakes are easy to make. Avoid selecting a tool that doesn't meet your needs or that's overly complex, carefully consider the ...continue reading
Establishing a DR plan for a VMware environment can be overwhelming. How do you design a plan that prioritizes VMs and manage your infrastructure to ...continue reading
Storage I/O control can be an effective way to handle occasional storage sharing issues, but it is not always suitable for every virtual machine.continue reading
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.