Developers running big data jobs in AWS typically move raw data from local storage to cloud storage, such as Amazon...
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
Simple Storage Service. But, sometimes, moving data causes Hadoop to stop -- without explanation. The culprit could be mistakes in Hadoop coding.
With a Simple Storage Service (S3) resource, data movement usually involves specifying a URL to the desired S3 bucket. But many web-savvy users neglect to include the end slash in the URL. For example, the correct format is xxx://xxxxxxxxxxx/ -- a slash ends the URL. Web servers generally don't need the end slash, but Hadoop instances need it to work properly.
Hadoop coding doesn't accept HTTP URLs, so storage resource locations like http://xxxxxx won't work. And mistakes can also crop up in the naming conventions of Amazon S3 buckets. S3 bucket names cannot end in numbers, and Hadoop coding demands that Amazon S3 bucket names for Amazon Elastic MapReduce (EMR) contain only lower case letters, numbers and periods or hyphens. Any variation in these naming restrictions will stop Hadoop.
Subdirectories are a way to organize data, but Hadoop won't search subdirectories for input data. Data must be in the exact directory or Amazon S3 bucket that a developer indicates or Hadoop will ignore the subdirectories, whether or not they contain data. For example, for the following directories -- /mydata/data1, /mydata/data2 -- use /mydata/ as the data input location. Hadoop won't find the data in /data1 or /data2. You will have to either move the data or change the path.
Output paths aren't supported in Hadoop coding, so a job may work fine once, but not after the output path is created. Hadoop sees the existing output path and stops. Developers may need to remove the existing path so that Hadoop can recreate it on the next job iteration, or they can change the output to a unique path for each run.
Amazon S3 is a service, so it doesn't behave like a traditional file system. Any application that involves S3, such as Hadoop EMR tasks, must check and handle S3 errors and retry failed S3 operations appropriately. Otherwise, an application may crash or cause other undesirable results.
In addition, reduce the rate at which an application calls S3; frequent calls increase the possibility of lag or failure. Also, reduce unnecessary S3 tasks, such as listing objects in S3, which can be expensive.
Identify and correct workload problems with Amazon EMR
Use Amazon EMR to fix common AWS credentials problems
How does Hadoop give big data a boost from the cloud?
Dig Deeper on AWS big data and data analytics
Related Q&A from Stephen J. Bigelow
Photon OS optimizes VMware Photon platform deployment, not only in vSphere but in GCE, EC2 and more. Follow these steps to learn how to run Photon OS...continue reading
Performance problems can be caused by a number of things, including overprovisioning and poor vCPU selection and assignment to VMs. Use these ...continue reading
Think about what types of workloads are running on a VM before assigning compute resources, and consider using vCPUs from different cores for ...continue reading
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.