Big data projects can be complex and demand significant design and coding expertise. Start small -- implement big...
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
data platforms slowly and test key functions with simple jobs. Once developers become comfortable using tools for big data on AWS they can move on to larger, complex jobs.
Complex big data projects often run across clusters of compute resources, with native and third-party tools that don't integrate well. And this creates problems for big data on AWS.
Two ways to limit big data problems
Apache Hive is an open source data warehouse and analytics tool that typically runs on Hadoop clusters. Data scientists use Hive to tackle jobs involving big data on AWS without creating Amazon Elastic MapReduce (EMR) programs in more traditional programming languages, like Java. Instead, developers write scripts in Hive Query Language, which resembles SQL but adds more capabilities and handles structured elements such as JSON objects and user-defined data types. Hive provides analytical power for big data jobs, but certain errors and big data problems can occur.
A Java Archive file (JAR) is a format developers use to collect and package Java class files, metadata and content into one file. Developers can redistribute or even provide executable capability for that file. JAR clusters can drive big data tasks, but errors can creep into JAR clusters when running Amazon EMR. In most cases, the JAR cluster can produce errors when creating a Hadoop job or when the JAR and mapper attempt to process data within the task itself.
Because Java and JAR are complex, look for error details in the syslog file. Once syslog reveals the nature of the error, IT teams can remedy the big data problems and create updated JAR files.
For example, older Hive versions can return errors when using certain features. Verify your IT team is using the latest, fully patched version of Hive for big data on AWS. A syslog file tracks any task attempts and can contain messages produced from syntax errors in the Hive script or other master node and cluster failures. Log messages help teams quickly narrow possible root causes of Hive problems under Hadoop.
Hive tasks that access cloud storage resources can also run up costs for enterprises. For example, it can be expensive to list the contents of a cloud storage instance, such as an Amazon Simple Storage Service bucket. To lower costs, reduce the number of tasks that require a list and delete existing storage objects. Hive and task performance can improve if a developer caches the storage list operation locally on the cluster; run static partitions on storage contents for better performance.
Amazon EMR problems can be identified and avoided
Fix AWS credential mistakes in Amazon EMR
Hadoop coding errors impact big data processing
Related Q&A from Stephen J. Bigelow
Photon OS optimizes VMware Photon platform deployment, not only in vSphere but in GCE, EC2 and more. Follow these steps to learn how to run Photon OS...continue reading
Performance problems can be caused by a number of things, including overprovisioning and poor vCPU selection and assignment to VMs. Use these ...continue reading
Think about what types of workloads are running on a VM before assigning compute resources, and consider using vCPUs from different cores for ...continue reading
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.