Sergey Nivens - Fotolia

Avoid common big data on AWS issues

When our IT team clusters Hive and Java Archive files for big data projects, it receives error messages. How can we keep our AWS big data projects running smoothly?

Big data projects can be complex and demand significant design and coding expertise. Start small -- implement big data platforms slowly and test key functions with simple jobs. Once developers become comfortable using tools for big data on AWS they can move on to larger, complex jobs.

Complex big data projects often run across clusters of compute resources, with native and third-party tools that don't integrate well. And this creates problems for big data on AWS.

Two ways to limit big data problems

Apache Hive is an open source data warehouse and analytics tool that typically runs on Hadoop clusters. Data scientists use Hive to tackle jobs involving big data on AWS without creating Amazon Elastic MapReduce (EMR) programs in more traditional programming languages, like Java. Instead, developers write scripts in Hive Query Language, which resembles SQL but adds more capabilities and handles structured elements such as JSON objects and user-defined data types. Hive provides analytical power for big data jobs, but certain errors and big data problems can occur.

A Java Archive file (JAR) is a format developers use to collect and package Java class files, metadata and content into one file. Developers can redistribute or even provide executable capability for that file. JAR clusters can drive big data tasks, but errors can creep into JAR clusters when running Amazon EMR. In most cases, the JAR cluster can produce errors when creating a Hadoop job or when the JAR and mapper attempt to process data within the task itself.

Because Java and JAR are complex, look for error details in the syslog file. Once syslog reveals the nature of the error, IT teams can remedy the big data problems and create updated JAR files.

For example, older Hive versions can return errors when using certain features. Verify your IT team is using the latest, fully patched version of Hive for big data on AWS. A syslog file tracks any task attempts and can contain messages produced from syntax errors in the Hive script or other master node and cluster failures. Log messages help teams quickly narrow possible root causes of Hive problems under Hadoop.

Hive tasks that access cloud storage resources can also run up costs for enterprises. For example, it can be expensive to list the contents of a cloud storage instance, such as an Amazon Simple Storage Service bucket. To lower costs, reduce the number of tasks that require a list and delete existing storage objects. Hive and task performance can improve if a developer caches the storage list operation locally on the cluster; run static partitions on storage contents for better performance.

Next Steps

Amazon EMR problems can be identified and avoided

Fix AWS credential mistakes in Amazon EMR

Hadoop coding errors impact big data processing

Dig Deeper on AWS big data and data analytics