Essential Guide

An enterprise guide to big data in cloud computing

A comprehensive collection of articles, videos and more, hand-picked by our editors
Q
Get started Bring yourself up to speed with our introductory content.

How can we avoid common AWS big data problems?

When our IT team clusters Hive and Java Archive files for big data projects, it receives error messages. How can we keep our AWS big data projects running smoothly?

Big data projects can be complex and demand significant design and coding expertise. Start small -- implement big...

data platforms slowly and test key functions with simple jobs. Once developers become comfortable using Amazon's big data tools they can move on to larger, complex jobs.

Complex big data projects often run across clusters of compute resources, with native and third-party tools that don't integrate well. And this creates big data problems.

Two ways to limit big data problems

Apache Hive is an open source data warehouse and analytics tool that typically runs on Hadoop clusters. Data scientists use Hive to tackle complex analytical jobs without creating Amazon Elastic MapReduce (EMR) programs in more traditional programming languages, like Java. Instead, developers write scripts in Hive Query Language, which resembles SQL but adds more capabilities and handles structured elements such as JSON objects and user-defined data types. Hive provides analytical power for big data jobs, but certain errors and big data problems can occur.

A Java Archive file (JAR) is a format developers use to collect and package Java class files, metadata and content into one file. Developers can redistribute or even provide executable capability for that file. JAR clusters can drive big data tasks, but errors can creep into JAR clusters when running Amazon EMR. In most cases, the JAR cluster can produce errors when creating a Hadoop job or when the JAR and mapper attempt to process data within the task itself.

Because Java and JAR are complex, look for error details in the syslog file. Once syslog reveals the nature of the error, IT teams can remedy the big data problems and create updated JAR files.

For example, older Hive versions can return errors when using certain features. Verify your IT team is using the latest, fully patched version of Hive. A syslog file tracks any task attempts and can contain messages produced from syntax errors in the Hive script or other master node and cluster failures. Log messages help teams quickly narrow possible root causes of Hive problems under Hadoop.

Hive tasks that access cloud storage resources can also run up costs for enterprises. For example, it can be expensive to list the contents of a cloud storage instance, such as an Amazon Simple Storage Service bucket. To lower costs, reduce the number of tasks that require a list and delete existing storage objects. Hive and task performance can improve if a developer caches the storage list operation locally on the cluster; run static partitions on storage contents for better performance.

Next Steps

Amazon EMR problems can be identified and avoided

Fix AWS credential mistakes in Amazon EMR

Hadoop coding errors impact big data processing

This was last published in February 2017

PRO+

Content

Find more PRO+ content and other member only offers, here.

Essential Guide

An enterprise guide to big data in cloud computing

Have a question for an expert?

Please add a title for your question

Get answers from a TechTarget expert on whatever's puzzling you.

You will be able to add details on the next page.

Join the conversation

1 comment

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

Which big data analytics tools do you use with AWS?
Cancel

-ADS BY GOOGLE

SearchCloudApplications

TheServerSide

SearchSoftwareQuality

SearchCloudComputing

Close