An enterprise guide to big data in cloud computing
A comprehensive collection of articles, videos and more, hand-picked by our editors
Big data projects can be complex and demand significant design and coding expertise. Start small -- implement big...
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
data platforms slowly and test key functions with simple jobs. Once developers become comfortable using Amazon's big data tools they can move on to larger, complex jobs.
Complex big data projects often run across clusters of compute resources, with native and third-party tools that don't integrate well. And this creates big data problems.
Two ways to limit big data problems
Apache Hive is an open source data warehouse and analytics tool that typically runs on Hadoop clusters. Data scientists use Hive to tackle complex analytical jobs without creating Amazon Elastic MapReduce (EMR) programs in more traditional programming languages, like Java. Instead, developers write scripts in Hive Query Language, which resembles SQL but adds more capabilities and handles structured elements such as JSON objects and user-defined data types. Hive provides analytical power for big data jobs, but certain errors and big data problems can occur.
A Java Archive file (JAR) is a format developers use to collect and package Java class files, metadata and content into one file. Developers can redistribute or even provide executable capability for that file. JAR clusters can drive big data tasks, but errors can creep into JAR clusters when running Amazon EMR. In most cases, the JAR cluster can produce errors when creating a Hadoop job or when the JAR and mapper attempt to process data within the task itself.
Because Java and JAR are complex, look for error details in the syslog file. Once syslog reveals the nature of the error, IT teams can remedy the big data problems and create updated JAR files.
For example, older Hive versions can return errors when using certain features. Verify your IT team is using the latest, fully patched version of Hive. A syslog file tracks any task attempts and can contain messages produced from syntax errors in the Hive script or other master node and cluster failures. Log messages help teams quickly narrow possible root causes of Hive problems under Hadoop.
Hive tasks that access cloud storage resources can also run up costs for enterprises. For example, it can be expensive to list the contents of a cloud storage instance, such as an Amazon Simple Storage Service bucket. To lower costs, reduce the number of tasks that require a list and delete existing storage objects. Hive and task performance can improve if a developer caches the storage list operation locally on the cluster; run static partitions on storage contents for better performance.
Amazon EMR problems can be identified and avoided
Fix AWS credential mistakes in Amazon EMR
Hadoop coding errors impact big data processing
Related Q&A from Stephen J. Bigelow
RAID 5 and RAID 6 are two types of erasure coding. The former protects data with basic parity, while the latter builds in a second layer of parity ...continue reading
Cleanly divided and straightforward applications are good candidates for a container-based deployment, whereas complex applications pose more ...continue reading
Assessing the impact of containers on application workloads can be extremely challenging, partially because of how quickly containers are spun up and...continue reading
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.