microworks - Fotolia


Spark on AWS helps ignite big data workloads

Developers turn to Hadoop for big data workloads, and Spark is a particularly enticing Hadoop service on AWS. Spark teams up with Elastic MapReduce for fast processing and versatility.

AWS users have access to a growing portfolio of application services, especially those related to data analysis....

And with most organizations awash in more data than they know how to handle, AWS has become an important name in big data tools that provide simpler and cost-effective ways to build an otherwise complex system, such as running Spark on AWS.

AWS offers an array of databases for a variety of needs, including NoSQL with DynamoDB, relational databases with Amazon RDS and Aurora, in-memory caching capabilities through ElastiCache, and data warehouse services such as Redshift. Hadoop addresses a new class of data analysis problem that uses extremely large data sets that must be distributed across many systems, forming a Hadoop compute cluster. Yet the cost and complexity of deploying dozens of systems with a stack requiring several software components makes using Hadoop infeasible for many organizations.

Amazon Elastic MapReduce (EMR), a Hadoop framework for distributed data processing, builds on top of core AWS infrastructure, using Simple Storage Service (S3) for data collection and results and an Elastic Compute Cloud cluster for processing. EMR handles the messy details of Hadoop software setup, including system configuration, deployment and decommissioning, job management, and logging.

Big data developers use the service to access Hadoop tools that are complex, time-consuming and expensive to deploy and configure -- particularly for workloads with a dozen or more servers.

EMR provides a Hadoop Distributed File System (HDFS), MapReduce processing engine, APIs and YARN resource manager. But EMR is also a platform for the following Hadoop services:

  • HBase -- Distributed data store using the HDFS
  • Hive -- Data warehouse with SQL query language
  • Pig -- Hadoop programming framework that transforms SQL-like commands into MapReduce jobs
  • Presto -- SQL query engine compatible with HDFS and S3
  • Spark -- Configurable option to EMR
Hadoop -- and, hence, the initial release of EMR -- was designed to process huge, effectively unlimited data sets. But the downside of massive scalability is speed.

Because of EMR's speed, versatility and multilingual programmability, Spark is perhaps the most interesting Hadoop service. It features Spark SQL for making low-latency, interactive SQL queries on structured data in a distributed Hadoop/HDFS data set and an MLlib library for scalable, distributed machine learning algorithms on data in a Hadoop cluster. Spark Streaming provides APIs for stream processing that use the same syntax and languages -- specifically, Java, Scala and Python -- tailored for streaming applications on resilient Hadoop clusters. GraphX -- Spark's API for graph processing that unifies extract, transform and load (ETL), exploratory analysis, and iterative graph computation within a Hadoop cluster -- is also included.

AWS users can install Ganglia on EMR clusters to provide more sophisticated cluster monitoring for Spark on AWS. When running Spark on AWS -- in a Hadoop cluster that Amazon EMR manages -- the master node runs the YARN resource controller, Spark and any optional modules, such as Spark SQL. It reads data and writes results to an S3 bucket and spawns one or more slave nodes for additional processing capacity.

Scaling to process big data

Hadoop -- and, hence, the initial release of EMR -- was designed to process huge, effectively unlimited data sets. But the downside of massive scalability is speed. Hadoop reads and writes everything from disk using HDFS and does one task at a time -- each MapReduce job is a batch process. Therefore, if a developer needs to perform complex data transformations that involve multiple, sequential processing steps or ETL-like transformations, it requires separate jobs. This means it will take a long time for processing to complete, and a fault-tolerant workflow is necessary, as individual jobs may fail, requiring an entire data processing step to be rerun.

In contrast, Spark is designed for speed, interactivity and more complex data transformations -- defined as directed acyclic graphs (DAGs) -- not basic MapReduce algorithms. DAGs are a construct in which each action is a node or vertex that can trigger other actions, but not loop back on itself. These properties make DAGs more flexible than sequential batch jobs, as several tasks can run in parallel and a single task can trigger one or more follow-up actions depending on the result.

Spark often runs entirely in memory on smaller data sets that can fit entirely within the available system memory on nodes in a Hadoop cluster. Spark's RAM-based design provides about a 100-times speed boost over disk-based MapReduce. In-memory caching accelerates complex DAGs by never having to write intermediate results to disk.

Spark's versatility is a key reason why it has become popular with big data developers. It's best to think of Spark as an application platform, not a prescriptive programming language. Developers can write Spark applications in familiar, high-level languages, using more than 80 high-level operators. Furthermore, Spark supports several libraries, including for SQL and DataFrame API, stream processing, machine learning, graph analysis and parallel computation.

Spark on AWS opens up possibilities

With Spark fully supported within EMR, installation on AWS is no longer a manual process. The EMR documentation walks through the steps, but it's all done through the AWS Management Console, and it's as simple as selecting the Spark application when creating a cluster. Developers submit Spark action to the EMR Step API for batch jobs or interact directly with the Spark API or Spark Shell on a cluster's master node for interactive workflows.

Using Spark on AWS to solve business problems is a question of imagination, not technology. For example, the AWS blog introducing Spark support uses the well-known Federal Aviation Administration flight data set, which has a 4-GB data set with over 162 million rows, to demonstrate Spark's efficiency. Spark on EMR can find the 10 airports with the most flight departures, the most flight delays over 15 minutes, the most flight delays over 60 minutes and the most flight cancellations. The bigger the EMR cluster, the faster the results. AWS cites other Spark uses such as for content recommendations, log data analysis and real-time web clickstream processing.

Next Steps

Amazon EMR embraces Apache Spark

Process and visualize data with these tools

AWS satisfies customers seeking machine learning techniques

Dig Deeper on AWS big data and data analytics