This content is part of the Essential Guide: AWS analytics tools help make sense of big data

Amazon Elastic MapReduce moves forward with Apache Spark

Amazon Elastic MapReduce has embraced Apache Spark, reflecting an industry shift around big data analytics.

Amazon Elastic MapReduce now supports Apache Spark, marking an important step in the evolution of Hadoop and big data analytics.

Apache Spark, a distributed processing engine, processes data in-memory, boosting performance of big data analytics jobs over Hadoop, which writes some data out to disk. Spark also supports Scala, Python, and Java APIs, and includes libraries for SQL, machine learning algorithms, graph processing, and stream processing.

"With many tightly integrated development options, it can be easier to create and maintain applications for Spark than to work with the various abstractions wrapped around the Hadoop MapReduce API," Amazon said in its blog post.

One big data analytics company using Amazon Elastic MapReduce (EMR) with Spark is Yelp Inc., a San Francisco-based online hub for consumer business reviews. The company uses Spark to train machine learning models for local ad targeting, according to a company spokesperson.

Spark's in-memory computing is a natural fit for iterative algorithms and has led to significant speed-ups in performance, the spokesperson said. This has helped Yelp to increase its training datasets and improve advertising click-through-rates and revenue.

Analysts said Amazon's move reflects a broader shift in the big data market from MapReduce to Spark as the primary analytics engine.

"Spark may offer better performance, and with options for SQL, streaming, machine learning, and graph analytics, as well as support for a variety of data sources, and it's getting a ton of interest," said Nik Rouda, analyst with the Enterprise Strategy Group located in Milford, Mass. "Whether it will completely replace Hadoop remains to be seen, but it's clearly going to complement -- if not challenge -- the ecosystem."

IBM has put a lot of weight behind Apache Spark, noted Kris Bliesner, CTO of 2nd Watch, Inc., an Amazon Premier Partner in Liberty Lake, Wash. Having an alternative to IBM's implantation on the market is a good thing, Bliesner said.

To Bliesner's taste, however, AWS EMR is a very light version of Hadoop management.

"Folks that are serious about hardcore [high-performance computing] workloads will go to Cloudera, et cetera," Bliesner said. "EMR is quick and easy but no frills."

Beth Pariseau is senior news writer for SearchAWS. Write to her at [email protected] or follow @PariseauTT on Twitter.

Next Steps

The ins, outs and in-betweens of AWS big data analytics

Identify the best Elastic MapReduce framework for Hadoop

Getting the most from Elastic MapReduce

Dig Deeper on AWS big data and data analytics