The battle for cloud services: Microsoft vs. Amazon

Microsoft’s Apache Hadoop on Windows Azure Preview is the software giant’s gambit to unseat Amazon Web Service’s Elastic MapReduce. Learn which approach better suits your development needs.

There’s little question that Apache Hadoop software library holds the lion’s share of today’s big data analytics...

mindshare. Gartner reported in March 2012 that Hadoop popularity as a search term on its website increased by 601.8% over 2011. Primary driving forces behind Hadoop’s growing popularity have been the rise in big data and social computing hype, widespread enterprise-level acceptance of open source software, a pool of skilled Hadoop devops personnel and Hadoop’s capability to deliver high availability with adequate performance from low-cost clusters of commodity servers. This latter feature enables enterprises to deploy Hadoop workloads to public clouds of IaaS and PaaS providers and substitute pay-per-use charges for capital investments in data centers.

The Apache Software Foundation describes Hadoop as follows:

The Apache Hadoop project develops open source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Commercial distributions of open source software, such as Red Hat Enterprise Linux, are de rigueur at the enterprise level. Cloudera pioneered commercial Hadoop distribution with a freemium model that offers a free Cloudera Distribution for Hadoop (CDH), but requires payments for support and licensing the Cloudera Manager application. Because of its business model and market dominance, Cloudera is considered by many to be the “Red Hat of Hadoop.” Yahoo!, the original developer of Hadoop, had converted the heathen but Cloudera was selling the bibles. So Yahoo! spun off in June 2011 its Hadoop engineering group into Hortonworks, a new entity financed by Benchmark capital, to generate revenue from Hadoop by competing with Cloudera. Cloudera announced a partnership with IBM in March 2012 to integrate CDH and Cloudera Manager with IBM’s BigInsights platform on premises and in its public SmartCloud service.

Amazon’s Elastic MapReduce

Amazon Web Services (AWS) introduced the Elastic MapReduce (EMR) service on April 2, 2009, making AWS the granddaddy of cloud-based Hadoop services. EMR uses clusters of on-demand EC2 instances for processing data stored in S3 or DynamoDB. Specialized on-demand EMR instances range in cost from US$0.105 per hour for Small to US$0.864 per hour for Extra Large Hi-CPU instances, including an EMR surcharge. Standard monthly charges for S3 or Dynamo DB storage and per-GB data transfers to and from Amazon data centers apply. You’re charged for each hour or fraction thereof that your instances are running.

AWS provides code samples and tutorials in an EMR Getting Started Guide for creating a Streaming Job Flow with the EMR Command Line Interface (CLI) with Linux and UNIX, as well as Windows syntax. Alternatively, you can create and execute a sample Contextual Advertising Using Hive and Amazon EMR job flow, as illustrated by Figure 1, from the EMR Management Console, which this blog post describes in detail.

Figure 1. Diagram of streaming Elastic MapReduce and Hive job flows. You can run interactive Hive sessions from the CLI or the AWS Management Console. (Image from AWS.)

This article compares creating Hive job flows with the AWS Management Console (see Figure 2), rather than the CLI, because Microsoft’s Apache Hadoop on Windows Azure (AHoWA) services include an Interactive Hive Console with similar features. The Apache Foundation describes Hive as follows:

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Figure 2. The Create a Job Flow page in the AWS Management Console’s Elastic MapReduce tab. Clicking the Create New Job Flow button lets you select the Contextual Advertising sample HiveQL statements that transform ad-server impressions data into a Hive table. Additional MapReduce operations produce a sequential file for summarizing advertisement effectiveness.

AWS updated EMR to the latest Hive version (0.8.1) on 5/31/2012. Hive has the capability to translate HiveQL statements into MapReduce operations and execute the operations against Hive tables populated by data in on-premises files or public cloud data stores, such as Amazon S3 or Windows Azure blobs. For example, the following sample HiveQL statement creates a Hive table named impressions with seven fields in SerializeDeserialize (serde) format from ad-server impression log files stored in JavaScript Object Notation (JSON) format in an s3 …/tables/impressions folder:

   requestBeginTime string
   adId string,
   impressionId string,
   referrer string,
   userAgent string,
   userCookie string,
   ip string )
PARTITIONED BY (dt string)
    serde 'com.amazon.elasticmapreduce.JsonSerde'
    with serdeproperties ( 'paths'='requestBeginTime, adId,
                                    impressionId, referrer,
                                    userAgent, userCookie, ip' )
LOCATION '${SAMPLE}/tables/impressions' ;

The Contextual Advertising job flow runs the preceding statement, which is stored in an s3 script file, to create the Hive table for subsequent analysis. A second CREATE EXTERNAL TABLE statement generates a clicks table from ad click log data and another joins the impressions and clicks tables. If you use the recommended Large instance size, which costs US$0.42 per instance-hour, with the recommended one master and two core instances, the cost will be US$1.26. With default Small instances, the cost drops to US$0.315. Total execution time is about 20 minutes with Small instances. After execution completes, the Management Console shuts down all running instances.

Further operations generate a feature_index table that can be used to calculate an estimate of the chance of clicks on an advertisement. Producing these estimates requires saving a HiveSQL statement in an S3 script, selecting it instead of the sample script in step 2 of the Job Flow process, and viewing the resulting S3 file in the Management Console’s S3 tab.

Microsoft’s Apache Hadoop on Windows Azure Services Preview

The SQL Server Big Data team announced an invitation-only Community Technical Preview (CTP) of Apache Hadoop on Windows Azure Services on December 14, 2011, which the team expected to release to the public in early 2012. Microsoft partnered with Hortonworks to create a service that offers core Hadoop/MapReduce features, JavaScript libraries to enable writing MapReduce programs in JavaScript and running jobs from standard Web browsers, and an interactive JavaScript/Hive console for writing and executing HiveQL statements. Analysts using Excel and other Microsoft business intelligence (BI) tools can download a Hive ODBC driver and Add-In for Excel, which let them issue HiveQL queries to analyze structured or unstructured Hadoop data with BI tools, such as PowerPivot and PowerView. Prospective AHoWA users must fill out a brief survey to receive an invitation code by e-mail. Logging in to the AHoWA site with the invitation code opens the Request a New Cluster page (see Figure 3.) There’s no charge for AHoWA resources consumed during the preview period. However, clusters are reclaimed after 48 hours; you can renew a cluster for 24 hours during the last 6 hours of its lifetime.

Figure 3. The AHoWA website’s Metro-ized Create a New Cluster page. Specifying a unique DNS name, selecting a cluster size and providing administrative credentials enables the Request Cluster button. Provisioning the cluster of one head and four worker nodes of a large cluster takes only a few minutes.

After creating the cluster you can run one of nine sample Apache MapReduce, Pig, Sqoop and Mahout programs. Alternatively, you can set up a Windows Azure Marketplace Datamarket offering, Windows Azure blob container or Amazon S3 folder as the data source for a Hive table, as described for the feature_index data in this blog post (see Figure 4).

Figure 4. The Upload from Amazon S3 form. The Manage Cluster page’s Setup S3 button opens this form, which requires your AWS Access Key and Secret Key for authentication. You specify the URL for the S3 data source folder in the HiveQL statement.

The following HiveQL statement typed in the text box below the data display area creates a local feature_index Hive table with four columns in Hadoop SEQUENCEFILE format for subsequent queries:

   feature STRING,
   ad_id STRING,
   clicked_percent DOUBLE )
COMMENT 'Amazon EMR Hive Output'
LOCATION 's3n://oakleaf-emr/hive-ads/output/2012-05-29/feature_index';

Clicking the Evaluate button executes the statement, clears the text box and creates a link to the data source in about four seconds (see Figure 5.) SELECT queries download data from the S3 data source.

Figure 5. Confirmation of execution of a HiveQL query. Viewing job logs requires a Remote Desktop Protocol (RDP) connection to the Windows Azure High Performance Cluster.

Creating a Hive table adds its name to the Tables list and names of its columns to the Columns list. Executing a SELECT * FROM feature_index LIMIT 20 query displays the first 20 rows of the table (see Figure 6.)

Figure 6. The first 20 rows of the feature_index table. It took 7.265 seconds to execute a simple HiveQL SELECT query because of Internet latency and a relatively slow DSL connection. The ads in this selection received no clicks.

The “Applying the Heuristic” section of AWS’ Contextual Advertising article suggests executing the following sample HiveQL query against the feature_index table “to see how it performs for the features 'us:safari' and 'ua:chrome'”:

SELECT ad_id, -sum(log(if(0.0001 > clicked_percent, 0.0001, clicked_percent))) AS value
FROM feature_index
WHERE feature = 'ua:safari' OR feature = 'ua:chrome'
GROUP BY ad_id
LIMIT 100;

According to the article:

The result is advertisements ordered by a heuristic estimate of the chance of a click. At this point, we could look up the advertisements and see, perhaps, a predominance of advertisements for Apple products.

Figure 7 shows the result of executing the preceding query to display ads with the highest click percentage:

Figure 7. The first 15 rows of 100 returned for the highest percentage of clicks for two features. Hive History data not shown displays the steps in and durations of the two MapReduce jobs executed.

If you’re interested in integrating PowerPivot for Excel with data generated with the Interactive Hive console, see my Using Excel 2010 and the Hive ODBC Driver to Visualize Hive Data Sources in Apache Hadoop on Windows Azure blog post. To give the Console a test run with Windows Azure blobs as a data source, see my Using Data from Windows Azure Blobs with Apache Hadoop on Windows Azure CTP post.


Amazon’s EMR is a seasoned Hadoop/MapReduce veteran compared to Microsoft’s AHoWA, which is still in the preview stage. Both offer the full gamut of Apache Hadoop Core features, but AHoWA wins the usability contest with its Interactive Hive and JavaScript console. If your analytics team uses Excel or other Microsoft BI tools, the Hive ODBC driver and Hive Ad-In for Excel make it the winner in the added features department.

Next Steps

Microsoft gains on Amazon as IT pros weigh AWS vs. Azure

Learn about Amazon Elastic MapReduce in this feature 

Dig Deeper on AWS big data and data analytics