BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
The battle between Amazon and Google for cloud domination has many facets, but among the most prominent is big data cloud services.
Google famously invented MapReduce, but AWS has the more widely used platform for the technology today. Streaming data analytics could be the next frontier, and Google claims its new Dataflow technology trumps MapReduce.
"Google can't spend the next year just duplicating what AWS offers," said James Staten, analyst with Forrester Research, based in Cambridge, Mass. "They have to offer something that's uniquely Google that differentiates the platform, and it's very clear that they have staked out big data as something they believe they can differentiate on."
Thirty percent of 375 respondents to a recent analyst survey said public cloud infrastructure as a service will be the IT trend with the biggest effect on their organizations' big data analytics strategies. Thirty-four percent also said big data software as a service will have the biggest impact, in Enterprise Data Analytics Trends: Market Drivers, Organizational Dynamics, and Customer Expectations, a May ESG report by Nik Rouda.
There is a high level of overlap between big data and public cloud trends, Rouda said in the report. Both cloud models are also being considered as places to aggregate and analyze vast and previously untapped volumes of data.
Today's battlegrounds: MapReduce and BigQuery
Amazon Web Services (AWS) has built a strong business with its Elastic MapReduce service, first launched in 2009. Today, big companies and startups alike use the service to glean insights from massive data stores, such as customer purchasing behavior and human genome mapping.
"The Hadoop space is really tied to MapReduce, so there's been a lot of market success with big leaders in this space … all promoting it and getting good traction," Rouda said.
Google's App Engine MapReduce, however, remains experimental.
Google's most popular big data cloud service is BigQuery, which allows users to perform SQL-like queries on large sets of data.
Workiva, a financial reporting software developer, pushes all its application logs and application analytics information to BigQuery to run analysis on things like application performance, feature usage and tracking trends in usage over time, according to Dave Tucker, senior director of platform development for the company, based in Ames, Iowa. Putting a SQL-like interface on big data queries has also shaped emerging big data trends, Rouda said.
MapReduce seems to be fading in popularity, Rouda said, in favor of applications like Spark, which do real-time processing, and projects that use SQL queries to search big data rather than writing to MapReduce.
Some big data practitioners on AWS disagree.
"To me, MapReduce is just a concept, the concept of how you process volumes of data by distributing them out, collapsing them into summaries and then merging the summaries," said Ed Abrams, principal software architect for SynapDx Corp., a biotech research firm based in Lexington, Mass. "I can't imagine that ever going away."
On the horizon: Streaming data analysis
Google began a private beta program for a new data analytics process it calls Dataflow in June, which company officials described as another evolution of MapReduce capable of examining a real-time stream of events and implementing multi-step processing pipelines.
This contrasts with MapReduce, which is used for batch analysis, Rouda said.
"Maybe you want to find, 'Show me all the customers who bought today, then show me all the customers in the Northeast who bought today, then show me if there were discounts,'" Rouda said. "It's definitely more real-time than MapReduce."
One of Amazon's newest services, Kinesis, is similar, particularly by processing real-time streams of data. Kinesis has been generally available for more than 10 months.
As for how Google will fare when Dataflow becomes generally available, Rouda said it's a new generation of technology the market will have to wrap its mind around.
"If you're limited finding people who can program for Elastic MapReduce, your numbers for Dataflow are going to be in the single digits," he said. "At this point it feels pretty targeted mostly for net-new applications being built in Google's compute cloud and using Google's data store."