BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Whether it's human genome data or restaurant recommendations, the world is inundated with data, and huge swaths of that information have entered the public cloud through AWS big data customers.
Business intelligence and data analytics are among the most popular uses for cloud computing, according to a survey conducted by TechTarget last fall. Out of 456 total respondents, 36.2% said they use the cloud for these workloads.
Big companies choose to put big data into Amazon Web Service's (AWS) cloud because massive machine resources from compute to networking and storage are instantly available, and its many data centers worldwide offer a place for collaborators -- from researchers to programmers and data scientists -- to meet.
The term "big data" refers to large volumes of data, but these workloads can also be defined by their velocity – the high speed at which data comes in to be analyzed as well as the real-time nature of the processing done to analyze it – and the variety of data sources involved.
How big is big data?
Yelp Inc., a San Francisco-based online hub for consumer business reviews, has millions of users and millions of reviews to sort through; it sees 250 GB in compressed logs per day and uses a set of databases that hold more than 7 terabytes (TB) of data.
The data it aggregates for analysis using AWS includes reviews, user information, business information, behavioral data -- such as check-ins -- and unstructured data -- such as photos.
Yelp mines this data to improve its targeted advertising, search results, recommendations and to filter out spam and abuse. The company uses a combination of Amazon's Elastic MapReduce (EMR), Simple Storage Service (S3) and home-grown open source utilities.
HubSpot Inc., a digital marketing software as a service (SaaS) company based in Cambridge, Massachusetts, captures between 500 million and 1 billion new data points each month through its marketing analytics platform. It averages tens of billions of records in that platform, all of which are processed for real-time dashboards and conversion rates, as well as for regularly scheduled effectiveness campaigns.
HubSpot has also sent some three billion emails through its email marketing platform, and applies big data analytics to measure recipients' responses to email pitches, to know whether they open the message and the amount of time they spend reading it. Finally, HubSpot built its contact management system on open source big data tools Hadoop and Apache HBase, so contacts are sorted according to recent activity rather than traditional data attributes.
Illumina Inc., based in San Diego, manufactures and markets systems to analyze human genome data. It has 2,500 instruments connected to its application, called BaseSpace, hosted on AWS. So far over 100,000 sequencing runs of between *5 GB and 600 GB each have been uploaded and the system has 13,000 individual users. Researchers have launched about 20,000 applications on top of BaseSpace; the company's S3 repository holds over *300 million files and adds 10 million files per week.
LexisNexis, which provides computer-assisted research and risk-assessment services, is among the oldest companies to specialize in data analytics and aggregates between three and four petabytes (PB) of data to back its services. The company's High Performance Computing Cluster Systems subsidiary offers a customizable data analytics platform that can be run on top of AWS.
Infrastructure as a service: The bedrock for big data
These companies differ in which AWS services they use, with the exception of Amazon's foundational compute service, the Elastic Compute Cloud (EC2) and S3. These are staples in virtually every deployment on AWS, but certain features differentiate them from other clouds for big data purposes.
While it's generally a distributed computing problem, big data can require big servers.
Most of us wanted big data to run on very commoditized servers, but the reality is the more metal that you can give to it the better it is.
CIO for HubSpot
For Illumina, information that arises from gene sequencers is in a raw format -- it's all the puzzle pieces but not the actual puzzle. A process called alignment combines all those sub-segments into a single genome.
"It turns out to be just as efficient for some of these genome alignments … if you can put it on a big computer," said Alex Dickinson, senior VP of strategic initiatives for the company. "Amazon seemed to be unique in terms of offering very large instances with about 100 GB of RAM and 12-core high performance processors."
These instances, part of the I2 line of EC2 instances, were crucial for HubSpot's continued use of cloud for its massive amounts of data.
"Most of us wanted big data to run on very commoditized servers, but the reality is the more metal that you can give to it the better it is," said Jim O'Neill, CIO for HubSpot. Depending on the workload, the company has seen between four and eight times the throughput on these instances over those it had used.
HubSpot had used a colo environment from 2012 to 2013, before Amazon offered high-density, high-compute servers, but moved toward AWS once those options were available, O'Neill said.
Virtual Private Cloud networking and EC2 security groups are also instrumental for HubSpot's analyses.
"Between the ability to move resources across these large pools of computing and memory capacity, and the ability to have the network changes done through the same API, it means that no humans are involved," O'Neill said. "It gives us the ability to scale our application in 15 minutes or to replicate entire data sets and reprocess billions of data points in a few hours. I don't see traditional networks being able to do [that as] easily, because they tend to require firewall or switch changes."
App development takes shape around AWS big data services
Once applications mature beyond the basic building blocks of cloud infrastructure, most experienced big data practitioners also use their own open-source elbow grease to get a leg up on competition. They develop apps and libraries that extend the pre-canned services available from AWS.
Elastic MapReduce, based on Google's MapReduce data analysis utility, is at the center of Yelp's data processing in the cloud. The company actually operates its own data centers for most workloads, according to engineering manager Jim Blomo, as it had already built them before considering AWS, but it sends data to S3, and then to EMR for batch processing before feeding them back into the on-premises data center for display on its Website.
"EMR offered a sweet spot between DIY and full service," Blomo said. "It uses an open source library, so we can use third-party tools, avoid vendor lock-in and dig into the source code if we have questions about implementation, but we don't have to manage software upgrades and cluster configuration."
However, Yelp has constructed several of its own open-source applications to make EMR more efficient and to integrate it with on-premises systems. The mrjob library Yelp put together allows Hadoop jobs to be coded in Python and to be correlated as part of the overall process; mrjob also spins up cluster resources and can repeat cluster sizes and configurations to streamline resources and cut costs.
"For a lot of our jobs we take data from MySQL, dump it out and then upload it into S3," Blomo said. "The advantage there is we can have almost an arbitrary number of people going over database data without having to provision 20 databases for different usages." Another open source package called S3mysqldump handles this process.
A utility called EMR Instance Optimizer (EMRio) looks at EMR usage and suggests billing improvements that can save money through the use of Reserved Instances. An app called Tron does batch process scheduling and allows for job flow pooling, so that compute time, rounded up to the nearest hour on EC2, is used as efficiently as possible by different jobs that share the EMR system.
Other companies have skipped EMR entirely and prefer to go their own way; LexisNexis sees EMR as competitive to its HPCC Systems' technology, and argues that it can process more kinds of data more efficiently. The company offers 'one-click' deployment of HPCC on EC2 and S3 for customers, who can also host the application on-premises.
However, custom application development isn't always necessary for aspiring big data gurus, according to HubSpot's O'Neill. HubSpot doesn't use EMR because its workloads require transactional databases as well as MapReduce processing.
"With big data some assembly is required … [big data apps] take a fair amount of systems data and operational experience to run," O'Neill said. "If you can get by using EMR because you don't have very specialized workloads, I would recommend that."
*Statements corrected after initial publication.