As an enterprise's pool of data grows each year, an otherwise valuable business asset can turn into a flood of...
noise that drowns out decision-makers.
To harness its unstructured, streaming and transactional data, an organization will sometimes use a data lake. In this model, enterprise data resides in a single repository and is stored in raw -- but easily accessible -- form, ready for applications to pull, transform and process relevant subsets as needed. Public clouds offer a variety of features conducive to data lakes, such as scalability, availability and diverse storage capabilities that include databases and services for data transformation and analytics.
However, most enterprises currently generate and store data on premises. This creates several challenges when they decide to build a data lake in AWS or on another public cloud platform, including:
- securely replicating data to the cloud;
- defining how and where to store it and extraction processes; and
- transforming data for particular applications and determining how and where to process the results.
AWS provides Quick Start guides, along with various other services, to help an IT team create a data lake in AWS and then transform that information for analytics.
Organizations can move large amounts of data to AWS via either a network connection or transportable drives. Because of the large quantities involved when an enterprise aggregates data, a Direct Connect high-speed, private link is the best network connection option, as speeds can reach up to 10 Gbps. Even when you use multi-gigabit network pipes, it takes time to replicate terabytes of data. For example, 20 TBs of data requires more than 51 hours to copy over a 1 Gbps link.
AWS acknowledged the roadblock this presents to organizations with huge data footprints. Subsequently, the cloud provider unveiled Snowball, a portable, ruggedized hard drive and associated software, to simplify batch transfers of multi-terabyte data sets. Snowball, which comes in 50 TB and 80 TB sizes, includes a 10GBase-T network-attached storage connection, and its software enables you to use multiple devices in parallel to simplify the transfer of 100-plus TB data sets. When you return the Snowball, the data transfers to your data lake in AWS, where you can extract it for other cloud services.
The primary purpose of a data lake is to facilitate the extraction and transformation of data into subsets for applications to use. For example, a Hadoop file system can use Spark for data analysis. S3 is an excellent destination for a data lake in AWS, as it:
- supports multiple data formats;
- easily feeds other AWS storage and database services;
- provides the cheapest form of nonarchival AWS storage;
- decouples storage from compute to make it operationally efficient; and
- adds flexibility to support clusterless architectures with serverless services, such as Lambda (event-driven functions), Athena (interactive SQL query), Glue (data transformation) and Macie (data security and access control).
AWS Glue is an on-demand extract, transform, load (ETL) service that can automatically find and categorize AWS-hosted data and can help an IT team build a cloud data processing pipeline. Glue uses crawlers to scour data sources and build a metadata catalog that uses either custom or built-in classifiers for commonly used data types, such as CSV, JSON, various log file formats and Java-supported databases.
Glue automatically generates scripts in Scala or PySpark with customizable Glue extensions that can clean data and perform other ETL operations. For example, a script could perform an ETL task and use a relational format to store data in a different repository, such as Redshift.
Data scientists can use the Glue metadata catalog with Apache Hive and associated tools, like Presto and Spark, for SQL queries and analytics processing, respectively. But AWS provides equivalent data analytics services, such as Athena, Elastic MapReduce (EMR) and Redshift Spectrum.
Data processing, warehousing
AWS provides several data services that can store, segment and analyze the ETL output from Glue, though the application determines the best service match. Redshift fits traditional data warehousing needs, such as online analytical processing or predictive analytics with full SQL syntax.
A Hadoop cluster, EMR or Spark can better analyze log files or stream data for system security analysis and clickstream processing. Here, Glue output could feed EMR or an Amazon machine learning service.
Some organizations use both types of data analysis. For example, Nasdaq was one of the original Redshift users, as it replaced its on-premises data warehouse for daily transaction data. But the company also wanted to mine and analyze its vast and growing archive of historical data; S3 and EMR were a better fit for this task. The general model for this type of hybrid design uses Redshift -- and possibly Athena -- for interactive queries and real-time predictions and then EMR with Spark and other analytical tools for batch prediction across a larger data set.
Many of the objections to Glue and AWS-based data pipelines and warehouses come down to data movement. In most cases, the majority of an organization's archival and newly generated data comes from outside AWS.
But consider this issue more broadly, and weigh the pros and cons of the cloud for data analysis in general. The cloud enables convenience, scalability, performance and flexibility at the expense of data transport, security and associated costs. If the scale tips toward the pro-cloud side of the equation, AWS provides the requisite services to create sophisticated data transformation and analysis systems.
The knock on Glue, in particular, is its immaturity compared to existing, stand-alone ETL and data integration software from providers such as Informatica, IBM, SAP and Talend. Experienced data analysts and developers might favor stand-alone clients or an integrated development environment over the AWS Glue console. And Glue does come with prohibitive capacity limits on the number of databases, jobs, triggers and crawlers per account, tables per database and partitions per table. While AWS can increase these limits, they might pose problems for particular organizations or system designs.
For most organizations, particularly those without extensive data warehouse and ETL expertise or legacy infrastructure, AWS could be an excellent platform for data aggregation, transformation and analysis, given its scalability and breadth of services.