Big data expansion affects how IT teams treat and retain information. Many businesses forego all but the most rudimentary...
data preparation techniques, and opt to retain greater volumes of raw data.
Rather than process data at the time of ingestion, these data owners basically withhold or delay processing schemas until they actually query the data. This keeps raw data in its crude format, which enables its potential use in more big data projects.
This trend enables a wide array of compute services to search, catalog, analyze and present enormous amounts of raw data from inside and outside the enterprise. Big data experts call this type of architecture a data lake, and public cloud providers like AWS offer the services, processing scale and storage capacity needed to architect these data lakes for big data projects.
Workload complexity is a big challenge when you implement data lakes on AWS and in other public clouds. A data lake is more than a single application, workload or service. Instead, it cobbles together interrelated services to manage and transform enterprise data. Data lake users will likely need additional services to enhance performance and securely tie everything together.
Data lake features
While an IT team can construct a data lake in a data center or private cloud, the IT infrastructure must provide exceptional performance and elastic scalability. Because this can be expensive to implement on premises, most business instead use data lakes hosted on a public cloud.
A data lake architecture, including one on the cloud, typically has the following features:
Data submission. Raw data typically passes from the business or outside data sources to the data lake through a wide area network as a batch or streaming process. IT teams commonly upload discrete files, such as test results, as batch processes. Streaming uploads better fit ongoing data collection projects, such as those that involve the internet of things (IoT). An IT team could also, in some cases, ship disks of data to a cloud provider via services like AWS Snowball.
Data preparation. IT teams typically subject data lake information to various ingestion processes that validate, index and catalog data. They might also create and apply metadata, and perhaps even transform data submissions. But teams could also replicate raw data into separate storage instances to preserve data lake integrity, and then use services to organize and prepare a copy of the data for analysis.
Data management. Once teams submit and prepare data, they apply data management techniques to organize, classify and tier valuable or sensitive data within the data lake. In many cases, data lake management involves services that combine two or more data sets into another view of that data -- sometimes called curation -- which yields a processed data set better suited to analytical tasks. This data typically receives additional indexing, and teams can separate it from raw data to keep it intact for future analytics projects.
Data analysis. Analytics is the heart of any data lake, as it enables users to answer queries, spot correlations and make predictions. Analysts can pose queries and apply one or more analytics engines to transform, aggregate and analyze data sets. While they can analyze raw information, data scientists typically work with curated data sets that were previously transformed and aggregated.
Data delivery. There are a range of tools that can organize and present results of big data jobs. For example, search features can help users find and see results for queries on indexed metadata attached to ingested data. Other features can publish curated data sets to make them available to Amazon and third-party tools. Teams can also post-process data results through visualization tools.
Native data lake management tools
While the details of big data projects will vary, most data lakes on AWS use a range of services, including the following:
Amazon Simple Storage Service. AWS' object storage offering, Simple Storage Service (S3), provides high-performance batch storage. S3 instances can retain both raw data sets and processed data that has been modified through curation, analytical processes and consumption. Users can store batch data in S3 buckets via internet transfers or through a physical disk transfer service, such as Snowball.
Amazon Kinesis Data Firehose. S3 can handle batch data uploads, but streaming data is more problematic. Amazon Kinesis Data Firehose enables the data lake to capture, modify and load streaming data, such as continuous telemetry from IoT devices, into storage instances.
Amazon Elasticsearch Service. When you bring raw data into data lakes on AWS, it typically requires a level of pre-processing to properly ingest the content and prepare it for use. Amazon Elasticsearch Service (ES) provides initial validation, metadata extraction and indexing for the arriving data. Other services use the metadata and indexes to support search, curation and other jobs. S3 events can trigger Amazon ES activity, which can then be directed via AWS Lambda functions.
Amazon Kinesis Data Analytics. Data scientists can apply numerous analytical engines to big data tasks. One option is Kinesis Data Analytics, which helps process or transform streaming Kinesis data using conventional SQL, and makes queries on streaming data. Other analytical tools can then manage the processed data.
Amazon Redshift Spectrum. Designed to manage the contents of data lakes on AWS, Amazon Redshift Spectrum enables users to analyze or transform data using SQL and third-party business intelligence tools. Redshift Spectrum supports many common open data formats, and it can perform queries across exabytes of raw unstructured data, whether it's located on disk in the local data center or in S3 buckets.
Amazon Athena. AWS' serverless analystics service, Amazon Athena, analyzes data in S3 buckets using standard SQL queries. Users can select an S3 bucket for analysis, create the schema and execute the query. Athena does not rely on conventional extract, transform and load (ETL) jobs to prepare data for analysis, but it can query curated data sets for convenient ad hoc analytics.
AWS Glue. When you do need to perform ETL processes, AWS Glue can handle that task. The AWS Glue managed service works with AWS-native data. It automatically discovers data and creates metadata, which data scientists can search or query. AWS Glue also creates a catalog of discovered content, as well as the code that transforms the data. Services like Hadoop and Apache Spark can reuse those catalogs and code for additional processing.
Amazon Elastic MapReduce. If you need to process huge volumes of data, Amazon Elastic MapReduce (EMR) enables this task with Amazon EC2 instances. In addition to Hadoop, EMR supports distributed processing platforms, including Spark, HBase and Presto. EMR and its desired framework interact with data from S3 and DynamoDB instances.
Startup and cost questions
Businesses face two core problems when they construct data lakes on AWS.
Implementation is the first, as an IT team must possess considerable knowledge of AWS architecture and services to properly construct a data lake; typically, only AWS architects have that type of know-how.
But several shortcuts can help accelerate data lake implementation. One-day boot camps can teach big data developers, cloud solutions architects, data analysts and other IT professionals to design, build and operate data lakes. AWS also provides a sample template in AWS CloudFormation that can deploy a basic data lake in a default configuration. IT professionals can use the template to get started, and then refine the deployment to better suit their specific needs.
Admins encounter the second core problem when they attempt to estimate costs for data lakes on AWS. Some services come without a charge, including the AWS CloudFormation template, but users will incur costs for most services involved with data lake composition. Each constituent service carries different price structures and tiers, and they might scale according to big data tasks and performance needs -- resulting in higher costs.
With these factors at play, cost estimation quickly becomes a time-consuming and error-prone endeavor. Tools like the AWS Simple Monthly Calculator can help estimate costs, though they might not account for the nuances involved with a data lake.
If you're unsure about how to best implement data lakes on AWS, discuss the planned architecture with AWS support teams. IT can identify the services involved and then estimate the monthly costs.