Amazon S3 has become the de facto standard for cloud storage, and now, AWS wants the service to play that same...
role for data lakes as well.
As part of this effort, the cloud provider enhanced S3 with support for larger storage buckets and upgraded its ecosystem of data lake tools, including the addition of AWS Glue to catalog data and Athena to query unstructured data stored in S3.
Enterprises with extensive Amazon cloud deployments will view S3 as an attractive option for AWS data lakes, but there are some limitations, especially around data transfers and analysis. In some cases, data lake alternatives to S3, such as the open source Hadoop Distributed File System (HDFS), might be better options.
Why a data lake?
Data lakes are architected to store a variety of data types and automatically generate a catalog of these different data sources. They are an evolution from data warehouses, which are only optimized for structured data from traditional transaction processing, ERP and customer relationship management databases. Data lakes make it easier to find correlations spread across structured and unstructured data sources, such as event logs, IoT readings, mobile applications and social media.
Data science typically requires a lot of research and engineering to find data sources and prepare them for a particular type of analysis. Because data lakes store different data types upfront, enterprises can add an appropriate schema later and also make it easier and less time-consuming for data scientists to identify new algorithms.
Data lakes also, however, pose some challenges. It can be difficult to find sources, understand schemas and identify the quality of the data source. AWS has designed S3 to dramatically reduce the overhead in data lake setup and use, with security and governance baked in.
AWS has positioned S3 as a more automated alternative to HDFS. S3 is clearly designed for Amazon's infrastructure, whereas HDFS draws on an open source history with support from leading data management vendors, including IBM.
HDFS is an outgrowth of MapReduce, which is a component of the Hadoop distributed computing framework. HDFS provides data distribution across multiple compute nodes in a cluster and is well-suited to manage different types of data sources. As a result, it set the stage for enterprise data lakes.
AWS added Amazon Elastic MapReduce a few years ago to automatically provision HDFS across a cluster of EC2 instances. Until recently, this was the best option for enterprises to build a data lake on top of AWS, since S3 was limited to 5 GB objects. An enterprise could create much larger data lakes if it spread HDFS across multiple EC2 instances with attached Elastic Block Store volumes.
Amazon has since expanded S3 to support 5 TB objects, which users can aggregate into multipetabyte buckets. This makes it easier to build much larger data lakes directly on S3 rather than HDFS. In addition, S3 is a service, while HDFS is a file system; with S3, Amazon takes care of the heavy lifting associated with managing multiple servers.
An architect can then use AWS Glue to save time and automate the creation of a data catalog that describes the structure and format of different data sources. The service uses a crawler to scan a collection of S3 buckets, classify data sources and automatically recommend different analytical algorithms that could run on AWS offerings, such as Redshift Spectrum or Athena.
AWS Glue also reduces the effort to extract, transform and load data into a centralized S3 repository.
The limits of S3
Enterprises that build AWS data lakes with S3 could face some challenges, such as the cost and complexity of data transfers to other analytics engines. Once users store data in S3, there are no transfer costs for analytics or data processing with apps that run within the same AWS region. However, enterprises must pay a premium when they move data to private infrastructure or other cloud platforms for analytics.
There are also some technical limitations with S3 compared to HDFS. S3 is an object store rather than a file system. This can reduce performance when operations are executed on a directory. Some file operations used by Hadoop tools, such as HBase, are not supported on S3. Also, S3 data is not visible for analytics until users write an output stream, which could reduce performance for some applications.
Developers have limited control over how data is replicated across S3 infrastructure compared with HDFS. This can lead to consistency issues when files are listed, read, updated or deleted. It's also more difficult to optimize the bandwidth between S3 and analytics applications that run outside of the Amazon ecosystem.
Planning for the future
A data lake could be beneficial in the long run, but it involves more than just technical considerations, since different business departments usually manage different data sources. In the short term, enterprises may consider S3 where practical for their AWS data lakes and HDFS for other data infrastructure required outside of Amazon. Enterprises don't have to decide upfront, since Hadoop infrastructure can use an integrated S3A client to pull data in from S3 buckets and work in concert with other data stored in HDFS.