This content is part of the Essential Guide: An admin's guide to AWS data management

data lake

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.

The term data lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization's data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop's cluster nodes of commodity computers.

Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried.

Data lake vs. data warehouse

Data lakes and data warehouses are both used for storing big data, but each approach has its own uses. Typically, a data warehouse is a relational database housed on an enterprise mainframe server or the cloud. The data stored in a warehouse is extracted from various online transaction processing (OLTP) applications to support business analytics queries and data marts for specific internal business groups, such as sales or inventory teams.

Data warehouses are useful when there is a massive amount of data from operational systems that needs to be readily available for analysis. Because the data in a lake is often uncurated and can originate from sources outside of the company's operational systems, lakes are not a good fit for the average business analytics user.

Data lake vs. data warehouse
This was last updated in March 2019

Continue Reading About data lake

Dig Deeper on Amazon S3 (Simple Storage Service) and backup

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Data Lake is the new oil in big data, several companies are going in that direction and recently was launched the first data lake as a service, the guys from Bigstep made this.
Data is like oil? Please. Data is nothing like oil. Data is like water... 


Extensions de fichiers et formats de fichiers

Motorisé par: