data lake definition

This definition is part of our Essential Guide: AWS analytics tools help make sense of big data

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.

While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.  

The term data lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization's data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop's cluster nodes of commodity computers

Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried.

See also: enterprise data hub, big data analytics, extract, load, transform


This was first published in May 2015

Continue Reading About data lake



Find more PRO+ content and other member only offers, here.

1 comment


Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:


File Extensions and File Formats

Powered by: