A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
The term data lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization's data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop's cluster nodes of commodity computers.
Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being used to describe any large data pool in which the schema and data requirements are not defined until the data is queried.
The term describes a data storage strategy, not a specific technology, although it is frequently used in conjunction with a specific technology (Hadoop). The same can be said of the term data warehouse, which despite often referring to a specific technology (relational database), actually describes a broad data management strategy.
Data lake vs. data warehouse
Data lakes and data warehouses are two different strategies for storing big data. The most important distinction between them is that in a data warehouse, the schema for the data is preset; that is, there is a plan for the data upon its entry into the database. In a data lake, this is not necessarily the case. A data lake can house both structured and unstructured data and does not have a predetermined schema. A data warehouse handles primarily structured data and has a predetermined schema for the data it houses.
To put it more simply, think of the concept of a warehouse versus the concept of a lake. A lake is liquid, shifting, amorphous, largely unstructured and is fed from rivers, streams, and other unfiltered sources of water. A warehouse, on the other hand, is a man-made structure, with shelves and aisles and designated places for the things inside of it. Warehouses store curated goods from specific sources. Warehouses are prestructured, lakes are not.
This core conceptual difference manifests in several ways, including:
Technology typically used to host data -- A data warehouse is usually a relational database housed on an enterprise mainframe server or the cloud, whereas a data lake is usually housed in a Hadoop environment or similar big data repository.
Source of the data -- The data stored in a warehouse is extracted from various online transaction processing applications to support business analytics queries and data marts for specific internal business groups, such as sales or inventory teams. Data lakes typically receive both relational and non-relational data from IoT devices, social media, mobile apps and corporate applications.
Users -- Data warehouses are useful when there is a massive amount of data from operational systems that need to be readily available for analysis. Data lakes are more useful when an organization needs a large repository of data, but does not have a purpose for all of it and can afford to apply a schema to it upon access.
Because the data in a lake is often uncurated and can originate from sources outside of the company's operational systems, lakes are not a good fit for the average business analytics user. Instead, data lakes are better suited for use by data scientists, because it takes a level of skill to be able to sort through the large body of uncurated data and readily extract meaning from it.
Data quality -- In a data warehouse, the highly curated data is generally trusted as the central version of true because it contains already processed data. The data in a data lake is less reliable because it could be arriving from any source in any state. It may be curated, and it may not be, depending on the source.
Processing -- The schema for data warehouses is on-write, meaning it is pre-set for when the data is entered into the warehouse. The schema for a data lake is on-read, meaning it doesn't exist until the data has been accessed and someone chooses to use it for something.
Performance/cost -- Data warehouses are usually more expensive for large data volumes, but the trade-off is faster query results, reliability and higher performance. Data lakes are designed with low cost in mind, but query results are improving as the concept and surrounding technologies mature.
Agility -- Data lakes are highly agile; they can be configured and reconfigured as needed. Data warehouses are less so.
Security -- Data warehouses are generally more secure than data lakes because warehouses as a concept have existed for longer and therefore, security methods have had the opportunity to mature.
Because of their differences, and the fact that data lakes are a newer and still-evolving concept, organizations might choose to use both a data warehouse and a data lake in a hybrid deployment. This may be to accommodate the addition of new data sources, or to create an archive repository to deal with data roll-off from the main data warehouse. Frequently data lakes are an addition to, or evolution of, an organization's current data management structure instead of a replacement.
Data lake architecture
The physical architecture of a data lake may vary, as data lake is a strategy that can be applied to multiple technologies. For example, the physical architecture of a data lake using Hadoop might differ from that of data lake using Amazon Simple Storage Service (Amazon S3).
However, there are three main principles that distinguish a data lake from other big data storage methods and make up the basic architecture of a data lake. They are:
- No data is turned away. All data is loaded in from various source systems and retained.
- Data is stored in an untransformed or nearly untransformed state, as it was received from the source.
- Data is transformed and fit into a schema based on analysis requirements.
Although data is largely unstructured and not geared toward answering any specific question, it should still be organized in some manner so that doing this in the future is possible. Whatever technology ends up being used to deploy an organization's data lake, a few features should be included to ensure that the data lake is functional and healthy and that the large repository of unstructured data doesn't go to waste. These include:
- A taxonomy of data classifications, which can include data type, content, usage scenarios and groups of possible users.
- A file hierarchy with naming conventions.
- Data profiling tools to provide insight for classifying data objects and addressing data quality issues.
- Standardized data access process to keep track of what members of an organization are accessing data.
- A searchable data catalog.
- Data protections including data masking, data encryption and automated monitoring to generate alerts when data is accessed by unauthorized parties.
- Data awareness among employees, which includes an understanding of proper data management and data governance, training on how to navigate the data lake, and an understanding of strong data quality and proper data usage.
Benefits of a data lake
The data lake offers several benefits, including:
- The ability of developers and data scientists to easily configure a given data model, application, or query on the fly. The data lake is highly agile.
- Data lakes are theoretically more accessible. Because there is no inherent structure, any user can technically access the data in the data lake, even though the prevalence of large amounts of unstructured data might inhibit less skilled users.
- The data lake supports users of varying levels of investment; users who want to return to the source to retrieve more information, those who seek to answer entirely new questions with the data and those who simply require a daily report. Access is possible for each of these user types.
- Data lakes are cheap to implement because most technologies used to manage them are open source (i.e., Hadoop) and can be installed on low-cost hardware.
- Labor-intensive schema development and data cleanup are deferred until after an organization has identified a clear business need for the data.
- Agility allows for a variety of different analytics methods to interpret data, including big data analytics, real-time analytics, machine learning and SQL queries.
- Scalable because of a lack of structure.
Despite the benefits of having a cheap, unstructured repository of data at an organization's disposal, several legitimate criticisms have been levied against the strategy.
One of the biggest potential follies of the data lake is that it might turn into a data swamp, or data graveyard. If an organization practices poor data governance and management, it may lose track of the data that exists in the lake, even as more pours in. The result is a wasted body of potentially valuable data rotting away unseen at the "bottom" of the data lake, so to speak, rendering it deteriorated, unmanaged and inaccessible.
Data lakes, while providing theoretical accessibility to anyone in an organization, may not be as accessible in practical use, because business analysts may have a difficult time readily parsing unstructured data from a variety of sources. This practical accessibility challenge may also contribute to the lack of proper data maintenance and result in the development of a data graveyard. It's important to maximize investment in a data lake and reduce the risk of failed deployment.
Another problem with the term data lake itself is that it is used in many contexts in public discourse. Although it makes most sense to use it to describe a strategy of data management, it has also commonly been used to describe specific technologies and as a result, has a level of arbitrariness to it. This challenge may cease to be once the term matures and finds a more concrete meaning in the public discourse.
Although a data lake isn't a specific technology, there are several technologies that enable them. Some vendors that offer those technologies are:
- Apache -- offers the open-source ecosystem Hadoop, one of the most common data lake services.
- Amazon -- offers Amazon S3 with virtually unlimited scalability.
- Google -- offers Google Cloud Storage and a collection of services to pair with it for management.
- Oracle -- offers the Oracle Big Data Cloud and a variety of PaaS services to help manage it.
- Microsoft -- offers the Azure Data Lake as a scalable data storage and Azure Data Lake Analytics as a parallel analytics service. This is an example of when the term data lake is used to refer to a specific technology instead of a strategy.
- HVR -- offers a scalable solution for organizations that need to move large volumes of data and update it in real time.
- Podium -- offers a solution with an easy-to-implement and use suite of management features.
- Snowflake -- offers a solution that specializes in processing diverse datasets, including structured and semi-structured datasets such as JSON, XML and Parquet.
- Zaloni -- offers a solution that comes with Mica, a self-service data prep tool and data catalog. Zaloni has been branded as the data lake company.