What is Data Lake ?
We are living in the age of digital transformation when voluminous data is getting created in diverse forms and shape. Every business is trying to derive more and more value out of the available data. Over the past decade, two things about data has changed drastically – quality of data & quantity of data.
Data is getting double in size every year, and is expected to reach 44 trillion gigabytes by 2020. Up to 90 percent of that data is unstructured or semi-structured, which presents a two-fold challenge: find a way to store all this data and maintain the capacity to process it quickly. This is where data lake comes in picture to handle large volume data in structured as well as unstructured format.
So, what is a Data Lake ?
In 2010, James Dixon introduced the concept of the data lake and his idea has gained traction ever since. He uses the following analogy:
“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
The term “data lake” refers to the ad hoc nature of data in a data lake, as opposed to the clean and processed data stored in traditional data warehouse systems. One of the focus areas in data modernization is the addition of Data Lakes in the new scheme of things.
Data Lake can be defined as below :
A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
Data lakes are usually configured on a cluster of inexpensive and scalable commodity hardware. This allows data to be dumped in the lake in case there is a need for it later without having to worry about storage capacity. The clusters could either exist on-premises or in the cloud.
Why you need a data lake ?
Data lakes are easily confused with data warehouses, but feature some distinct differences that can offer big benefits to the right organizations—especially as big data and big data processes continue to migrate from on-premises to the cloud.
Traditional data management approaches aren’t fit (or require a lot of money) to handle big data and big data analytics. With big data analytics essentially we want to find correlations between different data sets which need to be combined in order to achieve our business outcome. And if these data sets sit in entirely different systems, that’s virtually impossible. Data lakes can be termed as a way to end data silos in a fast growing and increasingly unstructured big data universe
Reasons for using Data Lake are:
- With the onset of storage engines like Hadoop storing disparate information has become easy. There is no need to model data into an enterprise-wide schema with a Data Lake.
- With the increase in data volume, data quality, and metadata, the quality of analyses also increases.
- Data Lake offers business Agility
- Machine Learning and Artificial Intelligence can be used to make profitable predictions.
- It offers a competitive advantage to the implementing organization.
- There is no data silo structure. Data Lake gives 360 degrees view of customers and makes analysis more robust.
Features Of Data Lake :
- Handles large volumes of diverse (structured, semi/ un structured) data
- Mostly detailed source data
- Raw material for discovering entities and facts
- Data prep on demand
- Data repurposed later, as need arise
- Typically schema on read
- Persists data in its original raw state
- Integrates into multiple enterprise data ecosystems and architectures