The industry has looked to several models to collect, store and impose order on an organization’s data, and one of the most talked about ones in the last decade is that of the ‘Data Lake’.
Before we talk about that, we need to recap the Data Warehouse concept we discussed in a previous blog, which provides a systematized process and tightly organized repository for data that might be needed for an organization’s operations and analytics. This strict structure helps data stay analysis-ready, but it also has a dark side: it slows the process of collecting and processing data, making it impractical for Big Data.
The Data Lake emerged as an alternative solution to this challenge, which allows the input of any kind of data in raw form. This makes it desirable for organizations that collect massive data volumes at high speed, and who deal with the full spectrum of structured and unstructured data.
Hadoop - The Original Data Lake
Hadoop - which more or less pioneered the data lake concept (Microsoft also provides one in Azure) - is an ecosystem of tools for handling “Big Data” that can be scaled to handle Zettabytes worth of data. It all started in 2005 when Yahoo!’s Doug Cutting and Michael J. Cafarella leveraged concepts from a Google paper to create a bigger and better search engine by coordinating the processing power of multiple computers. Project ‘Hadoop’ - named after a toy stuffed elephant - was turned over to the Apache Software Foundation which manages it to this day as an open source project. Numerous software platforms have since been built on top of it's Java/Linux core, making it possible to leverage its benefits using more common programming and query languages, such as SQL and Python.
The Data Lake, as exemplified by Hadoop, is notably distinguished by its ability to deal with almost any type of data. It takes everything from transactions to emails to images and videos.
What’s Under the Hood in Hadoop?
Hadoop is built on three core components:
HDFS, which manages the storage and figures out where to put the data
YARN, which manages the computation resources
MapReduce, which allows humans to interact with the system.
There’s also Ambari, a general monitoring dashboard, and Zookeeper which enables the systems to be coordinated.
Furthermore, Hadoop leverages 3 major computing principles, clustering, schema-on-read and Map+Reduce, that allow it to scale workloads over vast networks of servers. The concepts are pretty in-depth, but here’s a high level summary:
Clustering. Under the traditional ‘client/server’ model one computer directs either another part of itself, or another computer or server to do something. But when you’re dealing with Big Data this just doesn’t cut it. Hadoop blows the lid off of computing power and enables massive web scale applications by coordinating ‘clusters’ of computers for storage, processing and computation. Such ‘clustering’--also referred to as ‘distributed computing’--lets the user provision additional servers as needed, allowing data infrastructure to be expanded indefinitely.
Map+Reduce. When Hadoop does a job, it ‘maps’ the relevant data on the various cluster ‘nodes’ on which it is stored, and then ‘reduces’ it by performing some sort of calculation. A key here is that the actual data doesn’t change or move--Hadoop gives the user a very simplified and abstracted view of what might actually be a very convoluted storage structure. MapReduce also runs multiple actions in parallel, allowing jobs to be executed more quickly.
Schema-on-Read. Here’s where Hadoop’s “data lake” approach really diverges from the data warehouse concept. A data warehouse uses a strict ‘schema-on-write’ ETL approach requiring the data to conform to a specified schema before it is stored (Janković et al., 2018). But with Hadoop’s “schema-on-read” approach the schema isn’t applied until the data is retrieved for analysis. So a wide range of data types can be ingested very quickly without making any assumptions about how they might be used in the future. This ‘shoot first and ask questions later’ approach is helpful for organizations who suspect their data may be useful at some future point, but don’t currently know what to use it for (which, to be honest, probably describes most organizations).
Data Lake Drawbacks
Despite all this, data lakes like Hadoop aren’t good for every scenario. For example, critics point out that it may present problems when dealing with smaller datasets, and some of the technologies designed to allow programming in Python and machine learning may lag in terms of usability. So while analysts working in Hadoop may be able to mine massive amounts of data, they may find themselves doing it with less desirable tools.
Critics also note that the Data Lake’s greatest strength is also its greatest weakness; it doesn’t provide the structure necessary to make the data useful. Some have even gone so far as to call it a ‘data swamp’. Consequently, organizations continue to leverage multiple kinds of repositories for different use cases.
These problems have given rise to solutions designed to make Hadoop more manageable. To learn how it is possible to not only give Hadoop a level of structure that is similar to a Data Warehouse, but also allow it to be accessed as only one part of a data ecosystem that has many other types of repositories, including relational and NoSQL databases, read on.