When it comes to managing access to data on an organizational level, a number of models are used, the latest and most dynamic being the data fabric. To understand its value, it’s helpful to understand how it fundamentally differs from an approach closely associated with Big Data: the data lake.
The data lake arose, partially, as a way to overcome the limits of the data warehouse, which was an attempt to collect and organize all the data that’s relevant to an organization in a single highly organized and tightly managed repository. The warehouse concept, however, didn’t scale to big data. Its schema-on-write model meant that data had to conform to strict rules before being input, which in turn required a lot of oversight, time and work, and often created bottlenecks in data pipelines.
With the ‘data lake’ concept (which emerged in the initial form of Hadoop), instead of a tightly controlled, well-organized library-style repository, you have a massive well that allows the input of any kind of data in raw form. For companies that collect massive data volumes at high speed, and who collect a range of structured and unstructured data, the appeal was massive.
The lake concept, however, has limitations, which are addressed by the subsequent paradigm of the data fabric. Let’s discuss why this is the case by exploring some of the similarities and differences between the two approaches.
How data lakes and data fabrics are similar
Data lakes and data fabrics have a few things in common, both conceptually and functionally. For example, both of these systems:
Are designed to handle massive scale. Hadoop ditched the traditional client-server model and embraced ‘clustering’ of multiple servers for storage, processing and computation, or ‘distributed computing’. Theoretically, the data infrastructure could be expanded indefinitely by merely adding ‘nodes’, or servers to a cluster. A data fabric, on the other hand, provides a virtual layer that connects the entirety of an organization’s data, along with all of the processes and platforms that might be connected to it. So it also be expanded indefinitely.
Both handle multiple data types. The data lake as well as the data fabric paradigms are designed to handle just about any kind of data that exists (although, as we’ll discuss, the approaches differ in important ways).
How data lakes and data fabrics are different
While both systems create a way to handle massive volumes and varieties of data at high velocity (the three v’s of big data), they also differ in fundamental ways:
Handling both ‘big’ and ‘small’ data. Hadoop was designed to manage massive data sets. Smaller datasets, for instance, pose problems on a system that is designed to deal with the ingestion and management of massive single files vs. thousands of small Excel-sized data files. For a data fabric, this is not an issue, as it is essentially a layer that connects multiple repositories. So you can use an amalgam of whatever repositories are appropriate for the size of your datasets, and still control it as one system.
Handling relational data. Similarly, while a data lake is able to store relational data, it’s not optimal for it. The relational models that form the basis of Oracle, SQL Server, MySQL, PostgreSQL, etc., are still superior models for dealing with structured data, such as the transactional data that forms the backbone of financial and ecommerce systems. Relational models provide the ACID compliance that is absolutely necessary to keep these systems from breaking. Similar to the previous point, this isn’t an issue for a data fabric, as it allows you to use whatever system is best for the data type. The fabric is agnostic--it allows you to draw simultaneously from Hadoop for unstructured data, MySQL for relational data, and MariaDB for semi-structured data, for example.
Data Lakes weren’t built with analytics in mind. Hadoop was built with the primary objective of capturing a ton of data, and the ability of analysts to derive insights was an afterthought. Consequently, many of the technologies that analytics teams are forced to use are tough to deal with in Hadoop, such as R and Python. Even the attempts to bridge the gap, such as Spark, have a steep learning curve. But since a data fabric isn’t forcing you to store in any particular format, you’re free to use tools that are appropriate to that format. You can still use Spark if you’re operating through the data fabric in Hadoop, but you can also use R, Python, or any BI tool appropriate to the dataset and its storage format.
Data Lakes provide no structure. Many organizations who may have initially taken the ‘data lake plunge’ discovered that the very thing that makes a data lake incredibly useful and versatile -- it’s ability to handle any kind of data without requiring a particular schema or structure -- also means it’s easy for data to get lost. The data fabric addresses this by leveraging AI and machine learning to map out and make sense of your data across repositories. As such, it provides a next-gen version of the functionality of a data catalog, which was also designed to bring order to unwieldy big data architectures.
Moving the Data. Data Lakes like Hadoop make the process of moving data less painful by employing ELT (Extract, Load, Transform) as an alternative to ETL (Extract, Transform, Load). Since the ‘transformation’ doesn’t have to happen until the data is recalled for analysis, you can move data a lot faster. But, you still have to move it, and for most organizations, this is impractical. Furthermore, as we already discussed, aside from the pain associated with moving all of an organization’s data into a data lake like Hadoop, it wouldn’t make sense, because it isn’t optimal for all kinds of data. A data fabric, on the other hand, doesn’t require you to move data. Rather, it creates a virtualized layer that allows you to work with the data where it exists, regardless of what kind of repository you’re using.
One final point is that a data fabric may contain a data lake, but a data lake will never contain a data fabric. A data fabric exists at a higher level--it is an abstraction of all kinds of data sources, one of which may be a data lake. So, with a data fabric, you can have an overarching access layer that includes Hadoop along with other repositories, such as those that are more appropriate for smaller data assets, or relational data.