A data warehouse is a data management system designed to support business intelligence (BI) activities by centralizing and consolidating large amounts of data from multiple sources within an organization, and often includes large amounts of historical or “batch” data.
The concept of data warehousing was introduced in 1988 by IBM researchers Barry Devlin and Paul Murphy, and has evolved as the complexity and size of computer systems has increased. The related concept of a ‘data mart’ essentially involves creating a data warehouse that is focused on a specific purpose, such as marketing, HR or logistics.
How do you run a data warehouse?
Building a data warehouse involves several processes, including first data extraction, which involves gathering large amounts of data. The data is regularly retrieved from various organizational sources and brought into the data warehouse or data mart via a process of Extraction, Transformation and Loading, otherwise known as ‘ETL’.
As part of the ‘Transform’ portion of ETL, the data is ‘cleansed’ from errors or inconsistencies and converted to a warehouse format. This procedure is necessary for the ongoing maintenance and operation of data warehouses, as incomplete or otherwise low quality data are regularly obtained from data repositories. Finally, it is sorted, consolidated and summarized for ease of use.
Once the data is in the warehouse, it theoretically should not be changed or altered, and should represent a ‘single version of the truth’. Consequently, data warehouses require secure and reliable storage.
What’s the benefit of data warehousing?
Data warehousing facilitates data mining which involves analyzing large quantities of data to identify patterns, trends, anomalies, dependencies, results and other useful information, which is then typically presented in a user-friendly format like a chart or table. The resulting information could provide insight into customer behaviors, spending patterns, operational efficiencies, or just about any other actionable business decision.
Because they are formally structured for analysis, data from data warehouses should in theory provide more actionable value than data analyzed from other sources. However, in considering data warehousing strategies, it’s important to recognize that the ability of data warehouses to yield high-value insights does not necessarily mean that they are providing the full picture based on all the data that is currently available.
What are the drawbacks of data warehousing?
There are two limitations of traditional data warehouses that need to be discussed in order to understand where they can and cannot provide value. One problem lies in the non-real-time nature of the data, and relates to the fact that data must be ETL’d into the system. Data warehouses store and maintain analytical data separately from operational databases. While a transactional database monitors and updates real-time data, a data warehouse contains data that is aggregated over a certain time period and might not be up to date.
Furthermore, the fact that data has to be moved creates a tremendous amount of work for the company, which costs money, time and resources. It also means that multiple copies of the data now exist in the organization.
Second, traditional data warehouse structures are limited to relational database (RDBMS) models. The data is stored in it in a highly structured and organized way. All the data must fit neatly into a predesignated table and columns and therefore has to conform to strict rules regarding data type and size. While this enables data to be queried and used in a straightforward way, it also presents a major challenge in that new data sources, referred to collectively as ‘Big Data’, do not easily fit into such rigid paradigms. This umbrella class encompasses data that is:
Much less structured and often comes with little or no metadata
Collected in larger volume, velocity, and variety
Is more likely to data quality issues
This data might be generated by any number of sources, ranging from social media data to machine-generated Internet of Things (IOT) data, and has the potential to answer an entirely new set of questions.
What’s next for data warehousing?
The next phase in data warehousing will involve dealing with these issues by integrating the ability to analyze data from non-relational sources (like data lakes and NoSQL databases), and eliminating the need to actually move or create copies of the data. By the same token, this will involve dealing with the general issue of data being increasingly spread across silos and geographies, and residing in both on-premises and cloud repositories which might be managed by different vendors.
To learn more about the next phase in data warehousing, read our blog on the Unified Data Analytics Warehouse.
You might also enjoy reading our blog on the related topic of the data catalog.