Warehouse Not Dead!

Jérôme Darmont, Pegdwendé N. Sawadogo

Université de Lyon, Lyon 2, ERIC EA 3083

Since the 1990s, Data Warehouses (DWs) have been the foundation of business intelligence applications. Through Extract, Transform and Load (ETL) processes, DWs can ingest voluminous and heterogeneous (to some extent, e.g., thanks to XML DWs [1]) data from various sources; and allow different analyses such as dashboards, On-Line Analysis Processing (OLAP) and data mining. Yet, with the advent of big data, DWs faced new challenges with even greater data volume and variety, as well as velocity (e.g., data from the Internet of things) and veracity issues (e.g., data quality problems when using external data).

Thus, in the early 2010s, the concept of Data Lake (DL) emerged to address big data management issues. A DL is a large storage system for raw, heterogeneous data, fed by multiple data sources, and that allows users to explore, extract and analyze the data [2]. Moreover, DLs are quite often described as addressing the shortcomings of DWs, and even as DW killers [3]. 

The main differences between DWs and DLs lie at the data management and data analysis levels, respectively. As stated above, DWs mostly store structured data that are cleansed through ETL processes, while DLs store raw, heterogeneously structured data that are transformed for analysis a posteriori (Extract, Load, Transform or ELT) [4]. Moreover, DWs bear a fixed schema, which is referred to as schema-on-write or early binding, while DLs have no predefined schema that may evolve significantly, which is referred to as schema-on-read or late binding.

As for analytics, DWs mostly enable predefined, industrialized, query language-based analyses performed by business users (through, e.g., the SQL or MDX languages), while DLs aim at on-the-fly, ad-hoc and programming-based analyses performed by data scientists who access data through Application Programming Interfaces (APIs) [5]. 

Consequently, it is quite clear that DWs and DLs do not offer the same features, and more importantly, do not serve the same purposes. Moreover, DWs and DLs are even complementary. Since DLs allow an easy and cheap storage of large amount of raw data, they can serve as staging areas or Operational Data Stores (ODSs), i.e., intermediary data stores ahead of DWs that gather operational data from several sources before the ETL process takes place [5, 6]. And with a DL sourcing a DW, possibly with semi-structured data, industrialized OLAP analyses are possible over the lake’s data, while on-demand, ad-hoc analyses are still possible directly from the DL.

Furthermore, a whole DW may be part of a DL. For instance, in the data pond architecture [7], subdivisions of a DL, i.e., data ponds, aim to store and manage data of a specific type, i.e., structured, semi-structured or unstructured data. The application data pond, which generally stores structured data from relational databases that are integrated via an ETL process, is a DW.

In conclusion, many applications nowadays do not need big data technology nor ad-hoc analyses. There, DWs remain the right tool. Even when big data come into play, NoSQL DWs can handle the job [8, 9]. DLs are not DW killers because, as we have shown above, they do not serve the same purpose. DWs and DLs are even quite complementary and can smoothly coexist within a decision-support information system. 

Therefore, Data Warehouses, although undoubtedly less trendy than Data Lakes, are definitely not dead!

References:

[1] Jaroslav Pokorný. XML Data Warehouse: Modelling and Querying. BalticDB&IS 2002: 267-280.

[2] James Dixon. Pentaho, Hadoop, and Data Lakes. https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/, 2010.

[3] Cédrine Madera, Anne Laurent. The next information architecture evolution: the data lake wave. MEDES 2016: 174-180.

[4] Isuru Suriarachchi, Beth Plale. Crossing analytics systems: A case for integrated provenance in data lakes. eScience 2016: 349-354.

[5] Huang Fang. Managing Data Lakes in Big Data Era: What’s a data lake and why has it become popular in data management ecosystem. CYBER 2015: 820-824.

[6] Brian Stein, Alan Morrison. The enterprise data lake: Better integration and deeper analytics. PWC Technology Forecast, No. 1, 2014: 1-9.

[7] Bill Inmon. Data Lake Architecture: Designing the Data Lake and avoiding the garbage dump. Technics Publications, 2016.

[8] Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, Ronan Tournier. Document-oriented Models for Data Warehouses – NoSQL Document-oriented for Data Warehouses. ICEIS 2016: 142-149.

[9] Hajer Akid, Mounir Ben Ayed. Towards NoSQL Graph Data Warehouse for Big Social Data Analysis. ISDA 2016: 965-973.

You may also like...