Data Modeling for Analytical Data Warehouses. Interview with Michael Blaha.
“Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits” — Michael Blaha.
This is the third interview with our expert Dr. Michael Blaha on the topic Database Modeling. This time we look at the issue of data design for Analytical Data Warehouses.
In previous interviews we looked at how good is UML for database design , and how good are Use Cases for database modeling.
Hope you`ll find this interview interesting. I encourage the community to post comments.
Q1: What is the difference between data warehouses and day-to-day business applications?
Michael Blaha: Operational (day-to-day business) applications serve the routine needs of a business handling orders, scheduling manufacturing runs, servicing patients, and generating financial statements.
Operational applications have many short transactions that must process quickly. The transactions both read and write.
Well-written applications pay attention to data quality, striving to ensure correct data and avoid errors.
In contrast analytical (data warehouse) applications step back from the business routine and analyze data that accumulates over time. The idea is to gain insight into business patterns that are overlooked when responding to routine needs. Data warehouse queries can have a lengthy execution time as they process reams of data, searching for underlying patterns.
End users read from a data warehouse, but they don`t write to it. Rather writing occurs as the operational applications supply new data that is added to the data warehouse.
Q2: How do you approach data modeling for data warehouse problems?
Michael Blaha: For operational applications, I use the UML class model for conceptual data modeling. (I often use Enterprise Architect.) The notation is more succinct than conventional database notations and promotes abstract thinking.
In addition, the UML class model is understandable for business customers as it defers database design details. And, of course, the UML reaches out to the programming side of development.
In contrast, for analytical applications, I go straight to a database notation. (I often use ERwin.) Data warehouses revolve around facts and dimensions. The structure of a data warehouse model is so straightforward (unlike the model of operational application) that a database notation alone suffices.
For a business user, the UML model and the conventional data model look much the same for a data warehouse.
The programmers of a data warehouse (the ETL developers) are accustomed to database notations (unlike the developers in day-to-day applications).
As an aside, I note that in a past book (A Manager`s Guide to Database Technology) I used a UML class model for analytical modeling. In retrospect I now realize that was a forced fit. The class model does not deliver any benefits for data warehouses and it`s an unfamiliar technology for data warehouse developers, so there`s no point in using it there.
Q3: Is there any synergy between non-relational databases (NoSQL, Object Databases) and data warehouses?
Michael Blaha: Not for conventional data warehouses that are set-oriented. Mass quantities of data must be processed in bulk. Set-oriented data processing is a strength of relational databases and the SQL language. Furthermore, tables are a good metaphor for facts and dimensions and the data is intrinsically strongly typed.
NoSQL (Hadoop) is being used for mining Web data. Web data is by its nature unstructured and much different from conventional data warehouses.
Q4: How do data warehouses achieve fast performance?
Michael Blaha: The primary technique is pre-computation, by anticipating the need for aggregate data and computing it in advance. Indexing is also important for data warehouses, but less important than with operational applications.
Q5: What are some difficult issues with data warehouses?
–Abstraction. Abstraction is needed to devise the proper facts and dimensions. It is always difficult to perform abstraction.
–Conformed dimensions. A large data warehouse schema must be flexible for mining. This can only be achieved if data is on the same basis. Therefore there is a need for conformed dimensions.
For example, there must be a single definition of Customer that is used throughout the warehouse.
–Size. The sheer size of schema and data is a challenge.
–Data cleansing. Many operational applications are old legacy code. Often their data is flawed and may need to be corrected for a data warehouse.
– Data integration. Many data warehouses combine data from multiple applications. The application data overlaps and must be reconciled.
– Security. Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits.
Q6: What kind of metadata is associated with a data warehouse and is there a role for Object Databases with this?
Michael Blaha: Maybe. Data warehouse metadata includes source-to-target mappings, definitions (of facts, dimensions, and attributes), as well as the organization of the data warehouse into subject areas. The metadata for a data warehouse is just like operational applications. The metadata has to be custom modeled and doesn`t have a standard metaphor for structure like the facts and dimensions of a data warehouse. Relational databases, OO databases, and possibly other kinds of databases are all reasonable candidates.
Q7. In a recent interview, Florian Waas, EMC/Greenplum, said “in the Big Data era the old paradigm of shipping data to the application isn’t working any more. Rather, the application logic must come to the data or else things will break: this is counter to conventional wisdom and the established notion of strata within the database stack. Instead of stand-alone products for ETL, BI/reporting and analytics we have to think about seamless integration: in what ways can we open up a data processing platform to enable applications to get closer? What language interfaces, but also what resource management facilities can we offer? And so on.”
What is your view on this?
Michael Blaha: It’s well known that to get good performance for relational database applications that stored procedures must be used. Stored procedures are logic that is inside the database kernel. Stored procedures circumvent much of the overhead that is incurred by shuttling back and forth between an application process and the database process. So the stored procedure experience is certainly consistent with this comment.
What I try to do in practice is think in terms of objects. Relational database tables and stored procedures are analogous to objects with methods. I put core functionality that is likely to be reusable and computation intensive into stored procedures. I put lightweight functionality and functionality that is peculiar to an application outside the database kernel.
Q8. Hadoop is the system of choice for Big Data and Analytics. How do you approach data modeling in this case?
Michael Blaha: I have no experience with Hadoop. My projects have involved structured data. In contrast Hadoop is architected for unstructured data as is often found on the Web.
Q9. A lot of insights are contained in unstructured or semi-structured data from Big Data applications. Does it make any sense to do data modeling in this case?
Michael Blaha: I have no experience with unstructured data. I have some experience with semi-structured data (XML / XSD). I routinely practice data modeling for XSD files. I published a paper in 2010 lamenting the fact that so
many SOA projects concern storage and retrieval of data and completely lack a data model. I’ve been working on modeling approaches for XSD files, but have not yet devised a solution to my satisfaction.
Michael Blaha is a partner at Modelsoft Consulting Corporation.
Dr. Blaha is recognized as one of the world’s leading authorities on databases and data modeling. He has more than 25 years of experience as a consultant and trainer in conceiving, architecting, modeling, designing, and tuning databases for dozens of major organizations around the world. He has authored six U.S. patents, six books, and many papers. Dr. Blaha received his doctorate from Washington University in St. Louis and is an alumnus of GE Global Research in Schenectady, New York.
– Use Cases and Database Modeling — An interview with Michael Blaha.
– How good is UML for Database Design? Interview with Michael Blaha.
ODBMS.org: Free Downloads and Links on various data management technologies:
NoSQL Data Stores
Graphs and Data Stores
Cloud Data Stores
Entity Framework (EF) Resources
Object-Relational Impedance Mismatch
Databases in general
Big Data and Analytical Data Platforms
I still have “Object-Oriented Modeling and Design” on my bookshelf.
Michael writes “Many data warehouses contain sensitive data..” . The same is true of course also for operational data.
COBIT (www.isaca.org) addresses the issues of compliance and security related to these data with the process P02 “define information architecture”. This process “creates and regularly updates a business information model; This encompasses the development of a corporate data dictionary” etc. The obligation of IT on corporate level is to maintain proper records about the “data syntax rules, data classification scheme and security levels” (beyond modeling on the project level). COBIT seems to have in mind some sort of global “conceptual model” including information about the physical instances and the access rights.
The “Common Warehouse Metamodel” (OMG) on the other hand is a metamodel for the description of the physical instances of operational data, the warehouse and the mapping between both. The CMW relies (no surprise) on MOF to describe the metamodel. The final outcome is – in contrast to the “data dictionary view” of COBIT – a recipee for the creation of mapping services. This however would not be sufficient to support the reporting and auditing requirements of the IT-Auditors from ISACA. It would be necessary to combine both – i.e. a CWM-Dictionary which contains metadata also for auditing and controlling purposes. Such a dictionary could obviously be the basis for specs or even model driven generation of mapping services.
What is the main problem? Michael said “Data warehouse metadata includes source-to-target mappings, definitions (of facts, dimensions, and attributes)”. I think the “relationships” are the biggest challenge. They are the real source of the complexity of this task: E.g. the relationship between an attribute of a (conceptual) entity and the implementations of this attribute by various databases / columns.
We speak often of the so called “impedance mismatch”. There is such a mismatch between the corporate data governance view (COBIT), the MOF-View of the CWM (still quite global) and the views of individual projects / applications (which deal with the real world details). A “real corporate dictionary” could brigde this gap – but this is quite a challenge!