On Big Data and Data Integration. Q&A with Andreas Buckenhofer
You would expect that data integration is well understood and goes easy. The reality shows that data integration is still hard. There is no single version of the truth that would allow the reuse of data without any changes. Situative factors change too rapidly.
Data integration plays a crucial role to get value from many data sources. Today there are more data to analyze than ever before. Most ML algorithms expect data to be curated and easily accessible in a single table. A single version of truth or a ‘one-size-fits-all’ schema is not feasible anymore. Speed and flexibility are essential. Sensor data need a different approach compared to transactional data from production systems.
Data integration regularly had an adverse perception. Why does it take so many resources? Or even: is data integration really necessary? Microservices in OLTP systems lead to small databases containing data with a limited bounded context only. The need for data integration is evident. Additionally, the usage of NoSQL databases with the lack of constraints or proper data type checking increases data quality issues arising from production systems. There is a vast gap in data-driven awareness in the industry. A DataOps culture of agile production and delivery of high-quality data for a particular business process is necessary.
Q2. What are your experiences in architectural design for Big Data?
Indeed, Big Data needs architecture. Don’t treat Big Data as a black box. The slogan “Think big, start small”, known from DWH products, is more real than ever. There is so much to say about architecture, so I’m going to focus on two parts of an architecture that are often neglected.
Think about the data supply chain. A one-time load of data is trivial. Projects often fail to get data continuously from internal and external sources. Source system owners don’t like changes in their system or additional performance overhead from data extraction. Log-based change data capture tools like Golden Gate or Debezium and streaming solutions like Apache Pulsar or Kafka need to be considered in an overall architecture.
You need to architect a Big Data system with security in mind, e.g. GDPR requirements can’t be an afterthought. Security often has a negative perception as it is regarded as a limitation for data usage. It must be seen the other way around that an accurately designed Big Data system ensures that users will have access to data they are legitimated to see. A lot of progress has been made around anonymization techniques like differential privacy over the last years. The privacy of customers must be protected and ethics must be applied when data is analyzed manually or automatically.
Q3. Much has been said against ETL. What is your experience?
Data curation for data analysis is costly, resource-intense, complex, and fault-prone. It takes around 80% of the time when working with data. Therefore, it is paramount to start with a use case. Doing data preparation without a purpose or with just a hypothetical future use case is a waste. Ideally, data quality is ensured in the source systems. Practical experience shows that’s not the case.
Traditionally, ETL has been used to extract, transform, and load data. If tremendous amounts of data are processed, the ELT transformation job has to be pushed to the data and not vice versa. A point of critique of ETL or ELT is its time-consuming batch processing mode. Streaming is the way to go for right-time data requirements. But not much data need to be processed instantly.
AI without data quality does not work. ETL is not an option. ETL must be done – no matter if you call it data wrangling, data munging, or any other term. ML data pipelines do a lot of work similar to classical ETL flows. ML data pipelines need to preprocess data into a format compatible with an ML framework like Tensorflow. These pipelines also have requirements that were not necessary for most of the use cases in classical ETL processing like high availability, frequent schema changes, data versioning, feedback loops or API support. Additionally, ML models can also help during data curation.
We had NoSQL and now there is a lot of talk about NoETL. NoETL advocates argue that data is always in sync and development is faster. Transformations are coded in view definitions or queries without persisting the data. NoETL can handle elementary data quality flaws. Complex issues like deduplication of records can only be done by ETL effectively and efficiently. NoETL rapidly leads to a lousy query performance as the same transformations have to be done again and again. The same is true for data virtualization: many sources can be connected easily, but complex business rules do not scale as source systems will suffer from performance issues.
Q4. What did you learn in your practical experience in using Hadoop?
Hadoop with its zoo of tools provides capabilities like schema-on-read, parallel processing of large data volumes, data immutability, redundancy, etc. Hadoop took over some of the time-consuming work from ETL jobs. The overall complexity of the Hadoop ecosystem is monstrous, though. Administrating Hadoop is not easy. Currently, there is a movement into the cloud to avoid the operational burden of on-premises Hadoop. Security and many tools still lack maturity and compatibility among each other.
Skills like SQL are priceless when dealing with data. SQL is best for working on data. That’s also true for Hadoop or any other Big Data tools. Therefore, streaming tools also integrate SQL for processing data-on-motion. Hadoop isn’t the only technology suitable to manage high data volumes or unstructured data. I don’t like the term unstructured data – the structure just has not been modelled yet. Other technologies suitable for Big Data are relational database appliances, analytical databases, Spark, or NoSQL stores. Knowledge of distributed systems will become more and more significant in future.
Q5. What are the pros and cons of a Data Lake?
Does a Data Lake replace a DWH? No, both of them complement each other. The pros vs cons of a Data Lake may be simplified by flexibility vs quality. The flexibility comes from avoiding up-front data modelling and storing data with a schema-on-read approach. Storing sensor data with rapidly changing schemas becomes easy. That approach can quickly lead to a mess: what is stored in the Lake? It’s necessary to have a Metadata Management in place – ideally not just for the Data Lake but enterprise-wide across all data stores.
Metadata Management is a hot topic which is labelled now as Data Catalog. Cataloguing technical metadata is 20% only, and a few people can’t document everything. The combination of expert knowledge, swarm intelligence, and tool automation is necessary to build a data catalogue with the vision of an ‘Amazon for information’.
Qx. Anything else you wish to add?
Today, professionals get trained in using tools. A new version comes out, and further training follows with the vendor’s latest release. Unfortunately, there’s a lack of education of fundamentals like modelling, architecture, methods, or concepts. Thinking critically and questioning is vital when dealing with data. That sounds trivial. The reality shows that data quality is often not questioned. Dubious results from poor data are produced: garbage in – garbage out. The increased usage of external data from social media increases the challenge to detect data quality flaws like #fakenews. Getting value out of data needs professionalization based on education and practical experience. I’m glad that I have the chance to deliver a DWH and Big Data lecture at a university covering the topics mentioned above for several years.
——————————
Andreas Buckenhofer works at Daimler TSS as Database Professional and has more than 20 years of experience in data integration. He likes to pass on his practical work experience in internal talks and as a speaker at international conferences. He regularly gives a lecture on Data Warehousing and Big Data at Baden-Wuerttemberg Cooperative State University. He is the DOAG responsible for Oracle In-Memory Database and has been named an ACE Associate by Oracle.