On Data Intensive Applications. Q&A with Oliver Schabenberger
Q1. Briefly explain the concepts of data intensity and data complexity and why they will be important in the coming years to measure digital maturity and resilience of organizations.
Intensity has a formal definition in physics, the power per unit carried by a wave such as a sound wave, and more colloquial definitions such as the quality of being concentrated, strong and a high degree of force. The concept of data intensity is an evolution of the notion of tech intensity, introduced a few years ago by Satya Nadella, CEO of Microsoft. Tech intensity increases as organizations adopt technology and build unique digital capabilities. The idea of amplification by building on top of adopted technology maps to data in two important ways:
- Data intensity is about the degree to which you are focused and driven by data, and the amount and level of insight derived from data. What matters is not only the data you have now and in the future, but also what you do with it: the data-driven assets you build and deploy, from dashboards all the way to AI models.
- The pattern of technology innovation over the past decades is an arc from innovating machinery to innovating computing to innovating around data. The driving force that moves focus from computing to data is digital transformation. The increase of data intensity we experience today is a consequence of the increase in technology.
Data intensity is often associated with applications, but the concept is broader. Data intensity is a multi-dimensional attribute of organizations. Here are some signs your data intensity is increasing:
- Primary concerns about technology are shifting from worries about infrastructure, hardware and computing to concerns about data
- You are working with much more data and data of greater variety
- You can achieve your data-dependent tasks with minimal data movement or data duplication
- More data insights are delivered at the right time, with batch processing replaced with stream processing
- Predictive analytics are augmenting descriptive analytics
- You are building analytic models and deploying them in production
- You have a chief data officer and/or a chief analytics officer
- Your applications handle multiple data use cases (search, query, analytics, prediction) and can scale
- You have a data literacy program
- You are using digital twins
Organizations are on a journey, and you do not go from dashboards on historical data to operating retail stores through digital twins overnight. It is about continuous improvement: assess where you are today and where you want or need to respond with increasing intensity.
Part of this is passive. Data intensity is increasing whether we like it or not.
But we need to respond to it and manage it without introducing friction or more complexity. Be proactive by driving complexity out of the system. Then you will be able to absorb an increase in data intensity and take advantage of it.
For example, if your database is built for scale, multi-model use cases and low-latency, then being able to increase the data volume by 5x or the number of queries by 20x, or adding geo-referenced data is not complex and it is differentiating. Here we respond in a way that allows us to increase the data intensity even further.
On the other hand, if your existing database technology is no match for the new data reality and you add five special-purpose technologies in response, data intensity becomes synonymous with system integration. Now you have created more complexity that is difficult to maintain and scale.
Q2. Data intensity increases naturally as more constraints are connected to the data: What are the main challenges?
The challenges associated with managing increasing data intensity vary by organization since data intensity is multi-dimensional.
Increasing volume and velocity of the data, and the need to reduce latency in responding to data-dependent events, are among the challenges that organizations often face. Finding data engineering and data science talent is also a consideration. But it is also about building and operationalizing data-dependent assets such as machine learning models, reducing data movement, shifting from a historical approach to data to a predictive approach, and increasing data literacy.
The important question is: Where are the data-related concerns, and how am I managing them? I like to organize the ways in which you recognize data intensity in different categories:
- Attributes of the Data
- Data volumes are increasing: you worried about gigabytes yesterday, are dealing with terabytes today, and will need to content with petabytes tomorrow
- More data sources and more fast-moving data: thousands to millions of rows/second need to be ingested
- Latency requirements: tasks associated with data have SLAs, and the SLAs are getting shorter. Sub-second to millisecond latencies are not atypical.
- More data sources processed: a high number of joins across data sources, possibly in different systems and geographies
- High concurrency of users
- Data processing in centralized (data center, cloud) and decentralized (edge) environments
- Type of Analytics
- Transactions are not just recording the business but are augmented by analytics, the systems of records become systems of response
- Predictive, forward-looking analytics instead of a purely descriptive, backward-looking approach
- Analytics are not just re-packaging data into numeric outcomes but are driving decisions
- Automation of data engineering and machine learning
- The number of analytic models in production increases
- Applications with more analytic use cases and more diverse data types: text, search, recommendations, audio and video data, real-time and batch
- Organizational & Operational
- A more data-literate workforce; does everyone “speak data”?
- A data center strategy gives way to a data-centered cloud strategy, most likely hybrid multi-cloud
- Your organization has a chief data officer and/or chief analytics officer. They are executive-level positions
- You accomplish the same — or more — with fewer data technologies, less data movement, less data duplication
- Data privacy and data security by design — rather than capabilities bolted on afterwards
I should add that these are examples of categories. The list is not exhaustive.
Q3. What is more relevant : Variety, volume or velocity, geographic distribution, diverse data types and structure? What about automation, privacy, and security?
They are all relevant.
Big data was not a tangible term to me. In what way is the data “big”? It was not really just about the size of the data. The “big data” era is marked by the fact that new concerns and constraints were associated with the data. There was more of it, and it was of a different type and structure.
For example, click-stream data was new for many organizations. It was streaming, it had an expiration date, and it captured behavior rather than demographic information. My age will go up by one every year, that is easy to predict. My online behavior is much more difficult to predict. But that might be the more important attribute to interact with me.
The concept of data intensity resonates with me. It makes the many aspects in which data affects my thinking and operations more tangible.
Volume, variety, and velocity are still important. They put operational systems under strain. If the data systems cannot handle modern types of data such as audio, video, hand-written text, or geo-referenced data, then you are probably interacting with customers, employers, patients, users, and others in a sub-optimal way.
Security and privacy are supremely important, because a fumble in these areas puts the organization at risk. The data-intensive organization has data protection — securing data from unauthorized access, and data privacy — what someone with authorized access can do with the data, baked into product and operations as design principles. Security and privacy are not bolted on. Security and privacy by design means that products support data minimization, anonymization, pseudonymization, encryption, and other privacy and security technologies out of the box. Access to systems and access to data are differentiated.
Q4. We want data intensity to increase, but it can lead to complexity and friction if not properly managed. What is your take on this?
Data intensity is a good thing, and in today’s digitally transforming world, it is unavoidable. Data volumes are increasing, the variety of data and data use cases are increasing, every organization has a cloud story, and artificial intelligence and machine learning are everywhere.
Rather than trying to reduce the data intensity, let’s ask: How are we managing it?
Take a customer-facing application that uses data. That describes almost every application today. It has its application database, but you also want to bring in some real-time click-stream data as well as operational data. With the next update you want to include third-party weather and traffic data and predictive capabilities such as leaderboards and churn probabilities. And, of course, the backend needs to be able to scale to 10x the number of today’s users.
The application has become very data-intensive.
If we build such a system with 10 disparate pieces of technologies that need to be integrated, synced, secured, updated and so on, then we have managed intensity by creating complexity.
Data intensity metrics are a good way to measure the degree of digital transformation. High intensity with low complexity are signs of digital maturity and resilience. If your systems are resilient, they can scale. Through scaling you are increasing the intensity. It is a virtuous cycle.
Q5. While data intensity today is mostly an attribute of applications, in coming years many organizations will have objectives, key results, and KPIs tied to data intensity to capture their digital maturity and resilience. Can you give us some examples?
Since data intensity is multi-dimensional, there is not one KPI or metric tied to it. Once you identify the dimensions along which data intensity increases by design or necessity, the metrics fall into place. Like all metrics, they need to be tied to value creation.
For example, the maturity of a data and analytics program is not measured by the size of the data science team or the number of models that team produces. It should be measured by the number of models in production and the time it takes to operationalize the output of the data science team to create impact.
A simple approach to capture complexity is to measure the antidotes for complexity: latency and scalability:
- How long does it take to go from data capture to dashboard?
- How long does it take to update and re-deploy data-intensive applications?
- Can you meet SLAs if you have to do 2x or 5x more? Does scaling affect your expenses in a sublinear, linear, or exponential way?
Another great way to capture general data intensity is through data movement.
In the ideal world, we do not have to worry about where data is located and whether it needs to be moved to get a job done – or, for example, joining data in disparate regions or moving data around to perform analytics. In the ideal world, anyone can produce any data-driven result from anywhere at the right time without worrying about cost or risk. Yet the reality is that data movement remains a huge cost factor, performance killer, and governance challenge.
How many copies of the data do you make to support your use cases? Instead of moving data to the analytics, analytics should follow the data. If you can push analytics to a database such as SingleStore through a Spark connector — instead of moving data first to a data lake — your data intensity increases in a measurable way.
………………………………………………………..
Oliver Schabenberger is the Chief Innovation Officer at SingleStore. He is a former academician and seasoned technology executive with more than 25 years of global experience in data management, advanced analytics, and AI. Oliver formerly served as COO and CTO of SAS, where he led the design, development, and go-to market effort of massively scalable analytic tools and solutions and helped organizations become more data-driven.
Previously, Oliver led the Analytic Server R&D Division at SAS, with responsibilities for multi-threaded and distributed analytic server architecture, event stream processing, cognitive analytics, deep learning, and artificial intelligence. He has contributed thousands of lines of code to cutting-edge projects at SAS, including, SAS Cloud Analytic Services, the engine behind SAS Viya, the next-generation SAS architecture for the open, unified, simple, and powerful cloud. He has a PHD from Virginia Polytechnic Institute and State University.
Sponsored by SingleStore.