On the future of Data Warehousing. Interview with Jacque Istok and Mike Waas
” Open source software comes with a promise, and that promise is not about looking at the code, rather it’s about avoiding vendor lock-in.” –Jacque Istok.
” The cloud has out-paced the data center by far and we should expect to see the entire database market being replatformed into the cloud within the next 5-10 years.” –Mike Waas.
I have interviewed Jacque Istok, Head of Data Technical Field for Pivotal, and Mike Waas, founder and CEO Datometry.
Main topics of the interview are: the future of Data Warehousing, how are open source and the Cloud affecting the Data Warehouse market, and Datometry Hyper-Q and Pivotal Greenplum.
Q1. What is the future of Data Warehouses?
Jacque Istok: I believe that what we’re seeing in the market is a slight course correct with regards to the traditional data warehouse. For 25 years many of us spent many cycles building the traditional data warehouse.
The single source of the truth. But the long duration it took to get alignment from each of the business units regarding how the data related to each other combined with the cost of the hardware and software of the platforms we built it upon left everybody looking for something new. Enter Hadoop and suddenly the world found out that we could split up data on commodity servers and, with the right human talent, could move the ball forward faster and cheaper. Unfortunately the right human talent has proved hard to come by and the plethora of projects that have spawned up are neither production ready nor completely compliant or compatible with the expensive tools they were trying to replace.
So what looks to be happening is the world is looking for the features of yesterday combined with the cost and flexibility of today. In many cases that will be a hybrid solution of many different projects/platforms/applications, or at the very least, something that can interface easily and efficiently with many different projects/platforms/applications.
Mike Waas: Indeed, flexibility is what most enterprises are looking for nowadays when it comes to data warehousing. The business needs to be able to tap data quickly and effectively. However, in today’s world we see an enormous access problem with application stacks that are tightly bonded with the underlying database infrastructure. Instead of maintaining large and carefully curated data silos, data warehousing in the next decade will be all about using analytical applications from a quickly evolving application ecosystem with any and all data sources in the enterprise: in short, any application on any database. I believe data warehouses remain the most valuable of databases, therefore, cracking the access problem there will be hugely important from an economic point of view.
Q2. How is open source affecting the Data Warehouse market?
Jacque Istok: The traditional data warehouse market is having its lunch eaten by open source. Whether it’s one of the Hadoop distributions, one of the up and coming new NoSQL engines, or companies like Pivotal making large bets and open source production proven alternatives like Greenplum. What I ask prospective customers is if they were starting a new organization today, what platforms, databases, or languages would you choose that weren’t open source? The answer is almost always none. Open source software comes with a promise, and that promise is not about looking at the code, rather it’s about avoiding vendor lock-in.
Mike Waas: Whenever a technology stack gets disrupted by open source, it’s usually a sign that the technology has reached a certain maturity and customers have begun doubting the advantage of proprietary solutions. For the longest time, analytical processing was considered too advanced and too far-reaching in scope for an open source project. Greenplum Database is a great example for breaking through this ceiling: it’s the first open source database system with a query optimizer not only worth that title but setting a new standard, and a whole array of other goodies previously only available in proprietary systems.
Q3. Are databases an obstacle to adopting Cloud-Native Technology?
Jacque Istok: I believe quite the contrary, databases are a requirement for Cloud-Native Technology. Any applications that are created need to leverage data in some way. I think where the technology is going is to make it easier for developers to leverage whichever database or datastore makes the most sense for them or they have the most experience with – essentially leveraging the right tool for the right job, instead of the tool “blessed” by IT or Operations for general use. And they are doing this by automating the day 0, day 1, and day 2 operations of those databases. Making it easy to instantiate and use these platforms for anyone, which has never really been the case.
Mike Waas: In fact, a cloud-first strategy is incomplete unless it includes the data assets, i.e., the databases. Now, databases have always been one of the hardest things to move or replatform, and, naturally, it’s the ultimate challenge when moving to the cloud: firing up any new instance in the cloud is easy as 1-2-3 but what to do with the 10s of years of investment in application development? I would say it’s actually not the database that’s the obstacle but the applications and their dependencies.
Q4. What are the pros and cons of moving enterprise data to the cloud?
Jacque Istok: I think there are plenty of pros to moving enterprise data to the cloud, the extent of that list will really depend on the enterprise you’re talking to and the vertical that they are in. But cons? The only cons would be using these incredible tools incorrectly, at which point you might find yourself spending more money and feeling that things are slower or less flexible. Treating the cloud as a virtual data center, and simply moving things there without changing how they are architected or how they are used would be akin to taking
Mike Waas: I second that. A few years ago enterprises were still concerned about security, completeness of offering, and maturity of the stack. But now, the cloud has out-paced the data center by far and we should expect to see the entire database market being replatformed into the cloud within the next 5-10 years. This is going to be the biggest revolution in the database industry since the relational model with great opportunities for vendors and customers alike.
Q5. How do you quantify when is appropriate for an enterprise to move their data management to a new platform?
Jacque Istok: It’s pretty easy from my perspective, when any enterprise is done spending exorbitant amounts of money it might be time to move to a new platform. When you are coming up on a renewal or an upgrade of a legacy and/or expensive system it might be time to move to a new platform. When you have new initiatives to start it might be time to move to a new platform. When you are ready to compete with your competitors, both known and unknown (aka startups), it might be time to move to a new platform. The move doesn’t have to be scary either, as some products are designed to be a bridge to a modern a data platform.
Mike Waas: Traditionally, enterprises have held off from replatforming for too long: the switching cost has deterred them from adopting new and highly superior technology with the result that they have been unable to cut costs or gain true competitive advantage. Staying on an old platform is simply bad for business. Every organization needs to ask themselves constantly the question whether their business can benefit from adopting new technology. At Datometry, we make it easy for enterprises to move their analytics — so easy, in fact, the standard reaction to our technology is, “this is too good to be true.”
Q6. What is the biggest problem when enterprises want to move part or all of their data management to the cloud?
Jacque Istok: I think the biggest problem tends to be not architecting for the cloud itself, but instead treating the cloud like their virtual data center. Leveraging the same techniques, the same processes, and the same architectures will not lead to the cost or scalability efficiencies that you were hoping for.
Mike Waas: As Jacque points out, you really need to change your approach. However, the temptation is to use the move to the cloud as a trigger event to rework everything else at the same time. This quickly leads to projects that spiral out of control, run long, go over budget, or fail altogether. Being able to replatform quickly and separate the housekeeping from the actual move is, therefore, critical.
However, when it comes to databases, trouble runs deeper as applications and their dependencies on specific databases are the biggest obstacle. SQL code is embedded in thousands of applications and, probably most surprising, even third-party products that promise portability between databases get naturally contaminated with system-specific configuration and SQL extensions. We see roughly 90% of third-party systems (ETL, BI tools, and so forth) having been so customized to the underlying database that moving them to a different system requires substantial effort, time, and money.
Q7. How does an enterprise move the data management to a new platform without having to re-write all of the applications that rely on the database?
Mike Waas: At Datometry, we looked very carefully at this problem and, with what I said above, identified the need to rewrite applications each time new technology is adopted as the number one problem in the modern enterprise. Using Adaptive Data Virtualization (ADV) technology, this will quickly become a problem of the past! Systems like Datometry Hyper-Q let existing applications run natively and instantly on a new database without requiring any changes to the application. What would otherwise be a multi-year migration project and run into the millions, is now reduced in time, cost, and risk to a fraction of the conventional approach. “VMware for databases” is a great mental model that has worked really well for our customers.
Q8. What is Adaptive Data Virtualization technology, and how can it help adopting Cloud-Native Technology?
Mike Waas: Adaptive Data Virtualization is the simple, yet incredibly powerful, abstraction of a database: by intercepting the communication between application and database, ADV is able to translate in real-time and dynamically between the existing application and the new database. With ADV, we are drawing on decades of database research and solving what is essentially a compatibility problem between programming languages and systems with an elegant and highly effective approach. This is a space that has traditionally been served by consultants and manual migrations which are incredibly labor-intensive and expensive undertaking.
Through ADV, adopting cloud technology becomes orders of magnitude simpler as it takes away the compatibility challenges that hamper any replatforming initiative.
Q9. Can you quantify what are the reduced time, cost, and risk when virtualizing the data warehouse?
Jacque Istok: In the past, virtualizing the data warehouse meant sacrificing performance in order to get some of the common benefits of virtualization (reduced time for experimentation, maximizing resources, relative ease to readjust the architecture, etc). What we have found recently is that virtualization, when done correctly, actually provides no sacrifices in terms of performance, and the only question becomes whether or not the capital cost expenditure of bare metal versus the opex cost structure of virtual is something that makes sense for your organisation.
Mike Waas: I’d like to take it a step further and include ADV into this context too: instead of a 3-5 year migration, employing 100+ consultants, and rewriting millions of lines of application code, ADV lets you leverage new technology in weeks, with no re-writing of applications. Our customers can expect to save at least 85% of the transition cost.
Q10. What is the massively parallel processing (MPP) Scatter/Gather Streaming™ technology, and what is it useful for?
Jacque Istok: This is arguably one of the most powerful features of Pivotal Greenplum and it allows for the fastest loading of data in the industry. Effectively we scatter data into the Greenplum data cluster as fast as possible with no care in the world to where it will ultimately end up. Terabytes of data per hour, basically as much as you can feed down the wires, is sent to each of the workers within the cluster. The data is therefore disseminated to the cluster in the fastest physical way possible. At that point, each of the workers gathers the data that is pertinent to them according to the architecture you have chosen for the layout of those particular data elements, allowing for a physical optimization to be leveraged during interrogation of the data after it has been loaded.
Q11. How Datometry Hyper-Q & Pivotal Greenplum data warehouse work together?
Jacque Istok: Pivotal Greenplum is the world’s only true open source, production proven MPP data platform that provides out of the box ANSI compliant SQL capabilities along with Machine Learning, AI, Graph, Text, and Spatial analytics all in one. When combined with Datometry Hyper-Q, you can transparently and seamlessly take any Teradata application and, without changing a single line of code or a single piece of SQL, run it and stop paying the outrageous Teradata tax that you have been bearing all this time. Once you’re able to take out your legacy and expensive Teradata system, without a long investment to rewrite anything, you’ll be able to leverage this software platform to really start to analyze the data you have. And that analysis can be either on premise or in the cloud, giving you a truly hybrid and cross-cloud proven platform.
Mike Waas: I’d like to share a use case featuring Datometry Hyper-Q and Pivotal Greenplum featuring a Fortune 100 Global Financial Institution needing to scale their business intelligence application, built using 2000-plus stored procedures. The customer’s analysis showed that replacing their existing data warehouse footprint was prohibitively expensive and rewriting the business applications to a more cost-effective and modern data warehouse posed significant expense and business risk. Hyper-Q allowed the customer to transfer the stored procedures in days without refactoring the logic of the application and implement various control-flow primitives, a time-consuming and expensive proposition.
Qx. Anything else you wish to add?
Jacque Istok: Thank you for the opportunity to speak with you. We have found that there has never been a more valid time than right now for customers to stop paying their heavy Teradata tax and the combination of Pivotal Greenplum and Datometry Hyper-Q allows them to do that right now, with no risk, and immediate ROI. On top of that, they are then able to find themselves on a modern data platform – one that allows them to grow into more advanced features as they are able. Pivotal Greenplum becomes their bridge to transforming your organization by offering the advanced analytics you need but giving you traditional, production proven capabilities immediately. At the end of the day, there isn’t a single Teradata customer that I’ve spoken to that doesn’t want Teradata-like capabilities at Hadoop-like prices and you get all this and more with Pivotal Greenplum.
Mike Waas: Thank you for this great opportunity to speak with you. We, at Datometry, believe that data is the key that will unlock competitive advantage for enterprises and without adopting modern data management technologies, it is not possible to unlock value. According to the leading industry group, TDWI, “today’s consensus says that the primary path to big data’s business value is through the use of so-called ‘advanced’ forms of analytics based on technologies for mining, predictions, statistics, and natural language processing (NLP). Each analytic technology has unique data requirements, and DWs must modernize to satisfy all of them.”
We believe virtualizing the data warehouse is the cornerstone of any cloud-first strategy because data warehouse migration is one of the most risk-laden and most expensive initiatives that a company can embark on during their journey to to the cloud.
Interestingly, the cost of migration is primarily the cost of process and not technology and this is where Datometry comes in with its data warehouse virtualization technology.
We are the key that unlocks the power of new technology for enterprises to take advantage of the latest technology and gain competitive advantage.
Jacque Istok serves as the Head of Data Technical Field for Pivotal, responsible for setting both data strategy and execution of pre and post sales activities for data engineering and data science. Prior to that, he was Field CTO helping customers architect and understand how the entire Pivotal portfolio could be leveraged appropriately.
A hands on technologist, Mr. Istok has been implementing and advising customers in the architecture of big data applications and back end infrastructure the majority of his career.
Prior to Pivotal, Mr. Istok co-founded Professional Innovations, Inc. in 1999, a leading consulting services provider in the business intelligence, data warehousing, and enterprise performance management space, and served as its President and Chairman. Mr. Istok is on the board of several emerging startup companies and serves as their strategic technical advisor.
Mike Waas, CEO Datometry, Inc.
Mike Waas founded Datometry after having spent over 20 years in database research and commercial database development. Prior to Datometry, Mike was Sr. Director of Engineering at Pivotal, heading up Greenplum’s Advanced R&Dteam. He is also the founder and architect of Greenplum’s ORCA query optimizer initiative. Mike has held senior engineering positions at Microsoft, Amazon, Greenplum, EMC, and Pivotal, and was a researcher at Centrum voor Wiskunde en Informatica (CWI), Netherlands, and at Humboldt University, Berlin.
Mike received his M.S. in Computer Science from University of Passau, Germany, and his Ph.D. in Computer Science from the University of Amsterdam, Netherlands. He has authored or co-authored 36 publications on the science of databases and has 24 patents to his credit.
– On Open Source Databases. Interview with Peter Zaitsev, ODBMS Industry Watch, Published on 2017-09-06
– On Apache Ignite, Apache Spark and MySQL. Interview with Nikita Ivanov , ODBMS Industry Watch, Published on 2017-06-30
– On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah, ODBMS Industry Watch, Published on 2017-03-13
Follow us on Twitter: @odbmsorg