“A lot of times we think of digital transformation as a technology dependent process. The transformation takes place when employees learn new skills, change their mindset and adopt new ways of working towards the end goal.”–Kerem Tomak
I have interviewed Kerem Tomak, Executive VP, Divisional Board Member, Big Data-Advanced Analytics-AI, at Commerzbank AG. We talked about Digital Transformation, Big Data, Advanced Analytics and AI for the financial sector.
Commerzbank AG is a major German bank operating as a universal bank, headquartered in Frankfurt am Main. In the 2019 financial year, the bank was the second largest in Germany after the balance sheet total. The bank is present in more than 50 countries around the world and provides almost a third of Germany’s trade finance. In 2017, it handled nearly 13 million customers in Germany and more than 5 million customers in Central and Eastern Europe. (source: Wikipedia).
Q1. What are the key factors that need to be taken into account when a company wants to digitally transform itself?
Kerem Tomak: It starts with a clear and coherent digital strategy. Depending on the level of the company this can vary from operational efficiencies as the main target to disrupting and changing the business model all together. Having clear scope and objectives of the digital transformation is key in its success.
A lot of times we think of digital transformation as a technology dependent process. The transformation takes place when employees learn new skills, change their mindset and adopt new ways of working towards the end goal. Digital enablement together with a company wide upgrade/replacement of legacy technologies with new ones like Cloud, API, IoT etc. is the next step towards becoming a digital company. With all this comes the most important ingredient, thinking outside the box and taking risks. One of the key success criteria in becoming a digital enterprise is the true and speedy “fail fast, learn and optimize” mentality. Avoiding (calculated) risks, especially at the executive level, will limit growth and hinder transformation efforts.
Q2. What are the main lessons you have learned when establishing strategic, tactical and organizational direction for digital marketing, big data and analytics teams?
Kerem Tomak: For me, culture eats strategy. Efficient teams build a culture in which they thrive. Innovation is fueled by teams which constantly learn and share knowledge, take risks and experiment. Aside from cultural aspects, there are three main lessons I learned over the years.
First: Top down buy-in and support is key. Alignment with internal and external key stakeholders is vital – you cannot create impact without them taking ownership and being actively involved in the development of use cases.
Second: Clear prioritization is necessary. Resources are limited, both in the analytics teams and with the stakeholders. OKRs provide very valuable guidance on steering the teams forward and set priorities.
Third: Building solutions which can scale over a stable and scalable infrastructure. Data quality and governance build clean input channels to analytics development and deployment. This is a major requirement and biggest chunk of the work. Analytics capabilities then guide what kind of tools and technologies can be used to make sense of this data. Finally, integrating with execution outlets such as a digital marketing platform creates a feedback loop that teams can learn and optimize against.
Q3. What are the main challenges (both technical and non) when managing mid and large-size analytics teams?
Kerem Tomak: Again, building a culture in which teams thrive independent of size is key. For analytics teams, constantly learning/testing new techniques and technologies is an important aspect of job satisfaction for the first few years out of academia. Promotion path clarity and availability of a “skills matrix” makes it easy to understand what leadership values in the employees are important and provides guidance on future growth opportunities. I am not a believer in hierarchical organizations so keeping job levels as low as possible is necessary for speed and delivery. Hiring and retaining right skills in the analytics teams are not easy, especially in hot markets like Silicon Valley. Most analytics employees follow leaders and generally stay loyal to them. Head of an analytics team plays an extremely important role. That will “make it or break it” for analytics teams. Finally, analytics platforms with the right tools and scale is critical for the teams’ success.
Q4. What does it take to successfully deliver large scale analytics solutions?
Kerem Tomak: First, one needs a flexible and scalable analytics infrastructure – this can comprise on-premise components like a Chatbots for example, as well as shared components via a Public Cloud. Secondly, it takes an end-to-end automation of processes, in order to attain scale fast and on demand. Last but not least, companies need an accurate sense of customers’ needs and requirements to ensure that the developed solution will be adopted.
Q5. What parameters do you normally use to define if an analytics solution is really successful?
Kerem Tomak: An analytics solution is successful if it has a high impact. Some key parameters are usage, increased revenues and reduced costs.
Q6. Talking about Big Data, Advanced Analytics and AI: Which companies are benefiting from them at present?
Kerem Tomak: Maturity of Big Data, AA and AI differs across industries. Leading the pack are Tech, Telco, Financial Services, Retail and Automotive. In each industry there are leaders and laggards. There are fewer and fewer companies untouched by BDAA and AI.
Q7. Why are Big Data and Advanced Analytics so important for the banking sector?
Kerem Tomak: This has (at least) two dimensions. First: Like any other company that wants to sell products or services, we must understand our client’s needs. Big Data and Advanced Analytics can give us a decisive advantage here. For example – with our customers’ permission of course – we can analyze their transactions and thus gain useful information about their situation and learn what they need from their bank. Simply put: A person with a huge amount of cash in their account obviously has no need for a consumer credit at the moment. But the same person might have a need for advice on investment opportunities. Data analysis can give us very detailed insights and thus help us to understand our customers better.
This leads to the second dimension, which is risk management. As a bank we are risk taking specialists. The better the bank does in understanding the risks it takes, the more efficient it can act to counterbalance those risks. Benefits are a lower rate of credit defaults as well as a more accurate credit pricing. This is in favor of both the bank and its customers.
Data is the fabric which new business models are made of but Big Data does not necessarily mean Big Business: The correct evaluation of data is crucial. This will also be a decisive factor in the future as to whether a company can hold its own in the market.
Q8. What added value can you deliver to your customers with them?
Kerem Tomak: Well, for starters, Advanced Analytics helps us to prevent fraud. In 2017, Commerzbank used algorithms to stop fraudulent payments in excess of EUR 100 million. Another use case is the liquidity forecast for small and medium-sized enterprises. Our Cash Radar runs in a public cloud and generates forecasts for the development of the business account. It can therefore warn companies at an early stage if, for example, an account is in danger of being underfunded. So with the help of such innovative data-driven products, the bank obviously can generate added customer value, but also drive its growth and set itself apart from its competitors.
Additionally, Big Data and Advanced Analytics generate significant internal benefits. For example, Machine Learning is providing us with efficient support to prevent money laundering by automatically detecting conspicuous payment flows. Another example: Chatbots already regulate part of our customer communication. Also, Commerzbank is the first German financial institution to develop a data-based pay-per-use investment loan. The redemption amount is calculated from the use of the capital goods – in this case the utilization of the production machines, which protects the liquidity of the user and gives us the benefit of much more accurate risk calculations.
When we bear in mind that the technology behind examples like these is still quite new, I am confident that we will see many more use cases of all kinds in the future.
Q9. It seems that Artificial Intelligence (AI) will revolutionize the financial industry in the coming years. What is your take on this?
Kerem Tomak: When we talk about artificial intelligence, currently, we basically still mean machine learning. So we are not talking about generalized artificial intelligence in its original sense. It is about applications that recognize patterns and learn from these occurrences. Eventually tying these capabilities to applications that support decisions and provide services make AI (aka Machine Learning) a unique field. Even though the field of data modelling has developed rapidly in recent years, we are still a long way from the much-discussed generalized artificial intelligence which had the machine goal outlined in 1965 as “machines will be capable, within twenty years, of doing any work a man can do”. With the technology available today we can think of the financial industry having new ways of generating, transferring, accumulating wealth in ways we have not seen before all predicated upon individual adoption and trust.
Q10. You have been working for many years in US. What are the main differences you have discovered in now working in Europe?
Kerem Tomak: Europeans are very sensitive to privacy and data security. The European Union has set a high global standard with its General Data Protection Regulation (GDPR). In my opinion, Data protection “made in Europe” is a real asset and has the potential to become a global blueprint.
Also, Europe is very diverse – from language over culture to different market environments and regulatory issues. Even though immense progress has been made in the course of harmonization in the European Union, a level playing field remains one of the key issues in Europe, especially for Banks.
Technology adoption is lagging in some parts of Europe. Bigger infrastructure investments, wider adoption of public cloud, 5G deployment are needed to stay competitive and relevant in global markets which are increasingly dominated by US and China. This is both an opportunity and risk. I see tremendous opportunities everywhere from IoT to AI driven B2B and B2C apps for example. If adoption of public cloud lags any further, I see the risk of falling behind on AI development and innovation in EU.
Finally, I truly enjoy the family oriented work-life balance here which in turn increases work productivity and output.
Dr. Kerem Tomak, Executive VP, Divisional Board Member, Big Data-Advanced Analytics-AI, Commerzbank AG
Kerem brings more than 15 years of experience as a data scientist and an executive. He has expertise in the areas of omnichannel and cross-device attribution, price and revenue optimization, assessing promotion effectiveness, yield optimization in digital marketing and real time analytics. He has managed mid and large-size analytics and digital marketing teams in Fortune 500 companies and delivered large scale analytics solutions for marketing and merchandising units. His out-of-the box thinking and problem solving skills led to 4 patent awards and numerous academic publications. He is also a sought after speaker in Big Data and BI Platforms for Analytics.
Follow us on Twitter: @odbmsorg
“At some point, most companies come to the realization that the advanced technologies and innovation that allow them to improve business operations also generate increased amounts of data that existing legacy technology is unable to handle, resulting in the need for more new technology. It is a cyclical process that CIOs need to prepare for.” –Scott Gnau
InterSystems has appointed last month Scott Gnau to Head of their Data Platforms Business Unit. I have asked Scott a number of questions related to data management, what are his advices for Chief Information Officers, what is the positing of the InterSystems IRIS™ family of data platforms, and what is the technology vision ahead for the company’s Data Platforms business unit.
Q1. What are the main lessons you have learned in more than 20 years of experience in the data management space?
Scott Gnau: The data management space is a people-centric business, whether you are dealing with long-time customers or developers and architects. The formation of a trusted relationship can be the difference between a potential customer selecting one vendor’s technology which comes with the benefit of partnering for long term success, over a similar competitor’s technology.
Throughout my career, I have also learned how risky data management projects can be. They essentially ensure the security, cleanliness and accuracy of an organization’s data. They are then responsible for scaling data-centric applications, which helps inform important business decisions. Data management is a very competitive space which is only becoming more crowded.
Q2. What is your most important advice for Chief Information Officers?
Scott Gnau: At some point, most companies come to the realization that the advanced technologies and innovation that allow them to improve business operations also generate increased amounts of data that existing legacy technology is unable to handle, resulting in the need for more new technology. It is a cyclical process that CIOs need to prepare for.
Phenomena such as big data, the internet of things (IoT), and artificial intelligence (AI) are driving the need for this modern data architecture and processing, and CIOs should plan accordingly. For the last 30 years, data was primarily created inside data centers or firewalls, was standardized, kept in a central location and managed. It was fixed and simple to process.
In today’s world, most data is created outside the firewall and outside of your control. The data management process is now reversed – instead of starting with business requirements, then sourcing data and building and adjusting applications, developers and organizations load the data first and reverse engineer the process. Now data is driving decisions around what is relevant and informing the applications that are built.
Q3. How do you position the InterSystems IRIS™ family of data platforms with respect to other similar products on the market?
Scott Gnau: The data management industry is crowded, but the InterSystems IRIS data platform is like nothing else on the market. It has a unique, solid architecture that attracts very enthusiastic customers and partners, and plays well in the new data paradigm. There is no requirement to have a schema to leverage InterSystems IRIS. It scales unlike any other product in the data management marketplace.
InterSystems IRIS has unique architectural differences that enable all functions to run in a highly optimized fashion, whether it be supporting thousands of concurrent requests, automatic and easy compression, or highly performant data access methods.
Q4. What is your strategy with respect to the Cloud?
Scott Gnau: InterSystems has a cloud-first mentality, and with the goal of easy provisioning and elasticity, we offer customers the choice for cloud deployments. We want to make the consumption model simple, so that it is frictionless to do business with us.
InterSystems IRIS users have the ability to deploy across any cloud, public or private. Inside the software it leverages the cloud infrastructure to take advantage of the new capabilities that are enabled because of cloud and containerized architectures.
Q5. What about Artificial Intelligence?
Scott Gnau: AI is the next killer app for the new data paradigm. With AI, data can tell you things you didn’t already know. While many of the mathematical models that AI is built on are on the older side, it is still true that the more data you feed them the more accurate they become (which fits well with the new paradigm of data). Generating value from AI also implies real time decisioning, so in addition to more data, more compute and edge processing will define success.
Q6. How do you plan to help the company’s customers to a new era of digital transformation?
Scott Gnau: My goal is to help make technology as easy to consume as possible, to ensure that it is highly dependable. I will continue to work in and around vertical industries that are easily replicable.
Q7. What customers are asking for is not always what customers really need. How do you manage this challenge?
Scott Gnau: Disruption in the digital world is at an all-time high, and for some, impending change is sometimes too hard to see before it is too late. I encourage customers to be ready to “rethink normal,” while putting them in the best position for any transitions and opportunities to come. At the same time, as trusted partners we also are a source of advice to our customers on mega trends.
Q8. What is your technology vision ahead for the company’s Data Platforms business unit?
Scott Gnau: InterSystems continues to look for ways to differentiate how our technology creates success for our customers. We judge our success on our customers’ successes. Our unique architecture and overall performance envelope plays very well into data centric applications across multiple industries including financial services, logistics and healthcare. With connected devices and the requirement for augmented transactions we play nicely into the future high value application space.
Q9. What do you expect from your new role at InterSystems?
Scott Gnau: I expect to have a lot of fun because there is an infinite supply of opportunity in the data management space due to the new data paradigm and the demand for new analytics. On top of that, InterSystems has many smart, passionate and loyal customers, partners and employees. As I mentioned up front, it’s about a combination of great tech AND great people that drives success. Our ability to invest in the future is extremely strong – we have all the key ingredients.
Scott Gnau joined InterSystems in 2019 as Vice President of Data Platforms, overseeing the development, management, and sales of the InterSystems IRIS™ family of data platforms. Gnau brings more than 20 years of experience in the data management space helping lead technology and data architecture initiatives for enterprise-level organizations. He joins InterSystems from HortonWorks, where he served as chief technology officer. Prior to Hortonworks, Gnau spent two decades at Teradata in increasingly senior roles, including serving as president of Teradata Labs. Gnau holds a Bachelor’s degree in electrical engineering from Drexel University.
– On AI, Big Data, Healthcare in China. Q&A with Luciano Brustia ODBMS.org, 8 APR, 2019.
Follow us on Twitter: @odbmsorg
“When we started this project in 2013, it was a moonshot. We were not sure if NVM technologies would ever see the light of day, but Intel has finally started shipping NVM devices in 2019. We are excited about the impact of NVM on next-generation database systems.” — Joy Arulraj and Andrew Pavlo.
I have interviewed Joy Arulraj, Assistant Professor of Computer Science at Georgia Institute of Technology and Andrew Pavlo, Assistant Professor of Computer Science at Carnegie Mellon University. They just published a new book “Non-Volatile Memory Database Management Systems“. We talked about non-volatile memory technologies (NVM), and how NVM is going to impact the next-generation database systems.
Q1. What are emerging non-volatile memory technologies?
Arulraj, Pavlo: Non-volatile memory (NVM) is a broad class of technologies, including phase-change memory and memristors, that provide low latency reads and writes on the same order of magnitude as DRAM, but with persistent writes and large storage capacity like an SSD. For instance, Intel recently started shipping its Optane DC NVM modules based on 3D XPoint technology .
Q2. How do they potentially change the dichotomy between volatile memory and durable storage in database management systems?
Arulraj, Pavlo: Existing database management systems (DBMSs) can be classified into two types based on the primary storage location of the database: (1) disk-oriented and (2) memory-oriented DBMSs. Disk-oriented DBMSs are based on the same hardware assumptions that were made in the first relational DBMSs from the 1970s, such as IBM’s System R. The design of these systems target a two-level storage hierarchy comprising of a fast but volatile byte-addressable memory for caching (i.e., DRAM), and a slow, non-volatile block-addressable device for permanent storage (i.e., SSD). These systems take a pessimistic assumption that a transaction could access data that is not in memory, and thus will incur a long delay to retrieve the needed data from disk. They employ legacy techniques, such as heavyweight concurrency-control schemes, to overcome these limitations.
Recent advances in manufacturing technologies have greatly increased the capacity of DRAM available on a single computer.
But disk-oriented systems were not designed for the case where most, if not all, of the data resides entirely in memory.
The result is that many of their legacy components have been shown to impede their scalability for transaction processing workloads. In contrast, the architecture of memory-oriented DBMSs assumes that all data fits in main memory, and it therefore does away with the slower, disk-oriented components from the system. As such, these memory-oriented DBMSs have been shown to outperform disk-oriented DBMSs. But, they still have to employ heavyweight components that can recover the database after a system crash because DRAM is volatile. The design assumptions underlying both disk-oriented and memory-oriented DBMSs are poised to be upended by the advent of NVM technologies.
Q3. Why are existing DBMSs unable to take full advantage of NVM technology?
Arulraj, Pavlo: NVM differs from other storage technologies in the following ways:
- Byte-Addressability: NVM supports byte-addressable loads and stores unlike other non-volatile devices that only support slow, bulk data transfers as blocks.
- High Write Throughput: NVM delivers more than an order of magnitude higher write throughput compared to SSD. More importantly, the gap between sequential and random write throughput of NVM is much smaller than other durable storage technologies.
- Read-Write Asymmetry: In certain NVM technologies, writes take longer to complete compared to reads. Further, excessive writes to a single memory cell can destroy it.
Although the advantages of NVM are obvious, making full use of them in a DBMS is non-trivial. Our evaluation of state-of-the-art disk-oriented and memory-oriented DBMSs on NVM shows that the two architectures achieve almost the same performance when using NVM. This is because current DBMSs assume that memory is volatile, and thus their architectures are predicated on making redundant copies of changes on durable storage. This illustrates the need for a complete rewrite of the database system to leverage the unique properties of NVM.
Q4.With NVM, which components of legacy DBMSs are unnecessary?
Arulraj, Pavlo: NVM requires us to revisit the design of several key components of the DBMS, including that of the (1) logging and recovery protocol, (2) storage and buffer management, and (3) indexing data structures.
We will illustrate it using the logging and recovery protocol. A DBMS must guarantee the integrity of a database against application, operating system, and device failures. It ensures the durability of updates made by a transaction by writing them out to durable storage, such as SSD, before returning an acknowledgment to the application. Such storage devices, however, are much slower than DRAM, especially for random writes, and only support bulk data transfers as blocks.
During transaction processing, if the DBMS were to overwrite the contents of the database before committing the transaction, then it must perform random writes to the database at multiple locations on disk. DBMSs try to minimize random writes to disk by flushing the transaction’s changes to a separate log on disk with only sequential writes on the critical path of the transaction. This method is referred to as write-ahead logging (WAL).
NVM upends the key design assumption underlying the WAL protocol since it supports fast random writes. Thus, we need to tailor the protocol for NVM. We designed such a protocol that we call write-behind logging (WBL). WBL not only improves the runtime performance of the DBMS, but it also enables it to recovery nearly instantaneously from failures. The way that WBL achieves this is by tracking what parts of the database have changed rather than how it was changed. Using this logging method, the DBMS can directly flush the changes made by transactions to the database instead of recording them in the log. By ordering writes to NVM correctly, the DBMS can guarantee that all transactions are durable and atomic. This allows the DBMS to write fewer data per transaction, thereby improving a NVM device’s lifetime.
Q5. You have designed and implemented a DBMS storage engine architectures that are explicitly tailored for NVM. What are the key elements?
Arulraj, Pavlo: The design of all of the storage engines in existing DBMSs are predicated on a two-tier storage hierarchy comprised of volatile DRAM and a non-volatile SSD. These devices have distinct hardware constraints and performance properties. The traditional engines were designed to account for and reduce the impact of these differences.
For example, they maintain two layouts of tuples depending on the storage device. Tuples stored in memory can contain non-inlined fields because DRAM is byte-addressable and handles random accesses efficiently. In contrast, fields in tuples stored on durable storage are inlined to avoid random accesses because they are more expensive. To amortize the overhead for accessing durable storage, these engines batch writes and flush them in a deferred manner. Many of these techniques, however, are unnecessary in a system with a NVM-only storage hierarchy. We adapted the storage and recovery mechanisms of these traditional engines to exploit NVM’s characteristics.
For instance, consider an NVM-aware storage engine that performs in-place updates. When a transaction inserts a tuple, rather than copying the tuple to the WAL, the engine only records a non-volatile pointer to the tuple in the WAL. This is sufficient because both the pointer and the tuple referred to by the pointer are stored on NVM. Thus, the engine can use the pointer to access the tuple after the system restarts without needing to re-apply changes in the WAL. It also stores indexes as non-volatile B+trees that can be accessed immediately when the system restarts without rebuilding.
The effects of committed transactions are durable after the system restarts because the engine immediately persists the changes made by a transaction when it commits. So, the engine does not need to replay the log during recovery. But the changes of uncommitted transactions may be present in the database because the memory controller can evict cache lines containing those changes to NVM at any time. The engine therefore needs to undo those transactions using the WAL. As this recovery protocol does not include a redo process, the engine has a much shorter recovery latency compared to a traditional engine.
Q6. What is the key takeaway from the book?
Arulraj, Pavlo: All together, the work described in this book illustrates that rethinking the key algorithms and data structures employed in a DBMS for NVM not only improves performance and operational cost, but also simplifies development and enables the DBMS to support near-instantaneous recovery from DBMS failures. When we started this project in 2013, it was a moonshot. We were not sure if NVM technologies would ever see the light of day, but Intel has finally started shipping NVM devices in 2019. We are excited about the impact of NVM on next-generation database systems.
Joy Arulraj is an Assistant Professor of Computer Science at Georgia Institute of Technology. He received his Ph.D. from Carnegie Mellon University in 2018, advised by Andy Pavlo. His doctoral research focused on the design and implementation of non-volatile memory database management systems. This work was conducted in collaboration with the Intel Science & Technology Center for Big Data, Microsoft Research, and Samsung Research.
Andrew Pavlo is an Assistant Professor of Databaseology in the Computer Science Department at Carnegie Mellon University. At CMU, he is a member of the Database Group and the Parallel Data Laboratory. His work is also in collaboration with the Intel Science and Technology Center for Big Data.
– Non-Volatile Memory Database Management Systems. by Joy Arulraj, Georgia Institute of Technology, Andrew Pavlo, Carnegie Mellon University. Book, Morgan & Claypool Publishers, Copyright © 2019, 191 Pages.
ISBN: 9781681734842 | PDF ISBN: 9781681734859 , Hardcover ISBN: 9781681734866
–How to Build a Non-Volatile Memory Database Management System (.PDF), Joy Arulraj Andrew Pavlo
Follow us on Twitter: @odbmsorg
“Anyone who expects to have some of their work in the cloud (e.g. just about everyone) will want to consider the offerings of the cloud platform provider in any shortlist they put together for new projects. These vendors have the resources to challenge anyone already in the market.”– Merv Adrian.
I have interviewed Merv Adrian, Research VP, Data & Analytics at Gartner. We talked about the the database market, the Cloud and the 2018 Gartner Magic Quadrant for Operational Database Management Systems.
Q1. Looking Back at 2018, how has the database market changed?
Merv Adrian: At a high level, much is similar to the prior year. The DBMS market returned to double digit growth in 2017 (12.7% year over year in Gartner’s estimate) to $38.8 billion. Over 73% of that growth was attributable to two vendors: Amazon Web Services and Microsoft, reflecting the enormous shift to new spending going to cloud and hybrid-capable offerings. In 2018, the trend grew, and the erosion of share for vendors like Oracle, IBM and Teradata continued. We don’t have our 2018 data completed yet, but I suspect we will see a similar ballpark for overall growth, with the same players up and down as last year. Competition from Chinese cloud vendors, such as Alibaba Cloud and Tencent, is emerging, especially outside North America.
Q2. What most surprised you?
Merv Adrian: The strength of Hadoop. Even before the merger, both Cloudera and Hortonworks continued steady growth, with Hadoop as a cohort outpacing all other nonrelational DBMS activity from a revenue perspective. With the merger, Cloudera becomes the 7th largest vendor by revenue and usage and intentions data suggest continued growth in the year ahead.
Q3. Is the distinction between relational and nonrelational database management still relevant?
Merv Adrian: Yes, but it’s less important than the cloud. As established vendors refresh and extend product offerings that build on their core strengths and capabilities to provide multimodel DBMS and/or or broad portfolios of both, the “architecture” battle will ramp up. New disruptive players and existing cloud platform providers will have to battle established vendors where they are strong – so for DMSA players like Snowflake will have more competition and on the OPDBMS side, relational and nonrelational providers alike – such as EnterpriseDB, MongoDB, and Datastax – will battle more for a cloud foothold than a “nonrelational” one.
Specific nonrelational plays like Graph, Time Series, and ledger DBMSs will be more disruptive than the general “nonrelational” category.
Q4. Artificial intelligence is moving from sci-fi to the mainstream. What is the impact on the database market?
Merv Adrian: Vendors are struggling to make the case that much of the heavy lifting should move to their DBMS layer with in-database processing. Although it’s intuitive, it represents a different buyer base, with different needs for design, tools, expertise and operational support. They have a lot of work to do.
Q5. Recently Google announced BigQuery ML. Machine Learning in the (Cloud) Database. What are the pros and cons?
Merv Adrian: See the above answer. Google has many strong offerings in the space – putting them together coherently is as much of a challenge for them as anyone else, but they have considerable assets, a revamped executive team under the leadership of Thomas Kurian, and are entering what is likely to be a strong growth phase for their overall DBMS business. They are clearly a candidate to be included in planning and testing.
Q6. You recently published the 2018 Gartner Magic Quadrant for Operational Database Management Systems. In a nutshell. what are your main insights?
Merv Adrian: Much of that is included in the first answer above. What I didn’t say there is that the degree of disruption varies between the Operational and DMSA wings of the market, even though most of the players are the same. Most important, specialists are going to be less relevant in the big picture as the converged model of application design and multimodel DBMSs make it harder to thrive in a niche.
Q7. To qualify for inclusion in this Magic Quadrant, vendors must have had to support two of the following four use cases: traditional transactions, distributed variable data, operational analytical convergence and event processing or data in motion. What is the rational beyond this inclusion choice?
Merv Adrian: The rationale is to offer our clients the offerings with the broadest capabilities. We can’t cover all possibilities in depth, so we attempt to reach as many as we can within the constraints we design to map to our capacity to deliver. We call out specialists in various other research offerings such as Other Vendors to Consider, Cool Vendor, Hype Cycle and other documents, and pieces specific to categories where client inquiry makes it clear we need to have a published point of view.
Q8. How is the Cloud changing the overall database market?
Merv Adrian: Massively. In addition to functional and architectural disruption, it’s changing pricing, support, release frequency, and user skills and organizational models. The future value of data center skills, container technology, multicloud and hybrid challenges and more are hot topics.
Q9. In your Quadrant you listed Amazon Web Services, Alibaba Cloud and Google. These are no pure database vendors, strictly speaking. What role do they play in the overall Operational DBMS market?
Merv Adrian: Anyone who expects to have some of their work in the cloud (e.g. just about everyone) will want to consider the offerings of the cloud platform provider in any shortlist they put together for new projects. These vendors have the resources to challenge anyone already in the market. And their deep pockets, and the availability of open source versions of every DBMS technology type that they can use – including creating their own versions of with optimizations for their stack and pre-built integrations to upstream and downstream technologies required for delivery – makes them formidable.
Q10. What are the main data management challenges and opportunities in 2019?
Merv Adrian: Avoiding silver bullet solutions, sticking to sound architectural principles based on understanding real business needs, and leveraging emerging ideas without getting caught in dead end plays. Pretty much the same as always. The details change, but sound design and a focus on outcomes remain the way forward.
Qx Anything else you wish to add?
Merv Adrian: Fasten your seat belt. It’s going to be a bumpy ride.
Merv Adrian, Research VP, Data & Analytics. Gartner
Merv Adrian is an Analyst on the Data Management team following operational DBMS, Apache Hadoop, Spark, nonrelational DBMS and adjacent technologies. Mr. Adrian also tracks the increasing impact of open source on data management software and monitors the changing requirements for data security in information platforms.
Follow us on Twitter: @odbmsorg
” When software reliability issues creep up in production, it’s a finger-pointing moment between suppliers and users. Usually, what’s missing is simple: information. ” –Barry Morris.
“Most organisations also suffer from more continuous disruption caused by a steady stream of less dramatic issues. Intermittent software problems particularly cause a lot of user frustration and dissatisfaction.” — Dale Vile.
I have interviewed Barry Morris, CEO of Undo and Dale Vile, Distinguished Analyst Freeform Dynamics. Main topic of the interview is enterprise software reliability. This interview relates to a recent research on the challenges and impact on troubleshooting software failures in production, conducted by Freeform Dynamics.
Q1. How often software-related failures occur in the enterprise?
Dale Vile: When we hear the term ‘software failure’, we tend to think of major incidents that bring down a whole department or result in significant data loss. Our study suggests that this kind of thing happens around once every couple years on average in most organisations – at least that’s what people admit to when surveyed. The research also tells us, however, that most organisations also suffer from more continuous disruption caused by a steady stream of less dramatic issues. Intermittent software problems particularly cause a lot of user frustration and dissatisfaction.
Q2. What are the common reasons why major system failures and/or incidents leading to loss of data are top of the list when it comes to the potential for damage and disruption?
Dale Vile: Software is now embedded in most aspects of most businesses. A telling observation is that over the years, the percentage of applications considered to be business critical has steadily increased. At the turn-of-the-century, it was usual for organisations to tell us that around 10% of their application portfolio was considered critical. Nowadays, it’s more likely to be 50% or more. This is why it’s so disruptive and potentially damaging when software failures occur – even relatively brief or minor ones.
Barry Morris: The study we commissioned shows that 83% of enterprise customers consider data corruption issues to be highly disruptive to their business. In the database business, that’s probably closer to 100%. Take SAP HANA, Oracle, Teradata or other data management system vendors: they have clients paying them millions of dollars per year for a reliable and predictable system. Consequences are high if the wrong row is returned, there’s a memory corruption issue, or data goes missing. These types of clients have little tolerance that. At best, your reputation in the industry and software renewals will be on the line. At worst, you’re talking about plummeting stock prices wiping off a few millions off the value of your business.
Q3. What are the most important challenges to achieve software that runs reliably and predictably?
Dale Vile: It starts with software quality management in the development or engineering environment. Most of the challenges we see here are to do with adjusting testing and quality management processes to cope with modern approaches such as Agile, DevOps and Continuous Delivery. A lot of people now refer to ‘Continuous Testing’ in this context, and understandably put a lot of emphasis on automation. But even software makers are on a journey here. Our research tells us that few have it fully figured out at the moment. Beyond this, effective testing in the live environment is also essential.
The problem here, though, is that the complex and dynamic nature of today’s enterprise infrastructures makes it very hard or impossible to test every use case in every situation. And even if you could, subsequent changes to the environment, which an application team may not even be aware of, could easily interfere with the solution and cause instability or failure. There’s a lot to think about, and quality management is only the start.
Q4. What factors are influencing users’ satisfaction and confidence with respect to software?
Dale Vile: Confidence and satisfaction stem from users and business stakeholders perceiving that those responsible are working together competently and effectively to resolve issues when they occur in a timely manner. A fundamental requirement here is openness and honesty, and a willingness to take responsibility. Defensiveness, evasion and finger-pointing, however, tend to undermine confidence and satisfaction. Such behaviour can be cultural; but very often it’s more a symptom of inadequate skills, processes and/or tools within either the supplier or the customer environment. When such shortfalls in capability exist, the inevitable result is an elongated troubleshooting and resolution cycle. This is the real killer of confidence and satisfaction.
Barry Morris: When software reliability issues creep up in production, it’s a finger-pointing moment between suppliers and users. Usually, what’s missing is simple: information. Right now, to obtain that information, suppliers ask 20 questions: what did you do, how did you do it, in what environment and so on. There’s a long period of communication & diagnostic, which is frustrating and time-wasting on both sides. That supplier/user relationships at that moment of firefighting would be massively improved if there was data on the table and engineers could just get on with fixing the problem. I see data-driven defect diagnostic as the key to improving customer satisfaction.
Q5. How effective is software quality management in the enterprise?
Dale Vile: I’ll answer the question in relation to software *reliability* management, which is a function of inherent software quality, effective implementation, and competent operation and support thereafter. We generally find that each group or team tends to do reasonably well in their specific area; but challenges often exist because the various silos I disconnected. What many are lacking is good communication and mutual understanding between those involved in the software lifecycle. Lack of adequate visibility and effective feedback is also a common issue. Most organisations on both the supplier and enterprise side are working on improvement, but gaps frequently exist in these kinds of areas, which in turn impact software reliability.
Barry Morris: Despite all the processes and tools put in place in dev/test, we still see mission-critical applications being shipped with defects. Worse, they are being shipped with known defects – some of which could turn disastrous. Ticking time bombs really. Why? Because of tricky intermittent failures that no-one can get to the bottom of.
So actually, in a lot of cases, I don’t think that the software quality management practices I see are as effective as they could be.
Q6. How effective are the commonly used troubleshooting and diagnostics techniques?
Dale Vile: As mentioned above, the most common problems I see here are to do with disconnects between the various teams involved. Within the engineering environment, this is often down to developers and quality teams working in silos with inefficient handoffs and ineffective feedback mechanisms. In the enterprise context, it’s the disconnect between application teams, operations staff and even service desk personnel. Added to this, many also struggle to join the dots to figure out what’s going on when problems occur, and communicate insights back to developers so they can take appropriate action. Against this background, it’s not surprising that over 90% of both software makers and enterprises report that issues frequently go undiagnosed and come back to bite in a disruptive and often expensive manner.
Barry Morris: Sometimes, traditional methods troubleshooting methods like printf, logging, or core dump analysis are the right solutions if the team is confident they can isolate the issue quickly. Static and dynamic analysis tools are also good options for certain classes of failures. But in more complex situations, traditional debugging methods don’t help much. If anything, they lead you down the wrong path with false positive and become time-wasting, which leads to serious client dissatisfaction.
Q7. You wrote in your study that the big enemies of stakeholders and user satisfaction are delay and uncertainty. What remedies do exist to alleviate this?
Dale Vile: Beyond the kind of processes and tools we have mentioned…it boils down to effective communication and adequate visibility.
Barry Morris: I think that next-gen troubleshooting systems like software recording technology (such as what we offer at Undo) offer a unique solution to the problem of software reliability. Once we move away from guesswork and use data-driven insight instead, application vendors will be able to resolve the most challenging software defects faster than they have ever been able to do before. The unnecessary delays and uncertainty will be a thing of the past.
Q8. You wrote in your study that software failures are inevitable. It is what happens when they occur that really matters. Can you please explain what do you mean here?
Dale Vile: No one expects perfection; not even business users and stakeholders. So provided the software isn’t wildly buggy or unstable, it mostly comes down to how well you respond when problems occur. What annoys people the most in this respect is not knowing what’s going on. Informing someone that you know what the problem is, but it’s going to take some time to fix, is much better than telling them you have no idea what’s causing their problem. Even better if you can give them a timescale for a resolution, and/or a workaround that doesn’t represent a major inconvenience. Interestingly, if you diagnose and fix a problem quickly, the research suggests that you can actually turn a software incident into a positive experience that enhances satisfaction, confidence and mutual respect.
Q9. What remedies are available for that?
Dale Vile: A big enabler here is a modern approach to diagnostics: having the tooling and the processes in place that allow you to troubleshoot effectively in a complex production environment. Traditional approaches are often undermined by the sheer number of moving parts and dependencies, so you need a way to deal with that. This is where solutions such as program execution recording and replay capability (aka software flight recording technology) can help.
Q10.You wrote in your study that switching decisions are often down to simple economics. Can you please explain what do you mean?
Dale Vile: If an application is continually causing problems, the result is increased cost. At one level, this could be down to the additional resource required to support, maintain and troubleshoot software defects. Often more significant, is the end-user productivity hit that stems from people not able to do their jobs properly and efficiently.
There are then various kinds of opportunity costs, e.g. weeks spent battling unreliable software is time not being spent adding value to the business. In extreme cases, such as when customer facing systems are involved, repeated failure can lead to reputational damage, loss of customer confidence, and ultimately lost revenue and market share. It depends on the organisation and the specific application; but in every case there comes a point when the cost to the business of living with unreliable software is ultimately higher than the cost of switching.
Q11. What are the most effective solutions to software diagnostic processes?
Dale Vile: Solutions that work holistically. It’s about capturing all of the relevant events, inputs and variables, especially at execution time in the production environment; then providing actionable data and insights for engineers to facilitate rapid diagnosis and resolution.
Barry Morris: Dale is right. The most effective solutions to software failure diagnostic are those that provide full visibility and definitive data-driven insight into what your software really did before it crashed or resulted in incorrect behaviour. Software recording technology will speed up time-to-resolution by a factor of 10. But the beauty of this kind of approach is that you can now diagnose even the hardest of bugs that you couldn’t resolve before – just because a recording represents the reproducible test case you couldn’t obtain before.
Q12. What are the main conclusions of your study?
Dale Vile: In summary, in a world where software is critical to the business, applications must be reliable; otherwise damaging and costly disruption will result. With this in mind, it’s important to be able to respond quickly and effectively when problems occur. This shines a clear spotlight on diagnostics – an area in which many have clear room for improvement. New approaches and tools are required here, especially for troubleshooting in complex production environments.
The good news is that technology is emerging that can help, but at the moment we see an awareness gap. Our recommendation is therefore for anyone involved in software delivery and support to get up to speed on what’s available, e.g. from companies like Undo and others.
Qx. Anything else you wish to add?
Dale Vile: When you get right down to it, software reliability is a business issue. One of the most striking findings from the research for me is the level of willingness among enterprise customers to switch solutions and suppliers when the pain and cost of unreliable software gets too high. This should be a wake-up call for ISVs and other software makers, not just to manage product quality, but also to work proactively with customers on preventative diagnostic and remedial activity.
Barry Morris: As systems are becoming more and more complex, troubleshooting is not getting any easier…so has to be data-driven.
With over 25 years’ experience working in enterprise software and database systems, Barry is a prodigious company builder, scaling start-ups and publicly held companies alike. He was CEO of distributed service-oriented architecture (SOA) specialists IONA Technologies between 2000 and 2003 and built the company up to $180m in revenues and a $2bn valuation.
A serial entrepreneur, Barry founded NuoDB in 2008 and most recently served as its Executive Chairman. Barry has now been appointed as CEO in September 2018 to lead Undo‘s high-growth phase.
Dale is a co-founder of Freeform Dynamics, and today runs the company.
He oversees the organisation’s industry coverage and research agenda, which tracks technology trends and developments, along with IT-related buying behaviour among mainstream enterprises, SMBs and public sector organisations.
During his 30 year career, he has worked in enterprise IT delivery with companies such as Heineken and Glaxo, and has held sales, channel management and international market development roles within major IT vendors such as SAP, Oracle, Sybase and Nortel Networks. He also spent a couple of years managing an IT reseller business for Admiral Software.
Dale has been involved in IT industry research since the year 2000 and has a strong reputation for original thinking and alternative perspectives on the latest technology trends and developments. He is a widely published author of books, reports and articles, and is an authoritative and provocative speaker.
Hosted by Prof. Zicari of ODBMS.org and featuring Undo CEO Barry Morris and Distinguished Analyst Dale Vile, Freeform Dynamics, this webinar recording covers:
– New market research – the frequency, types, and economic impact of defects on users and developers of enterprise software.
– The importance of fast diagnostics and swift remediation when problems occur in production.
– How to increase enterprise software reliability with software flight recording technology
“The challenges, impact and solutions to troubleshooting software failures“, Freeform Dynamics. Access the full study report here (LINK registration required).
– On Software Quality. Q&A with Alexander Boehm (SAP) and Greg Law (Undo). ODBMS.org, November 26, 2018.
Dr. Alexander Boehm is a database architect working on SAP´s HANA in-memory database management system. Greg Law is Co-founder and CTO of Undo.
Follow us on Twitter: @odbmsorg
“Perhaps less obvious is how role definitions in an organization change as scale increases. Once rare tasks that were just a small part of one team’s responsibilities become so common that they are a full-time job for someone. At that point, one either needs to create automation for the task, or a new team needs to be assembled (or hired) to perform that task full time. ” — Eric Tune
I have interviewed Eric Tune, Senior Staff Engineer at Google. We talked about Kubernetes. Eric has been a Kubernetes contributor since 1.0
Q1. What are the main technical challenges in implementing massive-scale environments?
Eric Tune: Whether working at small or massive scale, the high-level technical goals don’t change: security, developer velocity, efficiency in use of compute resources, supportability of production environments, and so on.
As scale increases, there are some fairly obvious discontinuities, like moving from an application that fits on a single-machine to one that spans multiple machines, and from a single data center or zone to multiple regions. Quite a bit has been written about this. Microservices in particular can be a good fit because they scale well to more machines and more regions.
Perhaps less obvious is how role definitions in an organization change as scale increases. Once rare tasks that were just a small part of one team’s responsibilities become so common that they are a full-time job for someone. At that point, one either needs to create automation for the task, or a new team needs to be assembled (or hired) to perform that task full time. Sometimes, it is obvious how to do this. But, when this repeats many times, one can end up with a confusing mess of automation and tickets, dragging down development velocity and confounding attempts to analyze security and debug systemic failure.
So, a key challenge is finding the right separation responsibilities so that multiple pieces of automation, and multiple human teams collaborate well. Doing it requires not only having a broad view of an organization’s current processes and responsibilities around development, operations, and security; but also which assumptions behind those are no longer valid.
Kubernetes can help hereby providing automation for exactly the types of tasks that become toilsome as scale increases. Several of its creators have lived through organic growth to a massive-scale. Kubernetes is built from that experience, with awareness of the new roles that are needed at massive-scale.
Q2. What is Kubernetes and why is it important?
Eric Tune: First, Kubernetes is one of the most popular ways to deploy applications in containers. Containers make the act of maintaining the machine & operating system a largely separate process from installing and maintaining an application instance – no more worrying about shared library or system utility version differences.
Second, it provides an abstraction over IaaS: VMs, VM images, VM types, load balancers, block storage, auto-scalers, etc. Kubernetes runs on numerous clouds, on-premises, and on a laptop. Many complex applications, such as those consisting of many microservices, can be deployed onto any Kubernetes cluster regardless of the underlying infrastructure. For an organization that may want to modernize their applications now, and move to cloud later, targeting Kubernetes means they won’t need to re-architect when they are ready to move. Third, Kubernetes supports infrastructure-as-code (IaC). You can define complex applications, including storage, networking, and application identity, in a common configuration language, called the Kubernetes Resource model. Unlike other IaC systems, which mostly support a “single-user” model, Kubernetes is designed for multiple users. It supports controlled delegation of responsibility from an ops team to a dev team.
Fourth, it provides an opinionated way to build distributed system control planes, and to extend the APIs and infrastructure-as code type system. This allows solution vendors and in-house infrastructure teams to build custom solutions that feel like they are first class parts of Kubernetes.
Q3. Who should be using Kubernetes?
Eric Tune: If your organization runs Linux-based microservices and has explored container technology, then you are ready to try Kubernetes.
Q4. You are a Kubernetes contributor since 1.0 (4 years). What did you work on specifically?
Eric Tune: During the first year, I worked on whatever needed to be done, including security (namespaces, service accounts, authentication and authorization, resource quota), performance, documentation, testing, API review and code review.
In those first years, people were mostly running stateless microservices on Kubernetes. In the second year, I worked to broaden the set of applications that can run on Kubernetes. I worked on the Job and CronJob APIs of Kubernetes, which support basic batch computation, and the StatefulSet API, which supports databases and other stateful applications. Additionally, I worked with the Helm project on Charts (easy-to-install applications for Kubernetes), with the Spark open source community to get it running on Kubernetes.
Starting in 2017, Kubernetes interest was growing so quickly that the project maintainers could not accept a fraction of the new features that were proposed. The answer was to make Kubernetes extensible so that new features could be build “out of the core.” I worked to define the extensibility story for Kubernetes, particularly for Custom Resource Definitions (CRDs) and Webhooks. The extensibility features of Kubernetes have enabled other large projects, such as Istio and Knative, to integrate with Kubernetes with lower overhead for the Kubernetes project maintainers.
Currently, I lead teams which work on both Open Source Kubernetes and Google Cloud.
Q5. What are the main challenges of migrating several microservices to Kubernetes?
Eric Tune: Here are three challenges I see when migrating several microservices to Kubernetes, and how I recommend handling them:
- Remove Ordering Dependencies: Say microservice C depends on microservices A and B to function normally. When migrating to declarative configuration and Kubernetes, the startup order for microservices can become variable, where previously it was ordered (e.g. by a script). This can cause unexpected behaviors. For example, microservice C might log errors at a high rate or crash if A is not ready yet. A first reaction is sometimes “how can I guarantee ordering of microservice startup,” My advice is not to impose order, but to change problematic behavior. For example, C could be changed to return some response for a request even when A and B are unreachable. This is not really a Kubernetes-specific requirement – it is a good practice for microservices, as it allows for graceful recovery from failures and for autoscaling.
- Don’t Persist Peer Network Identity: Some microservices permanently record the IP addresses of their peers at startup time, and then don’t expect it to ever change. That’s not a great match for the Kubernetes approach to networking. Instead, resolve peer addresses using their domain names and re-resolve after disconnection.
- Plan ahead for Running in Parallel: When migrating a complex set of microservices to Kubernetes, it’s typical to run the entire old environment and the new (Kubernetes) environment in parallel. Make sure you have load replay and response diffing tools to evaluate a dual environment setup.
Q6. How can Kubernetes scale without increasing ops team?
Eric Tune: Kubernetes is built to respond to many types of application and infrastructure failures automatically – for example slow memory leaks in an application, or kernel panics in a virtual machine. Previously this kind of problem may have required immediate attention. With Kubernetes as the first line of defense, ops can wait for more data before taking action. This in turn supports faster rollouts, as you don’t need to soak as long if you know that slow memory leaks will be handled automatically, and you can fix by rolling forward rather than back.
Some ops teams also face multiple deployment environments, including multi-cloud, hybrid, or varying hardware in on-premises datacenters. Kubernetes hides somes differences between these, reducing the number of variations of configuration that is needed.
A pattern I have seen is role specialization within ops teams, which can bring efficiencies. Some members specialize in operating the Kubernetes cluster itself, what I call a “Cluster Operations” role, while others specialize in operating a set of applications (microservices). The clean separation between infrastructure and application – in particular the use of Kubernetes configuration files as a contract between the two groups – supports this separation of duties.
Finally, if you are able to choose a hosted version of Kubernetes such as Google Container Engine (GKE), then the hosted service takes on much of the Cluster Operations role. (Note: I work on GKE.)
Q7. On-premises, hybrid, or public cloud infrastructure: which solutions would you think is it better for running Kubernetes?
Eric Tune: Usually factors unrelated to Kubernetes will determine if an application needs to run on-premises, such as data sovereignty, latency concerns or an existing hardware investment. Often some applications need to be on-premises and some can move to public cloud. In this case you have a hybrid Kubernetes deployment, with one or more clusters on-premises, and one or more clusters on public cloud. For application operators and developers, the same tools can be used in all the clusters. Applications in different clusters can be configured to communicate with each other, or to be separate, as security needs dictate. Each cluster is a separate failure domain. One does not typically have a single cluster which spans on-premises and public cloud.
Q8. Kubernetes is open source. How can developers contribute?
Eric Tune: We have 150+ associated repositories that are all looking for developers (and other roles) to contribute. If you want to help but aren’t sure what you want to work on, then start with the Community ReadMe, and come to the community meetings or watch a rerun. If you think you already know what area of Kubernetes you are interested in, then start with our contributors guide, and attend the relevant Special Interest Group (SIG) meeting.
Dr. Eric Tune is a Senior Staff Engineer at Google. He leads dozens of engineers working on Kubernetes and GKE. He has been a Kubernetes contributor since 1.0. Previously at Google he worked on the Borg container orchestration system, drove company-wide compute efficiency improvements, created the Google-wide Profiling system, and helped expand the size of Google’s search index. Prior to Google, he was active in computer architecture research. He holds computer engineering degrees (PhD, MS, BS) from UCSD .
“N1QL for Analytics is the first commercial implementation of SQL++.” –Mike Carey
I have interviewed Michael Carey, Bren Professor of Information and Computer Sciences and Distinguished Professor of Computer Science at UC Irvine, where he leads the AsterixDB project, as well as a Consulting Architect at Couchbase. We talked about SQL++, the AsterixDB project, and the Couchbase N1QL for Analytics.
Q1. You are Couchbase’s Consulting Chief Architect. What are your main tasks in such a role?
Mike Carey: This came about when Couchbase began working on the effort that led to the recently released Couchbase Analytics Service, a service that was born when Ravi Mayuram (Couchbase’s Senior VP of Engineering and CTO) and I realized that Couchbase and the AsterixDB project shared a common vision regarding what future data management systems ought to look like. Rather than making me quit my day job, I was given the opportunity to participate in a consulting role and build a team within Couchbase to make the Analytics Service happen — using AsterixDB as a starting point. I guess now I’m kind of a mini-CTO for database-related issues; I primarily focus on the Analytics Service, but I also pay attention to the Query Service and the Couchbase Data Platform as a whole, especially when it comes to things like its query capabilities. I spend one day a week up at Couchbase HQ, at least most weeks. It’s really fun, and this keeps me connected to what’s happening in the “real world” outside academia.
Q2. What is SQL++ ? And what is special about it?
Mike Carey: SQL++ is a language that came out of work done by Prof. Yannis Papakonstantinou and his group at UC San Diego. Prior to SQL++, in the AsterixDB project, we had invented and implemented a full query language for semi-structured data called AQL (short for Asterix Query Language) based on a data model called ADM (short for Asterix Data Model). ADM was the result of realizing back in 2010 that JSON was coming in a pretty big way — we looked at JSON from a database data modeling perspective and added some things inspired by object databases that were missing. Most notable were the option to specify schemas, at least partially, if desired, and the ability to have multisets as well as arrays as multi-valued fields. AQL was the result of looking at XQuery, since it had been designed by a group of world experts to deal with semi-structured data, and then throwing out its “XML cruft” in order to gain a nice query language for ADM. To make AQL a bit more natural for SQL users, we also allowed some optional keyword substitutions (such as SELECT for RETURN and FROM for FOR). We had a pretty reasonable technical explanation for users as to why AQL was what it was — why it wasn’t just a SQL extension. Users listened and learned AQL, but they always seemed to wistfully sigh and continue to wish that AQL was more directly like SQL (in its syntax and not just its query power).
More or less in parallel, Yannis and friends were building a data integration system called FORWARD to integrate data of varied shapes and sizes from heterogeneous data stores. The FORWARD view of data was based on a semi-structured data model, and SQL++ was the SQL-based language framework that Yannis developed to classify the query capabilities of the stores. It also served as the integration language for FORWARD’s end users. At some point he approached us with a draft of his SQL++ framework paper, getting our attention by saying nice things about AQL relative to the other JSON query languages (:-)), and we took a look. Pretty quickly we realized that SQL++ was very much like AQL, but with a SQL-based syntax that would make those wistful AQL users much happier. Yannis did a very nice job of extending and generalizing SQL, allowing for a few differences where needed, such as where SQL had made “flat-world” or schema-based assumptions that no longer hold for JSON, and exploiting the generality of the nested data model, like adding richer support for grouping and de-mystifying grouped aggregation.
We have since “re-skinned” Apache AsterixDB to use SQL++ as the end-user query language for the system. This was actually relatively easy to do since all of the same algebra and physical operators work for both. We recently deprecated AQL altogether as an end-user language.
Q3. What is N1QL for Analytics?
Mike Carey: The Couchbase Analytics service is a component of the Couchbase Data Platform that allows users to run analytical-sized queries over their Couchbase JSON data. N1QL for Analytics is the product name for the end-user query language of Couchbase Analytics. It’s a dialect of SQL++, which itself is a language framework; the framework includes a number of choices that a SQL++ implementer gets to pin down about details like data types, missing information, supported functions, and so on. N1QL for Analytics could have been called “Couchbase SQL++”, but N1QL (non-1NF query language) is what Couchbase originally called the SQL-inspired query language for its Query service. A decision was made to keep the N1QL brand name, while adding “for Query” or “for Analytics” to more specifically identify the target service. Over time both N1QLs will be converging to the same dialect of SQL++. The bottom line is that N1QL for Analytics is the first commercial implementation of SQL++.
By the way, there’s a terrific new book available on Amazon called “SQL++ for SQL Users: A Tutorial.” It was written by Don Chamberlin, of SQL fame, for folks who want to learn more about SQL++ (from one of the world’s leading query language experts).
Q4. Is N1QL for Analytics based entirely on the SQL++ framework?
Mike Carey: Indeed it is. As I mentioned, N1QL for Analytics is really a dialect of SQL++, having chosen a particular combination of detailed settings that the framework provides options for. In the future it may gain other extensions, e.g., support for window queries, but right now, N1QL for Analytics is based entirely on the SQL++ framework.
Q5. How is new Couchbase Analytics influenced by the open-source Apache AsterixDB project?
Mike Carey: You’ve probably seen those computer ads in magazines that say “Intel Inside,” yes? In this case, the ad would say “Apache AsterixDB Inside”…
Q6. Specifically, did you re-use the Apache AsterixDB query engine? Or else?
Mike Carey: Specifically, yes. The Couchbase Data Platform, internally, is based on a software bus that the Data service (the Key/Value store service) broadcasts all data events on — and components like the Index service, Full Text service, Cross Datacenter Replication service, and others are all bus listeners. The Analytics service is a listener as well, and it manages a real-time replica of the KV data in order to make that data immediately available for analysis in a performance-isolated manner. Performance isolation is needed so that analytical queries don’t interfere with the front-end applications. Under the hood, the Analytics service is based on Apache AsterixDB — its storage facilities are used to store and manage the data, and its query engine powers the parallel query processing. The developers at Couchbase contribute their work on those components back to the Apache AsterixDB open source, and these days they’re among its most prolific committers. Couchbase Analytics also has some extensions that are only available from Couchbase — including integrated system management, cluster resizing, and a nice integrated query console — but the core plumbing is the same.
Q7. SQL does not provide an efficient solution for querying JSON or semi-structured data in JSON form. Can you explain how Couchbase Analytics analyzes data in JSON format? What is that capability useful for?
Mike Carey: Couchbase Analytics supports a JSON-based “come as you are” data model rather than requiring data to be normalized and schematized for analysis. We like to say that this gives users “NoETL for NoSQL.” You can perhaps think of it as being a data mart for Couchbase application data. The application folks think about their data naturally; if it’s nested, it’s allowed to be nested (e.g., an order object can contain a nested set of line items and a nested shipping address), and if it’s heterogeneous, it’s allowed to be heterogeneous (e.g., an electronic product can have different descriptive data than a clothing product or a furniture product). Couchbase Analytics allows data analysis on data that looks like that — data can “come as it is” and SQL++ is ready to query it in that “as is” form. You can do all the same analyses that you could do if you first designed a relational schema and wrote a collection of ETL scripts to move the data into a parallel SQL DBMS — but without having to do all that. Instead, you can now “have your data and query it too” in its original, natural, front-end JSON structure.
Q8. Can you please explain the architecture behind Couchbase`s MPP engine for JSON data?
Mike Carey: Sure, that’s easy — I can pretty much just refer you to the body of literature on parallel relational data management. (For an overview, see the classic DeWitt and Gray CACM paper on parallel database systems.)
Under the hood, the query engine for Couchbase Analytics and Apache AsterixDB looks like a best-practices parallel relational query engine. It uses hash partitioning to scale out horizontally in an MPP fashion, and it using best-practices physical operators (e.g., dynamic hash join, broadcast join, index join, parallel sort, sort-based and hash-based grouped aggregation, …) to deal gracefully with very large volumes of data. The operator set and the optimizer rules have just been extended where needed to accommodate nesting and schema optionality. Data is hash-partitioned on its primary key (the Couchbase key), with optional local secondary indexes on other fields, and queries run in parallel on all nodes in order to support linear speed-up and/or scale-up.
Q9. Do you think other database vendors will implement their own version/dialect of SQL++ ?
Mike Carey: Indeed I do. It’s a really nice language, and it makes a ton of sense as the “right” answer to querying the more general data models that one gets when one lets down their relational guard. It’s a whole lot cleaner than the “JSON as a column type” approach to adding JSON support to traditional RDBMSs in my opinion.
Qx. Anything else you wish to add?
Mike Carey: I teach the “Introduction to Data Management” class at UC Irvine as part of my day job. Our class sizes these days are exceeding 400 students per quarter — database systems are clearly not dead in students’ eyes! For the past few years I’ve been spending the last bit of the class on “NoSQL technology” — which to me means “no schema required” — and I’ve used SQL++ for the associated hands-on homework assignment. It’s been great to see how quickly and easily (relatively new!) SQL users can get their heads around the more relaxed data model and the query power of SQL++. Some faculty friends at the University of Washington have done this as well, and their experience there has been similar. I would like to encourage others to do the same! With SQL++, richer data no longer has to mean writing get/put programs or effectively hand-writing query plans, so it’s a very nice platform for teaching future generations about the emerging NoSQL world and its concepts and benefits.
Michael Carey received his B.S. and M.S. degrees from Carnegie-Mellon University and his Ph.D. from the University of California, Berkeley. He is currently a Bren Professor of Information and Computer Sciences and Distinguished Professor of Computer Science at UC Irvine, where he leads the AsterixDB project, as well as a Consulting Architect at Couchbase, Inc. Before joining UCI in 2008, he worked at BEA Systems for seven years and led the development of their AquaLogic Data Services Platform product for virtual data integration. He also spent a dozen years at the University of Wisconsin-Madison, five years at the IBM Almaden Research Center working on object-relational databases, and a year and a half at e-commerce platform startup Propel Software during the infamous 2000-2001 Internet bubble. He is an ACM Fellow, an IEEE Fellow, a member of the National Academy of Engineering, and a recipient of the ACM SIGMOD E.F. Codd Innovations Award. His current interests center around data-intensive computing and scalable data management (a.k.a. Big Data).
SQL++ For SQL Users: A Tutorial, Don Chamberlin, September 2018 (Free Book 143 pages)
Follow us on Twitter: @odbmsorg
” Learned indexes are able to learn from and benefit from patterns in the data and the workload. Most previous data structures were not designed to optimize for a particular distribution of data.” –Alex Beutel
I have interviewed Alex Beutel, Senior Research Scientist in the Google Brain SIR team. We talked about “Learned Index Structures“- data structures thought of as performing prediction tasks- their difference with respect to traditional index structures and their main benefits.
Q1. What is your role at Google?
Alex Beutel: I’m a research scientist within Google AI, specifically the Google Brain team. I focus on a mixture of recommender systems, machine learning fairness, and machine learning for systems. While these may sound quite different, I think they are all areas of machine learning application with unique, rich challenges and opportunities driving from understanding the data distribution.
Q2. You recently published a paper on so called Learned Index Structures . In the paper, you stated that Indexes (e.g B-Tree-Index, Hash-Index, BitMap-Index) can be replaced with other types of models, including deep-learning models, which you term learned indexes. Why do you want to replace well known Index-structures?
Alex Beutel: Traditional index structures are fundamental to databases and computer science in general, so they are important to study and have been deeply studied for a long time. I think whenever you can find a new perspective on such a well-studied area, it is worth exploring. In this case, we challenge the assumptions in data structure design by jumping from the more traditional discrete structures to continuous, stochastic components that can make mistakes. However, by taking this perspective, we find that we now have at our disposal a whole breadth of tools from the machine learning, data mining, and statistics communities that we can bring to bear on databases and more broadly data systems problems. Personally, rethinking these fundamental tasks with this new lens has been extremely exciting and fun.
Q3. What is the key idea for learned indexes?
Alex Beutel: The key idea for learned indexes is that many data structures can be thought of as performing prediction tasks, and as a result rather than building a discrete structure, use machine learning to build a model for the task .
Q4. What are the main benefits of learned indexes? Which applications could benefits from such learned indexes?
Alex Beutel: I want to separate what are the possible benefits and when or why can learned indexes realize those benefits. At a high level, using machine learned models lets us build data structures from a new broader set of tools. We have found that depending on the learned index configuration, we are able to get improvements in latency (speed), memory usage, and computational cost of running the index structure. Depending on the application, we can tune the learned index to get more savings in one or more of these dimensions. For example, in the paper we propose a hierarchical model structure, and we show that we can build a larger hierarchy and use more memory to get an even faster lookup or use a much smaller hierarchy to save memory and still not make the system too slow.
Why and when we are able to realize these benefits is a much more complicated question. One of the big advantages is that machine learning models make use of floating point operations which can be more easily parallelized with modern hardware, and with the growth of GPUs and TPUs, we may be able to build bigger and more accurate models without increasing latency.
Another aspect that I find exciting is that learned indexes are able to learn from and benefit from patterns in the data and the workload. Most previous data structures were not designed to optimize for a particular distribution of data. Rather, they often assume a worst-case distribution or ignore it entirely. But data structures aren’t being used in the abstract — they are being used on real data, which as we know from other areas of research, have many significant patterns. So one could ask, how can we make use of the patterns in the data being stored or processed to improve the efficiency of systems? ML models are extremely effective in adapting to those varying data distributions.
I think any application that is processing large amounts of data stands to benefit from taking this perspective. We focused on index structures in databases, but we have already seen multiple papers being published applying this perspective to new systems.
Q5. How can learned indexes learn the sort order or structure of lookup keys and use this signal to predict the position or existence of records?
Alex Beutel: B-Trees are already predicting the positions of records: they are built to give the block in which a record lies, and they do this just by processing the key. Learned indexes can do the same thing where they predict approximately where the record is. For example, if the keys are all even integers from 100 to 1000 (that is, key=100 has position 0, key=102 has position 1, key=104 has position 2, etc.), then the model f(key) = (key – 100)/2 will perfectly map from keys to positions. If the data aren’t exactly the even integers but on average we see one key every 2 spots (for example, keys: 100, 101, 105, 106, 109, 110, …) then f(key) above is still a pretty good model and for any key the model will almost find the exact position. Even if the data follow a more complicated pattern, we can learn a model to understand the distribution. It turns out that this is learning the cumulative distribution function, which has long been studied in statistics. This is exciting in that for those examples above, lookups become a constant-time operation, rather than growing with the size of the data; and more generally, this could change how we think about the complexity of these functions.
One challenge is that we can’t just return the approximate position; these data structures need to return the actual record being searched for. Typically, B-Trees will then scan through the block where the key is to find the exact right position. Likewise, when using a learned index, the model may not give the exact right position, but instead a close by one.
To return exactly the correct record, we search near the predicted position to find it; and the more accurate the model is, the faster the search will be.
Knowing if a record exists is quite different. Traditionally, Bloom filters have been used for this task; given a key, the Bloom filter will tell you if the key exists in the dataset, and if the key isn’t in the dataset the Bloom filter will mistakenly tell you it is with some small probability, called the false positive rate (FPR). This is a binary prediction problem: given a key, predict whether it’s in the dataset. Unlike traditional Bloom filters, we learn a model that tries to learn if there is some systematic difference between keys in the dataset and other questions (queries) asked of the Bloom filter. That is, if the dataset has all positive integers less than 1000, there is a trivial model g(key) := 1000 > key > 0 that can perfectly answer any query. If the dataset has all positive integers less than 1000 except for 517 then this is still a pretty good model with very few mistakes (FPR = 0.1%). If the dataset is malware URLs, these patterns are less obvious, but in fact lots of researchers have been studying what patterns are indicative of malware URLs (and distinguish them from normal webpage URLs), and we can build models to make use of these systematic differences.
From an accuracy perspective, Bloom filters have stringent requirements about no false negatives and low FPR, and so we build systems that combine machine learning classifiers and traditional Bloom filters to meet these requirements.
Q6. Under which conditions learned indexes outperform traditional index structures?
Alex Beutel: As mentioned above, I think there are a few key conditions for learned indexes being beneficial. First and foremost, it depends on the patterns of the data and workload being processed. In the range query case (B-Trees), if the data follow a linear pattern then learned indexes will easily excel; more complex data distributions may require more complex model structures which may not be okay for the application at hand. For existence indexes, the success of the model depends on how easily it can distinguish between keys in the dataset and real queries to the Bloom filter; distinguishing between even and odd integers is easy, but if the dataset is entirely random keys this will be very difficult.
In addition to making use of patterns in the data and workload, learned indexes depend on the environment they are being used in. For example, we study in-memory databases in our paper, and more recently we have found that disk-based systems require new techniques. For our learned Bloom filters we assume that saving memory is most important, but if there is a strict latency requirement, then the model design may need to change. If GPUs and TPUs were readily available, the learned index design would likely change dramatically.
Q7. What are the main challenges in designing learned index structures?
Alex Beutel: I think there are interesting challenges both in system design and in machine learning.
For systems, machine learned models provide much looser guarantees about accuracy than traditional data structures.
As a result, making use of ML models’ noisy predictions requires building systems that are robust to those errors.
In the B-Tree case we studied different local search strategies. For existence indexes we coupled the model with a Bloom filter to guarantee no false negatives. Interestingly, new research by Michael Mitzenmacher has shown that sandwiching the model between two Bloom filters does even better . I believe there are lots of interesting questions about (a) what is the right prediction task for machine learning models when incorporated in a system and (b) how should these models be safely integrated in the system.
On the machine learning side there are numerous challenges in building models that match the needs of these systems.
For example, most machine learning models are expected to execute on the order of milliseconds or slower; for learned indexes we often need the model to execute thousands of times faster. Tim Kraska, the first author on our paper, did a lot of optimizations for very fast execution of the model. In most of machine learning, overfitting is bad; for learned indexes that is not true — how should that change model design? How do I build model families that can trade-off memory and latency?
How do I build models that match the hardware they are running on, from parallelization to caching effects?
While these are challenges to making learned indexes work, they also present opportunities for interesting research from different communities working together.
Alex Beutel: We found some really great benefits. Depending on the use case learned indexes were able to be up to 3 times faster and in some cases use only 1% of the memory of a traditional B-Tree.
Q9. What is the implication of replacing core components of a data management system through learned models for future systems designs?
Alex Beutel: As I mentioned above, there have already been multiple papers applying these ideas to new core components, and we have been studying how to extend these ideas to a wide range of areas from indexing multidimensional data to sorting algorithms . We have seen similar opportunities and excitement in systems beyond databases, such as research for scheduling and caching.
My hope is that more folks building data management systems, and really any system that is processing data, think about if there are patterns in the data and workload the system is processing. Because there most likely are patterns, and I believe building new systems that can be customized and optimized for those patterns will greatly improve the systems’ efficiency.
Alex Beutel is a Senior Research Scientist in the Google Brain SIR team working on neural recommendation, fairness in machine learning, and ML for Systems. He received his Ph.D. in 2016 from Carnegie Mellon University’s Computer Science Department, and previously received his B.S. from Duke University in computer science and physics. His Ph.D. thesis on large-scale user behavior modeling, covering recommender systems, fraud detection, and scalable machine learning, was given the SIGKDD 2017 Doctoral Dissertation Award Runner-Up. He received the Best Paper Award at KDD 2016 and ACM GIS 2010, was a finalist for best paper in KDD 2014 and ASONAM 2012, and was awarded the Facebook Fellowship in 2013 and the NSF Graduate Research Fellowship in 2011. More details can be found at alexbeutel.com.
 Michael Mitzenmacher. A Model for Learned Bloom Filters, and Optimizing by Sandwiching. NeurIPS, 2018.
Stanford Seminar – The Case for Learned Index Structures. EE380: Computer Systems. Speakers: Alex Beutel and Ed Chi, Google, Published on Oct 18, 2018 (LINK to YouTube Video)
On Data, Exploratory Analysis, and R. Q&A with Ronald K. Pearson, ODBMS.org, April 13, 2018
On Apache Kafka®. Q&A with Gwen Shapira, ODBMS.org, March 26, 2018.
How to make Artificial Intelligence fair, transparent and accountable, ODBMS.org, January 27, 2018
Follow us on Twitter: @odbmsorg