“When we started this project in 2013, it was a moonshot. We were not sure if NVM technologies would ever see the light of day, but Intel has finally started shipping NVM devices in 2019. We are excited about the impact of NVM on next-generation database systems.” — Joy Arulraj and Andrew Pavlo.
I have interviewed Joy Arulraj, Assistant Professor of Computer Science at Georgia Institute of Technology and Andrew Pavlo, Assistant Professor of Computer Science at Carnegie Mellon University. They just published a new book “Non-Volatile Memory Database Management Systems“. We talked about non-volatile memory technologies (NVM), and how NVM is going to impact the next-generation database systems.
Q1. What are emerging non-volatile memory technologies?
Arulraj, Pavlo: Non-volatile memory (NVM) is a broad class of technologies, including phase-change memory and memristors, that provide low latency reads and writes on the same order of magnitude as DRAM, but with persistent writes and large storage capacity like an SSD. For instance, Intel recently started shipping its Optane DC NVM modules based on 3D XPoint technology .
Q2. How do they potentially change the dichotomy between volatile memory and durable storage in database management systems?
Arulraj, Pavlo: Existing database management systems (DBMSs) can be classified into two types based on the primary storage location of the database: (1) disk-oriented and (2) memory-oriented DBMSs. Disk-oriented DBMSs are based on the same hardware assumptions that were made in the first relational DBMSs from the 1970s, such as IBM’s System R. The design of these systems target a two-level storage hierarchy comprising of a fast but volatile byte-addressable memory for caching (i.e., DRAM), and a slow, non-volatile block-addressable device for permanent storage (i.e., SSD). These systems take a pessimistic assumption that a transaction could access data that is not in memory, and thus will incur a long delay to retrieve the needed data from disk. They employ legacy techniques, such as heavyweight concurrency-control schemes, to overcome these limitations.
Recent advances in manufacturing technologies have greatly increased the capacity of DRAM available on a single computer.
But disk-oriented systems were not designed for the case where most, if not all, of the data resides entirely in memory.
The result is that many of their legacy components have been shown to impede their scalability for transaction processing workloads. In contrast, the architecture of memory-oriented DBMSs assumes that all data fits in main memory, and it therefore does away with the slower, disk-oriented components from the system. As such, these memory-oriented DBMSs have been shown to outperform disk-oriented DBMSs. But, they still have to employ heavyweight components that can recover the database after a system crash because DRAM is volatile. The design assumptions underlying both disk-oriented and memory-oriented DBMSs are poised to be upended by the advent of NVM technologies.
Q3. Why are existing DBMSs unable to take full advantage of NVM technology?
Arulraj, Pavlo: NVM differs from other storage technologies in the following ways:
- Byte-Addressability: NVM supports byte-addressable loads and stores unlike other non-volatile devices that only support slow, bulk data transfers as blocks.
- High Write Throughput: NVM delivers more than an order of magnitude higher write throughput compared to SSD. More importantly, the gap between sequential and random write throughput of NVM is much smaller than other durable storage technologies.
- Read-Write Asymmetry: In certain NVM technologies, writes take longer to complete compared to reads. Further, excessive writes to a single memory cell can destroy it.
Although the advantages of NVM are obvious, making full use of them in a DBMS is non-trivial. Our evaluation of state-of-the-art disk-oriented and memory-oriented DBMSs on NVM shows that the two architectures achieve almost the same performance when using NVM. This is because current DBMSs assume that memory is volatile, and thus their architectures are predicated on making redundant copies of changes on durable storage. This illustrates the need for a complete rewrite of the database system to leverage the unique properties of NVM.
Q4.With NVM, which components of legacy DBMSs are unnecessary?
Arulraj, Pavlo: NVM requires us to revisit the design of several key components of the DBMS, including that of the (1) logging and recovery protocol, (2) storage and buffer management, and (3) indexing data structures.
We will illustrate it using the logging and recovery protocol. A DBMS must guarantee the integrity of a database against application, operating system, and device failures. It ensures the durability of updates made by a transaction by writing them out to durable storage, such as SSD, before returning an acknowledgment to the application. Such storage devices, however, are much slower than DRAM, especially for random writes, and only support bulk data transfers as blocks.
During transaction processing, if the DBMS were to overwrite the contents of the database before committing the transaction, then it must perform random writes to the database at multiple locations on disk. DBMSs try to minimize random writes to disk by flushing the transaction’s changes to a separate log on disk with only sequential writes on the critical path of the transaction. This method is referred to as write-ahead logging (WAL).
NVM upends the key design assumption underlying the WAL protocol since it supports fast random writes. Thus, we need to tailor the protocol for NVM. We designed such a protocol that we call write-behind logging (WBL). WBL not only improves the runtime performance of the DBMS, but it also enables it to recovery nearly instantaneously from failures. The way that WBL achieves this is by tracking what parts of the database have changed rather than how it was changed. Using this logging method, the DBMS can directly flush the changes made by transactions to the database instead of recording them in the log. By ordering writes to NVM correctly, the DBMS can guarantee that all transactions are durable and atomic. This allows the DBMS to write fewer data per transaction, thereby improving a NVM device’s lifetime.
Q5. You have designed and implemented a DBMS storage engine architectures that are explicitly tailored for NVM. What are the key elements?
Arulraj, Pavlo: The design of all of the storage engines in existing DBMSs are predicated on a two-tier storage hierarchy comprised of volatile DRAM and a non-volatile SSD. These devices have distinct hardware constraints and performance properties. The traditional engines were designed to account for and reduce the impact of these differences.
For example, they maintain two layouts of tuples depending on the storage device. Tuples stored in memory can contain non-inlined fields because DRAM is byte-addressable and handles random accesses efficiently. In contrast, fields in tuples stored on durable storage are inlined to avoid random accesses because they are more expensive. To amortize the overhead for accessing durable storage, these engines batch writes and flush them in a deferred manner. Many of these techniques, however, are unnecessary in a system with a NVM-only storage hierarchy. We adapted the storage and recovery mechanisms of these traditional engines to exploit NVM’s characteristics.
For instance, consider an NVM-aware storage engine that performs in-place updates. When a transaction inserts a tuple, rather than copying the tuple to the WAL, the engine only records a non-volatile pointer to the tuple in the WAL. This is sufficient because both the pointer and the tuple referred to by the pointer are stored on NVM. Thus, the engine can use the pointer to access the tuple after the system restarts without needing to re-apply changes in the WAL. It also stores indexes as non-volatile B+trees that can be accessed immediately when the system restarts without rebuilding.
The effects of committed transactions are durable after the system restarts because the engine immediately persists the changes made by a transaction when it commits. So, the engine does not need to replay the log during recovery. But the changes of uncommitted transactions may be present in the database because the memory controller can evict cache lines containing those changes to NVM at any time. The engine therefore needs to undo those transactions using the WAL. As this recovery protocol does not include a redo process, the engine has a much shorter recovery latency compared to a traditional engine.
Q6. What is the key takeaway from the book?
Arulraj, Pavlo: All together, the work described in this book illustrates that rethinking the key algorithms and data structures employed in a DBMS for NVM not only improves performance and operational cost, but also simplifies development and enables the DBMS to support near-instantaneous recovery from DBMS failures. When we started this project in 2013, it was a moonshot. We were not sure if NVM technologies would ever see the light of day, but Intel has finally started shipping NVM devices in 2019. We are excited about the impact of NVM on next-generation database systems.
Joy Arulraj is an Assistant Professor of Computer Science at Georgia Institute of Technology. He received his Ph.D. from Carnegie Mellon University in 2018, advised by Andy Pavlo. His doctoral research focused on the design and implementation of non-volatile memory database management systems. This work was conducted in collaboration with the Intel Science & Technology Center for Big Data, Microsoft Research, and Samsung Research.
Andrew Pavlo is an Assistant Professor of Databaseology in the Computer Science Department at Carnegie Mellon University. At CMU, he is a member of the Database Group and the Parallel Data Laboratory. His work is also in collaboration with the Intel Science and Technology Center for Big Data.
– Non-Volatile Memory Database Management Systems. by Joy Arulraj, Georgia Institute of Technology, Andrew Pavlo, Carnegie Mellon University. Book, Morgan & Claypool Publishers, Copyright © 2019, 191 Pages.
ISBN: 9781681734842 | PDF ISBN: 9781681734859 , Hardcover ISBN: 9781681734866
–How to Build a Non-Volatile Memory Database Management System (.PDF), Joy Arulraj Andrew Pavlo
Follow us on Twitter: @odbmsorg
“Anyone who expects to have some of their work in the cloud (e.g. just about everyone) will want to consider the offerings of the cloud platform provider in any shortlist they put together for new projects. These vendors have the resources to challenge anyone already in the market.”– Merv Adrian.
I have interviewed Merv Adrian, Research VP, Data & Analytics at Gartner. We talked about the the database market, the Cloud and the 2018 Gartner Magic Quadrant for Operational Database Management Systems.
Q1. Looking Back at 2018, how has the database market changed?
Merv Adrian: At a high level, much is similar to the prior year. The DBMS market returned to double digit growth in 2017 (12.7% year over year in Gartner’s estimate) to $38.8 billion. Over 73% of that growth was attributable to two vendors: Amazon Web Services and Microsoft, reflecting the enormous shift to new spending going to cloud and hybrid-capable offerings. In 2018, the trend grew, and the erosion of share for vendors like Oracle, IBM and Teradata continued. We don’t have our 2018 data completed yet, but I suspect we will see a similar ballpark for overall growth, with the same players up and down as last year. Competition from Chinese cloud vendors, such as Alibaba Cloud and Tencent, is emerging, especially outside North America.
Q2. What most surprised you?
Merv Adrian: The strength of Hadoop. Even before the merger, both Cloudera and Hortonworks continued steady growth, with Hadoop as a cohort outpacing all other nonrelational DBMS activity from a revenue perspective. With the merger, Cloudera becomes the 7th largest vendor by revenue and usage and intentions data suggest continued growth in the year ahead.
Q3. Is the distinction between relational and nonrelational database management still relevant?
Merv Adrian: Yes, but it’s less important than the cloud. As established vendors refresh and extend product offerings that build on their core strengths and capabilities to provide multimodel DBMS and/or or broad portfolios of both, the “architecture” battle will ramp up. New disruptive players and existing cloud platform providers will have to battle established vendors where they are strong – so for DMSA players like Snowflake will have more competition and on the OPDBMS side, relational and nonrelational providers alike – such as EnterpriseDB, MongoDB, and Datastax – will battle more for a cloud foothold than a “nonrelational” one.
Specific nonrelational plays like Graph, Time Series, and ledger DBMSs will be more disruptive than the general “nonrelational” category.
Q4. Artificial intelligence is moving from sci-fi to the mainstream. What is the impact on the database market?
Merv Adrian: Vendors are struggling to make the case that much of the heavy lifting should move to their DBMS layer with in-database processing. Although it’s intuitive, it represents a different buyer base, with different needs for design, tools, expertise and operational support. They have a lot of work to do.
Q5. Recently Google announced BigQuery ML. Machine Learning in the (Cloud) Database. What are the pros and cons?
Merv Adrian: See the above answer. Google has many strong offerings in the space – putting them together coherently is as much of a challenge for them as anyone else, but they have considerable assets, a revamped executive team under the leadership of Thomas Kurian, and are entering what is likely to be a strong growth phase for their overall DBMS business. They are clearly a candidate to be included in planning and testing.
Q6. You recently published the 2018 Gartner Magic Quadrant for Operational Database Management Systems. In a nutshell. what are your main insights?
Merv Adrian: Much of that is included in the first answer above. What I didn’t say there is that the degree of disruption varies between the Operational and DMSA wings of the market, even though most of the players are the same. Most important, specialists are going to be less relevant in the big picture as the converged model of application design and multimodel DBMSs make it harder to thrive in a niche.
Q7. To qualify for inclusion in this Magic Quadrant, vendors must have had to support two of the following four use cases: traditional transactions, distributed variable data, operational analytical convergence and event processing or data in motion. What is the rational beyond this inclusion choice?
Merv Adrian: The rationale is to offer our clients the offerings with the broadest capabilities. We can’t cover all possibilities in depth, so we attempt to reach as many as we can within the constraints we design to map to our capacity to deliver. We call out specialists in various other research offerings such as Other Vendors to Consider, Cool Vendor, Hype Cycle and other documents, and pieces specific to categories where client inquiry makes it clear we need to have a published point of view.
Q8. How is the Cloud changing the overall database market?
Merv Adrian: Massively. In addition to functional and architectural disruption, it’s changing pricing, support, release frequency, and user skills and organizational models. The future value of data center skills, container technology, multicloud and hybrid challenges and more are hot topics.
Q9. In your Quadrant you listed Amazon Web Services, Alibaba Cloud and Google. These are no pure database vendors, strictly speaking. What role do they play in the overall Operational DBMS market?
Merv Adrian: Anyone who expects to have some of their work in the cloud (e.g. just about everyone) will want to consider the offerings of the cloud platform provider in any shortlist they put together for new projects. These vendors have the resources to challenge anyone already in the market. And their deep pockets, and the availability of open source versions of every DBMS technology type that they can use – including creating their own versions of with optimizations for their stack and pre-built integrations to upstream and downstream technologies required for delivery – makes them formidable.
Q10. What are the main data management challenges and opportunities in 2019?
Merv Adrian: Avoiding silver bullet solutions, sticking to sound architectural principles based on understanding real business needs, and leveraging emerging ideas without getting caught in dead end plays. Pretty much the same as always. The details change, but sound design and a focus on outcomes remain the way forward.
Qx Anything else you wish to add?
Merv Adrian: Fasten your seat belt. It’s going to be a bumpy ride.
Merv Adrian, Research VP, Data & Analytics. Gartner
Merv Adrian is an Analyst on the Data Management team following operational DBMS, Apache Hadoop, Spark, nonrelational DBMS and adjacent technologies. Mr. Adrian also tracks the increasing impact of open source on data management software and monitors the changing requirements for data security in information platforms.
Follow us on Twitter: @odbmsorg
” When software reliability issues creep up in production, it’s a finger-pointing moment between suppliers and users. Usually, what’s missing is simple: information. ” –Barry Morris.
“Most organisations also suffer from more continuous disruption caused by a steady stream of less dramatic issues. Intermittent software problems particularly cause a lot of user frustration and dissatisfaction.” — Dale Vile.
I have interviewed Barry Morris, CEO of Undo and Dale Vile, Distinguished Analyst Freeform Dynamics. Main topic of the interview is enterprise software reliability. This interview relates to a recent research on the challenges and impact on troubleshooting software failures in production, conducted by Freeform Dynamics.
Q1. How often software-related failures occur in the enterprise?
Dale Vile: When we hear the term ‘software failure’, we tend to think of major incidents that bring down a whole department or result in significant data loss. Our study suggests that this kind of thing happens around once every couple years on average in most organisations – at least that’s what people admit to when surveyed. The research also tells us, however, that most organisations also suffer from more continuous disruption caused by a steady stream of less dramatic issues. Intermittent software problems particularly cause a lot of user frustration and dissatisfaction.
Q2. What are the common reasons why major system failures and/or incidents leading to loss of data are top of the list when it comes to the potential for damage and disruption?
Dale Vile: Software is now embedded in most aspects of most businesses. A telling observation is that over the years, the percentage of applications considered to be business critical has steadily increased. At the turn-of-the-century, it was usual for organisations to tell us that around 10% of their application portfolio was considered critical. Nowadays, it’s more likely to be 50% or more. This is why it’s so disruptive and potentially damaging when software failures occur – even relatively brief or minor ones.
Barry Morris: The study we commissioned shows that 83% of enterprise customers consider data corruption issues to be highly disruptive to their business. In the database business, that’s probably closer to 100%. Take SAP HANA, Oracle, Teradata or other data management system vendors: they have clients paying them millions of dollars per year for a reliable and predictable system. Consequences are high if the wrong row is returned, there’s a memory corruption issue, or data goes missing. These types of clients have little tolerance that. At best, your reputation in the industry and software renewals will be on the line. At worst, you’re talking about plummeting stock prices wiping off a few millions off the value of your business.
Q3. What are the most important challenges to achieve software that runs reliably and predictably?
Dale Vile: It starts with software quality management in the development or engineering environment. Most of the challenges we see here are to do with adjusting testing and quality management processes to cope with modern approaches such as Agile, DevOps and Continuous Delivery. A lot of people now refer to ‘Continuous Testing’ in this context, and understandably put a lot of emphasis on automation. But even software makers are on a journey here. Our research tells us that few have it fully figured out at the moment. Beyond this, effective testing in the live environment is also essential.
The problem here, though, is that the complex and dynamic nature of today’s enterprise infrastructures makes it very hard or impossible to test every use case in every situation. And even if you could, subsequent changes to the environment, which an application team may not even be aware of, could easily interfere with the solution and cause instability or failure. There’s a lot to think about, and quality management is only the start.
Q4. What factors are influencing users’ satisfaction and confidence with respect to software?
Dale Vile: Confidence and satisfaction stem from users and business stakeholders perceiving that those responsible are working together competently and effectively to resolve issues when they occur in a timely manner. A fundamental requirement here is openness and honesty, and a willingness to take responsibility. Defensiveness, evasion and finger-pointing, however, tend to undermine confidence and satisfaction. Such behaviour can be cultural; but very often it’s more a symptom of inadequate skills, processes and/or tools within either the supplier or the customer environment. When such shortfalls in capability exist, the inevitable result is an elongated troubleshooting and resolution cycle. This is the real killer of confidence and satisfaction.
Barry Morris: When software reliability issues creep up in production, it’s a finger-pointing moment between suppliers and users. Usually, what’s missing is simple: information. Right now, to obtain that information, suppliers ask 20 questions: what did you do, how did you do it, in what environment and so on. There’s a long period of communication & diagnostic, which is frustrating and time-wasting on both sides. That supplier/user relationships at that moment of firefighting would be massively improved if there was data on the table and engineers could just get on with fixing the problem. I see data-driven defect diagnostic as the key to improving customer satisfaction.
Q5. How effective is software quality management in the enterprise?
Dale Vile: I’ll answer the question in relation to software *reliability* management, which is a function of inherent software quality, effective implementation, and competent operation and support thereafter. We generally find that each group or team tends to do reasonably well in their specific area; but challenges often exist because the various silos I disconnected. What many are lacking is good communication and mutual understanding between those involved in the software lifecycle. Lack of adequate visibility and effective feedback is also a common issue. Most organisations on both the supplier and enterprise side are working on improvement, but gaps frequently exist in these kinds of areas, which in turn impact software reliability.
Barry Morris: Despite all the processes and tools put in place in dev/test, we still see mission-critical applications being shipped with defects. Worse, they are being shipped with known defects – some of which could turn disastrous. Ticking time bombs really. Why? Because of tricky intermittent failures that no-one can get to the bottom of.
So actually, in a lot of cases, I don’t think that the software quality management practices I see are as effective as they could be.
Q6. How effective are the commonly used troubleshooting and diagnostics techniques?
Dale Vile: As mentioned above, the most common problems I see here are to do with disconnects between the various teams involved. Within the engineering environment, this is often down to developers and quality teams working in silos with inefficient handoffs and ineffective feedback mechanisms. In the enterprise context, it’s the disconnect between application teams, operations staff and even service desk personnel. Added to this, many also struggle to join the dots to figure out what’s going on when problems occur, and communicate insights back to developers so they can take appropriate action. Against this background, it’s not surprising that over 90% of both software makers and enterprises report that issues frequently go undiagnosed and come back to bite in a disruptive and often expensive manner.
Barry Morris: Sometimes, traditional methods troubleshooting methods like printf, logging, or core dump analysis are the right solutions if the team is confident they can isolate the issue quickly. Static and dynamic analysis tools are also good options for certain classes of failures. But in more complex situations, traditional debugging methods don’t help much. If anything, they lead you down the wrong path with false positive and become time-wasting, which leads to serious client dissatisfaction.
Q7. You wrote in your study that the big enemies of stakeholders and user satisfaction are delay and uncertainty. What remedies do exist to alleviate this?
Dale Vile: Beyond the kind of processes and tools we have mentioned…it boils down to effective communication and adequate visibility.
Barry Morris: I think that next-gen troubleshooting systems like software recording technology (such as what we offer at Undo) offer a unique solution to the problem of software reliability. Once we move away from guesswork and use data-driven insight instead, application vendors will be able to resolve the most challenging software defects faster than they have ever been able to do before. The unnecessary delays and uncertainty will be a thing of the past.
Q8. You wrote in your study that software failures are inevitable. It is what happens when they occur that really matters. Can you please explain what do you mean here?
Dale Vile: No one expects perfection; not even business users and stakeholders. So provided the software isn’t wildly buggy or unstable, it mostly comes down to how well you respond when problems occur. What annoys people the most in this respect is not knowing what’s going on. Informing someone that you know what the problem is, but it’s going to take some time to fix, is much better than telling them you have no idea what’s causing their problem. Even better if you can give them a timescale for a resolution, and/or a workaround that doesn’t represent a major inconvenience. Interestingly, if you diagnose and fix a problem quickly, the research suggests that you can actually turn a software incident into a positive experience that enhances satisfaction, confidence and mutual respect.
Q9. What remedies are available for that?
Dale Vile: A big enabler here is a modern approach to diagnostics: having the tooling and the processes in place that allow you to troubleshoot effectively in a complex production environment. Traditional approaches are often undermined by the sheer number of moving parts and dependencies, so you need a way to deal with that. This is where solutions such as program execution recording and replay capability (aka software flight recording technology) can help.
Q10.You wrote in your study that switching decisions are often down to simple economics. Can you please explain what do you mean?
Dale Vile: If an application is continually causing problems, the result is increased cost. At one level, this could be down to the additional resource required to support, maintain and troubleshoot software defects. Often more significant, is the end-user productivity hit that stems from people not able to do their jobs properly and efficiently.
There are then various kinds of opportunity costs, e.g. weeks spent battling unreliable software is time not being spent adding value to the business. In extreme cases, such as when customer facing systems are involved, repeated failure can lead to reputational damage, loss of customer confidence, and ultimately lost revenue and market share. It depends on the organisation and the specific application; but in every case there comes a point when the cost to the business of living with unreliable software is ultimately higher than the cost of switching.
Q11. What are the most effective solutions to software diagnostic processes?
Dale Vile: Solutions that work holistically. It’s about capturing all of the relevant events, inputs and variables, especially at execution time in the production environment; then providing actionable data and insights for engineers to facilitate rapid diagnosis and resolution.
Barry Morris: Dale is right. The most effective solutions to software failure diagnostic are those that provide full visibility and definitive data-driven insight into what your software really did before it crashed or resulted in incorrect behaviour. Software recording technology will speed up time-to-resolution by a factor of 10. But the beauty of this kind of approach is that you can now diagnose even the hardest of bugs that you couldn’t resolve before – just because a recording represents the reproducible test case you couldn’t obtain before.
Q12. What are the main conclusions of your study?
Dale Vile: In summary, in a world where software is critical to the business, applications must be reliable; otherwise damaging and costly disruption will result. With this in mind, it’s important to be able to respond quickly and effectively when problems occur. This shines a clear spotlight on diagnostics – an area in which many have clear room for improvement. New approaches and tools are required here, especially for troubleshooting in complex production environments.
The good news is that technology is emerging that can help, but at the moment we see an awareness gap. Our recommendation is therefore for anyone involved in software delivery and support to get up to speed on what’s available, e.g. from companies like Undo and others.
Qx. Anything else you wish to add?
Dale Vile: When you get right down to it, software reliability is a business issue. One of the most striking findings from the research for me is the level of willingness among enterprise customers to switch solutions and suppliers when the pain and cost of unreliable software gets too high. This should be a wake-up call for ISVs and other software makers, not just to manage product quality, but also to work proactively with customers on preventative diagnostic and remedial activity.
Barry Morris: As systems are becoming more and more complex, troubleshooting is not getting any easier…so has to be data-driven.
With over 25 years’ experience working in enterprise software and database systems, Barry is a prodigious company builder, scaling start-ups and publicly held companies alike. He was CEO of distributed service-oriented architecture (SOA) specialists IONA Technologies between 2000 and 2003 and built the company up to $180m in revenues and a $2bn valuation.
A serial entrepreneur, Barry founded NuoDB in 2008 and most recently served as its Executive Chairman. Barry has now been appointed as CEO in September 2018 to lead Undo‘s high-growth phase.
Dale is a co-founder of Freeform Dynamics, and today runs the company.
He oversees the organisation’s industry coverage and research agenda, which tracks technology trends and developments, along with IT-related buying behaviour among mainstream enterprises, SMBs and public sector organisations.
During his 30 year career, he has worked in enterprise IT delivery with companies such as Heineken and Glaxo, and has held sales, channel management and international market development roles within major IT vendors such as SAP, Oracle, Sybase and Nortel Networks. He also spent a couple of years managing an IT reseller business for Admiral Software.
Dale has been involved in IT industry research since the year 2000 and has a strong reputation for original thinking and alternative perspectives on the latest technology trends and developments. He is a widely published author of books, reports and articles, and is an authoritative and provocative speaker.
Hosted by Prof. Zicari of ODBMS.org and featuring Undo CEO Barry Morris and Distinguished Analyst Dale Vile, Freeform Dynamics, this webinar recording covers:
– New market research – the frequency, types, and economic impact of defects on users and developers of enterprise software.
– The importance of fast diagnostics and swift remediation when problems occur in production.
– How to increase enterprise software reliability with software flight recording technology
“The challenges, impact and solutions to troubleshooting software failures“, Freeform Dynamics. Access the full study report here (LINK registration required).
– On Software Quality. Q&A with Alexander Boehm (SAP) and Greg Law (Undo). ODBMS.org, November 26, 2018.
Dr. Alexander Boehm is a database architect working on SAP´s HANA in-memory database management system. Greg Law is Co-founder and CTO of Undo.
Follow us on Twitter: @odbmsorg
“Perhaps less obvious is how role definitions in an organization change as scale increases. Once rare tasks that were just a small part of one team’s responsibilities become so common that they are a full-time job for someone. At that point, one either needs to create automation for the task, or a new team needs to be assembled (or hired) to perform that task full time. ” — Eric Tune
I have interviewed Eric Tune, Senior Staff Engineer at Google. We talked about Kubernetes. Eric has been a Kubernetes contributor since 1.0
Q1. What are the main technical challenges in implementing massive-scale environments?
Eric Tune: Whether working at small or massive scale, the high-level technical goals don’t change: security, developer velocity, efficiency in use of compute resources, supportability of production environments, and so on.
As scale increases, there are some fairly obvious discontinuities, like moving from an application that fits on a single-machine to one that spans multiple machines, and from a single data center or zone to multiple regions. Quite a bit has been written about this. Microservices in particular can be a good fit because they scale well to more machines and more regions.
Perhaps less obvious is how role definitions in an organization change as scale increases. Once rare tasks that were just a small part of one team’s responsibilities become so common that they are a full-time job for someone. At that point, one either needs to create automation for the task, or a new team needs to be assembled (or hired) to perform that task full time. Sometimes, it is obvious how to do this. But, when this repeats many times, one can end up with a confusing mess of automation and tickets, dragging down development velocity and confounding attempts to analyze security and debug systemic failure.
So, a key challenge is finding the right separation responsibilities so that multiple pieces of automation, and multiple human teams collaborate well. Doing it requires not only having a broad view of an organization’s current processes and responsibilities around development, operations, and security; but also which assumptions behind those are no longer valid.
Kubernetes can help hereby providing automation for exactly the types of tasks that become toilsome as scale increases. Several of its creators have lived through organic growth to a massive-scale. Kubernetes is built from that experience, with awareness of the new roles that are needed at massive-scale.
Q2. What is Kubernetes and why is it important?
Eric Tune: First, Kubernetes is one of the most popular ways to deploy applications in containers. Containers make the act of maintaining the machine & operating system a largely separate process from installing and maintaining an application instance – no more worrying about shared library or system utility version differences.
Second, it provides an abstraction over IaaS: VMs, VM images, VM types, load balancers, block storage, auto-scalers, etc. Kubernetes runs on numerous clouds, on-premises, and on a laptop. Many complex applications, such as those consisting of many microservices, can be deployed onto any Kubernetes cluster regardless of the underlying infrastructure. For an organization that may want to modernize their applications now, and move to cloud later, targeting Kubernetes means they won’t need to re-architect when they are ready to move. Third, Kubernetes supports infrastructure-as-code (IaC). You can define complex applications, including storage, networking, and application identity, in a common configuration language, called the Kubernetes Resource model. Unlike other IaC systems, which mostly support a “single-user” model, Kubernetes is designed for multiple users. It supports controlled delegation of responsibility from an ops team to a dev team.
Fourth, it provides an opinionated way to build distributed system control planes, and to extend the APIs and infrastructure-as code type system. This allows solution vendors and in-house infrastructure teams to build custom solutions that feel like they are first class parts of Kubernetes.
Q3. Who should be using Kubernetes?
Eric Tune: If your organization runs Linux-based microservices and has explored container technology, then you are ready to try Kubernetes.
Q4. You are a Kubernetes contributor since 1.0 (4 years). What did you work on specifically?
Eric Tune: During the first year, I worked on whatever needed to be done, including security (namespaces, service accounts, authentication and authorization, resource quota), performance, documentation, testing, API review and code review.
In those first years, people were mostly running stateless microservices on Kubernetes. In the second year, I worked to broaden the set of applications that can run on Kubernetes. I worked on the Job and CronJob APIs of Kubernetes, which support basic batch computation, and the StatefulSet API, which supports databases and other stateful applications. Additionally, I worked with the Helm project on Charts (easy-to-install applications for Kubernetes), with the Spark open source community to get it running on Kubernetes.
Starting in 2017, Kubernetes interest was growing so quickly that the project maintainers could not accept a fraction of the new features that were proposed. The answer was to make Kubernetes extensible so that new features could be build “out of the core.” I worked to define the extensibility story for Kubernetes, particularly for Custom Resource Definitions (CRDs) and Webhooks. The extensibility features of Kubernetes have enabled other large projects, such as Istio and Knative, to integrate with Kubernetes with lower overhead for the Kubernetes project maintainers.
Currently, I lead teams which work on both Open Source Kubernetes and Google Cloud.
Q5. What are the main challenges of migrating several microservices to Kubernetes?
Eric Tune: Here are three challenges I see when migrating several microservices to Kubernetes, and how I recommend handling them:
- Remove Ordering Dependencies: Say microservice C depends on microservices A and B to function normally. When migrating to declarative configuration and Kubernetes, the startup order for microservices can become variable, where previously it was ordered (e.g. by a script). This can cause unexpected behaviors. For example, microservice C might log errors at a high rate or crash if A is not ready yet. A first reaction is sometimes “how can I guarantee ordering of microservice startup,” My advice is not to impose order, but to change problematic behavior. For example, C could be changed to return some response for a request even when A and B are unreachable. This is not really a Kubernetes-specific requirement – it is a good practice for microservices, as it allows for graceful recovery from failures and for autoscaling.
- Don’t Persist Peer Network Identity: Some microservices permanently record the IP addresses of their peers at startup time, and then don’t expect it to ever change. That’s not a great match for the Kubernetes approach to networking. Instead, resolve peer addresses using their domain names and re-resolve after disconnection.
- Plan ahead for Running in Parallel: When migrating a complex set of microservices to Kubernetes, it’s typical to run the entire old environment and the new (Kubernetes) environment in parallel. Make sure you have load replay and response diffing tools to evaluate a dual environment setup.
Q6. How can Kubernetes scale without increasing ops team?
Eric Tune: Kubernetes is built to respond to many types of application and infrastructure failures automatically – for example slow memory leaks in an application, or kernel panics in a virtual machine. Previously this kind of problem may have required immediate attention. With Kubernetes as the first line of defense, ops can wait for more data before taking action. This in turn supports faster rollouts, as you don’t need to soak as long if you know that slow memory leaks will be handled automatically, and you can fix by rolling forward rather than back.
Some ops teams also face multiple deployment environments, including multi-cloud, hybrid, or varying hardware in on-premises datacenters. Kubernetes hides somes differences between these, reducing the number of variations of configuration that is needed.
A pattern I have seen is role specialization within ops teams, which can bring efficiencies. Some members specialize in operating the Kubernetes cluster itself, what I call a “Cluster Operations” role, while others specialize in operating a set of applications (microservices). The clean separation between infrastructure and application – in particular the use of Kubernetes configuration files as a contract between the two groups – supports this separation of duties.
Finally, if you are able to choose a hosted version of Kubernetes such as Google Container Engine (GKE), then the hosted service takes on much of the Cluster Operations role. (Note: I work on GKE.)
Q7. On-premises, hybrid, or public cloud infrastructure: which solutions would you think is it better for running Kubernetes?
Eric Tune: Usually factors unrelated to Kubernetes will determine if an application needs to run on-premises, such as data sovereignty, latency concerns or an existing hardware investment. Often some applications need to be on-premises and some can move to public cloud. In this case you have a hybrid Kubernetes deployment, with one or more clusters on-premises, and one or more clusters on public cloud. For application operators and developers, the same tools can be used in all the clusters. Applications in different clusters can be configured to communicate with each other, or to be separate, as security needs dictate. Each cluster is a separate failure domain. One does not typically have a single cluster which spans on-premises and public cloud.
Q8. Kubernetes is open source. How can developers contribute?
Eric Tune: We have 150+ associated repositories that are all looking for developers (and other roles) to contribute. If you want to help but aren’t sure what you want to work on, then start with the Community ReadMe, and come to the community meetings or watch a rerun. If you think you already know what area of Kubernetes you are interested in, then start with our contributors guide, and attend the relevant Special Interest Group (SIG) meeting.
Dr. Eric Tune is a Senior Staff Engineer at Google. He leads dozens of engineers working on Kubernetes and GKE. He has been a Kubernetes contributor since 1.0. Previously at Google he worked on the Borg container orchestration system, drove company-wide compute efficiency improvements, created the Google-wide Profiling system, and helped expand the size of Google’s search index. Prior to Google, he was active in computer architecture research. He holds computer engineering degrees (PhD, MS, BS) from UCSD .
“N1QL for Analytics is the first commercial implementation of SQL++.” –Mike Carey
I have interviewed Michael Carey, Bren Professor of Information and Computer Sciences and Distinguished Professor of Computer Science at UC Irvine, where he leads the AsterixDB project, as well as a Consulting Architect at Couchbase. We talked about SQL++, the AsterixDB project, and the Couchbase N1QL for Analytics.
Q1. You are Couchbase’s Consulting Chief Architect. What are your main tasks in such a role?
Mike Carey: This came about when Couchbase began working on the effort that led to the recently released Couchbase Analytics Service, a service that was born when Ravi Mayuram (Couchbase’s Senior VP of Engineering and CTO) and I realized that Couchbase and the AsterixDB project shared a common vision regarding what future data management systems ought to look like. Rather than making me quit my day job, I was given the opportunity to participate in a consulting role and build a team within Couchbase to make the Analytics Service happen — using AsterixDB as a starting point. I guess now I’m kind of a mini-CTO for database-related issues; I primarily focus on the Analytics Service, but I also pay attention to the Query Service and the Couchbase Data Platform as a whole, especially when it comes to things like its query capabilities. I spend one day a week up at Couchbase HQ, at least most weeks. It’s really fun, and this keeps me connected to what’s happening in the “real world” outside academia.
Q2. What is SQL++ ? And what is special about it?
Mike Carey: SQL++ is a language that came out of work done by Prof. Yannis Papakonstantinou and his group at UC San Diego. Prior to SQL++, in the AsterixDB project, we had invented and implemented a full query language for semi-structured data called AQL (short for Asterix Query Language) based on a data model called ADM (short for Asterix Data Model). ADM was the result of realizing back in 2010 that JSON was coming in a pretty big way — we looked at JSON from a database data modeling perspective and added some things inspired by object databases that were missing. Most notable were the option to specify schemas, at least partially, if desired, and the ability to have multisets as well as arrays as multi-valued fields. AQL was the result of looking at XQuery, since it had been designed by a group of world experts to deal with semi-structured data, and then throwing out its “XML cruft” in order to gain a nice query language for ADM. To make AQL a bit more natural for SQL users, we also allowed some optional keyword substitutions (such as SELECT for RETURN and FROM for FOR). We had a pretty reasonable technical explanation for users as to why AQL was what it was — why it wasn’t just a SQL extension. Users listened and learned AQL, but they always seemed to wistfully sigh and continue to wish that AQL was more directly like SQL (in its syntax and not just its query power).
More or less in parallel, Yannis and friends were building a data integration system called FORWARD to integrate data of varied shapes and sizes from heterogeneous data stores. The FORWARD view of data was based on a semi-structured data model, and SQL++ was the SQL-based language framework that Yannis developed to classify the query capabilities of the stores. It also served as the integration language for FORWARD’s end users. At some point he approached us with a draft of his SQL++ framework paper, getting our attention by saying nice things about AQL relative to the other JSON query languages (:-)), and we took a look. Pretty quickly we realized that SQL++ was very much like AQL, but with a SQL-based syntax that would make those wistful AQL users much happier. Yannis did a very nice job of extending and generalizing SQL, allowing for a few differences where needed, such as where SQL had made “flat-world” or schema-based assumptions that no longer hold for JSON, and exploiting the generality of the nested data model, like adding richer support for grouping and de-mystifying grouped aggregation.
We have since “re-skinned” Apache AsterixDB to use SQL++ as the end-user query language for the system. This was actually relatively easy to do since all of the same algebra and physical operators work for both. We recently deprecated AQL altogether as an end-user language.
Q3. What is N1QL for Analytics?
Mike Carey: The Couchbase Analytics service is a component of the Couchbase Data Platform that allows users to run analytical-sized queries over their Couchbase JSON data. N1QL for Analytics is the product name for the end-user query language of Couchbase Analytics. It’s a dialect of SQL++, which itself is a language framework; the framework includes a number of choices that a SQL++ implementer gets to pin down about details like data types, missing information, supported functions, and so on. N1QL for Analytics could have been called “Couchbase SQL++”, but N1QL (non-1NF query language) is what Couchbase originally called the SQL-inspired query language for its Query service. A decision was made to keep the N1QL brand name, while adding “for Query” or “for Analytics” to more specifically identify the target service. Over time both N1QLs will be converging to the same dialect of SQL++. The bottom line is that N1QL for Analytics is the first commercial implementation of SQL++.
By the way, there’s a terrific new book available on Amazon called “SQL++ for SQL Users: A Tutorial.” It was written by Don Chamberlin, of SQL fame, for folks who want to learn more about SQL++ (from one of the world’s leading query language experts).
Q4. Is N1QL for Analytics based entirely on the SQL++ framework?
Mike Carey: Indeed it is. As I mentioned, N1QL for Analytics is really a dialect of SQL++, having chosen a particular combination of detailed settings that the framework provides options for. In the future it may gain other extensions, e.g., support for window queries, but right now, N1QL for Analytics is based entirely on the SQL++ framework.
Q5. How is new Couchbase Analytics influenced by the open-source Apache AsterixDB project?
Mike Carey: You’ve probably seen those computer ads in magazines that say “Intel Inside,” yes? In this case, the ad would say “Apache AsterixDB Inside”…
Q6. Specifically, did you re-use the Apache AsterixDB query engine? Or else?
Mike Carey: Specifically, yes. The Couchbase Data Platform, internally, is based on a software bus that the Data service (the Key/Value store service) broadcasts all data events on — and components like the Index service, Full Text service, Cross Datacenter Replication service, and others are all bus listeners. The Analytics service is a listener as well, and it manages a real-time replica of the KV data in order to make that data immediately available for analysis in a performance-isolated manner. Performance isolation is needed so that analytical queries don’t interfere with the front-end applications. Under the hood, the Analytics service is based on Apache AsterixDB — its storage facilities are used to store and manage the data, and its query engine powers the parallel query processing. The developers at Couchbase contribute their work on those components back to the Apache AsterixDB open source, and these days they’re among its most prolific committers. Couchbase Analytics also has some extensions that are only available from Couchbase — including integrated system management, cluster resizing, and a nice integrated query console — but the core plumbing is the same.
Q7. SQL does not provide an efficient solution for querying JSON or semi-structured data in JSON form. Can you explain how Couchbase Analytics analyzes data in JSON format? What is that capability useful for?
Mike Carey: Couchbase Analytics supports a JSON-based “come as you are” data model rather than requiring data to be normalized and schematized for analysis. We like to say that this gives users “NoETL for NoSQL.” You can perhaps think of it as being a data mart for Couchbase application data. The application folks think about their data naturally; if it’s nested, it’s allowed to be nested (e.g., an order object can contain a nested set of line items and a nested shipping address), and if it’s heterogeneous, it’s allowed to be heterogeneous (e.g., an electronic product can have different descriptive data than a clothing product or a furniture product). Couchbase Analytics allows data analysis on data that looks like that — data can “come as it is” and SQL++ is ready to query it in that “as is” form. You can do all the same analyses that you could do if you first designed a relational schema and wrote a collection of ETL scripts to move the data into a parallel SQL DBMS — but without having to do all that. Instead, you can now “have your data and query it too” in its original, natural, front-end JSON structure.
Q8. Can you please explain the architecture behind Couchbase`s MPP engine for JSON data?
Mike Carey: Sure, that’s easy — I can pretty much just refer you to the body of literature on parallel relational data management. (For an overview, see the classic DeWitt and Gray CACM paper on parallel database systems.)
Under the hood, the query engine for Couchbase Analytics and Apache AsterixDB looks like a best-practices parallel relational query engine. It uses hash partitioning to scale out horizontally in an MPP fashion, and it using best-practices physical operators (e.g., dynamic hash join, broadcast join, index join, parallel sort, sort-based and hash-based grouped aggregation, …) to deal gracefully with very large volumes of data. The operator set and the optimizer rules have just been extended where needed to accommodate nesting and schema optionality. Data is hash-partitioned on its primary key (the Couchbase key), with optional local secondary indexes on other fields, and queries run in parallel on all nodes in order to support linear speed-up and/or scale-up.
Q9. Do you think other database vendors will implement their own version/dialect of SQL++ ?
Mike Carey: Indeed I do. It’s a really nice language, and it makes a ton of sense as the “right” answer to querying the more general data models that one gets when one lets down their relational guard. It’s a whole lot cleaner than the “JSON as a column type” approach to adding JSON support to traditional RDBMSs in my opinion.
Qx. Anything else you wish to add?
Mike Carey: I teach the “Introduction to Data Management” class at UC Irvine as part of my day job. Our class sizes these days are exceeding 400 students per quarter — database systems are clearly not dead in students’ eyes! For the past few years I’ve been spending the last bit of the class on “NoSQL technology” — which to me means “no schema required” — and I’ve used SQL++ for the associated hands-on homework assignment. It’s been great to see how quickly and easily (relatively new!) SQL users can get their heads around the more relaxed data model and the query power of SQL++. Some faculty friends at the University of Washington have done this as well, and their experience there has been similar. I would like to encourage others to do the same! With SQL++, richer data no longer has to mean writing get/put programs or effectively hand-writing query plans, so it’s a very nice platform for teaching future generations about the emerging NoSQL world and its concepts and benefits.
Michael Carey received his B.S. and M.S. degrees from Carnegie-Mellon University and his Ph.D. from the University of California, Berkeley. He is currently a Bren Professor of Information and Computer Sciences and Distinguished Professor of Computer Science at UC Irvine, where he leads the AsterixDB project, as well as a Consulting Architect at Couchbase, Inc. Before joining UCI in 2008, he worked at BEA Systems for seven years and led the development of their AquaLogic Data Services Platform product for virtual data integration. He also spent a dozen years at the University of Wisconsin-Madison, five years at the IBM Almaden Research Center working on object-relational databases, and a year and a half at e-commerce platform startup Propel Software during the infamous 2000-2001 Internet bubble. He is an ACM Fellow, an IEEE Fellow, a member of the National Academy of Engineering, and a recipient of the ACM SIGMOD E.F. Codd Innovations Award. His current interests center around data-intensive computing and scalable data management (a.k.a. Big Data).
SQL++ For SQL Users: A Tutorial, Don Chamberlin, September 2018 (Free Book 143 pages)
Follow us on Twitter: @odbmsorg
” Learned indexes are able to learn from and benefit from patterns in the data and the workload. Most previous data structures were not designed to optimize for a particular distribution of data.” –Alex Beutel
I have interviewed Alex Beutel, Senior Research Scientist in the Google Brain SIR team. We talked about “Learned Index Structures“- data structures thought of as performing prediction tasks- their difference with respect to traditional index structures and their main benefits.
Q1. What is your role at Google?
Alex Beutel: I’m a research scientist within Google AI, specifically the Google Brain team. I focus on a mixture of recommender systems, machine learning fairness, and machine learning for systems. While these may sound quite different, I think they are all areas of machine learning application with unique, rich challenges and opportunities driving from understanding the data distribution.
Q2. You recently published a paper on so called Learned Index Structures . In the paper, you stated that Indexes (e.g B-Tree-Index, Hash-Index, BitMap-Index) can be replaced with other types of models, including deep-learning models, which you term learned indexes. Why do you want to replace well known Index-structures?
Alex Beutel: Traditional index structures are fundamental to databases and computer science in general, so they are important to study and have been deeply studied for a long time. I think whenever you can find a new perspective on such a well-studied area, it is worth exploring. In this case, we challenge the assumptions in data structure design by jumping from the more traditional discrete structures to continuous, stochastic components that can make mistakes. However, by taking this perspective, we find that we now have at our disposal a whole breadth of tools from the machine learning, data mining, and statistics communities that we can bring to bear on databases and more broadly data systems problems. Personally, rethinking these fundamental tasks with this new lens has been extremely exciting and fun.
Q3. What is the key idea for learned indexes?
Alex Beutel: The key idea for learned indexes is that many data structures can be thought of as performing prediction tasks, and as a result rather than building a discrete structure, use machine learning to build a model for the task .
Q4. What are the main benefits of learned indexes? Which applications could benefits from such learned indexes?
Alex Beutel: I want to separate what are the possible benefits and when or why can learned indexes realize those benefits. At a high level, using machine learned models lets us build data structures from a new broader set of tools. We have found that depending on the learned index configuration, we are able to get improvements in latency (speed), memory usage, and computational cost of running the index structure. Depending on the application, we can tune the learned index to get more savings in one or more of these dimensions. For example, in the paper we propose a hierarchical model structure, and we show that we can build a larger hierarchy and use more memory to get an even faster lookup or use a much smaller hierarchy to save memory and still not make the system too slow.
Why and when we are able to realize these benefits is a much more complicated question. One of the big advantages is that machine learning models make use of floating point operations which can be more easily parallelized with modern hardware, and with the growth of GPUs and TPUs, we may be able to build bigger and more accurate models without increasing latency.
Another aspect that I find exciting is that learned indexes are able to learn from and benefit from patterns in the data and the workload. Most previous data structures were not designed to optimize for a particular distribution of data. Rather, they often assume a worst-case distribution or ignore it entirely. But data structures aren’t being used in the abstract — they are being used on real data, which as we know from other areas of research, have many significant patterns. So one could ask, how can we make use of the patterns in the data being stored or processed to improve the efficiency of systems? ML models are extremely effective in adapting to those varying data distributions.
I think any application that is processing large amounts of data stands to benefit from taking this perspective. We focused on index structures in databases, but we have already seen multiple papers being published applying this perspective to new systems.
Q5. How can learned indexes learn the sort order or structure of lookup keys and use this signal to predict the position or existence of records?
Alex Beutel: B-Trees are already predicting the positions of records: they are built to give the block in which a record lies, and they do this just by processing the key. Learned indexes can do the same thing where they predict approximately where the record is. For example, if the keys are all even integers from 100 to 1000 (that is, key=100 has position 0, key=102 has position 1, key=104 has position 2, etc.), then the model f(key) = (key – 100)/2 will perfectly map from keys to positions. If the data aren’t exactly the even integers but on average we see one key every 2 spots (for example, keys: 100, 101, 105, 106, 109, 110, …) then f(key) above is still a pretty good model and for any key the model will almost find the exact position. Even if the data follow a more complicated pattern, we can learn a model to understand the distribution. It turns out that this is learning the cumulative distribution function, which has long been studied in statistics. This is exciting in that for those examples above, lookups become a constant-time operation, rather than growing with the size of the data; and more generally, this could change how we think about the complexity of these functions.
One challenge is that we can’t just return the approximate position; these data structures need to return the actual record being searched for. Typically, B-Trees will then scan through the block where the key is to find the exact right position. Likewise, when using a learned index, the model may not give the exact right position, but instead a close by one.
To return exactly the correct record, we search near the predicted position to find it; and the more accurate the model is, the faster the search will be.
Knowing if a record exists is quite different. Traditionally, Bloom filters have been used for this task; given a key, the Bloom filter will tell you if the key exists in the dataset, and if the key isn’t in the dataset the Bloom filter will mistakenly tell you it is with some small probability, called the false positive rate (FPR). This is a binary prediction problem: given a key, predict whether it’s in the dataset. Unlike traditional Bloom filters, we learn a model that tries to learn if there is some systematic difference between keys in the dataset and other questions (queries) asked of the Bloom filter. That is, if the dataset has all positive integers less than 1000, there is a trivial model g(key) := 1000 > key > 0 that can perfectly answer any query. If the dataset has all positive integers less than 1000 except for 517 then this is still a pretty good model with very few mistakes (FPR = 0.1%). If the dataset is malware URLs, these patterns are less obvious, but in fact lots of researchers have been studying what patterns are indicative of malware URLs (and distinguish them from normal webpage URLs), and we can build models to make use of these systematic differences.
From an accuracy perspective, Bloom filters have stringent requirements about no false negatives and low FPR, and so we build systems that combine machine learning classifiers and traditional Bloom filters to meet these requirements.
Q6. Under which conditions learned indexes outperform traditional index structures?
Alex Beutel: As mentioned above, I think there are a few key conditions for learned indexes being beneficial. First and foremost, it depends on the patterns of the data and workload being processed. In the range query case (B-Trees), if the data follow a linear pattern then learned indexes will easily excel; more complex data distributions may require more complex model structures which may not be okay for the application at hand. For existence indexes, the success of the model depends on how easily it can distinguish between keys in the dataset and real queries to the Bloom filter; distinguishing between even and odd integers is easy, but if the dataset is entirely random keys this will be very difficult.
In addition to making use of patterns in the data and workload, learned indexes depend on the environment they are being used in. For example, we study in-memory databases in our paper, and more recently we have found that disk-based systems require new techniques. For our learned Bloom filters we assume that saving memory is most important, but if there is a strict latency requirement, then the model design may need to change. If GPUs and TPUs were readily available, the learned index design would likely change dramatically.
Q7. What are the main challenges in designing learned index structures?
Alex Beutel: I think there are interesting challenges both in system design and in machine learning.
For systems, machine learned models provide much looser guarantees about accuracy than traditional data structures.
As a result, making use of ML models’ noisy predictions requires building systems that are robust to those errors.
In the B-Tree case we studied different local search strategies. For existence indexes we coupled the model with a Bloom filter to guarantee no false negatives. Interestingly, new research by Michael Mitzenmacher has shown that sandwiching the model between two Bloom filters does even better . I believe there are lots of interesting questions about (a) what is the right prediction task for machine learning models when incorporated in a system and (b) how should these models be safely integrated in the system.
On the machine learning side there are numerous challenges in building models that match the needs of these systems.
For example, most machine learning models are expected to execute on the order of milliseconds or slower; for learned indexes we often need the model to execute thousands of times faster. Tim Kraska, the first author on our paper, did a lot of optimizations for very fast execution of the model. In most of machine learning, overfitting is bad; for learned indexes that is not true — how should that change model design? How do I build model families that can trade-off memory and latency?
How do I build models that match the hardware they are running on, from parallelization to caching effects?
While these are challenges to making learned indexes work, they also present opportunities for interesting research from different communities working together.
Alex Beutel: We found some really great benefits. Depending on the use case learned indexes were able to be up to 3 times faster and in some cases use only 1% of the memory of a traditional B-Tree.
Q9. What is the implication of replacing core components of a data management system through learned models for future systems designs?
Alex Beutel: As I mentioned above, there have already been multiple papers applying these ideas to new core components, and we have been studying how to extend these ideas to a wide range of areas from indexing multidimensional data to sorting algorithms . We have seen similar opportunities and excitement in systems beyond databases, such as research for scheduling and caching.
My hope is that more folks building data management systems, and really any system that is processing data, think about if there are patterns in the data and workload the system is processing. Because there most likely are patterns, and I believe building new systems that can be customized and optimized for those patterns will greatly improve the systems’ efficiency.
Alex Beutel is a Senior Research Scientist in the Google Brain SIR team working on neural recommendation, fairness in machine learning, and ML for Systems. He received his Ph.D. in 2016 from Carnegie Mellon University’s Computer Science Department, and previously received his B.S. from Duke University in computer science and physics. His Ph.D. thesis on large-scale user behavior modeling, covering recommender systems, fraud detection, and scalable machine learning, was given the SIGKDD 2017 Doctoral Dissertation Award Runner-Up. He received the Best Paper Award at KDD 2016 and ACM GIS 2010, was a finalist for best paper in KDD 2014 and ASONAM 2012, and was awarded the Facebook Fellowship in 2013 and the NSF Graduate Research Fellowship in 2011. More details can be found at alexbeutel.com.
 Michael Mitzenmacher. A Model for Learned Bloom Filters, and Optimizing by Sandwiching. NeurIPS, 2018.
Stanford Seminar – The Case for Learned Index Structures. EE380: Computer Systems. Speakers: Alex Beutel and Ed Chi, Google, Published on Oct 18, 2018 (LINK to YouTube Video)
On Data, Exploratory Analysis, and R. Q&A with Ronald K. Pearson, ODBMS.org, April 13, 2018
On Apache Kafka®. Q&A with Gwen Shapira, ODBMS.org, March 26, 2018.
How to make Artificial Intelligence fair, transparent and accountable, ODBMS.org, January 27, 2018
Follow us on Twitter: @odbmsorg
“The goal of in-database machine learning is to bring popular machine learning algorithms and advanced analytical functions directly to the data, where it most commonly resides – either in a data warehouse or a data lake.” — Waqas Dhillon.
I have interviewed Waqas Dhillon, Product Manager – Machine Learning at Vertica. We talked about in-database machine learning, and what are the new machine learning features of Vertica.
Q1. What is in-database machine learning?
Waqas Dhillon: The goal of in-database machine learning is to bring popular machine learning algorithms and advanced analytical functions directly to the data, where it most commonly resides – either in a data warehouse or a data lake. While machine learning is a common mechanism used to develop insights across a variety use cases, the growing volume of data has increased the complexity of building predictive models, since few tools are capable of processing these massive datasets. As a result, most organizations are down-sampling, which can impact the accuracy of machine models and created unnecessary steps to the predictive analytics process.
In-database machine learning changes the scale and speed through which these machine learning algorithms can be trained and deployed, removing common barriers and accelerating time to insight on predictive analytics projects. To that end, we’ve built machine learning and data preparation functions natively into Vertica, so the computational processes can be parallelized across nodes –scaling-out to address performance requirements, larger data volumes, and serving many concurrent users. Vertica in-database machine learning aims to eliminate the need of downloading and installing separate packages, purchasing 3rd party tools, or moving data out of database. Unlike traditional statistical analysis tools, we’ve given users the ability to archive and manage machine learning models inside the database, so they can train, deploy, and manage their models with a few simple lines of SQL.
Q2. What problem domains are most suitable for using Predictive Analytics?
Waqas Dhillon: Most organizations are realizing the role that predictive analytics can play in addressing certain business challenges to create a competitive advantage. While simple business intelligence and reporting has played a key role in understanding how an organization operates and where improvements can be made, the volume of data available combined with the power of machine learning is driving the adoption of forward-looking, predictive analytics projects. This adoption is compounded by an increase in end-user/customer demand for applications with embedded intelligence that no longer just identified ‘what happened’ but predicts ‘what will happen’.
In general, machine learning models using linear regression, logistic regression, naïve Bayes, etc. are better suited for problem domains involving structured data analysis. Beyond this, the most suitable domains for using predictive analytics are driven by the use cases and business applications that drive new revenue opportunities, increase operational efficiencies, or both.
Q3. Can you give us some examples?
Waqas Dhillon: In-database machine learning and the use of predictive analytics can drive tangible business benefits across a broad range of industries. Below are some of the most common industries and use cases where I’ve seen an adoption of predictive analytics capabilities:
• Financial services organizations can discover fraud, detect investment opportunities, identify clients with high-risk profiles, or determine the probability of an applicant defaulting on a loan.
• Communication service providers can leverage a variety of network probe and sensor data to analyze network performance, predict capacity constraints, and ensure quality service delivery to end customers.
• Marketing and sales organizations can use machine learning to analyze buying patterns, segment customers, personalize the shopping experience, and predict which targeted marketing campaigns will be most effective.
• Oil and gas organizations can leverage machine learning to analyze minerals to find new energy sources, streamline oil distribution for increased efficiency and cost effectiveness, or predict mechanical or sensor failures for proactive maintenance.
• Transportation organizations can analyze trends and identify patterns that can be used to enhance customer service, optimize routes, and increase profitability.
Q4. How do you handle machine learning on Big Data using an in-database approach?
Waqas Dhillon: The Vertica Analytics Platform was always built specifically for Big Data analytics and other analytical workloads where speed, scalability, and simplicity are crucial requirements.
Since we had spent years building out such a high-performance, scalable SQL engine, we started to ask ourselves, “Why should we limit the scope of our platform to standard SQL functions and descriptive analytics? Why not extend the power of Vertica to include more advanced analytics and machine learning functions?”
While some solutions might be limited by inherent architectural problems, such as lacking a shared-nothing-cluster architecture suitable for big data analytics, Vertica has an incredible engine for performing analytics on large scale data. That’s why we felt it was such an obvious choice to build machine learning functions natively into the platform. By building these machine learning capabilities on top of a foundation that already provides a tested, reliable distributed architecture and columnar compression, customers can now leverage these core features for advanced and predictive analytics uses cases.
In Vertica, we have implemented all in-database algorithms from scratch to run in parallel across multiple nodes in a cluster. Using parallel execution for model training, as well as scoring, not only results in extremely fast performance but also extends the capability of these algorithms to run on much larger datasets in comparison to traditional machine learning tools.
Using Vertica for machine learning provides another great advantage born from the fact that the computation engine and data storage management system are combined – this combination eliminates the need to move data between a database and a statistical analysis tool. You can build, share and deploy your machine learning pipelines in-place, where the data lives. This is a very important consideration when working with Big Data since it’s not just difficult, but sometimes outright impossible to move data at that scale between different tools.
Q5. How does Vertica support the machine learning process? Can you give some examples?
Waqas Dhillon: Vertica supports the entire machine learning workflow from data exploration and preparation to model deployment.
Users can explore their data using native database functions. As an analytics database, Vertica includes a large number of functions to support data exploration, and many more have recently been added to the machine learning library. Users can also prepare data with functions for normalization, outlier detection, sampling, imbalanced data processing, missing value imputation and many other native SQL and extended functions. They can also train and test advanced machine learning models like random forests and support vector machines on very large data sets.
There are multiple model evaluation metrics likes ROC, lift-table, AUC, etc. which can be used to assess your existing trained models. Any models built within Vertica can be stored inside the platform, shared with other users using the same instance of Vertica, or exported out to other Vertica databases. This can be quite useful while training models in test clusters and then moving them to production clusters. Training and managing models inside the database also reduces the overhead needed to transfer data into another system for analysis, along with the maintenance of that system.
Q6. How did you take advantage of a Massively Parallel Processing (MPP) Architecture, when implementing in-database machine learning in Vertica?
Waqas Dhillon: Vertica’s MPP architecture provided a great foundation on top of which we built a range of in-database machine learning functions, from data ingestion to model storage and scoring capabilities.
For data ingestion, there was already an extremely fast copy command used to move data in parallel into Vertica, where it’s stored on multiple nodes in a cluster. When we were writing our distributed machine learning algorithms, we could already rely on the data distribution across various nodes and instead focus our engineering efforts on the computation logic used to parallelize model training. We have also used a built-in distributed file system to maintain intermediate results as well as the final, trained models. These machine learning functions are mainly developed using Vertica’s C++ SDK, and are executed with Vertica’s distributed execution-engine.
To give an example of a machine learning algorithm used natively within Vertica leveraging the MPP architecture, let’s look at Random Forests. Random Forests is a popular algorithm among data scientists for training predictive models that can be applied to both regression and classification problems. It provides good prediction performance, and is quite robust against overfitting. The running time and memory footprint of this algorithm in R-randomForest package or Python-sklearn can be a major hurdle when working with large data volumes.
Our distributed implementation of Random Forest overcomes these obstacles. Model training is distributed across multiple nodes in a distributed architecture with multiple trees possibly being trained on the various nodes in the network, and then combining these results to provide a classification model. This model can then be used to perform scoring in parallel on data that might be distributed across multiple nodes (possibly hundreds) in a cluster.
Q7. You offer SQL-based machine learning functions. Is this an extension to SQL? Can you give us some examples?
Waqas Dhillon: Although Vertica follows the SQL standard, it offers multiple SQL extensions such as windowing functions and pattern matching. In-database machine learning algorithms are now part of the database’s analytical toolset, allowing users to write SQL like commands to run machine learning processes. They go beyond other, simpler SQL extensions users will find within Vertica.
For example, a simpler SQL extension in Vertica would be event series pattern matching. Event patterns are simply a series of events that occur in an order, or pattern that you specify. Vertica evaluates each row in your table, looking for the event you define. When Vertica finds a sequence of rows that conform to your pattern among a dataset of possibly hundreds of billions of rows or more, it outputs the rows that contribute to the match.
An example of a SQL extension for machine learning would be support vector machines (SVM). SVM is a very powerful algorithm that can be applied to large data sets for both classification and regression problems. For instance, an SVM model can be trained to predict the sales revenue of an e-commerce platform. There are many other extended SQL functions in Vertica as well to support a typical machine learning workflow from data preparation to model deployment.
Q8. What are the common barriers to Applying Machine Learning at Scale?
Waqas Dhillon: There are several challenges when it comes to applying machine learning to massive volumes of data. Predictive analytics can be complex, especially when big data is added to the mix. Since larger data sets yield more accurate results, high-performance, distributed, and parallel processing is required to obtain insights at a reasonable speed suitable for today’s business.
Traditional machine learning tools require data scientists to build and tune models using only small subsets of data (called down-sampling) and move data across different databases and tools, often resulting in inaccuracies, delays, increased costs, and slower access to critical insights:
• Slower development: Delays in moving large volumes of data between systems increases the amount of time data scientists spend creating predictive analytics models, which delays time-to-value.
• Inaccurate predictions: Since large data sets cannot be processed due to memory and computational limitations with traditional methods, only a subset of the data is analyzed, reducing the accuracy of subsequent insights and putting at risk any business decisions based on these insights.
• Delayed deployment: Owing to complex processes, deploying predictive models into production is often slow and tedious, jeopardizing the success of big data initiatives.
• Increased costs: Additional hardware, software tools, and administrator and developer resources are required for moving data, building duplicate predictive models, and running them on multiple platforms to obtain the desired results.
• Model management: Archiving and managing the machine learning models is a challenge when using most of the data science tools as they usually lack a mechanism for model management.
Q9. How do you overcome such barriers in Vertica?
Waqas Dhillon: Capable of storing large amounts of diverse data while also providing key built-in machine learning algorithms, Vertica eliminates or minimizes many of these barriers. Built from the ground up to handle massive volumes of data, Vertica is designed specifically to address the challenges of big data analytics using a balanced, distributed, compressed columnar paradigm.
Massively parallel processing enables data to be handled at petabyte scale for your most demanding use cases. Column store capabilities provide data compression, reducing big data analytics query times from hours to minutes or minutes to seconds, compared to legacy technologies. In addition, as a full-featured analytics system, Vertica provides advanced SQL-based analytics including pattern matching, geospatial analytics and many more capabilities.
As an optimized platform enabling advanced predictive modeling to be run from within the database and across large data sets, Vertica eliminates the need for data duplication and processing on alternative platforms—typically requiring multi-vendor offerings—that add complexity and cost. Now that same speed, scale, and performance used for SQL-based analytics can be applied to machine learning algorithms, with both running on a single system for additional simplification and cost savings.
Waqas is the product management lead for machine learning with Vertica. In his current role, he drives the strategy and implementation of advanced analytics and machine learning features in the Vertica MPP platform. Waqas holds a bachelor’s degree in computer software engineering from NUST and a master’s degree in management from Harvard University.
Prior to his current role, Waqas has worked in multiple positions where he applied data analytics and machine learning for consumer research and revenue growth for companies in consumer packaged goods and telecommunication industries.
– Vertica in-database machine learning: product page.
– Vertica in-database machine learning: full documentation.
– Try version of Vertica for free
– On using AI and Data Analytics in Pharmaceutical Research. Interview with Bryn Roberts ODBMS Industry Watch, Published on 2018-09-10
– On AI and Data Technology Innovation in the Rail Industry. Interview with Gerhard Kress ODBMS Industry Watch, Published on 2018-07-31
– On Artificial Intelligence, Machine Learning, and Deep Learning. Interview with Pedro Domingos ODBMS Industry Watch, Published on 2018-06-18
Follow us on Twitter: @odbmsorg
Are computer system designers (i.e. Software Developers, Software Engineers, Data Scientists, Data Engineers, etc,), the ones who will decide what the impact of these technologies are and whether to replace or augment humans in society?
Big Data, AI and Intelligent systems are becoming sophisticated tools in the hands of a variety of stakeholders, including political leaders.
Some AI applications may raise new ethical and legal questions, for example related to liability or potentially biased decision-making.
I recently gave a talk at UC Berkeley on the Ethical and Societal implications of Big Data and AI and what designers of intelligent systems can do to take responsibility, not only for policy makers and lawyers.
You can find copy of the presentation here:
I am interested to hear from you and receive your feedback.