“Cameras are now everywhere. Large-scale video processing is a grand challenge representing an important frontier for analytics, what with videos from factory floors, traffic intersections, police vehicles, and retail shops. It’s the golden era for computer vision, AI, and machine learning – it’s a great time now to extract value from videos to impact science, society, and business!” — Ganesh Ananthanarayanan
I have interviewed Ganesh Ananthanarayanan. We talked about his projects at Microsoft Research.
Q1. What is your role at Microsoft Research?
Ganesh Ananthanarayanan: I am a Researcher at Microsoft Research. Microsoft Research is a research wing within Microsoft, and my role is to watch out for key technology trends and work on large scale networked-systems.
Q2. Your current research focus is to democratize video analytics. What is it?
Ganesh Ananthanarayanan: Cameras are now everywhere. Large-scale video processing is a grand challenge representing an important frontier for analytics, what with videos from factory floors, traffic intersections, police vehicles, and retail shops. It’s the golden era for computer vision, AI, and machine learning – it’s a great time now to extract value from videos to impact science, society, and business!
Project Rocket‘s goal is to democratize video analytics: build a system for real-time, low-cost, accurate analysis of live videos. This system will work across a geo-distributed hierarchy of intelligent edges and large clouds, with the ultimate goal of making it easy and affordable for anyone with a camera stream to benefit from video analytics. Easy in the sense that any non-expert in AI should be able to use video analytics and derive value. Affordable because the latest advances in CV are still very resource intensive and expensive to use.
Q3. What are the main technical challenges of large-scale video processing?
Ganesh Ananthanarayanan: In the hotly growing “Internet of Things” domain, cameras are the most challenging of “things” in terms of data volume, (vision) processing algorithms, response latencies, and security sensitivities. They dwarf other sensors in data sizes and analytics costs, and analyzing videos will be a key workload in the IoT space. Consequently, we believe that large-scale video analytics is a grand challenge for the research community representing an important and exciting frontier for big data systems.
Unlike text or numeric processing, videos require high bandwidth (e.g., up to 5 Mbps for HD streams), need fast CPUs and GPUs, richer query semantics, and tight security guarantees. Our goal is to build and deploy a highly efficient distributed video analytic system. This will entail new research on (1) building a scalable, reliable and secure systems framework for capturing and processing video data from geographically distributed cameras; (2) efficient computer vision algorithms for detecting objects, performing analytics and issuing alerts on streaming video; and (3) efficient monitoring and management of computational and storage resources over a hybrid cloud computing infrastructure by reducing data movement, balancing loads over multiple cloud instances, and enhancing data-level parallelism.
Q4. What are the requirements posed by video analytics queries for systems such as IoT and edge computing?
Ganesh Ananthanarayanan: Live video analytics pose the following stringent requirements:
1) Latency: Applications require processing the video at very low latency because the output of the analytics is used to interact with humans (such as in augmented reality scenarios) or to actuate some other system (such as intersection traffic lights).
2) Bandwidth: High-definition video requires large bandwidth (5Mbps or even 25Mbps for 4K video) and streaming large number of video feeds directly to the cloud might be infeasible. When cameras are connected wirelessly, such as inside a car, the available uplink bandwidth is very limited.
3) Provisioning: Using compute at the cameras allows for correspondingly lower provisioning (or usage) in the cloud. Also, uninteresting parts of the video can be filtered out, for example, using motion-detection techniques, thus dramatically reducing the bandwidth that needs to be provisioned.
Besides low latency and efficient bandwidth usage, another major consideration for continuous video analytics is the high compute cost of video processing. Because of the high data volumes, compute demands, and latency requirements, we believe that largescale video analytics may well represent the killer application for edge computing.
Q5. Can you explain how Rocket allows programmers to plug-in vision algorithms while scaling across a hierarchy of intelligent edges and the cloud?
Ganesh Ananthanarayanan: Rocket (http://aka.ms/rocket) is an extensible software stack for democratizing video analytics: making it easy and affordable for anyone with a camera stream to benefit from computer vision and machine learning algorithms. Rocket allows programmers to plug-in their favorite vision algorithms while scaling across a hierarchy of intelligent edges and the cloud.
The figure above shows our video analytics stack, Rocket, that supports multiple applications including traffic camera analytics for smart cities, retail store intelligence scenarios, and home assistants. The “queries” of these applications are converted into a pipeline of vision modules by the video pipeline optimizer to process live video streams. The video pipeline consists of multiple modules including the decoder, background subtractor, and deep neural network (DNN) models.
Rocket partitions the video pipeline across the edge and the cloud. For instance, it is preferable to run the heavier DNNs on the cloud where the resources are plentiful. Rocket’s edge-cloud partitioning ensures that: (i) the compute (CPU and GPU) on the edge device is not overloaded and only used for cheap filtering, and (ii) the data sent between the edge and the cloud does not overload the network link. Rocket also periodically checks the connectivity to the cloud and falls back to an “edge-only” mode when disconnected. This avoids any disruption to the video analytics but may produce outputs of lower accuracy due to relying only on lightweight models. Finally, Rocket piggybacks on the live video analytics to use its results as an index for after-the-fact interactive querying on stored videos.
More details can be found in our recent MobiSys 2019 work.
Q6. One of the verticals your project is focused on is video streams from cameras at traffic intersections. Can you please tell us more how this works in practice?
Ganesh Ananthanarayanan: As we embarked on this project, two key trends stood out: (i) cities were already equipped with a lot of cameras and had plans to deploy many more, and (ii) traffic related fatalities were among the top-10 causes of deaths worldwide, which is terrible! So, in partnership with my colleague (Franz Loewenherz) at the City of Bellevue, we asked the question: can we use traffic video analytics to improve traffic safety, traffic efficiency, and traffic planning? We understood that most jurisdictions have little to no data on the continuous trends on directional traffic volumes; accident near-misses; pedestrian, bike & multi-modal volumes, etc. Data on these is usually got by commissioning an agency to count vehicles once or twice a year for a day.
We have built technology that analyzes traffic camera feeds 24X7 at low cost to power a dashboard of directional traffic volumes. The dashboard raises alerts on traffic congestion & conflicts. Such a capability can be vital towards traffic planning (of lanes), traffic efficiency (light durations), and safety (unsafe intersections).
A key aspect is that we do our video analytics using existing cameras and consciously decided to shy away from installing our own cameras. Check out this project video on Video Analytics for Smart Cities.
Q7. What are the lessons learned so far from your on-going pilot in Bellevue (Washington) for active traffic monitoring of traffic intersections live 24X7? Does it really help preventing traffic-related accidents? Does the use of your technology help your partners with jurisdictions to identify traffic details that impact traffic planning and safety?
Ganesh Ananthanarayanan: Our traffic analytics dashboard runs 24X7 and accumulates data non-stop that the officials didn’t have access to before. It helps them understand instances of unexpectedly high traffic volumes in certain directions. It also generates alerts on traffic volumes to help dispatch personnel accordingly. We also used the technology for planning a bike corridor in Bellevue. The objective was to do a before/after study of the bike corridor to help understand the impact of the corridor on driver behavior. The City plans to use the results, to decide on bike corridor designs.
Our goal is to make the roads considerably safer & efficient with affordable video analytics. We expect that video analytics will be able to drive decisions of cities precisely in these directions towards how they manage their lights, lanes, and signs. We also believe that data regarding traffic volumes from a dense network of cameras will be able to power & augment routing applications for better navigation.
As the number of cities that start to deploy the solution increase, it will only increase the accuracy of the computer vision models with better training data, thus leading to a nice virtuous cycle.
Qx Anything else you wish to add?
Ganesh Ananthanarayanan: So far I’ve described our video analytics solution on how it uses video cameras to continuously analyze and get data. One thing I am particularly excited to make happen is to “complete the loop”. That is, take the output from the video analytics and in real-time actuate it on the ground to users. For instance, if we predict an unsafe interaction between a bicycle & car, send a notification to one or both of them. Pedestrian lights can be automatically activated and even extended for people with disabilities (e.g., in a wheelchair) to enable them to safely cross the road (see demo). I believe that the infrastructure will be sufficiently equipped for this kind of communication in a few years. Another example of this is warning approaching cars when they cannot spot pedestrians between parked cars on the road.
I am really excited about the prospect of the AI analytics interacting with the infrastructure and people on the ground and I believe we are well on track for it!
Ganesh Ananthanarayanan is a Researcher at Microsoft Research. His research interests are broadly in systems & networking, with recent focus on live video analytics, cloud computing & large scale data analytics systems, and Internet performance. His work on “Video Analytics for Vision Zero” on analyzing traffic camera feeds won the Institute of Transportation Engineers 2017 Achievement Award as well as the “Safer Cities, Safer People” US Department of Transportation Award. He has published over 30 papers in systems & networking conferences such as USENIX OSDI, ACM SIGCOMM and USENIX NSDI. He has collaborated with and shipped technology to Microsoft’s cloud and online products like the Azure Cloud, Cosmos (Microsoft’s big data system) and Skype. He is a member of the ACM Future of Computing Academy. Prior to joining Microsoft Research, he completed his Ph.D. at UC Berkeley in Dec 2013, where he was also a recipient of the UC Berkeley Regents Fellowship. For more details: http://aka.ms/ganesh
– Rocket (http://aka.ms/rocket)
– On Amundsen. Q&A with Li Gao tech lead at Lyft, ODBMS.org Expert Article, JUL 30, 2019
– On IoT, edge computing and Couchbase Mobile. Q&A with Priya Rajagopal, ODBMS.org Expert article, JUL 25, 2019
“The best possible database migration is when you are able to move all your data and stored procedures unchanged(!) to the new system.” — Michael Widenius.
I have interviewed Michael “Monty” Widenius, Chief Technology Officer at MariaDB Corporation.
Monty is the “spiritual father” of MariaDB, a renowned advocate for the open source software movement and one of the original developers of MySQL.
Q1. What is adaptive scaling and why is it important for a database?
Michael Widenius: Adaptive scaling is provided to automatically change behavior in order to use available resources as efficiently as possible, as demands grows or shrinks. For a database, it means the ability to dynamically configure resources, adding or deleting data nodes and processing nodes according to demand. This provides both scale up and scale out in an easy manner.
Many databases can do part of this manually, a few can do this semi-automatically. When it comes to read scaling with replication, there are a few solutions, like Oracle RAC, but there are very few relational database systems that can handle true write scaling while preserving true ACID properties. This is a critical need for any company that wants to compete in the data space. That’s one of the reasons why MariaDB acquired ClustrixDB last year.
Q2. Technically speaking, how is it possible to adjust scaling so that you can run the database in the background in a desktop with very few resources, and up to a multi node cluster with petabytes of data with read and write scaling?
Michael Widenius: Traditionally databases are optimized for one particular setup. It’s very hard to be able to run efficiently both with a very small footprint, which is what desktop users are expecting, and yet provide extreme scale out.
The reason we can do that in MariaDB Platform is thanks to the unique separation between the query processing and data storage layers (storage engines). One can start by using a storage engine that requires a relatively small footprint (Aria or InnoDB) and, when demands grow, with a few commands move all or just part of the data to distributed storage with MariaDB ColumnStore, Spider, MyRocks or, in the future, ClustrixDB. One can also very easily move to a replication setup where you have one master for all writes and any number of read replicas. MariaDB Cluster can be used to provide a fully functional master-master network that can be replicated to remote data centers.
My belief is that MariaDB is the most advanced database in existence, when it comes to providing complex replication setups and very different ways to access and store data (providing OLTP, OLAP and hybrid OLTP/OLAP functionalities) while still providing one consistent interface to the end user.
Q3. How do you plan to use ClustrixDB distributed database technology for MariaDB?
Michael Widenius: We will add this as another storage engine for the user to choose from. What it means is that if one wants to switch a table called t1 from InnoDB to ClustrixDB, the only command the user needs to do is:
ALTER TABLE t1 STORAGE_ENGINE=ClustrixDB;
The interesting thing with ClustrixDB is not only that it’s distributed and can automatically scale up and down based on demands, but also that a table on ClustrixDB can be accessed by different MariaDB servers. If you create a ClustrixDB table on one MariaDB server, it’s at once visible to all other MariaDB servers that are attached to the same cluster.
Q3. Why is having Oracle compatibility in MariaDB a game changer for the database industry?
Michael Widenius:MariaDB Platform is the only enterprise open source database that supports a significant set of Oracle syntax. This makes it possible for the first time to easily move Oracle applications to an open source solution, get rid of single-vendor lock-in and leverage existing skill sets. MariaDB Corporation is also the best place to get migration help as well as enterprise features, consultative support and maintenance.
Q4. How do you manage with MariaDB to parse, depending on the case, approximately 80 percent of the legacy Oracle PL/SQL without rewriting the code?
Michael Widenius: Oracle PL/SQL was originally based on the same standard that created SQL, however Oracle decided to use different syntax than what’s used in ANSI SQL. Fortunately, most of the logical language constructs are the same. This made it possible to provide a mapping from most of the PL/SQL constructs to ANSI.
What we did:
– Created a new parser, sql_yacc_ora.yy, which understands the PL/SQL constructs, and map the PL/SQL syntax to existing MariaDB internal structures.
– Added support for SQL_MODE=ORACLE mode, to allow the user to switch which parser to use. The mode is stored as part of SQL procedures to allow users to run a stored procedure without having to know if it’s written in ANSI SQL or PL/SQL.
– Extended MariaDB with new Oracle compatibility that we didn’t have before such as SEQUENCES, PACKAGES, ROW TYPE etc.
You can read all about the Oracle compatibility functionality that MariaDB supports here.
Q5. When embarking on a database migration, what are the best practices and technical solutions you recommend?
Michael Widenius: The best possible database migration is when you are able to move all your data and stored procedures unchanged(!) to the new system.
That is our goal when we are supporting a migration from Oracle to MariaDB. This usually means that we are working closely with the customer to analyze the difficulty of the migration and determine a migration plan. It also helps that MariaDB supports MariaDB SQL/PL, a compatible subset of Oracle PL/SQL language.
If MariaDB is fairly new to you, then it’s best to start with something small that only uses a few stored procedures to give DBAs a chance to get to know MariaDB better. When you’ve succeeded to move a couple of smaller installations, then it’s time to start with the larger ones. Our expert migration team is standing by to assist you in any way possible.
Q6. Why did you combine your transactional and analytical databases into a single platform, MariaDB Platform X3?
Michael Widenius: Because thanks to the storage engine interface it’s easy for MariaDB to provide both transactional and analytical storage with one interface. Today it’s not efficient or desirable to have to move between databases just because your data needs grows. MariaDB can also provide the unique capability of using different storage engines on master and replicas. This allows you to have your master optimized for inserts while some of your replicas are optimized for analytical queries.
Q7. You also launched a managed service supporting public and hybrid cloud deployments. What are the benefits of such service to enterprises?
Michael Widenius: Some enterprises find it hard to find the right DBAs (these are still a scarce resource) and would rather want to focus on their core business instead of managing their databases. The managed service is there to help these enterprises to not have to think about how to keep the database servers up and running. Maintenance, upgrading and optimizing of the database will instead be done by people that are the definitive experts in this area.
Q8. What are the limitations of existing public cloud service offerings in helping companies succeed across their diverse cloud and on-prem environments?
Michael Widenius: Most of the existing cloud services for databases only ensures that the “database is up and running”. They don’t provide database maintenance, upgrading, optimization, consultative support or disaster management. More importantly you’re only getting a watered down version of MariaDB in the cloud rather than the full featured version you get with MariaDB Platform. If you encounter performance problems, serious bugs, crashes or data loss, you are on your own. You also don’t have anyone to talk with if you need new features for your database that your business requires.
Q9. How does MariaDB Platform Managed Service differ from existing cloud offering such as Amazon RDS and Aurora?
Michael Widenius: In our benchmarks that we shared at our MariaDB OpenWorks conference earlier this year, we showed that MariaDB’s Managed Service offering beats Amazon RDS and Aurora when it comes to performance. Our managed service also unlocks capabilities such as columnar storage, data masking, database firewall and many more features that you can’t get in Amazon’s services. See the full list here for a comparison.
Q10. What are the main advantages of using a mix of cloud and on-prem?
Michael Widenius: There are many reasons why a company will use a mix of cloud and on-prem. Cloud is where all the growth is and many new applications will likely go to the cloud. At the same time, this will take time and we’ll see many applications stay on prem for a while. Companies may decide to keep applications on prem for compliance and regulatory reasons as well. In general, it’s not good for any company to have a vendor that totally locks them into one solution. By ensuring you can run the exact same database on both on-prem and cloud, including ensuring that you have all your data in both places, you can be sure your company will not have a single point of failure.
Michael “Monty” Widenius, Chief Technology Officer, MariaDB.
Monty is the “spiritual father” of MariaDB, a renowned advocate for the open source software movement and one of the original developers of MySQL, the predecessor to MariaDB. In addition to serving as CTO for the MariaDB Corporation, he also serves as a board member of the MariaDB Foundation. He was a founder at SkySQL, and the CTO of MySQL AB until its sale to Sun Microsystems (now Oracle). Monty was also the founder of TCX DataKonsult AB, a Swedish data warehousing company. He is the co-author of the MySQL Reference Manual and was awarded in 2003 the Finnish Software Entrepreneur of the Year prize. In 2015, Monty was selected as one of the 100 most influential persons in the Finnish IT market. Monty studied at Helsinki University of Technology and lives in Finland.
Follow us on Twitter: @odbmsorg
“LeanXcale is the first startup that instead of going to market with a single innovation or know-how, is going to market with 10 disruptive innovations that are making it really differential for many different workloads and extremely competitive on different use cases.” — Patrick Valduriez.
I have interviewed Patrick Valduriez and Ricardo Jimenez-Peris. Patrick is a well know database researcher, and since 2019, he is the scientific advisor of LeanXcale. Ricardo is the CEO and Founder of LeanXcale. We talked about NewSQL, Hybrid Transaction and Analytics Processing (HTAP), and LeanXcale, a start up that offers an innovative HTAP database.
Q1. There is a class of new NewSQL databases in the market, called Hybrid Transaction and Analytics Processing (HTAP) – a term created by Gartner Inc. What is special about such systems?
Patrick Valduriez: NewSQL is a recent class of DBMS that seeks to combine the scalability of NoSQL systems with the strong consistency and usability of RDBMSs. An important class of NewSQL is Hybrid Transaction and Analytics Processing (HTAP) whose objective is to perform real-time analysis on operational data, thus avoiding the traditional separation between operational database and data warehouse and the complexity of dealing with ETLs.
Q2. HTAP functionality is offered by several database companies. How does LeanXcale compare with respect to other HTAP systems?
Ricardo Jimenez-Peris: HTAP covers a large spectrum that has three dimensions. One dimension is the scalability of the OLTP part. There is where we excel. We scale out linearly to hundreds of nodes. The second dimension is the ability to scale out OLAP. This is well known technology from the last two decades. Some systems are mostly centralized, but those that are distributed should be able to handle reasonably well the OLAP part. The third dimension is the efficiency on the OLAP part. There is where we are still working to improve the optimizer, so the expectation is that we will become pretty competitive in the next 18 months. Patrick’s expertise in distributed query processing will be key. I would like also to note that, for recurrent aggregation analytical queries, we are really unbeatable thanks to a new invention that enables us to update in real-time these aggregations, so these aggregation analytical queries becomes costless since they just need to read a single row from the relevant aggregation table.
Q3. Patrick, you wrote in a blog that “LeanXcale has a disruptive technology that can make a big difference on the DBMS market”. Can you please explain what is special about LeanXcale?
Patrick Valduriez: I believe that LeanXcale is at the forefront of the HTAP movement, with a disruptive technology that provides ultra-scalable transactions (see Q4), key-value capabilities (see Q5), and polyglot capabilities. On one hand, we support polyglot queries that allow integrating data coming from different data stores, such as HDFS, NoSQL and SQL systems. On the other hand, we already support SQL and key-value functionality on the same database, and soon we will support JSON documents in a seamless manner, so we are becoming a polystore.
LeanXcale is the first startup that instead of going to market with a single innovation or know-how, is going to market with 10 disruptive innovations that are making it really differential for many different workloads and extremely competitive on different use cases.
Q4. What are the basic principles you have used to design and implement LeanXcale as a distributed database that allows scaling transactions from 1 node to thousands?
Ricardo Jimenez-Peris: LeanXcale solves the traditional transaction management bottleneck with a new invention that lies in a distributed processing of the ACID properties, where each ACID property is scaled out independently but in a composable manner. LeanXcale’s architecture is based on three layers that scale out independently, 1) KiVi, the storage layer that is a relational key-value data store, 2) the distributed transactional manager that provides ultra-scalable transactions, and 3) the distributed query engine that enables to scale out both OLTP and OLAP workloads. KiVi counts with 8 disruptive innovations that provide dynamic elasticity, online aggregations, push down of all algebraic operators but join, active-active replication, simultaneous efficiency for both ingesting data and range queries, efficient execution in NUMA architectures, costless multiversioning, hybrid row-columnar storage, vectorial acceleration, and so on.
Q5. The LeanXcale database offers a so-called dual interface, key-value and SQL. How does it work and what is it useful for?
Ricardo Jimenez-Peris: (how does it work): The storage layer, it is a proprietary relational key-value data store, called KiVi, which we have developed. Unlike traditional key-value data stores, KiVi is not schemaless, but relational. Thus, KiVi tables have a relational schema, but can also have a part that is schemaless. The relational part enabled us to enrich KiVi with predicate filtering, aggregation, grouping, and sorting. As a result, we can push down all algebraic operators below a join to KiVi and execute them in parallel, thus saving the movement of a very large fraction of rows between the storage layer and they query engine layer. Furthermore, KiVi has a direct API that allows doing everything that SQL can do but join, but without the cost of SQL. In particular, it can ingest data as efficiently as the most efficient key-value data stores, but the data is stored in relational tables in a fully ACID way and the data is accessible through SQL. This enables to highly reduce the footprint of the database in terms of hardware resources for workloads where data ingestion represents a high fraction.
Patrick Valduriez: (what is it useful for): As for RDBMSs, the SQL interface allows rapid application development and remains the preferred interface for BI and analytics tools. The key-value interface is complementary and allows the developer to have better control of the integration of application code and database access, for higher performance. This interface also allows easy migration from other key-value stores.
Q6. You write that LeanXcale could be used in different ways. Can you please elaborate on that?
Ricardo Jimenez-Peris: LeanXcale can be used in many different ways: as an operational database (thanks to transaction scalability), as a data warehouse (thanks to our distributed OLAP query engine), as a real-time analytics platform (due to our HTAP capability), as an ACID key-value data store (using KiVi and our ultra-scalable transactional management), as a time series database (thanks to our high ingestion capabilities), as an integration polyglot query engine (based on our polyglot capabilities), as an operational data lake (combining our scalability in volume of a data lake with operational capabilities at any scale), as a fast data store (using KiVi as standalone), as an IoT database (deploying KiVi in IoT devices), and edge database (deploying KiVi on IoT devices and the edge and full LeanXcale database on the cloud with georeplication).
Thanks to all our innovations and our efficiency and flexible architecture, we can compete in many different scenarios.
Q7. The newly defined SQL++ language allows adding a JSON data type in SQL. N1QL for Analytics is the first commercial implementation of SQL++ (**). Do you plan to support SQL++ as well?
Ricardo Jimenez-Peris and Patrick Valduriez: Yes, but within SQL, as we don’t think any language will replace SQL in the near future. Over the last 30 years, there have been many claims that new languages would (“soon”) replace SQL, e.g., object query languages such as OQL in the 1990s or XML query languages such as XQuery in the 2000s. But this did not happen for three main reasons. First, SQL’s data abstraction (table) is ubiquitous and simple. Second, the language is easy to learn, powerful and has been adopted by legions of developers. And it is a (relatively) standard language, which makes it a good interface for tool vendors. This being said, the JSON data model is important to manage documents and SQL++ is a very nice SQL-like language for JSON. In LeanXcale, we plan to support a JSON data type in SQL columns and have a seamless integration of SQL++ within SQL, with the best of both (relational and document) worlds. Basically, each row can be relational or JSON and SQL statements can include SQL++ statements.
Q8. What are the typical use cases for LeanXcale? and what are the most interesting verticals for you?
Ricardo Jimenez-Peris: Too many. Basically, all data intensive use cases. We are ideal for the new technological verticals such as traveltech, adtech, IoT, smart-*, online multi-player games, eCommerce, …. But we are also very good and cost-effective for traditional use cases such as Banking/Finance, Telco, Retail, Insurance, Transportation, Logistics, etc.
Q9. Patrick, as Scientific Advisor of LeanXcale, what is your role? What are you working at present?
Patrick Valduriez: My role is as a sort of consulting chief architect for the company, providing advice on architectural and design choices as well as implementation techniques. I will also do what I like most, i.e., teach the engineers the principles of distributed database systems, do technology watch, write white papers and blog posts on HTAP-related topics, and do presentations at various venues. We are currently working on query optimization, based on the Calcite open source software, where we need to improve the optimizer cost model and search space, in particular, to support bushy trees in parallel query execution plans. Another topic is to add the JSON data type in SQL in order to combine the best of relational DBMS and document NoSQL DBMS.
Q10. What is the role that Glenn Osaka is having as an advisor for LeanXcale?
Ricardo Jimenez-Peris: Ricardo: Glenn is an amazing guy and successful Silicon Valley entrepreneur (CEO of Reactivity, sold to Cisco). He was advisor of Peter Thiel at Confinity, who later merged his company with Elon Musk’s X.com to create PayPal, and continued to be advisor there till it was sold to eBay.
He is guiding us in the strategy to become a global company. A company doing B2B to enterprises has as main challenge to overcome the slowness of enterprise sales, and through his advice we have built a strategy to overcome this slowness.
Q11. You plan to work with Ikea. Can you please tell us more?
Ricardo Jimenez-Peris: Ikea has isolated ERPs per store. Thus, the main issue is that when a customer wants to buy an item at a store and there is not enough stock, this isolation prevents them from selling using stock from other stores. Similarly, orders for new stock are not optimized since they are made based on the local store view. We are providing them with a centralized database that keeps the stock across all stores and solving the two problems. We are also working with them in a proximity marketing solution to offer customers coupon-based discounts as they go through the store.
Qx Anything else you wish to add?
Patrick Valduriez: Well, the adventure just got started and it is already a lot of fun. It is a great opportunity for me, and probably the right time, to go deeper in applying the principles of distributed and parallel databases on real-world problems. The timing is perfect as the new (fourth) edition of the book “Principles of Distributed Database Systems“, which I co-authored with Professor Tamer Özsu, is in production at Springer. As a short preview, note that there is a section on LeanXcale’s ultra-scalable transaction management approach in the transaction chapter and another section on LeanXcale’s architecture in the NoSQL/NewSQL chapter.
Ricardo Jimenez-Peris: Ricardo: It is a really exciting moment now that we are going to market. We managed to build an amazing team able to make the product strong and go to market with it. We believe to be the most innovative startup in the database arena and our objective is to become the next global database company. Still a lot of work and exciting challenges ahead. Now we are working on our database cloud managed service that will be delivered in Amazon, hopefully, by the end of the year.
Dr. Patrick Valduriez is a senior scientist at Inria in France. He has been a scientist at Microelectronics and Computer Technology Corp. in Austin (Texas) in the 1980s and a professor at University Pierre et Marie Curie (UPMC) in Paris in the early 2000s. He has also been consulting for major companies in USA (HP Labs, Lucent Bell Labs, NERA, LECG, Microsoft), Europe (ESA, Eurocontrol, Ask, Shell) and France (Bull, Capgemini, Matra, Murex, Orsys, Schlumberger, Sodifrance, Teamlog). Since 2019, he is the scientific advisor of the LeanXcale startup.
He is currently the head of the Zenith team (between Inria and University of Montpellier, LIRMM) that focuses on data science, in particular data management in large-scale distributed and parallel systems and scientific data management. He has authored and co-authored many technical papers and several textbooks, among which “Principles of Distributed Database Systems” with Professor Tamer Özsu. He currently serves as associate editor of several journals, including the VLDB Journal, Distributed and Parallel Databases, and Internet and Databases. He has served as PC chair of major conferences such as SIGMOD and VLDB. He was the general chair of SIGMOD04, EDBT08 and VLDB09.
He received prestigious awards and prizes. He obtained several best paper awards, including VLDB00. He was the recipient of the 1993 IBM scientific prize in Computer Science in France and the 2014 Innovation Award from Inria – French Academy of Science – Dassault Systems. He is an ACM Fellow.
Dr. Ricardo Jimenez was professor and researcher at Technical University of Madrid (Universidad Politécnica de Madrid – UPM) and abandoned his academic career to bring to the market an ultra-scalable database. At UPM, he already sold technology to European enterprises such as Ericsson, Telefonica, and Bull. He has been member of the advisory Committee on Cloud Computing for the European Commission.
He is co-inventor of the two patents already granted in US and Europe and of 8 new patent applications that are being prepared. He is co-author of the book “Replicated Databases” and more than 100 research papers and articles.
He has been invited to present LeanXcale technology in the headquarters of many tech companies in Silicon Valley such as Facebook, Twitter, Salesforce, Heroku, Greenplum (now Pivotal), HP, Microsoft, etc.
He has coordinated (as overall coordinator or technical coordinator) over 10 European projects. One of them, LeanBigData, was awarded with the “Best European project” award by the Madrid Research Council (Madri+d).
– LeanXcale was awarded with the “Best SME” award by the Innovation Radar of the European Commission in Nov. 2017 recognizing it as the most innovative European startup. LeanXcale has been identified as one of the innovator startups in the NewSQL arena by Bloor market analyst, and has been identified as one of the companies in the HTAP arena by 451 Research market analyst.
Follow us on Twitter: @odbmsorg
“A lot of times we think of digital transformation as a technology dependent process. The transformation takes place when employees learn new skills, change their mindset and adopt new ways of working towards the end goal.”–Kerem Tomak
I have interviewed Kerem Tomak, Executive VP, Divisional Board Member, Big Data-Advanced Analytics-AI, at Commerzbank AG. We talked about Digital Transformation, Big Data, Advanced Analytics and AI for the financial sector.
Commerzbank AG is a major German bank operating as a universal bank, headquartered in Frankfurt am Main. In the 2019 financial year, the bank was the second largest in Germany after the balance sheet total. The bank is present in more than 50 countries around the world and provides almost a third of Germany’s trade finance. In 2017, it handled nearly 13 million customers in Germany and more than 5 million customers in Central and Eastern Europe. (source: Wikipedia).
Q1. What are the key factors that need to be taken into account when a company wants to digitally transform itself?
Kerem Tomak: It starts with a clear and coherent digital strategy. Depending on the level of the company this can vary from operational efficiencies as the main target to disrupting and changing the business model all together. Having clear scope and objectives of the digital transformation is key in its success.
A lot of times we think of digital transformation as a technology dependent process. The transformation takes place when employees learn new skills, change their mindset and adopt new ways of working towards the end goal. Digital enablement together with a company wide upgrade/replacement of legacy technologies with new ones like Cloud, API, IoT etc. is the next step towards becoming a digital company. With all this comes the most important ingredient, thinking outside the box and taking risks. One of the key success criteria in becoming a digital enterprise is the true and speedy “fail fast, learn and optimize” mentality. Avoiding (calculated) risks, especially at the executive level, will limit growth and hinder transformation efforts.
Q2. What are the main lessons you have learned when establishing strategic, tactical and organizational direction for digital marketing, big data and analytics teams?
Kerem Tomak: For me, culture eats strategy. Efficient teams build a culture in which they thrive. Innovation is fueled by teams which constantly learn and share knowledge, take risks and experiment. Aside from cultural aspects, there are three main lessons I learned over the years.
First: Top down buy-in and support is key. Alignment with internal and external key stakeholders is vital – you cannot create impact without them taking ownership and being actively involved in the development of use cases.
Second: Clear prioritization is necessary. Resources are limited, both in the analytics teams and with the stakeholders. OKRs provide very valuable guidance on steering the teams forward and set priorities.
Third: Building solutions which can scale over a stable and scalable infrastructure. Data quality and governance build clean input channels to analytics development and deployment. This is a major requirement and biggest chunk of the work. Analytics capabilities then guide what kind of tools and technologies can be used to make sense of this data. Finally, integrating with execution outlets such as a digital marketing platform creates a feedback loop that teams can learn and optimize against.
Q3. What are the main challenges (both technical and non) when managing mid and large-size analytics teams?
Kerem Tomak: Again, building a culture in which teams thrive independent of size is key. For analytics teams, constantly learning/testing new techniques and technologies is an important aspect of job satisfaction for the first few years out of academia. Promotion path clarity and availability of a “skills matrix” makes it easy to understand what leadership values in the employees are important and provides guidance on future growth opportunities. I am not a believer in hierarchical organizations so keeping job levels as low as possible is necessary for speed and delivery. Hiring and retaining right skills in the analytics teams are not easy, especially in hot markets like Silicon Valley. Most analytics employees follow leaders and generally stay loyal to them. Head of an analytics team plays an extremely important role. That will “make it or break it” for analytics teams. Finally, analytics platforms with the right tools and scale is critical for the teams’ success.
Q4. What does it take to successfully deliver large scale analytics solutions?
Kerem Tomak: First, one needs a flexible and scalable analytics infrastructure – this can comprise on-premise components like a Chatbots for example, as well as shared components via a Public Cloud. Secondly, it takes an end-to-end automation of processes, in order to attain scale fast and on demand. Last but not least, companies need an accurate sense of customers’ needs and requirements to ensure that the developed solution will be adopted.
Q5. What parameters do you normally use to define if an analytics solution is really successful?
Kerem Tomak: An analytics solution is successful if it has a high impact. Some key parameters are usage, increased revenues and reduced costs.
Q6. Talking about Big Data, Advanced Analytics and AI: Which companies are benefiting from them at present?
Kerem Tomak: Maturity of Big Data, AA and AI differs across industries. Leading the pack are Tech, Telco, Financial Services, Retail and Automotive. In each industry there are leaders and laggards. There are fewer and fewer companies untouched by BDAA and AI.
Q7. Why are Big Data and Advanced Analytics so important for the banking sector?
Kerem Tomak: This has (at least) two dimensions. First: Like any other company that wants to sell products or services, we must understand our client’s needs. Big Data and Advanced Analytics can give us a decisive advantage here. For example – with our customers’ permission of course – we can analyze their transactions and thus gain useful information about their situation and learn what they need from their bank. Simply put: A person with a huge amount of cash in their account obviously has no need for a consumer credit at the moment. But the same person might have a need for advice on investment opportunities. Data analysis can give us very detailed insights and thus help us to understand our customers better.
This leads to the second dimension, which is risk management. As a bank we are risk taking specialists. The better the bank does in understanding the risks it takes, the more efficient it can act to counterbalance those risks. Benefits are a lower rate of credit defaults as well as a more accurate credit pricing. This is in favor of both the bank and its customers.
Data is the fabric which new business models are made of but Big Data does not necessarily mean Big Business: The correct evaluation of data is crucial. This will also be a decisive factor in the future as to whether a company can hold its own in the market.
Q8. What added value can you deliver to your customers with them?
Kerem Tomak: Well, for starters, Advanced Analytics helps us to prevent fraud. In 2017, Commerzbank used algorithms to stop fraudulent payments in excess of EUR 100 million. Another use case is the liquidity forecast for small and medium-sized enterprises. Our Cash Radar runs in a public cloud and generates forecasts for the development of the business account. It can therefore warn companies at an early stage if, for example, an account is in danger of being underfunded. So with the help of such innovative data-driven products, the bank obviously can generate added customer value, but also drive its growth and set itself apart from its competitors.
Additionally, Big Data and Advanced Analytics generate significant internal benefits. For example, Machine Learning is providing us with efficient support to prevent money laundering by automatically detecting conspicuous payment flows. Another example: Chatbots already regulate part of our customer communication. Also, Commerzbank is the first German financial institution to develop a data-based pay-per-use investment loan. The redemption amount is calculated from the use of the capital goods – in this case the utilization of the production machines, which protects the liquidity of the user and gives us the benefit of much more accurate risk calculations.
When we bear in mind that the technology behind examples like these is still quite new, I am confident that we will see many more use cases of all kinds in the future.
Q9. It seems that Artificial Intelligence (AI) will revolutionize the financial industry in the coming years. What is your take on this?
Kerem Tomak: When we talk about artificial intelligence, currently, we basically still mean machine learning. So we are not talking about generalized artificial intelligence in its original sense. It is about applications that recognize patterns and learn from these occurrences. Eventually tying these capabilities to applications that support decisions and provide services make AI (aka Machine Learning) a unique field. Even though the field of data modelling has developed rapidly in recent years, we are still a long way from the much-discussed generalized artificial intelligence which had the machine goal outlined in 1965 as “machines will be capable, within twenty years, of doing any work a man can do”. With the technology available today we can think of the financial industry having new ways of generating, transferring, accumulating wealth in ways we have not seen before all predicated upon individual adoption and trust.
Q10. You have been working for many years in US. What are the main differences you have discovered in now working in Europe?
Kerem Tomak: Europeans are very sensitive to privacy and data security. The European Union has set a high global standard with its General Data Protection Regulation (GDPR). In my opinion, Data protection “made in Europe” is a real asset and has the potential to become a global blueprint.
Also, Europe is very diverse – from language over culture to different market environments and regulatory issues. Even though immense progress has been made in the course of harmonization in the European Union, a level playing field remains one of the key issues in Europe, especially for Banks.
Technology adoption is lagging in some parts of Europe. Bigger infrastructure investments, wider adoption of public cloud, 5G deployment are needed to stay competitive and relevant in global markets which are increasingly dominated by US and China. This is both an opportunity and risk. I see tremendous opportunities everywhere from IoT to AI driven B2B and B2C apps for example. If adoption of public cloud lags any further, I see the risk of falling behind on AI development and innovation in EU.
Finally, I truly enjoy the family oriented work-life balance here which in turn increases work productivity and output.
Dr. Kerem Tomak, Executive VP, Divisional Board Member, Big Data-Advanced Analytics-AI, Commerzbank AG
Kerem brings more than 15 years of experience as a data scientist and an executive. He has expertise in the areas of omnichannel and cross-device attribution, price and revenue optimization, assessing promotion effectiveness, yield optimization in digital marketing and real time analytics. He has managed mid and large-size analytics and digital marketing teams in Fortune 500 companies and delivered large scale analytics solutions for marketing and merchandising units. His out-of-the box thinking and problem solving skills led to 4 patent awards and numerous academic publications. He is also a sought after speaker in Big Data and BI Platforms for Analytics.
Follow us on Twitter: @odbmsorg
“At some point, most companies come to the realization that the advanced technologies and innovation that allow them to improve business operations also generate increased amounts of data that existing legacy technology is unable to handle, resulting in the need for more new technology. It is a cyclical process that CIOs need to prepare for.” –Scott Gnau
InterSystems has appointed last month Scott Gnau to Head of their Data Platforms Business Unit. I have asked Scott a number of questions related to data management, what are his advices for Chief Information Officers, what is the positing of the InterSystems IRIS™ family of data platforms, and what is the technology vision ahead for the company’s Data Platforms business unit.
Q1. What are the main lessons you have learned in more than 20 years of experience in the data management space?
Scott Gnau: The data management space is a people-centric business, whether you are dealing with long-time customers or developers and architects. The formation of a trusted relationship can be the difference between a potential customer selecting one vendor’s technology which comes with the benefit of partnering for long term success, over a similar competitor’s technology.
Throughout my career, I have also learned how risky data management projects can be. They essentially ensure the security, cleanliness and accuracy of an organization’s data. They are then responsible for scaling data-centric applications, which helps inform important business decisions. Data management is a very competitive space which is only becoming more crowded.
Q2. What is your most important advice for Chief Information Officers?
Scott Gnau: At some point, most companies come to the realization that the advanced technologies and innovation that allow them to improve business operations also generate increased amounts of data that existing legacy technology is unable to handle, resulting in the need for more new technology. It is a cyclical process that CIOs need to prepare for.
Phenomena such as big data, the internet of things (IoT), and artificial intelligence (AI) are driving the need for this modern data architecture and processing, and CIOs should plan accordingly. For the last 30 years, data was primarily created inside data centers or firewalls, was standardized, kept in a central location and managed. It was fixed and simple to process.
In today’s world, most data is created outside the firewall and outside of your control. The data management process is now reversed – instead of starting with business requirements, then sourcing data and building and adjusting applications, developers and organizations load the data first and reverse engineer the process. Now data is driving decisions around what is relevant and informing the applications that are built.
Q3. How do you position the InterSystems IRIS™ family of data platforms with respect to other similar products on the market?
Scott Gnau: The data management industry is crowded, but the InterSystems IRIS data platform is like nothing else on the market. It has a unique, solid architecture that attracts very enthusiastic customers and partners, and plays well in the new data paradigm. There is no requirement to have a schema to leverage InterSystems IRIS. It scales unlike any other product in the data management marketplace.
InterSystems IRIS has unique architectural differences that enable all functions to run in a highly optimized fashion, whether it be supporting thousands of concurrent requests, automatic and easy compression, or highly performant data access methods.
Q4. What is your strategy with respect to the Cloud?
Scott Gnau: InterSystems has a cloud-first mentality, and with the goal of easy provisioning and elasticity, we offer customers the choice for cloud deployments. We want to make the consumption model simple, so that it is frictionless to do business with us.
InterSystems IRIS users have the ability to deploy across any cloud, public or private. Inside the software it leverages the cloud infrastructure to take advantage of the new capabilities that are enabled because of cloud and containerized architectures.
Q5. What about Artificial Intelligence?
Scott Gnau: AI is the next killer app for the new data paradigm. With AI, data can tell you things you didn’t already know. While many of the mathematical models that AI is built on are on the older side, it is still true that the more data you feed them the more accurate they become (which fits well with the new paradigm of data). Generating value from AI also implies real time decisioning, so in addition to more data, more compute and edge processing will define success.
Q6. How do you plan to help the company’s customers to a new era of digital transformation?
Scott Gnau: My goal is to help make technology as easy to consume as possible, to ensure that it is highly dependable. I will continue to work in and around vertical industries that are easily replicable.
Q7. What customers are asking for is not always what customers really need. How do you manage this challenge?
Scott Gnau: Disruption in the digital world is at an all-time high, and for some, impending change is sometimes too hard to see before it is too late. I encourage customers to be ready to “rethink normal,” while putting them in the best position for any transitions and opportunities to come. At the same time, as trusted partners we also are a source of advice to our customers on mega trends.
Q8. What is your technology vision ahead for the company’s Data Platforms business unit?
Scott Gnau: InterSystems continues to look for ways to differentiate how our technology creates success for our customers. We judge our success on our customers’ successes. Our unique architecture and overall performance envelope plays very well into data centric applications across multiple industries including financial services, logistics and healthcare. With connected devices and the requirement for augmented transactions we play nicely into the future high value application space.
Q9. What do you expect from your new role at InterSystems?
Scott Gnau: I expect to have a lot of fun because there is an infinite supply of opportunity in the data management space due to the new data paradigm and the demand for new analytics. On top of that, InterSystems has many smart, passionate and loyal customers, partners and employees. As I mentioned up front, it’s about a combination of great tech AND great people that drives success. Our ability to invest in the future is extremely strong – we have all the key ingredients.
Scott Gnau joined InterSystems in 2019 as Vice President of Data Platforms, overseeing the development, management, and sales of the InterSystems IRIS™ family of data platforms. Gnau brings more than 20 years of experience in the data management space helping lead technology and data architecture initiatives for enterprise-level organizations. He joins InterSystems from HortonWorks, where he served as chief technology officer. Prior to Hortonworks, Gnau spent two decades at Teradata in increasingly senior roles, including serving as president of Teradata Labs. Gnau holds a Bachelor’s degree in electrical engineering from Drexel University.
– On AI, Big Data, Healthcare in China. Q&A with Luciano Brustia ODBMS.org, 8 APR, 2019.
Follow us on Twitter: @odbmsorg
“When we started this project in 2013, it was a moonshot. We were not sure if NVM technologies would ever see the light of day, but Intel has finally started shipping NVM devices in 2019. We are excited about the impact of NVM on next-generation database systems.” — Joy Arulraj and Andrew Pavlo.
I have interviewed Joy Arulraj, Assistant Professor of Computer Science at Georgia Institute of Technology and Andrew Pavlo, Assistant Professor of Computer Science at Carnegie Mellon University. They just published a new book “Non-Volatile Memory Database Management Systems“. We talked about non-volatile memory technologies (NVM), and how NVM is going to impact the next-generation database systems.
Q1. What are emerging non-volatile memory technologies?
Arulraj, Pavlo: Non-volatile memory (NVM) is a broad class of technologies, including phase-change memory and memristors, that provide low latency reads and writes on the same order of magnitude as DRAM, but with persistent writes and large storage capacity like an SSD. For instance, Intel recently started shipping its Optane DC NVM modules based on 3D XPoint technology .
Q2. How do they potentially change the dichotomy between volatile memory and durable storage in database management systems?
Arulraj, Pavlo: Existing database management systems (DBMSs) can be classified into two types based on the primary storage location of the database: (1) disk-oriented and (2) memory-oriented DBMSs. Disk-oriented DBMSs are based on the same hardware assumptions that were made in the first relational DBMSs from the 1970s, such as IBM’s System R. The design of these systems target a two-level storage hierarchy comprising of a fast but volatile byte-addressable memory for caching (i.e., DRAM), and a slow, non-volatile block-addressable device for permanent storage (i.e., SSD). These systems take a pessimistic assumption that a transaction could access data that is not in memory, and thus will incur a long delay to retrieve the needed data from disk. They employ legacy techniques, such as heavyweight concurrency-control schemes, to overcome these limitations.
Recent advances in manufacturing technologies have greatly increased the capacity of DRAM available on a single computer.
But disk-oriented systems were not designed for the case where most, if not all, of the data resides entirely in memory.
The result is that many of their legacy components have been shown to impede their scalability for transaction processing workloads. In contrast, the architecture of memory-oriented DBMSs assumes that all data fits in main memory, and it therefore does away with the slower, disk-oriented components from the system. As such, these memory-oriented DBMSs have been shown to outperform disk-oriented DBMSs. But, they still have to employ heavyweight components that can recover the database after a system crash because DRAM is volatile. The design assumptions underlying both disk-oriented and memory-oriented DBMSs are poised to be upended by the advent of NVM technologies.
Q3. Why are existing DBMSs unable to take full advantage of NVM technology?
Arulraj, Pavlo: NVM differs from other storage technologies in the following ways:
- Byte-Addressability: NVM supports byte-addressable loads and stores unlike other non-volatile devices that only support slow, bulk data transfers as blocks.
- High Write Throughput: NVM delivers more than an order of magnitude higher write throughput compared to SSD. More importantly, the gap between sequential and random write throughput of NVM is much smaller than other durable storage technologies.
- Read-Write Asymmetry: In certain NVM technologies, writes take longer to complete compared to reads. Further, excessive writes to a single memory cell can destroy it.
Although the advantages of NVM are obvious, making full use of them in a DBMS is non-trivial. Our evaluation of state-of-the-art disk-oriented and memory-oriented DBMSs on NVM shows that the two architectures achieve almost the same performance when using NVM. This is because current DBMSs assume that memory is volatile, and thus their architectures are predicated on making redundant copies of changes on durable storage. This illustrates the need for a complete rewrite of the database system to leverage the unique properties of NVM.
Q4.With NVM, which components of legacy DBMSs are unnecessary?
Arulraj, Pavlo: NVM requires us to revisit the design of several key components of the DBMS, including that of the (1) logging and recovery protocol, (2) storage and buffer management, and (3) indexing data structures.
We will illustrate it using the logging and recovery protocol. A DBMS must guarantee the integrity of a database against application, operating system, and device failures. It ensures the durability of updates made by a transaction by writing them out to durable storage, such as SSD, before returning an acknowledgment to the application. Such storage devices, however, are much slower than DRAM, especially for random writes, and only support bulk data transfers as blocks.
During transaction processing, if the DBMS were to overwrite the contents of the database before committing the transaction, then it must perform random writes to the database at multiple locations on disk. DBMSs try to minimize random writes to disk by flushing the transaction’s changes to a separate log on disk with only sequential writes on the critical path of the transaction. This method is referred to as write-ahead logging (WAL).
NVM upends the key design assumption underlying the WAL protocol since it supports fast random writes. Thus, we need to tailor the protocol for NVM. We designed such a protocol that we call write-behind logging (WBL). WBL not only improves the runtime performance of the DBMS, but it also enables it to recovery nearly instantaneously from failures. The way that WBL achieves this is by tracking what parts of the database have changed rather than how it was changed. Using this logging method, the DBMS can directly flush the changes made by transactions to the database instead of recording them in the log. By ordering writes to NVM correctly, the DBMS can guarantee that all transactions are durable and atomic. This allows the DBMS to write fewer data per transaction, thereby improving a NVM device’s lifetime.
Q5. You have designed and implemented a DBMS storage engine architectures that are explicitly tailored for NVM. What are the key elements?
Arulraj, Pavlo: The design of all of the storage engines in existing DBMSs are predicated on a two-tier storage hierarchy comprised of volatile DRAM and a non-volatile SSD. These devices have distinct hardware constraints and performance properties. The traditional engines were designed to account for and reduce the impact of these differences.
For example, they maintain two layouts of tuples depending on the storage device. Tuples stored in memory can contain non-inlined fields because DRAM is byte-addressable and handles random accesses efficiently. In contrast, fields in tuples stored on durable storage are inlined to avoid random accesses because they are more expensive. To amortize the overhead for accessing durable storage, these engines batch writes and flush them in a deferred manner. Many of these techniques, however, are unnecessary in a system with a NVM-only storage hierarchy. We adapted the storage and recovery mechanisms of these traditional engines to exploit NVM’s characteristics.
For instance, consider an NVM-aware storage engine that performs in-place updates. When a transaction inserts a tuple, rather than copying the tuple to the WAL, the engine only records a non-volatile pointer to the tuple in the WAL. This is sufficient because both the pointer and the tuple referred to by the pointer are stored on NVM. Thus, the engine can use the pointer to access the tuple after the system restarts without needing to re-apply changes in the WAL. It also stores indexes as non-volatile B+trees that can be accessed immediately when the system restarts without rebuilding.
The effects of committed transactions are durable after the system restarts because the engine immediately persists the changes made by a transaction when it commits. So, the engine does not need to replay the log during recovery. But the changes of uncommitted transactions may be present in the database because the memory controller can evict cache lines containing those changes to NVM at any time. The engine therefore needs to undo those transactions using the WAL. As this recovery protocol does not include a redo process, the engine has a much shorter recovery latency compared to a traditional engine.
Q6. What is the key takeaway from the book?
Arulraj, Pavlo: All together, the work described in this book illustrates that rethinking the key algorithms and data structures employed in a DBMS for NVM not only improves performance and operational cost, but also simplifies development and enables the DBMS to support near-instantaneous recovery from DBMS failures. When we started this project in 2013, it was a moonshot. We were not sure if NVM technologies would ever see the light of day, but Intel has finally started shipping NVM devices in 2019. We are excited about the impact of NVM on next-generation database systems.
Joy Arulraj is an Assistant Professor of Computer Science at Georgia Institute of Technology. He received his Ph.D. from Carnegie Mellon University in 2018, advised by Andy Pavlo. His doctoral research focused on the design and implementation of non-volatile memory database management systems. This work was conducted in collaboration with the Intel Science & Technology Center for Big Data, Microsoft Research, and Samsung Research.
Andrew Pavlo is an Assistant Professor of Databaseology in the Computer Science Department at Carnegie Mellon University. At CMU, he is a member of the Database Group and the Parallel Data Laboratory. His work is also in collaboration with the Intel Science and Technology Center for Big Data.
– Non-Volatile Memory Database Management Systems. by Joy Arulraj, Georgia Institute of Technology, Andrew Pavlo, Carnegie Mellon University. Book, Morgan & Claypool Publishers, Copyright © 2019, 191 Pages.
ISBN: 9781681734842 | PDF ISBN: 9781681734859 , Hardcover ISBN: 9781681734866
–How to Build a Non-Volatile Memory Database Management System (.PDF), Joy Arulraj Andrew Pavlo
Follow us on Twitter: @odbmsorg
“Anyone who expects to have some of their work in the cloud (e.g. just about everyone) will want to consider the offerings of the cloud platform provider in any shortlist they put together for new projects. These vendors have the resources to challenge anyone already in the market.”– Merv Adrian.
I have interviewed Merv Adrian, Research VP, Data & Analytics at Gartner. We talked about the the database market, the Cloud and the 2018 Gartner Magic Quadrant for Operational Database Management Systems.
Q1. Looking Back at 2018, how has the database market changed?
Merv Adrian: At a high level, much is similar to the prior year. The DBMS market returned to double digit growth in 2017 (12.7% year over year in Gartner’s estimate) to $38.8 billion. Over 73% of that growth was attributable to two vendors: Amazon Web Services and Microsoft, reflecting the enormous shift to new spending going to cloud and hybrid-capable offerings. In 2018, the trend grew, and the erosion of share for vendors like Oracle, IBM and Teradata continued. We don’t have our 2018 data completed yet, but I suspect we will see a similar ballpark for overall growth, with the same players up and down as last year. Competition from Chinese cloud vendors, such as Alibaba Cloud and Tencent, is emerging, especially outside North America.
Q2. What most surprised you?
Merv Adrian: The strength of Hadoop. Even before the merger, both Cloudera and Hortonworks continued steady growth, with Hadoop as a cohort outpacing all other nonrelational DBMS activity from a revenue perspective. With the merger, Cloudera becomes the 7th largest vendor by revenue and usage and intentions data suggest continued growth in the year ahead.
Q3. Is the distinction between relational and nonrelational database management still relevant?
Merv Adrian: Yes, but it’s less important than the cloud. As established vendors refresh and extend product offerings that build on their core strengths and capabilities to provide multimodel DBMS and/or or broad portfolios of both, the “architecture” battle will ramp up. New disruptive players and existing cloud platform providers will have to battle established vendors where they are strong – so for DMSA players like Snowflake will have more competition and on the OPDBMS side, relational and nonrelational providers alike – such as EnterpriseDB, MongoDB, and Datastax – will battle more for a cloud foothold than a “nonrelational” one.
Specific nonrelational plays like Graph, Time Series, and ledger DBMSs will be more disruptive than the general “nonrelational” category.
Q4. Artificial intelligence is moving from sci-fi to the mainstream. What is the impact on the database market?
Merv Adrian: Vendors are struggling to make the case that much of the heavy lifting should move to their DBMS layer with in-database processing. Although it’s intuitive, it represents a different buyer base, with different needs for design, tools, expertise and operational support. They have a lot of work to do.
Q5. Recently Google announced BigQuery ML. Machine Learning in the (Cloud) Database. What are the pros and cons?
Merv Adrian: See the above answer. Google has many strong offerings in the space – putting them together coherently is as much of a challenge for them as anyone else, but they have considerable assets, a revamped executive team under the leadership of Thomas Kurian, and are entering what is likely to be a strong growth phase for their overall DBMS business. They are clearly a candidate to be included in planning and testing.
Q6. You recently published the 2018 Gartner Magic Quadrant for Operational Database Management Systems. In a nutshell. what are your main insights?
Merv Adrian: Much of that is included in the first answer above. What I didn’t say there is that the degree of disruption varies between the Operational and DMSA wings of the market, even though most of the players are the same. Most important, specialists are going to be less relevant in the big picture as the converged model of application design and multimodel DBMSs make it harder to thrive in a niche.
Q7. To qualify for inclusion in this Magic Quadrant, vendors must have had to support two of the following four use cases: traditional transactions, distributed variable data, operational analytical convergence and event processing or data in motion. What is the rational beyond this inclusion choice?
Merv Adrian: The rationale is to offer our clients the offerings with the broadest capabilities. We can’t cover all possibilities in depth, so we attempt to reach as many as we can within the constraints we design to map to our capacity to deliver. We call out specialists in various other research offerings such as Other Vendors to Consider, Cool Vendor, Hype Cycle and other documents, and pieces specific to categories where client inquiry makes it clear we need to have a published point of view.
Q8. How is the Cloud changing the overall database market?
Merv Adrian: Massively. In addition to functional and architectural disruption, it’s changing pricing, support, release frequency, and user skills and organizational models. The future value of data center skills, container technology, multicloud and hybrid challenges and more are hot topics.
Q9. In your Quadrant you listed Amazon Web Services, Alibaba Cloud and Google. These are no pure database vendors, strictly speaking. What role do they play in the overall Operational DBMS market?
Merv Adrian: Anyone who expects to have some of their work in the cloud (e.g. just about everyone) will want to consider the offerings of the cloud platform provider in any shortlist they put together for new projects. These vendors have the resources to challenge anyone already in the market. And their deep pockets, and the availability of open source versions of every DBMS technology type that they can use – including creating their own versions of with optimizations for their stack and pre-built integrations to upstream and downstream technologies required for delivery – makes them formidable.
Q10. What are the main data management challenges and opportunities in 2019?
Merv Adrian: Avoiding silver bullet solutions, sticking to sound architectural principles based on understanding real business needs, and leveraging emerging ideas without getting caught in dead end plays. Pretty much the same as always. The details change, but sound design and a focus on outcomes remain the way forward.
Qx Anything else you wish to add?
Merv Adrian: Fasten your seat belt. It’s going to be a bumpy ride.
Merv Adrian, Research VP, Data & Analytics. Gartner
Merv Adrian is an Analyst on the Data Management team following operational DBMS, Apache Hadoop, Spark, nonrelational DBMS and adjacent technologies. Mr. Adrian also tracks the increasing impact of open source on data management software and monitors the changing requirements for data security in information platforms.
Follow us on Twitter: @odbmsorg
” When software reliability issues creep up in production, it’s a finger-pointing moment between suppliers and users. Usually, what’s missing is simple: information. ” –Barry Morris.
“Most organisations also suffer from more continuous disruption caused by a steady stream of less dramatic issues. Intermittent software problems particularly cause a lot of user frustration and dissatisfaction.” — Dale Vile.
I have interviewed Barry Morris, CEO of Undo and Dale Vile, Distinguished Analyst Freeform Dynamics. Main topic of the interview is enterprise software reliability. This interview relates to a recent research on the challenges and impact on troubleshooting software failures in production, conducted by Freeform Dynamics.
Q1. How often software-related failures occur in the enterprise?
Dale Vile: When we hear the term ‘software failure’, we tend to think of major incidents that bring down a whole department or result in significant data loss. Our study suggests that this kind of thing happens around once every couple years on average in most organisations – at least that’s what people admit to when surveyed. The research also tells us, however, that most organisations also suffer from more continuous disruption caused by a steady stream of less dramatic issues. Intermittent software problems particularly cause a lot of user frustration and dissatisfaction.
Q2. What are the common reasons why major system failures and/or incidents leading to loss of data are top of the list when it comes to the potential for damage and disruption?
Dale Vile: Software is now embedded in most aspects of most businesses. A telling observation is that over the years, the percentage of applications considered to be business critical has steadily increased. At the turn-of-the-century, it was usual for organisations to tell us that around 10% of their application portfolio was considered critical. Nowadays, it’s more likely to be 50% or more. This is why it’s so disruptive and potentially damaging when software failures occur – even relatively brief or minor ones.
Barry Morris: The study we commissioned shows that 83% of enterprise customers consider data corruption issues to be highly disruptive to their business. In the database business, that’s probably closer to 100%. Take SAP HANA, Oracle, Teradata or other data management system vendors: they have clients paying them millions of dollars per year for a reliable and predictable system. Consequences are high if the wrong row is returned, there’s a memory corruption issue, or data goes missing. These types of clients have little tolerance that. At best, your reputation in the industry and software renewals will be on the line. At worst, you’re talking about plummeting stock prices wiping off a few millions off the value of your business.
Q3. What are the most important challenges to achieve software that runs reliably and predictably?
Dale Vile: It starts with software quality management in the development or engineering environment. Most of the challenges we see here are to do with adjusting testing and quality management processes to cope with modern approaches such as Agile, DevOps and Continuous Delivery. A lot of people now refer to ‘Continuous Testing’ in this context, and understandably put a lot of emphasis on automation. But even software makers are on a journey here. Our research tells us that few have it fully figured out at the moment. Beyond this, effective testing in the live environment is also essential.
The problem here, though, is that the complex and dynamic nature of today’s enterprise infrastructures makes it very hard or impossible to test every use case in every situation. And even if you could, subsequent changes to the environment, which an application team may not even be aware of, could easily interfere with the solution and cause instability or failure. There’s a lot to think about, and quality management is only the start.
Q4. What factors are influencing users’ satisfaction and confidence with respect to software?
Dale Vile: Confidence and satisfaction stem from users and business stakeholders perceiving that those responsible are working together competently and effectively to resolve issues when they occur in a timely manner. A fundamental requirement here is openness and honesty, and a willingness to take responsibility. Defensiveness, evasion and finger-pointing, however, tend to undermine confidence and satisfaction. Such behaviour can be cultural; but very often it’s more a symptom of inadequate skills, processes and/or tools within either the supplier or the customer environment. When such shortfalls in capability exist, the inevitable result is an elongated troubleshooting and resolution cycle. This is the real killer of confidence and satisfaction.
Barry Morris: When software reliability issues creep up in production, it’s a finger-pointing moment between suppliers and users. Usually, what’s missing is simple: information. Right now, to obtain that information, suppliers ask 20 questions: what did you do, how did you do it, in what environment and so on. There’s a long period of communication & diagnostic, which is frustrating and time-wasting on both sides. That supplier/user relationships at that moment of firefighting would be massively improved if there was data on the table and engineers could just get on with fixing the problem. I see data-driven defect diagnostic as the key to improving customer satisfaction.
Q5. How effective is software quality management in the enterprise?
Dale Vile: I’ll answer the question in relation to software *reliability* management, which is a function of inherent software quality, effective implementation, and competent operation and support thereafter. We generally find that each group or team tends to do reasonably well in their specific area; but challenges often exist because the various silos I disconnected. What many are lacking is good communication and mutual understanding between those involved in the software lifecycle. Lack of adequate visibility and effective feedback is also a common issue. Most organisations on both the supplier and enterprise side are working on improvement, but gaps frequently exist in these kinds of areas, which in turn impact software reliability.
Barry Morris: Despite all the processes and tools put in place in dev/test, we still see mission-critical applications being shipped with defects. Worse, they are being shipped with known defects – some of which could turn disastrous. Ticking time bombs really. Why? Because of tricky intermittent failures that no-one can get to the bottom of.
So actually, in a lot of cases, I don’t think that the software quality management practices I see are as effective as they could be.
Q6. How effective are the commonly used troubleshooting and diagnostics techniques?
Dale Vile: As mentioned above, the most common problems I see here are to do with disconnects between the various teams involved. Within the engineering environment, this is often down to developers and quality teams working in silos with inefficient handoffs and ineffective feedback mechanisms. In the enterprise context, it’s the disconnect between application teams, operations staff and even service desk personnel. Added to this, many also struggle to join the dots to figure out what’s going on when problems occur, and communicate insights back to developers so they can take appropriate action. Against this background, it’s not surprising that over 90% of both software makers and enterprises report that issues frequently go undiagnosed and come back to bite in a disruptive and often expensive manner.
Barry Morris: Sometimes, traditional methods troubleshooting methods like printf, logging, or core dump analysis are the right solutions if the team is confident they can isolate the issue quickly. Static and dynamic analysis tools are also good options for certain classes of failures. But in more complex situations, traditional debugging methods don’t help much. If anything, they lead you down the wrong path with false positive and become time-wasting, which leads to serious client dissatisfaction.
Q7. You wrote in your study that the big enemies of stakeholders and user satisfaction are delay and uncertainty. What remedies do exist to alleviate this?
Dale Vile: Beyond the kind of processes and tools we have mentioned…it boils down to effective communication and adequate visibility.
Barry Morris: I think that next-gen troubleshooting systems like software recording technology (such as what we offer at Undo) offer a unique solution to the problem of software reliability. Once we move away from guesswork and use data-driven insight instead, application vendors will be able to resolve the most challenging software defects faster than they have ever been able to do before. The unnecessary delays and uncertainty will be a thing of the past.
Q8. You wrote in your study that software failures are inevitable. It is what happens when they occur that really matters. Can you please explain what do you mean here?
Dale Vile: No one expects perfection; not even business users and stakeholders. So provided the software isn’t wildly buggy or unstable, it mostly comes down to how well you respond when problems occur. What annoys people the most in this respect is not knowing what’s going on. Informing someone that you know what the problem is, but it’s going to take some time to fix, is much better than telling them you have no idea what’s causing their problem. Even better if you can give them a timescale for a resolution, and/or a workaround that doesn’t represent a major inconvenience. Interestingly, if you diagnose and fix a problem quickly, the research suggests that you can actually turn a software incident into a positive experience that enhances satisfaction, confidence and mutual respect.
Q9. What remedies are available for that?
Dale Vile: A big enabler here is a modern approach to diagnostics: having the tooling and the processes in place that allow you to troubleshoot effectively in a complex production environment. Traditional approaches are often undermined by the sheer number of moving parts and dependencies, so you need a way to deal with that. This is where solutions such as program execution recording and replay capability (aka software flight recording technology) can help.
Q10.You wrote in your study that switching decisions are often down to simple economics. Can you please explain what do you mean?
Dale Vile: If an application is continually causing problems, the result is increased cost. At one level, this could be down to the additional resource required to support, maintain and troubleshoot software defects. Often more significant, is the end-user productivity hit that stems from people not able to do their jobs properly and efficiently.
There are then various kinds of opportunity costs, e.g. weeks spent battling unreliable software is time not being spent adding value to the business. In extreme cases, such as when customer facing systems are involved, repeated failure can lead to reputational damage, loss of customer confidence, and ultimately lost revenue and market share. It depends on the organisation and the specific application; but in every case there comes a point when the cost to the business of living with unreliable software is ultimately higher than the cost of switching.
Q11. What are the most effective solutions to software diagnostic processes?
Dale Vile: Solutions that work holistically. It’s about capturing all of the relevant events, inputs and variables, especially at execution time in the production environment; then providing actionable data and insights for engineers to facilitate rapid diagnosis and resolution.
Barry Morris: Dale is right. The most effective solutions to software failure diagnostic are those that provide full visibility and definitive data-driven insight into what your software really did before it crashed or resulted in incorrect behaviour. Software recording technology will speed up time-to-resolution by a factor of 10. But the beauty of this kind of approach is that you can now diagnose even the hardest of bugs that you couldn’t resolve before – just because a recording represents the reproducible test case you couldn’t obtain before.
Q12. What are the main conclusions of your study?
Dale Vile: In summary, in a world where software is critical to the business, applications must be reliable; otherwise damaging and costly disruption will result. With this in mind, it’s important to be able to respond quickly and effectively when problems occur. This shines a clear spotlight on diagnostics – an area in which many have clear room for improvement. New approaches and tools are required here, especially for troubleshooting in complex production environments.
The good news is that technology is emerging that can help, but at the moment we see an awareness gap. Our recommendation is therefore for anyone involved in software delivery and support to get up to speed on what’s available, e.g. from companies like Undo and others.
Qx. Anything else you wish to add?
Dale Vile: When you get right down to it, software reliability is a business issue. One of the most striking findings from the research for me is the level of willingness among enterprise customers to switch solutions and suppliers when the pain and cost of unreliable software gets too high. This should be a wake-up call for ISVs and other software makers, not just to manage product quality, but also to work proactively with customers on preventative diagnostic and remedial activity.
Barry Morris: As systems are becoming more and more complex, troubleshooting is not getting any easier…so has to be data-driven.
With over 25 years’ experience working in enterprise software and database systems, Barry is a prodigious company builder, scaling start-ups and publicly held companies alike. He was CEO of distributed service-oriented architecture (SOA) specialists IONA Technologies between 2000 and 2003 and built the company up to $180m in revenues and a $2bn valuation.
A serial entrepreneur, Barry founded NuoDB in 2008 and most recently served as its Executive Chairman. Barry has now been appointed as CEO in September 2018 to lead Undo‘s high-growth phase.
Dale is a co-founder of Freeform Dynamics, and today runs the company.
He oversees the organisation’s industry coverage and research agenda, which tracks technology trends and developments, along with IT-related buying behaviour among mainstream enterprises, SMBs and public sector organisations.
During his 30 year career, he has worked in enterprise IT delivery with companies such as Heineken and Glaxo, and has held sales, channel management and international market development roles within major IT vendors such as SAP, Oracle, Sybase and Nortel Networks. He also spent a couple of years managing an IT reseller business for Admiral Software.
Dale has been involved in IT industry research since the year 2000 and has a strong reputation for original thinking and alternative perspectives on the latest technology trends and developments. He is a widely published author of books, reports and articles, and is an authoritative and provocative speaker.
Hosted by Prof. Zicari of ODBMS.org and featuring Undo CEO Barry Morris and Distinguished Analyst Dale Vile, Freeform Dynamics, this webinar recording covers:
– New market research – the frequency, types, and economic impact of defects on users and developers of enterprise software.
– The importance of fast diagnostics and swift remediation when problems occur in production.
– How to increase enterprise software reliability with software flight recording technology
“The challenges, impact and solutions to troubleshooting software failures“, Freeform Dynamics. Access the full study report here (LINK registration required).
– On Software Quality. Q&A with Alexander Boehm (SAP) and Greg Law (Undo). ODBMS.org, November 26, 2018.
Dr. Alexander Boehm is a database architect working on SAP´s HANA in-memory database management system. Greg Law is Co-founder and CTO of Undo.
Follow us on Twitter: @odbmsorg
“Perhaps less obvious is how role definitions in an organization change as scale increases. Once rare tasks that were just a small part of one team’s responsibilities become so common that they are a full-time job for someone. At that point, one either needs to create automation for the task, or a new team needs to be assembled (or hired) to perform that task full time. ” — Eric Tune
I have interviewed Eric Tune, Senior Staff Engineer at Google. We talked about Kubernetes. Eric has been a Kubernetes contributor since 1.0
Q1. What are the main technical challenges in implementing massive-scale environments?
Eric Tune: Whether working at small or massive scale, the high-level technical goals don’t change: security, developer velocity, efficiency in use of compute resources, supportability of production environments, and so on.
As scale increases, there are some fairly obvious discontinuities, like moving from an application that fits on a single-machine to one that spans multiple machines, and from a single data center or zone to multiple regions. Quite a bit has been written about this. Microservices in particular can be a good fit because they scale well to more machines and more regions.
Perhaps less obvious is how role definitions in an organization change as scale increases. Once rare tasks that were just a small part of one team’s responsibilities become so common that they are a full-time job for someone. At that point, one either needs to create automation for the task, or a new team needs to be assembled (or hired) to perform that task full time. Sometimes, it is obvious how to do this. But, when this repeats many times, one can end up with a confusing mess of automation and tickets, dragging down development velocity and confounding attempts to analyze security and debug systemic failure.
So, a key challenge is finding the right separation responsibilities so that multiple pieces of automation, and multiple human teams collaborate well. Doing it requires not only having a broad view of an organization’s current processes and responsibilities around development, operations, and security; but also which assumptions behind those are no longer valid.
Kubernetes can help hereby providing automation for exactly the types of tasks that become toilsome as scale increases. Several of its creators have lived through organic growth to a massive-scale. Kubernetes is built from that experience, with awareness of the new roles that are needed at massive-scale.
Q2. What is Kubernetes and why is it important?
Eric Tune: First, Kubernetes is one of the most popular ways to deploy applications in containers. Containers make the act of maintaining the machine & operating system a largely separate process from installing and maintaining an application instance – no more worrying about shared library or system utility version differences.
Second, it provides an abstraction over IaaS: VMs, VM images, VM types, load balancers, block storage, auto-scalers, etc. Kubernetes runs on numerous clouds, on-premises, and on a laptop. Many complex applications, such as those consisting of many microservices, can be deployed onto any Kubernetes cluster regardless of the underlying infrastructure. For an organization that may want to modernize their applications now, and move to cloud later, targeting Kubernetes means they won’t need to re-architect when they are ready to move. Third, Kubernetes supports infrastructure-as-code (IaC). You can define complex applications, including storage, networking, and application identity, in a common configuration language, called the Kubernetes Resource model. Unlike other IaC systems, which mostly support a “single-user” model, Kubernetes is designed for multiple users. It supports controlled delegation of responsibility from an ops team to a dev team.
Fourth, it provides an opinionated way to build distributed system control planes, and to extend the APIs and infrastructure-as code type system. This allows solution vendors and in-house infrastructure teams to build custom solutions that feel like they are first class parts of Kubernetes.
Q3. Who should be using Kubernetes?
Eric Tune: If your organization runs Linux-based microservices and has explored container technology, then you are ready to try Kubernetes.
Q4. You are a Kubernetes contributor since 1.0 (4 years). What did you work on specifically?
Eric Tune: During the first year, I worked on whatever needed to be done, including security (namespaces, service accounts, authentication and authorization, resource quota), performance, documentation, testing, API review and code review.
In those first years, people were mostly running stateless microservices on Kubernetes. In the second year, I worked to broaden the set of applications that can run on Kubernetes. I worked on the Job and CronJob APIs of Kubernetes, which support basic batch computation, and the StatefulSet API, which supports databases and other stateful applications. Additionally, I worked with the Helm project on Charts (easy-to-install applications for Kubernetes), with the Spark open source community to get it running on Kubernetes.
Starting in 2017, Kubernetes interest was growing so quickly that the project maintainers could not accept a fraction of the new features that were proposed. The answer was to make Kubernetes extensible so that new features could be build “out of the core.” I worked to define the extensibility story for Kubernetes, particularly for Custom Resource Definitions (CRDs) and Webhooks. The extensibility features of Kubernetes have enabled other large projects, such as Istio and Knative, to integrate with Kubernetes with lower overhead for the Kubernetes project maintainers.
Currently, I lead teams which work on both Open Source Kubernetes and Google Cloud.
Q5. What are the main challenges of migrating several microservices to Kubernetes?
Eric Tune: Here are three challenges I see when migrating several microservices to Kubernetes, and how I recommend handling them:
- Remove Ordering Dependencies: Say microservice C depends on microservices A and B to function normally. When migrating to declarative configuration and Kubernetes, the startup order for microservices can become variable, where previously it was ordered (e.g. by a script). This can cause unexpected behaviors. For example, microservice C might log errors at a high rate or crash if A is not ready yet. A first reaction is sometimes “how can I guarantee ordering of microservice startup,” My advice is not to impose order, but to change problematic behavior. For example, C could be changed to return some response for a request even when A and B are unreachable. This is not really a Kubernetes-specific requirement – it is a good practice for microservices, as it allows for graceful recovery from failures and for autoscaling.
- Don’t Persist Peer Network Identity: Some microservices permanently record the IP addresses of their peers at startup time, and then don’t expect it to ever change. That’s not a great match for the Kubernetes approach to networking. Instead, resolve peer addresses using their domain names and re-resolve after disconnection.
- Plan ahead for Running in Parallel: When migrating a complex set of microservices to Kubernetes, it’s typical to run the entire old environment and the new (Kubernetes) environment in parallel. Make sure you have load replay and response diffing tools to evaluate a dual environment setup.
Q6. How can Kubernetes scale without increasing ops team?
Eric Tune: Kubernetes is built to respond to many types of application and infrastructure failures automatically – for example slow memory leaks in an application, or kernel panics in a virtual machine. Previously this kind of problem may have required immediate attention. With Kubernetes as the first line of defense, ops can wait for more data before taking action. This in turn supports faster rollouts, as you don’t need to soak as long if you know that slow memory leaks will be handled automatically, and you can fix by rolling forward rather than back.
Some ops teams also face multiple deployment environments, including multi-cloud, hybrid, or varying hardware in on-premises datacenters. Kubernetes hides somes differences between these, reducing the number of variations of configuration that is needed.
A pattern I have seen is role specialization within ops teams, which can bring efficiencies. Some members specialize in operating the Kubernetes cluster itself, what I call a “Cluster Operations” role, while others specialize in operating a set of applications (microservices). The clean separation between infrastructure and application – in particular the use of Kubernetes configuration files as a contract between the two groups – supports this separation of duties.
Finally, if you are able to choose a hosted version of Kubernetes such as Google Container Engine (GKE), then the hosted service takes on much of the Cluster Operations role. (Note: I work on GKE.)
Q7. On-premises, hybrid, or public cloud infrastructure: which solutions would you think is it better for running Kubernetes?
Eric Tune: Usually factors unrelated to Kubernetes will determine if an application needs to run on-premises, such as data sovereignty, latency concerns or an existing hardware investment. Often some applications need to be on-premises and some can move to public cloud. In this case you have a hybrid Kubernetes deployment, with one or more clusters on-premises, and one or more clusters on public cloud. For application operators and developers, the same tools can be used in all the clusters. Applications in different clusters can be configured to communicate with each other, or to be separate, as security needs dictate. Each cluster is a separate failure domain. One does not typically have a single cluster which spans on-premises and public cloud.
Q8. Kubernetes is open source. How can developers contribute?
Eric Tune: We have 150+ associated repositories that are all looking for developers (and other roles) to contribute. If you want to help but aren’t sure what you want to work on, then start with the Community ReadMe, and come to the community meetings or watch a rerun. If you think you already know what area of Kubernetes you are interested in, then start with our contributors guide, and attend the relevant Special Interest Group (SIG) meeting.
Dr. Eric Tune is a Senior Staff Engineer at Google. He leads dozens of engineers working on Kubernetes and GKE. He has been a Kubernetes contributor since 1.0. Previously at Google he worked on the Borg container orchestration system, drove company-wide compute efficiency improvements, created the Google-wide Profiling system, and helped expand the size of Google’s search index. Prior to Google, he was active in computer architecture research. He holds computer engineering degrees (PhD, MS, BS) from UCSD .