Q&A with Data Engineers: Andy Pavlo
Andy Pavlo is an Assistant Professor of Databaseology in the Computer Science Department at Carnegie Mellon University. At CMU, he is a member of the Database Group and the Parallel Data Laboratory. His work is also in collaboration with the Intel Science and Technology Center for Big Data.
Q1. What are the main technical challenges you typically face in this era of Big Data Analytics?
I think that we are in the golden era of databases. 20 years ago there were only a small handful of DBMSs available (mostly proprietary). But now we have so many choices of systems and tools to ingest, store, and analyze data. And what is also nice to see is that many of these systems are open-source and are backed by dedicated communities. Therefore, the unsolved research challenges in “Big Data” are not directly related to the development of new system architectures. We know how to build high-performance OLTP and OLAP systems for today’s hardware and the hardware that we are likely to have in the near future.
Instead, I contend that the challenges that still remain today are in data cleaning and system management. Automation is the key solution for both of these problem domains. It is also part of the overarching question of trying to understand the opportunities and limitations of AI/ML.
Q2. How do you gain the knowledge and skills across the plethora of technologies and architectures available?
I teach my students that tutorials, blog posts, and papers are nice for getting started in a new area, but you can only learn so much from just reading what other people have done. The best way to truly understand a topic in database systems is to “do” something. Pick a real-world problem that you want to solve (either at your job or for a hobby project) and then go about building the system that solves it for you.
The first incarnation of your system will not be the fastest nor will it be functionally complete, but the process will require you to think about and understand the different technologies that are out there. It will give you real hands-on experience with them and it will help you understand their advantages, limitations, and quirks. Once you’ve done this, then you can go back to look at the literature again because now you will know what are the right questions you should ask about the existing systems and tools.
Q3. What lessons did you learn from using shared-nothing scale-out architectures, which claim to allow for support of large volumes of data?
I have spent the last decade working on distributed OLTP systems (NewSQL). As in all things with databases, the efficacy of a system’s architecture is highly dependent on the workload. Our recently published VLDB 2017 paper evaluates different concurrency control protocols for OLTP workloads running on a shared-nothing DBMS testbed. The results from this study show that all of the protocols suffer from contention issues that prevent the system from scaling when transactions access data at multiple nodes. What is especially telling is that our experiments were a best case scenario where all transactions executed as stored procedures and their read/write set was known ahead of time. This limitation of shared-nothing architectures for OLTP has been known for decades, but our study was the first attempt to examine these issues in detail. In some scenarios, a high-performance OLTP DBMS that is running on a single node is preferable to a scale-out system. Others have reported similar results in data processing workloads.
My current research agenda is focused on improving performance for OLTP and OLAP workloads on a single-node DBMS. In a few years I will look again at multi-node deployments. I think, however, that given the prevalence of cloud-based deployments, a distributed DBMS will have to also include shared-disk storage (e.g., HDFS, Amazon S3). Such systems are no longer considered to be truly shared-nothing.
Q4. What are in your opinion the differences between capacity, scale, and performance?
At first glance these may seem like separate concepts in a database system, but they are actually closely linked. As such, one cannot just consider one without the others.
Capacity is typically thought of as the amount of data that a system can store. A system’s storage capacity is useless if you cannot put data in it and execute queries on that data in a timely manner. This means that you want a high-performance query engine. How to determine whether a system has enough performance depends on how you use it. For storing data, this means that the system can ingest new information into the database at the rate that outside entities (e.g., IoT sensors, web applications) push changes to it. Analyzing that data is a bit more fungible, as it depends on who is executing the query (e.g., a human versus a computer), how much data they want to access, and how “fresh” that data needs to be.
A DBMS that satisfies the above storage and performance requirements is achievable through a scale-up or scale-out architecture. For some workloads a single machine is ideal, while for other applications the database is better to be partitioned across multiple machines.
Q5. There are several types of new data stores including NoSQL, XML-based, Document-based, Key-value, Column-family and Graph data stores. What is your experience with those?
NoSQL is an imprecise term that conflates storage and retrieval concepts in DBMSs, thus I think it’s important that we take them one at a time. There is already plenty of literature that discusses the original motivation and history of the NoSQL movement so it’s not worth repeating here.
For storage, I believe that all of the data models listed here (XML, Document, Key-value, Column-family, Graph) can be represented in a relational DBMS. There is nothing unique about them that require a specialized storage engine. I (along with others) have argued that every data model except for large matrices/arrays for machine learning workloads can be modeled in a relational DBMS. With the exception of MongoDB and some of the graph stores, you see that the major NoSQL systems now support the relational model or something that closely resembles it.
Now given that one can store these NoSQL databases on a relational DBMS, we next need to discuss at how an application accesses the database. It is important to note that a DBMS’s data model is (mostly) separate from its access API.
That means for some applications, a relational query language (i.e., SQL) may not be the most natural or ideal.
I agree that some of the NoSQL systems have better support for writing queries that access data in their underlying data model. For example, a graph DBMS will provide built-in functions to traverse the graph (e.g., calculate the shortest path) directly on the server without needing to make multiple round-trips to the client. This means that instead of the application invoking multiple queries to traverse the graph, all of the processing is done on the server.
But this is just engineering; there is nothing unique about these operations that prevent them from being written in a relational DBMS using recursive functions that perform the query logic directly on the server. Furthermore, there are now several extensions and wrappers for relational DBMSs that provide support for storing and retrieving data (e.g., MSSQL GraphDB, IBM DB2 RDF).
Q6. Do you think Hadoop is suitable for the operational side of data?
If by Hadoop you mean HDFS, then there is nothing inherently wrong with using that for an operational DBMS. The system has to use a log-structured storage architecture since HDFS is append-only (i.e., it does not support in place updates). This is what HBase does. The slowness of writing data to HDFS is because it is a distributed file system; any DBMS running on a distributed storage system will have the same issue.
If by Hadoop you mean the MapReduce execution framework, then it is not suitable because it is inherently a batch-oriented programming model.
Q7. ACID, BASE, CAP. Are these concepts still important?
Yes. Different applications will always have different requirements that may not be met by either an ACID or BASE DBMS. Unless you know what you are doing, you are probably better off with choosing an ACID system because it will (mostly) guarantee the integrity of your data.
Q8. Learning a new technology and developing the knowledge is different than developing skills. Skills are perishable. How do you handle this?
Understanding the high-level concepts is more important than knowing how to use a specific piece of software. The syntax, configuration, and nuances of tools will always be different. If you understand what’s truly going on underneath the covers, then you will be able to pick up new skills quickly.
Q9. Is there any benefit of learning MapReduce?
For the average developer or data scientist, at this point I do not see any reason. There are enough SQL-on-Hadoop engines available (e.g., Presto, Hive, Impala, HAWQ) that make easy to access HDFS-resident data.
Q10. How data management, data science, and machine learning/AI relate to each other?
There are no exact definitions for these terms, but my general understanding is the following: (1) data management is how to collect, store, and query data, (2) data science is how to extrapolate new knowledge from existing data, and (3) machine learning is how to generate models that can predict trends in data from existing data.
Achieving successful outcomes in the second two fields are highly dependent on the first. If you cannot store data efficiently and access it in a variety of ways, then you are not going to have enough prior observations to guide your algorithms correctly.
Q11. What are the unsolved data management challenges that arise from the increased interest in AI?
I think an important unsolved problem is how to execute complex ML pipelines directly on operational databases. I don’t necessarily mean running a ML framework like Spark, TensorFlow, or Torch on the same hardware as the DBMS, but rather having them access the data natively. The most common practice now is to copy the data out to a separate storage system and then execute the ML programs there. The reason why this is difficult is because machine learning algorithms operate on arrays/matrices, whereas operational DBMSs store the database in tables. Previous attempts at in-database ML, such as Apache Madlib, do execute directly inside the DBMS but still requires the data to be transformed into the proper format first.
Having this ability would allow organizations to update their models with the latest information without needing to use an ETL process to copy the data to another location.
Qx Anything else you wish to add?
I think that the next major trend in databases will be in video analytics. Video data is predicted to be 80% of public Internet traffic by 2019, but currently there is no system that is able to extract meaning from them. The current trend is for data scientists to build pipelines from scratch to analyze visual data. What is needed is a system that can provide the same ease-of-use as relational DBMSs. Video analytics will be important in a variety of fields, including public policy, medicine, science, and business.
I predict that these future video analysis systems will use techniques from probabilistic databases, approximate query processing, crowd sourcing, and deep learning. They will also require us to develop new high-dimensional indexes. The database portion of the system will only store meta-data about video and any features that it extracts from it. One obviously does not want to store this video data in a traditional DBMS. Thus, these systems will need to support external (and slower) storage that are better suited for large, read-only data (e.g., HDFS, Amazon S3).