On Vertica Database, Apache ORC and Parquet. Q&A with Deepak Majeti
Q1. Can you tell us about your work extending Vertica onto the Hadoop ecosystem?
The Vertica Analytics Platform was founded in 2005 by Turing Award-winner Michael Stonebraker and his colleagues at MIT and other institutions. Vertica is as a Massively Parallel Processing (MPP) database designed to query large volumes of data where traditional databases failed. Over the years, Vertica’s popularity grew significantly as more and more enterprise customers came to rely on it as a high-performance data warehouse. Alongside the growth of Vertica came the advent of Data Lakes – used to process and store large volumes of data cost-effectively. With these data lakes, there is a need to separate the compute and storage in order to build analytical tools that are elastic and can interact with each other.
I posted a detailed blog on this topic here “How the Separation of Compute and Storage Impacts your Big Data Analytics Way of Life”. Many companies have come to depend on open source technologies such as Hadoop to manage their data lake. In order for Vertica to efficiently interact with this open-source technology and bring SQL analytic query engine capabilities to the data lake, we started the Vertica SQL on Hadoop project at our Pittsburgh office.
My work on extending Vertica onto the Hadoop ecosystem has involved enabling Vertica to efficiently communicate with HDFS, Hadoop’s distributed storage system. Additionally, to achieve this goal, I extended Vertica to work with Hadoop’s security layers. I designed and implemented the “data load” and “data export” stacks in Vertica to read/write data from external file-systems.
To enable efficient communication with HDFS, we need optimized libraries that can 1) read/write file formats and 2) transfer data. My team and I contributed to native C++ Apache ORC and Apache Parquet parsers to achieve the first goal and developed Libhdfs++ to achieve the second goal. Developing these libraries allowed us to implement many optimizations that are crucial for performance, including locality, load balancing, predicate pushdown, zero-copy, concurrency, and resource management.
I also worked on extending Vertica’s Kerberos and delegation token management capabilities to integrate with the Hadoop security layers that protect data in HDFS. Vertica can now authenticate users in the Hadoop space. My work on extending Vertica to authenticate with multiple Kerberos realms helps large organizations use a single Vertica cluster across multiple departments. Using C/C++ security libraries to implement secure authentication is very challenging because there is a lack of sufficient, up to date documentation for these libraries. In many cases, we had to read the library source code in order to understand the behavior of API calls provided.
Q2. What are the differences between a Query Engine and a Database Management System?
A Query engine is simply a compute engine that queries data and reports the results to a user.
A query engine acts in isolation from its surroundings. Critical database functions such as resource management and workload management are controlled with the help of other services. Hive, Cloudera Impala, Vertica SQL on Hadoop, are all examples of a query engine. Ambari, yarn, and Mesos are examples of workload and resource managers. Because these different services are loosely coupled, certain features, such as concurrency and atomicity, are implemented via workaround means that are often inefficient. For instance, Hive supports atomicity at the file-system level by writing to a temporary directory and atomically renaming it. For object stores such as S3, this approach is highly inefficient and doesn’t work.
The textbook definition of a Database Management System (DBMS) states that it must provide atomicity, concurrency, isolation, and durability (ACID). However, modern DBMSs have evolved to handle many more tasks including resource management, workload management, backup and restore, etc..
A DBMS is a tightly coupled system that controls compute, storage, resources, metadata all in a single application. Because of this tight coupling, a DBMS can more efficiently execute concurrent queries and workloads. This is achieved with the help of a query optimizer, which considers all resources and costs involved as part of the optimization equation. The optimizer is also aware of everything the system is already doing. The downside of this tight coupling is that it takes more development hours to implement new features, and it becomes difficult to integrate with third-party systems. Vertica Enterprise and Teradata are some examples of DBMS systems.
Q3. How are Data Warehouses and Data Lakes related to each other?
Traditional Data Warehouses define schemas before writing the data. Once processed and structured data is stored, a data warehouse can analyze this data and report results back to the users.
However, the explosion of diverse, unstructured data gave birth to the Data Lake, as companies had a need to store this data as cost-effectively as possible. (Big) Data today is produced from diverse sources including IoT, clickstream, social media, and the traditional transaction data. Organizations are now interested in performing complex big data analytics over data gathered from these diverse sources to find answers to questions such as how to optimize ad-targeting, detect fraud, or predict user behavior.
Therefore, organizations are storing all the data in their various forms (structured, semi-structured, unstructured) as-is on cheap storage layers such as HDFS and cloud storage services such as Amazon S3. Various analytical tools have been developed in this space to co-exist and bring further value to data lake investments. The tools for Data Lakes have to be elastic to handle the potentially vast differences in data sizes used across different analysis on different storage layers (aka need separate compute and storage).
A modern data warehouse, therefore, is one that effectively integrates with and adds value to data lake infrastructure. Organizations must, therefore, adopt the right mix of analytical tools that can efficiently handle both data warehousing and data lake use cases.
Q4. Can you describe some key differentiators in Vertica compared to other Query Engines for Data Lakes?
Vertica was one of the early adopters of column-store technology, which is specifically designed to handle modern data query and analytics requirements. In fact, many of the legacy products in this space have extended parts of their architecture to use column-store (mostly in-memory) and have begun to advertise themselves as column-store technologies. Vertica, on the other hand, was built from the ground up to be columnar.
Unlike many of the SQL query engines on the market, Vertica is an enterprise-grade DBMS, which is SQL99 and ACID compliant.
Our technology stack is built with numerous advanced optimizations and features such as different projection types, sideways information passing, flattened tables, live aggregate projection, directed query etc., which provide 50x or greater query performance without a need for any custom hardware. Whereas the primary goal of a data lake is to store massive volumes of structured and unstructured data as cost-effectively as possible, Vertica is built for running analytics against structured and semi-structured data. Since it is completely implemented in C++, Vertica has fine-grain control over various hardware and system resources, thereby providing high orders of concurrency and parallelism. Vertica is tuned for the data warehouse use cases where speed and SLAs matter.
However, given the growing adoption of data lakes, Vertica recognized early on that integration with this new infrastructure would add value to customer data stacks. Over the past decade, the Vertica engineering team has added a number of new capabilities that do exactly that.
Vertica for SQL on Hadoop (VSQLH) is deployment option where Vertica acts as a high-performance query engine for various types of data lakes. Although certain optimizations cannot be effectively implemented with VSQLH, it still outperforms every other data lake query engine because it takes advantage of important components of our enterprise technology stack including the optimizer, resource manager, etc. VSQLH also benefits from having tight integration with leading data lakes such as S3, HDFS and columnar file-formats such as Parquet and ORC.
In addition to Vertica for SQL on Hadoop, Vertica adds great value to data lakes with a range of integrations such as S3 backup, external tables, and the ability to read/write ORC and Parquet. This puts Vertica in a unique position where it can effectively target both data warehousing and data lake use cases. None of the other Query Engines or DBMSs have this capability.
Vertica is further loaded with various analytical packages required for modern data analytics. Vertica’s rich suite of tools includes time-series and advanced geospatial functions as well as native in-database machine learning. These advanced analytics packages provide organizations the power to run today’s complex big data requirements.
Q5. What are some benefits of using a paid license solution such as Vertica vs. Open Source products?
Vertica is an enterprise-grade software solution that runs in data centers on commodity hardware, across multiple cloud platforms and natively on Hadoop nodes. Unlike many closed, legacy database management systems, Vertica does not lock you into a proprietary appliance or single public cloud platform. Instead, Vertica users are free to deploy Vertica on their infrastructure of choice, including directly on HDFS data lakes. Vertica supports all of the popularly used file formats in the Big Data space including Avro, ORC, and Parquet, and file systems including Linux, HDFS, and S3. Vertica also integrates with a range of ETL, security and BI products, and open-source tools such as Kafka and Spark.
While open source products are often thought of as “free” many users and companies come to realize the hidden operational costs associated with open source. This often happens late in the game when they’ve been locked into a consulting firm or a large company that supports these open source implementations. Many of these consulting firms and companies that support open source projects are not strongly anchored on to a specific open-source product. This makes it risky to build infrastructure with open source products that lack a stable interface or a long-term roadmap. It is also important to realize that the more popular open source products have many forces that influence its direction. In many cases, companies are forced to spin off their own versions of the product for internal use leading to further costs.
With the Big Data space rapidly evolving at both the application and the technology levels, we believe it makes sense for companies to leverage a proven, enterprise-grade analytics platform that also maintains an open-ecosystem of partner and open-source integrations. And with complete freedom from underlying infrastructure, Vertica provides flexibility for your future deployment options, whether on-premise, in the clouds, on Hadoop, or a combination of all of them.
Q6. Many Big Data products are written in Java. How are the C++ products surviving in this space?
Most of today’s Big Data tools starting from MR/Hadoop are written in Java. But more recently, components that are critical for performance are being re-written in C++. This is partly due to the increase in popularity of using languages like Python and R to develop high-level data analytics applications. Python applications use cython to integrate with the C++ libraries directly. C++ allows explicit memory management which helps optimize memory allocations and implement zero-copy in many layers of the stack. These optimizations significantly help in the scaling of the Python applications. In fact, C++ implementations are more popular than the Java implementations in newer projects such as Apache Arrow.
With the advancement of the C++ standard (improved support for smart pointers, synchronization, etc.) as well as adopting programming techniques such as RAII, the headaches (mainly freeing resources) associated with explicit memory management are things of the past. The use of idioms such as pImpl helps implement cleaner abstractions for the public API. Parallelism and concurrency are the only options to fully utilize the modern heterogeneous hardware.
The power of C++ lies in the ability to use explicit SIMD and other hardware intrinsics. C++ is closer to the operating systems and hardware. This gives C++ the ability to have control over many low-level system components such as multi-threading, low-level networking, which are all helpful for maximizing performance.
My main contributions to the Apache ORC and Apache Parquet projects are to the C++ implementations of these projects. As a member of these projects, I am also responsible for their future directions.
On the ORC side, I shepherded important contributions to the library including the C++ writer and Libhdfs++. I also made significant improvements to its build system. One of the improvements includes extending the external third-party library dependency management. This enabled the ORC library to integrate into a variety of build environments.
On the Parquet side, I was one of the early developers to start the C++ implementation (parquet-cpp). I designed and developed various core components of the library including statistics and metadata. I made significant improvements to the buffer management, which helped reduce memory allocations by 75%. Implementing the Parquet Specification in C++ was interesting and challenging since it enabled the optimization of explicit memory allocations, in-memory layout, and use of SIMD intrinsics.
These C++ implementations have been designed and developed to cater to real-world use cases and workloads. Products such a Vertica can now query and export these file formats in a scalable manner.
My other contributions to these projects include ensuring compatibility between the Java and C++ implementations of the ORC and parquet specification, since compatibility can break at various levels, including the use of native libraries in the code and other known issues.
Deepak Majeti is a systems software engineer at Vertica where he makes several important contributions to the Vertica SQL on Hadoop project. He is also a PMC member of Apache ORC Project and a committer for the Apache Parquet Project. Deepak’s interests lie in getting the best from HPC and big data and building scalable, high-performance, and energy-efficient data analytics tools for modern computer architectures. Deepak holds a Ph.D. in the high-performance computing (HPC) domain from Rice University.