On Column Stores. Interview with Shilpa Lawande
“A true columnar store is not only about the way you store data, but the engine and the optimizations that are enabled by the column store”–Shilpa Lawande.
On the subject of column stores, and what are the important features in the new release of the HP Vertica Analytics Platform, I have interviewed Shilpa Lawande, VP Engineering & Customer Experience at HP Vertica.
Q1. Back in 2011 I did an interview with you  at the time Vertica was just acquired by HP. What is new in the current version of Vertica?
Shilpa Lawande: We’ve come a long way since 2011 and our innovation engine is going strong!
From “Bulldozer” to “Crane” and now “Dragline,” we’ve built on our columnar-compressed, MPP share-nothing core, expanded security and manageability, dramatically expanded data ingestion capabilities, and what’s most exciting is that we’ve added a host of advanced analytics functions and extensibility APIs to the HP Vertica Analytics Platform itself. One key innovation is our ability to ingest and auto-schematize semi-structured data using HP Vertica Flex Zone, which takes away much of the friction in the analytic life-cycle from exploration to production.
We’ve also grown a vibrant community of practitioners and an ecosystem of complementary tools, including Hadoop.
Dragline, our next release of the HP Vertica Analytics Platform addresses the needs of the most demanding, analytic-driven organizations by providing many new features, including:
- Project Maverick’s Live Aggregate Projections to speed up queries that rely on resource- intensive aggregate functions like SUM, MIN/MAX, and COUNT.
- Dynamic mixed workload management, which identifies and adapts to varying query complexities — simple and ad-hoc queries as well as long-running advanced queries — and dynamically assigns the appropriate amount of resources to ensure the needs of all data consumers
- HP Vertica Pulse, which helps organizations leverage an in-database sentiment analysis tool that scores short data posts, including social data, such as Twitter feeds or product reviews, to gauge the most popular topics of interest, analyze how sentiment changes over time and identify advocates and detractors.
- HP Vertica Place, which stores and analyzes geospatial data in real time, including locations, networks and regions.
- An expanded SQL-on-Hadoop offering that gives users freedom to pick their data formats and where to store it, including HDFS, but still benefit from the power of the Vertica analytic engine. OF course, there’s a lot more to the “Dragline” release, but these are the highlights.
Q2. Vertica is referred to as an analytics platform. How does it differentiate with respect to conventional relational database systems (RDBMSes)?
Shilpa Lawande: Good question. First, let me clear the misconception that column stores are not relational – Vertica is a relational database, an RDBMS – it speaks tables and columns and standard SQL and ODBC, and like your favorite RDBMS, talks to a variety of BI tools. Now, there are many variations in the database market from low-cost solutions that lack advanced analytics to high-end solutions that can’t handle big data.
HP Vertica is the only one purpose-built for big data analytics – most conventional RDBMS were purpose- built for OLTP and then retrofitted for analytics. Vertica’s core architecture with columnar storage, a columnar engine, aggressive use of data compression, our scale-out architecture, and, most importantly, our unique hybrid load architecture enables what we call real-time analytics, which gives us the edge over the competition.
You can keep loading your data throughout the day — not in batch at night — and you can query the data as it comes in, without any specialized indexes, materialized views, or other pre- processing. And we have a huge and ever-growing library of features and functions to explore and perform analytics on big data–both structured and semi-structured. All of these core capabilities add up to a powerful analytics platform–far beyond a conventional relational database.
Q3. Vertica is column-based. Could you please explain what are the main technological differences with respect to a conventional relational database system?
Shilpa Lawande: It’s about performance. A conventional RDBMS is bottlenecked with disk I/O.
The reason for this is that with a traditional database, data is stored on disks in a row-wise manner, so even if the query needs only a few columns, the entire row must be retrieved from disk. In analytic workloads, often there are hundreds of columns in the data and only a few are used in the query, so row-oriented databases simply don’t scale as the data sets get large.
Vendors who offer this type of database often require that you create indexes and materialized views to retrieve the relevant data in a reasonable about of time. With columnar storage, you store data for each column separately, so that you can grab just the columns you need to answer the query. This can speed query times immensely, where hour-long queries can happen in minutes or seconds. Furthermore, Vertica stores and processes the data sorted, which enables us to do all manner of interesting optimizations to queries that further boost performance.
Some of the traditional database vendors out there claim they now have columnar store, but a true columnar store is not only about the way you store data, but the engine and the optimizations that are enabled by the column store.
For instance, an optimization called late materialized allows Vertica to delay retrieval of columns as late as possible in query processing so that minimal I/O and data movement is done until absolutely necessary. Vertica is the only engine that is true columnar; everything else out there is a retrofit of a general purpose engine that can read some kind of a columnar format.
Q4. What is so special of Vertica data compression?
Shilpa Lawande: The capability of Vertica to store data in columns allows us to take advantage of the similar traits in data. This gives us not only a footprint reduction in the disk needed to store data, but also an I/O performance boost — compressed data takes a shorter time to load. But, even more importantly, we use various encoding techniques on the data itself that enable us to process the data without expanding it first.
We have over a dozen schemes for how we store the data to optimize its storage, retrieval, and processing.
Q5. Vertica is designed for massively parallel processing (MPP). What is it?
Shilpa Lawande: Vertica is a database designed to run on a cluster of industry-standard hardware.
There are no special- purpose hardware components. The database is based on a shared-nothing architecture, where many nodes each store part of the database and do part of the work in processing queries. We optimize the processing so much as to minimize data traffic over the network. We have built-in high availability to handle node failures. We also have a sophisticated elasticity mechanism that allows us to efficiently add and remove nodes from the cluster. This enables us to scale-out to very large data sizes and handle very large data problems. In other words, it is massively parallel processing!
Q6. In the past, columnar databases were said to be slow to load. Is it still true now?
Shilpa Lawande: This may have been true with older unsophisticated columnar databases. We have customers loading over 35 TB data / hour into Vertica, so I think we’ve put that one squarely to rest.
Q7. Who are the users ready to try column-based “data slicers”? And for what kind of use cases?
Shilpa Lawande: Vertica is a technology broadly applicable in many industries and in many business situations. Here are just a few of them.
Data Warehouse Modernization – the customer has some underperforming solution for data warehouse in place and they want to replace or augment their current analytics with a solution that will scale and deliver faster analytics at an overall lower TCO that requires substantially less hardware resources.
Hadoop Acceleration – the customer has bought into Hadoop for a data lake solution and would like a more expressive and faster SQL-on-Hadoop solution or an analytic platform that can offer real-time analytics for production use.
Predictive analytics – the customer has some kind of machine data, clickstream logs, call detail records, security event data, network performance data, etc. over long periods of time and they would like to get value out of this data via predictive analytics. Use-cases include website personalization, network performance optimization, security thread forensics, quality control, predictive maintenance, etc.
Q8. What are the typical indicators which are used to measure how well systems are running and analyzing data in the enterprise? In other words, how “good” is the value derived from analyzing (Big) Data?
Shilpa Lawande: There are many, many advantages and places to derive value from big data.
First, just having the ability to answer your daily analytics faster can be a huge boost for the organization. For example, we had one brick-and-mortar retailer who wanted to brief sales associates and managers daily on what the hottest selling products were, who had inventory and other store trends. With their legacy analytics system, they could not deliver analytics fast enough to have these analytics on hand. With Vertica, they now provide very detailed (and I might add graphically pleasing) analytics across all of their stores, right in the hands of the store manager via a tablet device. The analytics has boosted sales performance and efficiency across the chain. The user experience they get wouldn’t be possible without the speed of Vertica.
But what is most exciting to me is when Vertica is used to save lives and the environment. We have a client in the medical field who has used Vertica analytics to better detect infections in newborn infants by leveraging the data they have from the NICU. It’s difficult to detect infections in newborns because they don’t often run a fever, nor can they explain how they feel. The estimate is that this big data analytics has saved the lives of hundreds of newborn babies in the first year of use. Another example is the HP Earth Insights project, which used Vertica to create an early warning system to identify species threatened by destruction of tropical forests around the world.
This project done in cooperation with Conservation International is making an amazing difference to scientists and helping inform and influence policy decisions around our environment.
There are a LOT of great use cases like these coming out of the Vertica community.
Q9. What are the main technical challenges when analyzing data at speed?
Shilpa Lawande: In an analytics system, you tend to have a lot going on at the same time. There are data loads, both in batch and trickle loads. There is daily and regular analytics for generating daily reports. There may be data discovery where users are trying to find value in data. Of course, there are dashboards that executives rely upon to stay up to date. Finally, you may have ad-hoc queries that come in and try to take away resources. So perhaps the biggest challenge is dealing with all of these workloads and coming up with the most efficient way to manage it all.
We’ve invested a lot of resources in this area and the fruit of that labor is very much evident in the “Dragline” release.
Q10. Do you have some concrete example of use cases where HP Vertica is used to analyze data at speed?
Shilpa Lawande: Yes, we have many, see here.
Q11. How HP Vertica differs with respect to other analytical platforms offered by competitors such as IBM, Teradata, to in-memory databases such as SAP HANA?
Shilpa Lawande: Vertica offers everything that’s good about legacy data warehouse technologies like the ability to use your favorite visualization tools, standard SQL, and advanced analytic functionality.
In general, the legacy databases you mentioned are pretty good at handling analysis of business data, but they are still playing catch-up when it comes to big data – the volume, variety, and velocity. A row store simply cannot deliver the analytical performance and scale of an MPP columnar platform like Vertica.
In-memory databases are a good acceleration solution for some classes of business analytics, but, again, when it comes to very large data problems, the economics of putting all the data in memory simply do not work. That said, Vertica itself has an in-memory component which is at the core of our high-speed loading architecture, so I believe we have the best of both worlds – ability to use memory where it matters and still support petabyte scales!
Shilpa Lawande has been an integral part of the Vertica engineering team from its inception to its acquisition by HP in 2011. Shilpa brings over 15 years of experience in databases, data warehousing and grid computing to HP/Vertica.
Besides being responsible for Vertica’s Engineering team, Shilpa also manages the Customer Experience organization for Vertica including Customer Support, Training and Professional Services. Prior to Vertica, she was a key member of the Oracle Server Technologies group where she worked directly on several data warehousing and self-managing features in the Oracle Database.
Shilpa is a co-inventor on several patents on query optimization, materialized views and automatic index tuning for databases. She has also co-authored two books on data warehousing using the Oracle database as well as a book on Enterprise Grid Computing. She has been named to the 2012 Women to Watch list by Mass High Tech and awarded HP Software Business Unit Leader of the year in 2012.
Shilpa has a Masters in Computer Science from the University of Wisconsin-Madison and a Bachelors in Computer Science and Engineering from the Indian Institute of Technology, Mumbai.
Follow ODBMS.org on Twitter: @odbmsorg