On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica.
” It is expected there will be 6 billion mobile phones by the end of 2011, and there are currently over 300 Twitter accounts and 500K Facebook status updates created every minute. And, there is now a $2 billion a year market for virtual goods! “ — Shilpa Lawande
I wanted to know more about Vertica Analytics Platform for Big Data. I have interviewed Shilpa Lawande, VP of Engineering at Vertica. Vertica was acquired by HP early this year.
Q1. What are the main technical challenges for big data analytics?
Shilpa Lawande: Big data problems have several characteristics that make them technically challenging. First is the volume of data, especially machine-generated data, and how fast that data is growing every year, with new sources of data that are emerging. It is expected there will be 6 billion mobile phones by the end of 2011, and there are currently over 300 Twitter accounts and 500K Facebook status updates created every minute. And, there is now a $2 billion a year market for virtual goods!
A lot of insights are contained in unstructured or semi-structured data from these types of applications, and the problem is analyzing this data at scale. Equally challenging is the problem of ‘how to analyze.’ It can take significant exploration to find the right model for analysis, and the ability to iterate very quickly and “fail fast” through many (possible throwaway) models – at scale – is critical.
Second, as businesses get more value out of analytics, it creates a success problem – they want the data available faster, or in other words, want real-time analytics. And they want more people to have access to it, or in other words, high user volumes.
One of Vertica’s early customers is a Telco that started using Vertica as a ‘data mart’ because they couldn’t get resources from their enterprise data warehouse. Today, they have over a petabyte of data in Vertica, several orders of magnitude bigger than their enterprise data warehouse.
Techniques like social graph analysis, for instance leveraging the influencers in a social network to create better user experience are hard problems to solve at scale. All of these problems combined create a perfect storm of challenges and opportunities to create faster, cheaper and better solutions for big data analytics than traditional approaches can solve.
Q2. How Vertica helps solving such challenges?
Shilpa Lawande: Vertica was designed from the ground up for analytics. We did not try to retrofit 30-year old RDBMS technology to build the Vertica Analytics Platform. Instead, Vertica built a true columnar database engine including sorted columnar storage, a query optimizer and an execution engine.
With sorted columnar storage, there are two methods that drastically reduce the I/O bandwidth requirements for such big data analytics workloads. The first is that Vertica only reads the columns that queries need.
Second, Vertica compresses the data significantly better than anyone else.
Vertica’s execution engine is optimized for modern multi-core processors and we ensure that data stays compressed as much as possible through the query execution, thereby reducing the CPU cycles to process the query. Additionally, we have a scale-out MPP architecture, which means you can add more nodes to Vertica.
All of these elements are extremely critical to handle the data volume challenge.
With Vertica, customers can load several terabytes of data quickly (per hour in fact) and query their data within minutes of it being loaded – that is real-time analytics on big data for you.
There is a myth that columnar databases are slow to load. This may have been true with older generation column stores, but in Vertica, we have a hybrid in-memory/disk load architecture that rapidly ingests incoming data into a write-optimized row store and then converts that to read-optimized sorted columnar storage in the background. This is entirely transparent to the user because queries can access data in both locations seamlessly. We have a very lightweight transaction implementation with snapshot isolation queries can always run without any locks.
And we have no auxiliary data structures, like indexes or materialized views, which need to be maintained post-load.
Last, but not least, we designed the system for “always on,” with built-in high availability features.
Operations that translate into downtime in traditional databases are online in Vertica, including adding or upgrading nodes, adding or modifying database objects, etc.
With Vertica, we’ve removed many of the barriers to monetizing big data and hope to continue to do so.
Q3. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?
Shilpa Lawande:The short answer to the performance and scale question is that we make the most efficient use of all the resources, and we parallelize at every opportunity. When we compress and encode the data, we sometimes see 80:1 compression (depending on the data) . Vertica’s fully peer-to-peer MPP architecture allows you can issue loads and queries from any node and run multiple load streams. Within each node, operations make full use of the multi-core processors to parallelize their work. A great measure of our raw performance is that we have several OEM customers who run Vertica on a single 1U node, embedded within applications like security and event log management, with very low data latency requirements.
Scalability has three aspects – data volume, hardware size, and concurrency. Vertica’s performance scales linearly (and often super linearly due to compression and other factors) when you double the data volume or run the same data volume on twice the number of nodes. We have customers who have grown their databases from scratch to over a petabyte, with clusters from tens to hundreds of nodes. As far as concurrency goes, running queries 50-200x faster ensures that we can get a lot more queries done in a unit of time. To efficiently handle a highly concurrent mix of short and long queries, we have built-in workload management that controls how resources are allocated to different classes of queries. Some of our customers run with thousands of concurrent users running sub-second queries.
Q4. What is new in Vertica 5.0?
Shilpa Lawande: In Vertica 5.0, we focused on extensibility and elasticity of the platform. Vertica 5.0 introduced a Software Development Kit (SDK) that allows users to write custom user-defined functions so they can do more with the Vertica platform, such as analyze Apache logs or build custom statistical extensions. We also added several more built-in analytic extensions, including event-series joins, event-series pattern matching, and statistical and geospatial functions.
With our new Elastic Cluster features, we allow very fast expansion and contraction of the cluster to handle changing workloads – in a recent POC we showed expansion from 8 nodes to 16 nodes with 11TB of data in an hour. Another feature we’ve added is called Import/Export, which automates fast export of data from one Vertica cluster to another, a very useful feature for creating sandboxes from a production system for exploratory analysis.
Besides these two major features, there are a number of improvements to the manageability of the database, most notably the Data Collector, that captures and maintains a history of system performance data, and the Workload Analyzer, that analyzes this data to point out suboptimal performance and how to fix it.
Q5. How is Vertica currently being used? Could you give examples of applications that use Vertica?
Shilpa Lawande: Vertica has customers in most major verticals, including Telco, Financial Services, Retail, Healthcare, Media and Advertising, and Online Gaming. We have eight out of the top 10 US Telcos using Vertica for Call-Detail-Record analysis, and in Financial Services, a common use-case is a tickstore, where Vertica is used to store and analyze many years of financial trades and quotes data to build models.
A growing segment of Vertica customers are Online Gaming and Web 2.0 companies that use Vertica to build models for in-game personalization.
Q6. Vertica vs. Apache Hadoop: what are the similarities and what are the differences?
Shilpa Lawande: Vertica and Hadoop are both systems that can store and analyze large amounts of data on commodity hardware. The main differences are how the data gets in and out, how fast the system can perform, and what transaction guarantees are provided. Also, from the standpoint of data access, Vertica’s interface is SQL and data must be designed and loaded into a SQL schema for analysis.
With Hadoop, data is loaded AS IS into a distributed file system and accessed programmatically by writing Map-Reduce programs. By not requiring a schema first, Hadoop provides a great tool for exploratory analysis of the data, as long as you have the software development expertise to write Map Reduce programs. Hadoop assumes that the workload it runs will be long running, so it makes heavy use of checkpointing at intermediate stages.
This means parts of a job can fail, be restarted and eventually complete successfully. There are no transactional guarantees.
Vertica, on the other hand, is optimized for performance by careful layout of data and pipelined operations that minimize saving intermediate state. Vertica gets queries to run sub-second and if a query fails, you just run it again. Vertica provides standard ACID transaction semantics on loads and queries.
We recently did a comparison between Hadoop, Pig, and Vertica for a graph problem (see post on our blog) and when it comes to performance, the choice is clearly in favor of Vertica. But we believe in using the right tool for the job and have over 30 customers using both the systems together. Hadoop is a great tool for the early exploration phase, where you need to determine what value there is in the data, what the best schema is, or to transform the source data before loading into Vertica. Once the data models have been identified, use Vertica to get fast responses to queries over the data.
Other customers keep all their data in Vertica and leverage Hadoop’s scheduling capabilities to retrieve the data for different kinds of analysis. To facilitate this, we provide a Hadoop connector that allows efficient bi-directional transfer of data between Vertica and Hadoop. We plan to continue to enhance Vertica’s analytic platform as well as be a partner in the Hadoop ecosystem.
Q7. Cloud computing: Does it play a role at Vertica? If yes, how?
Shilpa Lawande: In 2007, when ‘Cloud’ first started to gather steam, we realized that Vertica had the perfect architecture for the cloud. It is no surprise because the very trends in commodity multi-core servers, storage and interconnects that resulted in our design choices also enable the cloud. We were the first analytic database to run on Amazon EC2 and today have several customers using EC2 to run their analytics.
We consider cloud as an important deployment configuration for Vertica and as the IaaS offerings improve, and we gain real-world experience from our customers, you can expect Vertica’s product to be further optimized for cloud deployments. Cloud is also a big focus area at HP and we now have the unique opportunity to create the best cloud platform to run Vertica and to provide value-added solutions for big data analytics based on Vertica
Q8. How Vertica fits into HP data management strategy?
Shilpa Lawande: Vertica provides HP with a proven platform for big data analytics that can be deployed as software, an appliance, and on the cloud, all of which are key focus areas for HP’s business.
Big data is a growing market and the majority of the data is unstructured. The combination of Vertica and Autonomy gives HP a comprehensive “Information Platform” to manage, analyze, and derive value from the explosion in both structured and unstructured data.
Shilpa Lawande, VP of Engineering, Vertica.
Shilpa Lawande has been an integral part of the Vertica engineering team since its inception, bringing over 10 years of experience in databases, data warehousing and grid computing to Vertica. Prior to Vertica, she was a key member of the Oracle Server Technologies group where she worked directly on several data warehousing and self-managing features in the Oracle 9i and 10g databases.
Lawande is a co-inventor on several patents on query optimization, materialized views and automatic index tuning for databases. She has also co-authored two books on data warehousing using the Oracle database as well as a book on Enterprise Grid Computing.
Lawande has a Masters in Computer Science from the University of Wisconsin-Madison and a Bachelors in Computer Science and Engineering from the Indian Institute of Technology, Mumbai.