On Big Data Analytics. Interview with Shilpa Lawande
“Really, I would say this is indeed the essence of Big Data – being able to harness data from millions of endpoints whether they be devices or users, and optimizing outcomes for the individual, not just for the collective!”–Shilpa Lawande.
I have been following Vertica since their acquisition by HP back in 2011. This is my third interview with Shilpa Lawande, now Vice President at Hewlett Packard Enterprise, and responsible for strategic direction of the HP Big Data Platforms, including HP Vertica Analytic Platform.
The first interview I did with Shilpa was back on November 16, 2011 (soon after the acquisition by HP), and the second on July 14, 2014.
If you read the three interviews (see links to the two previous interviews at the end of this interview), you will notice how fast the Big Data Analytics and Data Platforms world is changing.
Q1. What are the main technical challenges in offering data analytics in real time? And what are the main problems which occur when trying to ingest and analyze high-speed streaming data, from various sources?
Shilpa Lawande: Before we talk about technical challenges, I would like to point out the difference between two classes of analytic workloads that often get grouped under “streaming” or “real-time analytics”.
The first and perhaps more challenging workload deals with analytics at large scale on stored data but where new data may be coming in very fast, in micro-batches.
In this workload, challenges are twofold – the first challenge is about reducing the latency between ingest and analysis, in other words, ensuring that data can be made available for analysis soon after it arrives, and the second challenge is about offering rich, fast analytics on the entire data set, not just the latest batch. This type of workload is a facet of any use case where you want to build reports or predictive models on the most up-to-date data or provide up-to-date personalized analytics for a large number of users, or when collecting and analyzing data from millions of devices. Vertica excels at solving this problem at very large petabyte scale and with very small micro-batches.
The second type of workload deals with analytics on data in flight (sometimes called fast data) where you want to analyze windows of incoming data and take action, perhaps to enrich the data or to discard some of it or to aggregate it, before the data is persisted. An example of this type of workload might be taking data coming in at arbitrary times with granularity and keeping the average, min, and max data points per second, minute, hour for permanent storage. This use case is typically solved by in-memory streaming engines like Storm or, in cases where more state is needed, a NewSQL system like VoltDB, both of which we consider complementary to Vertica.
Q2. Do you know of organizations that already today consume, derive insight from, and act on large volume of data generated from millions of connected devices and applications?
Shilpa Lawande: HP Inc. and Hewlett Packard Enterprise (HPE) are both great examples of this kind of an organization. A number of our products – servers, storage, and printers all collect telemetry about their operations and bring that data back to analyze for purposes of quality control, predictive maintenance, as well as optimized inventory/parts supply chain management.
We’ve also seen organizations collect telemetry across their networks and data centers to anticipate servers going down, as well as to have better understanding of usage to optimize capacity planning or power usage. If you replace devices by users in your question, online and mobile gaming companies, social networks and adtech companies with millions of daily active users all collect clickstream data and use it for creating new and unique personalized experiences. For instance, user churn is a huge problem in monetizing online gaming.
If you can detect, from the in-game interactions, that users are losing interest, then you can immediately take action to hold their attention just a little bit longer or to transition them to a new game altogether. Companies like Game Show Network and Zynga do this masterfully using Vertica real-time analytics!
Really, I would say this is indeed the essence of Big Data – being able to harness data from millions of endpoints whether they be devices or users, and optimizing outcomes for the individual, not just for the collective!
Q3. Could you comment on the strategic decision of HP to enhance its support for Hadoop?
Shilpa Lawande: As you know HP recently split into Hewlett Packard Enterprise (HPE) and HP Inc.
With HPE, which is where Big Data and Vertica resides, our strategy is to provide our customers with the best end-to-end solutions for their big data problems, including hardware, software and services. We believe that technologies Hadoop, Spark, Kafka and R are key tools in the Big Data ecosystem and the deep integration of our technology such as Vertica and these open-source tools enables us to solve our customers’ problems more holistically.
At Vertica, we have been working closely with the Hadoop vendors to provide better integrations between our products.
Some notable, recent additions include our ongoing work with Hortonworks to provide an optimized Vertica SQL-on-Hadoop version for the Orcfile data format, as well as our integration with Apache Kafka.
Q4. The new version of HPE Vertica, “Excavator,” is integrated with Apache Kafka, an open source distributed messaging system for data streaming. Why?
Shilpa Lawande: As I mentioned earlier, one of the challenges with streaming data is ingesting it in micro- batches at low latency and high scale. Vertica has always had the ability to do so due to its unique hybrid load architecture whereby data is ingested into a Write Optimized Store in-memory and then optimized and persisted to a Read-Optimized Store on disk.
Before “Excavator,” the onus for engineering the ingest architecture was on our customers. Before Kafka, users were writing custom ingestion tools from scratch using ODBC/JDBC or staging data to files and then loading using Vertica’s COPY command. Besides the challenges of achieving the optimal load rates, users commonly ran into challenges of ensuring transactionality of the loads, so that each batch gets loaded exactly once even under esoteric error conditions. With Kafka, users get a scalable distributed messaging system that enables simplifying the load pipeline.
We saw the combination of Vertica and Kafka becoming a common design pattern and decided to standardize on this pattern by providing out-of-the-box integration between Vertica and Kafka, incorporating the best practices of loading data at scale. The solution aims to maximize the throughput of loads via micro-batches into Vertica, while ensuring transactionality of the load process. It removes a ton of complexity in the load pipeline from the Vertica users.
Q5.What are the pros and cons of this design choice (if any)?
Shilpa Lawande: The pros are that if you already use Kafka, much of the work of ingesting data into Vertica is done for you. Having seen so many different kinds of ingestion horror stories over the past decade, trust me, we’ve eliminated a ton of complexity that you don’t need to worry about anymore. The cons are, of course, that we are making the choice of the tool for you. We believe that the pros far outweigh any cons.
Q6. What kind of enhanced SQL analytics do you provide?
Shilpa Lawande: Great question. Vertica of course provides all the standard SQL analytic capabilities including joins, aggregations, analytic window functions, and, needless to say, performance that is a lot faster than any other RDBMS. But we do much more than that. We’ve built some unique time-series analysis (via SQL) to operate on event streams such as gap-filling and interpolation and event series joins. You can use this feature to do common operations like sessionization in three or four lines of SQL. We can do this because data in Vertica is always sorted and this makes Vertica a superior system for time series analytics. Our pattern matching capabilities enable user path or marketing funnel analytics using simple SQL, which might otherwise take pages of code in Hive or Java.
With the open source Distributed R engine, we provide predictive analytical algorithms such as logistic regression and page rank. These can be used to build predictive models using R, and the models can be registered into Vertica for in- database scoring. With Excavator, we’ve also added text search capabilities for machine log data, so you can now do both search and analytics over log data in one system. And you recently featured a five-part blog series by Walter Maguire examining why Vertica is the best graph analytics engine out there.
Q7. What kind of enhanced performance to Hadoop do you provide?
Shilpa Lawande We see Hadoop, particularly HDFS, as highly complementary to Vertica. Our users often use HDFS as their data lake, for exploratory/discovery phases of their data lifecycle. Our Vertica SQL on Hadoop offering includes the Vertica engine running natively on Hadoop nodes, providing all the advanced SQL capabilities of Vertica on top of data stored in HDFS. We integrate with native metadata stores like HCatalog and can operate on file formats like Orcfiles, Parquet, JSON, Avro, etc. to provide a much more robust SQL engine compared to the alternatives like Hive, Spark or Impala, and with significantly better performance. And, of course, when users are ready to operationalize the analysis, they can seamlessly load the data into Vertica Enterprise which provides the highest performance, compression, workload management, and other enterprise capabilities for your production workloads. The best part is that you do not have to rewrite your reports or dashboards as you move data from Vertica for SQL on Hadoop to Vertica Enterprise.
Qx Anything else you wish to add?
Shilpa Lawande: As we continue to develop the Vertica product, our goal is to provide the same capabilities in a variety of consumption and deployment models to suit different use cases and buying preferences. Our flagship Vertica Enterprise product can be deployed on-prem, in VMWare environments or in AWS via an AMI.
Our SQL on Hadoop product can be deployed directly in Hadoop environments, supporting all Hadoop distributions and a variety of native data formats. We also have Vertica OnDemand, our data warehouse-as-a-service subscription that is accessible via a SQL prompt in AWS, HPE handles all of the operations such as database and OS software updates, backups, etc. We hope that by providing the same capabilities across many deployment environments and data formats, we provide our users the maximum choice so they can pick the right tool for the job. It’s all based on our signature core analytics engine.
We welcome new users to our growing community to download our Community Edition, which provides 1TB of Vertica on a three-node cluster for free, or sign-up for a 15-day trial of Vertica on Demand!
Shilpa Lawande is Vice President at Hewlett Packard Enterprise, responsible for strategic direction of the HP Big Data Platforms, including the flagship HP Vertica Analytic Platform. Shilpa brings over 20 years of experience in databases, data warehousing, analytics and distributed systems.
She joined Vertica at its inception in 2005, being one of the original engineers who built Vertica from ground up, and running the Vertica Engineering and Customer Experience teams for better part of the last decade. Shilpa has been at HPE since 2011 through the acquisition of Vertica and has held a diverse set of roles spanning technology and business.
Prior to Vertica, she was a key member of the Oracle Server Technologies group where she worked directly on several data warehousing and self-managing features in the Oracle Database.
Shilpa is a co-inventor on several patents on database technology, both at Oracle and at HP Vertica.
She has co-authored two books on data warehousing using the Oracle database as well as a book on Enterprise Grid Computing.
She has been named to the 2012 Women to Watch list by Mass High Tech, the Rev Boston 2015 list, and awarded HP Software Business Unit Leader of the year in 2012 and 2013. As a working mom herself, Shilpa is passionate about STEM education for Girls and Women In Tech issues, and co-founded the Datagals women’s networking and advocacy group within HPE. In her spare time, she mentors young women at Year Up Boston, an organization that empowers low-income young adults to go from poverty to professional careers in a single year.
- Uplevel Big Data analytics with HP Vertica – Part 1: Graph in a relational database? Seriously? by Walter Maguire
- Uplevel Big Data Analytics with Graph in Vertica – Part 2: Yes, you can write that in SQL by Walter Maguire
- Uplevel Big Data Analytics with Graph in Vertica – Part 3: Yes, you can make it go even faster by Walter Maguire
- Uplevel Big Data Analytics with Graph in Vertica – Part 4: It’s not your dad’s graph engine by Walter Maggiore
- Uplevel Big Data Analytics with Graph in Vertica – Part 5: Putting graph to work for your business by Walter Maguire
Follow ODBMS.org on Twitter: @odbmsorg