On Cloud and Vertica 12. Q&A with Paige Roberts

by Roberto Zicari · August 18, 2022

If an organization wants to transition to cloud, the lessons of the past have taught us that a giant lift and shift operation is not the way to go. The most efficient path for analytics platform transitions is to move key workflows onto the new platform one at a time. It helps a great deal if the method of analytics is identical in both cases, so the end users see no difference. They can continue getting their work done without interruption while one by one, their workloads get moved over.

Q1. In your opinion, what are the top factors that have had an adverse impact on organization’s data analytics strategy and investments over the last 24 months?

Following analytics “fashions” is, I think, the worst, most harmful thing organizations can do when it comes to data analytics strategy and investments. A few years back, the big flashy trend was to put all your data in a Hadoop data lake, and somehow that was going to leap frog you ahead of the competition. Now, I hear people talk about how they need a data mesh, when a lot of people still haven’t defined what that means. The data management and analytics industry changes over time, but the key to weathering the changes is to focus on what your business needs, not focus on the latest analytics fashion de jour.

Q2. Many organizations look to the public cloud for help. What is your opinion on that?

This is a good example. “Go to the cloud! It will solve all data analytics problems ever and take out stubborn stains, too!” If the only reason an organization has for going to the cloud is that it’s the cool thing to do, and everyone is doing it, then it’s probably not a good idea.

Step back, consider what your organization needs. A lot of organizations can benefit from cloud advantages like elastic infrastructure if your workload has very high and very low points. If you are a start-up with a tiny IT staff but growing rapidly, the cloud can really help you get started easily, and scale quickly. If you’re a line of business wanting to do some analytics on your own data yourself, without any interdepartmental aspect, cloud software can be a good way to go. The cloud has huge benefits, but that doesn’t mean it’s going to instantly solve every problem.

One thing to look at very closely is which cloud technology you should switch to. “The cloud” isn’t a monolith. There’s a lot of options out there. I recently overheard one of my sales guys in a competitive POC. The company was considering Vertica, Snowflake, and Redshift. Snowflake was eliminated early because it would cost three times the company’s budget. Looking at Redshift, they had a more reasonable price for performance, even considering they’d need three products, Redshift, Athena, and Sagemaker to equal the functionality in Vertica alone, but those applications only work on the Amazon cloud. Vertica works on any cloud, plus on-prem, in containers, in private clouds, wherever.

Cloud is today’s fashion, today’s cool kids way to do things, but not locking yourself in means you can still change your mind later. Buying something that only works on one, specific cloud is the ultimate in lock in. I get lock-in shudders that remind me of appliances. Keeping your options open is the essence of future-proofing your architecture.

Q3. So, Vertica is as capable of deploying on cloud as it is on-premises. When should an organization go to cloud, multi-cloud, or hybrid cloud? How do they know which is right for that organization?

That’s an individual business decision. Nearly every organization can benefit from moving some aspects of the business to the cloud. Most organizations right now are hybrid, and larger ones are also overwhelmingly on more than one cloud, as many as 80% I heard from one of the analysts. The big thing we, as technology vendors, can do is help remove some of the risk of going to the cloud.

If an organization wants to transition to cloud, the lessons of the past have taught us that a giant lift and shift operation is not the way to go. The most efficient path for analytics platform transitions is to move key workflows onto the new platform one at a time. It helps a great deal if the method of analytics is identical in both cases, so the end users see no difference. They can continue getting their work done without interruption while one by one, their workloads get moved over.

The other aspect is that conditions change. Having a disaster recovery capability in one cloud, and production in another is a good option. Or, production on prem and DR in the cloud, or vice versa. And if new regulations require your company to keep its data in a specific region, or price hikes on one cloud mean it makes better business sense to be on another, keeping your options open is the way to go.

That means that the best option to reduce risk when moving to cloud is to have an analytics platform that works in multiple locations, on prem, on whatever cloud makes sense for your business, wherever you need it.

Your business requirements should determine deployment location, not the software.

Q4. You recently announced the release of version 12 of the Vertica analytical database. What is great about it?

Well, everything I just said isn’t just words. When we say future-proof, we mean making sure you can move with the times if you need to. Version 12 is us walking the talk.

Over the past couple of years, we’ve been adding more and more deployment options. We’re the only ones on the market who support cloud-style object storage supported analytics on premises for private clouds. We’ve added even more platforms we can do that with, H3C and Vast being major ones. We already support running on every cloud, and have a managed service version on Amazon, Vertica Accelerator, for folks who need that ease of getting going and management over time. Vertica v12 also expanded our containerized deployment support into Amazon EKS, as well as Operatorhub to give you as much flexibility as possible.

We’ve improved our integration with open source projects, including a brand new Node.js API, and improved two-way Spark integration, and our ability to read and write Parquet and ORC complex, hierarchical data. Vertica is unique in that it doesn’t read that data in a big blob, but actually treats it just like fields inside its own storage format. This provides a much finer ability to query data stored externally, as well as import and export it without modifying the schema. We also improved our query speed on those data types which makes Vertica an even more efficient lakehouse option. And, of course, we added a bunch of essential security and governance improvements like we do with every version.

Here’s a blog post with more details if you’re interested.

Q5. What are Vertica version 12 new major features and enhancements for analytics and machine learning?

A lot of folks think of Vertica the way we were maybe 10 or 15 years ago, a solid data warehouse for old school business intelligence. But we’ve had some powerful time series, geospatial, and machine learning analytic capabilities built in for years. We do a lot of telecom business, and I like to use the example of a football game where everyone films a touchdown with their phone, and they all try to upload it at once to social media. The cell network gets overwhelmed in that area. Now, if someone tries to call from one side of town to the other, in realtime, the machine learning-based AI built on Vertica can do the IoT time series analysis to detect the overloaded network, do the geospatial analysis to re-route the call, and connect the customer without them ever knowing about the problem. Our churn analysis algorithms, and customer analytics all show the nice customer satisfaction.

Version 12 took that to the next level. Our geospatial analytics got some nice additions, including the ability to load 2D GeoJSON files into our GEOMETRY data storage format. We added some new extensions to the stored procedures for geometry, and geography related procedures as well.

A new open source version of VerticaPy came out in conjunction with v 12. VerticaPy allows you to use Jupyter notebooks, and write Pandas and SKLearn style code while using a virtual dataframe that never leaves the database, rather than the usual in-memory dataframe. So, all the work can be done by the database engine on the full dataset with no need to sample. You write comfortable python code, and Vertica does all the work at any scale. No worries about overloading your local memory. We already had PMML import/export and TensorFlow graph import, but we’ve added Python pickle export as well, and some nice explainability graphs for the tree algorithms like random forest and XGBoost, and the newly added CHAID trees and Isolation Forest algorithms.

Q6. When is going to the cloud not the right decision?

People talk about the advantage of only paying for the compute you use, and in some cases, that is a HUGE advantage. In others, it’s a disaster. On-prem you only pay for compute once, then you can use it as many times as you need without paying for anything but power. In the cloud, you can end up paying for compute over and over and over again, racking up huge bills. Especially if the software isn’t optimized for heavy analytics loads. A lot of cloud software uses the “throw more hardware at it” strategy of increasing performance and concurrency. Since you’re paying for every bit of that hardware, you might want to look closely at how the cloud software you’re considering deals with high demand.

Catch Media went all in on the cloud, since everyone led them to believe it would save costs, and found that instead it hugely increased costs for many workloads. The CTO told me, for the cost of running his workload in the cloud, “I could re-buy the hardware every three months.” Fashion megatrend or not, it just didn’t make good business sense for his company to take their analytics to the cloud. I think a lot of companies need to take that into consideration.

You have to separate the hype from the real benefits, and look closely at what you hope to gain. If cost savings is the benefit you’re shooting for on the cloud, run the numbers very carefully before you commit. And then run them again afterward to verify that things are the way you were told they would be.

Q7. What are the disadvantages of analytics limited to cloud, or just a single cloud vendor?

The main problem isn’t now, it’s what happens tomorrow, or next month, or next year. I’ve been doing this for 25 years, and the analytics and data management industry changes constantly. Regulations change, user demands change, data sources change. Even the network is changing in the telecom industry to 5G. Nothing stays the same. Expecting the choice you made for analytics today, and locked yourself into, to be the right choice for even the next two years, much less the next ten years, is just not paying attention.

You can’t predict the future, so do your future self a favor and don’t take away their options.

Q8. On the integration front, Vertica 12 increases the interaction with the data analytics ecosystem. How can key proprietary and open-source technologies work seamlessly. Please explain?

I think I mentioned our Spark two-way integration. A lot of folks use that for feeding pre-processed data into Vertica. Similarly, we have a great 2-way connection with Kafka for feeding in streaming data, including the ability to pull in semi-structured data like logs, sensor readings, JSON or AVRO files, etc. in their original format. Then, we can do schema-on-read to pull out the data needed for an analysis. We can also automatically parse a lot of those files, and put them straight into tables if that makes better sense for your use case. We can import, export, or analyze in place ORC and Parquet files in HDFS or any of the major cloud flavors of object storage – S3, ABS, GCS.

We have the ability for users to define extensions in Java, C++, Python, or R, and run those in Vertica as if they were native functions. We also have VerticaPy for comfortable notebook use of our Python API, and other API’s in C++, Java, Go, and Node.js. A lot of cool applications are built with Vertica as the power underneath.

Q9. What are the typical use cases for Vertica and the new version 12?

Vertica is still at its foundation a powerful analytics database for driving reporting, dashboards, and ad hoc BI queries. Nearly all our customers use us for that, even if it isn’t the primary use case that they bought Vertica to do. No point in having such a slick BI database in house if you don’t use it for BI.

We dominate the telecom space, with 7 of the top 10 telecoms, both in the US, and globally, using Vertica. They use Vertica for everything from geospatial, to BI, to IoT, to AIOps (AI for IT Operations) kind of use cases. AIOps is also a common use case for Vertica in the robotic manufacturing, and compute hardware and network industry. HPE Infosight, powered by Vertica, is a good example of an AIOps platform that automatically detects and solves 85% of what used to be front line tech support calls.

Predictive maintenance is a powerful use case for Vertica. Philips uses us to reduce downtime for medical machines worldwide to approaching 0%.

Financial services companies, like a lot of major banks worldwide, use Vertica for fraud detection, risk analysis, etc. Cardlytics uses Vertica machine learning to improve customer acquisition, satisfaction and retention with Vertica machine learning in their loyalty card business.

Vertica analytical database delivers the best value with the highest performance for any data analytics, at any scale, anywhere.

………………………………

Paige Roberts (@RobertsPaige) has worked as an engineer, trainer, support technician, technical writer, marketer, product manager, and a consultant in the last 25 years. She has built data engineering pipelines and architectures, documented and tested open source analytics implementations, spun up Hadoop clusters, picked the brains of stars in data analytics, worked with different industries, and questioned a lot of assumptions. She’s worked for companies like Pervasive, the Bloor Group, Actian, Hortonworks, Syncsort, and Vertica. Recently, she contributed to “97 Things Every Data Engineer Should Know,” and co-authored “Accelerate Machine Learning with a Unified Analytics Architecture” both from O’Reilly publishing. She promotes understanding of Vertica, distributed data processing, open source, high scale data engineering architecture, and how the analytics revolution is changing the world.

You may also like...

Resources

Search

News

Events

Archives

Sponsored By

HPCC Systems from LexisNexis Risk Solutions

KX

InterSystems

MySQL/Oracle

SingleStore

Supporters

McObject

NEXTGRES

Raima

Scality

Volt Active Data