On Big Data Velocity. Interview with Scott Jarr.
“There is only so much static data in the world as of today. The vast majority of new data, the data that is said to explode in volume over the next 5 years, is arriving from a high velocity source. It’s funny how obvious it is when you think about it. The only way to get Big Data in the future is to have it arrive in a high velocity rate ” — Scott Jarr.
One of the key technical challenges of Big Data is (Data) Velocity. On that, I have interviewed Scott Jarr, Co-founder and Chief Strategy Officer of VoltDB.
Q1. Marc Geall, past Head of European Technology Research at Deutsche Bank AG/London, writes about the “Big Data myth”, claiming that there is:
1) limited need of petabyte-scale data today,
2) very low proportion of databases in corporate deployment which requires more than tens of TB of data to be handled, and
3) lack of availability and high cost of highly skilled operators (often post-doctoral) to operate highly scalable NoSQL clusters.
What is your take on this?
Scott Jarr: Interestingly I agree with a lot of this for today. However, I also believe we are in the midst of a massive shift in business to what I call data-as-a-priority.
We are just beginning, but you can already see the signs. People are loathed to get rid of anything, sensors are capturing finer resolutions, and people want to make far more, data informed decisions.
I also believe that the value that corporate IT teams were able to extract from data with the advent of data warehouses really whet the appetite of what could be done with data. We are now seeing people ask questions like “why can’t I see this faster,” or “how do we use this incoming data to better serve customers,” or “how can we beat the other guys with our data.”
Data is becoming viewed as a corporate weapon. Add inbound data rates (velocity) combined with the desire to use data for better decisions and you have data sizes that will dwarf what is considered typical today. And almost no industry is excluded. The cost ceiling has collapsed.
Q2: What are the typical problems that are currently holding back many Big Data projects?
1) Spending too much time trying to figure out what solution to use for what problem. We were seeing this so often that we created a graphic and presentation that addresses this topic. We called it the Data Continuum.
2) Putting out fires that the current data environment is causing. Most infrastructures aren’t ready for the volume or velocity of data that is already starting to arrive at their doorsteps. They are spending a ton of time dealing with band-aids on small-data-infrastructure and unable to shake free to focus on the Big Data infrastructure that will be a longer-term fix.
3) Being able to clearly articulate the business value the company expects to achieve from a Big Data project has a way of slowing things down in a radical way.
4) Most of the successes in Big Data projects today are in situations where the company has done a very good job maintaining a reasonable scope to the project.
Q3: Why is it important to solve the Velocity problem when dealing with Big Data projects?
Scott Jarr: There is only so much static data in the world as of today. The vast majority of new data, the data that is said to explode in volume over the next 5 years, is arriving from a high velocity source. It’s funny how obvious it is when you think about it. The only way to get Big Data in the future is to have it arrive in a high velocity rate.
Companies are recognizing the business value they can get by acting on that data as it arrives rather than depositing it in a file to be batch processed at some later data. So much of the context that makes that data is lost when it not acted on quickly.
Q4: What exactly is Big Data Velocity? Is Big Data Velocity the same as stream computing?
Scott Jarr: We think of Big Data Velocity as data that is coming into the organization at a rate that can’t be managed in the traditional database. However, companies want to extract the most value they can from that data as it arrives. We see them doing three specific things:
1) Ingesting relentless feed(s) of data;
2) Making decisioning on each piece of data as it arrives; and
3) Using real-time analytics to derive immediate insights into what is happening with this velocity data.
Making the best possible decision each time data is touched is what velocity is all about. These decisions used to be called transactions in the OLTP world. They involve using other data stored in the database to make decision – approve a transaction, server the ad, authorize the access, etc. These decisions, and the real-time analytics that support them, all require the context of other data. In other words, the database used to perform these decisions must hold some amount of previously processed data – they must hold state. Streaming systems are good at a different set of problems.
Q5: Two other critical factors often mentioned for Big Data projects are: 1) Data discovery: How to find high-quality data from the Web? and 2) Data Veracity: How can we cope with uncertainty, imprecision, missing values, mis-statements or untruths? Any thoughts on these?
Scott Jarr: We have a number of customers who are using VoltDB in ways to improve data quality within their organization. We have one customer who is examining incoming financial events and looking for misses in sequence numbers to determine lost or miss-ordered information. Likewise, a popular use case is to filter out bad data as it comes in by looking at it in its high velocity state against a known set of bad or good characteristics. This keeps much of the bad data from ever entering the data pipeline.
Q6: Scalability has three aspects: data volume, hardware size, and concurrency. Scale and performance requirements for Big Data strain conventional databases. Which database technology is best to scale to petabytes?
Scott Jarr: VoltDB is focused on a very different problem, which is how to process that data prior to it landing in the long-term petabyte system. We see customers deploying VoltDB in front of both MPP OLAP and Hadoop, in roughly the same numbers. It really all depends on what the customer is ultimately trying to do with the data once it settles into its resting state in the petabyte store.
Q7: A/B testing, sessionization, bot detection, and pathing analysis all require powerful analytics on many petabytes of semi-structured Web data. Do you have some customers examples in this area?
Scott Jarr: Absolutely. Taken broadly, this is one of the most common uses of VoltDB. Micro-segmentation and on-the-fly ad content optimization are examples that we see regularly. The ability to design an ad, in real-time, based on five sets of audience meta-data can have a radical impact on performance.
Q8: When would you recommend to store Big Data in a traditional Data Warehouse and when in Hadoop?
Scott Jarr: My experience here is limited. As I said, our customers are using VoltDB in front of both types of stores to do decisioning and real-time analytics before the data moves into the long term store. Often, when the data is highly structured, it goes into a data warehouse and when it is less structured, it goes into Hadoop.
Q9: Instead of stand-alone products for ETL, BI/reporting and analytics wouldn’t it be better to have a seamless integration? In what ways can we open up a data processing platform to enable applications to get closer?
Scott Jarr: This is very much inline with our vision of the world. As Mike (Stonebraker , VoltDB founder) has stated for years, in high performance data systems, you need to have specialized databases. So we see the new world having far more data pipelines than stand alone databases. A data pipeline will have seamless integrations between velocity stores, warehouses, BI tools and exploratory analytics. Standards go a long way to making these integrations easier.
Q10: Anything you wish to add?
Scott Jarr.: Thank you Roberto. Very interesting discussion.
VoltDB Co-founder and Chief Strategy Officer Scott Jarr. Scott brings more than 20 years of experience building, launching and growing technology companies from inception to market leadership in highly competitive environments.
Prior to joining VoltDB, Scott was VP Product Management and Marketing at on-line backup SaaS leader LiveVault Corporation. While at LiveVault, Scott was key in growing the recurring revenue business to 2,000 customers strong, leading to an acquisition by Iron Mountain. Scott has also served as board member and advisor to other early-stage companies in the search, mobile, security, storage and virtualization markets. Scott has an undergraduate degree in mathematical programming from the University of Tampa and an MBA from the University of South Florida.
– Big Data: Challenges and Opportunities.
Roberto V. Zicari, October 5, 2012.
Abstract: In this presentation I review three current aspects related to Big Data:
1. The business perspective, 2. The Technology perspective, and 3. Big Data for social good.
Presentation (89 pages) | Intermediate| English | DOWNLOAD (PDF)| October 2012|
You can follow ODBMS.org on Twitter : @odbmsorg.