VoltDB – ODBMS Industry Watch

On RDBMS, NoSQL and NewSQL databases. Interview with John Ryan

Roberto V. Zicari — Fri, 09 Mar 2018 11:05:17 +0000

“The single most important lesson I’ve learned is to keep it simple. I find designers sometimes deliver over-complex, generic solutions that could (in theory) do anything, but in reality are remarkably difficult to operate, and often misunderstood.”–John Ryan

I have interviewed John Ryan, Data Warehouse Solution Architect (Director) at UBS.

RVZ

Q1. You are an experienced Data Warehouse architect, designer and developer. What are the main lessons you have learned in your career?

John Ryan: The single most important lesson I’ve learned is to keep it simple. I find designers sometimes deliver over-complex, generic solutions that could (in theory) do anything, but in reality are remarkably difficult to operate, and often misunderstood. I believe this stems from a lack of understanding of the requirement – the second most important lesson.

Everyone from the senior stakeholders to architects, designers and developers need to fully understand the goal. Not the solution, but the “problem we’re trying to solve”. End users never ask for what they need (the requirement), but instead, they describe a potential solution. IT professionals are by nature delivery focused, and, get frustrated when it appears “the user changed their mind”. I find the user seldom changes their mind. In reality, the requirement was never fully understood.

To summarise. Focus on the problem not the solution. Then (once understood), suggest a dozen solutions and pick the best one. But keep it simple.

Q2. How has the Database industry changed in the last 20 years?

John Ryan: On the surface, not a lot. As recent as 2016 Gartner estimated Oracle, Microsoft and IBM accounted for 80% of the commercial database market, but that hides an underlying trend that’s disrupting this $50 billion industry.

Around the year 2000, the primary options were Oracle, DB2 or SQL Server with data warehouse appliances from Teradata and Netezza. Fast forward to today, and the database engine rankings include over 300 databases of which 50% are open source, and there are over 11 categories including Graph, Wide column, Key-value and Document stores; each suited to a different use-case.

While relational databases are still popular, 4 of the top 10 most popular solutions are non-relational (classed as NoSQL), including MongoDB, Redis and Cassandra. Cross-reference this against a of survey of the most highly sought skills, and we find MongoDB, Redis and Cassandra again in the top ranking, with nearly 40% of respondents seeking MongoDB skills compared to just 12% seeking Oracle expertise.

Likewise open source databases make up 60% of the top ten ranking, with open source database MySQL in second place behind Oracle in the rankings, and Gartner states that “By 2018, more than 70% of new in-house applications will be developed on an [Open Source] DBMS”.

The move towards cloud computing and database-as-a-service is causing further disruption in the data warehouse space with cloud and hybrid challengers including Vertica, Amazon Redshift and Snowflake.

In conclusion, the commercial relational vendors currently dominate the market in revenue terms. However, there has been a remarkable growth of open source alternatives, and huge demand for solutions to handle high velocity unstructured and semi-structured data. These use-cases including social media, and the Internet of Things are ill-suited to the legacy structured databases provided by Oracle, DB2 and SQL Server, and this void has been largely filled by open source NoSQL and NewSQL databases.

Q3. RDBMS vs. NoSQL vs. NewSQL: How do you compare Database Technologies?

John Ryan: The traditional RDBMS solutions from Oracle, IBM and Microsoft implement the relational model on a 1970s hardware architecture, and typically provide a good general purpose database platform which can be applied to most OLTP and Data Warehouse use cases.

However, as Dr. Michael Stonebraker indicated in this 2007 paper, The End of an Architectural Era (It’s Time for a Complete Rewrite), these are no longer fit for purpose, as both the hardware technology, and processing demands have moved on. In particular, the need for real time (millisecond) performance, greater scalability to handle web-scale applications, and the need to handle unstructured and semi-structured data.

Whereas the legacy RDBMS is a general purpose (will do anything) database, the NoSQL and NewSQL solutions are dedicated to a single problem, for example, short lived OLTP operations.

The Key-Value NoSQL databases were developed to handle the massive transaction volume, and low latency needed to handle web commerce at Amazon and LinkedIn. Others (eg. MongoDB) where developed to handle semi-structured data, while still others (eg. Neo4J) were built to efficiently model data networks of the type found at Facebook or LinkedIn.

The common thread with NoSQL databases is they tend to use an API interface rather than industry standard SQL, although increasingly that’s changing. They do however, entirely reject the relational model and ACID compliance. They typically don’t support cross-table join operations, and are entirely focused on low latency, trading consistency for scalability.

The so-called NewSQL databases include VoltDB , MemSQL and CockroachDB are a return to the relational model, but re-architected for modern hardware and web scale use cases. Like NoSQL solutions, they tend to run on a shared nothing architecture, and scale to millions of transactions per second, but they also have full transaction support and ACID compliance that are critical for financial operations.

Q4. What are the typical trade-off of performance and consistency when using NoSQL and NewSQL databases to support high velocity OLTP and real time analytics?

John Ryan: The shared nothing architecture is built to support horizontal scalability, and when combined with data replication, can provide high availability and good performance. If one node in the cluster fails, the system continues, as the data is available on other nodes. The NoSQL database is built upon this architecture, and to maximize throughput, ACID compliance is relaxed in favor of Eventual Consistency, and in some cases (eg. Cassandra), it supports tunable consistency, allowing the developer to trade performance for consistency, and durability.

For example, after a write operation, the change cannot be considered durable (the D in ACID) until the change is replicated to at least one, ideally two other nodes , but this would increase latency, and reduce performance. It’s possible however, to relax this constraint, and return immediately, with the risk the change may be lost if the node crashes before the data is successfully replicated. This becomes even more of a potential issue if the node is temporarily disconnected from the network, but is allowed to continue accepting transactions until the connection is restored. In practice, consistency will be eventually be achieved when the connection is reestablished – hence the term Eventual Consistency.

A NewSQL database on the other hand accepts no such compromise, and some databases (eg. VoltDB), even support full serializability, executing transactions as if they were executed serially. Impressively, they manage this impressive feat at a rate of millions of transactions per second, potentially on commodity hardware.

Q5. One of the main challenges for real time systems architects is the potentially massive throughput required which could exceed a million transactions per second. How do handle such a challenge?

John Ryan: The short answer is – with care! The longer answer is described in my article, Big Data – Velocity. I’d break the problem into three components, Data Ingestion, Transformation and Storage.

Data ingestion requires message based middleware (eg. Apache Kafka), with a range of adapters and interfaces, and the ability to smooth out the potentially massive spikes in velocity, with the ability to stream data to multiple targets.

Transformation, typically requires an in-memory data streaming solution to restructure and transform data in near-real time. Options include Spark Streaming, Storm or Flink.

Storage and Analytics is sometimes handled by a NoSQL database, but for application simplicity (avoiding the need to implement transactions or handle eventual consistency problems in the application), I’d recommend a NewSQL database.
All the low-latency, high throughput of the NoSQL solutions, but with the flexibility and ease of a full relational database, and full SQL support.

In conclusion, the solution needs to abandon the traditional batch oriented solution in favour of an always-on streaming solution with all processing in memory.

Q6. Michael Stonebraker introduced the so called “One Size no longer fits all”-concept. Has this concept come true on the database market?

John Ryan: First stated in the paper One Size Fits All – An Idea Whose Time Has Come And Gone Dr. Michael Stonebraker argued that the legacy RDBMS dominance was at an end, and would be replaced by specialized database technology including stream processing, OLTP and Data Warehouse solutions.

Certainly disruption in the Data Warehouse database market has been accelerated with the move towards the cloud, and as this Gigaom Report illustrates, there are at least nine major players in the market, with new specialized tools including Google Big Query, Amazon Redshift and Snowflake, and the column store (in memory or on secondary storage) dominates.

Finally, the explosion of specialized NoSQL and NewSQL databases, each with its own specialty including Key-Value, Document Stores, Text Search and Graph databases lend credence to the statement “One Size no longer fits all”.

I do think however, we’re still in a transformation stage, and the shake-out is not yet complete. I think a lot of large corporations (especially Financial Services) are wary of change, but it’s already happening.

I think the quote Stewart Brand is appropriate: “Once a new technology rolls over you, if you’re not part of the steamroller, your part of the road”.

Q7. Eventual consistency vs ACID. A pragmatic approach or a step too far?

John Ryan: As with so many decisions in IT, it depends. Eventual Consistency was built into the Amazon Dynamo database as a pragmatic decision, because it’s difficult to maintain high throughput and availability at scale. Amazon accepted the relatively minor risk of inconsistencies because of the huge benefits including scalability and availability. In many other web scale applications (eg. Social media), the implications of temporary inconsistency are not important, and it’s therefore an acceptable approach.

Having said that, it does place a huge addition burden on the developer to code for relatively rare and unexpected conditions, and I’d question why would anyone settle for a database which supported eventual consistency, when full ACID compliance and transaction handling is available.

Q8. What is your take on the Lambda Architecture ?

John Ryan: The Lambda Architecture is a an approach to handle massive data volumes, and provide real time results by running two parallel solutions, a Batch Processing and a real time Speed Processing stream.

As a Data Warehouse and ETL designer, I’m aware of the complexity involved in data transformation, and I was immediately concerned about an approach which involved duplicating the logic, probably in two different technologies.

I’d also question the sense of repeatedly executing batch processing on massive data volumes when processing is moving to the cloud, on a pay for what you use basis.

I’ve since found this article on Lambda by Jay Kreps of LinkedIn and more recently Confluent (who developed Apache Kafka), and he describes the challenges from a position of real (and quite painful) experience.

The article recommends an alternative approach, a single Speed Processing stream with slight changes to allow data to be reprocessed (the primary advantage of the Lambda architecture). This solution, named the Kappa Architecture, is based upon the ability of Kafka to retain and re-play data, and it seems an entirely sensible and more logical approach.

Qx Anything else you wish to add?

John Ryan: Thank you for this opportunity. It may be a Chinese curse that “may you live in interesting times”, but I think it’s a fascinating time to be working in the database industry.

—————————–

John Ryan, Data Warehouse Solution Architect (Director), UBS.

John has over 30 years experience in the IT industry, and has maintained a keen interest in database technology since his University days in the 1980s when he invented a B-Tree index, only to find Oracle (and several others) had got there already.

He’s worked as a Data Warehouse Architect and Designer for the past 20 years in a range of industries including Mobile Communications, Energy and Financial Services. He’s regularly found writing articles on Big Data and database architecture, and you can find him on LinkedIn.

Resources

– ODBMS.org: BIG DATA, ANALYTICAL DATA PLATFORMS, DATA SCIENCE

– ODBMS.org: NEWSQL, XML, RDF DATA STORES, RDBMS

– ODBMS.org: NOSQL DATA STORES

Related Posts

– On the InterSystems IRIS Data Platform ,ODBMS Industry Watch, 2018-02-09

– Facing the Challenges of Real-Time Analytics. Interview with David Flower, ODBMS Industry Watch, 2017-12-19

– On the future of Data Warehousing. Interview with Jacque Istok and Mike Waas, ODBMS Industry Watch, 2017-11-09

– On Vertica and the new combined Micro Focus company. Interview with Colin Mahony, ODBMS Industry Watch, 2017-10-25

– On Open Source Databases. Interview with Peter Zaitsev, ODBMS Industry Watch, 2017-09-06

Follow us on Twitter: @odbmsorg

Facing the Challenges of Real-Time Analytics. Interview with David Flower

Roberto V. Zicari — Tue, 19 Dec 2017 19:24:11 +0000

“We are now seeing a number of our customers in financial services adopt a real-time approach to detecting and preventing fraudulent credit card transactions. With the use of ML integrating into the real-time rules engine within VoltDB, the transaction can be monitored, validated and either rejected or passed, before being completed, saving time and money for both the financial institution and the consumer.”–David Flower.

I have interviewed David Flower, President and Chief Executive Officer of VoltDB. We discussed his strategy for VoltDB, and the main data challenges enterprises face nowadays in performing real-time analytics.

RVZ

Q1. You joined VoltDB as Chief Revenue Officer last year, and since March 29, 2017 you have been appointment to the role of President and Chief Executive Officer. What is your strategy for VoltDB?

David Flower : When I joined the company we took a step back to really understand our business and move from the start-up phase to growth stage. As with all organizations, you learn from what you have achieved but you also have to be honest with what your value is. We looked at 3 fundamentals;
1) Success in our customer base – industries, use cases, geography
2) Market dynamics
3) Core product DNA – the underlying strengths of our solution, over and above any other product in the market

The outcome of this exercise is we have moved from a generic veneer market approach to a highly focused specialized business with deep domain knowledge. As with any business, you are looking for repeatability into clearly defined and understood market sectors, and this is the natural next phase in our business evolution and I am very pleased to report that we have made significant progress to date.

With the growing demand for massive data management aligned with real-time decision making, VoltDB is well positioned to take advantage of this opportunity.

Q2. VoltDB is not the only in-memory transactional database in the market. What is your unique selling proposition and how do you position VoltDB in the broader database market?

David Flower : The advantage of operating in the database market is the pure size and scale that it offers – and that is also the disadvantage. You have to be able to express your target value. Through our customers and the strategic review we undertook, we are now able to express more clearly what value we have and where, and equally importantly, where we do not play! Our USP’s revolve around our product principles – vast data ingestion scale, full ACID consistency and the ability to undertake real-time decisioning, all supported through a distributed low-latency in-memory architecture, and we embrace traditional RDBMS through SQL to leverage existing market skills, and reduce the associated cost of change. We offer a proven enterprise grade database that is used by some of the World’s leading and demanding brands, a fact that many other companies in our market are unable to do.

Q3. VoltDB was founded in 2009 by a team of database experts, including Dr. Michael Stonebraker (winner of the ACM Turing award). How much of Stonebraker`s ideas are still in VoltDB and what is new?

David Flower : We are both proud and privileged to be associated with Dr. Stonebraker, and his stature in the database arena is without comparison. Mike’s original ideas underpin our product philosophy and our future direction, and he continues to be actively engaged in the business and will always remain a fundamental part of our heritage. Through our internal engineering experts and in conjunction with our customers, we have developed on Mike’s original ideas to bring additional features, functions and enterprise grade capabilities into the product.

Q4. Stonebraker co-founded several other database companies. Before VoltDB, in 2005, Stonebraker co-founded Vertica to commercialize the technology behind C-Store; and after VoltDB, in 2013 he co-founded another company called Tamr. Is there any relationship between Vertica, VoltDB and Tamr (if any)?

David Flower : Mike’s legacy in this field speaks for itself. VoltDB evolved from the Vertica business and while we have no formal ties, we are actively engaged with numerous leading technology companies that enable clients to gain deeper value through close integrations.

Q5. VoltDB is a ground-up redesign of a relational database. What are the main data challenges enterprises face nowadays in performing real-time analytics?

The demand for ‘real-time’ is one of the most challenging areas for many businesses today. Firstly, the definition of real-time is changing. Batch or micro-batch processing is now unacceptable – whether that be for the consumer, customer and in some cases for compliance. Secondly, analytics is also moving from the back-end (post event) to the front-end (in-event or in-process).
The drivers around AI and ML are forcing this even more. The market requirement is now for real-time analytics but what is the value of this if you cannot act on it? This is where VoltDB excels – we enable the action on this data, in process, and when the data/time is most valuable. VoltDB is able to truly deliver on the value of translytics – the combination of real-time transactions with real-time analytics, and we can demonstrate this through real use cases.

Q6. VoltDB is specialized in high-velocity applications that thrive on fast streaming data. What is fast streaming data and why does it matter?

David Flower : As previously mentioned, VoltDB is designed for high volume data streams that require a decision to be taken ‘in-stream’ and is always consistent. Fast streaming data is best defined through real applications – policy management, authentication, billing as examples in telecoms; fraud detection & prevention in finance (such as massive credit card processing streams); customer engagement offerings in media & gaming; and areas such as smart-metering in IoT.
The underlying principle being that the window of opportunity (action) is available in the fast data stream process, and once passed the opportunity value diminishes.

Q7. You have recently announced an “Enterprise Lab Program” to accelerate the impact of real-time data analysis at large enterprise organizations. What is it and how does it work?

David Flower : The objective of the Enterprise Lab Program is to enable organizations to access, test and evaluate our enterprise solution within their own environment and determine the applicability of VoltDB for either the modernization of existing applications or for the support of next gen applications. This comes without restriction, and provides full access to our support, technical consultants and engineering resources. We realize that selecting a database is a major decision and we want to ensure the potential of our product can be fully understood, tested and piloted with access to all our core assets.

Q8. You have been quoted saying that “Fraud is a huge problem on the Internet, and is one of the most scalable cybercrimes on the web today. The only way to negate the impact of fraud is to catch it before a transaction is processed”. Is this really always possible? How do you detect a fraud in practice?

David Flower : With the phenomenal growth in e-commerce and the changing consumer demands for web-driven retailing, the concerns relating to fraud (credit card) are only going to increase. The internet creates the challenge of handling massive transaction volumes, and cyber criminals are becoming ever more sophisticated in their approach.
Traditional fraud models simply were not designed to manage at this scale, and in many cases post-transaction capture is too late – the damage has been done. We are now seeing a number of our customers in financial services adopt a real-time approach to detecting and preventing fraudulent credit card transactions. With the use of ML integrating into the real-time rules engine within VoltDB, the transaction can be monitored, validated and either rejected or passed, before being completed, saving time and money for both the financial institution and the consumer. By using the combination of post- analytics and ML, the most relevant, current and effective set of rules can be applied as the transaction is processed.

Q9. Another area where VoltDB is used is in mobile gaming. What are the main data challenges with mobile gaming platforms?

David Flower : Mobile gaming is a perfect example of fast data – large data streams that require real-time decisioning for in-game customer engagement. The consumer wants the personal interaction but with relevant offers at that precise moment in the game. VoltDB is able to support this demand, at scale and based on the individual’s profile and stage in the application/game. The concept of the right offer, to the right person, at the right time ensures that the user remains loyal to the game and the game developer (company) can maximize its revenue potential through high customer satisfaction levels.

Q11. Can you explain the purpose of VoltDB`s recently announced co-operations with Huawei and Nokia?

David Flower : We have developed close OEM relationships with a number of major global clients, of which Huawei and Nokia are representative. Our aim is to be more than a traditional vendor, and bring additional value to the table, be it in the form of technical innovation, through advanced application development, or in terms of our ‘total company’ support philosophy. We also recognize that infrastructure decisions are critical by nature, and are not made for the short-term.
VoltDB has been rigorously tested by both Huawei and Nokia and was selected for several reasons against some of the world’s leading technologies, but fundamentally because our product works – and works in the most demanding environments providing the capability for existing and next-generation enterprise grade applications.

—————

David Flower brings more than 28 years of experience within the IT industry to the role of President and CEO of VoltDB. David has a track record of building significant shareholder value across multiple software sectors on a global scale through the development and execution of focused strategic plans, organizational development and product leadership.

Before joining VoltDB, David served as Vice President EMEA for Carbon Black Inc. Prior to Carbon Black he held senior executive positions in numerous successful software companies including Senior Vice President International for Everbridge (NASDAQ: EVBG); Vice President EMEA (APM division) for Compuware (formerly NASDAQ: CPWR); and UK Managing Director and Vice President EMEA for Gomez. David also held the position of Group Vice President International for MapInfo Corp. He began his career in senior management roles at Lotus Development Corp and Xerox Corp – Software Division.

David attended Oxford Brookes University where he studied Finance. David retains strong links within the venture capital investment community.

Resources

– eBook: Fast Data Use Cases for Telecommunications. Ciara Byrne 2017 O’Reilly Media. ( LINK to .PDF (registration required)

– Fast Data Pipeline Design: Updating Per-Event Decisions by Swapping Tables. July 11, 2017 BY JOHN PIEKOS, VoltDB

– VoltDB Extends Open Source Capabilities for Development of Real-Time Applications · OCTOBER 24, 2017

– New VoltDB Study Reveals Business and Psychological Impact of Waiting · OCTOBER 11, 2017

– VoltDB Accelerates Access to Translytical Database with Enterprise Lab Program · SEPTEMBER 29, 2017

– Huawei Selects VoltDB to Power Financial Services Fraud Detection Solution FusionInsight · APRIL 19, 2017

– Nokia Selects VoltDB for Cloud Mobility Manager and Cloud Mobility Gateway Solutions · APRIL 19, 2017

– Internet of Things: Safety, Security and Privacy. Interview with Vint G. Cerf, ODBMS Indutry Watch, June 11, 2017

Follow us on Twitter: @odbmsorg

On Big Data Analytics. Interview with Shilpa Lawande

Roberto V. Zicari — Thu, 10 Dec 2015 08:45:28 +0000

“Really, I would say this is indeed the essence of Big Data – being able to harness data from millions of endpoints whether they be devices or users, and optimizing outcomes for the individual, not just for the collective!”–Shilpa Lawande.

I have been following Vertica since their acquisition by HP back in 2011. This is my third interview with Shilpa Lawande, now Vice President at Hewlett Packard Enterprise, and responsible for strategic direction of the HP Big Data Platforms, including HP Vertica Analytic Platform.
The first interview I did with Shilpa was back on November 16, 2011 (soon after the acquisition by HP), and the second on July 14, 2014.
If you read the three interviews (see links to the two previous interviews at the end of this interview), you will notice how fast the Big Data Analytics and Data Platforms world is changing.

RVZ

Q1. What are the main technical challenges in offering data analytics in real time? And what are the main problems which occur when trying to ingest and analyze high-speed streaming data, from various sources?

Shilpa Lawande: Before we talk about technical challenges, I would like to point out the difference between two classes of analytic workloads that often get grouped under “streaming” or “real-time analytics”.

The first and perhaps more challenging workload deals with analytics at large scale on stored data but where new data may be coming in very fast, in micro-batches.
In this workload, challenges are twofold – the first challenge is about reducing the latency between ingest and analysis, in other words, ensuring that data can be made available for analysis soon after it arrives, and the second challenge is about offering rich, fast analytics on the entire data set, not just the latest batch. This type of workload is a facet of any use case where you want to build reports or predictive models on the most up-to-date data or provide up-to-date personalized analytics for a large number of users, or when collecting and analyzing data from millions of devices. Vertica excels at solving this problem at very large petabyte scale and with very small micro-batches.

The second type of workload deals with analytics on data in flight (sometimes called fast data) where you want to analyze windows of incoming data and take action, perhaps to enrich the data or to discard some of it or to aggregate it, before the data is persisted. An example of this type of workload might be taking data coming in at arbitrary times with granularity and keeping the average, min, and max data points per second, minute, hour for permanent storage. This use case is typically solved by in-memory streaming engines like Storm or, in cases where more state is needed, a NewSQL system like VoltDB, both of which we consider complementary to Vertica.

Q2. Do you know of organizations that already today consume, derive insight from, and act on large volume of data generated from millions of connected devices and applications?

Shilpa Lawande: HP Inc. and Hewlett Packard Enterprise (HPE) are both great examples of this kind of an organization. A number of our products – servers, storage, and printers all collect telemetry about their operations and bring that data back to analyze for purposes of quality control, predictive maintenance, as well as optimized inventory/parts supply chain management.
We’ve also seen organizations collect telemetry across their networks and data centers to anticipate servers going down, as well as to have better understanding of usage to optimize capacity planning or power usage. If you replace devices by users in your question, online and mobile gaming companies, social networks and adtech companies with millions of daily active users all collect clickstream data and use it for creating new and unique personalized experiences. For instance, user churn is a huge problem in monetizing online gaming.
If you can detect, from the in-game interactions, that users are losing interest, then you can immediately take action to hold their attention just a little bit longer or to transition them to a new game altogether. Companies like Game Show Network and Zynga do this masterfully using Vertica real-time analytics!

Really, I would say this is indeed the essence of Big Data – being able to harness data from millions of endpoints whether they be devices or users, and optimizing outcomes for the individual, not just for the collective!

Q3. Could you comment on the strategic decision of HP to enhance its support for Hadoop?

Shilpa Lawande: As you know HP recently split into Hewlett Packard Enterprise (HPE) and HP Inc.
With HPE, which is where Big Data and Vertica resides, our strategy is to provide our customers with the best end-to-end solutions for their big data problems, including hardware, software and services. We believe that technologies Hadoop, Spark, Kafka and R are key tools in the Big Data ecosystem and the deep integration of our technology such as Vertica and these open-source tools enables us to solve our customers’ problems more holistically.
At Vertica, we have been working closely with the Hadoop vendors to provide better integrations between our products.
Some notable, recent additions include our ongoing work with Hortonworks to provide an optimized Vertica SQL-on-Hadoop version for the Orcfile data format, as well as our integration with Apache Kafka.

Q4. The new version of HPE Vertica, “Excavator,” is integrated with Apache Kafka, an open source distributed messaging system for data streaming. Why?

Shilpa Lawande: As I mentioned earlier, one of the challenges with streaming data is ingesting it in micro- batches at low latency and high scale. Vertica has always had the ability to do so due to its unique hybrid load architecture whereby data is ingested into a Write Optimized Store in-memory and then optimized and persisted to a Read-Optimized Store on disk.
Before “Excavator,” the onus for engineering the ingest architecture was on our customers. Before Kafka, users were writing custom ingestion tools from scratch using ODBC/JDBC or staging data to files and then loading using Vertica’s COPY command. Besides the challenges of achieving the optimal load rates, users commonly ran into challenges of ensuring transactionality of the loads, so that each batch gets loaded exactly once even under esoteric error conditions. With Kafka, users get a scalable distributed messaging system that enables simplifying the load pipeline.
We saw the combination of Vertica and Kafka becoming a common design pattern and decided to standardize on this pattern by providing out-of-the-box integration between Vertica and Kafka, incorporating the best practices of loading data at scale. The solution aims to maximize the throughput of loads via micro-batches into Vertica, while ensuring transactionality of the load process. It removes a ton of complexity in the load pipeline from the Vertica users.

Q5.What are the pros and cons of this design choice (if any)?

Shilpa Lawande: The pros are that if you already use Kafka, much of the work of ingesting data into Vertica is done for you. Having seen so many different kinds of ingestion horror stories over the past decade, trust me, we’ve eliminated a ton of complexity that you don’t need to worry about anymore. The cons are, of course, that we are making the choice of the tool for you. We believe that the pros far outweigh any cons.

Q6. What kind of enhanced SQL analytics do you provide?

Shilpa Lawande: Great question. Vertica of course provides all the standard SQL analytic capabilities including joins, aggregations, analytic window functions, and, needless to say, performance that is a lot faster than any other RDBMS. But we do much more than that. We’ve built some unique time-series analysis (via SQL) to operate on event streams such as gap-filling and interpolation and event series joins. You can use this feature to do common operations like sessionization in three or four lines of SQL. We can do this because data in Vertica is always sorted and this makes Vertica a superior system for time series analytics. Our pattern matching capabilities enable user path or marketing funnel analytics using simple SQL, which might otherwise take pages of code in Hive or Java.
With the open source Distributed R engine, we provide predictive analytical algorithms such as logistic regression and page rank. These can be used to build predictive models using R, and the models can be registered into Vertica for in- database scoring. With Excavator, we’ve also added text search capabilities for machine log data, so you can now do both search and analytics over log data in one system. And you recently featured a five-part blog series by Walter Maguire examining why Vertica is the best graph analytics engine out there.

Q7. What kind of enhanced performance to Hadoop do you provide?

Shilpa Lawande We see Hadoop, particularly HDFS, as highly complementary to Vertica. Our users often use HDFS as their data lake, for exploratory/discovery phases of their data lifecycle. Our Vertica SQL on Hadoop offering includes the Vertica engine running natively on Hadoop nodes, providing all the advanced SQL capabilities of Vertica on top of data stored in HDFS. We integrate with native metadata stores like HCatalog and can operate on file formats like Orcfiles, Parquet, JSON, Avro, etc. to provide a much more robust SQL engine compared to the alternatives like Hive, Spark or Impala, and with significantly better performance. And, of course, when users are ready to operationalize the analysis, they can seamlessly load the data into Vertica Enterprise which provides the highest performance, compression, workload management, and other enterprise capabilities for your production workloads. The best part is that you do not have to rewrite your reports or dashboards as you move data from Vertica for SQL on Hadoop to Vertica Enterprise.

Qx Anything else you wish to add?

Shilpa Lawande: As we continue to develop the Vertica product, our goal is to provide the same capabilities in a variety of consumption and deployment models to suit different use cases and buying preferences. Our flagship Vertica Enterprise product can be deployed on-prem, in VMWare environments or in AWS via an AMI.
Our SQL on Hadoop product can be deployed directly in Hadoop environments, supporting all Hadoop distributions and a variety of native data formats. We also have Vertica OnDemand, our data warehouse-as-a-service subscription that is accessible via a SQL prompt in AWS, HPE handles all of the operations such as database and OS software updates, backups, etc. We hope that by providing the same capabilities across many deployment environments and data formats, we provide our users the maximum choice so they can pick the right tool for the job. It’s all based on our signature core analytics engine.
We welcome new users to our growing community to download our Community Edition, which provides 1TB of Vertica on a three-node cluster for free, or sign-up for a 15-day trial of Vertica on Demand!

———
Shilpa Lawande is Vice President at Hewlett Packard Enterprise, responsible for strategic direction of the HP Big Data Platforms, including the flagship HP Vertica Analytic Platform. Shilpa brings over 20 years of experience in databases, data warehousing, analytics and distributed systems.
She joined Vertica at its inception in 2005, being one of the original engineers who built Vertica from ground up, and running the Vertica Engineering and Customer Experience teams for better part of the last decade. Shilpa has been at HPE since 2011 through the acquisition of Vertica and has held a diverse set of roles spanning technology and business.
Prior to Vertica, she was a key member of the Oracle Server Technologies group where she worked directly on several data warehousing and self-managing features in the Oracle Database.

Shilpa is a co-inventor on several patents on database technology, both at Oracle and at HP Vertica.
She has co-authored two books on data warehousing using the Oracle database as well as a book on Enterprise Grid Computing.
She has been named to the 2012 Women to Watch list by Mass High Tech, the Rev Boston 2015 list, and awarded HP Software Business Unit Leader of the year in 2012 and 2013. As a working mom herself, Shilpa is passionate about STEM education for Girls and Women In Tech issues, and co-founded the Datagals women’s networking and advocacy group within HPE. In her spare time, she mentors young women at Year Up Boston, an organization that empowers low-income young adults to go from poverty to professional careers in a single year.

Resources

Uplevel Big Data analytics with HP Vertica – Part 1: Graph in a relational database? Seriously? by Walter Maguire
Uplevel Big Data Analytics with Graph in Vertica – Part 2: Yes, you can write that in SQL by Walter Maguire
Uplevel Big Data Analytics with Graph in Vertica – Part 3: Yes, you can make it go even faster by Walter Maguire
Uplevel Big Data Analytics with Graph in Vertica – Part 4: It’s not your dad’s graph engine by Walter Maggiore
Uplevel Big Data Analytics with Graph in Vertica – Part 5: Putting graph to work for your business by Walter Maguire

– On Column Stores. Interview with Shilpa Lawande. ODBMS Industry Watch,July 14, 2014

– On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica. ODBMS Industry Watch,November 16, 2011

Follow ODBMS.org on Twitter: @odbmsorg

Powering Big Data at Pinterest. Interview with Krishna Gade.

Roberto V. Zicari — Wed, 22 Apr 2015 07:55:31 +0000

“Today, we’re storing and processing tens of petabytes of data on a daily basis, which poses the big challenge in building a highly reliable and scalable data infrastructure.”–Krishna Gade.

I have interviewed Krishna Gade, Engineering Manager on the Data team at Pinterest.

RVZ

Q1. What are the main challenges you are currently facing when dealing with data at Pinterest?

Krishna Gade: Pinterest is a data product and a data-driven company. Most of our Pinner-facing features like recommendations, search and Related Pins are created by processing large amounts of data every day. Added to this, we use data to derive insights and make decisions on products and features to build and ship. As Pinterest usage grows, the number of Pinners, Pins and the related metadata are growing rapidly. Today, we’re storing and processing tens of petabytes of data on a daily basis, which poses the big challenge in building a highly reliable and scalable data infrastructure.

On the product side, we’re curating a unique dataset we call the ‘interest graph’ which captures the relationships between Pinners, Pins, boards (collections of Pins) and topic categories. As Pins are visual bookmarks of web pages saved by our Pinners, we can have the same web page Pinned many different times. One of the problems we try to solve is to collate all the Pins that belong to the same web page and aggregate all the metadata associated with them.

Visual discovery is an important feature in our product. When you click on a Pin we need to show you visually related Pins. In order to do this we extract features from the Pin image and apply sophisticated deep learning techniques to suggest Pins related to the original. There is a need to build scalable infrastructure and algorithms to mine and extract value from this data and apply to our features like search, recommendations etc.

Q2. You wrote in one of your blog posts that “data-driven decision making is in your company DNA”. Could please elaborate and explain what do you mean with that?

Krishna Gade: It starts from the top. Our senior leadership is constantly looking for insights from data to make critical decisions. Every day, we look at the various product metrics computed by our daily pipelines to measure how the numerous product features are doing. Every change to our product is first tested with a small fraction of Pinners as an A/B experiment, and at any given time we’re running hundreds of these A/B experiments. Over time data-driven decision making has become an integral part of our culture.

Q3. Specifically, what do you use Real-time analytics for at Pinterest?

Krishna Gade: We build batch pipelines extensively throughout the company to process billions of Pins and the activity on them. These pipelines allow us to process vast amounts of historic data very efficiently and tune and personalize features like search, recommendations, home feed etc. However these pipelines don’t capture the activity happening currently – new users signing up, millions of repins, clicks and searches. If we only rely on batch pipelines, we won’t know much about a new user, Pin or trend for a day or two. We use real-time analytics to bridge this gap.
Our real-time data pipelines process user activity stream that includes various actions taken by the Pinner (repins, searches, clicks, etc.) as they happen on the site, compute signals for Pinners and Pins in near real-time and make these available back to our applications to customize and personalize our products.

Q4 Could you pls give us an overview of the data platforms you use at Pinterest?

Krishna Gade: We’ve used existing open-source technologies and also built custom data infrastructure to collect, process and store our data. We built a logging agent Singer, deployed on all of our web servers that’s constantly pumping log data into Kafka, which we use as a log transport system. After the logs reach Kafka, they’re copied into Amazon S3 by our custom log persistence service called Secor. We built Secor to ensure 0-data loss and overcome the weak eventual consistency model of S3.
After this point, our self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing. All our large scale batch pipelines run on Hadoop, which is the core data infrastructure we depend on for improving and observing our product. Our engineers use either Hive or Cascading to build the data pipelines, which are managed by Pinball – a flexible workflow management system we built. More recently, we’ve started using Spark to support our machine learning use-cases.

Q5. You have built a real-time data pipeline to ingest data into MemSQL using Spark Streaming. Why?

Krishna Gade: As of today, most of our analytics happens in the batch processing world. All the business metrics we compute are powered by the nightly workflows running on Hadoop. In the future our goal is to be able to consume real-time insights to move quickly and make product and business decisions faster. A key piece of infrastructure missing for us to achieve this goal was a real-time analytics database that can support SQL.

We wanted to experiment with a real-time analytics database like MemSQL to see how it works for our needs. As part of this experiment, we built a demo pipeline to ingest all our repin activity stream into MemSQL and built a visualization to show the repins coming from the various cities in the U.S.

Q6. Could you pls give us some detail how is it implemented?

Krishna Gade: As Pinners interact with the product, Singer agents hosted on our web servers are constantly writing the activity data to Kafka. The data in Kafka is consumed by a Spark streaming job. In this job, each Pin is filtered and then enriched by adding geolocation and Pin category information. The enriched data is then persisted to MemSQL using MemSQL’s spark connector and is made available for query serving. The goal of this prototype was to test if MemSQL could enable our analysts to use familiar SQL to explore the real-time data and derive interesting insights.

Q7. Why did you choose MemSQL and Spark for this? What were the alternatives?

Krishna Gade: I led the Storm engineering team at Twitter, and we were able to scale the technology for hundreds of applications there. During that time I was able to experience both good and bad aspects of Storm.
When I came to Pinterest, I saw that we were beginning to use Storm but mostly for use-cases like computing the success rate and latency stats for the site. More recently we built an event counting service using Storm and HBase for all of our Pin and user activity. In the long run, we think it would be great to consolidate our data infrastructure to a fewer set of technologies. Since we’re already using Spark for machine learning, we thought of exploring its streaming capabilities. This was the main motivation behind using Spark for this project.

As for MemSQL, we were looking for a relational database that can run SQL queries on streaming data that would not only simplify our pipeline code but would give our analysts a familiar interface (SQL) to ask questions on this new data source. Another attractive feature about MemSQL is that it can also be used for the OLTP use case, so we can potentially have the same pipeline enabling both product insights and user-facing features. Apart from MemSQL, we’re also looking at alternatives like VoltDB and Apache Phoenix. Since we already use HBase as a distributed key-value store for a number of use-cases, Apache Phoenix which is nothing but a SQL layer on top of HBase is interesting to us.

Q8. What are the lessons learned so far in using such real-time data pipeline?

Krishna Gade: It’s early days for the Spark + MemSQL real-time data pipeline, so we’re still learning about the pipeline and ingesting more and more data. Our hope is that in the next few weeks we can scale this pipeline to handle hundreds of thousands of events per second and have our analysts query them in real-time using SQL.

Q9. What are your plans and goals for this year?

Krishna Gade: On the platform side, our plan to is to scale real-time analytics in a big way in Pinterest. We want to be able to refresh our internal company metrics, signals into product features at the granularity of seconds instead of hours. We’re also working on scaling our Hadoop infrastructure especially looking into preventing S3 eventual consistency from disrupting the stability of our pipelines. This year should also see more open-sourcing from us. We started the year by open-sourcing Pinball, our workflow manager for Hadoop jobs. We plan to open-source Singer our logging agent sometime soon.

One the product side, one of our big goals is to scale our self-serve ads product and grow our international user-base. We’re focusing especially on markets like Japan and Europe to grow our user-base and get more local content into our index.

Qx. Anything else you wish to add?

Krishna Gade: For those who are interested in more information, we share latest from the engineering team on our Engineering blog. You can follow along with the blog, as well as updates on our Facebook Page. Thanks a lot for the opportunity to talk about Pinterest engineering and some of the data infrastructure challenges.

————-
Krishna Gade is the engineering manager for the data team at Pinterest. His team builds core data infrastructure to enable data driven products and insights for Pinterest. They work on some of the cutting edge big data technologies like Kafka, Hadoop, Spark, Redshift etc. Before Pinterest, Krishna was at Twitter and Microsoft building large scale search and data platforms.

—————–
Resources

–Singer, Pinterest’s Logging Infrastructure (LINK to SlideShares)

–Introducing Pinterest Secor (LINK to Pinterest engineering blog)

– pinterest/secor (GitHub)

– Spark Streaming

– MemSQL

–MemSQL’s spark connector (memsql/memsql-spark-connector GitHub)

———————-
Related Posts

–Big Data Management at American Express. Interview with Sastry Durvasula and Kevin Murray. ODBMS Industry Watch, October 12, 2014

–Hadoop at Yahoo. Interview with Mithun Radhakrishnan. ODBMS Industry Watch, 2014-09-21

Follow ODBMS.org on Twitter: @odbmsorg

On the SciDB array database. Interview with Mike Stonebraker and Paul Brown.

Roberto V. Zicari — Mon, 14 Apr 2014 07:00:30 +0000

“SciDB is both a data store and a massively parallel compute engine for numerical processing. The inclusion of this computational platform is what makes us the first “computational database”, not just a SQL-style decision support DBMS. Hence, we need a new moniker to describe this class of interactions. We settled on computational databases, but if your readers have a better suggestion, we are all ears!”
–Mike Stonebraker, Paul Brown.

On the SciDB array database, I have interviewed Mike Stonebraker, MIT Professor and Paradigm4 co-founder and CTO, and Paul Brown, Paradigm4 Chief Architect.

RVZ

Q1: What is SciDB and why did you create it?

Mike Stonebraker, Paul Brown: SciDB is an open source array database with scalable, built-in complex analytics, programmable from R and Python. The requirements for SciDB emerged from discussions between academic database researchers—Mike Stonebraker and Dave DeWitt— and scientists at the first Extremely Large Databases conference (XLDB) at SLAC in 2007 about coping with the peta-scale data from the forthcoming LSST telescope.

Recognizing that commercial and industrial users were about to face the same challenges as scientists, Mike Stonebraker founded Paradigm4 in 2010 to make the ideas explored in early prototypes available as a commercial-quality software product. Paradigm4 develops and supports both a free, open-source Community Edition (scidb.org/forum) and an Enterprise Edition with additional features (paradigm4.com).

Q2. With the rise of Big Data analytics, is the convergence of analytic needs between science and industry really happening?

Mike Stonebraker, Paul Brown: There is a “sea change” occurring as companies move from Business Intelligence (think SQL analytics) to Complex Analytics (think predictive modelling, clustering, correlation, principal components analysis, graph analysis, etc.). Obviously science folks have been doing complex analytics on big data all along.

Another force driving this sea change is all the machine-generated data produced by cell phones, genomic sequencers, and by devices on the Industrial Internet and the Internet of Things. Here too science folks have been working with big data from sensors, instruments, telescopes and satellites all along. So it is quite natural that a scalable computational database like SciDB that serves the science world is a good fit for the emerging needs of commercial and industrial users.

There will be a convergence of the two markets as many more companies aspire to develop innovative products and services using complex analytics on big and diverse data. In the forefront are companies doing electronic trading on Wall Street; insurance companies developing new pricing models using telematics data; pharma and biotech companies analyzing genomics and clinical data; and manufacturing companies building predictive models to anticipate repairs on expensive machinery. We expect everybody will move to this new paradigm over time. After all, a predictive model integrating diverse data is much more useful than a chart of numbers about past behavior.

Q3. What are the typical challenges posed by scientific analytics?

Mike Stonebraker, Paul Brown: We asked a lot of working scientists the same question, and published a paper in the IEEE Computing Science & Engineering summarizing their answers (*see citation below). In a nutshell, there are 4 primary issues.

1. Scale. Science has always been intensely “data driven”. With the ever-increasing massive data-generating capabilities of scientific instruments, sensors, and computer simulations, the average scientist is overwhelmed with data and needs data management and analysis tools that can scale to meet his or her needs, now and in the future.

2. New Analytic Methods. Historically analysis tools have focused on business users, and have provided easy-to-use interfaces for submitting SQL aggregates to data warehouses. Such business intelligence (BI) tools are not useful to scientists, who universally want much more complex analyses, whether it be outlier detection, curve fitting, analysis of variance, predictive models or network analysis. Such “complex analytics” is defined on arrays in linear algebra, and requires a new generation of client-side tools and server side tools in DBMSs.

3. Provenance. One of the central requirements that scientists have is reproducibility. They need to be able to send their data to colleagues to rerun their experiments and produce the same answers. As such, it is crucial to keep prior versions of data in the face of updates, error correction, and the like. The right way to provide such provenance is through a no-overwrite DBMS; which allows time-travel back in time to when the experiment in question was performed.

4. Interactivity. Unlike business users who are often comfortable with batch reporting of information, scientific users are invariably exploring their data, asking “what if” questions and testing hypotheses. What they need in interactivity on very large data sets.

Q3. What are in your opinion the commonalities between scientific and industrial analytics?

Mike Stonebraker, Paul Brown: We would state the question in reverse “What are the differences between the two markets?” In our opinion, the two markets will converge quickly as commercial and industrial companies move to the analytic paradigms pervasive in the science marketplace.

Q4. How come in the past the database system software community has failed to build the kinds of systems that scientists needed for managing massive data sets?

Mike Stonebraker, Paul Brown: Mostly it’s because scientific problems represent a $0 billion market! However, the convergence of industrial requirements and science requirements means that science can “piggy back” on the commercial market and get their needs met.

Q5. SciDB is a scalable array database with native complex analytics. Why did you choose a data model based on multidimensional arrays?

Mike Stonebraker, Paul Brown: Our main motivation is that at scale, the complex analyses done by “post sea change” users are invariably about applying parallelized linear algebraic algorithms to arrays. Whether you are doing regression, singular value decomposition, finding eigenvectors, or doing operations on graphs, you are performing a sequence of matrix operations. Obviously, this is intuitive and natural in an array data model, whereas you have to recast tables into arrays if you begin with an RDBMS or keep data in files. Also, a native array implementation can be made much faster than a table-based system by directly implementing multi-dimensional clustering and doing selective replication of neighboring data items.

Our secondary motivation is that, just like mathematical matrices, geospatial data, time-series data, image data, and graph data are most naturally organized as arrays. By preserving the inherent ordering in the data, SciDB supports extremely fast selection (including vectors, planes, ‘hypercubes’), doing multi-dimensional windowed aggregates, and re-gridding it to change spatial or temporal resolution.

Q6. How do you manage in a nutshell scalability with high degrees of tolerance to failures?

Mike Stonebraker, Paul Brown: In a nutshell? Partitioning, and redundancy (k-replication).

First, SciDB splits each array’s attributes apart, just like any columnar system. Then we partition each array into rectilinear blocks we call “chunks”. Then we employ a variety of mapping functions that map an array’s chunks to SciDB instances. For each copy of an array we use a different mapping function to create copies of each chunk on different node of the cluster. If a node goes down, we figure out where there is a redundant copy of the data and move the computation there.

Q7. How do you handle data compression in SciDB?

Mike Stonebraker, Paul Brown: Use of compression in modern data stores is a very important topic. Minimizing storage while retaining information and supporting extremely rapid data access informs every level of SciDB’s design. For example, SciDB splits every array into single-attribute components. We compress a chunk’s worth of cell values for a specific attribute. At the lowest level, we compress attribute data using techniques like run-length encoding on data. In addition, our implementation has an abstraction for compression to support other compression algorithms.

Q8. Why supporting two query languages?

Mike Stonebraker, Paul Brown: Actually the primary interfaces we are promoting are R and Python as they are the languages of choice of data scientists, quants, bioinformaticians, and scientists. SciDB-R and SciDB-Py allow users to interactively query SciDB using R and Python. Data is persisted in SciDB. Math operators are overloaded so that complex analytical computations execute scalably in the database.

Early on we surveyed potential and existing SciDB users, and found there were two very different types. By and large, commercial users using RDMBSs said “make it look like SQL”. For those users we created AQL—array SQL. On the other hand, data scientists and programmers preferred R, Python, and functional languages. For the second class of users we created SciDB-R, SciDB-Py, and AFL—an array functional language.

All queries get compiled into a query plan, which is a sequence of algebraic operations. Essentially all relational versions of SQL do exactly the same thing. In SciDB, AFL, the array functional language, is the underlying language of algebraic operators. Hence, it is easy to surface and support AFL in addition to AQL, SciDB-R, and SciDB-Py, allowing us to satisfy the preferred mode of working for many classes of users.

Q9. You defined SciDB a computational database – not a data warehouse, not a business-intelligence database, and not a transactional database. Could you please elaborate more on this point?

Mike Stonebraker, Paul Brown: In our opinion, there are two mature markets for DBMSs: transactional DBMSs that are optimized for large numbers of users performing short write-oriented ACID transactions, and data warehouses, which strive for high performance on SQL aggregates and other read-oriented longer queries. The users of SciDB fit into neither category. They are universally doing more complex mathematical calculations than SQL aggregates on their data, and their DBMS interactions are typically longer read-oriented queries. SciDB is both a data store and a massively parallel compute engine for numerical processing. The inclusion of this computational platform is what makes us the first “computational database”, not just a SQL-style decision support DBMS. Hence, we need a new moniker to describe this class of interactions. We settled on computational databases, but if your readers have a better suggestion, we are all ears!

Q10. How does SciDB differ from analytical databases, such as for example HP Vertica, and in-memory analytics databases such as SAP HANA?

Mike Stonebraker, Paul Brown: Both are data warehouse products, optimized for warehouse workloads. SciDB serves a different class of users from these other systems. Our customers’ data are naturally represented as arrays that don’t fit neatly or efficiently into relational tables. Our users want more sophisticated analytics—more numerical, statistical, and graph analysis—and not so much SQL OLAP.

Q11. What about Teradata?

Mike Stonebraker, Paul Brown: Another data warehouse vendor. Plus, SciDB runs on commodity hardware clusters or in a cloud and not on a proprietary appliances or expensive servers.

Q12. Anything else you wish to add?

Mike Stonebraker, Paul Brown: SciDB is currently being used by commercial users for computational finance, bioinformatics and clinical informatics, satellite image analysis, and industrial analytics. The publicly accessible NIH NCBI One Thousand Genomes browser has been running on SciDB since the Fall of 2012.

Anyone can try out SciDB using an AMI or a VM available at scidb.org/forum.

————————–

Mike Stonebraker , CTO Paradigm4
Renowned database researcher, innovator, and entrepreneur: Berkeley, MIT, Postgres, Ingres, Illustra, Cohera, Streambase, Vertica, VoltDB, and now Paradigm4.

Paul Brown , Chief Architect Paradigm4
Premier database ‘plumber’ and researcher moving from the “I’s” (Ingres, Illustra, Informix, IBM) to a “P” (Paradigm4).
————————-
Resources

*Citation for IEEE paper
Stonebraker, M.; Brown, P.; Donghui Zhang; Becla, J., “SciDB: A Database Management System for Applications with Complex Analytics,” Computing in Science & Engineering , vol.15, no.3, pp.54,62, May-June 2013
doi: 10.1109/MCSE.2013.19, URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461866&isnumber=6549993

– ODBMS.org: free resources related to Paradigm4

– Objects in Space vs. Friends in Facebook. ODBMS Industry Watch, April 13, 2011.

Follow ODBMS.org on Twitter: @odbmsorg

Big Data: Three questions to VoltDB.

Roberto V. Zicari — Thu, 06 Feb 2014 19:13:33 +0000

“Some of our current priorities include: augmenting capabilities in the area of real-time analytics – especially around online operations, SQL functionality, integrations with messaging applications, statistics and monitoring procedures, and enhanced developer features.”– Ryan Betts.

The third interview in the “Big Data: three questions to “ series of interviews, is with Ryan Betts, CTO of VoltDB.

RVZ

Q1. What are your current product offerings?

Ryan Betts: VoltDB is a high-velocity database platform that enables developers to build next generation real-time operational applications. VoltDB converges all of the following:

• A dynamically scalable in-memory relational database delivering high-velocity, ACID-compliant OLTP
• High-velocity data ingestion, with millions of writes per second
• Real-time analytics, to enable instant operational visibility at the individual event level
• Real-time decisioning, to enable applications to act on data when it is most valuable—the moment it arrives

Version 4.0 delivers enhanced in-memory analytics capabilities and expanded integrations. VoltDB 4.0 is the only high performance operational database that combines in-memory analytics with real-time transactional decision-making in a single system.
It gives organizations an unprecedented ability to extract actionable intelligence about customer and market behavior, website interactions, service performance and much more by performing real-time analytics on data moving at breakneck speed.

Specifically, VoltDB 4.0 features a tenfold throughput improvement of analytic queries and is capable of writes and reads on millions of data events per second. It provides large-scale concurrent, multiuser access to data, the ability to factor current incoming data into analytics, and enhanced SQL support. VoltDB 4.0 also delivers expanded integrations with an organization’s existing data infrastructure such as message queue systems, improved JDBC driver and monitoring utilities such as New Relic.

Q2. Who are your current customers and how do they typically use your products?

Ryan Betts: Customers use VoltDB for a wide variety of data-management functions, including data caching, stream processing and “on the fly” ETL.
Current VoltDB customers represent industries ranging from telecommunications to e-commerce, power & energy, financial services, online gaming, retail and more.

Following are common use cases:

• Optimized, real-time information delivery
• Personalized audience targeting
• Real-time analytics dashboards
• Caching server replacements
• Session / user management
• Network analysis & monitoring
• Ingestion and on-the-fly-ETL

Below are the customers that have been publicly announced thus far:

Eagle Investments
Conexient
OpenNet
Sakura
Shopzilla
Social Game Universe
Yellowhammer

Q3. What are the main new technical features you are currently working on and why?

Ryan Betts: Our customers are reaping the benefits of VoltDB in the areas of transactional decision-making and generating real-time analytics on that data—right at the moment it’s coming in.

Therefore, some of our current priorities include: augmenting capabilities in the area of real-time analytics – especially around online operations, SQL functionality, integrations with messaging applications, statistics and monitoring procedures, and enhanced developer features.

Although VoltDB has proven to be the industry’s “easiest to use” database, we are also continuing to invest quite heavily in making the process of building and deploying real-time operational applications with VoltDB even easier. Among other things, we are extending the power and simplicity that we offer developers in building high throughput applications to building modest sized throughput applications.

—————
Related Posts

– Setting up a Big Data project. Interview with Cynthia M. Saracco.
ODBMS Industry Watch, January 27, 2014

– Big Data: Three questions to Pivotal.
ODBMS Industry Watch, January 20, 2014.

–Big Data: Three questions to InterSystems.
ODBMS Industry Watch, January 13, 2014.

– Operational Database Management Systems. Interview with Nick Heudecker.
ODBMS Industry Watch, December 16, 2013.

Resources

ODBMS.org: Free resources on Big Data, Analytics, Cloud Data Stores, Graphs Databases, NewSQL, NoSQL, Object Databases.

Follow ODBMS.org on Twitter: @odbmsorg

On NoSQL. Interview with Rick Cattell.

Roberto V. Zicari — Mon, 19 Aug 2013 07:48:00 +0000

” There aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves. Likewise, companies with closed source will have trouble finding customers willing to risk a closed source (or limited open source) solution. It will be interesting to see what happens. But I don’t see NoSQL going away, there is a well-established following.” –Rick Cattell.

I have asked Rick Cattell, one of the leading independent consultants in database systems, a few questions on NoSQL.

RVZ

Q1. For years, you have been studying the NoSQL area and writing articles about scalable databases. What is new in the last year, in your view? What is changing?

Rick Cattell: It seems like there’s a new NoSQL player every month or two, now!
It’s hard to keep track of them all. However, a few players have become much more popular than the others.

Q2. Which players are those?

Rick Cattell: Among the open source players, I hear the most about MongoDB, Cassandra, and Riak now, and often HBase and Redis. However, don’t forget that the proprietary players like Amazon, Oracle, and Google have NoSQL systems as well.

Q3. How do you define “NoSQL”?

Rick Cattell: I use the term to mean systems that provide a simple operations like key/value storage or simple records and indexes, and that focus on horizontal scalability for those simple operations. Some people categorize horizontally scaled graph databases and object databases to be “NoSQL” as well. However, those systems have very different characteristics. Graph databases and object databases have to efficiently break connections up over distributed servers, and have to provide operations that somehow span servers as you traverse the graph. Distributed graph/object databases have been around for a while, but efficient distribution is a hard problem. The NoSQL databases simply distribute (or shard) each data type based on a primary key; that’s easier to do efficiently.

Q4. What other categories of systems do you see?

Rick Cattell: Well, there are systems that focus on horizontal scaling for full SQL with joins, which are generally called “NewSQL“, and systems optimized for “Big Data” analytics, typically based on Hadoop map/reduce. And of course, you can also sub-categorize the NoSQL systems based on their data model and distribution model.

Q5. What subcategories would those be?

Rick Cattell: On data model, I separate them into document databases like MongoDB and CouchBase, simple key/value stores like Riak and Redis, and grouped-column stores like HBase and Cassandra. However, a categorization by data model is deceptive, because they also differ quite a bit in their performance and concurrency guarantees.

Q6: Which systems perform best?

Rick Cattell: That’s hard to answer. Performance is not a scale from “good” to “bad”… the different systems have better performance for different kinds of applications. MongoDB performs incredibly well if all your data fits in distributed memory, for example, and Cassandra does a pretty good job of using disk, because of its scheme of writing new data to the end of disk files and consolidating later.

Q7: What about their concurrency guarantees?

Rick Cattell: They are all over the place on concurrency. The simplest provide no guarantees, only “eventual consistency“. You don’t know which version of data you’ll get with Cassandra. MongoDB can keep a “primary” replica consistent if you can live with their rudimentary locking mechanism.
Some of the new systems try to provide full ACID transactions. FoundationDB and Oracle NoSQL claim to do that, but I haven’t yet verified that. I have studied Google’s Spanner paper, and they do provide true ACID consistency in a distributed world, for most practical purposes. Many people think the CAP theorem makes that impossible, but I believe their interpretation of the theorem is too narrow: most real applications can have their cake and eat it too, given the right distribution model. By the way, graph/object databases also provide ACID consistency, as does VoltDB, but as I mentioned I consider them a different category.

Q8: I notice you have an unpublished paper on your website, called 2x2x2 Requirements for Scalability. Can you explain what the 2x2x2 means?

Rick Cattell: Well, the first 2x means that there are two different kinds of scalability: horizontal scaling over multiple servers, and vertical scaling for performance on a single server. The remaining 2×2 means that there are two key features needed to achieve the horizontal and vertical scaling, and for each of those, there are two additional things you have to do to make the features practical.

Q9: What are those key features?

Rick Cattell: For horizontal scaling, you need to partition and replicate your data. But you also need automatic failure recovery and database evolution with no downtime, because when your database runs on 200 nodes, you can’t afford to take the database offline and you can’t afford operator intervention on every failure.
To achieve vertical scaling, you need to take advantage of RAM and you need to avoid random disk I/O. You also need to minimize the overhead for locking and latching, and you need to minimize network calls between servers. There are various ways to do that. The best systems have all eight of these key features. These eight features represent my summary of scalable databases in a nutshell.

Q10: What do you see happening with NoSQL, going forward?

Rick Cattell: Good question. I see a lot of consolidation happening… there are too many players! There aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves. Likewise, companies with closed source will have trouble finding customers willing to risk a closed source (or limited open source) solution.
It will be interesting to see what happens. But I don’t see NoSQL going away, there is a well-established following.

————–
R. G. G. “Rick” Cattell is an independent consultant in database systems.
He previously worked as a Distinguished Engineer at Sun Microsystems, most recently on open source database systems and distributed database scaling. Dr. Cattell served for 20+ years at Sun Microsystems in management and senior technical roles, and for 10 years in research at Xerox PARC and at Carnegie-Mellon University. Dr. Cattell is best known for his contributions in database systems and middleware, including database scalability, enterprise Java, object/relational mapping, object-oriented databases, and database interfaces. He is the author of several dozen papers and five books, and a co-inventor of six U.S. patents.
At Sun he instigated the Enterprise Java, Java DB, and Java Blend projects, and was a contributor to a number of Java APIs and products. He previously developed the Cedar DBMS at Xerox PARC, the Sun Simplify database GUI, and SunSoft’s CORBA-database integration.
He is a co-founder of SQL Access (a predecessor to ODBC), the founder and chair of the Object Data Management Group (ODMG), the co-creator of JDBC, the author of the world’s first monograph on object/relational and object databases, a recipient of the ACM Outstanding PhD Dissertation Award, and an ACM Fellow.

–On Real Time NoSQL. Interview with Brian Bulkowski. May 21, 2013

Resources

– Rick Cattell home page.

– ODBMS.org Free Downloads and Links
In this section you can download free resources covering the following topics:
Big Data and Analytical Data Platforms
Cloud Data Stores
Object Databases
NoSQL Data Stores
Graphs and Data Stores
Object-Oriented Programming
Entity Framework (EF) Resources
ORM Technology
Object-Relational Impedance Mismatch
NewSQL, XML, RDF Data Stores, RDBMS

Follow ODBMS.org on Twitter: @odbmsorg

On PostgreSQL. Interview with Tom Kincaid.

Roberto V. Zicari — Thu, 30 May 2013 10:05:20 +0000

“Application designers need to start by thinking about what level of data integrity they need, rather than what they want, and then design their technology stack around that reality. Everyone would like a database that guarantees perfect availability, perfect consistency, instantaneous response times, and infinite throughput, but it´s not possible to create a
product with all of those properties”–Tom Kincaid.

What is new with PostgreSQL? I have Interviewed Tom Kincaid, head of Products and Engineering at EnterpriseDB.

RVZ

(Tom prepared the following responses with contributions from the EnterpriseDB development team)

Q1. EnterpriseDB products are based upon PostgreSQL. What is special about your product offering?

Tom Kincaid: EnterpriseDB has integrated many enterprise features and performance enhancements into the core PostgreSQL code to create a database with the lowest possible TCO and provide the “last mile” of service needed by enterprise database users.

EnterpriseDB´s Postgres Plus software provides the performance, security and Oracle compatibility needed to address a range of enterprise business applications. EnterpriseDB´s Oracle compatibility, also integrated into the PostgreSQL code base, allows many Oracle shops to realize a much lower database TCO while utilizing their Oracle skills and applications designed to work against Oracle databases.

EnterpriseDB also creates enterprise-grade tools around PostgreSQL and Postgres Plus Advanced Server for use in large-scale deployments. They are Postgres Enterprise Manager, a powerful management console for managing, monitoring and tuning databases en masse whether they´re PostgreSQL community version or EnterpriseDB´s enhanced Postgres Plus Advanced Server; xDB Replication Server with multi-master replication and replication between Postgres, Oracle and SQL Server databases; and SQL/Protect for guarding against SQL Injection attacks.

Q2. How does PostgreSQL compare with MariaDB and MySQL 5.6?

Tom Kincaid: There are several areas of difference. PostgreSQL has traditionally had a stronger focus on data integrity and compliance with the SQL standard.
MySQL has traditionally been focused on raw performance for simple queries, and a typical benchmark is the number of read queries per second that the database engine can carry out, while PostgreSQL tends to focus more on having a sophisticated query optimizer that can efficiently handle more complex queries, sometimes at the expense of speed on simpler queries. And, for a long time, MySQL had a big lead over PostgreSQL in the area of replication technologies, which discouraged many users from choosing PostgreSQL.

Over time, these differences have diminished. PostgreSQL´s replication options have expanded dramatically in the last three releases, and its performance on simple queries has greatly improved in the most recent release (9.2). On the other hand, MySQL and MariaDB have both done significant recent work on their query optimizers. So each product is learning from the strengths of the other.

Of course, there´s one other big difference, which is that PostgreSQL is an independent open source project that is not, and cannot be, controlled by any single company, while MySQL is now owned and controlled by Oracle.
MariaDB is primarily developed by the Monty Program and shows signs of growing community support, but it does not yet have the kind of independent community that PostgreSQL has long enjoyed.

Q3. Tomas Ulin mentioned in an interview that “with MySQL 5.6, developers can now commingle the “best of both worlds” with fast key-value look up operations and complex SQL queries to meet user and application specific requirements”. What is your take on this?

Tom Kincaid: I think anyone who is developing an RDBMS today has to be aware that there are some users who are looking for the features of a key-value store or document database.
On the other hand, many NoSQL vendors are looking to add the sorts of features that have traditionally been associated with an enterprise-grade RDBMS. So I think that theme of convergence is going to come up over and over again in different contexts.
That´s why, for example, PostgreSQL added a native JSON datatype as part of the 9.2 release, which is being further enhanced for the forthcoming 9.3 release.
Will we see a RESTful or memcached-like interface to PostgreSQL in the future? Perhaps.
Right now our customers are much more focused on improving and expanding the traditional RDBMS functionality, so that´s where our focus is as well.

Q4. How would you compare your product offering with respect to NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak, and NewSQL such as NuoDB and VoltDB?

Tom Kincaid: It is a matter of the right tools for the right problem. Many of our customers use our products together with the NoSQL solutions you mention. If you need ACID transaction properties for your data, with savepoints and rollback capabilities, along with the ability to access data in a standardized way and a large third party tool set for doing it, a time tested relational database is the answer.
The SQL standard provides the benefit of always being able to switch products and having a host of tools for reporting and administration. PostgreSQL, like Linux, provides the benefit of being able to switch service partners.

If your use case does not mandate the benefits mentioned above and you have data sets in the Petabyte range and require the ability to ingest Terabytes of data every 3-4 hours, a NoSQL solution is likely the right answer. As I said earlier many of our customers use our database products together with NoSQL solutions quite successfully. We expect to be working with many of the NoSQL vendors in the coming year to offer a more integrated solution to our joint customers.

Since it is still pretty new, I haven´t had a chance to evaluate NuoDB so I can´t comment on how it compares with PostgreSQL or Postgres Plus Advanced Server.

As far as VoltDB is concerned there is a blog by Dave Page, our Chief Architect for tools and installers, that describes the differences between PostgreSQL and VoltDB. It can be found here.

There is also some terrific insight, on this topic, in an article by my colleague Bruce Momjian, who is one of the most active contributors to PostgreSQL, that can be found here.

Q5. Justin Sheehy of Basho in an interview said “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”. What is your opinion on this?

Tom Kincaid: It´s overly simplistic. There is certainly room for asynchronous multi-master replication in applications such as banking, but it has to be done very, very carefully to avoid losing track of the money.
It´s not clear that the NoSQL products which provide eventual consistency today make the right trade-offs or provide enough control for serious enterprise applications – or that the products overall are sufficiently stable. Relational databases remain the most mature, time-tested, and stable solution for storing enterprise data.
NoSQL may be appealing for Internet-focused applications that must accommodate truly staggering volumes of requests, but we anticipate that the RDBMS will remain the technology of choice for most of the mission-critical applications it has served so well over the last 40 years.

Q6. What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

Tom Kincaid: Application designers need to start by thinking about what level of data integrity they need, rather than what they want, and then design their technology stack around that reality.
Everyone would like a database that guarantees perfect availability, perfect consistency, instantaneous response times, and infinite throughput, but it´s not possible to create a product with all of those properties.

If you have an application that has a large write throughput and you assume that you can store all of that data using a single database server, which has to scale vertically to meet the load, you´re going to be unhappy eventually. With a traditional RDBMS, you´re going to be unhappy when you can´t scale far enough vertically. With a distributed key-value store, you can avoid that problem, but then you have all the challenges of maintaining a distributed system, which can sometimes involve correlated failures, and it may also turn out that your application makes assumptions about data consistency that are difficult to guarantee in a distributed environment.

By making your assumptions explicit at the beginning of the project, you can consider alternative designs that might meet your needs better, such as incorporating mechanisms for dealing with data consistency issues or even application-level shading into the application itself.

Q7. How do you handle Large Objects Support?

Tom Kincaid: PostgreSQL supports storing objects up to 1GB in size in an ordinary database column.
For larger objects, there1s a separate large object API. In current releases, those objects are limited to just 2GB, but the next release of PostgreSQL (9.3) will increase that limit to 4TB. We don´t necessarily recommend storing objects that large in the database, though
in many cases, it´s more efficient to store enormous objects on a file server rather than as database objects. But the capabilities are there for those who need them.

Q8. Do you use Data Analytics at EnterpriseDB and for what?

Tom Kincaid: Most companies today use some form of data analytics to understand their customers and their marketplace and we1re no exception. However, how we use data is rapidly changing given our rapid growth and deepening
penetration into key markets.

Q9. Do you have customers who have Big Data problem? Could you please give us some examples of Big Data Use Cases?

Tom Kincaid: We have found that most customers with big data problems are using specialized appliances and in fact we partnered with Netezza to assist in creating such an appliance – The Netezza TwinFin Data Warehousing appliance.
See here.

Q10. How do you handle the Big Data Analytics “process” challenges with deriving insight?

Tom Kincaid: EnterpriseDB does not specialize in solutions for the Big Data market and will refer prospects to specialists like Netezza.

Q11. Do you handle un-structured data? If yes, how?

Tom Kincaid: PostgreSQL has an integrated full-text search capability that can be used for document processing, and there are also XML and JSON data types that can be used for data of those types. We also have a PostgreSQL-specific data type called hstore that can be used to store groups of key-value pairs.

Q12. Do you use Hadoop? If yes, what is your experience with Hadoop so far?

Tom Kincaid: We developed, and released in late 2011, our Postgres Plus Connector for Hadoop, which allows massive amounts of data from a Postgres Plus Advanced Server (PPAS) or PostgreSQL database to be accessed, processed and analyzed in a Hadoop cluster. The Postgres Plus Connector for Hadoop allows programmers to process large amounts of SQL-based data using their familiar MapReduce constructs. Hadoop combined with PPAS or PostgreSQL enables users to perform real time queries with Postgres and non-real time CPU intensive analysis and with our connector, users can load SQL data to Hadoop, process it and even push the results back to Postgres.

Q13 Cloud computing and open source: How does it relate to PostgreSQL?

Tom Kincaid: In 2012, EnterpriseDB released its Postgres Plus Cloud Database. We´re seeing a wide-scale migration to cloud computing across the enterprise. With that growth has come greater clarity in what developers need in a cloudified database. The solutions are expected to deliver lower costs and management ease with even greater functionality because they are taking advantage of the cloud.

______________________
Tom Kincaid.As head of Products and Engineering, Tom leads the company’s product development and directs the company’s world-class software engineers. Tom has nearly 25 years of experience in the Enterprise Software Industry.
Prior to EnterpriseDB, he was VP of software development for Oracle’s GlassFish and Web Tier products.
He integrated Sun’s Application Server Product line into Oracle’s Fusion middleware offerings. At Sun Microsystems, he was part of the original Java EE architecture and management teams and played a critical role in defining and delivering the Java Platform.
Tom is a veteran of the Object Database industry and helped build Object Design’s customer service department holding management and senior technical contributor roles. Other positions in Tom’s past include Director of Quality Engineering at Red Hat and Director of Software Engineering at Unica.

– On Eventual Consistency– Interview with Monty Widenius. October 23, 2012

Resources

Follow ODBMS.org on Twitter: @odbmsorg

On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen

Roberto V. Zicari — Mon, 13 May 2013 06:52:11 +0000

“The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point”
–Kingsley Uyi Idehen.

I have interviewed Kingsley Idehen founder and CEO of OpenLink Software. The main topics of this interview are: the Semantic Web, and the Virtuoso Hybrid Data Server.

RVZ

Q1. The vision of the Semantic Web is the one where web pages contain self describing data that machines will be able to navigate them as easily as humans do now. What are the main benefits? Who could profit most from the Semantics Web?

Kingsley Uyi Idehen: The vision of a Semantic Web is actually the vision of the Web. Unbeknownst to most, they are one and the same. The goal was always to have HTTP URIs denote things, and by implication, said URIs basically resolve to their meaning [1] [2].
Paradoxically, the Web bootstrapped on the back of URIs that denoted HTML documents (due to Mosaic’s ingenious exploitation of the “view source” pattern [3]) thereby accentuating its Web of hyper-linked Documents (i.e., Information Space) aspect while leaving its Web of hyper-linked Data aspect somewhat nascent.
The nascence of the Web of hyper-linked Data (aka Web of Data, Web of Linked Data etc.) laid the foundation for the “Semantic Web Project” which naturally evoled into “The Semantic Web” meme. Unfortunately, “The Semantic Web” meme hit a raft of issues (many self inflicted) that basically disconnected it from its original Web vision and architecture aspect reality.
The Semantic Web is really about the use of hypermedia to enhance the long understood entity relationship model [4] via the incorporation of _explicit_ machine- and human-comprehensible entity relationship semantics via the RDF data model. Basically, RDF is just about an enhancement to the entity relationship model that leverages URIs for denoting entities and relations that are described using subject->predicate->object based proposition statements.
For the rest of this interview, I would encourage readers to view “The Semantic Web” phrase as meaning: a Web-scale entity relationship model driven by hypermedia resources that bear entity relationship model description graphs that describe entities and their relations (associations).

To answer your question, the benefits of the Semantic Web are as follows: fine-grained access to relevant data on the Web (or private Web-like networks) with increasing degrees of serendipity [5].

Q2. Who is currently using Semantic Web technologies and how? Could you please give us some examples of current commercial projects?

Kingsley Uyi Idehen: I wouldn’t used “project” to describe endeavors that exploit Semantic Web oriented solutions. Basically, you have entire sectors being redefined by this technology. Examples range from “Open Government” (US, UK, Italy, Spain, Portugal, Brazil etc..) all the way to publishing (BBC, Globo, Elsevier, New York Times, Universal etc..) and then across to pharmaceuticals (OpenPHACTs, St. Judes, Mayo, etc.. ) and automobiles (Daimler Benz, Volkswagen etc..). The Semantic Web isn’t an embryonic endeavor deficient on usecases and case studies, far from it.

Q3. Virtuoso is a Hybrid RDBMS/Graph Column store. How does it differ from relational databases and from XML databases?

Kingsley Uyi Idehen:: First off, we really need to get the definitions of databases clear. As you know, the database management technology realm is vast. For instance, there isn’t anything such thing as a non relational database.
Such a system would be utterly useless beyond an comprehendible definition, to a marginally engaged audience. A relational database management system is typically implemented with support for a relational model oriented query language e.g., SQL, QUEL, OQL (from the Object DBMS era), and more recently SPARQL (for RDF oriented databases and stores). Virtuoso is comprised of a relational database management system that supports SQL, SPARQL, and XQuery. It is optimized to handle relational tables and/or relational property graphs (aka. entity relationship graphs) based data organization. Thus, Virtuoso is about providing you with the ability to exploit the intensional (open world propositions or claims) and extensional (closed world statements of fact) aspects of relational database management without imposing either on its users.

Q4. Is there any difference with Graph Data stores such as Neo4j?

Kingsley Uyi Idehen: Yes, as per my earlier answer, it is a hybrid relational database server that supports relational tables and entity relationship oriented property graphs. It’s support for RDF’s data model enables the use of URIs as native types. Thus, every entity in a Virtuoso DBMS is endowed with a URI as its _super key_. You can de-reference the description of a Virtuoso entity from anywhere on a network, subject to data access policies and resource access control lists.

Q5. How do you position Virtuoso with respect to NoSQL (e.g Cassandra, Riak, MongoDB, Couchbase) and to NewSQL (e.g.NuoDB, VoltDB)?

Kingsley Uyi Idehen: Virtuoso is a SQL, NoSQL, and NewSQL offering. Its URI based _super keys_ capability differentiates it from other SQL, NewSQL, and NoSQL relational database offerings, in the most basic sense. Virtuoso isn’t a data silo, because its keys are URI based. This is a “deceptively simple” claim that is very easy to verify and understand. All you need is a Web Browser to prove the point i.e., a Virtuoso _super key_ can be placed in the address bar of any browser en route to exposing a hypermedia based entity relationship graph that navigable using the Web’s standard follow-your-nose pattern.

Q6. RDF can be encoded in various formats. How do you handle that in Virtuoso?

Kingsley Uyi Idehen: Virtuoso supports all the major syntax notations and data serialization formats associated with the RDF data model. This implies support for N-Triples, Turtle, N3, JSON-LD, RDF/JSON, HTML5+Microdata, (X)HTML+RDFa, CSV, OData+Atom, OData+JSON.

Q7. Does Virtuoso restrict the contents to triples?

Kingsley Uyi Idehen: Assuming you mean: how does it enforce integrity constraints on triple values?
It doesn’t enforce anything per se. since the principle here is “schema last” whereby you don’t have a restrictive schema acting as an inflexible view over the data (as is the case with conventional SQL relational databases). Of course, an application can apply reasoning to OWL (Web Ontology Language) based relation semantics (i.e, in the so-called RBox) as option for constraining entity types that constitute triples. In addition, we will soon be releasing a SPARQL Views mechanism that provides a middle ground for this matter whereby the aforementioned view can be used in a loosely coupled manner at the application, middleware, or dbms layer for applying constraints to entity types that constitute relations expressed by RDF triples.

Q8. RDF can be represented as a direct graph. Graphs, as data structure do not scale well. How do you handle scalability in Virtuoso? How do you handle scale-out and scale-up?

Kingsley Uyi Idehen: The fundamental mission statement of Virtuoso has always be to destroy any notion of performance and scalability as impediments to entity relationship graph model oriented database management. The crux of the matter with regards to Virtuoso is that it is massively scalable due for the following reasons:
• fine-grained multi-threading scoped to CPU cores
• vectorized (array) execution of query commands across fine-grained threads
• column-store based physical storage which provides storage layout and data compaction optimizations (e.g., key compression)
• share-nothing clustering that scales from multiple instances (leveraging the items above) on a single machine all the way up to a cluster comprised of multiple machines.
The scalability prowess of Virtuoso are clearly showcased via live Web instances such as DBpedia and the LOD Cloud Cache (50+ Billion Triples). You also have no shortage of independent benchmark reports to compliment the live instances:
•50 – 150 Billion scale Berlin SPARQL Benchmark (BSBM) report (.pdf)

Q9. Could you give us some commercial examples where Virtuoso is in use?

Kingsley Uyi Idehen: Elsevier, Globo, St. Judes Medical, U.S. Govt., EU, are a tiny snapshot of entities using Virtuoso on a commercial basis.

Q10. Do you plan in the near future to develop integration interfaces to other NoSQL data stores?

Kingsley Uyi Idehen: If a NewSQL or NoSQL store supports any of the following, their integration with Virtuoso is implicit: HTTP based RESTful interaction patterns, SPARQL, ODBC, JDBC, ADO.NET, OLE-DB. In the very worst of cases, we have to convert the structured data returned into 5-Star Linked Data using Virtuoso’s in-built Linked Data middleware layer for heterogeneous data virtualization.

Q11. Virtuoso supports SPARQL. SPARQL is not SQL, how do handle querying relational data then?

Kingsley Uyi Idehen: Virtuoso support SPARQL, SQL, SQL inside SPARQL and SPARQL inside SQL (we call this SPASQL). Virtuoso has always had its own native SQL engine, and that’s integral to the entire product. Virtuoso provides an extremely powerful and scalable SQL engine as exemplified by the fact that the RDF data management services are basically driven by the SQL engine subsystem.

Q12. How do you support Linked Open Data? What advantages are the main benefits of Linked Open Data in your opinion?

Kingsley Uyi Idehen: Virtuoso enables you expose data from the following sources, courtesy of its in-built 5-star Linked Data Deployment functionality:
• RDF based triples loaded from Turtle, N-Triples, RDF/XML, CSV etc. documents
• SQL relational databases via ODBC or JDBC connections
• SOAP based Web Services
• Web Services that provide RESTful interaction patterns for data access.
• HTTP accessible document types e.g., vCard, iCalendar, RSS, Atom, CSV, and many others.

Q13. What are the most promising application domains where you can apply triple store technology such as Virtuoso?

Kingsley Uyi Idehen: Any application that benefits from high-performance and scalable access to heterogeneously shaped data across disparate data sources. Healthcare, Pharmaceuticals, Open Government, Privacy enhanced Social Web and Media, Enterprise Master Data Management, Big Data Analytics etc..

Q14. Big Data Analysis: could you connect Virtuoso with Hadoop? How does Viruoso relate to commercial data analytics platforms, e.g Hadapt, Vertica?

Kingsley Uyi Idehen: You can integrate data managed by Hadoop based ETL workflows via ODBC or Web Services driven by Hapdoop clusters that expose RESTful interaction patterns for data access. As for how Virtuoso relates to the likes of Vertica re., analytics, this is about Virtuoso being the equivalent of Vertica plus the added capability of RDF based data management, Linked Data Deployment, and share-nothing clustering. There is no job that Vertica performs that Virtuoso can’t perform.
There are several jobs that Virtuoso can perform that Vertica, VoltDB, Hadapt, and many other NoSQL and NewSQL simply cannot perform with regards to scalable, high-performance RDF data management and Linked Data deployment. Remember, RDF based Linked Data is all about data management and data access without any kind of platform lock-in. Virtuoso locks you into a value proposition (performance and scale) not the platform itself.

Q15. Do you also benchmark loading trillion of RDF triples? Do you have current benchmark results? How much time does it take to querying them?

Kingsley Uyi Idehen: As per my earlier responses, there is no shortage of benchmark material for Virtuoso.
The benchmarks are also based on realistic platform configurations unlike the RDBMS patterns of the past which compromised the utility of TPC benchmarks.

Q16. In your opinion, what are the main current obstacles for the adoption of Semantic Web technologies in the Enterprise?

Kingsley Uyi Idehen:The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point.

Links:

[1]. — 5-Star Linked Data URIs and Semiotic Triangle
[2]. — what do HTTP URIs Identify?
[3]. — View Source Pattern & Web Bootstrap
[4]. — Unified View of Data using the Entity Relationship Model (Peter Chen’s 1976 dissertation)
[5]. — Serendipitous Discovery Quotient (SDQ).

——————–
Kingsley Idehen is the Founder and CEO of OpenLink Software. He is an industry acclaimed technology innovator and entrepreneur in relation to technology and solutions associated with data management systems, integration middleware, open (linked) data, and the semantic web.

Kingsley has been at the helm of OpenLink Software for over 20 years during which he has actively provided dual contributions to OpenLink Software and the industry at large, exemplified by contributions and product deliverables associated with: Open Database Connectivity (ODBC), Java Database Connectivity (JDBC), Object Linking and Embedding (OLE-DB), Active Data Objects based Entity Frameworks (ADO.NET), Object-Relational DBMS technology (exemplified by Virtuoso), Linked (Open) Data (where DBpedia and the LOD cloud are live showcases), and the Semantic Web vision in general.
————-

Resources

— 50 – 150 Billion scale Berlin SPARQL Benchmark (BSBM) report (.pdf)

– History of Virtuoso

–ODBMS.org free resources on : Relational Databases, NewSQL, XML Databases, RDF Data Stores

– MySQL-State of the Union. Interview with Tomas Ulin. February 11, 2013

Follow ODBMS Industry Watch on Twitter: @odbmsorg

On Big Data Velocity. Interview with Scott Jarr.

Roberto V. Zicari — Mon, 28 Jan 2013 07:54:10 +0000

“There is only so much static data in the world as of today. The vast majority of new data, the data that is said to explode in volume over the next 5 years, is arriving from a high velocity source. It’s funny how obvious it is when you think about it. The only way to get Big Data in the future is to have it arrive in a high velocity rate ” — Scott Jarr.

One of the key technical challenges of Big Data is (Data) Velocity. On that, I have interviewed Scott Jarr, Co-founder and Chief Strategy Officer of VoltDB.

RVZ

Q1. Marc Geall, past Head of European Technology Research at Deutsche Bank AG/London, writes about the “Big Data myth”, claiming that there is:
1) limited need of petabyte-scale data today,
2) very low proportion of databases in corporate deployment which requires more than tens of TB of data to be handled, and
3) lack of availability and high cost of highly skilled operators (often post-doctoral) to operate highly scalable NoSQL clusters.
What is your take on this?

Scott Jarr: Interestingly I agree with a lot of this for today. However, I also believe we are in the midst of a massive shift in business to what I call data-as-a-priority.
We are just beginning, but you can already see the signs. People are loathed to get rid of anything, sensors are capturing finer resolutions, and people want to make far more, data informed decisions.
I also believe that the value that corporate IT teams were able to extract from data with the advent of data warehouses really whet the appetite of what could be done with data. We are now seeing people ask questions like “why can’t I see this faster,” or “how do we use this incoming data to better serve customers,” or “how can we beat the other guys with our data.”
Data is becoming viewed as a corporate weapon. Add inbound data rates (velocity) combined with the desire to use data for better decisions and you have data sizes that will dwarf what is considered typical today. And almost no industry is excluded. The cost ceiling has collapsed.

Q2: What are the typical problems that are currently holding back many Big Data projects?

Scott Jarr:
1) Spending too much time trying to figure out what solution to use for what problem. We were seeing this so often that we created a graphic and presentation that addresses this topic. We called it the Data Continuum.
2) Putting out fires that the current data environment is causing. Most infrastructures aren’t ready for the volume or velocity of data that is already starting to arrive at their doorsteps. They are spending a ton of time dealing with band-aids on small-data-infrastructure and unable to shake free to focus on the Big Data infrastructure that will be a longer-term fix.
3) Being able to clearly articulate the business value the company expects to achieve from a Big Data project has a way of slowing things down in a radical way.
4) Most of the successes in Big Data projects today are in situations where the company has done a very good job maintaining a reasonable scope to the project.

Q3: Why is it important to solve the Velocity problem when dealing with Big Data projects?

Scott Jarr: There is only so much static data in the world as of today. The vast majority of new data, the data that is said to explode in volume over the next 5 years, is arriving from a high velocity source. It’s funny how obvious it is when you think about it. The only way to get Big Data in the future is to have it arrive in a high velocity rate.
Companies are recognizing the business value they can get by acting on that data as it arrives rather than depositing it in a file to be batch processed at some later data. So much of the context that makes that data is lost when it not acted on quickly.

Q4: What exactly is Big Data Velocity? Is Big Data Velocity the same as stream computing?

Scott Jarr: We think of Big Data Velocity as data that is coming into the organization at a rate that can’t be managed in the traditional database. However, companies want to extract the most value they can from that data as it arrives. We see them doing three specific things:
1) Ingesting relentless feed(s) of data;
2) Making decisioning on each piece of data as it arrives; and
3) Using real-time analytics to derive immediate insights into what is happening with this velocity data.

Making the best possible decision each time data is touched is what velocity is all about. These decisions used to be called transactions in the OLTP world. They involve using other data stored in the database to make decision – approve a transaction, server the ad, authorize the access, etc. These decisions, and the real-time analytics that support them, all require the context of other data. In other words, the database used to perform these decisions must hold some amount of previously processed data – they must hold state. Streaming systems are good at a different set of problems.

Q5: Two other critical factors often mentioned for Big Data projects are: 1) Data discovery: How to find high-quality data from the Web? and 2) Data Veracity: How can we cope with uncertainty, imprecision, missing values, mis-statements or untruths? Any thoughts on these?

Scott Jarr: We have a number of customers who are using VoltDB in ways to improve data quality within their organization. We have one customer who is examining incoming financial events and looking for misses in sequence numbers to determine lost or miss-ordered information. Likewise, a popular use case is to filter out bad data as it comes in by looking at it in its high velocity state against a known set of bad or good characteristics. This keeps much of the bad data from ever entering the data pipeline.

Q6: Scalability has three aspects: data volume, hardware size, and concurrency. Scale and performance requirements for Big Data strain conventional databases. Which database technology is best to scale to petabytes?

Scott Jarr: VoltDB is focused on a very different problem, which is how to process that data prior to it landing in the long-term petabyte system. We see customers deploying VoltDB in front of both MPP OLAP and Hadoop, in roughly the same numbers. It really all depends on what the customer is ultimately trying to do with the data once it settles into its resting state in the petabyte store.

Q7: A/B testing, sessionization, bot detection, and pathing analysis all require powerful analytics on many petabytes of semi-structured Web data. Do you have some customers examples in this area?

Scott Jarr: Absolutely. Taken broadly, this is one of the most common uses of VoltDB. Micro-segmentation and on-the-fly ad content optimization are examples that we see regularly. The ability to design an ad, in real-time, based on five sets of audience meta-data can have a radical impact on performance.

Q8: When would you recommend to store Big Data in a traditional Data Warehouse and when in Hadoop?

Scott Jarr: My experience here is limited. As I said, our customers are using VoltDB in front of both types of stores to do decisioning and real-time analytics before the data moves into the long term store. Often, when the data is highly structured, it goes into a data warehouse and when it is less structured, it goes into Hadoop.

Q9: Instead of stand-alone products for ETL, BI/reporting and analytics wouldn’t it be better to have a seamless integration? In what ways can we open up a data processing platform to enable applications to get closer?

Scott Jarr: This is very much inline with our vision of the world. As Mike (Stonebraker , VoltDB founder) has stated for years, in high performance data systems, you need to have specialized databases. So we see the new world having far more data pipelines than stand alone databases. A data pipeline will have seamless integrations between velocity stores, warehouses, BI tools and exploratory analytics. Standards go a long way to making these integrations easier.

Q10: Anything you wish to add?

Scott Jarr.: Thank you Roberto. Very interesting discussion.

——————–
VoltDB Co-founder and Chief Strategy Officer Scott Jarr. Scott brings more than 20 years of experience building, launching and growing technology companies from inception to market leadership in highly competitive environments.
Prior to joining VoltDB, Scott was VP Product Management and Marketing at on-line backup SaaS leader LiveVault Corporation. While at LiveVault, Scott was key in growing the recurring revenue business to 2,000 customers strong, leading to an acquisition by Iron Mountain. Scott has also served as board member and advisor to other early-stage companies in the search, mobile, security, storage and virtualization markets. Scott has an undergraduate degree in mathematical programming from the University of Tampa and an MBA from the University of South Florida.

– Two cons against NoSQL. Part II. on November 21, 2012

– Two Cons against NoSQL. Part I. on October 30, 2012

– Interview with Mike Stonebraker. on May 2, 2012

Resources

– Big Data: Challenges and Opportunities.
Roberto V. Zicari, October 5, 2012.
Abstract: In this presentation I review three current aspects related to Big Data:
1. The business perspective, 2. The Technology perspective, and 3. Big Data for social good.
Presentation (89 pages) | Intermediate| English | DOWNLOAD (PDF)| October 2012|
##

You can follow ODBMS.org on Twitter : @odbmsorg.
——————————-