ODBMS Industry Watch » NewSQL http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Fri, 09 Feb 2018 21:04:31 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.19 A Grand Tour of Big Data. Interview with Alan Morrison http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/ http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/#comments Thu, 25 Feb 2016 15:52:44 +0000 http://www.odbms.org/blog/?p=4087

“Leading enterprises have a firm grasp of the technology edge that’s relevant to them. Better data analysis and disambiguation through semantics is central to how they gain competitive advantage today.”–Alan Morrison.

I have interviewed Alan Morrison, senior research fellow at PwC, Center for Technology and Innovation.
Main topic of the interview is how the Big Data market is evolving.


Q1. How do you see the Big Data market evolving? 

Alan Morrison: We should note first of all how true Big Data and analytics methods emerged and what has been disruptive. Over the course of a decade, web companies have donated IP and millions of lines of code that serves as the foundation for what’s being built on top.  In the process, they’ve built an open source culture that is currently driving most big data-related innovation. As you mentioned to me last year, Roberto, a lot of database innovation was the result of people outside the world of databases changing what they thought needed to be fixed, people who really weren’t versed in the database technologies to begin with.

Enterprises and the database and analytics systems vendors who serve them have to constantly adjust to the innovation that’s being pushed into the open source big data analytics pipeline. Open source machine learning is becoming the icing on top of that layer cake.

Q2. In your opinion what are the challenges of using Big Data technologies in the enterprise?

Alan Morrison: Traditional enterprise developers were thrown for a loop back in the late 2000s when it comes to open source software, and they’re still adjusting. The severity of the problem differs depending on the age of the enterprise. In our 2012 issue of the Forecast on DevOps, we made clear distinctions between three age classes of companies: legacy mainstream enterprises, pre-cloud enterprises and cloud natives. Legacy enterprises could have systems that are 50 years old or more still in place and have simply added to those. Pre-cloud enterprises are fighting with legacy that’s up to 20 years old. Cloud natives don’t have to fight legacy and can start from scratch with current tech.

DevOps (dev + ops) is an evolution of agile development that focuses on closer collaboration between developers and operations personnel. It’s a successor to agile development, a methodology that enables multiple daily updates to operational codebases and feedback-response loop tuning by making small code changes and see how those change user experience and behaviour. The linked article makes a distinction between legacy, pre-cloud and cloud native enterprises in terms of their inherent level of agility:

 Most enterprises are in the legacy mainstream group, and the technology adoption challenges they face are the same regardless of the technology. To build feedback-response loops for a data-driven enterprise in a legacy environment is more complicated in older enterprises. But you can create guerilla teams to kickstart the innovation process.

Q3. Is the Hadoop ecosystem now ready for enterprise deployment at large scale? 

Alan MorrisonHadoop is ten years old at this point, and Yahoo, a very large mature enterprise, has been running Hadoop on 10,000 nodes for years now. Back in 2010, we profiled a legacy mainstream media company who was doing logfile analysis from all of its numerous web properties on a Hadoop cluster quite effectively. Hadoop is to the point where people in their dens and garages are putting it on Raspberry Pi systems. Lots of companies are storing data in or staging it from HDFS. HDFS is a given. MapReduce, on the other hand, has given way to Spark.

HDFS preserves files in their original format immutably, and that’s important. That innovation was crucial to data-driven application development a decade ago. But Hadoop isn’t the end state for distributed storage, and NoSQL databases aren’t either. It’s best to keep in mind that alternatives to Hadoop and its ecosystem are emerging.

I find it fascinating what folks like LinkedIn and Metamarkets are doing data architecture wise with the Kappa architecture–essentially a stream processing architecture that also works for batch analytics, a system where operational and analytical data are one and the same. That’s appropriate for fully online, all-digital businesses.  You can use HDFS, S3, GlusterFS or some other file system along with a database such as Druid. On the transactional side of things, the nascent IPFS (the Interplanetary File System) anticipates both peer-to-peer and the use of blockchains in environments that are more and more distributed. Here’s a diagram we published last year that describes this evolution to date:

From PWC Technology Forecast 2015

People shouldn’t be focused on Hadoop, but what Hadoop has cleared a path for that comes next.

Q4. What are in your opinion the most innovative Big Data technologies?

Alan Morrison: The rise of immutable data stores (HDFS, Datomic, Couchbase and other comparable databases, as well as blockchains) was significant because it was an acknowledgement that data history and permanence matters, the technology is mature enough and the cost is low enough to eliminate the need to overwrite. These data stores also established that eliminating overwrites also eliminates a cause of contention. We’re moving toward native cloud and eventually the P2P fog (localized, more truly distributed computing) that will extend the footprint of the cloud for the Internet of things.

Unsupervised machine learning has made significant strides in the past year or two, and it has become possible to extract facts from unstructured data, building on the success of entity and relationship extraction. What this advance implies is the ability to put humans in feedback loops with machines, where they let machines discover the data models and facts and then tune or verify those data models and facts.

In other words, large enterprises now have the capability to build their own industry- and organization-specific knowledge graphs and begin to develop cognitive or intelligent apps on top those knowledge graphs, along the lines of what Cirrus Shakeri of Inventurist envisions.


From Cirrus Shakeri, “From Big Data to Intelligent Applications,”  post, January 2015 

At the core of computable semantic graphs (Shakeri’s term for knowledge graphs or computable knowledge bases) is logically consistent semantic metadata. A machine-assisted process can help with entity and relationship extraction and then also ontology generation.

Computability = machine readability. Semantic metadata–the kind of metadata cognitive computing apps use–can be generated with the help of a well-designed and updated ontology. More and more, these ontologies are uncovered in text rather than hand built, but again, there’s no substitute for humans in the loop. Think of the process of cognitive app development as a continual feedback-response loop process. The use of agents can facilitate the construction of these feedback loops.

Q5. In a recent note Carl Olofson, Research Vice President, Data Management Software Research, IDC, predicted the RIP of “Big Data” as a concept. What is your view on this?

Alan Morrison: I agree the term is nebulous and can be misleading, and we’ve had our fill of it. But that doesn’t mean it won’t continue to be used. Here’s how we defined it back in 2009:

Big Data is not a precise term; rather, it is a characterization of the never-ending accumulation of all kinds of data, most of it unstructured. It describes data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis using relational database techniques. Whether terabytes or petabytes, the precise amount is less the issue than where the data ends up and how it is used. (See https://www.pwc.com/us/en/technology-forecast/assets/pwc-tech-forecast-issue3-2010.pdf, pg. 6.)

For that issue of the Forecast, we focused on how Hadoop was being piloted in enterprises and the ecosystem that was developing around it. Hadoop was the primary disruptive technology, as well as NoSQL databases. It helps to consider the data challenge of the 2000s and how relational databases and enterprise data warehousing techniques were falling short at that point.  Hadoop has reduced the cost of analyzing data by an order of magnitude and allows processing of very large unstructured datasets. NoSQL has made it possible to move away from rigid data models and standard ETL.

“Big Data” can continue to be shorthand for petabytes of unruly, less structured data. But why not talk about the system instead of just the data? I like the term that George Gilbert of Wikibon latched on to last year. I don’t know if he originated it, but he refers to the System of Intelligence. That term gets us beyond the legacy, pre-web “business intelligence” term, more into actionable knowledge outputs that go beyond traditional reporting and into the realm of big data, machine learning and more distributed systems. The Hadoop ecosystem, other distributed file systems, NoSQL databases and the new analytics capabilities that rely on them are really at the heart of a System of Intelligence.

Q6. How many enterprise IT systems do you think we will need to interoperate in the future? 

Alan Morrison: I like Geoffrey Moore‘s observations about a System of Engagement that emerged after the System of Record, and just last year George Gilbert was adding to that taxonomy with a System of Intelligence. But you could add further to that with a System of Collection that we still need to build. Just to be consistent, the System of Collection articulates how the Internet of Things at scale would function on the input side. The System of Engagement would allow distribution of the outputs. For the outputs of the System of Collection to be useful, that system will need to interoperate in various ways with the other systems.

To summarize, there will actually be four enterprise IT systems that will need to interoperate, ultimately. Three of these exist, and one still needs to be created.

The fuller picture will only emerge when this interoperation becomes possible.

Q7. What are the  requirements, heritage and legacy of such systems?

Alan Morrison: The System of Record (RDBMSes) still relies on databases and tech with their roots in the pre-web era. I’m not saying these systems haven’t been substantially evolved and refined, but they do still reflect a centralized, pre-web mentality. Bitcoin and Blockchain make it clear that the future of Systems of Record won’t always be centralized. In fact, microtransaction flows in the Internet of Things at scale will depend on the decentralized approaches,  algorithmic transaction validation, and immutable audit trail creation which blockchain inspires.

The Web is only an interim step in the distributed system evolution. P2P systems will eventually complemnt the web, but they’ll take a long time to kick in fully–well into the next decade. There’s always the S-curve of adoption that starts flat for years. P2P has ten years of an installed base of cloud tech, twenty years of web tech and fifty years plus of centralized computing to fight with. The bitcoin blockchain seems to have kicked P2P in gear finally, but progress will be slow through 2020.

The System of Engagement (requiring Web DBs) primarily relies on Web technnology (MySQL and NoSQL) in conjunction with traditional CRM and other customer-related structured databases.

The System of Intelligence (requiring Web file systems and less structured DBs) primarily relies on NoSQL, Hadoop, the Hadoop ecosystem and its successors, but is built around a core DW/DM RDBMS analytics environment with ETLed structured data from the System of Record and System of Engagement. The System of Intelligence will have to scale and evolve to accommodate input from the System of Collection.

The System of Collection (requiring distributed file systems and DBs) will rely on distributed file system successors to Hadoop and HTTP such as IPFS and the more distributed successors to MySQL+ NoSQL. Over the very long term, a peer-to-peer architecture will emerge that will become necessary to extend the footprint of the internet of things and allow it to scale.

Q8. Do you already have the piece parts to begin to build out a 2020+ intersystem vision now?

Alan Morrison: Contextual, ubiquitous computing is the vision of the 2020s, but to get to that, we need an intersystem approach. Without interoperation of the four systems I’ve alluded to, enterprises won’t be able to deliver the context required for competitive advantage. Without sufficient entity and relationship disambiguation via machine learning in machine/human feedback loops, enterprises won’t be able to deliver the relevance for competitive advantage.

We do have the piece parts to begin to build out an intersystem vision now. For example, interoperation is a primary stumbling block that can be overcome now. Middleware has been overly complex and inadequate to the current-day task, but middleware platforms such as EnterpriseWeb are emerging that can reach out as an integration fabric for all systems, up and down the stack. Here’s how the integration fabric becomes an essential enabler for the intersystem approach:

PwC, 2015

A lot of what EnterpriseWeb (full disclosure: a JBR partner of PwC) does hinges on the creation and use of agents and semantic metadata that enable the data/logic virtualization. That’s what makes the desiloing possible. One of the things about the EnterpriseWeb platform is that it’s a full stack virtual integration and application platform, using methods that have data layer granularity, but process layer impact. Enterprise architects can tune their models and update operational processes at the same time. The result: every change is model-driven and near real-time. Stacks can all be simplified down to uniform, virtualized composable entities using enabling technologies that work at the data layer. Here’s how they work:

PwC, 2015

So basically you can do process refinement across these systems, and intersystem analytics views thus also become possible.

Qx anything else you wish to add? 

Alan Morrison: We always quote science fiction writer William Gibson, who said,

“The future is already here — it’s just not very evenly distributed.”

Enterprises would do best to remind themselves what’s possible now and start working with it. You’ve got to grab onto that technology edge and let it pull you forward. If you don’t understand what’s possible, most relevant to your future business success and how to use it, you’ll never make progress and you’ll always be reacting to crises. Leading enterprises have a firm grasp of the technology edge that’s relevant to them. Better data analysis and disambiguation through semantics is central to how they gain competitive advantage today.

We do a ton of research to get to the big picture and find the real edge, where tech could actually have a major business impact. And we try to think about what the business impact will be, rather than just thinking about the tech. Most folks who are down in the trenches are dismissive of the big picture, but the fact is they aren’t seeing enough of the horizon to make an informed judgement. They are trying to use tools they’re familiar with to address problems the tools weren’t designed for. Alongside them should be some informed contrarians and innovators to provide balance and get to a happy medium.

That’s how you counter groupthink in an enterprise. Executives need to clear a path for innovation and foster a healthy, forward-looking, positive and tolerant mentality. If the workforce is cynical, that’s an indication that they lack a sense of purpose or are facing systemic or organizational problems they can’t overcome on their own.

Alan Morrison (@AlanMorrison) is a senior research fellow at PwC, a longtime technology trends analyst and an issue editor of the firm’s Technology Forecast


Data-driven payments. How financial institutions can win in a networked economy, BY, Mark Flamme, Partner; Kevin Grieve, Partner;  Mike Horvath, Principal Strategy&. FEBRUARY 4, 2016, ODBMS.org

The rise of immutable data stores, By Alan Morrison, Senior Manager, PwC Center for technology and innovation (CTI), OCTOBER 9, 2015, ODBMS.org

The enterprise data lake: Better integration and deeper analytics, By Brian Stein and Alan Morrison, PwC, AUGUST 20, 2014 ODBMS.org

Related Posts

On the Industrial Internet of Things. Interview with Leon Guzenda , ODBMS Industry Watch, January 28, 2016

On Big Data and Society. Interview with Viktor Mayer-Schönberger , ODBMS Industry Watch, January 8, 2016

On Big Data Analytics. Interview with Shilpa Lawande , ODBMS Industry Watch, December 10, 2015

On Dark Data. Interview with Gideon Goldin , ODBMS Industry Watch, November 16, 2015

Follow us on Twitter: @odbmsorg


http://www.odbms.org/blog/2016/02/a-grand-tour-of-big-data-interview-with-alan-morrison/feed/ 0
On Big Data Analytics. Interview with Shilpa Lawande http://www.odbms.org/blog/2015/12/on-big-data-analytics-interview-with-shilpa-lawande/ http://www.odbms.org/blog/2015/12/on-big-data-analytics-interview-with-shilpa-lawande/#comments Thu, 10 Dec 2015 08:45:28 +0000 http://www.odbms.org/blog/?p=4039

“Really, I would say this is indeed the essence of Big Data – being able to harness data from millions of endpoints whether they be devices or users, and optimizing outcomes for the individual, not just for the collective!”–Shilpa Lawande.

I have been following Vertica since their acquisition by HP back in 2011. This is my third interview with Shilpa Lawande, now Vice President at Hewlett Packard Enterprise, and responsible for strategic direction of the HP Big Data Platforms, including HP Vertica Analytic Platform.
The first interview I did with Shilpa was back on November 16, 2011 (soon after the acquisition by HP), and the second on July 14, 2014.
If you read the three interviews (see links to the two previous interviews at the end of this interview), you will notice how fast the Big Data Analytics and Data Platforms world is changing.


Q1. What are the main technical challenges in offering data analytics in real time? And what are the main problems which occur when trying to ingest and analyze high-speed streaming data, from various sources?

Shilpa Lawande: Before we talk about technical challenges, I would like to point out the difference between two classes of analytic workloads that often get grouped under “streaming” or “real-time analytics”.

The first and perhaps more challenging workload deals with analytics at large scale on stored data but where new data may be coming in very fast, in micro-batches.
In this workload, challenges are twofold – the first challenge is about reducing the latency between ingest and analysis, in other words, ensuring that data can be made available for analysis soon after it arrives, and the second challenge is about offering rich, fast analytics on the entire data set, not just the latest batch. This type of workload is a facet of any use case where you want to build reports or predictive models on the most up-to-date data or provide up-to-date personalized analytics for a large number of users, or when collecting and analyzing data from millions of devices. Vertica excels at solving this problem at very large petabyte scale and with very small micro-batches.

The second type of workload deals with analytics on data in flight (sometimes called fast data) where you want to analyze windows of incoming data and take action, perhaps to enrich the data or to discard some of it or to aggregate it, before the data is persisted. An example of this type of workload might be taking data coming in at arbitrary times with granularity and keeping the average, min, and max data points per second, minute, hour for permanent storage. This use case is typically solved by in-memory streaming engines like Storm or, in cases where more state is needed, a NewSQL system like VoltDB, both of which we consider complementary to Vertica.

Q2. Do you know of organizations that already today consume, derive insight from, and act on large volume of data generated from millions of connected devices and applications?

Shilpa Lawande: HP Inc. and Hewlett Packard Enterprise (HPE) are both great examples of this kind of an organization. A number of our products – servers, storage, and printers all collect telemetry about their operations and bring that data back to analyze for purposes of quality control, predictive maintenance, as well as optimized inventory/parts supply chain management.
We’ve also seen organizations collect telemetry across their networks and data centers to anticipate servers going down, as well as to have better understanding of usage to optimize capacity planning or power usage. If you replace devices by users in your question, online and mobile gaming companies, social networks and adtech companies with millions of daily active users all collect clickstream data and use it for creating new and unique personalized experiences. For instance, user churn is a huge problem in monetizing online gaming.
If you can detect, from the in-game interactions, that users are losing interest, then you can immediately take action to hold their attention just a little bit longer or to transition them to a new game altogether. Companies like Game Show Network and Zynga do this masterfully using Vertica real-time analytics!

Really, I would say this is indeed the essence of Big Data – being able to harness data from millions of endpoints whether they be devices or users, and optimizing outcomes for the individual, not just for the collective!

Q3. Could you comment on the strategic decision of HP to enhance its support for Hadoop?

Shilpa Lawande: As you know HP recently split into Hewlett Packard Enterprise (HPE) and HP Inc.
With HPE, which is where Big Data and Vertica resides, our strategy is to provide our customers with the best end-to-end solutions for their big data problems, including hardware, software and services. We believe that technologies Hadoop, Spark, Kafka and R are key tools in the Big Data ecosystem and the deep integration of our technology such as Vertica and these open-source tools enables us to solve our customers’ problems more holistically.
At Vertica, we have been working closely with the Hadoop vendors to provide better integrations between our products.
Some notable, recent additions include our ongoing work with Hortonworks to provide an optimized Vertica SQL-on-Hadoop version for the Orcfile data format, as well as our integration with Apache Kafka.

Q4. The new version of HPE Vertica, “Excavator,” is integrated with Apache Kafka, an open source distributed messaging system for data streaming. Why?

Shilpa Lawande: As I mentioned earlier, one of the challenges with streaming data is ingesting it in micro- batches at low latency and high scale. Vertica has always had the ability to do so due to its unique hybrid load architecture whereby data is ingested into a Write Optimized Store in-memory and then optimized and persisted to a Read-Optimized Store on disk.
Before “Excavator,” the onus for engineering the ingest architecture was on our customers. Before Kafka, users were writing custom ingestion tools from scratch using ODBC/JDBC or staging data to files and then loading using Vertica’s COPY command. Besides the challenges of achieving the optimal load rates, users commonly ran into challenges of ensuring transactionality of the loads, so that each batch gets loaded exactly once even under esoteric error conditions. With Kafka, users get a scalable distributed messaging system that enables simplifying the load pipeline.
We saw the combination of Vertica and Kafka becoming a common design pattern and decided to standardize on this pattern by providing out-of-the-box integration between Vertica and Kafka, incorporating the best practices of loading data at scale. The solution aims to maximize the throughput of loads via micro-batches into Vertica, while ensuring transactionality of the load process. It removes a ton of complexity in the load pipeline from the Vertica users.

Q5.What are the pros and cons of this design choice (if any)?

Shilpa Lawande: The pros are that if you already use Kafka, much of the work of ingesting data into Vertica is done for you. Having seen so many different kinds of ingestion horror stories over the past decade, trust me, we’ve eliminated a ton of complexity that you don’t need to worry about anymore. The cons are, of course, that we are making the choice of the tool for you. We believe that the pros far outweigh any cons. :-)

Q6. What kind of enhanced SQL analytics do you provide?

Shilpa Lawande: Great question. Vertica of course provides all the standard SQL analytic capabilities including joins, aggregations, analytic window functions, and, needless to say, performance that is a lot faster than any other RDBMS. :) But we do much more than that. We’ve built some unique time-series analysis (via SQL) to operate on event streams such as gap-filling and interpolation and event series joins. You can use this feature to do common operations like sessionization in three or four lines of SQL. We can do this because data in Vertica is always sorted and this makes Vertica a superior system for time series analytics. Our pattern matching capabilities enable user path or marketing funnel analytics using simple SQL, which might otherwise take pages of code in Hive or Java.
With the open source Distributed R engine, we provide predictive analytical algorithms such as logistic regression and page rank. These can be used to build predictive models using R, and the models can be registered into Vertica for in- database scoring. With Excavator, we’ve also added text search capabilities for machine log data, so you can now do both search and analytics over log data in one system. And you recently featured a five-part blog series by Walter Maguire examining why Vertica is the best graph analytics engine out there.

Q7. What kind of enhanced performance to Hadoop do you provide?

Shilpa Lawande We see Hadoop, particularly HDFS, as highly complementary to Vertica. Our users often use HDFS as their data lake, for exploratory/discovery phases of their data lifecycle. Our Vertica SQL on Hadoop offering includes the Vertica engine running natively on Hadoop nodes, providing all the advanced SQL capabilities of Vertica on top of data stored in HDFS. We integrate with native metadata stores like HCatalog and can operate on file formats like Orcfiles, Parquet, JSON, Avro, etc. to provide a much more robust SQL engine compared to the alternatives like Hive, Spark or Impala, and with significantly better performance. And, of course, when users are ready to operationalize the analysis, they can seamlessly load the data into Vertica Enterprise which provides the highest performance, compression, workload management, and other enterprise capabilities for your production workloads. The best part is that you do not have to rewrite your reports or dashboards as you move data from Vertica for SQL on Hadoop to Vertica Enterprise.

Qx Anything else you wish to add?

Shilpa Lawande: As we continue to develop the Vertica product, our goal is to provide the same capabilities in a variety of consumption and deployment models to suit different use cases and buying preferences. Our flagship Vertica Enterprise product can be deployed on-prem, in VMWare environments or in AWS via an AMI.
Our SQL on Hadoop product can be deployed directly in Hadoop environments, supporting all Hadoop distributions and a variety of native data formats. We also have Vertica OnDemand, our data warehouse-as-a-service subscription that is accessible via a SQL prompt in AWS, HPE handles all of the operations such as database and OS software updates, backups, etc. We hope that by providing the same capabilities across many deployment environments and data formats, we provide our users the maximum choice so they can pick the right tool for the job. It’s all based on our signature core analytics engine.
We welcome new users to our growing community to download our Community Edition, which provides 1TB of Vertica on a three-node cluster for free, or sign-up for a 15-day trial of Vertica on Demand!

Shilpa Lawande is Vice President at Hewlett Packard Enterprise, responsible for strategic direction of the HP Big Data Platforms, including the flagship HP Vertica Analytic Platform. Shilpa brings over 20 years of experience in databases, data warehousing, analytics and distributed systems.
She joined Vertica at its inception in 2005, being one of the original engineers who built Vertica from ground up, and running the Vertica Engineering and Customer Experience teams for better part of the last decade. Shilpa has been at HPE since 2011 through the acquisition of Vertica and has held a diverse set of roles spanning technology and business.
Prior to Vertica, she was a key member of the Oracle Server Technologies group where she worked directly on several data warehousing and self-managing features in the Oracle Database.

Shilpa is a co-inventor on several patents on database technology, both at Oracle and at HP Vertica.
She has co-authored two books on data warehousing using the Oracle database as well as a book on Enterprise Grid Computing.
She has been named to the 2012 Women to Watch list by Mass High Tech, the Rev Boston 2015 list, and awarded HP Software Business Unit Leader of the year in 2012 and 2013. As a working mom herself, Shilpa is passionate about STEM education for Girls and Women In Tech issues, and co-founded the Datagals women’s networking and advocacy group within HPE. In her spare time, she mentors young women at Year Up Boston, an organization that empowers low-income young adults to go from poverty to professional careers in a single year.


Related Posts

On HP Distributed R. Interview with Walter Maguire and Indrajit Roy. ODBMS Industry Watch, April 9, 2015

On Column Stores. Interview with Shilpa Lawande. ODBMS Industry Watch,July 14, 2014

On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica. ODBMS Industry Watch,November 16, 2011

Follow ODBMS.org on Twitter: @odbmsorg


http://www.odbms.org/blog/2015/12/on-big-data-analytics-interview-with-shilpa-lawande/feed/ 0
On SQL and NoSQL. Interview with Dave Rosenthal http://www.odbms.org/blog/2014/03/dave-rosenthal/ http://www.odbms.org/blog/2014/03/dave-rosenthal/#comments Tue, 18 Mar 2014 16:04:45 +0000 http://www.odbms.org/blog/?p=2932

“Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications.”–Dave Rosenthal.

On SQL and NoSQL, I have interviewed Dave Rosenthal CEO of FoundationDB.


Q1. What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

Dave Rosenthal: There is a tradeoff in available between commit latency and durability–especially in distributed databases. At one extreme a database client can just report success immediately (without even talking to the database server) and buffer the writes in the background. Obviously, that hides latency well, but you could lose a suffix of transactions. At the other extreme, you can replicate writes across multiple machines, fsync them on each of the machines, and only then report success to the client.

FoundationDB is optimized to provide good performance in its default setting, which is the safest end of that tradeoff.

Usually, if you want some reasonable amount of durability guarantee, you are talking about a commit latency of small constant factor times the network latency. So, the real latency issues come with databases spanning multiple data centers. In that case FoundationDB users are able to choose whether or not they want durability guarantees in all data centers before commit (increasing commitment latencies), which is our default setting, or whether they would like to relax durability guarantees by returning a commit when the data is fsync’d to disk in just one datacenter.

All that said, in general, we think that the application is usually a more appropriate place to try to hide latency than the database.

Q2. Justin Sheehy of Basho in an interview said [1] “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”. What is your opinion on this?

Dave Rosenthal: Yes, we totally agree with Justin. Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications. In fact, true ACID transactions are way more broadly important than that. They give you the ability to build abstractions and systems that you can provide guarantees about.
As Michael Cahill says in his thesis which became the SIGMOD paper of the year: “Serializable isolation enables the development of a complex system by composing modules, after verifying that each module maintains consistency in isolation.” It’s this incredibly important ability to compose that makes a system with transactions special.

Q3. FoundationDB claim to provide full ACID transactions. How do you do that?

Dave Rosenthal: In the same basic way as many other transactional databases do. We use a few strategies that tend to work well in distributed system such as optimistic concurrency and MVCC. We also, of course, have had to solve some of the fundamental challenges associated with distributed systems and all of the crazy things that can happen in them. Honestly, it’s not very hard to build a distributed transactional database. The hard part is making it work gracefully through failure scenarios and to run fast.

Q4. Is this similar to Oracle NoSQL?

Dave Rosenthal: Not really. Both Oracle NoSQL and FoundationDB provide an automatically-partitioned key-value store with fault tolerance. Both also have a concept of ordering keys (for efficient range operations) though Oracle NoSQL only provides ordering “within a Major Key set”. So, those are the similarities, but there are a bunch of other NoSQL systems with all those properties. The huge difference is that FoundationDB provides for ACID transactions over arbitrary keys and ranges, while Oracle NoSQL does not.

Q5. How would you compare your product offering with respect to NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak, and NewSQL such as NuoDB and VoltDB?

Dave Rosenthal: The most obvious response for the NoSQL data stores would be “we have ACID transactions, they don’t”, but the more important difference is in philosophy and strategy.

Each of those products expose a single data model and interface. Maybe two. We are pursuing a fundamentally different strategy.
We are building a storage substrate that can be adapted, via layers, to provide a variety of data models, APIs, and true flexibility.
We can do that because of our transactional capabilities. CouchDB, MongoDB, Cassandra and Riak all have different APIs and we talk to companies that run all of those products side-by-side. The NewSQL database players are also offering a single data model, albeit a very popular one, SQL. FoundationDB is offering an ever increasing number of data models through its “layers”, currently including several popular NoSQL data models and with SQL being the next big one to hit. Our philosophy is that you shouldn’t have to increase the complexity of your architecture by adopting a new NoSQL database each time your engineers need access to a new data model.

Q6. Cloud computing and open source: How does it relate to FoundationDB?

Dave Rosenthal: Cloud computing: FoundationDB has been designed from the beginning to run well in cloud environments that make use of large numbers of commodity machines connected through a network. Probably the most important aspect of a distributed database designed for cloud deployment is exceptional fault tolerance under very harsh and strange failure conditions – the kind of exceptionally unlikely things that can only happen when you have many machines working together with components failing unpredictably. We have put a huge amount of effort into testing FoundationDB in these grueling scenarios, and feel very confident in our ability to perform well in these types of environments. In particular, we have users running FoundationDB successfully on many different cloud providers, and we’ve seen the system keep its guarantees under real-world hardware and network failure conditions experienced by our users.

Open source: Although FoundationDB’s core data storage engine is closed source, our layer ecosystem is open source. Although the core data storage engine has a very simple feature set, and is very difficult to properly modify while maintaining correctness, layers are very feature rich and because they are stateless, are much easier to create and modify which makes them well suited to third-party contributions.

Q7 Pls give some examples of use cases where FoundationDB is currently in use. Is FoundationDB in use for analyzing Big Data as well?

Dave Rosenthal: Some examples: User data, meta data, user social graphs, geo data, via ORMs using the SQL layer, metrics collection, etc.

We’ve mostly focused on operational systems, but a few of our customers have built what I would call “big data” applications, which I think of as analytics-focused. The most common use case has been for collecting and analyzing time-series data. FoundationDB is strongest in big data applications that call for lots of random reads and writes, not just big table scans—which many systems can do well.

Q8. Rick Cattel said in an recent interview [2] “there aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves”. What is your opinion on this?

Dave Rosenthal: People have great ideas for databases all the time. New data models, new query languages, etc.
If nothing else, this NoSQL experiment that we’ve all been a part of the past few years has shown us all the appetite for data models suited to specific problems. They would love to be able to build these tools, open source them, etc.
The problem is that the checklist of practical considerations for a database is huge: Fault tolerance, scalability, a backup solution, management and monitoring, ACID transactions, etc. Add those together and even the simplest concept sounds like a huge project.

Our vision at FoundationDB is that we have done the hard work to build a storage substrate that simultaneously solves all those tricky practical problems. Our engine can be used to quickly build a database layer for any particular application that inherits all of those solutions and their benefits, like scalability, fault tolerance and ACID compliance.

Q9. Nick Heudecker of Gartner, predicts that [3] “going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time” . What is your take on this?

Dave Rosenthal: I do think that the lines between SQL and NoSQL will start to blur and I believe that we are leading that charge.We acquired another database startup last year called Akiban that builds an amazing SQL database engine.
In 2014 we’ll be bringing that engine to market as a layer running on top of FoundationDB. That will be a true ANSI SQL database operating as a module directly on top of a transactional “NoSQL” engine, inheriting the operational benefits of our core storage engine – scalability, fault tolerance, ease of operation.

When you run multiple of the SQL layer modules, you can point many of them at the same key-space in FoundationDB and it’s as if they are all part of the same database, with ACID transactions enforced across the separate SQL layer processes.
It’s very cool. Of course, you can even run the SQL layer on a FoundationDB cluster that’s also supporting other data models, like graph or document. That’s about as blurry as it gets.

Dave Rosenthal is CEO of FoundationDB. Dave started his career in games, building a 3D real-time strategy game with a team of high-school friends that won the 1st annual Independent Games Festival. Previously, Dave was CTO at Visual Sciences, a pioneering web-analytics company that is now part of Adobe. Dave has a degree in theoretical computer science from MIT.

Related Posts
Operational Database Management Systems. Interview with Nick Heudecker, ODBMS Industry Watch December 16, 2013

Follow ODBMS.org on Twitter: @odbmsorg


http://www.odbms.org/blog/2014/03/dave-rosenthal/feed/ 0
On Big Data. Interview with Adam Kocoloski. http://www.odbms.org/blog/2013/11/on-big-data-interview-with-adam-kocoloski/ http://www.odbms.org/blog/2013/11/on-big-data-interview-with-adam-kocoloski/#comments Tue, 05 Nov 2013 08:25:58 +0000 http://www.odbms.org/blog/?p=2678

” The pace that we can generate data will outstrip our ability to store it.
I think you’ll soon see data scientists emphasizing the ability to make decisions on data before storing it ” –Adam Kocoloski.

I have interviewed Adam Kocoloski, Founder & CTO of Cloudant.


Q1. What can we learn from physics when managing and analyzing big data for the enterprise?

Adam Kocoloski: The growing body of data collected in today’s Web applications and sensor networks is a potential goldmine for businesses. But modeling transactions between people and causality between events becomes challenging at large scale, and traditional enterprise systems like data warehousing and business intelligence are too cumbersome to extract value fast enough.

Physicists are natural problem solvers, equipped to think through what tools will work for particular data challenges. In the era of big data, these challenges are growing increasingly relevant, especially to the enterprise.

In a way, physicists have it easier. Analyzing isolated particle collisions translated well to distributed university research systems and parallel models of computing. In other ways, we have shared the challenge of filtering big data to find useful information. In my physics work, we addressed this problem with blind analysis and machine learning. I think you’ll soon see those practices emerge in the field of enterprise data analysis.

Q2. How do you see data science evolving in the near future?

Adam Kocoloski: The pace that we can generate data will outstrip our ability to store it. I think you’ll soon see data scientists emphasizing the ability to make decisions on data before storing it.

The sheer volume of data we’re storing is a factor, but what’s more interesting is the shift toward the distributed generation of data — data from mobile devices, sensor networks, and the coming “Internet of Things.” It’s easy for an enterprise to stand up Hadoop in its own data center and start dumping data into it, especially if it plans to sort out the valuable parts later. It’s not so easy when it’s large volumes of operational data generated in a distributed system. Machine learning algorithms that can recognize and store only the useful patterns can help us better deal with the deluge.

As physicists, we learned that the way big data is headed, there’s no way we’ll be able to keep writing it all down. That’s the tradeoff today’s data scientists must learn: right when you collect the data, you need to make decisions on throwing it away.

Q3. In your opinion, given the current available Big Data technologies, what is the most difficult challenge in filtering big data to find useful information?

Adam Kocoloski: Cloudant is an operational data store and not a big data or offline analytics platform like Hadoop. That means we deal with mutable data that applications are accessing and changing as they run.

From my physics experience, the most difficult big data challenge I’ve seen is the lack of accurate simulations for machine learning. For me, that meant simulations of the STAR particle detector at Brookhaven National Lab’s Relativistic Heavy Ion Collider (RHIC).

People use machine learning algorithms in many fields, and they don’t always understand the caveats of building in an appropriate training data set. It’s easy to apply training data without fully understanding how the process works. If they do that, they won’t realize when they’ve trained their machine learning algorithms inappropriately.

Slicing data from big data sets is great, but at a certain point it becomes a black box that makes it hard to understand what is and what isn’t working well in your analysis. The bigger the data, the more it’s possible for one variable to be related to others in nonlinear ways. This problem makes it harder to reason about data, placing more demands on data scientists to build training data sets using a balanced combination of linear and nonlinear techniques.

Q4. Could you please explain why blind analyses is important for Big Data?

Adam Kocoloski: Humans are naturally predisposed to find signals. It’s an evolutionary trait of ours. It’s better if we recognize the tiger in the jungle, even if there really isn’t one there. If we see a bump in a distribution of data, we do what we can to tease it out. We bias ourselves that way.
So when you do a blind analysis, you hopefully immunize yourself against that bias.

Data scientists are people too, and with big data, they can’t become overly reliant on data visualization. It’s too easy for us to see things that aren’t really there. Instead of seeking out the signals within all that data, we need to work on recognizing the noise — the data we don’t want — so we can inversely select the data we want to keep.

Q5. Is machine learning the right way to analyze Big Data?

Adam Kocoloski: Machine learning offers the possibility to improve the signal-to-noise ratio beyond what any manually constructed analysis can do.
The potential is there, but you have to balance it with the need to understand the training data set. It’s not a panacea. Algorithms have weak points. They have places where they fail. When you’re applying various machine-learning analyses, it’s important that you understand where those weak points are.

Q6. The past year has seen a renaissance in NewSQL. Will transactions ultimately spell the end of NoSQL databases?

Adam Kocoloski: No — 1) because there’s a wide, growing class of problems that don’t require transactional semantics and 2) mobile computing makes transactions at large scale technically infeasible.

Applications like address books, blogs, or content management systems can store a wide variety of data, and, largely, do not require a high degree of transactional integrity. Using systems that inherently enforce schemas and row-level locking — like an relational database management system (RDBMS) — unnecessarily over-complicate these applications.

It’s widely thought that the popularity of NoSQL databases was due to the inability of relational databases to scale horizontally. If NewSQL databases can provide transactional integrity for large, distributed databases and cloud services, does this undercut the momentum of the NoSQL movement? I argue that no, it doesn’t, because mobile computing introduces new challenges (e.g. offline application data and database sync) that fundamentally cannot be addressed in transactional systems.

It’s unrealistic to lock a row in an RDBMS when a mobile device that’s only occasionally connected could introduce painful amounts of latency over unreliable networks. Add that to the fact that many NoSQL systems are introducing new behaviors (strong-consistency, multi-document transactions) and strategies for approximating ACID transactions (event sourcing) — mobile is showing us that we need to rethink the information theory behind it.

Q7. What is the technical role that CouchDB clustering plays for Cloudant’s distributed data hosting platform?

Adam Kocoloski: At Cloudant, clustering allows us to take one logical database and partition that database for large scale and high availability.
We also store redundant copies of the partitions that make up that cluster, and to our customers, it all looks and operates like one logical database. CouchDB’s interface naturally lends itself to this underlying clustering implementation, and it is one of the many technologies we have used to build Cloudant’s managed database service.

Cloudant is built to be more than just hosted CouchDB. Along with CouchDB, open source software projects like HAProxy, Lucene, Chef, and Graphite play a crucial role in running our service and managing the experience for customers. Cloudant is also working with organizations like the Open Geospatial Consortium (OGC) to develop new standards for working with geospatial data sets.

That said, the semantics of CouchDB replication — if not the actual implementation itself — are critical to Cloudant’s ability to synchronize individual JSON documents or entire database partitions between shard copies within a single cluster, between clusters in the same data center, and between data centers across the globe. We’ve been able to horizontally scale CouchDB and apply its unique replication abilities on a much larger scale.

Q8. Cloudant recently announced the merging of its distributed database platform into the Apache CouchDB project. Why? What are the benefits of such integration?

Adam Kocoloski: We merged the horizontal scaling and fault-tolerance framework we built in BigCouch into Apache CouchDB™. The same way Cloudant has applied CouchDB replication in new ways to adapt the database for large distributed systems, Apache CouchDB will now share those capabilities.

Previously, the biggest knock on CouchDB was that it couldn’t scale horizontally to distribute portions of a database across multiple machines. People saw it as a monolithic piece of software, only fit to run on a single server. That is no longer the case.

Obviously new scalability features are good for the Apache project, and a healthy Apache CouchDB is good for Cloudant. The open source community is an excellent resource for engineering talent and sales leads. Our contribution will also improve the quality of our code. Having more of it out there in live deployment will only increase the velocity of our development teams. Many of our engineers wear multiple hats — as Cloudant employees and Apache CouchDB project committers. With the code merger complete, they’ll no longer have to maintain multiple forks of the codebase.

Q9. Will there be two offerings of the same Apache CouchDB: one from Couchbase and one from Cloudant?

Adam Kocoloski: No. Couchbase has distanced itself from the Apache project. Their product, Couchbase Server, is no longer interface-compatible with Apache CouchDB and has no plans to become so.

Adam Kocoloski, Founder & CTO of Cloudant.
Adam is an Apache CouchDB developer and one of the founders of Cloudant. He is the lead architect of BigCouch, a Dynamo-flavored clustering solution for CouchDB that serves as the core of Cloudant’s distributed data hosting platform. Adam received his Ph.D. in Physics from MIT in 2010, where he studied the gluon’s contribution to the spin structure of the proton using a motley mix of server farms running Platform LSF, SGE, and Condor. He and his wife Hillary are the proud parents of two beautiful girls.

Related Posts

Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. September 23, 2013

On NoSQL. Interview with Rick Cattell. August 19, 2013

On Big Data Analytics –Interview with David Smith. February 27, 2013


“NoSQL Failover Characteristics: Aerospike, Cassandra, Couchbase, MongoDB” (.pdf), by Denis Nelubin, Ben Engber, Thumbtack Technology, 2013

“Ultra-High Performance NoSQL Benchmarking- Analyzing Durability and Performance Tradeoffs” (.pdf) by Denis Nelubin,, Ben Engber, Thumbtack Technology, 2013

Follow us on Twitter: @odbmsorg

http://www.odbms.org/blog/2013/11/on-big-data-interview-with-adam-kocoloski/feed/ 0
On geo-distributed data management — Interview with Adam Abrevaya. http://www.odbms.org/blog/2013/10/on-geo-distributed-data-management-interview-with-adam-abrevaya/ http://www.odbms.org/blog/2013/10/on-geo-distributed-data-management-interview-with-adam-abrevaya/#comments Sat, 19 Oct 2013 15:00:17 +0000 http://www.odbms.org/blog/?p=2712

“Geo-distribution is the ability to distribute a single, logical SQL/ACID database that delivers transactional consistency across multiple datacenters, cloud provider regions, or a hybrid” — Adam Abrevaya.

I have interviewed Adam Abrevaya, Vice President of Engineering, NuoDB.


Q1. You just launched NuoDB 2.0, what is special about it?

Adam Abrevaya: NuoDB Blackbirds Release 2.0 demonstrates a strong implementation of the NuoDB vision. It includes over 200 new features and improvements, making it even more stable and reliable than previous versions.
We have improved migration tools; included Java stored procedures; are introducing powerful automated administration; made enhancements to core geo-distribution functionality and more.

Q2. You offer a feature called geo-distribution. What is it and why is it useful?

Adam Abrevaya: Geo-distribution is the ability to distribute a single, logical SQL/ACID database that delivers transactional consistency across multiple datacenters, cloud provider regions, or a hybrid.

NuoDB’s geo-distributed data management lets customers build an active/active, highly-responsive database for high availability and low latency. By bringing the database closer to the end user, we can enable faster responses while simultaneously eliminating the time spent on complex tasks like replication, backup and recovery schemes.

One of the most exciting aspects of the Release 2.0 launch was the discussion about a major deployment of NuoDB Geo-Distribution by a customer. We were very excited to include Cameron Weeks, CEO and Co-Founder of Fathom Voice, talking about the challenges his company was facing—both managing his existing business and cost-effectively expanding globally. After a lengthy evaluation of alternative technologies, he found NuoDB’s distributed database is the only one that met his needs.

Q3. NuoDB falls broadly into the category of NewSQL databases, but you say that you are also a distributed database and that your architecture is fundamentally different than other databases out there. What’s different about it?

Adam Abrevaya: Yes, we are a NewSQL database and we offer the scale-out performance typically associated with NoSQL solutions, while still maintaining the safety and familiarity of SQL and ACID guarantees.

Our architecture, envisioned by renowned data scientist, Jim Starkey, is based on what we call “On-demand Replication”. We have an architecture whitepaper (registration required) which provides all the technical differentiators of our approach.

Q4. NuoDB is SQL compliant, and you claim that it scales elastically. But how do you handle complex join operations on data sets that are geographically distributed and at the same time scale (in) (out)?

Adam Abrevaya: NuoDB can have transactions that work against completely different on-demand caches.
For example, you can have OLTP transactions running in 9 Amazon AWS regions, each working on a subset of the overall database. Separately, there can be on-demand caches that can be dedicated to queries across the entire data set. NuoDB manages these on-demand ACID-compliant caches with very different use cases automatically without impact to the critical end user OLTP operations.

Q5. What is special about NuoDB with respect to availability? Several other NoSQL data stores are also resilient to infrastructure and partition failures.

Adam Abrevaya: First off, NuoDB offers a distributed SQL database system that provides all the ACID guarantees you expect from a relational database. We scale out like NoSQL databases, and offer support for handling independent failures at each level of our architecture. Redundant processes take over for failed processes (due to machine or other failures) and we make it easy for new machines and process to be brought online and added to the overall database dynamically. Applications that make use of the typical facilities when building an enterprise application will automatically reconnect to surviving processes in our system. We can detect network partition failures and allow the application to take appropriate measures.

Q6 How are some of your customers using NuoDB?

Adam Abrevaya: We are seeing a number of common uses of NuoDB among our customers. These range from startups building new web-facing solutions, to geo-distributed SaaS applications, to ISVs moving existing apps to the cloud, to all sorts of other apps that hit the performance wall with MySQL and other traditional DBMS. Ultimately, with lots of replication, sharding, new server hardware, etc., customers can use traditional databases to scale out or up but at a very high cost in terms of both time, money and usually by giving up transactional guarantees. One customer said he decided to look at alternatives to MySQL just because he was spending so much time in meetings talking about how to get it to do what they needed it to do. He added up the cost of the man-hours and he said “migrate.”

As I mentioned already, Fathom Voice, a SaaS provider offering VoIP, conference bridging, receptionist services and some innovative communications apps, had a global deployment challenge. How to get the database near their globe trotting customers; reduce latency and ensure redundancy. They are one of many customers and prospects tackling these issues.

Adam Abrevaya, Vice President of Engineering, NuoDB
Adam has been building and managing world-class engineering teams and products for almost two decades. His passion is around building and delivering high-performance core infrastructure products that companies depend on to build their businesses.

Adam started his career at MIT Lincoln Laboratory where he developed a distributed platform and image processing algorithms for detecting dangerous weather patterns in radar images. The system was deployed at several airports around the country.

From there, Adam joined Object Design and held various senior management positions where he was responsible for managing several major releases of ObjectStore (an Object database) along with spearheading the development team building XML products that included: Stylus Studio, an XML database, and a Business Process Manager.

Adam joined Pantero Corporation as VP of Development where he developed a revolutionary Semantic Data Integration product. Pantero was eventually sold to Progress Software.

From Pantero, Adam joined m-Qube to manage and build the team creating its Mobile Messaging Gateway platform. The m-Qube platform is a carrier grade product that has become the leading Mobile Messaging Gateway in North America and generated billions of dollars in revenue. Adam continued managing the mQube platform along with expanded roles after acquisitions of the technology from VeriSign and Mobile Messenger.


Related Posts

On Big Data and NoSQL. Interview with Renat Khasanshyn. October 7, 2013

On NoSQL. Interview with Rick Cattell. August 19, 2013


Download NuoDB Pro Edition (Registration required) (NuoDB Blackbirds Release 2.0)

ODBMS.org free resources on
Relational Databases, NewSQL, XML Databases, RDF Data Stores:
Blog Posts |Free Software | Articles and Presentations| Lecture Notes | Tutorials| Journals |

Follow ODBMS.org on Twitter: @odbmsorg


http://www.odbms.org/blog/2013/10/on-geo-distributed-data-management-interview-with-adam-abrevaya/feed/ 0
On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen http://www.odbms.org/blog/2013/05/on-hybrid-relational-databases-interview-with-kingsley-uyi-idehen/ http://www.odbms.org/blog/2013/05/on-hybrid-relational-databases-interview-with-kingsley-uyi-idehen/#comments Mon, 13 May 2013 06:52:11 +0000 http://www.odbms.org/blog/?p=2260

“The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point”
–K​ingsley Uyi Idehen.

I have interviewed Kingsley Idehen founder and CEO of OpenLink Software. The main topics of this interview are: the Semantic Web, and the Virtuoso Hybrid Data Server.


Q1. The vision of the Semantic Web is the one where web pages contain self describing data that machines will be able to navigate them as easily as humans do now. What are the main benefits? Who could profit most from the Semantics Web?

K​ingsley Uyi Idehen: The vision of a Semantic Web is actually the vision of the Web. Unbeknownst to most, they are one and the same. The goal was always to have HTTP URIs denote things, and by implication, said URIs basically resolve to their meaning [1] [2].
Paradoxically, the Web bootstrapped on the back of URIs that denoted HTML documents (due to Mosaic’s ingenious exploitation of the “view source” pattern [3]) thereby accentuating its Web of hyper-linked Documents (i.e., Information Space) aspect while leaving its Web of hyper-linked Data aspect somewhat nascent.
The nascence of the Web of hyper-linked Data (aka Web of Data, Web of Linked Data etc.) laid the foundation for the “Semantic Web Project” which naturally evoled into “The Semantic Web” meme. Unfortunately, “The Semantic Web” meme hit a raft of issues (many self inflicted) that basically disconnected it from its original Web vision and architecture aspect reality.
The Semantic Web is really about the use of hypermedia to enhance the long understood entity relationship model [4] via the incorporation of _explicit_ machine- and human-comprehensible entity relationship semantics via the RDF data model. Basically, RDF is just about an enhancement to the entity relationship model that leverages URIs for denoting entities and relations that are described using subject->predicate->object based proposition statements.
For the rest of this interview, I would encourage readers to view “The Semantic Web” phrase as meaning: a Web-scale entity relationship model driven by hypermedia resources that bear entity relationship model description graphs that describe entities and their relations (associations).

To answer your question, the benefits of the Semantic Web are as follows: fine-grained access to relevant data on the Web (or private Web-like networks) with increasing degrees of serendipity [5].

Q2. Who is currently using Semantic Web technologies and how? Could you please give us some examples of current commercial projects?

K​ingsley Uyi Idehen: I wouldn’t used “project” to describe endeavors that exploit Semantic Web oriented solutions. Basically, you have entire sectors being redefined by this technology. Examples range from “Open Government” (US, UK, Italy, Spain, Portugal, Brazil etc..) all the way to publishing (BBC, Globo, Elsevier, New York Times, Universal etc..) and then across to pharmaceuticals (OpenPHACTs, St. Judes, Mayo, etc.. ) and automobiles (Daimler Benz, Volkswagen etc..). The Semantic Web isn’t an embryonic endeavor deficient on usecases and case studies, far from it.

Q3. Virtuoso is a Hybrid RDBMS/Graph Column store. How does it differ from relational databases and from XML databases?

K​ingsley Uyi Idehen:: First off, we really need to get the definitions of databases clear. As you know, the database management technology realm is vast. For instance, there isn’t anything such thing as a non relational database.
Such a system would be utterly useless beyond an comprehendible definition, to a marginally engaged audience. A relational database management system is typically implemented with support for a relational model oriented query language e.g., SQL, QUEL, OQL (from the Object DBMS era), and more recently SPARQL (for RDF oriented databases and stores). Virtuoso is comprised of a relational database management system that supports SQL, SPARQL, and XQuery. It is optimized to handle relational tables and/or relational property graphs (aka. entity relationship graphs) based data organization. Thus, Virtuoso is about providing you with the ability to exploit the intensional (open world propositions or claims) and extensional (closed world statements of fact) aspects of relational database management without imposing either on its users.

Q4. Is there any difference with Graph Data stores such as Neo4j?

K​ingsley Uyi Idehen: Yes, as per my earlier answer, it is a hybrid relational database server that supports relational tables and entity relationship oriented property graphs. It’s support for RDF’s data model enables the use of URIs as native types. Thus, every entity in a Virtuoso DBMS is endowed with a URI as its _super key_. You can de-reference the description of a Virtuoso entity from anywhere on a network, subject to data access policies and resource access control lists.

Q5. How do you position Virtuoso with respect to NoSQL (e.g Cassandra, Riak, MongoDB, Couchbase) and to NewSQL (e.g.NuoDB, VoltDB)?

K​ingsley Uyi Idehen: Virtuoso is a SQL, NoSQL, and NewSQL offering. Its URI based _super keys_ capability differentiates it from other SQL, NewSQL, and NoSQL relational database offerings, in the most basic sense. Virtuoso isn’t a data silo, because its keys are URI based. This is a “deceptively simple” claim that is very easy to verify and understand. All you need is a Web Browser to prove the point i.e., a Virtuoso _super key_ can be placed in the address bar of any browser en route to exposing a hypermedia based entity relationship graph that navigable using the Web’s standard follow-your-nose pattern.

Q6. RDF can be encoded in various formats. How do you handle that in Virtuoso?

K​ingsley Uyi Idehen: Virtuoso supports all the major syntax notations and data serialization formats associated with the RDF data model. This implies support for N-Triples, Turtle, N3, JSON-LD, RDF/JSON, HTML5+Microdata, (X)HTML+RDFa, CSV, OData+Atom, OData+JSON.

Q7. Does Virtuoso restrict the contents to triples?

K​ingsley Uyi Idehen: Assuming you mean: how does it enforce integrity constraints on triple values?
It doesn’t enforce anything per se. since the principle here is “schema last” whereby you don’t have a restrictive schema acting as an inflexible view over the data (as is the case with conventional SQL relational databases). Of course, an application can apply reasoning to OWL (Web Ontology Language) based relation semantics (i.e, in the so-called RBox) as option for constraining entity types that constitute triples. In addition, we will soon be releasing a SPARQL Views mechanism that provides a middle ground for this matter whereby the aforementioned view can be used in a loosely coupled manner at the application, middleware, or dbms layer for applying constraints to entity types that constitute relations expressed by RDF triples.

Q8. RDF can be represented as a direct graph. Graphs, as data structure do not scale well. How do you handle scalability in Virtuoso? How do you handle scale-out and scale-up?

K​ingsley Uyi Idehen: The fundamental mission statement of Virtuoso has always be to destroy any notion of performance and scalability as impediments to entity relationship graph model oriented database management. The crux of the matter with regards to Virtuoso is that it is massively scalable due for the following reasons:
• fine-grained multi-threading scoped to CPU cores
• vectorized (array) execution of query commands across fine-grained threads
• column-store based physical storage which provides storage layout and data compaction optimizations (e.g., key compression)
• share-nothing clustering that scales from multiple instances (leveraging the items above) on a single machine all the way up to a cluster comprised of multiple machines.
The scalability prowess of Virtuoso are clearly showcased via live Web instances such as DBpedia and the LOD Cloud Cache (50+ Billion Triples). You also have no shortage of independent benchmark reports to compliment the live instances:
50 – 150 Billion scale Berlin SPARQL Benchmark (BSBM) report (.pdf)

Q9. Could you give us some commercial examples where Virtuoso is in use?

K​ingsley Uyi Idehen: Elsevier, Globo, St. Judes Medical, U.S. Govt., EU, are a tiny snapshot of entities using Virtuoso on a commercial basis.

Q10. Do you plan in the near future to develop integration interfaces to other NoSQL data stores?

K​ingsley Uyi Idehen: If a NewSQL or NoSQL store supports any of the following, their integration with Virtuoso is implicit: HTTP based RESTful interaction patterns, SPARQL, ODBC, JDBC, ADO.NET, OLE-DB. In the very worst of cases, we have to convert the structured data returned into 5-Star Linked Data using Virtuoso’s in-built Linked Data middleware layer for heterogeneous data virtualization.

Q11. Virtuoso supports SPARQL. SPARQL is not SQL, how do handle querying relational data then?

K​ingsley Uyi Idehen: Virtuoso support SPARQL, SQL, SQL inside SPARQL and SPARQL inside SQL (we call this SPASQL). Virtuoso has always had its own native SQL engine, and that’s integral to the entire product. Virtuoso provides an extremely powerful and scalable SQL engine as exemplified by the fact that the RDF data management services are basically driven by the SQL engine subsystem.

Q12. How do you support Linked Open Data? What advantages are the main benefits of Linked Open Data in your opinion?

K​ingsley Uyi Idehen: Virtuoso enables you expose data from the following sources, courtesy of its in-built 5-star Linked Data Deployment functionality:
• RDF based triples loaded from Turtle, N-Triples, RDF/XML, CSV etc. documents
• SQL relational databases via ODBC or JDBC connections
• SOAP based Web Services
• Web Services that provide RESTful interaction patterns for data access.
• HTTP accessible document types e.g., vCard, iCalendar, RSS, Atom, CSV, and many others.

Q13. What are the most promising application domains where you can apply triple store technology such as Virtuoso?

K​ingsley Uyi Idehen: Any application that benefits from high-performance and scalable access to heterogeneously shaped data across disparate data sources. Healthcare, Pharmaceuticals, Open Government, Privacy enhanced Social Web and Media, Enterprise Master Data Management, Big Data Analytics etc..

Q14. Big Data Analysis: could you connect Virtuoso with Hadoop? How does Viruoso relate to commercial data analytics platforms, e.g Hadapt, Vertica?

K​ingsley Uyi Idehen: You can integrate data managed by Hadoop based ETL workflows via ODBC or Web Services driven by Hapdoop clusters that expose RESTful interaction patterns for data access. As for how Virtuoso relates to the likes of Vertica re., analytics, this is about Virtuoso being the equivalent of Vertica plus the added capability of RDF based data management, Linked Data Deployment, and share-nothing clustering. There is no job that Vertica performs that Virtuoso can’t perform.
There are several jobs that Virtuoso can perform that Vertica, VoltDB, Hadapt, and many other NoSQL and NewSQL simply cannot perform with regards to scalable, high-performance RDF data management and Linked Data deployment. Remember, RDF based Linked Data is all about data management and data access without any kind of platform lock-in. Virtuoso locks you into a value proposition (performance and scale) not the platform itself.

Q15. Do you also benchmark loading trillion of RDF triples? Do you have current benchmark results? How much time does it take to querying them?

K​ingsley Uyi Idehen: As per my earlier responses, there is no shortage of benchmark material for Virtuoso.
The benchmarks are also based on realistic platform configurations unlike the RDBMS patterns of the past which compromised the utility of TPC benchmarks.

Q16. In your opinion, what are the main current obstacles for the adoption of Semantic Web technologies in the Enterprise?

K​ingsley Uyi Idehen:The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point.


[1]. — 5-Star Linked Data URIs and Semiotic Triangle
[2]. — what do HTTP URIs Identify?
[3]. — View Source Pattern & Web Bootstrap
[4]. — Unified View of Data using the Entity Relationship Model (Peter Chen’s 1976 dissertation)
[5]. — Serendipitous Discovery Quotient (SDQ).

Kingsley Idehen is the Founder and CEO of OpenLink Software. He is an industry acclaimed technology innovator and entrepreneur in relation to technology and solutions associated with data management systems, integration middleware, open (linked) data, and the semantic web.


Kingsley has been at the helm of OpenLink Software for over 20 years during which he has actively provided dual contributions to OpenLink Software and the industry at large, exemplified by contributions and product deliverables associated with: Open Database Connectivity (ODBC), Java Database Connectivity (JDBC), Object Linking and Embedding (OLE-DB), Active Data Objects based Entity Frameworks (ADO.NET), Object-Relational DBMS technology (exemplified by Virtuoso), Linked (Open) Data (where DBpedia and the LOD cloud are live showcases), and the Semantic Web vision in general.


50 – 150 Billion scale Berlin SPARQL Benchmark (BSBM) report (.pdf)

History of Virtuoso

ODBMS.org free resources on : Relational Databases, NewSQL, XML Databases, RDF Data Stores

Related Posts

Graphs vs. SQL. Interview with Michael Blaha April 11, 2013

MySQL-State of the Union. Interview with Tomas Ulin. February 11, 2013

Follow ODBMS Industry Watch on Twitter: @odbmsorg


http://www.odbms.org/blog/2013/05/on-hybrid-relational-databases-interview-with-kingsley-uyi-idehen/feed/ 0
Acquiring Versant –Interview with Steve Shine. http://www.odbms.org/blog/2013/03/acquiring-versant-interview-with-steve-shine/ http://www.odbms.org/blog/2013/03/acquiring-versant-interview-with-steve-shine/#comments Wed, 06 Mar 2013 17:26:21 +0000 http://www.odbms.org/blog/?p=2096 “So the synergies in data management come not from how the systems connect but how the data is used to derive business value” –Steve Shine,

On Dec. 21, 2012, Actian Corp. announced the completion of the transaction to buy Versant Corporation. I have interviewed Steve Shine, CEO and President, Actian Corporation.


Q1. Why acquiring an object-oriented database company such as Versant?

Steve Shine: Versant Corporation, like us, has a long pedigree in solving complex data management in some of the world’s largest organisations. We see many synergies in bringing the two companies together. The most important of these is together we are able to invest more resources in helping our customers extract even more value from their data. Our direct clients will have a larger product portfolio to choose from, our partners will be able to expand in adjacent solution segments, and strategically we arm ourselves with the skills and technology to fulfil our plans to deliver innovative solutions in the emerging Big Data Market.

Q2. For the enterprise market, Actian offers its legacy Ingres relational database. Versant on the other hand offers an object oriented database, especially suited for complex science/engineering applications. How does this fit? Do you have a strategy on how to offer a set of support processes and related tools for the enterprise? if yes, how?

Steve Shine: While the two databases may not have a direct logical connection at client installations, we recognise that most clients use these two products as part of a larger more holistic solutions to support their operations. The data they manage is the same and interacts to solve business issues – for example object stores to manage the relationships between entities; transactional systems to manage clients and the supply chain and analytic systems to monitor and tune operational performance. – Different systems using the same underlying data to drive a complex business.

We plan to announce a vision of an integrated platform designed to help our clients manage all their data and their complex interactions, both internal and external so they can not only focus on their running their business, but better exploit the incremental opportunity promised by Big Data.

Q3. Bernhard Woebker, president and chief executive officer of Versant stated, “the combination of Actian and Versant provides numerous synergies for data management”. Could you give us some specific examples of such synergies for data management?

Steve Shine: Here is a specific example of what I mean by helping clients extracting more value from data in the Telco space. These type of incremental opportunities exist in every vertical we have looked at.

An OSS system in a Telco today may use an Object store to manage the complex relationships between the data, the same data is used in a relational store to monitor, control and manage the telephone network.

Another relational store using variants of the same data manages the provisioning, billing and support for the users of the network. The whole data set in Analytical stores is used to monitor and optimise performance and usage of the network.

Fast forwarding to today, the same data used in more sophisticated ways has allowed voice and data networks to converge to provide a seamless interface to mobile users. As a result, Telcos have tremendous incremental revenue opportunities BUT only if they can exploit the data they already have in their networks. For example: The data on their networks has allowed for a huge increase in location based services, knowledge and analysis of the data content has allowed providers to push targeted advertising and other revenue earning services at their users; then turning the phone into a common billing device to get even a greater share of the service providers revenue… You get the picture.

Now imagine other corporations being able to exploit their information in similar ways: Would a retailer benefit from knowing the preferences of who’s in their stores? Would a hedge fund benefit from detecting a sentiment shift for a stock as it happens? Even knowledge of simple events can help organisations become more efficient.
A salesman knowing immediately a key client raises a support ticket; A product manager knowing what’s being asked on discussion forums; A marketing manager knowing a perfect prospect is on the website.

So the synergies in data management come not from how the systems connect but how the data is used to derive business value. We want to help manage all the data in our customers organisations and help them drive incremental value from it. That is the what we mean by numerous synergies from data management and we have a vision to deliver it to our customers.

Q4. Actian claims to have more than 10,000 customers worldwide. What is the value proposition of Versant’s acquisition for the existing Actian`s customers?

Steve Shine: I have covered this in the answers above. They get access to a larger portfolio of products and services and we together drive a vision to help them extract greater value from their data.

Q5. Versant claims to have more than 150,000 installations worldwide. How do you intend to support them?

Steve Shine: Actian already runs a 24/7 global support organisation that prides itself in delivering one of the industry’s best client satisfaction scores. As far as numbers are concerned, Versant’s large user count is in essence driven by only 250 or so very sophisticated large installations whereas Actian already deals with over 10,000 discreet mission critical installations worldwide. So we are confident of maintaining our very high support levels and the Versant support infrastructure is being integrated into Actian’s as we speak.

Q6. Actian is active in the market for big data analytics. How does Versant’s database technology fit into Actian’s big data analytics offerings and capabilities?

Steve Shine: Using the example above imagine using OSS data to analyse network utilisation, CDR’s and billing information to identify pay plans for your most profitable clients.

Now give these clients the ability to take business action on real time changes in their data.Now imagine being able to do that from an integrated product set from one vendor. We will be announcing the vision behind this strategy this quarter. In addition, the Versant technology gives us additional options for solutions for big data for example visualisation and managing meta data.

Q7. Do you intend to combine or integrate your analytics database Vectorwise with Versant’s database technology (such as Versant JPA)? If yes, how?

Steve Shine: Specific plans for integrating products within the overall architecture have not been formulated. We have a strong philosophy that you should use the best tool for the job eg OODB for some things, OLTP RDBMS for other etc. But the real value comes from being able to perform sophisticated analysis and management across the different data stores. That is part of the work out platform integration efforts are focused on.

Q8. What are the plans for future software developments. Will you have a joint development team or else?

Steve Shine: We will be merging the engineering teams to focus on providing innovative solutions for big Data under single leadership.

Q9. You have recently announced two partnerships for Vectorwise, with Inferenda and BiBoard. Will you also pursue this indirect channel path also for Versant’s database technology?

Steve Shine: The beauty of the vision we speak of is that our joint partner have a real opportunity to expand their solutions using Actian’s broader product set and for those that are innovative the opportunity for new emerging markets

Q10. Versant recently developed Versant JPA. Is the Java market important for Actian?

Steve Shine: Yes !

Q11. It is currently a crowded database market: several new database vendors (NoSQL and NewSQL) offering innovative database technology (NuoDB, VoltDB, MongoDB, Cassandra, Couchbase, Riak to name a few), and large companies such as IBM and Oracle, are all chasing the big data market. What is your plan to stand out of the crowd?

Steve Shine: We are very excited about the upcoming announcement on our plans for the Big Data market. We will be happy to brief you on the details closer to the time but I will say that early feedback from analysts houses like Gartner have confirmed that our solution is very effective and differentiated in helping corporations extract business value from Big Data. On a higher scale, many of the start ups are going to get a very rude awakening when they find that delivering a database for mission critical use is much more than speed and scale of technology. Enterprises want world class 24×7 support service with failsafe resilience and security. Real industry grade databases take years and many $m’s to reach scalable maturity. Most of the start ups will not make it. Actian is uniquely positioned in being profitable and having delivered industry grade database innovation but also being singularly focused around data management unlike the broad, cumbersome and expensive bigger players. We believe value conscious enterprises will see our maturity and agility as a great strength.

Qx Anything else you wish to add?

Steve Shine: DATA! – What a great thing to be involved in! Endless value, endless opportunities for innovation and no end in sight as far as growth is concerned. I look forward to the next 5 years.


Steve Shine, CEO and President, Actian Corporation.
Steve comes to Actian from Sybase where he was senior vice president and general manager for EMEA, overseeing all operational, sales, financial and human resources in the region for the past three years. While at Sybase, he achieved more than 200 million in revenue and managed 500 employees, charting over 50 percent growth in the Business Intelligence market for Sybase. Prior to Sybase, Steve was at Canadian-based Geac Computer Corporation for ten successful years, helping to successfully turn around two major global divisions for the ERP firm.

Related Posts

Managing Internet Protocol Television Data. — An interview with Stefan Arbanowski. June 25, 2012

On Versant`s technology. Interview with Vishal Bagga. August 17, 2011


Big Data: Principles and best practices of scalable realtime data systems. Nathan Marz (Twitter) and James Warren, MEAP Began: January 2012,Manning Publications.

Analyzing Big Data With Twitter. A special UC Berkeley iSchool course.

-A write-performance improvement of ZABBIX with NoSQL databases and HistoryGluon. MIRACLE LINUX CORPORATION, February 13, 2013

Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs. Ben Engber, CEO, Thumbtack Technology, JANUARY 2013.

Follow ODBMS.org on Twitter: @odbmsorg

http://www.odbms.org/blog/2013/03/acquiring-versant-interview-with-steve-shine/feed/ 0
Hadoop and NoSQL: Interview with J. Chris Anderson http://www.odbms.org/blog/2012/09/hadoop-and-nosql-interview-with-j-chris-anderson/ http://www.odbms.org/blog/2012/09/hadoop-and-nosql-interview-with-j-chris-anderson/#comments Wed, 19 Sep 2012 14:35:05 +0000 http://www.odbms.org/blog/?p=1734 “The missing piece of the Hadoop puzzle is accounting for real time changes. Hadoop can give powerful analysis, but it is fundamentally a batch-oriented paradigm.” — J. Chris Anderson.

How is Hadoop related to NoSQL databases? What are the main performance bottlenecks of NoSQL data stores? On these topics I did interview, J. Chris Anderson co-founder of Couchbase.


Q1. In order to analyze Big Data, the current state of the art is a parallel database or NoSQL data store, with a Hadoop connector.
What about performance issues arising with the transfer of large amounts of data between the two systems? Can the use of connectors introduce delays, data silos, increase TCO?

Chris Anderson : The missing piece of the Hadoop puzzle is accounting for real time changes. Hadoop can give powerful analysis, but it is fundamentally a batch-oriented paradigm. Couchbase is designed for real time applications (with all the different trade-offs that implies) yet also provides query-ability, so you can see inside the data as it changes.

We are seeing interesting applications where Couchbase is used to enhance the batch-based Hadoop analysis with real time information, giving the effect of a continuous process.
So hot data lives in Couchbase, in RAM (even replicas in RAM for HA fast-failover). You wouldn’t want to keep 3 copies of your Hadoop data in RAM, that’d be crazy.
But it makes sense for your working set.

And this solves the data transfer costs issue you mention, because you essentially move the data out of Couchbase into Hadoop when it cools off.
That’s much easier than maintaining parallel stores, because you only have to copy data from Couchbase to Hadoop as it passes out of the working set.

For folks working on problems like this, we have a Sqoop connector and we’ll be talking about it with Cloudera at our CouchConf in San Francisco on September 21.

Q2. Wouldn’t a united/integrated platform (data store + Hadoop) be a better solution instead?

Chris Anderson : It would be nice to have a unified query language and developer experience (not to mention goodies like automatically pulling data back
out of Hadoop into Couchbase when it comes back into the working set). I think everyone recognizes this.

We’ll get there, but in my opinion the primary interface will be via the real time store, and the Hadoop layer will become a commodity. That is why there is so much competition for the NoSQL brass ring right now.

Q3. Could you please explain in your opinion the tradeoff between scaling out and scaling up? What does it mean in practice for an application

Chris Anderson : Scaling up is easier from a software perspective. It’s essentially the Moore’s Law approach to scaling — buy a bigger box. Well, eventually you run out of bigger boxes to buy, and then you’ve run off the edge of a cliff. You’ve got to pray Moore keeps up.

Scaling out means being able to add independent nodes to a system. This is the real business case for NoSQL. Instead of being hostage to Moore’s Law, you can grow as fast as your data. Another advantage to adding independent nodes is you have more options when it comes to matching your workload. You have more flexibility when you are running on commodity hardware — you can run on SSDs or high-compute instances, in the cloud, or inside your firewall.

Q4. James Phillips a year ago said that “it is possible we will see standards begin to emerge, both in on-the-wire protocols and perhaps in query languages, allowing interoperability between NoSQL database technologies similar to the kind of interoperability we’ve seen with SQL and relational database technology.” What is your take now?

Chris Anderson : That hasn’t changed but the industry is still young and everyone is heads-down on things like reliability and operability. Once these products become more mature there will be time to think about standardization.

Q5. There is a scarcity of benchmarks to substantiate the many claims made of scalabilty of NoSQL vendors. NoSQL data stores do not qualify for the TPC-C benchmark, since they relax ACID transaction properties. How can you then measure and compare the performance of the various NoSQL data stores instead?

Chris Anderson : I agree. Vendors are making a lot of claims about latency, throughput and scalability without much proof. There are a few benchmarks starting
to trickle out from various third parties. Cisco and SolarFlare published one on Couchbase (See here ) and are putting other vendors through the same tests. I know there will be other third party benchmarks comparing Couchbase, MongoDB, and Cassandra that will be coming out soon. I think the Yahoo YCSB benchmarks will be another source of good comparisons.
There are bigger differences between vendors than people are aware of and we think many people will be surprised by the results.

Q6. What are in your opinion the main performance bottlenecks for NoSQL data stores?

Chris Anderson : The three classes of bottleneck correspond to the major areas of hardware: network, disk, and memory. Couchbase has historically been very
fast at the network layer – it’s based on Memcached which has had a ton of optimizations for interacting with network hardware and protocols.
We’re essentially as fast as one can get in the network layer, and I believe most NoSQL databases that use persistent socket connections are also free of significant network bottlenecks. So the network is only a bottleneck for REST or other stateless connection models.

Disk is always going to be the slowest component as far as the inherent latencies of non-volatile storage, but any high-performance database will paper over this by using some form of memory caching or memory-based storage. Couchbase has been designed specifically to decouple the disk from the rest of the system. In the extreme, we’ve seen customers survive prolonged disk outages while maintaining availability, as our memory layer keeps on trucking, even when disks become unresponsive. (This was during the big Amazon EBS outage that left a lot of high-profile sites down due to database issues.)

Memory may be the most interesting bottleneck, because it is the source of non-determinism in performance. So if you are choosing a database for performance reasons you’ll want to take a look at how it’s memory layer is architected.
Is it decoupled from the disk? Is it free of large locks that can pause unrelated queries as the engine modifies in-memory data structures? Over time does it continue to perform, or does the memory layout become fragmented? These are all problems we’ve been working on for a long time at Couchbase, and we are pretty happy with where we stand.

Q7. Couchbase is the result of the merger (more then one year ago) of CouchOne(document store) and Membase (key-value store).How has your product offering changed since the merge happened?

Chris Anderson : Our product offering hasn’t changed a bit since the merger. The current GA product, Couchbase Server 1.8.1, is essentially a continuation of the Membase Server 1.7 product line. It is a key value database using the
binary-memcached interface. It’s in use all around the web, like Orbitz, LinkedIn, AOL, Zynga, and lots of other companies.

With our upcoming 2.0 release, we are expanding from a key value database to a document database. This means adding query and index support. We’re even including an Elastic Search adapter and experimental geographic indices. In addition, 2.0 adds cross-datacenter replication support so you can provide high-performance access to the data at multiple points-of-presence.

Q8. How do you implement concurrency in Couchbase?

Chris Anderson : Each node is inherently independent, and there are no special nodes, proxies, or gatekeepers. The client drivers running inside your application server connect directly to the data-node for a particular item, which gives low-latency but also greater concurrency, a given application server will be talking to multiple backend database nodes at any given time.

For the memory layer, we are based on memcached, which has a long history of concurrency optimizations. We support the full memcached feature set, so operations like CAS write, INCR and DECR are available, which is great for concurrent workloads. We’ve also added some extensions for locking, which facilitates reading and updating an object-graph that is spread across multiple keys.

At the disk layer, for Couchbase Server 2.0, we are moving away from SQLite, toward our own highly concurrent storage format. We’re append-only, so once data is written to disk, there’s no chance of corruption. The other advantage of tail-append writes is that you can do all kinds of concurrent reads of the file, even while writes are happening. For instance a backup can be as easy as `cp` or `rsync` (although we provide tools to manage backing up an entire cluster).

Q9. Couchbase does not support SQL queries, how do you query your database instead? What are the lessons learned so far from your users?

Chris Anderson : Our incremental indexing system is designed to be native to our JSON storage format. So the user writes JavaScript code to inspect the document, and pick out which data to use as the index key. It’s essentially putting the developer directly in touch with the index data structure, so while it sounds primitive, there is a ton of flexibility to be had there. We’ve got a killer optimization for aggregation operations, so if you’ve ever been burned by a slow GROUP BY query, you might want to take a look.

Despite all the power, we know users are also looking for more traditional query approaches. We’re working on a few things in this area.
The first one you will see is our Elastic Search integration, which will simplify querying your Couchbase cluster. Elastic Search provides a JSON-style query API, and we’ve already seen many of our users integrate it with Couchbase, so we are building an official adapter to better support this use case.

Q10 How do you handle both structured and unstructured data at scale?

Chris Anderson : At scale, all data is messy. I’ve seen databases in the 4th normal form accumulate messy errors, so a schema isn’t always a protection. At scale, it’s all about discovering the schema, not imposing it.

If you fill a Couchbase cluster with enough tweets, wall posts, and Instagram photos, you’ll start to see that even though these APIs all have different JSON formats, it’s not hard to pick out some things in common. With our flexible indexing system, we see users normalize data after the fact, so they can query heterogeneous data, and have it show up as a result set that is easy for the application to work with.

This fits with the overall model of a document database: rather than try to “model the world” with a relational schema, your aim becomes to “capture the user’s intent” and make sense of it later. When your goal is to scale up to tens of millions of users (in maybe only a few days), the priority becomes to capture the data and provide a compelling high-performance user experience.

Q11 Couchbase is sponsor of the Couchbase Server NoSQL database open source project. How do you ensure participation of the open source developers community and avoid incompatible versions of the system? Are you able to with an open source project to produce a highly performant system?

Chris Anderson : All our code is available under the Apache 2.0 license, and I’d wager that we’ve never heard of the majority of our users, much less asked them
for money. When someone’s business depends on Couchbase, that’s when they come to us, so I’m comfortable with the open source business model. It has some distinct advantages over the Oracle style license model.

The engineering side of me admits that we haven’t always been the best at engendering participation in the Couchbase open source project. A few months ago, if you tried to compile your own copy of Couchbase, e.g. to provide a patch or bug-fix, you’d be on a wild goose chase for days before you got to the fun part. We’ve fixed that now, but it’s worth noting that open-source friction hurts in more ways than one, as smoothing the path for new external contributors also means new engineering hires can get productive on our tool-chain faster.

So I’ve taken a personal interest in our approach to external contributions. The first step is cleaning up the on-ramps, as we already have decent docs for contributing, we just need to make them more prominent. The goal is to have a world-class open-source contributor experience, and we don’t take it lightly.

We do have a history of *very open development* which I am proud of. Not only can you see all the patches that make it into the code base, you can see all the patches that don’t make it through code review, and the comments on them as they are happening. Check out here for an example of how to do open development right.

Q12 How do you compare NewSQL databases, claming to offer both ACID-transaction properties and scalability, with Couchbase?

Chris Anderson : The CAP theorem is well-known these days, so I don’t need to go into the
reasons why NewSQL is an uphill battle. I see NewSQL technologies as primarily aimed at developers who don’t want to learn the lessons about data that the first decade of the web taught us. So there will likely be a significant revenue stream based on applications that need to be scaled up, without being re-written.

But we’ve asked our users what they like about NoSQL, and one of the biggest answers, as important to most of them as scalability and performance, was that they felt more productive with a JSON document model, than with rigid schemas. Requirements change quickly and being able to change your application without working through a DBA is a real
advantage. So I think NewSQL is really for legacy applications, and the future is with NoSQL document databases.

Q13 No single data store is best for all uses. What are the applications that are best suited for Couchbase, and which ones are not?

Chris Anderson : I got into Couch in the first place, because I see document databases as the 80% solution. It might take a few years, but I fully expect the
schema-free document model to be ascendant over the relational model, for most applications, most of the time. Of course there will still be uses for relational databases, but the majority of applications being written these days could work just fine on a document database, and as developer preferences change, we’ll see more and more serious applications running on NoSQL.

So from that angle, you can see that I think Couchbase is a broad solution.
We’ve seen great success in everything from content management systems to ad-targeting platforms, as well as simpler things like providing a high-performance data store for CRUD operations, or more enterprise-focused use cases like offloading query volume from mainframe applications.

A geekier way to answer your question, is to talk about the use cases where you pretty much don’t have a choice, it’s Couchbase or bust. So those use cases are anything where you need consistent sub-millisecond access latency, for instance maybe you have a time-based service level agreement.

If you need consistent high-performance while you are scaling your database cluster, that’s when you need Couchbase.
So for instance social-gaming has been a great vertical for us, since the hallmark of a successful game is that it may go from obscurity to a household name in just a few weeks. Too many of those games crash and burn when they are running on previous-generation technology.
It’s always been possible to build a backend that can handle internet-scale applications, what’s new with Couchbase is the ability to scale from a relatively small backend, to a backend handling hundreds of thousands of operations per second, at the drop of a hat.

Again, the use-cases where it’s Couchbase or bust are a subset of what our users are doing, but they are a great way to illustrate our priorities. We have plenty of users who don’t have high-performance requirements, but they still enjoy the flexibility of a document database.

If you need transactions across multiple data items, with atomic rollback, then you should be using a relational database. If you need foreign-key constraints, you should be using a relational database.

However, before you make that decision, you may want to ask if the performance tradeoffs are worth it for your application. Often there is a way to redesign something to make it less dependent on schemas, and the business benefits from the increased scale and performance you can get from NoSQL may make it a worthwhile tradeoff.


Chris Anderson is a co-founder of Couchbase, focussed on the overall developer experience. In the early days of the company he served as CFO and President, but never strayed too far from writing code. His background includes open-source contributions to the Apache Software Foundation as well as many other projects. Before he wrote database engine code, he cut his teeth building applications and spidering the web. These days he gets a kick out of Node.js, playing bass guitar, and enjoying family time.


Related Posts

Measuring the scalability of SQL and NoSQL systems. by Roberto V. Zicari on May 30, 2011

Next generation Hadoop — interview with John Schroeder. by Roberto V. Zicari on September 7, 2012

Integrating Enterprise Search with Analytics. Interview with Jonathan Ellis. by Roberto V. Zicari on April 16, 2012


ODBMS.org resources on:
Big Data and Analytical Data Platforms: Blog Posts | Free Software | Articles | PhD and Master Thesis|

NoSQL Data Stores: Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis

http://www.odbms.org/blog/2012/09/hadoop-and-nosql-interview-with-j-chris-anderson/feed/ 0
On Eventual Consistency– An interview with Justin Sheehy. http://www.odbms.org/blog/2012/08/on-eventual-consistency-an-interview-with-justin-sheehy/ http://www.odbms.org/blog/2012/08/on-eventual-consistency-an-interview-with-justin-sheehy/#comments Wed, 15 Aug 2012 14:48:32 +0000 http://www.odbms.org/blog/?p=1671 “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense” –Justin Sheehy.

On the subject of new data models and eventual consistency I did interview Justin Sheehy Chief Technology Officer, Basho Technologies.


Q1. What are in your opinion the main differences and similarities of a key-value store (ala Dynamo), a document store (ala MongoDB), and an “extensible record” store (ala Big Table) when using them in practice?

Justin Sheehy: Describing the kv-store, doc store, and column family data models in general is not the same as describing specific systems like Dynamo, MongoDB, and BigTable. I’ll do the former here as I am guessing that is the intention of the question. Since the following couple of questions ask for differences, I’ll emphasize the similarity here.

All three of these data models have two major things in common: values stored in them are not rigidly structured, and are organized mainly by primary key. The details beyond those similarities, and how given systems expose those details, certainly vary. But the flexibility of semi-structured data and the efficiency of primary-key access generally apply to most such systems.

Q2. When is a key-value store particularly well suited and when is a a document store instead preferable? For which kind of applications and for what kind of data management requirements?

Justin Sheehy: The interesting issue with this question is that “document store” is not well-established as having a specific meaning. Certainly it seems to apply to both MongoDB and CouchDB, but those two systems have very different data access semantics. The closest definition I can come up with quickly that covers the prominent systems known as doc stores might be something like “a key-value store which also has queryability of some kind based on the content of the values.”

If we accept that definition then you can happily use a document store anywhere that a key-value store would work, but would find it most worthwhile when your querying needs are richer than simply primary key direct access.

Q3. What is Riak? A key-value store or a document store? What are the main features of the current version of Riak?

Justin Sheehy: Riak began as being called a key-value store before the current popularity of the term “document store” term, but it is certainly a document store by any reasonable definition that I know — such as the one I gave above. In addition to access by primary key, values in Riak can be queried by secondary key, range query, link walking, full text search, or map/reduce.

Riak has many features, but the core reasons that people come to Riak over other systems are Availability, Scalability, and Predictability. For people whose business demands extremely high availability, easy and linear scalability, or predictable performance over time, Riak is worth a look.

Q4. How do you achieve horizontal scalability? Do you use a “shared nothing” horizontal scaling – replicating and partitioning data over many servers?
What performance metrics do you have for that?

Justin Sheehy: We use a number of techniques to achieve horizontal scalability. Among them is consistent hashing, an approach invented at Akamai and successfully used by many distributed systems since then. This allows for constant time routing to replicas of data based on the hash of the data’s primary key.
Data is partitioned to servers in the cluster based on consistent hashing, and replicated to a configurable number of of those servers. By partitioning the data to many “virtual nodes” per host, growth is relatively easy as new hosts simply (and automatically) take over some of the virtual nodes that has previously owned by existing cluster hosts.
Yes, in terms of data location Riak is a “shared nothing” system.
One (of many) demonstrations of this scalability was performed by Joyent here.
That benchmark is approximately 2 years old, so various specific numbers are quite outdated, but the important lesson in it remains and is summed up by this graph late in this post.
It shows that as servers were added, the throughput (as well as the capacity) of the overall system increased linearly.

Q5. How do you handle updates if you do not support ACID transactions? For which applications this is sufficient, and when this is not?

Justin Sheehy: Riak takes more of the “BASE” approach, which has become accepted over the past several years as a sensible tradeoff for high-availability data systems. By allowing consistency guarantees to be a bit flexible during failure conditions, a Riak cluster is able to provide much more extreme availability guarantees than a strictly ACID system.

Q6. You said that Riak takes more of the “BASE” approach. Did you use the definition of eventual consistency by Werner Vogels?
Reproduced here: “Eventual consistency: The storage system guarantees that if no new updates are made to the object, eventually (after the inconsistency window closes) all accesses will return the last updated value”. You would not wish to have an “eventual consistency” update to your bank account. For which class of applications is eventual consistency a good system design choice?

Justin Sheehy: That definition of Eventual Consistency certainly does apply to Riak, yes.

I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense. Traditional accounting is done in an eventually-consistent way and if you send me a payment from your bank to mine then that transaction will be resolved in an eventually consistent way. That is, your bank account and mine will not have a jointly-atomic change in value, but instead yours will have a debit and mine will have a credit, each of which will be applied to our respective accounts.

This question contains a very commonly held misconception. The use of eventual consistency in well-designed systems does not lead to inconsistency. Instead, such systems may allow brief (but shortly resolved) discrepancies at precisely the moments when the other alternative would be to simply fail.

To rephrase your statement, you would not wish your bank to fail to accept a deposit due to an insistence on strict global consistency.

It is precisely the cases where you care about very high availability of a distributed system that eventual consistency might be a worthwhile tradeoff.

Q7. Why is Riak written in Erlang? What are the implications for the application developers of this choice?

Justin Sheehy: Erlang’s original design goals included making it easy to build systems with soft real-time guarantees and very robust fault-tolerance properties. That is perfectly aligned with our central goals with Riak, and so Erlang was a natural fit for us. Over the past few years, that choice has proven many times to have been a great choice with a huge payoff for Riak’s developers. Application developers using Riak are not required to care about this choice any more than they need to care what language PostgreSQL is written in. The implications for those developers are simply that the database they are using has very predictable performance and excellent resilience.

Q8. Riak is open source. How do you engage the open source community and how do you make sure that no inconsistent versions are generated?

Justin Sheehy: We engage the open source community everywhere that it exists. We do our development in the open on github, and have lively conversations with a wider community via email lists, IRC, Twitter, many in-person venues, and more.
Mark Phillips and others at Basho are dedicated full-time to ensuring that we continue to engage honestly and openly with developer communities, but all of us consider it an essential part of what we do. We do not try to prevent forks. Instead, we are part of the community in such a way that people generally want to contribute their changes back to the central repository. The only barrier we have to merging such code is about maintaining a standard of quality.

Q9. How do you optimize access to non-key attributes?

Justin Sheehy: Riak stores index content in addition to values, encoded by type and in sorted order on disk. A query by index certainly is more expensive than simply accessing a single value directly by key, as the indices are distributed around the cluster — but this also means that the size of the index is not constrained by a single host.

Q10. How do you optimize access to non-key attributes if you do not support indexes in Riak?

Justin Sheehy: We do support indexes in Riak.

Q11 How does Riak compare with a new generation of scalable relational systems (NewSQL)?

Justin Sheehy: The “NewSQL” term is, much like “NoSQL”, a marketing term that doesn’t usefully define a technical category. The primary argument made by NewSQL proponents is that some NoSQL systems have made unnecessary tradeoffs. I personally consider these NewSQL systems to be a part of the greater movement generally dubbed NoSQL despite the seemingly contradictory names, as the core of that movement has nothing to do with SQL — it is about escaping the architectural monoculture that has gripped the commercial database market for the past few decades. In terms of technical comparison, some systems placing themselves under the NewSQL banner are excellent at scalability and performance, but I know of none whose availability and predictability can rival Riak.

Q12 Pls give some examples of use cases where Riak is currently in use. Is Riak in use for analyzing Big Data as well?

Justin Sheehy: A few examples of companies relying on Riak in their business can be found here.
While Riak is primarily about highly-available systems with predictable low-latency performance, it does have analytical capabilities as well and many users make use of map/reduce and other such programming models in Riak. By most definitions of “Big Data”, many of Riak’s users certainly fall into that category.

Q Anything you wish to add?

Justin Sheehy: Thank you for your interest. We’re not done making Riak great!


Justin Sheehy
Chief Technology Officer, Basho Technologies

As Chief Technology Officer, Justin Sheehy directs Basho’s technical strategy, roadmap, and new research into storage and distributed systems.
Justin came to Basho from the MITRE Corporation, where as a principal scientist he managed large research projects for the U.S. Intelligence Community including such efforts as high assurance platforms, automated defensive cyber response, and cryptographic protocol analysis.
He was central to MITRE’s development of research for mission assurance against sophisticated threats, the flagship program of which successfully proposed and created methods for building resilient networks of web services.
Before working for MITRE, Justin worked at a series of technology companies including five years at Akamai Technologies, where he was a senior architect for systems infrastructure giving Justin a broad as well as deep background in distributed systems.
Justin was a key contributor to the technology that enabled fast growth of Akamai’s networks and services while allowing support costs to stay low. Justin performed both undergraduate and postgraduate studies in Computer Science at Northeastern University.

Related Posts

On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.

On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com

ODBMS.org — Free Downloads and Links
In this section you can download resources covering the following topics:
Big Data and Analytical Data Platforms
Cloud Data Stores
Object Databases
NoSQL Data Stores
Graphs and Data Stores
Object-Oriented Programming
Entity Framework (EF) Resources
ORM Technology
Object-Relational Impedance Mismatch
XML, RDF Data Stores,


http://www.odbms.org/blog/2012/08/on-eventual-consistency-an-interview-with-justin-sheehy/feed/ 2
On Big Graph Data. http://www.odbms.org/blog/2012/08/on-big-graph-data/ http://www.odbms.org/blog/2012/08/on-big-graph-data/#comments Mon, 06 Aug 2012 10:41:46 +0000 http://www.odbms.org/blog/?p=1612 ” The ultimate goal is to ensure that the graph community is not hindered by vendor lock-in” –Marko A. Rodriguez.
There are three components to scaling OLTP graph databases: effective edge compression, efficient vertex centric query support, and intelligent graph partitioning” — Matthias Broecheler.

Titan is a new distributed graph database available in alpha release. It is an open source Apache project maintained and funded by Aurelius. To learn more about it, I have interviewed Dr. Marko A. Rodriguez and Dr. Matthias Broecheler cofounders of Aurelius.


Q1. What is Titan?

MATTHIAS: Titan is a highly scalable OLTP graph database system optimized for thousands of users concurrently accessing and updating one huge graph.

Q2. Who needs to handle graph-data and why?

MARKO: Much of today’s data is composed of a heterogeneous set of “things” (vertices) connected by a heterogeneous set of relationships (edges) — people, events, items, etc. related by knowing, attending, purchasing, etc. The property graph model leveraged by Titan espouses this world view. This world view is not new as the object-oriented community has a similar perspective on data.
However, graph-centric data aligns well with the numerous algorithms and statistical techniques developed in both the network science and graph theory communities.

Q3. What are the main technical challenges when storing and processing graphs?

MATTHIAS: At the interface level, Titan strives to strike a balance between simplicity, so that developers can think in terms of graphs and traversals without having to worry about the persistence and efficiency details. This is achieved by both using the Blueprint’s API and by extending it with methods that allow developers to give Titan “hints” about the graph data. Titan can then exploit these “hints” to ensure performance at scale.

Q4. Graphs are hard to scale. What are the key ideas that make it so that Titan scales? Do you have any performance metrics available?

MATTHIAS: There are three components to scaling OLTP graph databases: effective edge compression, efficient vertex centric query support, and intelligent graph partitioning.
Edge compression in Titan comprises various techniques for keeping the memory footprint of each edge as small as possible and storing all edge information in one consecutive block of memory for fast retrieval.
Vertex centric queries allow users to query for a specific set of edges by leveraging vertex centric indices and a query optimizer.
Graph data partitioning refers to distributing the graph across multiple machines such that frequently co accessed data is co-located. Graph partitioning is a (NP-) hard problem and this is an aspect of Titan where we will see most improvement in future releases.
The current alpha release focuses on balanced partitioning and multi-threaded parallel traversals for scale.

MARKO: To your question about performance metrics, Matthias and his colleague Dan LaRocque are currently working on a benchmark that will demonstrate Titan’s performance when tens of thousands of transactions are concurrently interacting with Titan. We plan to release this benchmark via the Aurelius blog.
[Edit: The benchmark is now available here. ]

Q5. What is the relationships of Titan with other open source projects you were previously involved with, such as TinkerPop? Is Titan open source?

MARKO: Titan is a free, open source Apache2 project maintained and funded by Aurelius . Aurelius (our graph consulting firm) developed Titan in order to meet the scalability requirements of a number of our clients.
In fact, Pearson is a primary supporter and early adopter of Titan. TinkerPop, on the other hand, is not directly funded by any company and as such, is an open source group developing graph-based tools that any graph database vendor can leverage.
With that said, Titan natively implements the Blueprint 2 API and is able to leverage the TinkerPop suite of technologies: Pipes, Gremlin, Frames, and Rexster.
We believe this demonstrates the power of the TinkerPop stack — if you are developing a graph persistence store, implement Blueprints and your store automatically gets a traversal language, an OGM (object-to-graph mapper) framework, and a RESTful server.

Q6. How is Titan addressing the problem of analyzing Big Data at scale?

MATTHIAS: Titan is an OLTP database that is optimized for many concurrent users running short transactions, e.g. graph updates or short traversals against one huge graph. Titan significantly simplifies the development of scalable graph applications such as Facebook, Twitter, and the like.
Interestingly enough, most of these large companies have built their own internal graph databases.
We hope Titan will allow organizations to not reinvent the wheel. In this way, companies can focus on the value their data adds, not on the “plumbing” needed to process that data.

MARKO: In order to support the type of global OLAP operations typified by the Big Data community, Aurelius will be providing a suite of technologies that will allow developers to make use of global graph algorithms. Faunus is a Hadoop-connector that implements a multi-relational path algebra developed by myself and Joshua Shinavier. This algebra allows users to derive smaller, “semantically rich” graphs that can then be effectively computed on within the memory confines of a single machine. Fulgora will be the in-memory processing engine. Currently, as Matthias has shown in prototype, Fulgora can store ~90 billion edges on a 64-Gig RAM machine for graphs with a natural, real-world topology. Titan, Faunus, and Fulgora form Aurelius’ OLAP story

Q7. How do you handle updates?

MATTHIAS: Updates are bundled in transactions which are executed against the underlying storage backend. Titan can be operated on multiple storage backends and currently supports Apache Cassandra, Apache HBase and Oracle BerkeleyDB.
The degree of transactional support and isolation depends on the chosen storage backend. For non-transactional storage backends Titan provides its own locking system and fine grained locking support to achieve consistency while maintaining scalability.

Q8. Do you offer support for declarative queries?

MARKO: Titan implements the Blueprints 2 API and as such, supports Gremlin as its query/traversal language. Gremlin is a data flow language for graphs whereby traversals are prescriptively described using path expressions.

MATTHIAS: With respects to a declarative query language, the TinkerPop teams is currently in the design process of a graph-centric language called “Troll.” We invite anybody interested in graph algorithms and graph processing to help in this effort.
We want to identify the key graph use cases and then build a language that addresses those most effectively. Note that this is happening in TinkerPop and any Blueprints-enabled graph database will ultimately be able to add “Troll” to their supported languages.

Q9. How does Titan compare with other commercial graph databases and RDF triple stores?

MARKO: As Matthias has articulated previously, Titan is optimized for thousands of concurrent users reading and writing to a single massive graph. Most popular graph databases on the market today are single machine databases and simply can’t handle the scale of data and number of concurrent users that Titan can support. However, because Titan is a Blueprints-enabled graph database, it provides that same perspective on graph data as other graph databases.
In terms of RDF quad/triple stores, the biggest obvious difference is the data model. RDF stores make use of a collection of triples composed of a subject, predicate, and object. There is no notion of key/value pairs associated with vertices and edges like Blueprints-based databases. When one wants to model edge weights, timestamps, etc., RDF becomes cumbersome. However, the RDF community has a rich collection of tools and standards that make working with RDF data easy and compatible across all RDF vendors.
For example, I have a deep appreciation for OpenRDF.
Similar to OpenRDF, TinkerPop hopes to make it easy for developers to migrate between various graph solutions whether they be graph databases, in-memory graph frameworks, Hadoop-based graph processing solutions, etc.
The ultimate goal is to ensure that the graph community is not hindered by vendor lock-in.

Q10. How does Titan compare with respect to NoSQL data stores and NewSQL databases?

MATTHIAS: Titan builds on top of the innovation at the persistence layer that we have seen in recent years in the NoSQL movement. At the lowest level, a graph database needs to store bits and bytes and therefore has to address the same issues around persistence, fault tolerance, replication, synchronization, etc. that NoSQL solutions are tackling.
Rather than reinventing the wheel, Titan is standing on the shoulders of giants by being able to utilize different NoSQL solutions for storage through an abstract storage interface. This allows Titan to cover all three sides of the CAP theorem triangle — please see here.

Q11. Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

MATTHIAS: We absolutely agree with Mike on this. The relational model is a way of looking at your data through tables and SQL is the language you use when you adopt this tabular view. There is nothing intrinsically inefficient about tables or relational algebra. But its important to note that the relational model is simply one way of looking at your data. We promote the graph data model which is the natural data representation for many applications where entities are highly connected with one another. Using a graph database for such applications will make developers significantly more productive and change the way one can derive value from their data.

Dr. Marko A. Rodriguez is the founder of the graph consulting firm Aurelius. He has focused his academic and commercial career on the theoretical and applied aspects of graphs. Marko is a cofounder of TinkerPop and the primary developer of the Gremlin graph traversal language.

Dr. Matthias Broecheler has been researching and developing large-scale graph database systems for many years in both academia and in his role as a cofounder of the Aurelius graph consulting firm. He is the primary developer of the distributed graph database Titan.
Matthias focuses most of his time and effort on novel OLTP and OLAP graph processing solutions.

Related Posts

“Applying Graph Analysis and Manipulation to Data Stores.” (June 22, 2011)

“Marrying objects with graphs”: Interview with Darren Wood. (March 5, 2011)

Resources on Graphs and Data Stores
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes


http://www.odbms.org/blog/2012/08/on-big-graph-data/feed/ 3