On multi-model databases. Interview with Martin Schönert and Frank Celler.

by Roberto V. Zicari on October 28, 2013

“We want to prevent a deadlock where the team is forced to switch the technology in the middle of the project because it doesn’t meet the requirements any longer.”–Martin Schönert and Frank Celler.

On “multi-model” databases, I have interviewed Martin Schönert and Frank Celler, founders and creators of the open source ArangoDB.

RVZ

Q1. What is ArangoDB and for what kind of applications is it designed for?

Frank Celler: ArangoDB is a multi-model mostly-memory database with a flexible data model for documents and graphs. It is designed as a “general purpose database”, offering all the features you typically need for modern web applications.

ArangoDB is supposed to grow with the application—the project may start as a simple single-server prototype, nothing you couldn’t do with a relational database equally well. After some time, some geo-location features are needed and a shopping cart requires transactions. ArangoDB’s graph data model is useful for the recommendation system. The smartphone app needs a lean API to the back-end—this is where Foxx, ArangoDB’s integrated Javascript application framework, comes into play.
The overall idea is: “We want to prevent a deadlock where the team is forced to switch the technology in the middle of the project because it doesn’t meet the requirements any longer.”

ArangoDB is open source (Apache 2 licence)—you can get the source code at GitHub or download the precompiled binaries from our website.

Though ArangoDB as a universal approach, there are edge cases where we don’t recommend ArangoDB. Actually, ArangoDB doesn’t compete with massively distributed systems like Cassandra with thousands of nodes and many terabytes of data.

Q2. What’s so special about the ArangoDB data model?

Martin Schönert: ArangoDB is a multi-model database. It stores documents in collections. A specialized binary data file format is used for disk storage. Documents that have similar structure (i.e., that have the same attribute names and attribute types) can share their structural information. The structure (called “shape”) is saved just once, and multiple documents can re-use it by storing just a pointer to their “shape”.
In practice, documents in a collection are likely to be homogenous, and sharing the structure data between multiple documents can greatly reduce disk storage space and memory usage for documents.

Q3. Who is currently using ArangoDB for what?

Frank Celler: ArangoDB is open source. You don’t have to register to download the source code or precompiled binaries. As a user, you can get support via Google Group, GitHub’s issue tracker and even via Twitter. We are very amenable, which is an essential part of the project. The drawback is that we don’t really know what people are doing with ArangoDB in detail. We are noticing an exponentially increasing number of downloads over the last months.
We are aware of a broad range of use cases: a CMS, a high-performance logging component, a geo-coding tool, an annotation system for animations, just to name a few. Other interesting use cases are single page apps or mobile apps via Foxx, ArangoDB’s application framework. Many of our users have in-production experience with other NoSQL databases, especially the leading document stores.

Q4. Could you motivate your design decision to use Google’s V8 JavaScript engine?

Martin Schönert: ArangoDB uses Google’s V8 engine to execute server-side JavaScript functions. Users can write server-side business logic in JavaScript and deploy it in ArangoDB. These so-called “actions” are much like stored procedures living close to the data.
For example, with actions it is possible to perform cascading deletes/updates, assign permissions, and do additional calculations and modifications to the data.
ArangoDB also allows users to map URLs to custom actions, making it usable as an application server that handles client HTTP requests with user-defined business logic.
We opted for Javascript as it meets our requirements for an “embedded language” in the database context:
• Javascript is widely used. Regardless in which “back-end language” web developers write their code, almost everybody can code also in Javascript.
• Javascript is effective and still modern.
Just as well, we chose Google V8, as it is the fastest, most stable Javascript interpreter available for the time being.

Q5. How do you query ArangoDB if you don’t want to use JavaScript?

Frank Celler: ArangoDB offers a couple of options for getting data out of the database. It has a REST interface for CRUD operations and also allows “querying by example”. “Querying by example” means that you create a JSON document with the attributes you are looking for. The database returns all documents which look like the “example document”.
Expressing complex queries as JSON documents can become a tedious task—and it’s almost impossible to support joins following this approach. We wanted a convenient and easy-to-learn way to execute even complex queries, not involving any programming as in an approach based on map/reduce. As ArangoDB supports multiple data models including graphs, it was neither sufficient to stick to SQL nor to simply implement UNQL. We ended up with the “ArangoDB query language” (AQL), a declarative language similar to SQL and Jsoniq. AQL supports joins, graph queries, list iteration, results filtering, results projection, sorting, variables, grouping, aggregate functions, unions, and intersections.
Of course, ArangoDB also offers drivers for all major programming languages. The drivers wrap the mentioned query options following the paradigm of the programming language and/or frameworks like Ruby on Rails.

Q6. How do you perform graph queries? How does this differ from systems such as Neo4J?

Frank Celler: SQL can’t cope with the required semantics to express the relationships between graph nodes, so graph databases have to provide other ways to access the data.
The first option is to write small programs, so called “path traversals.” In ArangoDB, you use Javascript; in neo4j Java, the general approach is very similar.
Programming gives you all the freedom to do whatever comes to your mind. That’s good. For standard use cases, programming might be too much effort. So, both ArangoDB and neo4j offer a declarative language—neo4j has “Cypher,” ArangoDB the “ArangoDB Query Language.” Both also implement the blueprints standard so that you can use “Gremlin” as query-language inside Java. We already mentioned that ArangoDB is a multi-model database: AQL covers documents and graphs, it provides support for joins, lists, variables, and does much more.

The following example is taken from the neo4j website:

“For example, here is a query which finds a user called John in an index and then traverses the graph looking for friends of John’s friends (though not his direct friends) before returning both John and any friends-of-friends that are found.

START john=node:node_auto_index(name = ‘John’)
MATCH john-[:friend]->()-[:friend]->fof
RETURN john, fof ”

The same query looks in AQL like this:

FOR t IN TRAVERSAL(users, friends, “users/john”, “outbound”,
{minDepth: 2}) RETURN t.vertex._key

The result is:
[ “maria”, “steve” ]

You see that Cypher describes patterns while AQL describes joins. Internally, ArangoDB has a library of graph functions—those functions return collections of paths and paths or use those collections in a join.

Q7. How did you design ArangoDB to scale out and/or scale up? Please give us some detail.

Martin Schönert: Solid state disks are becoming more and more a commodity hardware. ArangoDB’s append-only design is a perfect fit for such SSD, allowing for data-sets which are much bigger than the main memory but still fit unto a solid state disk.
ArangoDB supports master/slave replication in version 1.4 which will be released in the next days (a beta has been available for some time). On the one hand this provides easy fail-over setups. On the other hand it provides a simple way to scale the read-performance.
Sharding is implemented in version 2.0. This enables you to store even bigger data-sets and increase the write-performance. As noted before, however, we see our main application when scaling to a low number of nodes. We don’t plan to optimize ArangoDB for massive scaling with hundreds of nodes. Plain key/value stores are much more usable in such scenarios.

Q8. What is ArangoDB’s update and delete strategy?

Martin Schönert: ArangoDB versions prior to 1.3 store all revisions of documents in an append-only fashion; the objects will never be overwritten. The latest version of a document is available to the end user.

With the current version 1.3, ArangoDB introduces transactions and sets the technical fundament for replication and sharding. In the course of those highly wanted features comes “real” MVCC with concurrent writes.

In databases implementing an append-only strategy, obsolete versions of a document have to be removed to save space. As we already mentioned, ArangoDB is multi-threaded: The so-called compaction is automatically done in the background in a different thread without blocking reads and writes.

Q9. How does ArangoDB differ from other NoSQL data stores such as Couchbase and MongoDB and graph data stores such as Neo4j, to name a few?

Frank Celler: ArangoDB’s feature scope is driven by the idea to give the developer everything he needs to master typical tasks in a web application—in a convenient and technically sophisticated way alike.
From our point of view it’s the combination of features and quality of the product which accounts for ArangoDB: ArangoDB not only handles documents but also graphs.
ArangoDB is extensible via Javascript and Ruby. Enclosed with ArangoDB you get “Foxx”. Foxx is an integrated application framework ideal for lean back-ends and single page Javascript applications (SPA).
Multi-collection transactions are useful not only for online banking and e-commerce but they become crucial in any web app in a distributed architecture. Here again, we offer the developers many choices. If transactions are needed, developers can use them.
If, on the other hand, the problem requires a higher performance and less transaction-safety, developers are free to ignore multi-collections transactions and to use the standard single-document transactions implemented by most NoSQL databases.
Another unique feature is ArangoDB’s query language AQL—it makes querying powerful and convenient. For simple queries, we offer a simple query-by-example interface. Then again, AQL enables you to describe complex filter conditions and joins in a readable format.

Q10. Could you summarize the main results of your benchmarking tests?

Frank Celler: To quote Jan Lenhardt from CouchDB: “Nosql is not about performance, scaling, dropping ACID or hating SQL—it is about choice. As nosql databases are somewhat different it does not help very much to compare the databases by their throughput and chose the one which is fasted. Instead—the user should carefully think about his overall requirements and weight the different aspects. Massively scalable key/value stores or memory-only system[s] can archive much higher benchmarks. But your aim is [to] provide a much more convenient system for a broader range of use-cases—which is fast enough for almost all cases.”
Anyway, we have done a lot of performance tests and are more than happy with the results. ArangoDB 1.3 inserts up to 140,000 documents per second. We are going to publish the whole test suite including a test runner soon, so everybody can try it out on his own hardware.

We have also tested the space usage: Storing 3.5 millions AQL search queries takes about 200 MB in MongoDB with pre-allocation compared to 55 MB in ArangoDB. This is the benefit of implementing the concept of shapes.

Q11. ArangoDB is open source. How do you motivate and involve the open source development community to contribute to your projects rather than any other open source NoSQL?

Frank Celler: To be honest: The contributors come of their own volition and until now we didn’t have to “push” interested parties. Obviously, ArangoDB is fascinating enough, even though there are more than 150 NoSQL databases available to choose from.

It all started when Ruby inventor Yukihiro “Matz” Matsumoto tweeted on ArangoDB and recommended it to the community. Following this tweet, ArangoDB’s first local fan base was established in Japan—and we learned a lot about the limits of automatic translation from Japanese tweets to English and the other way around ;-).

In our daily “work” with our community, we try to be as open and supportive as possible. The core developers communicate directly and within short response times with people having ideas or needing help through Google Groups or GitHub. We take care of a community, especially for contributors, where we discuss future features and inform about upcoming changes early so that API contributors can keep their implementations up to date.

——————————
Martin Schönert
Martin is the origin of many fancy ideas in ArangoDB. As chief architect he is responsible for the overall architecture of the system, bringing in his experience from more than 20 years in IT as developer, architect, project manager and entrepreneur.
Martin started his career as scientist at the technical university of Aachen after earning his degree in Mathematics. Later he worked as head of product development (Team4 Systemhaus), Director IT (OnVista Technologies) and head of division at Deutsche Post.
Martin has been working with relational and non-relations databases (e.g. a torrid love-hate relationsship with the granddaddy of all non-relational databases: Lotus Notes) for the largest part of his professional life.
When no database did what he needed he also wrote his own, one for extremely high update rate and the other for distributed caching.

Frank Celler
Frank is both entrepreneur and backend developer, developing mostly memory databases for two decades. He is the lead developer of ArangoDB and co-founder of triAGENS. Besides Frank organizes Cologne’s NoSQL user group, NoSQL conferences and is speaking at developer conferences.
Frank studied in Aachen and London and received a PHD in Mathematics. Prior to founding triAGENS, the company behind ArangoDB, he worked for several German tech companies as consultant, team lead and developer.
His technical focus is C and C++, recently he gained some experience with Ruby when integrating Mruby into ArangoDB.

Resources

– The stable version (1.3 branch) of ArangoDB can be downloaded here.
– ArangoDB on Twitter
– ArangoDB Google Group
– ArangoDB questions on StackOverflow
– Issue Tracker at Github

– On Big Data and NoSQL. Interview with Renat Khasanshyn. October 7, 2013

– On NoSQL. Interview with Rick Cattell. August 19, 2013

– On Big Graph Data. August 6, 2012

Follow ODBMS.org on Twitter: @odbmsorg

From → Uncategorized

No comments yet

On multi-model databases. Interview with Martin Schönert and Frank Celler.

Leave a Reply Cancel reply

About the author

Archives

Meta

About

Flickr

Search

On multi-model databases. Interview with Martin Schönert and Frank Celler.

Leave a Reply Cancel reply

About the author

Tags

Archives

Meta

About

Flickr

Search