ODBMS Industry Watch » MongoDB http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Fri, 09 Feb 2018 21:04:31 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.19 On Open Source Databases. Interview with Peter Zaitsev http://www.odbms.org/blog/2017/09/on-open-source-databases-interview-with-peter-zaitsev/ http://www.odbms.org/blog/2017/09/on-open-source-databases-interview-with-peter-zaitsev/#comments Wed, 06 Sep 2017 00:49:18 +0000 http://www.odbms.org/blog/?p=4448

“To be competitive with non-open-source cloud deployment options, open source databases need to invest in “ease-of-use.” There is no tolerance for complexity in many development teams as we move to “ops-less” deployment models.” –Peter Zaitsev

I have interviewed Peter Zaitsev, Co-Founder and CEO of Percona.
In this interview, Peter talks about the Open Source Databases market; the Cloud; the scalability challenges at Facebook; compares MySQL, MariaDB, and MongoDB; and presents Percona’s contribution to the MySQL and MongoDB ecosystems.

RVZ

Q1. What are the main technical challenges in obtaining application scaling?

Peter Zaitsev: When it comes to scaling, there are different types. There is a Facebook/Google/Alibaba/Amazon scale: these giants are pushing boundaries, and usually are solving very complicated engineering problems at a scale where solutions aren’t easy or known. This often means finding edge cases that break things like hardware, operating system kernels and the database. As such, these companies not only need to build a very large-scale infrastructures, with a high level of automation, but also ensure it is robust enough to handle these kinds of issues with limited user impact. A great deal of hardware and software deployment practices must to be in place for such installations.

While these “extreme-scale” applications are very interesting and get a lot of publicity at tech events and in tech publications, this is a very small portion of all the scenarios out there. The vast majority of applications are running at the medium to high scale, where implementing best practices gets you the scalability you need.

When it comes to MySQL, perhaps the most important question is when you need to “shard.” Sharding — while used by every application at extreme scale — isn’t a simple “out-of-the-box” feature in MySQL. It often requires a lot of engineering effort to correctly implement it.

While sharding is sometimes required, you should really examine whether it is necessary for your application. A single MySQL instance can easily handle hundreds of thousands per second (or more) of moderately complicated queries, and Terabytes of data. Pair that with MemcacheD or Redis caching, MySQL Replication or more advanced solutions such as Percona XtraDB Cluster or Amazon Aurora, and you can cover the transactional (operational) database needs for applications of a very significant scale.

Besides making such high-level architecture choices, you of course need to also ensure that you exercise basic database hygiene. Ensure that you’re using the correct hardware (or cloud instance type), the right MySQL and operating system version and configuration, have a well-designed schema and good indexes. You also want to ensure good capacity planning, so that when you want to take your system to the next scale and begin to thoroughly look at it you’re not caught by surprise.

Q2. Why did Facebook create MyRocks, a new flash-optimized transactional storage engine on top of RocksDB storage engine for MySQL?

Peter Zaitsev: The Facebook Team is the most qualified to answer this question. However, I imagine that at Facebook scale being efficient is very important because it helps to drive the costs down. If your hot data is in the cache when it is important, your database is efficient at handling writes — thus you want a “write-optimized engine.”
If you use Flash storage, you also care about two things:

      – A high level of compression since Flash storage is much more expensive than spinning disk.

– You are also interested in writing as little to the storage as possible, as the more you write the faster it wears out (and needs to be replaced).

RocksDB and MyRocks are able to achieve all of these goals. As an LSM-based storage engine, writes (especially Inserts) are very fast — even for giant data sizes. They’re also much better suited for achieving high levels of compression than InnoDB.

This Blog Post by Mark Callaghan has many interesting details, including this table which shows MyRocks having better performance, write amplification and compression for Facebook’s workload than InnoDB.
Percona

Q3. Beringei is Facebook’s open source, in-memory time series database. According to Facebook, large-scale monitoring systems cannot handle large-scale analysis in real time because the query performance is too slow. What is your take on this?

Peter Zaitsev: Facebook operates at extreme scale, so it is no surprise the conventional systems don’t scale well enough or aren’t efficient enough for Facebook’s needs.

I’m very excited Facebook has released Beringei as open source. Beringei itself is a relatively low-end storage engine that is hard to use for a majority of users, but I hope it gets integrated with other open source projects and provides a full-blown high-performance monitoring solution. Integrating it with Prometheus would be a great fit for solutions with extreme data ingestion rates and very high metric cardinality.

Q4. How do you see the market for open source databases evolving?

Peter Zaitsev: The last decade has seen a lot of open source database engines built, offering a lot of different data models, persistence options, high availability options, etc. Some of them were build as open source from scratch, while others were released as open source after years of being proprietary engines — with the most recent example being CMDB2 by Bloomberg. I think this heavy competition is great for pushing innovation forward, and is very exciting! For example, I think if that if MongoDB hadn’t shown how many developers love a document-oriented data model, we might never of seen MySQL Document Store in the MySQL ecosystem.

With all this variety, I think there will be a lot of consolidation and only a small fraction of these new technologies really getting wide adoption. Many will either have niche deployments, or will be an idea breeding ground that gets incorporated into more popular database technologies.

I do not think SQL will “die” anytime soon, even though it is many decades old. But I also don’t think we will see it being the dominant “database” language, as it has been since the turn of millennia.

The interesting disruptive force for open source technologies is the cloud. It will be very interesting for me to see how things evolve. With pay-for-use models of the cloud, the “free” (as in beer) part of open source does not apply in the same way. This reduces incentives to move to open source databases.

To be competitive with non-open-source cloud deployment options, open source databases need to invest in “ease-of-use.” There is no tolerance for complexity in many development teams as we move to “ops-less” deployment models.

Q5. In your opinion what are the pros and cons of MySQL vs. MariaDB?

Peter Zaitsev: While tracing it roots to MySQL, MariaDB is quickly becoming a very different database.
It implements some features MySQL doesn’t, but also leaves out others (MySQL Document Store and Group Replication) or implements them in a different way (JSON support and Replication GTIDs).

From the MySQL side, we have Oracle’s financial backing and engineering. You might dislike Oracle, but I think you agree they know a thing or two about database engineering. MySQL is also far more popular, and as such more battle-tested than MariaDB.

MySQL is developed by a single company (Oracle) and does not have as many external contributors compared to MariaDB — which has its own pluses and minuses.

MySQL is “open core,” meaning some components are available only in the proprietary version, such as Enterprise Authentication, Enterprise Scalability, and others. Alternatives for a number of these features are available in Percona Server for MySQL though (which is completely open source). MariaDB Server itself is completely open source, through there are other components that aren’t that you might need to build a full solution — namely MaxScale.

Another thing MariaDB has going for it is that it is included in a number of Linux distributions. Many new users will be getting their first “MySQL” experience with MariaDB.

For additional insight into MariaDB, MySQL and Percona Server for MySQL, you can check out this recent article

Q6. What’s new in the MySQL and MongoDB ecosystem?

Peter Zaitsev: This could be its own and rather large article! With MySQL, we’re very excited to see what is coming in MySQL 8. There should be a lot of great changes in pretty much every area, ranging from the optimizer to retiring a lot of architectural debt (some of it 20 years old). MySQL Group Replication and MySQL InnoDB Cluster, while still early in their maturity, are very interesting products.

For MongoDB we’re very excited about MongoDB 3.4, which has been taking steps to be a more enterprise ready database with features like collation support and high-performance sharding. A number of these features are only available in the Enterprise version of MongoDB, such as external authentication, auditing and log redaction. This is where Percona Server for MongoDB 3.4 comes in handy, by providing open source alternatives for the most valuable Enterprise-only features.

For both MySQL and MongoDB, we’re very excited about RocksDB-based storage engines. MyRocks and MongoRocks both offer outstanding performance and efficiency for certain workloads.

Q7. Anything else you wish to add?

Peter Zaitsev: I would like to use this opportunity to highlight Percona’s contribution to the MySQL and MongoDB ecosystems by mentioning two of our open source products that I’m very excited about.

First, Percona XtraDB Cluster 5.7.
While this has been around for about a year, we just completed a major performance improvement effort that allowed us to increase performance up to 10x. I’m not talking about improving some very exotic workloads: these performance improvements are achieved in very typical high-concurrency environments!

I’m also very excited about our Percona Monitoring and Management product, which is unique in being the only fully packaged open source monitoring solution specifically built for MySQL and MongoDB. It is a newer product that has been available for less than a year, but we’re seeing great momentum in adoption in the community. We are focusing many of our resources to improving it and making it more effective.

———————

Peter Zaitsev_Percona

Peter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With more than 150 professionals in 29 countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of Internet giants, large enterprises and many exciting startups. Percona was named to the Inc. 5000 in 2013, 2014, 2015 and 2016.

Peter was an early employee at MySQL AB, eventually leading the company’s High Performance Group. A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Data Performance Blog. He has also been tapped as a contributor to Fortune and DZone, and his recent ebook Practical MySQL Performance Optimization Volume 1 is one of percona.com’s most popular downloads.
————————-

Resources

Percona, in collaboration with Facebook, announced the first experimental release of MyRocks in Percona Server for MySQL 5.7, with packages. September 6, 2017

eBook, “Practical MySQL Performance Optimization,” by Percona CEO Peter Zaitsev and Principal Consultant Alexander Rubin. (LINK to DOWNLOAD, registration required)

MySQL vs MongoDB – When to Use Which Technology. Peter Zaitsev, June 22, 2017

Percona Live Open Source Database Conference Europe, Dublin, Ireland. September 25 – 27, 2017

Percona Monitoring and Management (PMM) Graphs Explained: MongoDB with RocksDB, By Tim Vaillancourt,JUNE 18, 2017

Related Posts

On Apache Ignite, Apache Spark and MySQL. Interview with Nikita Ivanov. ODBMS Industry Watch, 2017-06-30

On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah. ODBMS Industry Watch,2017-03-13

On in-memory, key-value data stores. Ofer Bengal and Yiftach Shoolman. ODBMS Industry Watch, 2017-02-13

follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/09/on-open-source-databases-interview-with-peter-zaitsev/feed/ 0
Identity Graph Analysis at Scale. Interview with Niels Meersschaert http://www.odbms.org/blog/2017/05/interview-with-niels-meersschaert/ http://www.odbms.org/blog/2017/05/interview-with-niels-meersschaert/#comments Tue, 09 May 2017 07:10:19 +0000 http://www.odbms.org/blog/?p=4359

“I’ve found the best engineers actually have art backgrounds or interests. The key capability is being able to see problems from multiple perspectives, and realizing there are multiple solutions to a problem. Music, photography and other arts encourage that.”–Niels Meersschaert.

I have interviewed Niels Meersschaert, Chief Technology Officer at Qualia. The Qualia team relies on over one terabyte of graph data in Neo4j, combined with larger amounts of non-graph data to provide major companies with consumer insights for targeted marketing and advertising opportunities.

RVZ

Q1. Your background is in Television & Film Production. How does it relate to your current job?

Niels Meersschaert: Engineering is a lot like producing. You have to understand what you are trying to achieve, understand what parts and roles you’ll need to accomplish it, all while doing it within a budget. I’ve found the best engineers actually have art backgrounds or interests. The key capability is being able to see problems from multiple perspectives, and realizing there are multiple solutions to a problem. Music, photography and other arts encourage that. Engineering is both art and science and creativity is a critical skill for the best engineers. I also believe that a breath of languages is critical for engineers.

Q2. Your company collects data on more than 90% of American households. What kind of data do you collect and how do you use such data?

Niels Meersschaert: We focus on high quality data that is indicative of commercial intent. Some examples include wishlist interaction, content consumption, and location data. While we have the breath of a huge swath of the American population, a key feature is that we have no personally identifiable information. We use anonymous unique identifiers.
So, we know this ID did actions indicative of interest in a new SUV, but we don’t know their name, email address, phone number or any other personally identifiable information about a consumer. We feel this is a good balance of commercial need and individual privacy.

Q3. If you had to operate with data from Europe, what would be the impact of the new EU General Data Protection Regulation (GDPR) on your work?

Niels Meersschaert: Europe is a very different market than the U.S. and many of the regulations you mentioned do require a different approach to understanding consumer behaviors. Given that we avoid personal IDs, our approach is already better situated than many peers, that rely on PII.

Q4. Why did you choose a graph database to implement your system consumer behavior tracking system?

Niels Meersschaert: Our graph database is used for ID management. We don’t use it for understanding the intent data, but rather recognizing IDs. Conceptually, describing the various IDs involved is a natural fit for a graph.
As an example, a conceptual consumer could be thought of as the top of the graph. That consumer uses many devices and each device could have 1 or more anonymous IDs associated with it, such as cookie IDs. Each node can represent an associated device or ID and the relationships between each node allow us to see the path. A key element we have in our system is something we call the Borg filter. It’s a bit of a reference to Star Trek, but essentially when we find a consumer is too connected, i.e. has dozens or hundreds of devices, we remove all those IDs from the graph as clearly something has gone wrong. A graph database makes it much easier to determine how many connected nodes are at each level.

Q5. Why did you choose Neo4j?

Niels Meersschaert: Neo4J had a rich query language and very fast performance, especially if your hot set was in RAM.

Q6. You manage one terabyte of graph data in Neo4j. How do you combine them with larger amounts of non-graph data?

Niels Meersschaert: You can think of the graph as a compression system for us. While consumer actions occur on multiple devices and anonymous IDs, they represent the actions of a single consumer. This actually simplifies things for us, since the unique grouping IDs is much smaller than the unique source IDs. It also allows us to eliminate non-human IDs from the graph. This does mean we see the world in different ways they many peers. As an example, if you focus only on cookie IDs, you tend to have a much larger number of unique IDs than actual consumers those represent. Sadly, the same thing happens with website monthly uniques, many are highly inflated both on the number of unique people they represent, but also since many of the IDs are non-human. Ultimately, the entire goal of advertising is to influence consumers, so we feel that having the better representation of actual consumers allows us to be more effective.

Q7. What are the technical challenges you face when blending data with different structure?

Niels Meersschaert: A key challenge is some unifying element between different systems or structures that link data. What we did with Neo4J is create a unique property on the nodes that we use for interchange. The internal node IDs that are part of Neo4J aren’t something we use except internally within the graph DB.

Q8. If your data is sharded manually, how do you handle scalability?

Niels Meersschaert: We don’t shard the data manually, but scalability is one of the biggest challenges. We’ve spent a lot of time tuning queries and grouping operations to take advantage of some of the capabilities of Neo4J and to work around some limitations it has. The vast majority of graph customers wouldn’t have the volume nor the volatility of data that we do, so our challenges are unique.

Q9. What other technologies do you use and how they interact with Neo4j?

Niels Meersschaert: We use the classic big data tools like Hadoop and Spark. We also use MongoDB and Google’s Big Query. If you look at the graph as the truth set of device IDs, we interact with it on ingestion and export only. Everything in the middle can operate on the consumer ID, which is far more efficient.

Q10. How do you measure the ROI of your solution?

Niels Meersschaert: There are a few factors we consider. First is how much does the infrastructure cost us to process the data and output? How fast is it in terms of execution time? How much development effort does it take relative to other solutions? How flexible is it for us to extend it? This is an ever evolving situation and one we always look at how to improve, especially as a smaller business.

———————————-

Niels Meersschaert
I’ve been coding since I was 7 years old on an Apple II. I’d built radio control model cars and aircraft as a child and built several custom chassis using controlled flex as suspension to keep weight & parts count down. So, I’d had an early interest in both software and physical engineering.

My father was from the Netherlands and my maternal grandfather was a linguist fluent in 43 languages. As a kid, my father worked for the airlines, so we traveled often to Europe to see family, so I grew up multilingual. Computer languages are just different ways to describe something, the basic concepts are similar, just as they are in spoken languages albeit with different grammatical and syntax structure. Whether you’re speaking French, or writing a program in Python or C, the key is you are trying to get your communication across to the target of your message, whether it is another person or a computer.

I originally started university in aeronautical engineering, but in my sophomore year, Grumman let go about 3000 engineers, so I didn’t think the career opportunities would be as great. I’d always viewed problem solutions as a combination of art & science, so I switched majors to one in which I could combine the two.

After school I worked producing and editing commercials and industrials, often with special effects. I got into web video early on & spent a lot of time on compression and distribution systems. That led to working on search, and bringing the linguistics back front and center again. I then combined the two and came full circle back to advertising, but from the technical angle at Magnetic, where we built search retargeting. At Qualia, we kicked this into high gear, where we understand consumer intent by analyzing sentiment, content and actions across multiple devices and environments and the interaction and timing between them to understand the point in the intent path of a consumer.

Resources

EU General Data Protection Regulation (GDPR):

Reform of EU data protection rules

European Commission – Fact Sheet Questions and Answers – Data protection reform

General Data Protection Regulation (Wikipedia)

Neo4j Sandbox: The Neo4j Sandbox enables you to get started with Neo4j, with built-in guides and sample datasets for popular use cases.

Related Posts

LDBC Developer Community: Benchmarking Graph Data Management Systems. ODBMS.org, 6 APR, 2017

Graphalytics benchmark.ODBMS.org 6 APR, 2017
The Graphalytics benchmark is an industrial-grade benchmark for graph analysis platforms such as Giraph. It consists of six core algorithms, standard datasets, synthetic dataset generators, and reference outputs, enabling the objective comparison of graph analysis platforms.

Collaborative Filtering: Creating the Best Teams Ever. By Maurits van der Goes, Graduate Intern | February 16, 2017

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/05/interview-with-niels-meersschaert/feed/ 0
New Gartner Magic Quadrant for Operational Database Management Systems. Interview with Nick Heudecker http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/ http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/#comments Wed, 30 Nov 2016 20:30:20 +0000 http://www.odbms.org/blog/?p=4272

“It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.”–Nick Heudecker.

I have interviewed Nick Heudecker, Research Director on Gartner’s Data & Analytics team.
The main topic of the interview is the new Magic Quadrant for Operational Database Management Systems.

RVZ

Q1. You have published the new Magic Quadrant for Operational Database Management Systems (*). How do you define the operational database management system market?

Nick Heudecker: We define a DBMS as a complete software system used to define, create, manage, update and query a database. DBMSs provide interfaces to independent programs and tools that both support and govern the performance of a variety of concurrent workload types. There is no presupposition that DBMSs must support the relational model or that they must support the full set of possible data types in use today. OPDBMSs must include functionality to support backup and recovery, and have some form of transaction durability — although the atomicity, consistency, isolation and durability model is not a requirement. OPDBMSs may support multiple delivery models, such as stand-alone DBMS software, certified configurations, cloud (public and private) images or versions, and database appliances.

Q2. Can you explain the methodology you used for this new Magic Quadrant?

Nick Heudecker: The methodologies for several Gartner methodologies are public. The Magic Quadrant methodology can be found here.

We use a number of data sources when we’re creating the Magic Quadrant for Operational Database Management Systems.
We survey vendor reference customers and include data from our interactions with Gartner clients. We also consider earlier information and any news about vendors’ products, customers and finances that came to light during the time frame for our analysis.

Once we have the data, we score vendors across the various dimensions of Completeness of Vision and Ability to Execute.
One thing that’s important to note is Magic Quadrants are relative assessments of vendors in a market. We couldn’t have one vendor on an MQ because it would be right in the middle – there’s nothing to compare it to.

Q3. Why were there no Visionaries this year?

Nick Heudecker: We determined there was an overall lack of vision in the market. After a few years of rapid feature expansion, the focus has shifted to operational excellence and execution. Even Leaders shifted to the left on vision, but are still placed in the Leaders quadrant based on their vision for the development of hybrid database management, hardware optimization and integration, emerging deployment models such as containerization, as well as vertical features.

Q4. Were you surprised by the analysis and some of the results you obtained?

Nick Heudecker: The lack of overall vision in the market struck us the most. Other than in a few notable cases, we received largely the same story from most vendors. The explosion of features, and the vendors emerging to implement them, has slowed. The features that initiated the expansion, such as storing new data types, geographically distributed storage, cloud and flexible data consistency models, have become common. Today, nearly every established or emerging DBMS vendor supports these features to some degree. The OPDBMS market has shifted from a phase of rapid innovation to a phase of maturing products and capabilities.

Q5. Do you believe the “NoSQL” label will continue to distinguish DBMSs?

Nick Heudecker: If you look at the entire operational DBMS space, there’s already a great deal of convergence between NoSQL vendors, as well as between NoSQL and traditionally relational vendors. Nearly every vendor, nonrelational and relational, supports multiple data types, like JSON documents, graph or wide-column. NoSQL vendors are adding SQL: MongoDB’s BI Connector and Couchbase’s N1QL are good, if diverse, examples. They’re also adding things like schema management and data validation capabilities.
On the relational side, they’re adding horizontal scaling options and alternative consistency models, as well as modern APIs. And everyone either has or is adding in-memory and cloud capabilities.

It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.

Q6. What are the other “Vendors to Consider”?

Nick Heudecker: The other vendors to consider are vendors that did not meet the inclusion requirements for the Magic Quadrant. Usually this is because they missed our minimum revenue requirements, but that doesn’t mean they don’t have compelling products.

——————————-
Nick Heudecker is a Research Director on Gartner’s Data & Analytics team. His coverage includes data management technologies and practices.

——————————-

Resources
(*) Magic Quadrant for Operational Database Management Systems. Published: 05 October 2016 ID: G00293203Analyst(s): Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald

– Complimentary Gartner Research: 100 Data and Analytics Predictions Through 2020. Get exclusive access to Gartner’s top 100 data and analytics predictions through 2020. Plus access other relevant Gartner research including Magic Quadrant reports for database and data warehouse solutions, and the market guide for in-memory computing (LINK to MemSQL web site – registration required).

Related Posts

MarkLogic Named a Next-Generation Database Challenger in 2016 Gartner Magic Quadrant. By GARY BLOOM, Chief Executive Officer and President MARKLOGIC

MarkLogic Recognized in New Gartner® Magic Quadrant. Gartner Magic Quadrant for Operational Database Management Systems positions MarkLogic® the highest for ability to execute in the Challengers Quadrant

– Accelerating Business Value with a Multi-Model, Multi-Workload Data Platform

– NuoDB Recognized by Gartner in Critical Capabilities for Operational Database Management Systems. Elastic SQL database achieves top five score in all four use cases.

– Clustrix Recognized in Gartner Magic Quadrant for Operational Database Management Systems

– Learn why EDB is named a “Challenger” in the 2016 Gartner ODBMS Magic Quadrant

– DataStax Receives Highest Scores in 2 Use Cases in Gartner’s Critical Capabilities for Operational Database Management Systems

– Gartner Scores Oracle Highest In 3 of 4 Use Cases: Gartner Critical Capabilities for Operational Database Management Systems Report

Gartner Critical Capabilities For Operational Database Management Systems 2016 – Redis Labs Ranked Second Highest In 2/4 Categories (Link- Registation required)

 

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/feed/ 0
AsterixDB: Better than Hadoop? Interview with Mike Carey http://www.odbms.org/blog/2014/10/asterixdb-hadoop-interview-mike-carey/ http://www.odbms.org/blog/2014/10/asterixdb-hadoop-interview-mike-carey/#comments Wed, 22 Oct 2014 18:18:28 +0000 http://www.odbms.org/blog/?p=3501

“To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”)”–Mike Carey.

Mike Carey and his colleagues have been working on a new data management system for Big Data called AsterixDB.

The AsterixDB Big Data Management System (BDMS) is the result of approximately four years of R&D involving researchers at UC Irvine, UC Riverside, and Oracle Labs. The AsterixDB code base currently consists of over 250K lines of Java code that has been co-developed by project staff and students at UCI and UCR.

The AsterixDB project has been supported by the U.S. National Science Foundation as well as by several generous industrial gifts.

RVZ

Q1. Why build a new Big Data Management System?

Mike Carey: When we started this project in 2009, we were looking at a “split universe” – there were your traditional parallel data warehouses, based on expensive proprietary relational DBMSs, and then there was the emerging Hadoop platform, which was free but low-function in comparison and wasn’t based on the many lessons known to the database community about how to build platforms to efficiently query large volumes of data. We wanted to bridge those worlds, and handle “modern data” while we were at it, by taking into account the key lessons from both sides.

To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”). 
We felt that the Big Data world, once the initial Hadoop furor started to fade a little, would benefit from having a platform that could offer things like:

  • a flexible data model that could handle data scenarios ranging from “schema first” to “schema never”;
  • a full query language with at least the expressive power of SQL;
  • support for data storage, data management, and automatic indexing;
  • support for a wide range of query sizes, with query processing cost being proportional to the given query;
  • support for continuous data ingestion, hence the accumulation of Big Data;
  • the ability to scale up gracefully to manage and query very large volumes of data using commodity clusters; and,
  • built-in support for today’s common “Big Data data types”, such as textual, temporal, and simple spatial data.

So that’s what we set out to do.

Q2. What was wrong with the current Open Source Big Data Stack?

Mike Carey: First, we should mention that some reviewers back in 2009 thought we were crazy or stupid (or both) to not just be jumping on the Hadoop bandwagon – but we felt it was important, as academic researchers, to look beyond Hadoop and be asking the question “okay, but after Hadoop, then what?” 
We recognized that MapReduce was great for enabling developers to write massively parallel jobs against large volumes of data without having to “think parallel” – just focusing on one piece of data (map) or one key-sharing group of data (reduce) at a time. As a platform for “parallel programming for dummies”, it was (and still is) very enabling! It also made sense, for expedience, that people were starting to offer declarative languages like Pig and Hive, compiling them down into Hadoop MapReduce jobs to improve programmer productivity – raising the level much like what the database community did in moving to the relational model and query languages like SQL in the 70’s and 80’s.

One thing that we felt was wrong for sure in 2009 was that higher-level languages were being compiled into an assembly language with just two instructions, map and reduce. We knew from Tedd Codd and relational history that more instructions – like the relational algebra’s operators – were important – and recognized that the data sorting that Hadoop always does between map and reduce wasn’t always needed. 
Trying to simulate everything with just map and reduce on Hadoop made “get something better working fast” sense, but not longer-term technical sense. As for HDFS, what seemed “wrong” about it under Pig and Hive was its being based on giant byte stream files and not on “data objects”, which basically meant file scans for all queries and lack of indexing. We decided to ask “okay, suppose we’d known that Big Data analysts were going to mostly want higher-level languages – what would a Big Data platform look like if it were built ‘on purpose’ for such use, instead of having incrementally evolved from HDFS and Hadoop?”

Again, our idea was to try and bring together the best ideas from both the database world and the distributed systems world. (I guess you could say that we wanted to build a Big Data Reese’s Cup… J)

Q3. AsterixDB has been designed to manage vast quantities of semi-structured data. How do you define semi-structured data?

Mike Carey: In the late 90’s and early 2000’s there was a bunch of work on that – on relaxing both the rigid/flat nature of the relational model as well as the requirement to have a separate, a priori specification of the schema (structure) of your data. We felt that this flexibility was one of the things – aside from its “free” price point – drawing people to the Hadoop ecosystem (and the key-value world) instead of the parallel data warehouse ecosystem.
In the Hadoop world you can start using your data right away, without spending 3 months in committee meetings to decide on your schema and indexes and getting DBA buy-in. To us, semi-structured means schema flexibility, so in AsterixDB, we let you decide how much of your schema you have to know and/or choose to reveal up front, and how much you want to leave to be self-describing and thus allow it to vary later. And it also means not requiring the world to be flat – so we allow nesting of records, sets, and lists. And it also means dealing with textual data “out of the box”, because there’s so much of that now in the Big Data world.

Q4. The motto of your project is “One Size Fits a Bunch”. You claim that AsterixDB can offer better functionality, managability, and performance than gluing together multiple point solutions (e.g., Hadoop + Hive + MongoDB).  Could you please elaborate on this?

Mike Carey: Sure. If you look at current Big Data IT infrastructures, you’ll see a lot of different tools and systems being tied together to meet an organization’s end-to-end data processing requirements. In between systems and steps you have the glue – scripts, workflows, and ETL-like data transformations – and if some of the data needs to be accessible faster than a file scan, it’s stored not just in HDFS, but also in a document store or a key-value store.
This just seems like too many moving parts. We felt we could build a system that could meet more (not all!) of today’s requirements, like the ones I listed in my answer to the first question.
If your data is in fewer places or can take a flight with fewer hops to get the answers, that’s going to be more manageable – you’ll have fewer copies to keep track of and fewer processes that might have hiccups to watch over. If you can get more done in one system, obviously that’s more functional. And in terms of performance, we’re not trying to out-perform the specialty systems – we’re just trying to match them on what each does well. If we can do that, you can use our new system without needing as many puzzle pieces and can do so without making a performance sacrifice.
We’ve recently finished up a first comparison of how we perform on tasks that systems like parallel relational systems, MongoDB, and Hive can do – and things look pretty good so far for AsterixDB in that regard.

Q5. AsterixDB has been combining ideas from three distinct areas — semi-structured data management, parallel databases, and data-intensive computing. Could you please elaborate on that?

Mike Carey: Our feeling was that each of these areas has some ideas that are really important for Big Data. Borrowing from semi-structured data ideas, but also more traditional databases, leads you to a place where you have flexibility that parallel databases by themselves do not. Borrowing from parallel databases leads to scale-out that semi-structured data work didn’t provide (since scaling is orthogonal to data model) and with query processing efficiencies that parallel databases offer through techniques like hash joins and indexing – which MapReduce-based data-intensive computing platforms like Hadoop and its language layers don’t give you. Borrowing from the MapReduce world leads to the open-source “pricing” and flexibility of Hadoop-based tools, and argues for the ability to process some of your queries directly over HDFS data (which we call “external data” in AsterixDB, and do also support in addition to managed data).

Q6. How does the AsterixDB Data Model compare with the data models of NoSQL data stores, such as document databases like MongoDB and CouchBase, simple key/value stores like Riak and Redis, and column-based stores like HBase and Cassandra?

Mike Carey: AsterixDB’s data model is flexible – we have a notion of “open” versus “closed” data types – it’s a simple idea but it’s unique as far as we know. When you define a data type for records to be stored in an AsterixDB dataset, you can choose to pre-define any or all of the fields and types that objects to be stored in it will have – and if you mark a given type as being “open” (or let the system default it to “open”), you can store objects there that have those fields (and types) as well as any/all other fields that your data instances happen to have at insertion time.
Or, if you prefer, you can mark a type used by a dataset as “closed”, in which case AsterixDB will make sure that all inserted objects will have exactly the structure that your type definition specifies – nothing more and nothing less.
(We do allow fields to be marked as optional, i.e., nullable, if you want to say something about their type without mandating their presence.)

What this gives you is a choice!  If you want to have the total, last-minute flexibility of MongoDB or Couchbase, with your data being self-describing, we support that – you don’t have to predefine your schema if you use data types that are totally open. (The only thing we insist on, at the moment, is that every type must have a key field or fields – we use keys when sharding datasets across a cluster.)

Structurally, our data model was JSON-inspired – it’s essentially a schema language for a JSON superset – so we’re very synergistic with MongoDB or Couchbase data in that regard. 
On the other end of the spectrum, if you’re still a relational bigot, you’re welcome to make all of your data types be flat – don’t use features like nested records, lists, or bags in your record definitions – and mark them all as “closed” so that your data matches your schema. With AsterixDB, we can go all the way from traditional relational to “don’t ask, don’t tell”. As for systems with BigTable-like “data models” – I’d personally shy away from calling those “data models”.

Q7. How do you handle horizontal scaling? And vertical scaling?

Mike Carey: We scale out horizontally using the same sort of divide-and-conquer techniques that have been used in commercial parallel relational DBMSs for years now, and more recently in Hadoop as well. That is, we horizontally partition both data (for storage) and queries (when processed) across the nodes of commodity clusters. Basically, our innards look very like those of systems such as Teradata or Parallel DB2 or PDW from Microsoft – we use join methods like parallel hybrid hash joins, and we pay attention to how data is currently partitioned to avoid unnecessary repartitioning – but have a data model that’s way more flexible. And we’re open source and free….

We scale vertically (within one node) in two ways. First of all, we aren’t memory-dependent in the way that many of the current Big Data Analytics solutions are; it’s not that case that you have to buy a big enough cluster so that your data, or at least your intermediate results, can be memory-resident.
Instead, our physical operators (for joins, sorting, aggregation, etc.) all spill to disk if needed – so you can operate on Big Data partitions without getting “out of memory” errors. The other way is that we allow nodes to hold multiple partitions of data; that way, one can also use multi-core nodes effectively.

Q8. What performance figures do you have for AsterixDB?

Mike Carey: As I mentioned earlier, we’ve completed a set of initial performance tests on a small cluster at UCI with 40 cores and 40 disks, and the results of those tests can be found in a recently published AsterixDB overview paper that’s hanging on our project web site’s publication page (http://asterixdb.ics.uci.edu/publications.html).
We have a couple of other performance studies in flight now as well, and we’ll be hanging more information about those studies in the same place on our web site when they’re ready for human consumption. There’s also a deeper dive paper on the AsterixDB storage manager that has some performance results regarding the details of scaling, indexing, and so on; that’s available on our web site too. The quick answer to “how does AsterixDB perform” is that we’re already quite competitive with other systems that have narrower feature sets – which we’re pretty proud of.

Q9. You mentioned support for continuous data ingestion. How does that work?

Mike Carey: We have a special feature for that in AsterixDB – we have a built-in notion of Data Feeds that are designed to simplify the lives of users who want to use our system for warehousing of continuously arriving data.
We provide Data Feed adaptors to enable outside data sources to be defined and plugged in to AsterixDB, and then one can “connect” a Data Feed to an AsterixDB data set and the data will start to flow in. As the data comes in, we can optionally dispatch a user-defined function on each item to do any initial information extraction/annotation that you want.  Internally, this creates a long-running job that our system monitors – if data starts coming too fast, we offer various policies to cope with it, ranging from discarding data to sampling data to adding more UDF computation tasks (if that’s the bottleneck). More information about this is available in the Data Feeds tech report on our web site, and we’ll soon be documenting this feature in the downloadable version of AsterixDB. (Right now it’s there but “hidden”, as we have been testing it first on a set of willing UCI student guinea pigs.)

Q10. What is special about the AsterixDB Query Language? Why not use SQL?

Mike Carey: When we set out to define the query language for AsterixDB, we decided to define our own new language – since it seemed like everybody else was doing that at the time (witness Pig, Jaql, HiveQL, etc.) – one aimed at our data model. 
SQL doesn’t handle nested or open data very well, so extending ANSI/ISO SQL seemed like a non-starter – that was also based on some experience working on SQL3 in the late 90’s. (Take a look at Oracle’s nested tables, for example.). Based on our team’s backgrounds in XML querying, we actually started there – XQuery was developed by a team of really smart people from the SQL world (including Don Chamberlin, father of SQL) as well as from the XML world and the functional programming world – so we started there. We took XQuery and then started throwing the stuff overboard that wasn’t needed for JSON or that seemed like a poor feature that had been added for XPath compatibility.
What remained was AQL, and we think it’s a pretty nice language for semistructured data handling. We periodically do toy with the notion of adding a SQL-like re-skinning of AQL to make SQL users feel more at home – and we may well do that in the future – but that would be different than “real SQL”. (The N1QL effort at Couchbase is doing something along those lines, language-wise, as an example. The SQL++ design from UCSD is another good example there.)

Q11. What level of concurrency and recovery guarantees does AsterixDB offer?

Mike Carey: We offer transaction support that’s akin to that of current NoSQL stores. That is, we promise record-level ACIDity – so inserting or deleting a given record will happen as an atomic, durable action. However, we don’t offer general-purpose distributed transactions. We support an arbitrary number of secondary indexes on data sets, and we’ll keep all the indexes on a data set transactionally consistent – that we can do because secondary index entries for a given record live in the same data partition as the record itself, so those transactions are purely local.

Q12. How does AsterixDB compare with Hadoop? What about Hadoop Map/Reduce compatibility?

Mike Carey: I think we’ve already covered most of that – Hadoop MapReduce is an answer to low-level “parallel programming for dummies”, and it’s great for that – and languages on top like Pig Latin and HiveQL are better programming abstractions for “data tasks” but have runtimes that could be much better. We started over, much as the recent flurry of Big Data analytics platforms are now doing (e.g., Impala, Spark, and friends), but with a focus on scaling to memory-challenging data sizes. We do have a MapReduce compatibility layer that goes along with our Hyracks runtime layer – Hyracks is name of our internal dataflow runtime layer – but our MapReduce compatibility layer is not related to (or connected to) the AsterixDB system.

Q13. How does AsterixDB relate to Hadapt?

Mike Carey: I’m not familiar with Hadapt, per se, but I read the HadoopDB work that fed into it. 
We’re architecturally very different – we’re not Hadoop-based at all – I’d say that HadoopDB was more of an expedient hybrid coupling of Hadoop and databases, to get some of the indexing and local query efficiency of an existing database engine quickly in the Hadoop world. We were thinking longer term, starting from first principles, about what a next-generation BDMS might look like. AsterixDB is what we came up.

Q14. How does AsterixDB relate to Spark?

Mike Carey: Spark is aimed at fast Big Data analytics – its data is coming from HDFS, and the task at hand is to scan and slice and dice and process that data really fast. Things like Shark and SparkSQL give users SQL query power over the scanned data, but Spark in general is really catching fire, it appears, due to its applicability to Big Machine Learning tasks. In contrast, we’re doing Big Data Management – we store and index and query Big Data. It would be a very interesting/useful exercise for us to explore how to make AsterixDB another source where Spark computations can get input data from and send their results to, as we’re not targeting the more complex, in-memory computations that Spark aims to support.

Q15. How can others contribute to the project?

Mike Carey: We would love to see this start happening – and we’re finally feeling more ready for that, and even have some NSF funding to make AsterixDB something that others in the Big Data community can utilize and share. 
(Note that our system is Apache-style open source licensed, so there are no “gotchas” lurking there.)
Some possibilities are:

(1) Others can start to use AsterixDB to do real exploratory Big Data projects, or to teach about Big Data (or even just semistructured data) management. Each time we’ve worked with trial users we’ve gained some insights into our feature set, our query optimizations, and so on – so this would help contribute by driving us to become better and better over time.

(2) Folks who are studying specific techniques for dealing with modern data – e.g., new structures for indexing spatiotemporaltextual (J) data – might consider using AsterixDB as a place to try out their new ideas.
(This is not for the meek, of course, as right now effective contributors need to be good at reading and understanding open source software without the benefit of a plethora of internal design documents or other hints.) We also have some internal wish lists of features we wish we had time to work on – some of which are even doable from “outside”, e.g., we’d like to have a much nicer browser-based workbench for users to use when interacting with and managing an AsterixDB cluster.

(3) Students or other open source software enthusiasts who download and try our software and get excited about it – who then might want to become an extension of our team – should contact us and ask about doing so. (Try it first, though!)  We would love to have more skilled hands helping with fixing bugs, polishing features, and making the system better – it’s tough to build robust software in a university setting, and we would especially welcome contributors from companies.

Thanks very much for this opportunity to share what we’ve being doing!

————————
Michael J. Carey is a Bren Professor of Information and Computer Sciences at UC Irvine.
Before joining UCI in 2008, Carey worked at BEA Systems for seven years and led the development of BEA’s AquaLogic Data Services Platform product for virtual data integration. He also spent a dozen years teaching at the University of Wisconsin-Madison, five years at the IBM Almaden Research Center working on object-relational databases, and a year and a half at e-commerce platform startup Propel Software during the infamous 2000-2001 Internet bubble. Carey is an ACM Fellow, a member of the National Academy of Engineering, and a recipient of the ACM SIGMOD E.F. Codd Innovations Award. His current interests all center around data-intensive computing and scalable data management (a.k.a. Big Data).

Resources

– AsterixDB Big Data Management System (BDMS): Downloads, Documentation, Asterix Publications.

Related Posts

Hadoop at Yahoo. Interview with Mithun Radhakrishnan. ODBMS Industry Watch, September 21, 2014

On the Hadoop market. Interview with John Schroeder. ODBMS Industry Watch, June 30, 2014

Follow ODBMS.org on Twitter: @odbmsorg
##

]]>
http://www.odbms.org/blog/2014/10/asterixdb-hadoop-interview-mike-carey/feed/ 0
Interview with John Partridge, President & CEO of Tokutek, Inc. http://www.odbms.org/blog/2014/05/interview-john-partridge-president-ceo-tokutek/ http://www.odbms.org/blog/2014/05/interview-john-partridge-president-ceo-tokutek/#comments Fri, 09 May 2014 08:01:56 +0000 http://www.odbms.org/blog/?p=3116

“As the database gets used, shards can grow at an uneven rate and one shard might carry a majority of the load. MongoDB corrects this by balancing shards, but because of MongoDB’s lack of concurrency this operation can stall the database unacceptably.”–John Partridge.

I have interviewed John Partridge, President & CEO of Tokutek, Inc.

RVZ

Q1. Tokutek recently announced to have eliminated performance issues of MongoDB sharding. What was the problem?

John Partridge: The problem occurs after a shard is created. As the database gets used, shards can grow at an uneven rate and one shard might carry a majority of the load. MongoDB corrects this by balancing shards, but because of MongoDB’s lack of concurrency this operation can stall the database unacceptably (see the benchmark).

Q2. For what kind of application users of MongoDB experienced these bottlenecks?

John Partridge: Users who need to scale out, and rely on sharding to do so.

Q3. What is the solution you propose to this problem?

John Partridge: Use TokuMX 1.4. It is a complete distribution and a drop-in replacement for the distribution you get from MongoDB/10gen.

Q4. How TokuMX v1.4 is able to allow shards to be balanced and added without disruption for a NoSQL solution that scales up and scales out?

John Partridge: TokuMX replaces the B-tree indexing used in MongoDB with patented Fractal Tree indexing, which allows for significantly better concurrency (among other things). Because of the improved concurrency, data can be copied, then deleted, from one shard to another without unnecessary locking.

Q5. What is the difference in performance of your solution with respect to the basic MongoDB? What “basic” MongoDB do you use for this comparison?

John Partridge: “Basic” MongoDB is the distro that you get from MongoDB (10gen). We typically see 20x performance improvements but as you might imagine, it depends on the application. Because TokuMX offers document-level locking rather than the database-level locking, TokuMX shines when there are significant reads *and* writes.

Q6. How do you compare TokuMX with other distribution of MongoDB, such as the one of 10gen (now MongoDB)?

John Partridge: There are three major differences: 20x performance improvement, 90% smaller database size (we compress the data), and support for ACID transactions. Look at the bottom of http://www.tokutek.com/products/tokumx-for-mongodb/ for more information on each of these benefits.
——————————
John Partridge
Mr. Partridge brings over twenty years of experience in the software industry as a developer, investor, and entrepreneur. He joins Tokutek from StreamBase Systems which John co-founded with database pioneer Dr. Michael Stonebraker. He started his career as a software developer at Microsoft Corporation where he co-authored Excel v1.0. He later worked as a venture capitalist at Accel Partners and the Summit Accelerator Fund where he specialized in investing in early stage internet infrastructure and enterprise software companies. John holds an A.B. in Applied Mathematics / Computer Science from Harvard University and an MBA from the Stanford University Graduate School of Business.

Related Post

Big Data for Genomic Sequencing. Interview with Thibault de Malliard.
ODBMS Industry Watch, March 25, 2013

Scaling MySQL and MariaDB to TBs: Interview with Martín Farach-Colton.
ODBMS Industry Watch, October 8, 2012

Resources

TokuMX vs. MongoDB : Sharding Balancer Performance Posted on February 16, 2014 by Tim Callaghan, Tokutek.

What’s new in TokuMX 1.4, Part 4: Smaller, faster sharded clusters. Posted on February 20, 2014 by Leif Walsh, Tokutek.

MongoDB web site

Follow ODBMS.org on Twitter: @odbmsorg
##

]]>
http://www.odbms.org/blog/2014/05/interview-john-partridge-president-ceo-tokutek/feed/ 0
On SQL and NoSQL. Interview with Dave Rosenthal http://www.odbms.org/blog/2014/03/dave-rosenthal/ http://www.odbms.org/blog/2014/03/dave-rosenthal/#comments Tue, 18 Mar 2014 16:04:45 +0000 http://www.odbms.org/blog/?p=2932

“Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications.”–Dave Rosenthal.

On SQL and NoSQL, I have interviewed Dave Rosenthal CEO of FoundationDB.

RVZ

Q1. What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

Dave Rosenthal: There is a tradeoff in available between commit latency and durability–especially in distributed databases. At one extreme a database client can just report success immediately (without even talking to the database server) and buffer the writes in the background. Obviously, that hides latency well, but you could lose a suffix of transactions. At the other extreme, you can replicate writes across multiple machines, fsync them on each of the machines, and only then report success to the client.

FoundationDB is optimized to provide good performance in its default setting, which is the safest end of that tradeoff.

Usually, if you want some reasonable amount of durability guarantee, you are talking about a commit latency of small constant factor times the network latency. So, the real latency issues come with databases spanning multiple data centers. In that case FoundationDB users are able to choose whether or not they want durability guarantees in all data centers before commit (increasing commitment latencies), which is our default setting, or whether they would like to relax durability guarantees by returning a commit when the data is fsync’d to disk in just one datacenter.

All that said, in general, we think that the application is usually a more appropriate place to try to hide latency than the database.

Q2. Justin Sheehy of Basho in an interview said [1] “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”. What is your opinion on this?

Dave Rosenthal: Yes, we totally agree with Justin. Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications. In fact, true ACID transactions are way more broadly important than that. They give you the ability to build abstractions and systems that you can provide guarantees about.
As Michael Cahill says in his thesis which became the SIGMOD paper of the year: “Serializable isolation enables the development of a complex system by composing modules, after verifying that each module maintains consistency in isolation.” It’s this incredibly important ability to compose that makes a system with transactions special.

Q3. FoundationDB claim to provide full ACID transactions. How do you do that?

Dave Rosenthal: In the same basic way as many other transactional databases do. We use a few strategies that tend to work well in distributed system such as optimistic concurrency and MVCC. We also, of course, have had to solve some of the fundamental challenges associated with distributed systems and all of the crazy things that can happen in them. Honestly, it’s not very hard to build a distributed transactional database. The hard part is making it work gracefully through failure scenarios and to run fast.

Q4. Is this similar to Oracle NoSQL?

Dave Rosenthal: Not really. Both Oracle NoSQL and FoundationDB provide an automatically-partitioned key-value store with fault tolerance. Both also have a concept of ordering keys (for efficient range operations) though Oracle NoSQL only provides ordering “within a Major Key set”. So, those are the similarities, but there are a bunch of other NoSQL systems with all those properties. The huge difference is that FoundationDB provides for ACID transactions over arbitrary keys and ranges, while Oracle NoSQL does not.

Q5. How would you compare your product offering with respect to NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak, and NewSQL such as NuoDB and VoltDB?

Dave Rosenthal: The most obvious response for the NoSQL data stores would be “we have ACID transactions, they don’t”, but the more important difference is in philosophy and strategy.

Each of those products expose a single data model and interface. Maybe two. We are pursuing a fundamentally different strategy.
We are building a storage substrate that can be adapted, via layers, to provide a variety of data models, APIs, and true flexibility.
We can do that because of our transactional capabilities. CouchDB, MongoDB, Cassandra and Riak all have different APIs and we talk to companies that run all of those products side-by-side. The NewSQL database players are also offering a single data model, albeit a very popular one, SQL. FoundationDB is offering an ever increasing number of data models through its “layers”, currently including several popular NoSQL data models and with SQL being the next big one to hit. Our philosophy is that you shouldn’t have to increase the complexity of your architecture by adopting a new NoSQL database each time your engineers need access to a new data model.

Q6. Cloud computing and open source: How does it relate to FoundationDB?

Dave Rosenthal: Cloud computing: FoundationDB has been designed from the beginning to run well in cloud environments that make use of large numbers of commodity machines connected through a network. Probably the most important aspect of a distributed database designed for cloud deployment is exceptional fault tolerance under very harsh and strange failure conditions – the kind of exceptionally unlikely things that can only happen when you have many machines working together with components failing unpredictably. We have put a huge amount of effort into testing FoundationDB in these grueling scenarios, and feel very confident in our ability to perform well in these types of environments. In particular, we have users running FoundationDB successfully on many different cloud providers, and we’ve seen the system keep its guarantees under real-world hardware and network failure conditions experienced by our users.

Open source: Although FoundationDB’s core data storage engine is closed source, our layer ecosystem is open source. Although the core data storage engine has a very simple feature set, and is very difficult to properly modify while maintaining correctness, layers are very feature rich and because they are stateless, are much easier to create and modify which makes them well suited to third-party contributions.

Q7 Pls give some examples of use cases where FoundationDB is currently in use. Is FoundationDB in use for analyzing Big Data as well?

Dave Rosenthal: Some examples: User data, meta data, user social graphs, geo data, via ORMs using the SQL layer, metrics collection, etc.

We’ve mostly focused on operational systems, but a few of our customers have built what I would call “big data” applications, which I think of as analytics-focused. The most common use case has been for collecting and analyzing time-series data. FoundationDB is strongest in big data applications that call for lots of random reads and writes, not just big table scans—which many systems can do well.

Q8. Rick Cattel said in an recent interview [2] “there aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves”. What is your opinion on this?

Dave Rosenthal: People have great ideas for databases all the time. New data models, new query languages, etc.
If nothing else, this NoSQL experiment that we’ve all been a part of the past few years has shown us all the appetite for data models suited to specific problems. They would love to be able to build these tools, open source them, etc.
The problem is that the checklist of practical considerations for a database is huge: Fault tolerance, scalability, a backup solution, management and monitoring, ACID transactions, etc. Add those together and even the simplest concept sounds like a huge project.

Our vision at FoundationDB is that we have done the hard work to build a storage substrate that simultaneously solves all those tricky practical problems. Our engine can be used to quickly build a database layer for any particular application that inherits all of those solutions and their benefits, like scalability, fault tolerance and ACID compliance.

Q9. Nick Heudecker of Gartner, predicts that [3] “going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time” . What is your take on this?

Dave Rosenthal: I do think that the lines between SQL and NoSQL will start to blur and I believe that we are leading that charge.We acquired another database startup last year called Akiban that builds an amazing SQL database engine.
In 2014 we’ll be bringing that engine to market as a layer running on top of FoundationDB. That will be a true ANSI SQL database operating as a module directly on top of a transactional “NoSQL” engine, inheriting the operational benefits of our core storage engine – scalability, fault tolerance, ease of operation.

When you run multiple of the SQL layer modules, you can point many of them at the same key-space in FoundationDB and it’s as if they are all part of the same database, with ACID transactions enforced across the separate SQL layer processes.
It’s very cool. Of course, you can even run the SQL layer on a FoundationDB cluster that’s also supporting other data models, like graph or document. That’s about as blurry as it gets.

———–
Dave Rosenthal is CEO of FoundationDB. Dave started his career in games, building a 3D real-time strategy game with a team of high-school friends that won the 1st annual Independent Games Festival. Previously, Dave was CTO at Visual Sciences, a pioneering web-analytics company that is now part of Adobe. Dave has a degree in theoretical computer science from MIT.

Related Posts
Operational Database Management Systems. Interview with Nick Heudecker, ODBMS Industry Watch December 16, 2013

Follow ODBMS.org on Twitter: @odbmsorg

 

]]>
http://www.odbms.org/blog/2014/03/dave-rosenthal/feed/ 0
Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner http://www.odbms.org/blog/2013/11/big-data-analytics-at-thomson-reuters-interview-with-jochen-l-leidner/ http://www.odbms.org/blog/2013/11/big-data-analytics-at-thomson-reuters-interview-with-jochen-l-leidner/#comments Fri, 15 Nov 2013 09:21:06 +0000 http://www.odbms.org/blog/?p=2779

“My experience overall with almost all open-source tools has been very positive: open source tools are very high quality, well documented, and if you get stuck there is a helpful and responsive community on mailing lists or Stack Exchange and similar sites.” —Dr. Jochen L. Leidner.

I wanted to know how Thomson Reuters uses Big Data. I have interviewed Dr. Jochen L. Leidner, Lead Scientist, of the London R&D at Thomson Reuters.

RVZ

Q1. What is your current activity at Thomson Reuters?
Jochen L. Leidner: For the most part, I carry out applied research in information access, and that’s what I have been doing for quite a while. After five years with the company – I joined from the University of Edinburgh, where I had been a postdoctoral Royal Society of Edinburgh Enterprise Fellow half a decade ago – I am currently a Lead Scientist with Thomson Reuters, where I am building up a newly-established London site part of our Corporate Research & Development group (by the way: we are hiring!). Before that, I held research and innovation-related roles in the USA and in Switzerland.
Let me say a few words about Thomson Reuters before I go more into my own activities, just for background. Thomson Reuters has around 50,000 employees in over 100 countries and sells information to professionals in many verticals, including finance & risk, legal, intellectual property & scientific, tax & accounting.
Our headquarters are located at 3 Time Square in the city of New York, NY, USA.
Most people know our REUTERS brand from reading their newspapers (thanks to our highly regarded 3,000+ journalists at news desks in about 100 countries, often putting their lives at risk to give us reliable reports of the world’s events) or receiving share price information on the radio or TV, but as a company, we are also involved in as diverse areas as weather prediction (as the weather influences commodity prices) and determining citation impact of academic journals (which helps publishers sell their academic journals to librarians), or predicting Nobel prize winners.
My research colleagues and I study information access and especially means to improve it, using including natural language processing, information extraction, machine learning, search engine ranking, recommendation system and similar areas of investigations.
We carry out a lot of contract research for internal business units (especially if external vendors do not offer what we need, or if we believe we can build something internally that is lower cost and/or better suited to our needs), feasibility studies to de-risk potential future products that are considered, and also more strategic, blue-sky research that anticipates future needs. As you would expect, we protect our findings and publish them in the usual scientific venues.

Q2. Do you use Data Analytics at Thomson Reuters and for what?
Jochen L. Leidner: Note that terms like “analytics” are rather too broad to be useful in many instances; but the basic answer is “yes”, we develop, apply internally, and sell as products to our customers what can reasonably be called solutions that incorporate “data analytics” functions.
One example capability that we developed is the recommendation engine CaRE (Al-Kofahi et al., 2007), which was developed by our group, Corporate Research & Development, which is led by our VP of Research & Development, Khalid al-Kofahi. This is a bit similar in spirit to Amazon’s well-known book recommendations. We used it to recommend legal documents to attorneys, as a service to supplement the user’s active search on our legal search engine with “see also..”-type information. This is an example of a capability developed in-house that also something that made it into a product, and is very popular.
Thomson Reuters is selling information services, often under a subscription model, and for that it is important to have metrics available that indicate usage, in order to inform our strategy. So another example for data analytics is that we study how document usage can inform personalization and ranking, of from where documents are accessed, and we use this to plan network bandwidth and to determine caching server locations.
A completely different example is citation information: Since 1969, when our esteemed colleague Eugene Garfield (he is now officially retired, but is still active) came up with the idea of citation impact, our Scientific business division is selling the journal citation impact factor – an analytic that can be used as a proxy for the importance of a journal (and, by implication, as an argument to a librarian to purchase a subscription of that journal for his or her university library).
Or, to give another example from the financial markets area, we are selling predictive models (Starmine) that estimate how likely it is whether a given company goes bankrupt within the next six months.

Q3. Do you have Big Data at Thomson Reuters? Could you please give us some examples of Big Data Use Cases at your company?
Jochen L. Leidner: For most definitions of “big”, yes we do. Consider that we operate a news organization, which daily generates in the tens of thousands of news reports (if we count all languages together). Then we have photo journalists who create large numbers of high-quality, professional photographs to document current events visually, and videos comprising audio-visual storytelling and interviews. We further collect all major laws, statutes, regulations and legal cases around in major jurisdictions around the world, enrich the data with our own meta-data using both manual expertise and automatic classification and tagging tools to enhance findability. We hold collections of scientific articles and patents in full text and abstracts.
We gather, consolidate and distribute price information for financial instruments from hundreds of exchanges around the world. We sell real-time live feeds as well as access to decades of these time series for the purpose of back-testing trading strategies.

Q4. What “value” can be derived by analyzing Big Data at Thomson Reuters?
Jochen L. Leidner: This is the killer question: we take the “value” very much without the double quotes – big data analytics lead to cost savings as well as generate new revenues in a very real, monetary sense of the word “value”. Because our solutions provide what we call “knowledge to act” to our customers, i.e., information that lets them make better decisions, we provide them with value as well: we literally help our customers save the cost of making a wrong decision.

Q5. What are the main challenges for big data analytics at Thomson Reuters ?
Jochen L. Leidner: I’d say these are absolute volume, growth, data management/integration, rights management, and privacy are some of the main challenges.
One obvious challenge is the size of the data. It’s not enough to have enough persistent storage space to keep it, we also need backup space, space to process it, caches and so on – it all adds up. Another is the growth and speed of that growth of the data volume. You can plan for any size, but it’s not easy to adjust your plans if unexpected growth rates come along.
Another challenge that we must look into is the integration between our internal data holdings, external public data (like the World Wide Web, or commercial third-party sources – like Twitter – which play an important role in the modern news ecosystem), and customer data (customers would like to see their own internal, proprietary data be inter-operable with our data). We need to respect the rights associated with each data set, as we deal with our own data, third party data and our customers’ data. We must be very careful regarding privacy when brainstorming about the next “big data analytics” idea – we take privacy very seriously and our CPO joined us from a government body that is in charge of regulating privacy.

Q6. How do you handle the Big Data Analytics “process” challenges with deriving insight?
Jochen L. Leidner: Usually analytics projects happen as an afterthought to leverage existing data created in a legacy process, which means not a lot of change of process is needed at the beginning. This situation changes once there is resulting analytics output, and then the analytics-generating process needs to be integrated with the previous processes.
Even with new projects, product managers still don’t think of analytics as the first thing to build into a product for a first bare-bones version, and we need to change that; instrumentation is key for data gathering so that analytics functionality can build on it later on.
In general, analytics projects follow a process of (1) capturing data, (2) aligning data from different sources (e.g., resolving when two objects are the same), (3) pre-processing or transforming the data into a form suitable for analysis, (4) building some model and (5) understanding the output (e.g. visualizing and sharing the results). This five-step process is followed by an integration phase into the production process to make the analytics repeatable.

Q7. What kind of data management technologies do you use? What is your experience in using them?
Jochen L. Leidner: For storage and querying, we use relational database management systems to store valuable data assets and also use NoSQL databases such as CouchDB and MongoDB in projects where applicable. We use homegrown indexing and retrieval engines as well as open source libraries and components like Apache Lucene, Solr and ElasticSearch.
We use parallel, distributed computing platforms such as Hadoop/Pig and Spark to process data, and virtualization to manage isolation of environments.
We sometimes push out computations to Amazon’s EC2 and storage to S3 (but this can only be done if the data is not sensitive). And of course in any large organization there is a lot of homegrown software around. My experience overall with almost all open-source tools has been very positive: open source tools are very high quality, well documented, and if you get stuck there is a helpful and responsive community on mailing lists or Stack Exchange and similar sites.

Q8. Do you handle un-structured data? If yes, how?
Jochen L. Leidner: We curate lots of our own unstructured data from scratch, including the thousands of news stories that our over three thousand REUTERS journalists write every day, and we also enrich the unstructured content produced by others: for instance, in the U.S. we manually enrich legal cases with human-written summaries (written by highly-qualified attorneys) and classify them based on a proprietary taxonomy, which informs our market-leading WestlawNext product (see this CIKM 2011 talk), our search engine for the legal profession, who need to find exactly the right cases. Over the years, we have developed proprietary content repositories to manage content storage, meta-data storage, indexing and retrieval. One of our challenges is to unite our data holdings, which often come from various acquisitions that use their own technology.

Q9. Do you use Hadoop? If yes, what is your experience with Hadoop so far?
Jochen L. Leidner: We have two clusters featuring Hadoop, Spark and GraphLab, and we are using them intensively. Hadoop is mature as an implementation of the MapReduce computing paradigm, but has its shortcomings because it is not a true operating system (but probably it should be) – for instance regarding the stability of HDFS, its distributed file system. People have started to realize there are shortcomings, and have started to build other systems around Hadoop to fill gaps, but these are still early stage, so I expect them to become more mature first and then there might be a wave of consolidation and integration. We have definitely come a long way since the early days.
Typically, on our clusters we run batch information extraction tasks, data transformation tasks and large-scale machine learning processes to train classifiers and taggers for these. We are also inducing language models and training recommender systems on them. Since many training algorithms are iterative, Spark can win over Hadoop for these, as it keeps models in RAM.

Q10. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?
Jochen L. Leidner: The dichotomy is perhaps between batch processing and dialog processing, whereas real-time (“hard”, deterministic time guarantee for a system response) goes hand-in-hand with its opposite non-real time, but I think what you are after here is that dialog systems have to be responsive. There is no one-size-fits-all method for meeting (near) real-time requirements or for making dialog systems more responsive; a lot of analytics functions require the analysis of more than just a few recent data-points, so if that’s what is needed it may take its time. But it is important that critical functions, such as financial data feeds, are delivered as fast as possible – micro-seconds matter here. The more commoditized analytics functions become, the faster they need to be available to retain at least speed as a differentiator.

Q11 Cloud computing and open source: Do you they play a role at Thomson Reuters? If yes, how?
Jochen L. Leidner: Regarding cloud computing, we use cloud services internally and as part of some of our product offerings. However, there are also reservations – a lot of our applications contain information that is too sensitive to entrust a third party, especially as many cloud vendors cannot give you a commitment with respect to hosting (or not hosting) in particular jurisdictions. Therefore, we operate our own set of data centers, and some part of these operates as what has become known as “private clouds”, retaining the benefit of the management outsourcing abstraction, but within our large organization rather than pushing it out to a third party. Of course the notion of private clouds is leading the cloud idea ad absurdum quite a bit, because it sacrifices the economy of scale, but having more control is an advantage.
Open source plays a huge role at Thomson Reuters – we rely on many open source components, libraries and systems, especially under the MIT, BSD, LGPL and Apache licenses. For example, some of our tagging pipelines rely on Apache UIMA, which is a contribution originally developed at IBM, and which has seen contributions by researchers from all around the world (from Darmstadt to Boulder). To date, we have not been very good about opening up our own services in the form of source code, but we are trying to change that now, and we have just launched a new corporation-wide process for open-sourcing software. We also have an internal sharing repository, “Corporate Source”, but in my personal view the audience in any single company is too small – open source (like other recent waves such as clouds or crowdsourcing) needs Internet-scale to work, and Web search engines for the various projects to be discovered).

Q12 What are the main research challenges ahead? And what are the main business challenges ahead?
Jochen L. Leidner: Some of the main business challenges are the cost pressure that some of our customers face, and the increasing availability of low-cost or free-of-charge information sources, i.e. the commoditization of information. I would caution here that whereas the amount of information available for free is large, this in itself does not help you if you have a particular problem and cannot find the information that helps you solve it, either because the solution is not there despite the size, or because it is there but findability is low. Further challenges include information integration, making systems ever more adaptive, but only to the extent it is useful, or supporting better personalization. Having said this sometimes systems need to be run in a non-personalized mode (e.g. in the field of e-discovery, you need to have a certain consistency, namely that the same legal search systems retrieves the same things today and tomorrow, and to different parties.

Q13 Anything else you wish to add?
Jochen L. Leidner: I would encourage decision makers of global companies not to get misled by fast-changing “hype” language: words like “cloud”, “analytics” and “big data” are too general to inform a professional discussion. Putting my linguist’s hat on, I can only caution about the lack of precision inherent in marketing language’s use in technological discussions: for instance, cluster computing is not the same as grid computing. And what looks like “big data” today we will almost certainly carry around with us on a mobile computing device tomorrow. Also, buzz words like “Big Data” do not by themselves solve any problem – they are not magic bullets. To solve any problem, look at the input data, specify the desired output data, and think hard about whether and how you can compute the desired result – nothing but “good old” computer science.

Dr. Jochen L. Leidner is a Lead Scientist with Thomson Reuters, where he is building up the corporation’s London site of its Research & Development group.
He holds Master’s degrees in computational linguistics, English language and literature and computer science from Friedrich-Alexander University Erlangen-Nuremberg and in computer speech, text and internet technologies from the University of Cambridge, as well as a Ph.D. in information extraction from the University of Edinburgh (“Toponym resolution in Text”). He is recipient of the first ACM SIGIR Doctoral Consortium Award, a Royal Society of Edinburgh Enterprise Fellowship in Electronic Markets, and two DAAD scholarships.
Prior to his research career, he has worked as a software developer, including for SAP AG (basic technology and knowledge management) as well as for startups.
He led the development teams of multiple question Answering systems, including the systems QED at Edinburgh and Alyssa at Saarland University, the latter of which ranked third at the last DARPA/NIST TREC open-domain question answering factoid track evaluation.
His main research interests include information extraction, question answering and search, geo-spatial grounding, applied machine learning with a focus on methodology behind research & development in the area of information access.
In 2013, Dr. Leidner has also been teaching an invited lecture course “Language Technology and Big Data” at the University of Zurich, Switzerland.

—————————-
Related Posts

Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. September 23, 2013

On Linked Data. Interview with John Goodwin. September 1, 2013

Big Data Analytics at Netflix. Interview with Christos Kalantzis and Jason Brown. February 18, 2013

Resources

ODBMS.org free resources on Big Data and Analytical Data Platforms:
Blog Posts | Free Software| Articles | Lecture Notes | PhD and Master Thesis|

Follow us on Twitter: @odbmsorg

]]>
http://www.odbms.org/blog/2013/11/big-data-analytics-at-thomson-reuters-interview-with-jochen-l-leidner/feed/ 0
On multi-model databases. Interview with Martin Schönert and Frank Celler. http://www.odbms.org/blog/2013/10/on-multi-model-databases-interview-with-martin-schonert-and-frank-celler/ http://www.odbms.org/blog/2013/10/on-multi-model-databases-interview-with-martin-schonert-and-frank-celler/#comments Mon, 28 Oct 2013 09:21:59 +0000 http://www.odbms.org/blog/?p=2743

“We want to prevent a deadlock where the team is forced to switch the technology in the middle of the project because it doesn’t meet the requirements any longer.”–Martin Schönert and Frank Celler.

On “multi-model” databases, I have interviewed Martin Schönert and Frank Celler, founders and creators of the open source ArangoDB.

RVZ

Q1. What is ArangoDB and for what kind of applications is it designed for?

Frank Celler: ArangoDB is a multi-model mostly-memory database with a flexible data model for documents and graphs. It is designed as a “general purpose database”, offering all the features you typically need for modern web applications.

ArangoDB is supposed to grow with the application—the project may start as a simple single-server prototype, nothing you couldn’t do with a relational database equally well. After some time, some geo-location features are needed and a shopping cart requires transactions. ArangoDB’s graph data model is useful for the recommendation system. The smartphone app needs a lean API to the back-end—this is where Foxx, ArangoDB’s integrated Javascript application framework, comes into play.
The overall idea is: “We want to prevent a deadlock where the team is forced to switch the technology in the middle of the project because it doesn’t meet the requirements any longer.”

ArangoDB is open source (Apache 2 licence)—you can get the source code at GitHub or download the precompiled binaries from our website.

Though ArangoDB as a universal approach, there are edge cases where we don’t recommend ArangoDB. Actually, ArangoDB doesn’t compete with massively distributed systems like Cassandra with thousands of nodes and many terabytes of data.

Q2. What’s so special about the ArangoDB data model?

Martin Schönert: ArangoDB is a multi-model database. It stores documents in collections. A specialized binary data file format is used for disk storage. Documents that have similar structure (i.e., that have the same attribute names and attribute types) can share their structural information. The structure (called “shape”) is saved just once, and multiple documents can re-use it by storing just a pointer to their “shape”.
In practice, documents in a collection are likely to be homogenous, and sharing the structure data between multiple documents can greatly reduce disk storage space and memory usage for documents.

Q3. Who is currently using ArangoDB for what?

Frank Celler: ArangoDB is open source. You don’t have to register to download the source code or precompiled binaries. As a user, you can get support via Google Group, GitHub’s issue tracker and even via Twitter. We are very amenable, which is an essential part of the project. The drawback is that we don’t really know what people are doing with ArangoDB in detail. We are noticing an exponentially increasing number of downloads over the last months.
We are aware of a broad range of use cases: a CMS, a high-performance logging component, a geo-coding tool, an annotation system for animations, just to name a few. Other interesting use cases are single page apps or mobile apps via Foxx, ArangoDB’s application framework. Many of our users have in-production experience with other NoSQL databases, especially the leading document stores.

Q4. Could you motivate your design decision to use Google’s V8 JavaScript engine?

Martin Schönert: ArangoDB uses Google’s V8 engine to execute server-side JavaScript functions. Users can write server-side business logic in JavaScript and deploy it in ArangoDB. These so-called “actions” are much like stored procedures living close to the data.
For example, with actions it is possible to perform cascading deletes/updates, assign permissions, and do additional calculations and modifications to the data.
ArangoDB also allows users to map URLs to custom actions, making it usable as an application server that handles client HTTP requests with user-defined business logic.
We opted for Javascript as it meets our requirements for an “embedded language” in the database context:
• Javascript is widely used. Regardless in which “back-end language” web developers write their code, almost everybody can code also in Javascript.
• Javascript is effective and still modern.
Just as well, we chose Google V8, as it is the fastest, most stable Javascript interpreter available for the time being.

Q5. How do you query ArangoDB if you don’t want to use JavaScript?

Frank Celler: ArangoDB offers a couple of options for getting data out of the database. It has a REST interface for CRUD operations and also allows “querying by example”. “Querying by example” means that you create a JSON document with the attributes you are looking for. The database returns all documents which look like the “example document”.
Expressing complex queries as JSON documents can become a tedious task—and it’s almost impossible to support joins following this approach. We wanted a convenient and easy-to-learn way to execute even complex queries, not involving any programming as in an approach based on map/reduce. As ArangoDB supports multiple data models including graphs, it was neither sufficient to stick to SQL nor to simply implement UNQL. We ended up with the “ArangoDB query language” (AQL), a declarative language similar to SQL and Jsoniq. AQL supports joins, graph queries, list iteration, results filtering, results projection, sorting, variables, grouping, aggregate functions, unions, and intersections.
Of course, ArangoDB also offers drivers for all major programming languages. The drivers wrap the mentioned query options following the paradigm of the programming language and/or frameworks like Ruby on Rails.

Q6. How do you perform graph queries? How does this differ from systems such as Neo4J?

Frank Celler: SQL can’t cope with the required semantics to express the relationships between graph nodes, so graph databases have to provide other ways to access the data.
The first option is to write small programs, so called “path traversals.” In ArangoDB, you use Javascript; in neo4j Java, the general approach is very similar.
Programming gives you all the freedom to do whatever comes to your mind. That’s good. For standard use cases, programming might be too much effort. So, both ArangoDB and neo4j offer a declarative language—neo4j has “Cypher,” ArangoDB the “ArangoDB Query Language.” Both also implement the blueprints standard so that you can use “Gremlin” as query-language inside Java. We already mentioned that ArangoDB is a multi-model database: AQL covers documents and graphs, it provides support for joins, lists, variables, and does much more.

The following example is taken from the neo4j website:

“For example, here is a query which finds a user called John in an index and then traverses the graph looking for friends of John’s friends (though not his direct friends) before returning both John and any friends-of-friends that are found.

START john=node:node_auto_index(name = ‘John’)
MATCH john-[:friend]->()-[:friend]->fof
RETURN john, fof ”

The same query looks in AQL like this:

FOR t IN TRAVERSAL(users, friends, “users/john”, “outbound”,
{minDepth: 2}) RETURN t.vertex._key

The result is:
[ “maria”, “steve” ]

You see that Cypher describes patterns while AQL describes joins. Internally, ArangoDB has a library of graph functions—those functions return collections of paths and paths or use those collections in a join.

Q7. How did you design ArangoDB to scale out and/or scale up? Please give us some detail.

Martin Schönert: Solid state disks are becoming more and more a commodity hardware. ArangoDB’s append-only design is a perfect fit for such SSD, allowing for data-sets which are much bigger than the main memory but still fit unto a solid state disk.
ArangoDB supports master/slave replication in version 1.4 which will be released in the next days (a beta has been available for some time). On the one hand this provides easy fail-over setups. On the other hand it provides a simple way to scale the read-performance.
Sharding is implemented in version 2.0. This enables you to store even bigger data-sets and increase the write-performance. As noted before, however, we see our main application when scaling to a low number of nodes. We don’t plan to optimize ArangoDB for massive scaling with hundreds of nodes. Plain key/value stores are much more usable in such scenarios.

Q8. What is ArangoDB’s update and delete strategy?

Martin Schönert: ArangoDB versions prior to 1.3 store all revisions of documents in an append-only fashion; the objects will never be overwritten. The latest version of a document is available to the end user.

With the current version 1.3, ArangoDB introduces transactions and sets the technical fundament for replication and sharding. In the course of those highly wanted features comes “real” MVCC with concurrent writes.

In databases implementing an append-only strategy, obsolete versions of a document have to be removed to save space. As we already mentioned, ArangoDB is multi-threaded: The so-called compaction is automatically done in the background in a different thread without blocking reads and writes.

Q9. How does ArangoDB differ from other NoSQL data stores such as Couchbase and MongoDB and graph data stores such as Neo4j, to name a few?

Frank Celler: ArangoDB’s feature scope is driven by the idea to give the developer everything he needs to master typical tasks in a web application—in a convenient and technically sophisticated way alike.
From our point of view it’s the combination of features and quality of the product which accounts for ArangoDB: ArangoDB not only handles documents but also graphs.
ArangoDB is extensible via Javascript and Ruby. Enclosed with ArangoDB you get “Foxx”. Foxx is an integrated application framework ideal for lean back-ends and single page Javascript applications (SPA).
Multi-collection transactions are useful not only for online banking and e-commerce but they become crucial in any web app in a distributed architecture. Here again, we offer the developers many choices. If transactions are needed, developers can use them.
If, on the other hand, the problem requires a higher performance and less transaction-safety, developers are free to ignore multi-collections transactions and to use the standard single-document transactions implemented by most NoSQL databases.
Another unique feature is ArangoDB’s query language AQL—it makes querying powerful and convenient. For simple queries, we offer a simple query-by-example interface. Then again, AQL enables you to describe complex filter conditions and joins in a readable format.

Q10. Could you summarize the main results of your benchmarking tests?

Frank Celler: To quote Jan Lenhardt from CouchDB: “Nosql is not about performance, scaling, dropping ACID or hating SQL—it is about choice. As nosql databases are somewhat different it does not help very much to compare the databases by their throughput and chose the one which is fasted. Instead—the user should carefully think about his overall requirements and weight the different aspects. Massively scalable key/value stores or memory-only system[s] can archive much higher benchmarks. But your aim is [to] provide a much more convenient system for a broader range of use-cases—which is fast enough for almost all cases.”
Anyway, we have done a lot of performance tests and are more than happy with the results. ArangoDB 1.3 inserts up to 140,000 documents per second. We are going to publish the whole test suite including a test runner soon, so everybody can try it out on his own hardware.

We have also tested the space usage: Storing 3.5 millions AQL search queries takes about 200 MB in MongoDB with pre-allocation compared to 55 MB in ArangoDB. This is the benefit of implementing the concept of shapes.

Q11. ArangoDB is open source. How do you motivate and involve the open source development community to contribute to your projects rather than any other open source NoSQL?

Frank Celler: To be honest: The contributors come of their own volition and until now we didn’t have to “push” interested parties. Obviously, ArangoDB is fascinating enough, even though there are more than 150 NoSQL databases available to choose from.

It all started when Ruby inventor Yukihiro “Matz” Matsumoto tweeted on ArangoDB and recommended it to the community. Following this tweet, ArangoDB’s first local fan base was established in Japan—and we learned a lot about the limits of automatic translation from Japanese tweets to English and the other way around ;-).

In our daily “work” with our community, we try to be as open and supportive as possible. The core developers communicate directly and within short response times with people having ideas or needing help through Google Groups or GitHub. We take care of a community, especially for contributors, where we discuss future features and inform about upcoming changes early so that API contributors can keep their implementations up to date.

——————————
Martin Schönert
Martin is the origin of many fancy ideas in ArangoDB. As chief architect he is responsible for the overall architecture of the system, bringing in his experience from more than 20 years in IT as developer, architect, project manager and entrepreneur.
Martin started his career as scientist at the technical university of Aachen after earning his degree in Mathematics. Later he worked as head of product development (Team4 Systemhaus), Director IT (OnVista Technologies) and head of division at Deutsche Post.
Martin has been working with relational and non-relations databases (e.g. a torrid love-hate relationsship with the granddaddy of all non-relational databases: Lotus Notes) for the largest part of his professional life.
When no database did what he needed he also wrote his own, one for extremely high update rate and the other for distributed caching
.

Frank Celler
Frank is both entrepreneur and backend developer, developing mostly memory databases for two decades. He is the lead developer of ArangoDB and co-founder of triAGENS. Besides Frank organizes Cologne’s NoSQL user group, NoSQL conferences and is speaking at developer conferences.
Frank studied in Aachen and London and received a PHD in Mathematics. Prior to founding triAGENS, the company behind ArangoDB, he worked for several German tech companies as consultant, team lead and developer.
His technical focus is C and C++, recently he gained some experience with Ruby when integrating Mruby into ArangoDB.

Resources

– The stable version (1.3 branch) of ArangoDB can be downloaded here.
ArangoDB on Twitter
ArangoDB Google Group
ArangoDB questions on StackOverflow
Issue Tracker at Github

Related Posts

On geo-distributed data management — Interview with Adam Abrevaya. October 19, 2013

On Big Data and NoSQL. Interview with Renat Khasanshyn. October 7, 2013

On NoSQL. Interview with Rick Cattell. August 19, 2013

On Big Graph Data. August 6, 2012

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2013/10/on-multi-model-databases-interview-with-martin-schonert-and-frank-celler/feed/ 0
On Big Data and NoSQL. Interview with Renat Khasanshyn. http://www.odbms.org/blog/2013/10/on-big-data-and-nosql-interview-with-renat-khasanshyn/ http://www.odbms.org/blog/2013/10/on-big-data-and-nosql-interview-with-renat-khasanshyn/#comments Mon, 07 Oct 2013 19:50:27 +0000 http://www.odbms.org/blog/?p=2636

“The most important thing is to focus on a task you need to solve instead of a technology” –Renat Khasanshyn.

I have interviewed Renat Khasanshyn, Founder and Chairman of Altoros.
Renat is a NoSQL and Big Data specialist.

RVZ

Q1. In your opinion, what are the most popular players in the NoSQL market?

Khasanshyn: I think, MongoDB is definitely one of the key players of the NoSQL market. This database has a long history, I mean for this kind of products, and a good commercial support. For many people this database became the first mass market NoSQL store. I can assume that MongoDB is going to become something like MySQL for the field of relational databases. The second position I would give to Cassandra. It has a great architecture and enables building clusters with geographically dispersed nodes. For me it seems absolutely amazing. In addition, this database is often chosen by big companies that need a large highly available cluster.

Q2. How do you evaluate and compare different NoSQL Databases?

Khasanshyn: Thank you for an interesting question. How to choose a database? Which one is the best? These are the main questions for any company that wants to try a NoSQL solution. Definitely, for some cases it may be quite easy to select a suitable NoSQL store. However, very often it is not enough just to know customer’s business goals. When we suggest a particular database we take into consideration the following factors: the business issues a NoSQL store should solve, database reading/writing speed, availability, scalability, and many other important indicators. Sometimes we use a hybrid solution that may include several NoSQL databases.
Or we can see that a relational database will be a good match for a case. The most important thing is to focus on a task you need to solve instead of a technology.

We think that a good scalability, performance, and ease of administration are the most important criteria for choosing a NoSQL database. These are the key factors that we take into consideration. Of course, there are some additional criteria that sometimes may be even more important than those mentioned above. To somehow simplify a choice of a database for our engineers and for many other people, for two years, we carry out independent tests that evaluate performance of different NoSQL databases. Although aimed at comparing performance, these investigations also touch consistency, scalability, and configuration issues. You can take a look at our most recent article on this subject on Network World. Some new researches on this subject are to be published in a couple of months.

Q3. Which NoSQL databases did you evaluate so far, and what are the results did you obtain?

Khasanshyn: We used a great variety of NoSQL databases, for instance, Cassandra, HBase, MongoDB, Riak, Couchbase Server, Redis, etc. for our researches and real-life projects. From this experience, we learned that one should be very careful when choosing a database. It is better to spend more time on architecture design and make some changes to the project in the beginning rather than come across a serious issue in the future.

Q4. For which projects did use NoSQL databases and for what kind of problems?

Khasanshyn: It is hardly possible to name a project for which a NoSQL database would be useless, except for a blog or a home page. As the main use cases for NoSQL stores I would mention the following tasks:

● collecting and analyzing large volumes of data
● scaling large historical databases
● building interactive applications for which performance and fast response time to users’ actions are crucial

The major “drawback” of NoSQL architecture is the absence of ACID engine that provides a verification of transaction. It means that financial operations or user registration should be better performed by RDBMS like Oracle or MS SQL. However, absence of ACID allows for significant acceleration and decentralization of NoSQL databases which are their major advantages. The bottom line, non-relational databases are much faster in general, and they pay for it with a fraction of their reliability. Is it a good tradeoff? Depends on the task.

Q5. What do you see happening with NoSQL, going forward?

Khasanshyn: It’s quite difficult to make any predictions, but we guess that NoSQL and relational databases will become closer. For instance, NewSQL solutions took good scalability from NoSQL and a query language from the SQL products.
Probably, a kind of a standard query language based on SQL or SQL-like language will soon appear for NoSQL stores. We are also looking forward to improved consistency, or to be more precise, better predictability of NoSQL databases. In other words, NoSQL solutions should become more mature. We will also see some market consolidation. Smaller players will form alliances or quit the game. Leaders will take bigger part of the market. We will most likely see a couple of acquisitions. Overall, it will be easier to work with NoSQL and to choose a right solution out of the available options.

Q6. What what do you think is still needed for big data analytics to be really useful for the enterprise?

Khasanshyn: It is all about people and their skills. Storage is relatively cheap and available. Variety of databases is enormous and it helps solving virtually any task. Hadoop is stable. Majority of software is open source-based, or at least doesn’t cost a fortune. But all these components are useless without data scientists who can do modeling and simulations on the existing data. As well as without engineers who can efficiently employ the toolset. As well as without managers who understand the outcomes of data handling revolution that happened just recently. When we have these three types of people in place, then we will say that enterprises are fully equipped for making an edge in big data analytics.

Q7. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey. When you speak with your customers what are the typical use cases and requirements they have?

Khasanshyn: I agree with you. Some customers just make their first steps with Hadoop while others need to know how to optimize their Hadoop-based systems. Unfortunately, the second group of customers is much smaller. I can name the following typical tasks our customers have:

● To process historical data that has been collected for a long period of time. Some time ago, users were unable to process large volumes of unstructured data due to some financial and technical limitations. Now Hadoop can do it at a moderate cost and for reasonable time.

● To develop a system for data analysis based on Hadoop. Once an hour, the system builds patterns on a typical user behavior on a Web site. These patterns help to react to users’ actions in the real-time mode, for instance, allow doing something or temporary block some actions because they are not typical of this user. The data is collected continuously and is analyzed at the same time. So, the system can rapidly respond to the changes in the user behavior.

● To optimize data storage. It is interesting that in some cases HDFS can replace a database, especially when the database was used for storing raw input data. Such projects do not need an additional database level.

I should say that our customers have similar requirements. Apart from solving a particular business task, they need a certain level of performance and data consistency.

Q8. In your opinion is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?

Khasanshyn: In a few words, my answer is yes. Some specific features of Hadoop enable to prepare data for future analysis fast and at a moderate cost. In addition, this technology can work with unstructured data. However, I do not think it will happen very soon. There are many OLAP systems and they solve their tasks, doing it better or worse. In general enterprises are very reluctant to change something. In addition, replacing the OLAP tools requires additional investments. The good news is that we don’t have to choose one or another. Hadoop can be used as a pre-processor of the data for OLAP analysis. And analysts can work with the familiar tools.

Q9. How do you categorize the various stages of the Hadoop usage in the enterprises?

Khasanshyn: I would name the following stages of Hadoop usage:

1. Development of prototypes to check out whether Hadoop is suitable for their tasks
2. Using Hadoop in combination with other tools for storing and processing data of some business units
3. Implementation of a centralized enterprise data storage system and gradual integration of all business units into it

Q10. Data Warehouse vs Big “Data Lake”. What are the similarities and what are the differences?

Khasanshyn: Even though Big “Data Lake” is a metaphor, I do not really like it.
This name highlights that it is something limited, isolated from other systems. I would better call this concept a “Big Data Ocean”, because the data added to the system can interact with the surrounding systems. In my opinion, data warehouses are a bit outdated. At the earlier stage, such systems enabled companies to aggregate a lot of data in one central storage and arrange this data. All this was done with the acceptable speed. Now there are a lot of cheap storage solutions, so we can keep enormous volumes of data and process it much faster than with data warehouses.

The possibility to store large volumes of data is a common feature of data warehouses and a Big “Data Lake”. With a flexible architectures and broad capabilities for data analysis and discovery, a Big “Data Lake” provides a wider range of business opportunities. A modern company should adjust to changes very fast. The structure that was good yesterday may become a limitation today.

Q11. In your opinion, is there a technology which is best suited to build a Big Data Analytics Data Platform? If yes, which one?

Khasanshyn: As I have already said, there is no “magic bullet” that can cure every disease. There is no “Universal Big Data Analytics Data Platform” fitting each size; everything depends on the requirements of a particular business. A system of this kind should have the following features:

● A Hadoop-based system for storing and processing data
● An operational database that contains the most recent data, it can be raw data, that should be analyzed.
A NoSQL solution can be used for this case.
● A database for storing predicted indicators. Most probably, it should be a relational data store.
● A system that allows for creating data analysis algorithms. The R language can be used to complete this task.
● A report building system that provides access to data. For instance, there such good options like Tableau or Pentaho.

Q12. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?

Khasanshyn: In my opinion, cloud computing became the force that raised a Big Data wave. Elastic computing enabled us to use the amount of computation resources we need and also reduced the cost of maintaining a large infrastructure.

There is a connection between elastic computing and big data analytics. For us it is quite a typical case that we have to process data from time to time, for instance, once a day. To solve this task, we can deploy a new cluster or just scale up an existing Hadoop cluster in the cloud environment. We can temporary increase the speed of data processing by scaling a Hadoop cluster. The task will be executed faster and after that we can stop the Hadoop cluster or reduce its size. I can even say, that cloud technologies is a must have component for a Big Data analysis system.

——-

Renat Khasanshyn, Founder and Chairman, Altoros.
Renat is founder & CEO of Altoros, and Venture Partner at Runa Capital. Renat helps define Altoros’s strategic vision, and its role in Big Data, Cloud Computing, and PaaS ecosystem. Renat is a frequent conference and meetup speaker on this topic.
Under his supervision Altoros has been servicing such innovative companies as Couchbase, RightScale, Canonical, DataStax, Joyent, Nephoscale, and NuoDB.
In the past, Renat has been selected as finalist for the Emerging Executive of the Year award by the Massachusetts Technology Leadership Council and once won an IBM Business Mashup Challenge. Prior to founding Altoros, Renat was VP of Engineering for Tampa-based insurance company PriMed. Renat is also founder of Apatar, an open source data integration toolset, founder of Silicon Valley NewSQL User Group and co-founder of the Belarusian Java User Group.

——————–
Related Posts

On NoSQL. Interview with Rick Cattell. August 19, 2013

On Oracle NoSQL Database –Interview with Dave Segleau. July 2, 2013

On Big Data and Hadoop. Interview with Paul C. Zikopoulos. June 10, 2013
————————-

Resources

Evaluating NoSQL Performance, Sergey Bushik, Altoros (Slideshare)

“A Vendor-independent Comparison of NoSQL Databases: Cassandra, HBase, MongoDB, Riak”. Altoros (registration required)

ODBMS.org free resources on Big Data and Analytical Data Platforms:
Blog Posts | Free Software| Articles | Lecture Notes | PhD and Master Thesis|

ODBMS.org free resources on NoSQL Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis

ODBMS.org free resources on Cloud Data Stores:
Blog Posts | Lecture Notes| Articles and Presentations| PhD and Master Thesis|

———————–

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2013/10/on-big-data-and-nosql-interview-with-renat-khasanshyn/feed/ 1
On Linked Data. Interview with John Goodwin. http://www.odbms.org/blog/2013/09/on-linked-data-interview-with-john-goodwin/ http://www.odbms.org/blog/2013/09/on-linked-data-interview-with-john-goodwin/#comments Sun, 01 Sep 2013 15:27:54 +0000 http://www.odbms.org/blog/?p=2617

“Semantic technologies may be unfamiliar, but when you have used them for a while you will realise they are no harder than many other technologies…in fact I would argue they are easier.”– John Goodwin.

On the topics of Semantic web technologies, ontology engineering, and linked data, I have interviewed John Goodwin. John is Principal Scientist in the Research Department of Ordnance Survey, which is Great Britain’s National Mapping Authority.

RVZ

Q1. You are a senior data scientist at the Ordnance Survey, Great Britain’s national mapping agency. What is your role there?

John Goodwin: I am a Principal Scientist in the Research Department of Ordnance Survey, which is Great Britain’s National Mapping Authority [note: we are authorities now…not agencies].

I have worked in research for Ordnance Survey for over 10 years now, and my research was mainly focused in semantic web technologies, ontology engineering and linked data. The Principal Scientist role is a fairly new one for me, and as part of this role I am now responsibly for a stream of research work around data management, data delivery and web services. This involves looking at new and novel technologies that ensure we have the correct infrastructure and data models to meet the challenges of the future. Furthermore, it is investigating new ways we can serve our our data to the end customer.

Q2. Do you have a Big Data problem at the at the Ordnance Survey? Could you please give us some examples of Big Data Use Cases?

John Goodwin:Hmmm, that is debatable. Ordnance Survey certainly has big ‘data problems’ but I don’t know if they qualify as ‘big data’ problems. I have heard Big Data defined as any data that won’t fit into Excel (which is a definition I personally hate), and if that is the case then we certainly have ‘Big Data’. Ordnance Survey currently stores information about half a billion topographic features, and 27.5 million geocoded address (with around 500,000 changes a year). So we may not have the sheer volume of data that some folk have, but I believe the combination of volume and complexity means that performing analysis over this data or running queries would certainly be a ‘Big Data’ problem.
For example, if you wanted to calculate the number of postboxes in Scotland, find the length of all roads in Great Britain you could be waiting some time using traditional database solutions.

Q3. The vision of the Semantic Web is the one where web pages contain self describing data that machines will be able to navigate them as easily as humans do now. What are the main benefits? Who could profit most from the Semantics Web?

John Goodwin: I think an immediate benefit is the ability to provide more structured data to search engines so that they can provide better search services. Structured web content means more meaningful search results and offers new ways to summarise and present summaries of pages in a search engine.

Q4. Who is currently using Semantic Web technologies and how? Could you please give us some examples of current commercial projects?

John Goodwin: One interesting example is a company called Garlik (now part of Experian). Garlik provides services to protect people from identify theft and financial fraud. They use semantic web technologies to integrate a number of different datasets, and provide a flexible way to integrate new datasets so they can perform queries across these datasets to find potential victims more easily. The BBC are a great users of linked data technology and used triplestore technologies as part of their content management systems for their World Cup and Olympics websites. Again the flexibility of the technology, and ability to link data across the whole of the BBC proved invaluable.

We are using linked data technologies at Ordnance Survey in research projects to look at way of integrating our data with third party data.

The major search engines are back an initiative called schema.org which will provide a unified schema for structure data in web content, and this has the potential (as mentioned above) to provide a richer search experience.

Q5. Do you use Linked Data? What are the main benefits of Linked Data in your opinion?

John Goodwin: I am a big supporter of linked data, and this has been the focus of my work for the last few years. I have used it in research projects and also produced the Ordnance Survey linked data.

Linked data is great for data integration – a common data language makes it easy (or rather easier) to bring a number of disparate datasets together. It is also more flexible than traditional relational database technologies. Like other NoSQL technologies linked data can be seen as ‘schemaless’ to some extent. This means if you want to change the datamode by, say, adding new attributes or properties it is very easy to do so. Furthermore, and this is a more personal thing, I find graphs to be the most natural way to think about data. It feel far more intuitive and I have to say I think querying graph data using SPARQL is far easier than querying relation data using SQL (especially if you have a lot of joins).

Q6. What data management technologies are best suited to model and query Linked Open Data?

John Goodwin: Linked Open Data is built around W3C standards such as RDF (resource description frame) – which is the data language of choice on the linked data web (although some people like to debate whether or not RDF is needed for linked data). RDF is to the web of data as HTML is to the web of documents…or at least that is how I see it. RDF has its own query language called SPARQL. A large number of programming libraries (e.g. Jena) are emerging to handle RDF. Furthermore, RDF can be stored in databases called triplestores and there are many triplestores to choose from. I am not in a position to advocate one triplestore over another but there are a large number of great technologies being developed by SMEs and more traditional database vendors alike. Furthermore, there are a number of open source options. We have experimented with a number of them at Ordnance Survey.

Q7. How do you integrate data from different sources that are not in Linked Open Data format (e.g. relational, raw data, etc.)?

John Goodwin: So far by converting the underlying data to RDF. Most relational data is a simple script away from being RDF. Tools do exist to help ‘triplify’ the data, but if I am honest I find that most of the time it is easier to write a quick Python script to do the job.
OpenRefine is a useful tool that lets you clean up csv data and has a plugin that allow export of data to RDF. OpenRefine additionally has the benefit of being able to work with reconciliation APIs. If a linked data site offers a reconciliation API you can use it with OpenRefine to, for example, convert a column of cities names or postcodes to URIs in the Ordnance Survey linked data. This is useful when you need to create explicit links to other datasets. For example, if you had a spreadsheet with place names like ‘City of Southampton’ you could use OpenRefine and the Ordnance Survey linked data reconciliation API to turn ‘City of Southampton’ into its URI.

Q8. What are the most promising application domains where you can apply RDF triple store technology such as AllegroGraph and Virtuoso?

John Goodwin: I think any domain where you either want to integrate lots of disparate datasets or you want a data model that is flexible, and where schema evolution might be a problem. I think geospatial is a promising domain as ‘everything happens somewhere’ and location provide a useful integration hub for many datasets. Semantic web technologies have also been used widely in the bioinformatics domain. The BBC are another great usecase – they use the technology to integrate data across their whole enterprise. This brings together data from news, radio, sport, television and music and allows new and exciting ways to explore the data.
I dare say it is also a technology that will prove useful/interesting to certain three letter American government agencies that have made the news recently :)

Q9. Do you use Data Analytics at the Ordnance Survey and for what?

John Goodwin: I would say currently we don’t really – thought it depends what you mean by analytics. We are largely concerned with collecting and maintaining data, and then shipping this out as products and services. We have experimented with an IBM® Netezza® appliance to perform queries over our data that would have taken too much time in traditional databases to answer questions such as ‘how many post boxes are there in Great Britain?’.

Q10. Can you do data analytics using Linked Open Data? If yes how?

John Goodwin: I think again it depends what is meant by analytics. Linked data offers a great way to bring lots of datasets together and then, maybe, materialise a view of those integrated datasets that could then be used to perform some analytics. Many people are doing ‘graph analytics’, and given that linked data is a graph I think there is some interesting work to be done in looking at the intersections of graph/network theory and linked data.

Q11. What are the main current obstacles for the adoption of Semantic Web technologies in the Enterprise?

John Goodwin: I think two main obstacles. The first one is a perception that RDF and linked data are hard, and somehow we need to overcome with perception. Lots of things in the ICT domain are hard…RDBMS is hard, C++ is hard etc. Semantic technologies may be unfamiliar, but when you have used them for a while you will realise they are no harder than many other technologies…in fact I would argue they are easier. I know a lot of developers who have moved onto using SPARQL and after a few months using it find it much easier to understand that SQL. Furthermore, I think it is harder to hire people with expertise in these technologies – there are still more people skilled up in traditional RDBMS and other newer NoSQL technologies like MongoDB.

I think the second obstacle is that semantic web technologies are, obviously, not going to be as mature as a good old relational database. There are some great triplestores out there, and there are enterprises who have successfully incorporated them (the BBC are a great example) but being a relatively new technology I suspect many enterprises are nervous to invest.

——————-
John Goodwin went to university at Royal Holloway and Bedford New College (University of London – based in Egham, Surrey) and graduated in 1992 with a 1st class honours degree in mathematics. Following that he moved to Cambridge and studied Part III of the Mathematics Tripos at the Department of Applied Maths and Theoretical Physics (University of Cambridge) where he obtained a Certificate of Advanced Study in Mathematics. John then moved to the University of Southampton to start his PhD.
He graduated in 1997 with a PhD in “The Cauchy Problem in Spacetimes with Closed Timelike Curves” (which can very roughly be paraphrased as ‘do timemachines blow up when you turn them on?’). In 1998 John left academia to start work at Ordnance Survey (located at postcode SO16 0AS) as a systems developer. He left Ordnance Survey in 2000 to start work at a small software company called Neusciences where he gained experience in various A.I. techniques. After just ten months at Neusciences John returned to Ordnance Survey to work in the research department where his research concentrated on the semantic web, ontologies and linked data. On the back of this research John produced the current Ordnance Survey linked data. He is currently still at Ordnance Survey and working as a Principal Scientist, where he leads research (at a technical and strategic level) into data managment, data delivery and services.
John currently chair the UK location council linked data working group and participate in the UK Government Linked Data working group.

Related Posts

On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen. May 13, 2013

Graphs vs. SQL. Interview with Michael Blaha. April 11, 2013

On Big Graph Data. August 6, 2012

Resources

ODBMS.org: free resources on Graphs and Data Stores
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes

——————————–
Follow ODBMS.org on Twitter: @odbmsorg

]]>
http://www.odbms.org/blog/2013/09/on-linked-data-interview-with-john-goodwin/feed/ 0