MongoDB – ODBMS Industry Watch

On Amazon DocumentDB. Interview with Barry Morris

Roberto V. Zicari — Wed, 05 May 2021 08:33:06 +0000

“We built DocumentDB to implement the Apache 2.0 open source MongoDB APIs, specifically by emulating the responses that a MongoDB client expects from a MongoDB server. We don’t support 100 percent of the APIs today, but we do support the vast majority that customers actually use. We continue to work back from customers and support additional APIs that customers ask for.” — Barry Morris.

I have interviewed Barry Morris, GM ElastiCache, Timestream and DocumentDB at AWS. We talked about DocumentDB

RVZ.

Q1. AWS has many database services now. Why DocumentDB? Why did you build it?

Barry Morris: At AWS we believe customers should choose the right tool for the right job, and we don’t believe there’s a one size fits all database given the variety and scale of applications out there. Customers using our purpose-built databases don’t have to compromise on the functionality, performance, or scale of their workloads because they have a tool that is expressly designed for the purpose at hand. In the case of Amazon DocumentDB (with MongoDB compatibility) we offer a fast, scalable, highly available, and fully managed document database service that is purpose-built to store and query JSON.

We built Amazon DocumentDB because customers kept asking us for a flexible database service that could scale document workloads with ease. Amazon DocumentDB has made it simple for these customers to store, query, and index data in the same flexible JSON format that is generated in their applications, so it is highly intuitive for their developers. And it achieves this expressive document query support while also maintaining the high availability, performance, and durability required for modern enterprise applications in the cloud. Similar to our other AWS purpose-built database services, Amazon DocumentDB is fully managed, so customers can scale their databases with clicks in the console rather than executing a planning exercise that takes weeks.

Finally, because many of our customers with document database needs are already enthusiastic about and familiar with the MongoDB APIs, we designed Amazon DocumentDB to implement the Apache 2.0 open source MongoDB APIs. This allows customers to use their existing MongoDB drivers and tools with Amazon DocumentDB, and to migrate directly from their self-managed MongoDB databases to Amazon DocumentDB. It also gives them the freedom to migrate data in and out of DocumentDB without fear of lock-in.

Q2. Who is using DocumentDB and for what?

Barry Morris: Amazon DocumentDB is being used today by a wide variety of customers, from longstanding global enterprises like Samsung and Capital One, to digital natives like Rappi and Zulily, to financial organizations like FINRA. In addition, several products that Amazon customers use, such as the Fulfillment by Amazon (FBA) experience on Amazon.com, are powered by Amazon DocumentDB. We have customers in virtually every industry, from financial services to retail, from gaming to manufacturing, from media and entertainment to publishing, and more.

Many of our customers are software engineering teams who don’t want to deal with the “undifferentiated heavy lifting” of database administration, such as hardware provisioning, patching, setup, and configuration. These organizations would rather allocate their valuable engineering talent to building core application functionality, rather than deploying and managing MongoDB clusters. One of our customers, Plume, saved themselves the cost of “three to five approximately $150,000 Silicon Valley salaries” which both offset the managed service cost and allowed their team to focus on their core mission to deliver a superior wireless internet experience. Further, DocumentDB allows Plume to scale much more than their previous solution, with one of their clouds handling as many as 50,000 API calls per minute. You can read the full case study here.

The customer use cases are wide and many, given that document databases offer both flexible schemas and extensive query capabilities. Some of the more traditional use cases for document databases include catalogs, user profiles, and content management systems; and with the scale that AWS and Amazon DocumentDB provide, we are seeing customers deploy document databases for a much wider range of internet-scale use cases, including critical customer-facing e-commerce applications and production telemetry.

Q3. What has been the customer response?

Barry Morris: As with all AWS services, we work very closely with DocumentDB customers to ensure we are building a service that works backward from their needs. To date, the feedback we get is that customers are thrilled by DocumentDB’s ease of scaling, its fully managed capabilities, its natural integration with other AWS offerings, its durability and general enterprise-readiness, and its straightforward API compatibility with MongoDB. Of course, we are always working to add capabilities and features that are highly requested. For example, we just improved our MongoDB compatibility by adding support for frequently requested APIs such as renameCollection, $natural, and $indexOfArray. In the coming months, we also plan to release one of our most-requested features, Global Clusters, for customers with cross-region disaster recovery and data locality requirements. We also continue to bolster our MongoDB compatibility by adding support for the APIs that customers use the most.

Q4. What are the main design features of Amazon DocumentDB?

Barry Morris: Amazon DocumentDB has been built from the ground up with a cloud native architecture designed for scaling JSON workloads with ease. An essential design feature of DocumentDB is that it decouples compute and storage, allowing each to scale independently. Because storage and compute are separate, customers can add replicas without putting additional load on the primary. This allows you to easily scale out read capacity to millions of requests per second by adding up to 15 low latency read replicas across three AWS Availability Zones (AZs) in minutes. DocumentDB’s distributed, fault-tolerant, self-healing storage system auto-scales storage up to 64 TB per database cluster without the need for sharding, and without any impact or downtime to a customer’s application.

As I mentioned before, DocumentDB is built to be enterprise-ready. It provides strict network isolation with Amazon Virtual Private Cloud (VPC). All data is encrypted at rest with AWS Key Management Service (KMS) and encryption in transit is provided with Transport Layer Security (TLS). DocumentDB has compliance readiness with a wide range of industry standards, and automatically and continuously monitors and backs up to Amazon S3, which is highly durable.

Q5. When would you suggest to use DocumentDB vs another purpose-built database?

Barry Morris: At its core, DocumentDB is designed to store, index, and query rich and complex JSON documents with high availability and scalability. You can retrieve documents based on nested field values, join data across collections, and perform aggregation queries. So if you need schema flexibility and the ability to index and query rich structured and semi-structured documents, DocumentDB is a great choice. This is particularly true if you have JSON document workloads that are mission critical for your organization. A DocumentDB cluster provides 99.99% availability, can handle tens of thousands of writes per second and millions of reads per second, and supports up to 64 TiB of data. Finally, since DocumentDB supports MongoDB workloads and is compatible with the MongoDB API, it is a logical choice for MongoDB users who are looking to easily migrate to a fully managed database solution. Every use case is unique, and it is often a good idea to engage an AWS solution architect (SA) if you have questions about selecting the right database for your next application.

Q6. What are the key advantages of DocumentDB vs managing your own cluster?

Barry Morris: For many customers, fully managed is all about scale. We scale your database at the click of a button, saving you nights and weekends of scaling clusters manually. Customers don’t have to worry about provisioning hardware, running the service, configuring for high availability, or dealing with patching and durability. These concerns are shifted to AWS, so our customers can focus on their applications and innovate on behalf of their customers. Something as simple as backup and restore can be a drag on production. With DocumentDB, backup is on by default.

Cost is also a big concern when managing your own clusters. This can include the cost of labor resources, hardware investments, vendor software solutions, support costs, and more. Cost becomes very transparent with DocumentDB, as it offers pay-as-you-go pricing with per second instance billing. You don’t have to worry about planning for future growth, because DocumentDB scales with your business.

Q7. Tell me about “MongoDB compatibility” – what does that really mean in practice?

Barry Morris: That’s a great question and one we get a lot from customers. We built DocumentDB to implement the Apache 2.0 open source MongoDB APIs, specifically by emulating the responses that a MongoDB client expects from a MongoDB server. We don’t support 100 percent of the APIs today, but we do support the vast majority that customers actually use. We continue to work back from customers and support additional APIs that customers ask for. Because we offer MongoDB API compatibility, it’s straightforward to migrate from the MongoDB databases you’re managing on premises or in EC2 today to DocumentDB. Updating the application is as easy as changing the database endpoint to the new Amazon DocumentDB cluster.

Q8. Let’s hear about some exciting customer momentum. Can you please share some customer stories?

Barry Morris: We have a lot of them! Customers including BBC, Capital One, Dow Jones, FINRA, Samsung, and The Washington Post have shared their success stories with us. Recently, we’ve done some deeper-dive case studies with customers in a range of industries.

For example, Zulily presented their solution at AWS re:Invent 2020. The popular online retailer is using Amazon DocumentDB along with Amazon Kinesis Data Analytics to power its “suggested searches” feature. In this solution, Kinesis Data Analytics filters relevant events from clickstream analytics when a Zulily customer requests a search, a Lambda function performs a lookup for brands and categories relevant to those events, and the resulting enriched events — which populate the suggested search — are stored in DocumentDB. The feature has been a hit, with more than 75% of Zulily customers using suggested searches when they search the online store.

A customer story that is particularly compelling given recent events is Rappi. Rappi is a successful Colombian delivery app startup that operates in nine Latin American countries. The company had been rearchitecting their monolithic application into a more flexible, microservices-driven architecture to help it scale as it grew. As part of this modernization effort, the startup selected DocumentDB as a fully managed, purpose-built JSON database service to replace its self-managed MongoDB clusters, which were becoming unwieldy to manage at scale. When Covid-19 hit, the company faced an unprecedented surge in orders and deliveries. DocumentDB enabled them to handle the surge because, as a highly scalable service, it operated as normal despite the change in volume. Overall, Rappi decreased management and operational overhead by more than 50% using Amazon DocumentDB.

A final one I will mention is Asahi Shimbun, which is one of Japan’s oldest and largest-circulated newspapers. The company overhauled its digital app last year using AWS and selected Amazon DocumentDB as their content master database to store their articles. Since modernizing, Asahi Shimbun has seen a 30% reduction in monthly operation costs for extracting past articles and a 20% improvement in frequency of use for the app. This is one of many examples that showcase how essential AWS is for industries like publishing, retail, and banking that are evolving with new business models in the cloud.

You can peruse these and many other customer case studies in full on our website.

Q9. Anything else you wish to add?

Barry Morris: Over the last decade, JSON/document-based workloads have become one of the primary alternatives to relational approaches, for a wide range of applications with requirements for flexible data management. We expect this trend to keep growing, particularly with cloud-native applications, and we’re excited to offer DocumentDB as a tool in the toolkit of modern builders leveraging JSON. It’s been great to see DocumentDB support the needs not only of customers who are migrating their existing MongoDB workloads to the cloud, but also the builders who are creating modern applications and choosing DocumentDB as the right “purpose-built database” for their needs.

For anyone interested in learning more and getting hands-on with DocumentDB, we have a number of things coming up that may be of interest. We will be hosting two DocumentDB Focus Days, which are virtual workshops on best practices, in May and June. You can learn more and sign up on the registration page. Finally, we have an ongoing Twitch series where our solution architects (SAs) dive deeper on DocumentDB functionality, which you can learn more about on the website. Our DocumentDB product detail page is the best place to start for a general overview of the service and steps to get started, and you can refer to the documentation for an in-depth developer guide.

………………………….

Barry Morris, GM ElastiCache, Timestream and DocumentDB. As General Manager of ElastiCache, Timestream and DocumentDB, Barry manages a number of businesses in the AWS database portfolio. He is focused on delivering value to AWS customers through trusted data management services, with a relentless commitment to database innovation.

Prior to joining AWS in 2020, his career includes over 20 years as the CEO of international technology companies, both private and public, including Undo.io, NuoDB, StreamBase, Headway, and IONA Technologies. Barry has also had leadership roles in PROTEK, Metrica, Lotus Development and DEC.

Born in South Africa, Barry lived in England and Ireland before moving to Boston. He holds a Bachelor’s Degree (BA) in engineering from Oxford University and an Honorary Doctorate in Business Administration (DBA) from the IMCA.

Resources

– Get Started with Amazon DocumentDB

Follow us on Twitter: @odbmsorg

On the Database Market. Interview with Merv Adrian

Roberto V. Zicari — Tue, 23 Apr 2019 10:30:24 +0000

“Anyone who expects to have some of their work in the cloud (e.g. just about everyone) will want to consider the offerings of the cloud platform provider in any shortlist they put together for new projects. These vendors have the resources to challenge anyone already in the market.”– Merv Adrian.

I have interviewed Merv Adrian, Research VP, Data & Analytics at Gartner. We talked about the the database market, the Cloud and the 2018 Gartner Magic Quadrant for Operational Database Management Systems.

RVZ

Q1. Looking Back at 2018, how has the database market changed?

Merv Adrian: At a high level, much is similar to the prior year. The DBMS market returned to double digit growth in 2017 (12.7% year over year in Gartner’s estimate) to $38.8 billion. Over 73% of that growth was attributable to two vendors: Amazon Web Services and Microsoft, reflecting the enormous shift to new spending going to cloud and hybrid-capable offerings. In 2018, the trend grew, and the erosion of share for vendors like Oracle, IBM and Teradata continued. We don’t have our 2018 data completed yet, but I suspect we will see a similar ballpark for overall growth, with the same players up and down as last year. Competition from Chinese cloud vendors, such as Alibaba Cloud and Tencent, is emerging, especially outside North America.

Q2. What most surprised you?

Merv Adrian: The strength of Hadoop. Even before the merger, both Cloudera and Hortonworks continued steady growth, with Hadoop as a cohort outpacing all other nonrelational DBMS activity from a revenue perspective. With the merger, Cloudera becomes the 7th largest vendor by revenue and usage and intentions data suggest continued growth in the year ahead.

Q3. Is the distinction between relational and nonrelational database management still relevant?

Merv Adrian: Yes, but it’s less important than the cloud. As established vendors refresh and extend product offerings that build on their core strengths and capabilities to provide multimodel DBMS and/or or broad portfolios of both, the “architecture” battle will ramp up. New disruptive players and existing cloud platform providers will have to battle established vendors where they are strong – so for DMSA players like Snowflake will have more competition and on the OPDBMS side, relational and nonrelational providers alike – such as EnterpriseDB, MongoDB, and Datastax – will battle more for a cloud foothold than a “nonrelational” one.
Specific nonrelational plays like Graph, Time Series, and ledger DBMSs will be more disruptive than the general “nonrelational” category.

Q4. Artificial intelligence is moving from sci-fi to the mainstream. What is the impact on the database market?

Merv Adrian: Vendors are struggling to make the case that much of the heavy lifting should move to their DBMS layer with in-database processing. Although it’s intuitive, it represents a different buyer base, with different needs for design, tools, expertise and operational support. They have a lot of work to do.

Q5. Recently Google announced BigQuery ML. Machine Learning in the (Cloud) Database. What are the pros and cons?

Merv Adrian: See the above answer. Google has many strong offerings in the space – putting them together coherently is as much of a challenge for them as anyone else, but they have considerable assets, a revamped executive team under the leadership of Thomas Kurian, and are entering what is likely to be a strong growth phase for their overall DBMS business. They are clearly a candidate to be included in planning and testing.

Q6. You recently published the 2018 Gartner Magic Quadrant for Operational Database Management Systems. In a nutshell. what are your main insights?

Merv Adrian: Much of that is included in the first answer above. What I didn’t say there is that the degree of disruption varies between the Operational and DMSA wings of the market, even though most of the players are the same. Most important, specialists are going to be less relevant in the big picture as the converged model of application design and multimodel DBMSs make it harder to thrive in a niche.

Q7. To qualify for inclusion in this Magic Quadrant, vendors must have had to support two of the following four use cases: traditional transactions, distributed variable data, operational analytical convergence and event processing or data in motion. What is the rational beyond this inclusion choice?

Merv Adrian: The rationale is to offer our clients the offerings with the broadest capabilities. We can’t cover all possibilities in depth, so we attempt to reach as many as we can within the constraints we design to map to our capacity to deliver. We call out specialists in various other research offerings such as Other Vendors to Consider, Cool Vendor, Hype Cycle and other documents, and pieces specific to categories where client inquiry makes it clear we need to have a published point of view.

Q8. How is the Cloud changing the overall database market?

Merv Adrian: Massively. In addition to functional and architectural disruption, it’s changing pricing, support, release frequency, and user skills and organizational models. The future value of data center skills, container technology, multicloud and hybrid challenges and more are hot topics.

Q9. In your Quadrant you listed Amazon Web Services, Alibaba Cloud and Google. These are no pure database vendors, strictly speaking. What role do they play in the overall Operational DBMS market?

Merv Adrian: Anyone who expects to have some of their work in the cloud (e.g. just about everyone) will want to consider the offerings of the cloud platform provider in any shortlist they put together for new projects. These vendors have the resources to challenge anyone already in the market. And their deep pockets, and the availability of open source versions of every DBMS technology type that they can use – including creating their own versions of with optimizations for their stack and pre-built integrations to upstream and downstream technologies required for delivery – makes them formidable.

Q10. What are the main data management challenges and opportunities in 2019?

Merv Adrian: Avoiding silver bullet solutions, sticking to sound architectural principles based on understanding real business needs, and leveraging emerging ideas without getting caught in dead end plays. Pretty much the same as always. The details change, but sound design and a focus on outcomes remain the way forward.

Qx Anything else you wish to add?

Merv Adrian: Fasten your seat belt. It’s going to be a bumpy ride.

————————

Merv Adrian, Research VP, Data & Analytics. Gartner

Merv Adrian is an Analyst on the Data Management team following operational DBMS, Apache Hadoop, Spark, nonrelational DBMS and adjacent technologies. Mr. Adrian also tracks the increasing impact of open source on data management software and monitors the changing requirements for data security in information platforms.

Resources

– 2018 Gartner Magic Quadrant for Operational Database Management Systems

– 2019 Gartner Magic Quadrant for Data Management Solutions for Analytics

– 2019 Gartner Magic Quadrant for Data Science and Machine Learning Platforms

Follow us on Twitter: @odbmsorg

On RDBMS, NoSQL and NewSQL databases. Interview with John Ryan

Roberto V. Zicari — Fri, 09 Mar 2018 11:05:17 +0000

“The single most important lesson I’ve learned is to keep it simple. I find designers sometimes deliver over-complex, generic solutions that could (in theory) do anything, but in reality are remarkably difficult to operate, and often misunderstood.”–John Ryan

I have interviewed John Ryan, Data Warehouse Solution Architect (Director) at UBS.

RVZ

Q1. You are an experienced Data Warehouse architect, designer and developer. What are the main lessons you have learned in your career?

John Ryan: The single most important lesson I’ve learned is to keep it simple. I find designers sometimes deliver over-complex, generic solutions that could (in theory) do anything, but in reality are remarkably difficult to operate, and often misunderstood. I believe this stems from a lack of understanding of the requirement – the second most important lesson.

Everyone from the senior stakeholders to architects, designers and developers need to fully understand the goal. Not the solution, but the “problem we’re trying to solve”. End users never ask for what they need (the requirement), but instead, they describe a potential solution. IT professionals are by nature delivery focused, and, get frustrated when it appears “the user changed their mind”. I find the user seldom changes their mind. In reality, the requirement was never fully understood.

To summarise. Focus on the problem not the solution. Then (once understood), suggest a dozen solutions and pick the best one. But keep it simple.

Q2. How has the Database industry changed in the last 20 years?

John Ryan: On the surface, not a lot. As recent as 2016 Gartner estimated Oracle, Microsoft and IBM accounted for 80% of the commercial database market, but that hides an underlying trend that’s disrupting this $50 billion industry.

Around the year 2000, the primary options were Oracle, DB2 or SQL Server with data warehouse appliances from Teradata and Netezza. Fast forward to today, and the database engine rankings include over 300 databases of which 50% are open source, and there are over 11 categories including Graph, Wide column, Key-value and Document stores; each suited to a different use-case.

While relational databases are still popular, 4 of the top 10 most popular solutions are non-relational (classed as NoSQL), including MongoDB, Redis and Cassandra. Cross-reference this against a of survey of the most highly sought skills, and we find MongoDB, Redis and Cassandra again in the top ranking, with nearly 40% of respondents seeking MongoDB skills compared to just 12% seeking Oracle expertise.

Likewise open source databases make up 60% of the top ten ranking, with open source database MySQL in second place behind Oracle in the rankings, and Gartner states that “By 2018, more than 70% of new in-house applications will be developed on an [Open Source] DBMS”.

The move towards cloud computing and database-as-a-service is causing further disruption in the data warehouse space with cloud and hybrid challengers including Vertica, Amazon Redshift and Snowflake.

In conclusion, the commercial relational vendors currently dominate the market in revenue terms. However, there has been a remarkable growth of open source alternatives, and huge demand for solutions to handle high velocity unstructured and semi-structured data. These use-cases including social media, and the Internet of Things are ill-suited to the legacy structured databases provided by Oracle, DB2 and SQL Server, and this void has been largely filled by open source NoSQL and NewSQL databases.

Q3. RDBMS vs. NoSQL vs. NewSQL: How do you compare Database Technologies?

John Ryan: The traditional RDBMS solutions from Oracle, IBM and Microsoft implement the relational model on a 1970s hardware architecture, and typically provide a good general purpose database platform which can be applied to most OLTP and Data Warehouse use cases.

However, as Dr. Michael Stonebraker indicated in this 2007 paper, The End of an Architectural Era (It’s Time for a Complete Rewrite), these are no longer fit for purpose, as both the hardware technology, and processing demands have moved on. In particular, the need for real time (millisecond) performance, greater scalability to handle web-scale applications, and the need to handle unstructured and semi-structured data.

Whereas the legacy RDBMS is a general purpose (will do anything) database, the NoSQL and NewSQL solutions are dedicated to a single problem, for example, short lived OLTP operations.

The Key-Value NoSQL databases were developed to handle the massive transaction volume, and low latency needed to handle web commerce at Amazon and LinkedIn. Others (eg. MongoDB) where developed to handle semi-structured data, while still others (eg. Neo4J) were built to efficiently model data networks of the type found at Facebook or LinkedIn.

The common thread with NoSQL databases is they tend to use an API interface rather than industry standard SQL, although increasingly that’s changing. They do however, entirely reject the relational model and ACID compliance. They typically don’t support cross-table join operations, and are entirely focused on low latency, trading consistency for scalability.

The so-called NewSQL databases include VoltDB , MemSQL and CockroachDB are a return to the relational model, but re-architected for modern hardware and web scale use cases. Like NoSQL solutions, they tend to run on a shared nothing architecture, and scale to millions of transactions per second, but they also have full transaction support and ACID compliance that are critical for financial operations.

Q4. What are the typical trade-off of performance and consistency when using NoSQL and NewSQL databases to support high velocity OLTP and real time analytics?

John Ryan: The shared nothing architecture is built to support horizontal scalability, and when combined with data replication, can provide high availability and good performance. If one node in the cluster fails, the system continues, as the data is available on other nodes. The NoSQL database is built upon this architecture, and to maximize throughput, ACID compliance is relaxed in favor of Eventual Consistency, and in some cases (eg. Cassandra), it supports tunable consistency, allowing the developer to trade performance for consistency, and durability.

For example, after a write operation, the change cannot be considered durable (the D in ACID) until the change is replicated to at least one, ideally two other nodes , but this would increase latency, and reduce performance. It’s possible however, to relax this constraint, and return immediately, with the risk the change may be lost if the node crashes before the data is successfully replicated. This becomes even more of a potential issue if the node is temporarily disconnected from the network, but is allowed to continue accepting transactions until the connection is restored. In practice, consistency will be eventually be achieved when the connection is reestablished – hence the term Eventual Consistency.

A NewSQL database on the other hand accepts no such compromise, and some databases (eg. VoltDB), even support full serializability, executing transactions as if they were executed serially. Impressively, they manage this impressive feat at a rate of millions of transactions per second, potentially on commodity hardware.

Q5. One of the main challenges for real time systems architects is the potentially massive throughput required which could exceed a million transactions per second. How do handle such a challenge?

John Ryan: The short answer is – with care! The longer answer is described in my article, Big Data – Velocity. I’d break the problem into three components, Data Ingestion, Transformation and Storage.

Data ingestion requires message based middleware (eg. Apache Kafka), with a range of adapters and interfaces, and the ability to smooth out the potentially massive spikes in velocity, with the ability to stream data to multiple targets.

Transformation, typically requires an in-memory data streaming solution to restructure and transform data in near-real time. Options include Spark Streaming, Storm or Flink.

Storage and Analytics is sometimes handled by a NoSQL database, but for application simplicity (avoiding the need to implement transactions or handle eventual consistency problems in the application), I’d recommend a NewSQL database.
All the low-latency, high throughput of the NoSQL solutions, but with the flexibility and ease of a full relational database, and full SQL support.

In conclusion, the solution needs to abandon the traditional batch oriented solution in favour of an always-on streaming solution with all processing in memory.

Q6. Michael Stonebraker introduced the so called “One Size no longer fits all”-concept. Has this concept come true on the database market?

John Ryan: First stated in the paper One Size Fits All – An Idea Whose Time Has Come And Gone Dr. Michael Stonebraker argued that the legacy RDBMS dominance was at an end, and would be replaced by specialized database technology including stream processing, OLTP and Data Warehouse solutions.

Certainly disruption in the Data Warehouse database market has been accelerated with the move towards the cloud, and as this Gigaom Report illustrates, there are at least nine major players in the market, with new specialized tools including Google Big Query, Amazon Redshift and Snowflake, and the column store (in memory or on secondary storage) dominates.

Finally, the explosion of specialized NoSQL and NewSQL databases, each with its own specialty including Key-Value, Document Stores, Text Search and Graph databases lend credence to the statement “One Size no longer fits all”.

I do think however, we’re still in a transformation stage, and the shake-out is not yet complete. I think a lot of large corporations (especially Financial Services) are wary of change, but it’s already happening.

I think the quote Stewart Brand is appropriate: “Once a new technology rolls over you, if you’re not part of the steamroller, your part of the road”.

Q7. Eventual consistency vs ACID. A pragmatic approach or a step too far?

John Ryan: As with so many decisions in IT, it depends. Eventual Consistency was built into the Amazon Dynamo database as a pragmatic decision, because it’s difficult to maintain high throughput and availability at scale. Amazon accepted the relatively minor risk of inconsistencies because of the huge benefits including scalability and availability. In many other web scale applications (eg. Social media), the implications of temporary inconsistency are not important, and it’s therefore an acceptable approach.

Having said that, it does place a huge addition burden on the developer to code for relatively rare and unexpected conditions, and I’d question why would anyone settle for a database which supported eventual consistency, when full ACID compliance and transaction handling is available.

Q8. What is your take on the Lambda Architecture ?

John Ryan: The Lambda Architecture is a an approach to handle massive data volumes, and provide real time results by running two parallel solutions, a Batch Processing and a real time Speed Processing stream.

As a Data Warehouse and ETL designer, I’m aware of the complexity involved in data transformation, and I was immediately concerned about an approach which involved duplicating the logic, probably in two different technologies.

I’d also question the sense of repeatedly executing batch processing on massive data volumes when processing is moving to the cloud, on a pay for what you use basis.

I’ve since found this article on Lambda by Jay Kreps of LinkedIn and more recently Confluent (who developed Apache Kafka), and he describes the challenges from a position of real (and quite painful) experience.

The article recommends an alternative approach, a single Speed Processing stream with slight changes to allow data to be reprocessed (the primary advantage of the Lambda architecture). This solution, named the Kappa Architecture, is based upon the ability of Kafka to retain and re-play data, and it seems an entirely sensible and more logical approach.

Qx Anything else you wish to add?

John Ryan: Thank you for this opportunity. It may be a Chinese curse that “may you live in interesting times”, but I think it’s a fascinating time to be working in the database industry.

—————————–

John Ryan, Data Warehouse Solution Architect (Director), UBS.

John has over 30 years experience in the IT industry, and has maintained a keen interest in database technology since his University days in the 1980s when he invented a B-Tree index, only to find Oracle (and several others) had got there already.

He’s worked as a Data Warehouse Architect and Designer for the past 20 years in a range of industries including Mobile Communications, Energy and Financial Services. He’s regularly found writing articles on Big Data and database architecture, and you can find him on LinkedIn.

Resources

– ODBMS.org: BIG DATA, ANALYTICAL DATA PLATFORMS, DATA SCIENCE

– ODBMS.org: NEWSQL, XML, RDF DATA STORES, RDBMS

– ODBMS.org: NOSQL DATA STORES

Related Posts

– On the InterSystems IRIS Data Platform ,ODBMS Industry Watch, 2018-02-09

– Facing the Challenges of Real-Time Analytics. Interview with David Flower, ODBMS Industry Watch, 2017-12-19

– On the future of Data Warehousing. Interview with Jacque Istok and Mike Waas, ODBMS Industry Watch, 2017-11-09

– On Vertica and the new combined Micro Focus company. Interview with Colin Mahony, ODBMS Industry Watch, 2017-10-25

– On Open Source Databases. Interview with Peter Zaitsev, ODBMS Industry Watch, 2017-09-06

Follow us on Twitter: @odbmsorg

On Open Source Databases. Interview with Peter Zaitsev

Roberto V. Zicari — Wed, 06 Sep 2017 00:49:18 +0000

“To be competitive with non-open-source cloud deployment options, open source databases need to invest in “ease-of-use.” There is no tolerance for complexity in many development teams as we move to “ops-less” deployment models.” –Peter Zaitsev

I have interviewed Peter Zaitsev, Co-Founder and CEO of Percona.
In this interview, Peter talks about the Open Source Databases market; the Cloud; the scalability challenges at Facebook; compares MySQL, MariaDB, and MongoDB; and presents Percona’s contribution to the MySQL and MongoDB ecosystems.

RVZ

Q1. What are the main technical challenges in obtaining application scaling?

Peter Zaitsev: When it comes to scaling, there are different types. There is a Facebook/Google/Alibaba/Amazon scale: these giants are pushing boundaries, and usually are solving very complicated engineering problems at a scale where solutions aren’t easy or known. This often means finding edge cases that break things like hardware, operating system kernels and the database. As such, these companies not only need to build a very large-scale infrastructures, with a high level of automation, but also ensure it is robust enough to handle these kinds of issues with limited user impact. A great deal of hardware and software deployment practices must to be in place for such installations.

While these “extreme-scale” applications are very interesting and get a lot of publicity at tech events and in tech publications, this is a very small portion of all the scenarios out there. The vast majority of applications are running at the medium to high scale, where implementing best practices gets you the scalability you need.

When it comes to MySQL, perhaps the most important question is when you need to “shard.” Sharding — while used by every application at extreme scale — isn’t a simple “out-of-the-box” feature in MySQL. It often requires a lot of engineering effort to correctly implement it.

While sharding is sometimes required, you should really examine whether it is necessary for your application. A single MySQL instance can easily handle hundreds of thousands per second (or more) of moderately complicated queries, and Terabytes of data. Pair that with MemcacheD or Redis caching, MySQL Replication or more advanced solutions such as Percona XtraDB Cluster or Amazon Aurora, and you can cover the transactional (operational) database needs for applications of a very significant scale.

Besides making such high-level architecture choices, you of course need to also ensure that you exercise basic database hygiene. Ensure that you’re using the correct hardware (or cloud instance type), the right MySQL and operating system version and configuration, have a well-designed schema and good indexes. You also want to ensure good capacity planning, so that when you want to take your system to the next scale and begin to thoroughly look at it you’re not caught by surprise.

Q2. Why did Facebook create MyRocks, a new flash-optimized transactional storage engine on top of RocksDB storage engine for MySQL?

Peter Zaitsev: The Facebook Team is the most qualified to answer this question. However, I imagine that at Facebook scale being efficient is very important because it helps to drive the costs down. If your hot data is in the cache when it is important, your database is efficient at handling writes — thus you want a “write-optimized engine.”
If you use Flash storage, you also care about two things:

– A high level of compression since Flash storage is much more expensive than spinning disk.

– You are also interested in writing as little to the storage as possible, as the more you write the faster it wears out (and needs to be replaced).

RocksDB and MyRocks are able to achieve all of these goals. As an LSM-based storage engine, writes (especially Inserts) are very fast — even for giant data sizes. They’re also much better suited for achieving high levels of compression than InnoDB.

This Blog Post by Mark Callaghan has many interesting details, including this table which shows MyRocks having better performance, write amplification and compression for Facebook’s workload than InnoDB.

Q3. Beringei is Facebook’s open source, in-memory time series database. According to Facebook, large-scale monitoring systems cannot handle large-scale analysis in real time because the query performance is too slow. What is your take on this?

Peter Zaitsev: Facebook operates at extreme scale, so it is no surprise the conventional systems don’t scale well enough or aren’t efficient enough for Facebook’s needs.

I’m very excited Facebook has released Beringei as open source. Beringei itself is a relatively low-end storage engine that is hard to use for a majority of users, but I hope it gets integrated with other open source projects and provides a full-blown high-performance monitoring solution. Integrating it with Prometheus would be a great fit for solutions with extreme data ingestion rates and very high metric cardinality.

Q4. How do you see the market for open source databases evolving?

Peter Zaitsev: The last decade has seen a lot of open source database engines built, offering a lot of different data models, persistence options, high availability options, etc. Some of them were build as open source from scratch, while others were released as open source after years of being proprietary engines — with the most recent example being CMDB2 by Bloomberg. I think this heavy competition is great for pushing innovation forward, and is very exciting! For example, I think if that if MongoDB hadn’t shown how many developers love a document-oriented data model, we might never of seen MySQL Document Store in the MySQL ecosystem.

With all this variety, I think there will be a lot of consolidation and only a small fraction of these new technologies really getting wide adoption. Many will either have niche deployments, or will be an idea breeding ground that gets incorporated into more popular database technologies.

I do not think SQL will “die” anytime soon, even though it is many decades old. But I also don’t think we will see it being the dominant “database” language, as it has been since the turn of millennia.

The interesting disruptive force for open source technologies is the cloud. It will be very interesting for me to see how things evolve. With pay-for-use models of the cloud, the “free” (as in beer) part of open source does not apply in the same way. This reduces incentives to move to open source databases.

To be competitive with non-open-source cloud deployment options, open source databases need to invest in “ease-of-use.” There is no tolerance for complexity in many development teams as we move to “ops-less” deployment models.

Q5. In your opinion what are the pros and cons of MySQL vs. MariaDB?

Peter Zaitsev: While tracing it roots to MySQL, MariaDB is quickly becoming a very different database.
It implements some features MySQL doesn’t, but also leaves out others (MySQL Document Store and Group Replication) or implements them in a different way (JSON support and Replication GTIDs).

From the MySQL side, we have Oracle’s financial backing and engineering. You might dislike Oracle, but I think you agree they know a thing or two about database engineering. MySQL is also far more popular, and as such more battle-tested than MariaDB.

MySQL is developed by a single company (Oracle) and does not have as many external contributors compared to MariaDB — which has its own pluses and minuses.

MySQL is “open core,” meaning some components are available only in the proprietary version, such as Enterprise Authentication, Enterprise Scalability, and others. Alternatives for a number of these features are available in Percona Server for MySQL though (which is completely open source). MariaDB Server itself is completely open source, through there are other components that aren’t that you might need to build a full solution — namely MaxScale.

Another thing MariaDB has going for it is that it is included in a number of Linux distributions. Many new users will be getting their first “MySQL” experience with MariaDB.

For additional insight into MariaDB, MySQL and Percona Server for MySQL, you can check out this recent article

Q6. What’s new in the MySQL and MongoDB ecosystem?

Peter Zaitsev: This could be its own and rather large article! With MySQL, we’re very excited to see what is coming in MySQL 8. There should be a lot of great changes in pretty much every area, ranging from the optimizer to retiring a lot of architectural debt (some of it 20 years old). MySQL Group Replication and MySQL InnoDB Cluster, while still early in their maturity, are very interesting products.

For MongoDB we’re very excited about MongoDB 3.4, which has been taking steps to be a more enterprise ready database with features like collation support and high-performance sharding. A number of these features are only available in the Enterprise version of MongoDB, such as external authentication, auditing and log redaction. This is where Percona Server for MongoDB 3.4 comes in handy, by providing open source alternatives for the most valuable Enterprise-only features.

For both MySQL and MongoDB, we’re very excited about RocksDB-based storage engines. MyRocks and MongoRocks both offer outstanding performance and efficiency for certain workloads.

Q7. Anything else you wish to add?

Peter Zaitsev: I would like to use this opportunity to highlight Percona’s contribution to the MySQL and MongoDB ecosystems by mentioning two of our open source products that I’m very excited about.

First, Percona XtraDB Cluster 5.7.
While this has been around for about a year, we just completed a major performance improvement effort that allowed us to increase performance up to 10x. I’m not talking about improving some very exotic workloads: these performance improvements are achieved in very typical high-concurrency environments!

I’m also very excited about our Percona Monitoring and Management product, which is unique in being the only fully packaged open source monitoring solution specifically built for MySQL and MongoDB. It is a newer product that has been available for less than a year, but we’re seeing great momentum in adoption in the community. We are focusing many of our resources to improving it and making it more effective.

———————

Peter Zaitsev co-founded Percona and assumed the role of CEO in 2006. As one of the foremost experts on MySQL strategy and optimization, Peter leveraged both his technical vision and entrepreneurial skills to grow Percona from a two-person shop to one of the most respected open source companies in the business. With more than 150 professionals in 29 countries, Peter’s venture now serves over 3000 customers – including the “who’s who” of Internet giants, large enterprises and many exciting startups. Percona was named to the Inc. 5000 in 2013, 2014, 2015 and 2016.

Peter was an early employee at MySQL AB, eventually leading the company’s High Performance Group. A serial entrepreneur, Peter co-founded his first startup while attending Moscow State University where he majored in Computer Science. Peter is a co-author of High Performance MySQL: Optimization, Backups, and Replication, one of the most popular books on MySQL performance. Peter frequently speaks as an expert lecturer at MySQL and related conferences, and regularly posts on the Percona Data Performance Blog. He has also been tapped as a contributor to Fortune and DZone, and his recent ebook Practical MySQL Performance Optimization Volume 1 is one of percona.com’s most popular downloads.
————————-

Resources

– Percona, in collaboration with Facebook, announced the first experimental release of MyRocks in Percona Server for MySQL 5.7, with packages. September 6, 2017

– eBook, “Practical MySQL Performance Optimization,” by Percona CEO Peter Zaitsev and Principal Consultant Alexander Rubin. (LINK to DOWNLOAD, registration required)

– MySQL vs MongoDB – When to Use Which Technology. Peter Zaitsev, June 22, 2017

– Percona Live Open Source Database Conference Europe, Dublin, Ireland. September 25 – 27, 2017

–Percona Monitoring and Management (PMM) Graphs Explained: MongoDB with RocksDB, By Tim Vaillancourt,JUNE 18, 2017

–On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah. ODBMS Industry Watch,2017-03-13

–On in-memory, key-value data stores. Ofer Bengal and Yiftach Shoolman. ODBMS Industry Watch, 2017-02-13

follow us on Twitter: @odbmsorg

Identity Graph Analysis at Scale. Interview with Niels Meersschaert

Roberto V. Zicari — Tue, 09 May 2017 07:10:19 +0000

“I’ve found the best engineers actually have art backgrounds or interests. The key capability is being able to see problems from multiple perspectives, and realizing there are multiple solutions to a problem. Music, photography and other arts encourage that.”–Niels Meersschaert.

I have interviewed Niels Meersschaert, Chief Technology Officer at Qualia. The Qualia team relies on over one terabyte of graph data in Neo4j, combined with larger amounts of non-graph data to provide major companies with consumer insights for targeted marketing and advertising opportunities.

RVZ

Q1. Your background is in Television & Film Production. How does it relate to your current job?

Niels Meersschaert: Engineering is a lot like producing. You have to understand what you are trying to achieve, understand what parts and roles you’ll need to accomplish it, all while doing it within a budget. I’ve found the best engineers actually have art backgrounds or interests. The key capability is being able to see problems from multiple perspectives, and realizing there are multiple solutions to a problem. Music, photography and other arts encourage that. Engineering is both art and science and creativity is a critical skill for the best engineers. I also believe that a breath of languages is critical for engineers.

Q2. Your company collects data on more than 90% of American households. What kind of data do you collect and how do you use such data?

Niels Meersschaert: We focus on high quality data that is indicative of commercial intent. Some examples include wishlist interaction, content consumption, and location data. While we have the breath of a huge swath of the American population, a key feature is that we have no personally identifiable information. We use anonymous unique identifiers.
So, we know this ID did actions indicative of interest in a new SUV, but we don’t know their name, email address, phone number or any other personally identifiable information about a consumer. We feel this is a good balance of commercial need and individual privacy.

Q3. If you had to operate with data from Europe, what would be the impact of the new EU General Data Protection Regulation (GDPR) on your work?

Niels Meersschaert: Europe is a very different market than the U.S. and many of the regulations you mentioned do require a different approach to understanding consumer behaviors. Given that we avoid personal IDs, our approach is already better situated than many peers, that rely on PII.

Q4. Why did you choose a graph database to implement your system consumer behavior tracking system?

Niels Meersschaert: Our graph database is used for ID management. We don’t use it for understanding the intent data, but rather recognizing IDs. Conceptually, describing the various IDs involved is a natural fit for a graph.
As an example, a conceptual consumer could be thought of as the top of the graph. That consumer uses many devices and each device could have 1 or more anonymous IDs associated with it, such as cookie IDs. Each node can represent an associated device or ID and the relationships between each node allow us to see the path. A key element we have in our system is something we call the Borg filter. It’s a bit of a reference to Star Trek, but essentially when we find a consumer is too connected, i.e. has dozens or hundreds of devices, we remove all those IDs from the graph as clearly something has gone wrong. A graph database makes it much easier to determine how many connected nodes are at each level.

Q5. Why did you choose Neo4j?

Niels Meersschaert: Neo4J had a rich query language and very fast performance, especially if your hot set was in RAM.

Q6. You manage one terabyte of graph data in Neo4j. How do you combine them with larger amounts of non-graph data?

Niels Meersschaert: You can think of the graph as a compression system for us. While consumer actions occur on multiple devices and anonymous IDs, they represent the actions of a single consumer. This actually simplifies things for us, since the unique grouping IDs is much smaller than the unique source IDs. It also allows us to eliminate non-human IDs from the graph. This does mean we see the world in different ways they many peers. As an example, if you focus only on cookie IDs, you tend to have a much larger number of unique IDs than actual consumers those represent. Sadly, the same thing happens with website monthly uniques, many are highly inflated both on the number of unique people they represent, but also since many of the IDs are non-human. Ultimately, the entire goal of advertising is to influence consumers, so we feel that having the better representation of actual consumers allows us to be more effective.

Q7. What are the technical challenges you face when blending data with different structure?

Niels Meersschaert: A key challenge is some unifying element between different systems or structures that link data. What we did with Neo4J is create a unique property on the nodes that we use for interchange. The internal node IDs that are part of Neo4J aren’t something we use except internally within the graph DB.

Q8. If your data is sharded manually, how do you handle scalability?

Niels Meersschaert: We don’t shard the data manually, but scalability is one of the biggest challenges. We’ve spent a lot of time tuning queries and grouping operations to take advantage of some of the capabilities of Neo4J and to work around some limitations it has. The vast majority of graph customers wouldn’t have the volume nor the volatility of data that we do, so our challenges are unique.

Q9. What other technologies do you use and how they interact with Neo4j?

Niels Meersschaert: We use the classic big data tools like Hadoop and Spark. We also use MongoDB and Google’s Big Query. If you look at the graph as the truth set of device IDs, we interact with it on ingestion and export only. Everything in the middle can operate on the consumer ID, which is far more efficient.

Q10. How do you measure the ROI of your solution?

Niels Meersschaert: There are a few factors we consider. First is how much does the infrastructure cost us to process the data and output? How fast is it in terms of execution time? How much development effort does it take relative to other solutions? How flexible is it for us to extend it? This is an ever evolving situation and one we always look at how to improve, especially as a smaller business.

———————————-

Niels Meersschaert
I’ve been coding since I was 7 years old on an Apple II. I’d built radio control model cars and aircraft as a child and built several custom chassis using controlled flex as suspension to keep weight & parts count down. So, I’d had an early interest in both software and physical engineering.

My father was from the Netherlands and my maternal grandfather was a linguist fluent in 43 languages. As a kid, my father worked for the airlines, so we traveled often to Europe to see family, so I grew up multilingual. Computer languages are just different ways to describe something, the basic concepts are similar, just as they are in spoken languages albeit with different grammatical and syntax structure. Whether you’re speaking French, or writing a program in Python or C, the key is you are trying to get your communication across to the target of your message, whether it is another person or a computer.

I originally started university in aeronautical engineering, but in my sophomore year, Grumman let go about 3000 engineers, so I didn’t think the career opportunities would be as great. I’d always viewed problem solutions as a combination of art & science, so I switched majors to one in which I could combine the two.

After school I worked producing and editing commercials and industrials, often with special effects. I got into web video early on & spent a lot of time on compression and distribution systems. That led to working on search, and bringing the linguistics back front and center again. I then combined the two and came full circle back to advertising, but from the technical angle at Magnetic, where we built search retargeting. At Qualia, we kicked this into high gear, where we understand consumer intent by analyzing sentiment, content and actions across multiple devices and environments and the interaction and timing between them to understand the point in the intent path of a consumer.

Resources

EU General Data Protection Regulation (GDPR):

– Reform of EU data protection rules

– European Commission – Fact Sheet Questions and Answers – Data protection reform

– General Data Protection Regulation (Wikipedia)

– Neo4j Sandbox: The Neo4j Sandbox enables you to get started with Neo4j, with built-in guides and sample datasets for popular use cases.

– Graphalytics benchmark.ODBMS.org 6 APR, 2017
The Graphalytics benchmark is an industrial-grade benchmark for graph analysis platforms such as Giraph. It consists of six core algorithms, standard datasets, synthetic dataset generators, and reference outputs, enabling the objective comparison of graph analysis platforms.

– Collaborative Filtering: Creating the Best Teams Ever. By Maurits van der Goes, Graduate Intern | February 16, 2017

Follow us on Twitter: @odbmsorg

New Gartner Magic Quadrant for Operational Database Management Systems. Interview with Nick Heudecker

Roberto V. Zicari — Wed, 30 Nov 2016 20:30:20 +0000

“It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.”–Nick Heudecker.

I have interviewed Nick Heudecker, Research Director on Gartner’s Data & Analytics team.
The main topic of the interview is the new Magic Quadrant for Operational Database Management Systems.

RVZ

Q1. You have published the new Magic Quadrant for Operational Database Management Systems (*). How do you define the operational database management system market?

Nick Heudecker: We define a DBMS as a complete software system used to define, create, manage, update and query a database. DBMSs provide interfaces to independent programs and tools that both support and govern the performance of a variety of concurrent workload types. There is no presupposition that DBMSs must support the relational model or that they must support the full set of possible data types in use today. OPDBMSs must include functionality to support backup and recovery, and have some form of transaction durability — although the atomicity, consistency, isolation and durability model is not a requirement. OPDBMSs may support multiple delivery models, such as stand-alone DBMS software, certified configurations, cloud (public and private) images or versions, and database appliances.

Q2. Can you explain the methodology you used for this new Magic Quadrant?

Nick Heudecker: The methodologies for several Gartner methodologies are public. The Magic Quadrant methodology can be found here.

We use a number of data sources when we’re creating the Magic Quadrant for Operational Database Management Systems.
We survey vendor reference customers and include data from our interactions with Gartner clients. We also consider earlier information and any news about vendors’ products, customers and finances that came to light during the time frame for our analysis.

Once we have the data, we score vendors across the various dimensions of Completeness of Vision and Ability to Execute.
One thing that’s important to note is Magic Quadrants are relative assessments of vendors in a market. We couldn’t have one vendor on an MQ because it would be right in the middle – there’s nothing to compare it to.

Q3. Why were there no Visionaries this year?

Nick Heudecker: We determined there was an overall lack of vision in the market. After a few years of rapid feature expansion, the focus has shifted to operational excellence and execution. Even Leaders shifted to the left on vision, but are still placed in the Leaders quadrant based on their vision for the development of hybrid database management, hardware optimization and integration, emerging deployment models such as containerization, as well as vertical features.

Q4. Were you surprised by the analysis and some of the results you obtained?

Nick Heudecker: The lack of overall vision in the market struck us the most. Other than in a few notable cases, we received largely the same story from most vendors. The explosion of features, and the vendors emerging to implement them, has slowed. The features that initiated the expansion, such as storing new data types, geographically distributed storage, cloud and flexible data consistency models, have become common. Today, nearly every established or emerging DBMS vendor supports these features to some degree. The OPDBMS market has shifted from a phase of rapid innovation to a phase of maturing products and capabilities.

Q5. Do you believe the “NoSQL” label will continue to distinguish DBMSs?

Nick Heudecker: If you look at the entire operational DBMS space, there’s already a great deal of convergence between NoSQL vendors, as well as between NoSQL and traditionally relational vendors. Nearly every vendor, nonrelational and relational, supports multiple data types, like JSON documents, graph or wide-column. NoSQL vendors are adding SQL: MongoDB’s BI Connector and Couchbase’s N1QL are good, if diverse, examples. They’re also adding things like schema management and data validation capabilities.
On the relational side, they’re adding horizontal scaling options and alternative consistency models, as well as modern APIs. And everyone either has or is adding in-memory and cloud capabilities.

It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.

Q6. What are the other “Vendors to Consider”?

Nick Heudecker: The other vendors to consider are vendors that did not meet the inclusion requirements for the Magic Quadrant. Usually this is because they missed our minimum revenue requirements, but that doesn’t mean they don’t have compelling products.

——————————-
Nick Heudecker is a Research Director on Gartner’s Data & Analytics team. His coverage includes data management technologies and practices.

——————————-

Resources
(*) Magic Quadrant for Operational Database Management Systems. Published: 05 October 2016 ID: G00293203Analyst(s): Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald

– Complimentary Gartner Research: 100 Data and Analytics Predictions Through 2020. Get exclusive access to Gartner’s top 100 data and analytics predictions through 2020. Plus access other relevant Gartner research including Magic Quadrant reports for database and data warehouse solutions, and the market guide for in-memory computing (LINK to MemSQL web site – registration required).

– MarkLogic Recognized in New Gartner® Magic Quadrant. Gartner Magic Quadrant for Operational Database Management Systems positions MarkLogic® the highest for ability to execute in the Challengers Quadrant

– Accelerating Business Value with a Multi-Model, Multi-Workload Data Platform

– NuoDB Recognized by Gartner in Critical Capabilities for Operational Database Management Systems. Elastic SQL database achieves top five score in all four use cases.

– Clustrix Recognized in Gartner Magic Quadrant for Operational Database Management Systems

– Learn why EDB is named a “Challenger” in the 2016 Gartner ODBMS Magic Quadrant

– DataStax Receives Highest Scores in 2 Use Cases in Gartner’s Critical Capabilities for Operational Database Management Systems

– Gartner Scores Oracle Highest In 3 of 4 Use Cases: Gartner Critical Capabilities for Operational Database Management Systems Report

– Gartner Critical Capabilities For Operational Database Management Systems 2016 – Redis Labs Ranked Second Highest In 2/4 Categories (Link- Registation required)

Follow us on Twitter: @odbmsorg

AsterixDB: Better than Hadoop? Interview with Mike Carey

Roberto V. Zicari — Wed, 22 Oct 2014 18:18:28 +0000

“To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”)”–Mike Carey.

Mike Carey and his colleagues have been working on a new data management system for Big Data called AsterixDB.

The AsterixDB Big Data Management System (BDMS) is the result of approximately four years of R&D involving researchers at UC Irvine, UC Riverside, and Oracle Labs. The AsterixDB code base currently consists of over 250K lines of Java code that has been co-developed by project staff and students at UCI and UCR.

The AsterixDB project has been supported by the U.S. National Science Foundation as well as by several generous industrial gifts.

RVZ

Q1. Why build a new Big Data Management System?

Mike Carey: When we started this project in 2009, we were looking at a “split universe” – there were your traditional parallel data warehouses, based on expensive proprietary relational DBMSs, and then there was the emerging Hadoop platform, which was free but low-function in comparison and wasn’t based on the many lessons known to the database community about how to build platforms to efficiently query large volumes of data. We wanted to bridge those worlds, and handle “modern data” while we were at it, by taking into account the key lessons from both sides.

To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”).
We felt that the Big Data world, once the initial Hadoop furor started to fade a little, would benefit from having a platform that could offer things like:

a flexible data model that could handle data scenarios ranging from “schema first” to “schema never”;
a full query language with at least the expressive power of SQL;
support for data storage, data management, and automatic indexing;
support for a wide range of query sizes, with query processing cost being proportional to the given query;
support for continuous data ingestion, hence the accumulation of Big Data;
the ability to scale up gracefully to manage and query very large volumes of data using commodity clusters; and,
built-in support for today’s common “Big Data data types”, such as textual, temporal, and simple spatial data.

So that’s what we set out to do.

Q2. What was wrong with the current Open Source Big Data Stack?

Mike Carey: First, we should mention that some reviewers back in 2009 thought we were crazy or stupid (or both) to not just be jumping on the Hadoop bandwagon – but we felt it was important, as academic researchers, to look beyond Hadoop and be asking the question “okay, but after Hadoop, then what?”
We recognized that MapReduce was great for enabling developers to write massively parallel jobs against large volumes of data without having to “think parallel” – just focusing on one piece of data (map) or one key-sharing group of data (reduce) at a time. As a platform for “parallel programming for dummies”, it was (and still is) very enabling! It also made sense, for expedience, that people were starting to offer declarative languages like Pig and Hive, compiling them down into Hadoop MapReduce jobs to improve programmer productivity – raising the level much like what the database community did in moving to the relational model and query languages like SQL in the 70’s and 80’s.

One thing that we felt was wrong for sure in 2009 was that higher-level languages were being compiled into an assembly language with just two instructions, map and reduce. We knew from Tedd Codd and relational history that more instructions – like the relational algebra’s operators – were important – and recognized that the data sorting that Hadoop always does between map and reduce wasn’t always needed.
Trying to simulate everything with just map and reduce on Hadoop made “get something better working fast” sense, but not longer-term technical sense. As for HDFS, what seemed “wrong” about it under Pig and Hive was its being based on giant byte stream files and not on “data objects”, which basically meant file scans for all queries and lack of indexing. We decided to ask “okay, suppose we’d known that Big Data analysts were going to mostly want higher-level languages – what would a Big Data platform look like if it were built ‘on purpose’ for such use, instead of having incrementally evolved from HDFS and Hadoop?”

Again, our idea was to try and bring together the best ideas from both the database world and the distributed systems world. (I guess you could say that we wanted to build a Big Data Reese’s Cup… J)

Q3. AsterixDB has been designed to manage vast quantities of semi-structured data. How do you define semi-structured data?

Mike Carey: In the late 90’s and early 2000’s there was a bunch of work on that – on relaxing both the rigid/flat nature of the relational model as well as the requirement to have a separate, a priori specification of the schema (structure) of your data. We felt that this flexibility was one of the things – aside from its “free” price point – drawing people to the Hadoop ecosystem (and the key-value world) instead of the parallel data warehouse ecosystem.
In the Hadoop world you can start using your data right away, without spending 3 months in committee meetings to decide on your schema and indexes and getting DBA buy-in. To us, semi-structured means schema flexibility, so in AsterixDB, we let you decide how much of your schema you have to know and/or choose to reveal up front, and how much you want to leave to be self-describing and thus allow it to vary later. And it also means not requiring the world to be flat – so we allow nesting of records, sets, and lists. And it also means dealing with textual data “out of the box”, because there’s so much of that now in the Big Data world.

Q4. The motto of your project is “One Size Fits a Bunch”. You claim that AsterixDB can offer better functionality, managability, and performance than gluing together multiple point solutions (e.g., Hadoop + Hive + MongoDB). Could you please elaborate on this?

Mike Carey: Sure. If you look at current Big Data IT infrastructures, you’ll see a lot of different tools and systems being tied together to meet an organization’s end-to-end data processing requirements. In between systems and steps you have the glue – scripts, workflows, and ETL-like data transformations – and if some of the data needs to be accessible faster than a file scan, it’s stored not just in HDFS, but also in a document store or a key-value store.
This just seems like too many moving parts. We felt we could build a system that could meet more (not all!) of today’s requirements, like the ones I listed in my answer to the first question.
If your data is in fewer places or can take a flight with fewer hops to get the answers, that’s going to be more manageable – you’ll have fewer copies to keep track of and fewer processes that might have hiccups to watch over. If you can get more done in one system, obviously that’s more functional. And in terms of performance, we’re not trying to out-perform the specialty systems – we’re just trying to match them on what each does well. If we can do that, you can use our new system without needing as many puzzle pieces and can do so without making a performance sacrifice.
We’ve recently finished up a first comparison of how we perform on tasks that systems like parallel relational systems, MongoDB, and Hive can do – and things look pretty good so far for AsterixDB in that regard.

Q5. AsterixDB has been combining ideas from three distinct areas — semi-structured data management, parallel databases, and data-intensive computing. Could you please elaborate on that?

Mike Carey: Our feeling was that each of these areas has some ideas that are really important for Big Data. Borrowing from semi-structured data ideas, but also more traditional databases, leads you to a place where you have flexibility that parallel databases by themselves do not. Borrowing from parallel databases leads to scale-out that semi-structured data work didn’t provide (since scaling is orthogonal to data model) and with query processing efficiencies that parallel databases offer through techniques like hash joins and indexing – which MapReduce-based data-intensive computing platforms like Hadoop and its language layers don’t give you. Borrowing from the MapReduce world leads to the open-source “pricing” and flexibility of Hadoop-based tools, and argues for the ability to process some of your queries directly over HDFS data (which we call “external data” in AsterixDB, and do also support in addition to managed data).

Q6. How does the AsterixDB Data Model compare with the data models of NoSQL data stores, such as document databases like MongoDB and CouchBase, simple key/value stores like Riak and Redis, and column-based stores like HBase and Cassandra?

Mike Carey: AsterixDB’s data model is flexible – we have a notion of “open” versus “closed” data types – it’s a simple idea but it’s unique as far as we know. When you define a data type for records to be stored in an AsterixDB dataset, you can choose to pre-define any or all of the fields and types that objects to be stored in it will have – and if you mark a given type as being “open” (or let the system default it to “open”), you can store objects there that have those fields (and types) as well as any/all other fields that your data instances happen to have at insertion time.
Or, if you prefer, you can mark a type used by a dataset as “closed”, in which case AsterixDB will make sure that all inserted objects will have exactly the structure that your type definition specifies – nothing more and nothing less.
(We do allow fields to be marked as optional, i.e., nullable, if you want to say something about their type without mandating their presence.)

What this gives you is a choice! If you want to have the total, last-minute flexibility of MongoDB or Couchbase, with your data being self-describing, we support that – you don’t have to predefine your schema if you use data types that are totally open. (The only thing we insist on, at the moment, is that every type must have a key field or fields – we use keys when sharding datasets across a cluster.)

Structurally, our data model was JSON-inspired – it’s essentially a schema language for a JSON superset – so we’re very synergistic with MongoDB or Couchbase data in that regard.
On the other end of the spectrum, if you’re still a relational bigot, you’re welcome to make all of your data types be flat – don’t use features like nested records, lists, or bags in your record definitions – and mark them all as “closed” so that your data matches your schema. With AsterixDB, we can go all the way from traditional relational to “don’t ask, don’t tell”. As for systems with BigTable-like “data models” – I’d personally shy away from calling those “data models”.

Q7. How do you handle horizontal scaling? And vertical scaling?

Mike Carey: We scale out horizontally using the same sort of divide-and-conquer techniques that have been used in commercial parallel relational DBMSs for years now, and more recently in Hadoop as well. That is, we horizontally partition both data (for storage) and queries (when processed) across the nodes of commodity clusters. Basically, our innards look very like those of systems such as Teradata or Parallel DB2 or PDW from Microsoft – we use join methods like parallel hybrid hash joins, and we pay attention to how data is currently partitioned to avoid unnecessary repartitioning – but have a data model that’s way more flexible. And we’re open source and free….

We scale vertically (within one node) in two ways. First of all, we aren’t memory-dependent in the way that many of the current Big Data Analytics solutions are; it’s not that case that you have to buy a big enough cluster so that your data, or at least your intermediate results, can be memory-resident.
Instead, our physical operators (for joins, sorting, aggregation, etc.) all spill to disk if needed – so you can operate on Big Data partitions without getting “out of memory” errors. The other way is that we allow nodes to hold multiple partitions of data; that way, one can also use multi-core nodes effectively.

Q8. What performance figures do you have for AsterixDB?

Mike Carey: As I mentioned earlier, we’ve completed a set of initial performance tests on a small cluster at UCI with 40 cores and 40 disks, and the results of those tests can be found in a recently published AsterixDB overview paper that’s hanging on our project web site’s publication page (http://asterixdb.ics.uci.edu/publications.html).
We have a couple of other performance studies in flight now as well, and we’ll be hanging more information about those studies in the same place on our web site when they’re ready for human consumption. There’s also a deeper dive paper on the AsterixDB storage manager that has some performance results regarding the details of scaling, indexing, and so on; that’s available on our web site too. The quick answer to “how does AsterixDB perform” is that we’re already quite competitive with other systems that have narrower feature sets – which we’re pretty proud of.

Q9. You mentioned support for continuous data ingestion. How does that work?

Mike Carey: We have a special feature for that in AsterixDB – we have a built-in notion of Data Feeds that are designed to simplify the lives of users who want to use our system for warehousing of continuously arriving data.
We provide Data Feed adaptors to enable outside data sources to be defined and plugged in to AsterixDB, and then one can “connect” a Data Feed to an AsterixDB data set and the data will start to flow in. As the data comes in, we can optionally dispatch a user-defined function on each item to do any initial information extraction/annotation that you want. Internally, this creates a long-running job that our system monitors – if data starts coming too fast, we offer various policies to cope with it, ranging from discarding data to sampling data to adding more UDF computation tasks (if that’s the bottleneck). More information about this is available in the Data Feeds tech report on our web site, and we’ll soon be documenting this feature in the downloadable version of AsterixDB. (Right now it’s there but “hidden”, as we have been testing it first on a set of willing UCI student guinea pigs.)

Q10. What is special about the AsterixDB Query Language? Why not use SQL?

Mike Carey: When we set out to define the query language for AsterixDB, we decided to define our own new language – since it seemed like everybody else was doing that at the time (witness Pig, Jaql, HiveQL, etc.) – one aimed at our data model.
SQL doesn’t handle nested or open data very well, so extending ANSI/ISO SQL seemed like a non-starter – that was also based on some experience working on SQL3 in the late 90’s. (Take a look at Oracle’s nested tables, for example.). Based on our team’s backgrounds in XML querying, we actually started there – XQuery was developed by a team of really smart people from the SQL world (including Don Chamberlin, father of SQL) as well as from the XML world and the functional programming world – so we started there. We took XQuery and then started throwing the stuff overboard that wasn’t needed for JSON or that seemed like a poor feature that had been added for XPath compatibility.
What remained was AQL, and we think it’s a pretty nice language for semistructured data handling. We periodically do toy with the notion of adding a SQL-like re-skinning of AQL to make SQL users feel more at home – and we may well do that in the future – but that would be different than “real SQL”. (The N1QL effort at Couchbase is doing something along those lines, language-wise, as an example. The SQL++ design from UCSD is another good example there.)

Q11. What level of concurrency and recovery guarantees does AsterixDB offer?

Mike Carey: We offer transaction support that’s akin to that of current NoSQL stores. That is, we promise record-level ACIDity – so inserting or deleting a given record will happen as an atomic, durable action. However, we don’t offer general-purpose distributed transactions. We support an arbitrary number of secondary indexes on data sets, and we’ll keep all the indexes on a data set transactionally consistent – that we can do because secondary index entries for a given record live in the same data partition as the record itself, so those transactions are purely local.

Q12. How does AsterixDB compare with Hadoop? What about Hadoop Map/Reduce compatibility?

Mike Carey: I think we’ve already covered most of that – Hadoop MapReduce is an answer to low-level “parallel programming for dummies”, and it’s great for that – and languages on top like Pig Latin and HiveQL are better programming abstractions for “data tasks” but have runtimes that could be much better. We started over, much as the recent flurry of Big Data analytics platforms are now doing (e.g., Impala, Spark, and friends), but with a focus on scaling to memory-challenging data sizes. We do have a MapReduce compatibility layer that goes along with our Hyracks runtime layer – Hyracks is name of our internal dataflow runtime layer – but our MapReduce compatibility layer is not related to (or connected to) the AsterixDB system.

Q13. How does AsterixDB relate to Hadapt?

Mike Carey: I’m not familiar with Hadapt, per se, but I read the HadoopDB work that fed into it.
We’re architecturally very different – we’re not Hadoop-based at all – I’d say that HadoopDB was more of an expedient hybrid coupling of Hadoop and databases, to get some of the indexing and local query efficiency of an existing database engine quickly in the Hadoop world. We were thinking longer term, starting from first principles, about what a next-generation BDMS might look like. AsterixDB is what we came up.

Q14. How does AsterixDB relate to Spark?

Mike Carey: Spark is aimed at fast Big Data analytics – its data is coming from HDFS, and the task at hand is to scan and slice and dice and process that data really fast. Things like Shark and SparkSQL give users SQL query power over the scanned data, but Spark in general is really catching fire, it appears, due to its applicability to Big Machine Learning tasks. In contrast, we’re doing Big Data Management – we store and index and query Big Data. It would be a very interesting/useful exercise for us to explore how to make AsterixDB another source where Spark computations can get input data from and send their results to, as we’re not targeting the more complex, in-memory computations that Spark aims to support.

Q15. How can others contribute to the project?

Mike Carey: We would love to see this start happening – and we’re finally feeling more ready for that, and even have some NSF funding to make AsterixDB something that others in the Big Data community can utilize and share.
(Note that our system is Apache-style open source licensed, so there are no “gotchas” lurking there.)
Some possibilities are:

(1) Others can start to use AsterixDB to do real exploratory Big Data projects, or to teach about Big Data (or even just semistructured data) management. Each time we’ve worked with trial users we’ve gained some insights into our feature set, our query optimizations, and so on – so this would help contribute by driving us to become better and better over time.

(2) Folks who are studying specific techniques for dealing with modern data – e.g., new structures for indexing spatiotemporaltextual (J) data – might consider using AsterixDB as a place to try out their new ideas.
(This is not for the meek, of course, as right now effective contributors need to be good at reading and understanding open source software without the benefit of a plethora of internal design documents or other hints.) We also have some internal wish lists of features we wish we had time to work on – some of which are even doable from “outside”, e.g., we’d like to have a much nicer browser-based workbench for users to use when interacting with and managing an AsterixDB cluster.

(3) Students or other open source software enthusiasts who download and try our software and get excited about it – who then might want to become an extension of our team – should contact us and ask about doing so. (Try it first, though!) We would love to have more skilled hands helping with fixing bugs, polishing features, and making the system better – it’s tough to build robust software in a university setting, and we would especially welcome contributors from companies.

Thanks very much for this opportunity to share what we’ve being doing!

————————
Michael J. Carey is a Bren Professor of Information and Computer Sciences at UC Irvine.
Before joining UCI in 2008, Carey worked at BEA Systems for seven years and led the development of BEA’s AquaLogic Data Services Platform product for virtual data integration. He also spent a dozen years teaching at the University of Wisconsin-Madison, five years at the IBM Almaden Research Center working on object-relational databases, and a year and a half at e-commerce platform startup Propel Software during the infamous 2000-2001 Internet bubble. Carey is an ACM Fellow, a member of the National Academy of Engineering, and a recipient of the ACM SIGMOD E.F. Codd Innovations Award. His current interests all center around data-intensive computing and scalable data management (a.k.a. Big Data).

Resources

– AsterixDB Big Data Management System (BDMS): Downloads, Documentation, Asterix Publications.

–On the Hadoop market. Interview with John Schroeder. ODBMS Industry Watch, June 30, 2014

Follow ODBMS.org on Twitter: @odbmsorg
##

Interview with John Partridge, President & CEO of Tokutek, Inc.

Roberto V. Zicari — Fri, 09 May 2014 08:01:56 +0000

“As the database gets used, shards can grow at an uneven rate and one shard might carry a majority of the load. MongoDB corrects this by balancing shards, but because of MongoDB’s lack of concurrency this operation can stall the database unacceptably.”–John Partridge.

I have interviewed John Partridge, President & CEO of Tokutek, Inc.

RVZ

Q1. Tokutek recently announced to have eliminated performance issues of MongoDB sharding. What was the problem?

John Partridge: The problem occurs after a shard is created. As the database gets used, shards can grow at an uneven rate and one shard might carry a majority of the load. MongoDB corrects this by balancing shards, but because of MongoDB’s lack of concurrency this operation can stall the database unacceptably (see the benchmark).

Q2. For what kind of application users of MongoDB experienced these bottlenecks?

John Partridge: Users who need to scale out, and rely on sharding to do so.

Q3. What is the solution you propose to this problem?

John Partridge: Use TokuMX 1.4. It is a complete distribution and a drop-in replacement for the distribution you get from MongoDB/10gen.

Q4. How TokuMX v1.4 is able to allow shards to be balanced and added without disruption for a NoSQL solution that scales up and scales out?

John Partridge: TokuMX replaces the B-tree indexing used in MongoDB with patented Fractal Tree indexing, which allows for significantly better concurrency (among other things). Because of the improved concurrency, data can be copied, then deleted, from one shard to another without unnecessary locking.

Q5. What is the difference in performance of your solution with respect to the basic MongoDB? What “basic” MongoDB do you use for this comparison?

John Partridge: “Basic” MongoDB is the distro that you get from MongoDB (10gen). We typically see 20x performance improvements but as you might imagine, it depends on the application. Because TokuMX offers document-level locking rather than the database-level locking, TokuMX shines when there are significant reads *and* writes.

Q6. How do you compare TokuMX with other distribution of MongoDB, such as the one of 10gen (now MongoDB)?

John Partridge: There are three major differences: 20x performance improvement, 90% smaller database size (we compress the data), and support for ACID transactions. Look at the bottom of http://www.tokutek.com/products/tokumx-for-mongodb/ for more information on each of these benefits.
——————————
John Partridge
Mr. Partridge brings over twenty years of experience in the software industry as a developer, investor, and entrepreneur. He joins Tokutek from StreamBase Systems which John co-founded with database pioneer Dr. Michael Stonebraker. He started his career as a software developer at Microsoft Corporation where he co-authored Excel v1.0. He later worked as a venture capitalist at Accel Partners and the Summit Accelerator Fund where he specialized in investing in early stage internet infrastructure and enterprise software companies. John holds an A.B. in Applied Mathematics / Computer Science from Harvard University and an MBA from the Stanford University Graduate School of Business.

– Scaling MySQL and MariaDB to TBs: Interview with Martín Farach-Colton.
ODBMS Industry Watch, October 8, 2012

Resources

– TokuMX vs. MongoDB : Sharding Balancer Performance Posted on February 16, 2014 by Tim Callaghan, Tokutek.

– What’s new in TokuMX 1.4, Part 4: Smaller, faster sharded clusters. Posted on February 20, 2014 by Leif Walsh, Tokutek.

– MongoDB web site

Follow ODBMS.org on Twitter: @odbmsorg
##

On SQL and NoSQL. Interview with Dave Rosenthal

Roberto V. Zicari — Tue, 18 Mar 2014 16:04:45 +0000

“Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications.”–Dave Rosenthal.

On SQL and NoSQL, I have interviewed Dave Rosenthal CEO of FoundationDB.

RVZ

Q1. What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

Dave Rosenthal: There is a tradeoff in available between commit latency and durability–especially in distributed databases. At one extreme a database client can just report success immediately (without even talking to the database server) and buffer the writes in the background. Obviously, that hides latency well, but you could lose a suffix of transactions. At the other extreme, you can replicate writes across multiple machines, fsync them on each of the machines, and only then report success to the client.

FoundationDB is optimized to provide good performance in its default setting, which is the safest end of that tradeoff.

Usually, if you want some reasonable amount of durability guarantee, you are talking about a commit latency of small constant factor times the network latency. So, the real latency issues come with databases spanning multiple data centers. In that case FoundationDB users are able to choose whether or not they want durability guarantees in all data centers before commit (increasing commitment latencies), which is our default setting, or whether they would like to relax durability guarantees by returning a commit when the data is fsync’d to disk in just one datacenter.

All that said, in general, we think that the application is usually a more appropriate place to try to hide latency than the database.

Q2. Justin Sheehy of Basho in an interview said [1] “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”. What is your opinion on this?

Dave Rosenthal: Yes, we totally agree with Justin. Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications. In fact, true ACID transactions are way more broadly important than that. They give you the ability to build abstractions and systems that you can provide guarantees about.
As Michael Cahill says in his thesis which became the SIGMOD paper of the year: “Serializable isolation enables the development of a complex system by composing modules, after verifying that each module maintains consistency in isolation.” It’s this incredibly important ability to compose that makes a system with transactions special.

Q3. FoundationDB claim to provide full ACID transactions. How do you do that?

Dave Rosenthal: In the same basic way as many other transactional databases do. We use a few strategies that tend to work well in distributed system such as optimistic concurrency and MVCC. We also, of course, have had to solve some of the fundamental challenges associated with distributed systems and all of the crazy things that can happen in them. Honestly, it’s not very hard to build a distributed transactional database. The hard part is making it work gracefully through failure scenarios and to run fast.

Q4. Is this similar to Oracle NoSQL?

Dave Rosenthal: Not really. Both Oracle NoSQL and FoundationDB provide an automatically-partitioned key-value store with fault tolerance. Both also have a concept of ordering keys (for efficient range operations) though Oracle NoSQL only provides ordering “within a Major Key set”. So, those are the similarities, but there are a bunch of other NoSQL systems with all those properties. The huge difference is that FoundationDB provides for ACID transactions over arbitrary keys and ranges, while Oracle NoSQL does not.

Q5. How would you compare your product offering with respect to NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak, and NewSQL such as NuoDB and VoltDB?

Dave Rosenthal: The most obvious response for the NoSQL data stores would be “we have ACID transactions, they don’t”, but the more important difference is in philosophy and strategy.

Each of those products expose a single data model and interface. Maybe two. We are pursuing a fundamentally different strategy.
We are building a storage substrate that can be adapted, via layers, to provide a variety of data models, APIs, and true flexibility.
We can do that because of our transactional capabilities. CouchDB, MongoDB, Cassandra and Riak all have different APIs and we talk to companies that run all of those products side-by-side. The NewSQL database players are also offering a single data model, albeit a very popular one, SQL. FoundationDB is offering an ever increasing number of data models through its “layers”, currently including several popular NoSQL data models and with SQL being the next big one to hit. Our philosophy is that you shouldn’t have to increase the complexity of your architecture by adopting a new NoSQL database each time your engineers need access to a new data model.

Q6. Cloud computing and open source: How does it relate to FoundationDB?

Dave Rosenthal: Cloud computing: FoundationDB has been designed from the beginning to run well in cloud environments that make use of large numbers of commodity machines connected through a network. Probably the most important aspect of a distributed database designed for cloud deployment is exceptional fault tolerance under very harsh and strange failure conditions – the kind of exceptionally unlikely things that can only happen when you have many machines working together with components failing unpredictably. We have put a huge amount of effort into testing FoundationDB in these grueling scenarios, and feel very confident in our ability to perform well in these types of environments. In particular, we have users running FoundationDB successfully on many different cloud providers, and we’ve seen the system keep its guarantees under real-world hardware and network failure conditions experienced by our users.

Open source: Although FoundationDB’s core data storage engine is closed source, our layer ecosystem is open source. Although the core data storage engine has a very simple feature set, and is very difficult to properly modify while maintaining correctness, layers are very feature rich and because they are stateless, are much easier to create and modify which makes them well suited to third-party contributions.

Q7 Pls give some examples of use cases where FoundationDB is currently in use. Is FoundationDB in use for analyzing Big Data as well?

Dave Rosenthal: Some examples: User data, meta data, user social graphs, geo data, via ORMs using the SQL layer, metrics collection, etc.

We’ve mostly focused on operational systems, but a few of our customers have built what I would call “big data” applications, which I think of as analytics-focused. The most common use case has been for collecting and analyzing time-series data. FoundationDB is strongest in big data applications that call for lots of random reads and writes, not just big table scans—which many systems can do well.

Q8. Rick Cattel said in an recent interview [2] “there aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves”. What is your opinion on this?

Dave Rosenthal: People have great ideas for databases all the time. New data models, new query languages, etc.
If nothing else, this NoSQL experiment that we’ve all been a part of the past few years has shown us all the appetite for data models suited to specific problems. They would love to be able to build these tools, open source them, etc.
The problem is that the checklist of practical considerations for a database is huge: Fault tolerance, scalability, a backup solution, management and monitoring, ACID transactions, etc. Add those together and even the simplest concept sounds like a huge project.

Our vision at FoundationDB is that we have done the hard work to build a storage substrate that simultaneously solves all those tricky practical problems. Our engine can be used to quickly build a database layer for any particular application that inherits all of those solutions and their benefits, like scalability, fault tolerance and ACID compliance.

Q9. Nick Heudecker of Gartner, predicts that [3] “going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time” . What is your take on this?

Dave Rosenthal: I do think that the lines between SQL and NoSQL will start to blur and I believe that we are leading that charge.We acquired another database startup last year called Akiban that builds an amazing SQL database engine.
In 2014 we’ll be bringing that engine to market as a layer running on top of FoundationDB. That will be a true ANSI SQL database operating as a module directly on top of a transactional “NoSQL” engine, inheriting the operational benefits of our core storage engine – scalability, fault tolerance, ease of operation.

When you run multiple of the SQL layer modules, you can point many of them at the same key-space in FoundationDB and it’s as if they are all part of the same database, with ACID transactions enforced across the separate SQL layer processes.
It’s very cool. Of course, you can even run the SQL layer on a FoundationDB cluster that’s also supporting other data models, like graph or document. That’s about as blurry as it gets.

———–
Dave Rosenthal is CEO of FoundationDB. Dave started his career in games, building a 3D real-time strategy game with a team of high-school friends that won the 1st annual Independent Games Festival. Previously, Dave was CTO at Visual Sciences, a pioneering web-analytics company that is now part of Adobe. Dave has a degree in theoretical computer science from MIT.

Follow ODBMS.org on Twitter: @odbmsorg

Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner

Roberto V. Zicari — Fri, 15 Nov 2013 09:21:06 +0000

“My experience overall with almost all open-source tools has been very positive: open source tools are very high quality, well documented, and if you get stuck there is a helpful and responsive community on mailing lists or Stack Exchange and similar sites.” —Dr. Jochen L. Leidner.

I wanted to know how Thomson Reuters uses Big Data. I have interviewed Dr. Jochen L. Leidner, Lead Scientist, of the London R&D at Thomson Reuters.

RVZ

Q1. What is your current activity at Thomson Reuters?
Jochen L. Leidner: For the most part, I carry out applied research in information access, and that’s what I have been doing for quite a while. After five years with the company – I joined from the University of Edinburgh, where I had been a postdoctoral Royal Society of Edinburgh Enterprise Fellow half a decade ago – I am currently a Lead Scientist with Thomson Reuters, where I am building up a newly-established London site part of our Corporate Research & Development group (by the way: we are hiring!). Before that, I held research and innovation-related roles in the USA and in Switzerland.
Let me say a few words about Thomson Reuters before I go more into my own activities, just for background. Thomson Reuters has around 50,000 employees in over 100 countries and sells information to professionals in many verticals, including finance & risk, legal, intellectual property & scientific, tax & accounting.
Our headquarters are located at 3 Time Square in the city of New York, NY, USA.
Most people know our REUTERS brand from reading their newspapers (thanks to our highly regarded 3,000+ journalists at news desks in about 100 countries, often putting their lives at risk to give us reliable reports of the world’s events) or receiving share price information on the radio or TV, but as a company, we are also involved in as diverse areas as weather prediction (as the weather influences commodity prices) and determining citation impact of academic journals (which helps publishers sell their academic journals to librarians), or predicting Nobel prize winners.
My research colleagues and I study information access and especially means to improve it, using including natural language processing, information extraction, machine learning, search engine ranking, recommendation system and similar areas of investigations.
We carry out a lot of contract research for internal business units (especially if external vendors do not offer what we need, or if we believe we can build something internally that is lower cost and/or better suited to our needs), feasibility studies to de-risk potential future products that are considered, and also more strategic, blue-sky research that anticipates future needs. As you would expect, we protect our findings and publish them in the usual scientific venues.

Q2. Do you use Data Analytics at Thomson Reuters and for what?
Jochen L. Leidner: Note that terms like “analytics” are rather too broad to be useful in many instances; but the basic answer is “yes”, we develop, apply internally, and sell as products to our customers what can reasonably be called solutions that incorporate “data analytics” functions.
One example capability that we developed is the recommendation engine CaRE (Al-Kofahi et al., 2007), which was developed by our group, Corporate Research & Development, which is led by our VP of Research & Development, Khalid al-Kofahi. This is a bit similar in spirit to Amazon’s well-known book recommendations. We used it to recommend legal documents to attorneys, as a service to supplement the user’s active search on our legal search engine with “see also..”-type information. This is an example of a capability developed in-house that also something that made it into a product, and is very popular.
Thomson Reuters is selling information services, often under a subscription model, and for that it is important to have metrics available that indicate usage, in order to inform our strategy. So another example for data analytics is that we study how document usage can inform personalization and ranking, of from where documents are accessed, and we use this to plan network bandwidth and to determine caching server locations.
A completely different example is citation information: Since 1969, when our esteemed colleague Eugene Garfield (he is now officially retired, but is still active) came up with the idea of citation impact, our Scientific business division is selling the journal citation impact factor – an analytic that can be used as a proxy for the importance of a journal (and, by implication, as an argument to a librarian to purchase a subscription of that journal for his or her university library).
Or, to give another example from the financial markets area, we are selling predictive models (Starmine) that estimate how likely it is whether a given company goes bankrupt within the next six months.

Q3. Do you have Big Data at Thomson Reuters? Could you please give us some examples of Big Data Use Cases at your company?
Jochen L. Leidner: For most definitions of “big”, yes we do. Consider that we operate a news organization, which daily generates in the tens of thousands of news reports (if we count all languages together). Then we have photo journalists who create large numbers of high-quality, professional photographs to document current events visually, and videos comprising audio-visual storytelling and interviews. We further collect all major laws, statutes, regulations and legal cases around in major jurisdictions around the world, enrich the data with our own meta-data using both manual expertise and automatic classification and tagging tools to enhance findability. We hold collections of scientific articles and patents in full text and abstracts.
We gather, consolidate and distribute price information for financial instruments from hundreds of exchanges around the world. We sell real-time live feeds as well as access to decades of these time series for the purpose of back-testing trading strategies.

Q4. What “value” can be derived by analyzing Big Data at Thomson Reuters?
Jochen L. Leidner: This is the killer question: we take the “value” very much without the double quotes – big data analytics lead to cost savings as well as generate new revenues in a very real, monetary sense of the word “value”. Because our solutions provide what we call “knowledge to act” to our customers, i.e., information that lets them make better decisions, we provide them with value as well: we literally help our customers save the cost of making a wrong decision.

Q5. What are the main challenges for big data analytics at Thomson Reuters ?
Jochen L. Leidner: I’d say these are absolute volume, growth, data management/integration, rights management, and privacy are some of the main challenges.
One obvious challenge is the size of the data. It’s not enough to have enough persistent storage space to keep it, we also need backup space, space to process it, caches and so on – it all adds up. Another is the growth and speed of that growth of the data volume. You can plan for any size, but it’s not easy to adjust your plans if unexpected growth rates come along.
Another challenge that we must look into is the integration between our internal data holdings, external public data (like the World Wide Web, or commercial third-party sources – like Twitter – which play an important role in the modern news ecosystem), and customer data (customers would like to see their own internal, proprietary data be inter-operable with our data). We need to respect the rights associated with each data set, as we deal with our own data, third party data and our customers’ data. We must be very careful regarding privacy when brainstorming about the next “big data analytics” idea – we take privacy very seriously and our CPO joined us from a government body that is in charge of regulating privacy.

Q6. How do you handle the Big Data Analytics “process” challenges with deriving insight?
Jochen L. Leidner: Usually analytics projects happen as an afterthought to leverage existing data created in a legacy process, which means not a lot of change of process is needed at the beginning. This situation changes once there is resulting analytics output, and then the analytics-generating process needs to be integrated with the previous processes.
Even with new projects, product managers still don’t think of analytics as the first thing to build into a product for a first bare-bones version, and we need to change that; instrumentation is key for data gathering so that analytics functionality can build on it later on.
In general, analytics projects follow a process of (1) capturing data, (2) aligning data from different sources (e.g., resolving when two objects are the same), (3) pre-processing or transforming the data into a form suitable for analysis, (4) building some model and (5) understanding the output (e.g. visualizing and sharing the results). This five-step process is followed by an integration phase into the production process to make the analytics repeatable.

Q7. What kind of data management technologies do you use? What is your experience in using them?
Jochen L. Leidner: For storage and querying, we use relational database management systems to store valuable data assets and also use NoSQL databases such as CouchDB and MongoDB in projects where applicable. We use homegrown indexing and retrieval engines as well as open source libraries and components like Apache Lucene, Solr and ElasticSearch.
We use parallel, distributed computing platforms such as Hadoop/Pig and Spark to process data, and virtualization to manage isolation of environments.
We sometimes push out computations to Amazon’s EC2 and storage to S3 (but this can only be done if the data is not sensitive). And of course in any large organization there is a lot of homegrown software around. My experience overall with almost all open-source tools has been very positive: open source tools are very high quality, well documented, and if you get stuck there is a helpful and responsive community on mailing lists or Stack Exchange and similar sites.

Q8. Do you handle un-structured data? If yes, how?
Jochen L. Leidner: We curate lots of our own unstructured data from scratch, including the thousands of news stories that our over three thousand REUTERS journalists write every day, and we also enrich the unstructured content produced by others: for instance, in the U.S. we manually enrich legal cases with human-written summaries (written by highly-qualified attorneys) and classify them based on a proprietary taxonomy, which informs our market-leading WestlawNext product (see this CIKM 2011 talk), our search engine for the legal profession, who need to find exactly the right cases. Over the years, we have developed proprietary content repositories to manage content storage, meta-data storage, indexing and retrieval. One of our challenges is to unite our data holdings, which often come from various acquisitions that use their own technology.

Q9. Do you use Hadoop? If yes, what is your experience with Hadoop so far?
Jochen L. Leidner: We have two clusters featuring Hadoop, Spark and GraphLab, and we are using them intensively. Hadoop is mature as an implementation of the MapReduce computing paradigm, but has its shortcomings because it is not a true operating system (but probably it should be) – for instance regarding the stability of HDFS, its distributed file system. People have started to realize there are shortcomings, and have started to build other systems around Hadoop to fill gaps, but these are still early stage, so I expect them to become more mature first and then there might be a wave of consolidation and integration. We have definitely come a long way since the early days.
Typically, on our clusters we run batch information extraction tasks, data transformation tasks and large-scale machine learning processes to train classifiers and taggers for these. We are also inducing language models and training recommender systems on them. Since many training algorithms are iterative, Spark can win over Hadoop for these, as it keeps models in RAM.

Q10. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?
Jochen L. Leidner: The dichotomy is perhaps between batch processing and dialog processing, whereas real-time (“hard”, deterministic time guarantee for a system response) goes hand-in-hand with its opposite non-real time, but I think what you are after here is that dialog systems have to be responsive. There is no one-size-fits-all method for meeting (near) real-time requirements or for making dialog systems more responsive; a lot of analytics functions require the analysis of more than just a few recent data-points, so if that’s what is needed it may take its time. But it is important that critical functions, such as financial data feeds, are delivered as fast as possible – micro-seconds matter here. The more commoditized analytics functions become, the faster they need to be available to retain at least speed as a differentiator.

Q11 Cloud computing and open source: Do you they play a role at Thomson Reuters? If yes, how?
Jochen L. Leidner: Regarding cloud computing, we use cloud services internally and as part of some of our product offerings. However, there are also reservations – a lot of our applications contain information that is too sensitive to entrust a third party, especially as many cloud vendors cannot give you a commitment with respect to hosting (or not hosting) in particular jurisdictions. Therefore, we operate our own set of data centers, and some part of these operates as what has become known as “private clouds”, retaining the benefit of the management outsourcing abstraction, but within our large organization rather than pushing it out to a third party. Of course the notion of private clouds is leading the cloud idea ad absurdum quite a bit, because it sacrifices the economy of scale, but having more control is an advantage.
Open source plays a huge role at Thomson Reuters – we rely on many open source components, libraries and systems, especially under the MIT, BSD, LGPL and Apache licenses. For example, some of our tagging pipelines rely on Apache UIMA, which is a contribution originally developed at IBM, and which has seen contributions by researchers from all around the world (from Darmstadt to Boulder). To date, we have not been very good about opening up our own services in the form of source code, but we are trying to change that now, and we have just launched a new corporation-wide process for open-sourcing software. We also have an internal sharing repository, “Corporate Source”, but in my personal view the audience in any single company is too small – open source (like other recent waves such as clouds or crowdsourcing) needs Internet-scale to work, and Web search engines for the various projects to be discovered).

Q12 What are the main research challenges ahead? And what are the main business challenges ahead?
Jochen L. Leidner: Some of the main business challenges are the cost pressure that some of our customers face, and the increasing availability of low-cost or free-of-charge information sources, i.e. the commoditization of information. I would caution here that whereas the amount of information available for free is large, this in itself does not help you if you have a particular problem and cannot find the information that helps you solve it, either because the solution is not there despite the size, or because it is there but findability is low. Further challenges include information integration, making systems ever more adaptive, but only to the extent it is useful, or supporting better personalization. Having said this sometimes systems need to be run in a non-personalized mode (e.g. in the field of e-discovery, you need to have a certain consistency, namely that the same legal search systems retrieves the same things today and tomorrow, and to different parties.

Q13 Anything else you wish to add?
Jochen L. Leidner: I would encourage decision makers of global companies not to get misled by fast-changing “hype” language: words like “cloud”, “analytics” and “big data” are too general to inform a professional discussion. Putting my linguist’s hat on, I can only caution about the lack of precision inherent in marketing language’s use in technological discussions: for instance, cluster computing is not the same as grid computing. And what looks like “big data” today we will almost certainly carry around with us on a mobile computing device tomorrow. Also, buzz words like “Big Data” do not by themselves solve any problem – they are not magic bullets. To solve any problem, look at the input data, specify the desired output data, and think hard about whether and how you can compute the desired result – nothing but “good old” computer science.

Dr. Jochen L. Leidner is a Lead Scientist with Thomson Reuters, where he is building up the corporation’s London site of its Research & Development group.
He holds Master’s degrees in computational linguistics, English language and literature and computer science from Friedrich-Alexander University Erlangen-Nuremberg and in computer speech, text and internet technologies from the University of Cambridge, as well as a Ph.D. in information extraction from the University of Edinburgh (“Toponym resolution in Text”). He is recipient of the first ACM SIGIR Doctoral Consortium Award, a Royal Society of Edinburgh Enterprise Fellowship in Electronic Markets, and two DAAD scholarships.
Prior to his research career, he has worked as a software developer, including for SAP AG (basic technology and knowledge management) as well as for startups.
He led the development teams of multiple question Answering systems, including the systems QED at Edinburgh and Alyssa at Saarland University, the latter of which ranked third at the last DARPA/NIST TREC open-domain question answering factoid track evaluation.
His main research interests include information extraction, question answering and search, geo-spatial grounding, applied machine learning with a focus on methodology behind research & development in the area of information access.
In 2013, Dr. Leidner has also been teaching an invited lecture course “Language Technology and Big Data” at the University of Zurich, Switzerland.
—————————-
Related Posts

–Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. September 23, 2013

–On Linked Data. Interview with John Goodwin. September 1, 2013

–Big Data Analytics at Netflix. Interview with Christos Kalantzis and Jason Brown. February 18, 2013

Resources

Follow us on Twitter: @odbmsorg