New and old Data stores – ODBMS Industry Watch

Database Challenges and Innovations. Interview with Jim Starkey

Roberto V. Zicari — Wed, 31 Aug 2016 03:33:42 +0000

“Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?.–Jim Starkey.

I have interviewed Jim Starkey. A database legend, Jim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history.

RVZ

Q1. In your opinion, what are the most significant advances in databases in the last few years?

Jim Starkey: I’d have to say the “atom programming model” where a database is layered on a substrate of peer-to-peer replicating distributed objects rather than disk files. The atom programming model enables scalability, redundancy, high availability, and distribution not available in traditional, disk-based database architectures.

Q2. What was your original motivation to invent the NuoDB Emergent Architecture?

Jim Starkey: It all grew out of a long Sunday morning shower. I knew that the performance limits of single-computer database systems were in sight, so distributing the load was the only possible solution, but existing distributed systems required that a new node copy a complete database or partition before it could do useful work. I started thinking of ways to attack this problem and came up with the idea of peer to peer replicating distributed objects that could be serialized for network delivery and persisted to disk. It was a pretty neat idea. I came out much later with the core architecture nearly complete and very wrinkled (we have an awesome domestic hot water system).

Q3. In your career as an entrepreneur and architect what was the most significant innovation you did?

Jim Starkey: Oh, clearly multi-generational concurrency control (MVCC). The problem I was trying to solve was allowing ad hoc access to a production database for a 4GL product I was working on at the time, but the ramifications go far beyond that. MVCC is the core technology that makes true distributed database systems possible. Transaction serialization is like Newtonian physics – all observers share a single universal reference frame. MVCC is like special relativity, where each observer views the universe from his or her reference frame. The views appear different but are, in fact, consistent.

Q4. Proprietary vs. open source software: what are the pros and cons?

Jim Starkey: It’s complicated. I’ve had feet in both camps for 15 years. But let’s draw a distinction between open source and open development. Open development – where anyone can contribute – is pretty good at delivering implementations of established technologies, but it’s very difficult to push the state of the art in that environment. Innovation, in my experience, requires focus, vision, and consistency that are hard to maintain in open development. If you have a controlled development environment, the question of open source versus propriety is tactics, not philosophy. Yes, there’s an argument that having the source available gives users guarantees they don’t get from proprietary software, but with something as complicated as a database, most users aren’t going to try to master the sources. But having source available lowers the perceived risk of new technologies, which is a big plus.

Q5. You led the Falcon project – a transactional storage engine for the MySQL server- through the acquisition of MySQL by Sun Microsystems. What impact did it have this project in the database space?

Jim Starkey: In all honesty, I’d have to say that Falcon’s most important contribution was its competition with InnoDB. In the end, that competition made InnoDB three times faster. Falcon, multi-version in memory using the disk for backfill, was interesting, but no matter how we cut it, it was limited by the performance of the machine it ran on. It was fast, but no single node database can be fast enough.

Q6. What are the most challenging issues in databases right now?

Jim Starkey: I think it’s time to step back and reexamine the assumptions that have accreted around database technology – data model, API, access language, data semantics, and implementation architectures. The “relational model”, for example, is based on what Codd called relations and we call tables, but otherwise have nothing to do with his mathematic model. That model, based on set theory, requires automatic duplicate elimination. To the best of my knowledge, nobody ever implemented Codd’s model, but we still have tables which bear a scary resemblance to decks of punch cards. Are they necessary? Or do they just get in the way?
Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?. SQL has no provision for flexible text search, no provision for multi-column, multi-table search, and no mechanics in the APIs to handle the results if it could do them. And this is just one a dozen problems that SQL databases can’t handle. It was a really good technical fit for computers, memory, and disks of the 1980’s, but is it right answer now?

Q7. How do you see the database market evolving?

Jim Starkey: I’m afraid my crystal ball isn’t that good. Blobs, another of my creations, spread throughout the industry in two years. MVCC took 25 years to become ubiquitous. I have a good idea of where I think it should go, but little expectation of how or when it will.

Qx. Anything else you wish to add?

Jim Starkey: Let me say a few things about my current project, AmorphousDB, an implementation of the Amorphous Data Model (meaning, no data model at all). AmorphousDB is my modest effort to question everything database.
The best way to think about Amorphous is to envision a relational database and mentally erase the boxes around the tables so all records free float in the same space – including data and metadata. Then, if you’re uncomfortable, add back a “record type” attribute and associated syntactic sugar, so table-type semantics are available, but optional. Then abandon punch card data semantics and view all data as abstract and subject to search. Eliminate the fourteen different types of numbers and strings, leaving simply numbers and strings, but add useful types like URL’s, email addresses, and money. Index everything unless told not to. Finally, imagine an API that fits on a single sheet of paper (OK, 9 point font, both sides) and an implementation that can span hundreds of nodes. That’s AmorphousDB.

————
Jim Starkey invented the NuoDB Emergent Architecture, and developed the initial implementation of the product. He founded NuoDB [formerly NimbusDB] in 2008, and retired at the end of 2012, shortly before the NuoDB product launch.

Jim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history from the Datacomputer project on the fledgling ARPAnet to his most recent startup, NuoDB, Inc. Through the period, he has been
responsible for many database innovations from the date data type to the BLOB to multi-version concurrency control (MVCC). Starkey has extensive experience in proprietary and open source software.

Starkey joined Digital Equipment Corporation in 1975, where he created the Datatrieve family of products, the DEC Standard Relational Interface architecture, and the first of the Rdb products, Rdb/ELN. Starkey was also software architect for DEC’s database machine group.

Leaving DEC in 1984, Starkey founded Interbase Software to develop relational database software for the engineering workstation market. Interbase was a technical leader in the database industry producing the first commercial implementations of heterogeneous networking, blobs, triggers, two phase commit, database events, etc. Ashton-Tate acquired Interbase Software in 1991, and was, in turn, acquired by Borland International a few months later. The Interbase database engine was released open source by Borland in 2000 and became the basis for the Firebird open source database project.

In 2000, Starkey founded Netfrastructure, Inc., to build a unified platform for distributable, high quality Web applications. The Netfrastructure platform included a relational database engine, an integrated search engine, an integrated Java virtual machine, and a high performance page generator.

MySQL, AB, acquired Netfrastructure, Inc. in 2006 to be the kernel of a wholly owned transactional storage engine for the MySQL server, later known as Falcon. Starkey led the Falcon project through the acquisition of MySQL by Sun Microsystems.

Jim has a degree in Mathematics from the University of Wisconsin.
For amusement, Jim codes on weekends, while sailing, but not while flying his plane.

——————

Resources

– NuoDB Emergent Architecture (.PDF)

– On Database Resilience. Interview with Seth Proctor, ODBMs Industry Watch, March 17, 2015

– Hands-On with NuoDB and Docker, BY MJ Michaels, NuoDB. ODBMS.org– OCT 27 2015

– How leading Operational DBMSs rank popularity wise? By Michael Waclawiczek– ODBMS.org · JANUARY 27, 2016

– A Glimpse into U-SQL BY Stephen Dillon, Schneider Electric, ODBMS.org-DECEMBER 7, 2015

– Gartner Magic Quadrant for Operational DBMS 2015

Follow us on Twitter: @odbmsorg

On the Internet of Things. Interview with Colin Mahony

Roberto V. Zicari — Mon, 14 Mar 2016 08:45:56 +0000

“Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data.”– Colin Mahony

I have interviewed Colin Mahony, SVP & General Manager, HPE Big Data Platform. Topics of the interview are: The challenges of the Internet of Things, the opportunities for Data Analytics, the positioning of HPE Vertica and HPE Cloud Strategy.

RVZ

Q1. Gartner says 6.4 billion connected “things” will be in use in 2016, up 30 percent from 2015. How do you see the global Internet of Things (IoT) market developing in the next years?

Colin Mahony: As manufacturers connect more of their “things,” they have an increased need for analytics to derive insight from massive volumes of sensor or machine data. I see these manufacturers, particularly manufacturers of commodity equipment, with a need to provide more value-added services based on their ability to provide higher levels of service and overall customer satisfaction. Data analytics platforms are key to making that happen. Also, we could see entirely new analytical applications emerge, driven by what consumers want to know about their devices and combine that data with, say, their exercise regimens, health vitals, social activities, and even driving behavior, for full personal insight.
Ultimately, the Internet of Things will drive a need for the Analyzer of Things, and that is our mission.

Q2. What Challenges and Opportunities bring the Internet of Things (IoT)?

Colin Mahony: Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data. The reason? Traditional data warehouses were designed well before the Internet of Things, or, at least before OT (operational technology) like medical devices, industrial equipment, cars, and more were connected to the Internet. So, having an analytical platform to provide the scale and performance required to handle these volumes is important, but customers are taking more of a two- or three-tier approach that involves some sort of analytical processing at the edge before data is sent to an analytical data store. Apache Kafka is also becoming an important tier in this architecture, serving as a message bus, to collect and push that data from the edge in streams to the appropriate database, CRM system, or analytical platform for, as an example, correlation of fault data over months or even years to predict and prevent part failure and optimize inventory levels.

Q3. Big Data: In your opinion, what are the current main demands/needs in the market?

Colin Mahony: All organizations want – and need – to become data-driven organizations. I mean, who wants to make such critical decisions based on half answers and anecdotal data? That said, traditional companies with data stores and systems going back 30-40 years don’t have the same level playing field as the next market disruptor that just received their series B funding and only knows that analytics is the life blood of their business and all their critical decisions.
The good news is that whether you are a 100-year old insurance company or the next Uber or Facebook, you can become a data-driven organization by taking an open platform approach that uses the best tool for the job and can incorporate emerging technologies like Kafka and Spark without having to bolt on or buy all of that technology from a single vendor and get locked in. Understanding the difference between an open platform with a rich ecosystem and open source software as one very important part of that ecosystem has been a differentiator for our customers.

Beyond technology, we have customers that establish analytical centers of excellence that actually work with the data consumers – often business analysts – that run ad-hoc queries using their preferred data visualization tool to get the insight need for their business unit or department. If the data analysts struggle, then this center of excellence, which happens to report up through IT, collaborates with them to understand and help them get to the analytical insight – rather than simply halting the queries with no guidance on how to improve.

Q4. How do you embed analytics and why is it useful?

Colin Mahony: OEM software vendors, particularly, see the value of embedding analytics in their commercial software products or software as a service (SaaS) offerings. They profit by creating analytic data management features or entirely new applications that put customers on a faster path to better, data-driven decision making. Offering such analytics capabilities enables them to not only keep a larger share of their customer’s budget, but at the same time greatly improve customer satisfaction. To offer such capabilities, many embedded software providers are attempting unorthodox fixes with row-oriented OLTP databases, document stores, and Hadoop variations that were never designed for heavy analytic workloads at the volume, velocity, and variety of today’s enterprise. Alternatively, some companies are attempting to build their own big data management systems. But such custom database solutions can take thousands of hours of research and development, require specialized support and training, and may not be as adaptable to continuous enhancement as a pure-play analytics platform. Both approaches are costly and often outside the core competency of businesses that are looking to bring solutions to market quickly.

Because it’s specifically designed for analytic workloads, HPE Vertica is quite different from other commercial alternatives. Vertica differs from OLTP DBMS and proprietary appliances (which typically embed row-store DBMSs) by grouping data together on disk by column rather than by row (that is, so that the next piece of data read off disk is the next attribute in a column, not the next attribute in a row). This enables Vertica to read only the columns referenced by the query, instead of scanning the whole table as row-oriented databases must do. This speeds up query processing dramatically by reducing disk I/O.

You’ll find Vertica as the core analytical engine behind some popular products, including Lancope, Empirix, Good Data, and others as well as many HPE offerings like HPE Operations Analytics, HPE Application Defender, and HPE App Pulse Mobile, and more.

Q5. How do you make a decision when it is more appropriate to “consume and deploy” Big Data on premise, in the cloud, on demand and on Hadoop?

Colin Mahony: The best part is that you don’t need to choose with HPE. Unlike most emerging data warehouses as a service where your data is trapped in their databases when your priorities or IT policies change, HPE offers the most complete range of deployment and consumption models. If you want to spin up your analytical initiative on the cloud for a proof-of-concept or during the holiday shopping season for e-retailers, you can do that easily with HPE Vertica OnDemand.
If your organization finds that due to security or confidentiality or privacy concerns you need to bring your analytical initiative back in house, then you can use HPE Vertica Enterprise on-premises without losing any customizations or disruption to your business. Have petabyte volumes of largely unstructured data where the value is unknown? Use HPE Vertica for SQL on Hadoop, deployed natively on your Hadoop cluster, regardless of the distribution you have chosen. Each consumption model, available in the cloud, on-premise, on-demand, or using reference architectures for HPE servers, is available to you with that same trusted underlying core.

Q6. What are the new class of infrastructures called “composable”? Are they relevant for Big Data?

Colin Mahony: HPE believes that a new architecture is needed for Big Data – one that is designed to power innovation and value creation for the new breed of applications while running traditional workloads more efficiently.
We call this new architectural approach Composable Infrastructure. HPE has a well-established track record of infrastructure innovation and success. HPE Converged Infrastructure, software-defined management, and hyper-converged systems have consistently proven to reduce costs and increase operational efficiency by eliminating silos and freeing available compute, storage, and networking resources. Building on our converged infrastructure knowledge and experience, we have designed a new architecture that can meet the growing demands for a faster, more open, and continuous infrastructure.

Q7. What is HPE Cloud Strategy?

Colin Mahony: Hybrid cloud adoption is continuing to grow at a rapid rate and a majority of our customers recognize that they simply can’t achieve the full measure of their business goals by consuming only one kind of cloud.
HPE Helion not only offers private cloud deployments and managed private cloud services, but we have created the HPE Helion Network, a global ecosystem of service providers, ISVs, and VARs dedicated to delivering open standards-based hybrid cloud services to enterprise customers. Through our ecosystem, our customers gain access to an expanded set of cloud services and improve their abilities to meet country-specific data regulations.

In addition to the private cloud offerings, we have a strategic and close alliance with Microsoft Azure, which enables many of our offerings, including Haven OnDemand, in the public cloud. We also work closely with Amazon because our strategy is not to limit our customers, but to ensure that they have the choices they need and the services and support they can depend upon.

Q8. What are the advantages of an offering like Vertica in this space?

Colin Mahony: More and more companies are exploring the possibility of moving their data analytics operations to the cloud. We offer HPE Vertica OnDemand, our data warehouse as a service, for organizations that need high-performance enterprise class data analytics for all of their data to make better business decisions now. Built by design to drastically improve query performance over traditional relational database systems, HPE Vertica OnDemand is engineered from the same technology that powers the HPE Vertica Analytics Platform. For organizations that want to select Amazon hardware and still maintain the control over the installation, configuration, and overall maintenance of Vertica for ultimate performance and control, we offer Vertica AMI (Amazon Machine Image). The Vertica AMI is a bring-your-own-license model that is ideal for organizations that want the same experience as on-premise installations, only without procuring and setting up hardware. Regardless of which deployment model to choose, we have you covered for “on demand” or “enterprise cloud” options.

Q9. What is HPE Vertica Community Edition?

Colin Mahony: We have had tens of thousands of downloads of the HPE Vertica Community Edition, a freemium edition of HPE Vertica with all of the core features and functionality that you experience with our core enterprise offering. It’s completely free for up to 1 TB of data storage across three nodes. Companies of all sizes prefer the Community Edition to download, install, set-up, and configure Vertica very quickly on x86 hardware or use our Amazon Machine Image (AMI) for a bring-your-own-license approach to the cloud.

Q10. Can you tell us how Kiva.org, a non-profit organization, uses on-demand cloud analytics to leverage the internet and a worldwide network of microfinance institutions to help fight poverty?

Colin Mahony: HPE is a major supporter of Kiva.org, a non-profit organization with a mission to connect people through lending to alleviate poverty. Kiva.org uses the internet and a worldwide network of microfinance institutions to enable individuals lend as little as $25 to help create opportunity around the world. When the opportunity arose to help support Kiva.org with an analytical platform to further the cause, we jumped at the opportunity. Kiva.org relies on Vertica OnDemand to reduce capital costs, leverage the SaaS delivery model to adapt more quickly to changing business requirements, and work with over a million lenders, hundreds of field partners and volunteers, across the world. To see a recorded Webinar with HPE and Kiva.org, see here.

Qx Anything else you wish to add?

Colin Mahony: We appreciate the opportunity to share the features and benefits of HPE Vertica as well as the bright market outlook for data-driven organizations. However, I always recommend that any organization that is struggling with how to get started with their analytics initiative to speak and meet with peers to learn best practices and avoid potential pitfalls. The best way to do that, in my opinion, is to visit with the more than 1,000 Big Data experts in Boston from August 29 – September 1^st at the HPE Big Data Conference. Click here to learn more and join us for 40+ technical deep-dive sessions.

————-

Colin Mahony, SVP & General Manager, HPE Big Data Platform

Colin Mahony leads the Hewlett Packard Enterprise Big Data Platform business group, which is responsible for the industry leading Vertica Advanced Analytics portfolio, the IDOL Enterprise software that provides context and analysis of unstructured data, and Haven OnDemand, a platform for developers to leverage APIs and on demand services for their applications.
In 2011, Colin joined Hewlett Packard as part of the highly successful acquisition of Vertica, and took on the responsibility of VP and General Manager for HP Vertica, where he guided the business to remarkable annual growth and recognized industry leadership. Colin brings a unique combination of technical knowledge, market intelligence, customer relationships, and strategic partnerships to one of the fastest growing and most exciting segments of HP Software.

Prior to Vertica, Colin was a Vice President at Bessemer Venture Partners focused on investments primarily in enterprise software, telecommunications, and digital media. He established a great network and reputation for assisting in the creation and ongoing operations of companies through his knowledge of technology, markets and general management in both small startups and larger companies. Prior to Bessemer, Colin worked at Lazard Technology Partners in a similar investor capacity.

Prior to his venture capital experience, Colin was a Senior Analyst at the Yankee Group serving as an industry analyst and consultant covering databases, BI, middleware, application servers and ERP systems. Colin helped build the ERP and Internet Computing Strategies practice at Yankee in the late nineties.

Colin earned an M.B.A. from Harvard Business School and a bachelor’s degrees in Economics with a minor in Computer Science from Georgetown University. He is an active volunteer with Big Brothers Big Sisters of Massachusetts Bay and the Joey Fund for Cystic Fibrosis.

Resources

–What’s in store for Big Data analytics in 2016, Steve Sarsfield, Hewlett Packard Enterprise. ODBMS.org, 3 FEB, 2016

–What’s New in Vertica 7.2?: Apache Kafka Integration!, HPE, last edited February 2, 2016

–Gartner Says 6.4 Billion Connected “Things” Will Be in Use in 2016, Up 30 Percent From 2015, Press release, November 10, 2015

–The Benefits of HP Vertica for SQL on Hadoop, HPE, July 13, 2015

–Uplevel Big Data Analytics with Graph in Vertica – Part 5: Putting graph to work for your business , Walter Maguire, Chief Field Technologist, HP Big Data Group, ODBMS.org, 2 Nov, 2015

–HP Distributed R ,ODBMS.org, 19 FEB, 2015.

–Understanding ROS and WOS: A Hybrid Data Storage Model, HPE, October 7, 2015

Related Posts

–On Big Data Analytics. Interview with Shilpa Lawande, Source: ODBMS Industry Watch, Published on December 10, 2015

–On HP Distributed R. Interview with Walter Maguire and Indrajit Roy, Source: ODBMS Industry Watch, Published on April 9, 2015

Follow us on Twitter: @odbmsorg

On Oracle NoSQL Database –Interview with Dave Segleau.

Roberto V. Zicari — Tue, 02 Jul 2013 07:18:08 +0000

“We went down the path of building Oracle NoSQL database because of explicit request from some of our largest Oracle Berkeley DB installations that wanted to move away from maintaining home grown sharding implementations and very much wanted an out of box technology that can replicate the robustness of what they had built “out of box” ” –Dave Segleau.

On October 3, 2011 Oracle announced the Oracle NoSQL Database, and on December 17, 2012, Oracle shipped Oracle NoSQL Database R2. I wanted to know more about the status of the Oracle NoSQL Database. I have interviewed Dave Segleau, Director of Product Management,Oracle NoSQL Database.

RVZ

Q1. Who is currently using Oracle NoSQL Database, and for what kind of domain specific applications? Please give us some examples.

Dave Segleau: There are a range of users from segments such as Web-scale Transaction Processing, to Web-scale Personalization and Real-time Event Processing. To pick the area where I would say we see the largest adoption, it would be the Real-time Event Processing category. This is basically the use case that covers things like Fraud Detection, Telecom Services Billing, Online Gaming and Mobile Device Management.

Q2. What is new in Oracle NoSQL Database R2?

Dave Segleau: We added significant enhancements to NoSQL Database in the areas of Configuration Management/Monitoring (CM/M), APIs and Application Developer Usability, as well as Integration with the Oracle technology stack.
In the area of CM/M, we added “Smart Topology” ( an automated capacity and reliability-aware data storage allocation with intelligent request routing), configuration elasticity and rebalancing, JMX/SNMP support. In the area of APIs and Application Developer Usability we added a C API, support for values as JSON objects (with AVRO serialization), JSON schema definitions, and a Large Object API (including a highly efficient streaming interface). In the area of Integration we added support for accessing NoSQL Database data via Oracle External Tables (using SQL in the Oracle Database), RDF Graph support in NoSQL Database, Oracle Coherence as well as integration with Oracle Event Processing.

Q3. How would you compare Oracle NoSQL with respect to other NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak?

Dave Segleau: The Oracle NoSQL Database is a key-value store, although it also supports JSON as a value type similar to a document store. Architecturally it is closer to Riak, Cassandra and the Amazon Dynamo-based implementations, rather than the other technologies, at least at the highest level of abstraction. With regards to features, Oracle NoSQL Database shares a lot of commonality with Riak. Our performance and scalability characteristics are showing up with the best results in YCSB benchmarks.

Q4. What is the implication of having Oracle Berkeley DB Java Edition as the core engine for the Oracle NoSQL database?

Dave Segleau: It means that Oracle NoSQL Database provides a mission-critical proven database technology at the heart of the implementation. Many of the other NoSQL databases use relatively new implementations for data storage and replication. Databases in general, and especially distributed parallel databases, are hard to implement well and achieve high product quality and reliability. So we see the use of Oracle Berkeley DB, a pervasively deployed database engine for 1000’s of mission-critical applications, as a big differentiation. Plus, many of the early NoSQL technologies are based on Oracle Berkeley DB, for example LinkedIn’s Voldemort, Amazon’s Dynamo and other popular commercial and enterprise social media products like Yammer.
The bottom line is that we went down the path of building Oracle NoSQL database because of explicit request from some of our largest Oracle Berkeley DB installations that wanted to move away from maintaining home grown sharding implementations and very much wanted an out of box technology that can replicate the robustness of what they had built “out of box”.

Q5. What is the relationships between the underlying “cleaning” mechanism to free up unused space in Oracle Berkeley DB, and the predictability and throughput in Oracle NoSQL Database?

Dave Segleau: As mentioned in the previous section, Oracle NoSQL Database uses Oracle Berkeley DB Java Edition as the key-value storage mechanism within the Storage Nodes. Oracle Berkeley DB Java Edition uses a no-overwrite log file system to store the data and a configurable multi-threaded background log cleaner task to compact and clean log files and free up unused disk space. The Oracle Berkeley DB log cleaner has underdone many years of in-house and real world high volume validation and tuning. Oracle NoSQL Database pre-defines the BDB cleaner parameters for optimal configuration for this particular use case. The cleaner enhances system throughput and predictability by a) running as a low level background task, b) being preconfigured to minimize impact on the running system. The combination of these two characteristics leads to more predictable system throughput.

Several other NoSQL database products have implemented heavy weight tasks to compact, compress and free up disk space. Running them definitely impacts system throughput and predictability. From our point of view, not only do you want a NoSQL database that has excellent performance, but you also need predictable performance. Routine tasks like Java GCs and disk space management should not cause major impacts to operational throughput in a production system.

Q7. Oracle NoSQL data model is using the concepts of “major” and “minor” key path. Why?

Dave Segleau: We heard from customers that they wanted both even distribution of data as well as co-location of certain sets of records. The Major/Minor key paradigm provides the best of both worlds. The Major key is the basis for the hash function which causes Major key values to be evenly distributed throughput the key-value data store. The Minor key allows us to cluster multiple records for a given Major key together in the same storage location. In addition to being very flexible, it also provided additional benefits:
a) A scalable two-tier indexing structure. A hash map of Major Keys to partitions that contain the data, and then a B-tree within each partition to quickly locate Minor key values.
b) Minor keys allow us to perform efficient lookups and range scans within a Major key. For example, for userID 1234 (Major key), fetch all of the products that they browsed from January 1st to January 15th (Minor key).
c) Because all of the Minor key records for a given Major key are co-located on the same storage location, this becomes our basic unit of ACID transactions, allowing applications to have a transaction that spans a single record, multiple records or even multiple write operations on multiple records for a given major key.

This degree of flexibility is often lacking in other NoSQL database offerings.

Q8. Oracle NoSQL database is a distributed, replicated key-value store using a shared-nothing master-slave architecture. Why did you choose to have a master node architecture? How does it compare with other systems which have no master?

Dave Segleau: First of all, lets clarify that each shard has a master (multi master) and it is an elected master based system. The Oracle NoSQL Database topology is deployed with user-specified replication factor (how many copies of the data should the system maintain) and then using a PAXOS based mechanism, a master is elected. It is quite possible that a new master is elected under certain operating conditions. Plus, if you throw more hardware resources at the system, those “masters” will shift the data for which they are responsible, again to achieve the optimal latency profile. We are leveraging the enterprise grade replication technology that is widely deployed via the Oracle Berkeley DB Java Edition. Also, by using an elected master implementation, we can provide a fully ACID transaction on an operation by operation basis

Q9. It is known that when the master node for a particular key-value fails (or because of a network failure), some writes may get lost. What is the implication from an application point of view?

Dave Segleau: This is a complex question in that it depends largely on the type of durability requested for the operation and that is controlled by the developer. In general though, committed transactions acknowledged by a simple majority of nodes (our default durability) are not lost when a master fails. In the case of less aggressive durability policies, in-flight transactions that have been subject to network, disk, server failures, are handled similar to process failure in other database implementations, the transactions are rolled back. However, a new master will quickly be elected and future requests will go thru without a hitch. The applications can guard against such situations by handling exceptions and performing a retry.

Q10. Justin Sheehy of Basho in an interview said (1): “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense” Would you recommend to your clients to use Oracle NoSQL database for banking applications?

Dave Segleau: Absolutely. The Oracle NoSQL Database offers a range of transaction durability and consistency options on a per operation basis. The choice of eventual consistency is best made on a case by case basis, because while using it can open up new levels of scalability and performance, it does come with some risk and/or alternate processes which have a cost. Some NoSQL vendors don’t provide the options to leverage ACID transactions where they make sense, but the Oracle NoSQL Database does.

Q11. Could you detail how Elasticity is provided in R2?

Dave Segleau: The Oracle NoSQL database slices data up into partitions within highly available replication groups. Each replication group contains an elected master and a number of replicas based on user configuration. The exact configuration will vary depending on the read latency /write throughput requirements of the application. The processes associated with those replication groups run on hardware (Storage Nodes) declared to the Oracle NoSQL Database. For elasticity purposes, additional Storage Nodes can be declaratively added to a running system, in which case some of the data partitions will be re-allocated onto the new hardware, thereby increasing the number of shards and the write throughput. Additionally, the number of replicas can be increased to improved read latency and increase reliability. The process of rebalancing data partitions, spawning new replicas, and forming new Replication Groups will cause those internal data partitions to automatically move around the Storage Nodes to take advantage of the new storage capacity.

Q12. What is the implication from a developer perspective of having a Avro Schema Support?

Dave Segleau: For the developer, it means better support for seamless JSON storage. There are other downstream implications, like compatibility and integration with Hadoop processing where AVRO is quickly becoming a standard not only for efficient wireline serialization protocols, but for HDFS storage. Also, AVRO is a very efficient serialization format for JSON, unlike other serialization options like BSON which tend to be much less efficient. In the future, Oracle NoSQL Database will leverage this flexible AVRO schema definition in order to provide features like table abstractions, projections and secondary index support.

Q13. How do you handle Large Object Support?

Dave Segleau: Oracle NoSQL Database provides a streaming interface for Large Objects. Internally, we break a Large Object up into chunks and use parallel operations to read and write those chunks to/from the database. We do it in an ordered fashion so that you can begin consuming the data stream before all of the contents are returned to the application. This is useful when implementing functionality like scrolling partial results, streaming video, etc. Large Object operations are restartable and recoverable. Let’s say that you start to write a 1 GB Large Object and sometime during the write operation a failure occurs and the write is only partially completed. The application will get an exception. When the application re-issues the Large Object operation, NoSQL resumes where it left off, skipping chunks that were already successfully written.
The Large Object chunking implementation also ensures that partially written Large Objects are not readable until they are completely written.

Q14. A NoSQL Database can act as an Oracle Database External Table. What does it mean in practice?

Dave Segleau: What we have achieved here is the ability to treat the Oracle NoSQL Database as a resource that can participate in SQL queries originating from an Oracle Database via standard SQL query facilities. Of course, the developer has to define a template that maps the “value” into a table representation. In release 2 we provide sample templates and configuration files that the application developer can use in order to define the required components. In the future, Oracle NoSQL Database will automate template definitions for JSON values. External Table capabilities give seamless access to both structured relational and unstructured NoSQL data using familiar SQL tools.

Q15. Why Atomic Batching is important?

Dave Segleau: If by Atomic Batching you mean the ability to perform more than one data manipulation in a single transaction, then atomic batching is the only real way to ensure logical consistency in multi-data update transactions. The Oracle NoSQL Database provides this capability for data beneath a given major key.

Q16 What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?

Dave Segleau: That’s a tough one to answer, since it is so case by case dependent. As discussed above in the banking question, in general if you can achieve your application latency goals while specifying high durability, then that’s your best course of action. However, if you have more aggressive low-latency/high-throughput requirements, you may have to assess the impact of relaxing your durability constraints and of the rare case where a write operation may fail. It’s useful to keep in mind that a write failure is a rare event because of the inherent reliability built into the technology.

Q17. Tomas Ulin mentioned in an interview (2) that “with MySQL 5.6, developers can now commingle the “best of both worlds” with fast key-value look up operations and complex SQL queries to meet user and application specific requirements”. Isn’t MySQL 5.6 in fact competing with Oracle NoSQL database?

Dave Segleau: MySQL is an SQL database with a KV API on top. We are a KV database. If you have an SQL application with occasional need for fast KV access, MySQL is your best option. If you need pure KV access with unlimited scalability, then NoSQL DB is your best option.

———
David Segleau is the Director Product Management for the Oracle NoSQL Database, Oracle Berkeley DB and Oracle Database Mobile Server. He joined Oracle as the VP of Engineering for Sleepycat Software (makers of Berkeley DB). He has more than 30 years of industry experience, leading and managing technical product teams and working extensively with database technology as both a customer and a vendor.

(2) MySQL-State of the Union. Interview with Tomas Ulin. February 11, 2013.

Resources

– Charles Lamb’s Blog

– ODBMS.org: Resources on NoSQL Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis.

Follow ODBMS.org on Twitter: @odbmsorg
##

Two cons against NoSQL. Part II.

Roberto V. Zicari — Wed, 21 Nov 2012 14:27:13 +0000

This post is the second part of a series of feedback I received from various experts, with obviously different point of views, on:

Two cons against NoSQL data stores :

Cons1. ” It’s very hard to move data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch. “

Cons2. “There is no standard way to access a NoSQL data store. All tools that already exist for SQL has to be recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many NoSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).”

You can also read Part I here.

RVZ
————

J Chris Anderson, Couchbase, cofounder and Mobile Architect:
On Cons1: JSON is the defacto format for APIs these days. I’ve found moving between NoSQL stores to be very simple, just a matter of swapping out a driver. It is also usually quite simple to write a small script to migrate the data between databases. Now, there aren’t pre packaged tools for this yet, but it’s typically one page of code to do. There is some truth that if you’ve tightly bound your application to a particular query capability, there might be some more work involved, but a least you don’t have to redo your stored data structures.

On Cons2: I’m from the “it’s a simple matter of programming” school of thought. e.g. just write the query you need, and a little script to turn it into CSV. If you want to do all of this without writing code, then of course the industry isn’t as mature as RDBMS. It’s only a few years old, not decades. But this isn’t a permanent structural issues, it’s just an artifact of the relative freshness of NoSQL.

—-
Marko Rodriguez, Aurelius, cofounder :
On Cons1: NoSQL is not a mirror image of SQL. SQL databases such as MySQL, Oracle, PostgreSQL all share the same data model (table) and query language (SQL).
In the NoSQL space, there are numerous data models (e.g. key/value, document, graph) and numerous query languages (e.g. JavaScript MapReduce, Mongo Query Language, Gremlin ).
As such, the vendor lock-in comment should be directed to a particular type of NoSQL system, not to NoSQL in general. Next, in the graph space, there are efforts to mitigate vendor lock-in. TinkerPop provides a common set of interfaces that allow the various graph computing technologies to work together and it allows developers to “plug and play” the underlying technologies as needed. In this way, for instance, TinkerPop’s Gremlin traversal language works for graph systems such as OrientDB , Neo4j, Titan, InfiniteGraph, RDF Sail stores , and DEX to name several.
To stress the point, with TinkerPop and graphs, there is no need to re-implement an application as any graph vendor’s technology is interoperable and swappable.

On Cons2: Much of the argument above holds for this comment as well. Again, one should not see NoSQL as the space, but the particular subset of NoSQL (by data model) as the space to compare against SQL (table). In support of SQL, SQL and the respective databases have been around for much longer than most of the technologies in the NoSQL space. This means they have had longer to integrate into popular data workflows (e.g. Ruby on Rails). However, it does not mean “that it will always be harder to access data in NoSQL than from SQL.” New technologies emerge, they find their footing within the new generation of technologies (e.g. Hadoop) and novel ways of processing/exporting/understanding data emerge. If SQL was the end of the story, it would have been the end of the story.

—–
David Webber, Oracle, Information Technologist:
On Cons1: Well it makes sense. Of course depends what you are using the NoSQL store for – if it is a niche application – or innovative solution – then a “one off” may not be an issue for you. Do you really see people using NoSQL as their primary data store? As with any technology – knowing when to apply it successfully is always the key. And these aspects of portability help inform when NoSQL is appropriate. There are obviously more criteria as well that people should reference to understand when NoSQL would be suitable for their particular application. The good news is that there are solid and stable choices available should they decide NoSQL is their appropriate option. BTW – in the early days of SQL too – even with the ANSI standard – its was a devil to port across SQL implementations – not just syntax, but performance and technique issues – I know – I did three such projects!

—-
Wiqar Chaudry, NuoDB, Tech Evangelist :
On Cons1: The answer to the first scenario is relatively straightforward. There are many APIs like REST or third-party ETL tools that now support popular NoSQL databases. The right way to think about this is to put yourself in the shoes of multiple different users. If you are a developer then it should be relatively simple and if you are a non-developer then it comes down to what third-party tools you have access to and those with which you are familiar. Re-educating yourself to migrate can be time consuming if you have never used these tools however. In terms of migrating applications from one NoSQL technology to another this is largely dependent on how well the data access layer has been abstracted from the physical database. Unfortunately, since there is limited or no support for ORM technologies this can indeed be a daunting task.

On Cons2: This is a fair assessment of NoSQL. It is limited when it comes to third-party tools and integration. So you will be spending time doing custom design.
However, it’s also important to note that the NoSQL movement was really born out of necessity. For example, technologies such as Cassandra were designed by private companies to solve a specific problem that a particular company was facing. Then the industry saw what NoSQL can do and everyone tried to adopt the technology as a general purpose database. With that said, what many NoSQL companies have ignored is the tremendous opportunity to take from SQL-based technologies the goodness that is applicable to 21st century database needs. .

—–
Robert Greene, Versant, Vice President, Technology:
On Cons1: Yes, I agree that this is difficult with most NoSQL solutions and that is a problem for adopters.
Versant has taken the position of trying to be first to deliver enterprise connectivity and standards into the NoSQL community. Of course, we can only take that so far, because many of the concepts that make NoSQL attractive to adopters simply do not have an equivalent in the relational database world. For example, horizontal scale-out capabilities are only loosely defined for relational technologies, but certainly not standardized. Specifically in terms of moving data in/out of other systems, Versant has developed a connector for the Talend Open Studio which has connectivity to over 400 relational and non-relational data sources, making it easy to move data in and out of Versant depending on your needs. For the case of Excell, while it is certainly not our fastest interface, having recognized the needs of data access from accepted tools, Versant has developed odbc/jdbc capabilities which can be used to get data from Versant databases into things like Excell, Toad, etc.

On Cons2: Yes, I also agree that this is a problem for most NoSQL solutions and again Versant is moving to bring better standards based programming API’s to the NoSQL community. For example, in our Java language interface, we support JPA ( Java Persistence API ), which is the interface application developers get when ever they download the Java SDK. They can create an application using JPA and execute against a Versant NoSQL database without implementing any relational mapping annotations or XML.
Versant thinks this is a great low risk way for enterprise developers to test out the benefits of NoSQL with limited risk. For example, if Versant does not perform much faster that the relational databases, run on much less hardware, scale-out effectively to multiple commodity servers, then they can simply take Hibernate or OpenJPA, EclipseLink, etc and drop it into place, do the mapping exercise and then point it at their relational database with nothing lost in productivity.
In the .NET world,b we have an internal implementation that support LINK and will be made available in the near future to interested developers. We are also supporting other standards in the area of production management, having SNMP capabilities so we can be integrated into tools like OpenView and others where IT folks can get a unified view of all their production systems.

I think we as an engineering discipline should not forget our lessons learned in the early 2000’s. Some pretty smart people helped many realize that what is important is the life cycle of your application objects, some of which are persistent, and that what is important is providing the appropriate abstraction for things like transaction demarcation, caching activation, state tracking ( new, changed, deleted ) etc. These are all features common to any application and developers can easily abstract them away to be database implementation independent, just like we did in the ORM days. Its what we do as good software engineers, find the right abstractions and refine and reuse them over time. It is important that the NoSQL vendors embrace such an approach to ease the development burden of the practitioners that will use the technology.

—-
Jason Hunter, MarkLogic , Chief Architect:
On Cons1: When choosing a database, being future-proof is definitely something to consider. You never know where requirements will take you or what future technologies you’ll want to leverage. You don’t want your data locked into a proprietary format that’s going to paint you into a corner and reduce your options. That’s why MarkLogic chose XML (and now JSON also) as its internal data format. It’s an international standard. It’s plain-text, human readable, fully internationalized, widely deployed, and supported by thousands upon thousands of products. Customers choose MarkLogic for several reasons, but a key reason is that the underlying XML data format will still be understood and supported by vendors decades in the future. Furthermore, I think the first sentence above could be restated, “It’s very hard to move the data out from one SQL to some other system, even other SQL.” Ask anyone who’s tried!

On Cons2: People aren’t choosing NoSQL databases because they’re unhappy with the SQL language. They’re picking them because NoSQL databases provide a combination of feature, performance, cost, and flexibility advantages. Customers don’t pick MarkLogic to run away from SQL, they pick MarkLogic because they want the advantages of a document store, the power of integrated text search, the easy scaling and cost savings of a shared-nothing architecture, and the enterprise reliability of a mature product. Yes, there’s a use case for exporting data to Excel. That’s why MarkLogic has a SQL interface as well as REST and Java interfaces. The SQL interface isn’t the only interface, nor is it the most powerful (it limits MarkLogic down to the subset of functionality expressable in SQL) but it provides an integration path.

Resources
NoSQL Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis

Two Cons against NoSQL. Part I.

Roberto V. Zicari — Tue, 30 Oct 2012 16:57:52 +0000

Two cons against NoSQL data stores read like this:

1. It’s very hard to move data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

2. There is no standard way to access a NoSQL data store.
All tools that already exist for SQL has to be recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many NoSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).

These are valid points. I wanted to start a discussion on this.
This post is the first part of a series of feedback I received from various experts, with obviously different point of views.

I plan to publish Part II, with more feedback later on.

You are welcome to contribute to the discussion by leaving a comment if you wish!

RVZ
————

1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL.

Dwight Merriman ( founder 10gen, maker of MongoDB): I agree it is still early and I expect some convergence in data models over time. btw I am having conversations with other nosql product groups about standards but it is super early so nothing is imminent.
50% of the nosql products are JSON-based document-oriented databases.
So that is the greatest commonality. Use that and you have some good flexibility and JSON is standards-based and widely used in general which is nice. MongoDB, couchdb, riak for example use JSON. (Mongo internally stores “BSON“.)

So moving data across these would not be hard.

1. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

Dwight Merriman: Yes. Once again I wouldn’t assume that to be the case forever, but it is for the present. Also I think there is a bit of an illusion of portability with relational. There are subtle differences in the SQL, medium differences in the features, and there are giant differences in the stored procedure languages.
I remember at DoubleClick long ago we migrated from SQL Server to Oracle and it was a HUGE project. (We liked SQL server we just wanted to run on a very very large server — i.e. we wanted vertical scaling at that time.)

Also: while porting might be work, given that almost all these products are open source, the potential “risks” of lock-in I think drops an order of magnitude — with open source the vendors can’t charge too much.

Ironically people are charged a lot to use Oracle, and yet in theory it has the portability properties that folks would want.

I would anticipate SQL-like interfaces for BI tool integration in all the products in the future. However that doesn’t mean that is the way one will write apps though. I don’t really think that even when present those are ideal for application development productivity.

1. For example, how many noSQL databases can export their data to excel? (Something every CEO wants to get sooner or later).

Dwight Merriman: So with MongoDB what I would do would be to use the mongoexport utility to dump to a CSV file and then load that into excel. That is done often by folks today. And when there is nested data that isn’t tabular in structure, you can use the new Aggregation Framework to “unwind” it to a more matrix-like format for Excel before exporting.

You’ll see more and more tooling for stuff like that over time. Jaspersoft and Pentaho have mongo integration today, but the more the better.

John Hugg (VoltDB Engineering): Regarding your first point about the issue with moving data out from one NoSQL to some other system, even other NoSQL.
There are a couple of angles to this. First, data movement itself is indeed much easier between systems that share a relational model.
Most SQL relational systems, including VoltDB, will import and export CSV files, usually without much issue. Sometimes you might need to tweak something minor, but it’s straightforward both to do and to understand.

Beyond just moving data, moving your application to another system is usually more challenging. As soon as you target a platform with horizontal scalability, an application developer must start thinking about partitioning and parallelism. This is true whether you’re moving from Oracle to Oracle RAC/Exadata, or whether you’re moving from MySQL to Cassandra. Different target systems make this easier or harder, from both development and operations perspectives, but the core idea is the same. Moving from a scalable system to another scalable system is usually much easier.

Where NoSQL goes a step further than scalability, is the relaxing of consistency and transactions in the database layer. While this simplifies the NoSQL system, it pushes complexity onto the application developer. A naive application port will be less successful, and a thoughtful one will take more time.
The amount of additional complexity largely depends on the application in question. Some apps are more suited to relaxed consistency than others. Other applications are nearly impossible to run without transactions. Most lie somewhere in the middle.

To the point about there being no standard way to access a NoSQL data store. While the tooling around some of the most popular NoSQL systems is improving, there’s no escaping that these are largely walled gardens.
The experience gained from using one NoSQL system is only loosely related to another. Furthermore, as you point out, non-traditional data models are often more difficult to export to the tabular data expected by many reporting and processing tools.

By embracing the SQL/Relational model, NewSQL systems like VoltDB can leverage a developer’s experience with legacy SQL systems, or other NewSQL systems.
All share a common query language and data model. Most can be queried at a console. Most have familiar import and export functionality.
The vocabulary of transactions, isolation levels, indexes, views and more are all shared understanding. That’s especially impressive given the diversity in underlying architecture and target use cases of the many available SQL/Relational systems.

Finally, SQL/Relational doesn’t preclude NoSQL-style development models. Postgres, Clustrix and VoltDB support MongoDB/CouchDB-style JSON Documents in columns. Functionality varies, but these systems can offer features not easily replicated on their NoSQL inspiration, such as JSON sub-key joins or multi-row/key transactions on JSON data

1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

Steve Vinoski (Architect at Basho): Keep in mind that relational databases are around 40 years old while NoSQL is 3 years old. In terms of the technology adoption lifecycle, relational databases are well down toward the right end of the curve, appealing to even the most risk-averse consumer. NoSQL systems, on the other hand, are still riding the left side of the curve, appealing to innovators and the early majority who are willing to take technology risks in order to gain advantage over their competitors.

Different NoSQL systems make very different trade-offs, which means they’re not simply interchangeable. So you have to ask yourself: why are you really moving to another database? Perhaps you found that your chosen database was unreliable, or too hard to operate in production, or that your original estimates for read/write rates, query needs, or availability and scale were off such that your chosen database no longer adequately serves your application.
Many of these reasons revolve around not fully understanding your application in the first place, so no matter what you do there’s going to be some inconvenience involved in having to refactor it based on how it behaves (or misbehaves) in production, including possibly moving to a new database that better suits the application model and deployment environment.

2. There is no standard way to access a NoSQL data store.
All tools that already exists for SQL has to recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many noSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).

Steve Vinoski: Don’t make the mistake of thinking that NoSQL is attempting to displace SQL entirely. If you want data for your Excel spreadsheet, or you want to keep using your existing SQL-oriented tools, you should probably just stay with your relational database. Such databases are very well understood, they’re quite reliable, and they’ll be helping us solve data problems for a long time to come. Many NoSQL users still use relational systems for the parts of their applications where it makes sense to do so.

NoSQL systems are ultimately about choice. Rather than forcing users to try to fit every data problem into the relational model, NoSQL systems provide other models that may fit the problem better. In my own career, for example, most of my data problems have fit the key-value model, and for that relational systems were overkill, both functionally and operationally. NoSQL systems also provide different tradeoffs in terms of consistency, latency, availability, and support for distributed systems that are extremely important for high-scale applications. The key is to really understand the problem your application is trying to solve, and then understand what different NoSQL systems can provide to help you achieve the solution you’re looking for.

Cindy Saracco (IBM Senior Solutions Architect) (these comments reflect my personal views and not necessarily those of my employer, IBM) :
Since NoSQL systems are newer to market than relational DBMSs and employ a wider range of data models and interfaces, it’s understandable that migrating data and applications from one NoSQL system to another — or from NoSQL to relational — will often involve considerable effort.
However, I’ve heard more customer interest around NoSQL interoperability than migration. By that, I mean many potential NoSQL users seem more focused on how to integrate that platform into the rest of their enterprise architecture so that applications and users can have access to the data they need regardless of the underlying database used.

2. There is no standard way to access a NoSQL data store.
All tools that already exists for SQL has to recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many noSQL databases can export their data to excel? (Something every CEO wants to get sooner or later).

Cindy Saracco: From what I’ve seen, most organizations gravitate to NoSQL systems when they’ve concluded that relational DBMSs aren’t suitable for a particular application (or set of applications). So it’s probably best for those groups to evaluate what tools they need for their NoSQL data stores and determine what’s available commercially or via open source to fulfill their needs.
There’s no doubt that a wide range of compelling tools are available for relational DBMSs and, by comparison, fewer such tools are available for any given NoSQL system. If there’s sufficient market demand, more tools for NoSQL systems will become available over time, as software vendors are always looking for ways to increase their revenues.

As an aside, people sometimes equate Hadoop-based offerings with NoSQL.
We’re already seeing some “traditional” business intelligence tools (i.e., tools originally designed to support query, reporting, and analysis of relational data) support Hadoop, as well as newer Hadoop-centric analytical tools emerge.
There’s also a good deal of interest in connecting Hadoop to existing data warehouses and relational DBMSs, so various technologies are already available to help users in that regard . . . . IBM happens to be one vendor that’s invested quite a bit in different types of tools for its Hadoop-based offering (InfoSphere BigInsights), including a spreadsheet-style analytical tool for non-programmers that can export data in CSV format (among others), Web-based facilities for administration and monitoring, Eclipse-based application development tools, text analysis facilities, and more. Connectivity to relational DBMSs and data warehouses are also part IBM’s offerings. (Anyone who wants to learn more about BigInsights can explore links to articles, videos, and other technical information available through its public wiki. )

– On Eventual Consistency– An interview with Justin Sheehy. by Roberto V. Zicari, August 15, 2012

– Hadoop and NoSQL: Interview with J. Chris Anderson by Roberto V. Zicari

Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB.

Roberto V. Zicari — Wed, 14 Dec 2011 08:14:52 +0000

The landscape for data management products is rapidly evolving. NuoDB is a new entrant in the market. I asked a few questions to Barry Morris, Founder & Chief Executive Officer NuoDB.

RVZ

Q1. What is NuoDB?

Barry Morris: NuoDB combines the power of standard SQL with cloud elasticity. People want SQL and ACID transactions but these have been expensive to scale-up or down. NuoDB changes that by introducing a breakthrough “emergent” internal architecture.

NuoDB scales elastically as you add machines and is self-healing if you take machines away. It also allows you to share machines between databases, vastly increases business continuity guarantees, and runs in geo-distributed active/active configurations. It does all this with very little database administration.

Q2. When did you start the company?

Barry Morris: The technology has been developed over a period of several years but the company was funded in 2010 and has been operating out of Cambridge MA since then.

Q3. Big data, Web-scale concurrency and Cloud: how can NuoDB help here?

Barry Morris: These are some of the main themes of modern system software.
The 30-year-old database architecture that is universally used by traditional SQL database systems is great for traditional applications and in traditional datacenters, but the old architecture is a liability in web-facing applications running on private, public or hybrid clouds.

NuoDB is a general purpose SQL database designed specifically for these modern application requirements.
Massive concurrency with NuoDB is easy to understand: If the system is overloaded with users you can just add as many diskless “transaction managers” as you need. NuoDB handles big data by using redundant Key-value stores for it’s storage architecture.

Cloud support boils down to the Five Key Cloud requirements: Elastic scalability, sharing machines between databases, extreme availability geo-distribution and “zero” DBA costs. You get these for free from a database with an emergent architecture, such as NuoDB.

Data

Q4. What kind of data (structured, non structured), and volumes of data is NuoDB able to handle?

Barry Morris: NuoDB can handle any kinds of data, in a set-based relational model.

Naturally we support all standard SQL types. Additionally we have rich BLOB support that allows us to store anything in an opaque fashion. A forthcoming release of the product will also support user-defined types.

NuoDB extends the traditional SQL type model in some interesting ways. We store arbitrary-length strings and numbers because we store everything by value not by type. Schemas can change easily and dynamically because they are not tightly coupled to the data.

Additionally NuoDB supports table inheritance, a powerful feature for applications that traditionally have wide, sparse tables.

Q5. How do you ensure data availability?

Barry Morris: There is no single point of failure in NuoDB. In fact it is quite hard to stop a NuoDB database from running, or to lose data.

There are three tiers of processes in a NuoDB solution (Brokers, Transaction Managers, and Archive Managers) and each tier can be arbitrarily redundant. All NuoDB Brokers, Transaction Managers and Archive Managers run as true peers of others in their respective tiers. To stop a database you have to stop all processes on at least one tier.

Data is stored redundantly, in as many Archive Managers as you want to deploy. All Archive Managers are peers and if you lose one the system just keeps going with the remaining Archive Managers.

Additionally the system can run across multiple datacenters. In the case of losing a datacenter a NuoDB database will keep running in the other datacenters.

Transactions

Q6. How do you orchestrate and guarantee ACID transactions in a distributed environment?

Barry Morris: We do not use a network lock manager because we need to be asynchronous in order to scale elastically.

Instead concurrency and consistency are managed using an advanced form of MVCC (multi-version concurrency control). We never actually delete or update a record in the system directly. Instead we always create new versions of records pointing at the old versions. It is our job to do all the bookkeeping about who sees which versions of which records and at what time.

The durability model is based on redundant storage in Key-value stores we call Archive Managers.

Q7. You say you use a distributed non-blocking atomic commit protocol. What is it? What is it useful for?

Barry Morris: Until changes to the state of a transactional database are committed they are not part of the state of the database. This is true for any ACID database system.

The Commit Protocol in NuoDB is complex because at any time we can have thousands of concurrent applications reading, writing, updating and deleting data in an asynchronous distributed system with multiple live storage nodes. A naïve design would “lock” the system in order to commit a change to the durable state, but that is a good way of ensuring the system does not perform or scale. Instead we have a distributed, asynchronous commit protocol that allows a transaction to commit without requiring network locks.

Architecture

Q8. What is special about NuoDB Architecture?

Barry Morris: NuoDB has an emergent architecture. It is like a flock of birds that can fly in an organized formation without having a central “brain”. Each bird follows some simple rules and the overall effect is to organize the group.

In NuoDB there is no one is in charge. There is no supervisor, no master, and central authority on anything. Everything that would normally be centralized is distributed. Any Transaction Manager or Archive Manager with the right security credentials can dynamically opt-in or opt-out of a particular database. All properties of the system emerge from the interactions of peers participating on a discretionary basis rather than from a monolithic central coordinator.

SQL

Q9. Do you support SQL?

Barry Morris: Yes. NuoDB is a relational database, and SQL is the primary query language. We have very broad and very standard SQL support.

Performance

Q10. How do you reduce the overhead when storing data to a disk? Do you use in-memory cache?

Barry Morris: Our storage model is that we have multiple redundant Archive Nodes for any given database. Archive Nodes are Key-value stores that know how to create and retrieve blobs of data. Archive Nodes can be implemented on top of any Key-value store (we currently have a file-system implementation and an Amazon S3 implementation). Any database can use multiple types of Archive Nodes at the same time (eg SSD based alongside disk-based).

Archive Nodes allow trade-offs to be made on disk write latency. For an ultra-conservative strategy you can run the system with fully journaled, flushed disk writes on multiple Archive Nodes in multiple datacenters; on the other end of the spectrum you can rely on redundant memory copies as your committed data, with disk writes happening as a background activity. And there are options in between. In all cases there are caching strategies in place that are internal to the Archive Nodes.

Q11. How do you achieve that query processing scales with the number of available nodes?

Barry Morris: Queries running on multiple nodes get the obvious benefit of node parallelism at the query-processing level. They additionally get parallelism at the disk-read level because there are multiple archive nodes. In most applications the most significant scaling benefit is that the Transaction Nodes (which do the query-processing) can load data from each others memory if it is available there. This is orders-of-magnitude faster than loading data from disk.

Q12. What about storage? How does it scale?

Barry Morris: You can have arbitrarily large databases, bounded only by the storage capacity of your underlying Key-value store. The NuoDB system itself is agnostic about data set size.

Working set size is very important for performance. If you have a fairly stable working set that fits in the distributed memory of your Transaction Nodes then you effectively have a distributed in-memory database, and you will see extremely high performance numbers. With the low and decreasing cost of memory this is not an unusual circumstance.

Scalability and Elasticity

Q13. How do you obtain scalability and elasticity?

Barry Morris: These are simple consequences of the emergent architecture.
Transaction Managers are diskless nodes. When a Transaction Manager is added to a database it starts getting allocated work and will consequently increase the throughput of the database system. If you remove a transaction manager you might have some failed transactions, which the application is necessarily designed to handle, but you won’t lose any data and the system will continue regardless.

Q14. How complex is for a developer to write applications using NuoDB?

Barry Morris: It’s no different from traditional relational databases. NuoDB offers standard SQL with JDBC, ODBC, Hibernate, Ruby Active Records etc.

Q15. NuoDB is in a restricted Beta program. When will it be available in production? Is there a way to try the system now?

Barry Morris: NuoDB is in it’s Beta 4 release. We expect it to ship by the end of the year, or very early in 2012. You can try the system by going to our download page.

_____________________________

– MariaDB: the new MySQL? Interview with Michael Monty Widenius.

– On Versant`s technology. Interview with Vishal Bagga.

– Interview with Jonathan Ellis, project chair of Apache Cassandra.

– The evolving market for NoSQL Databases: Interview with James Phillips.

Related Resources

– ODBMS.org: Free Downloads and Links on various data management technologies:
*Analytical Data Platforms.
*Cloud Data Stores,
*Databases in general
*Entity Framework (EF) Resources,
*Graphs and Data Stores
*NoSQL Data Stores,
*Object Databases,
*Object-Oriented Programming,
*Object-Relational Impedance Mismatch,
*ORM Technology,
##

The future of data management: “Disk-less” databases? Interview with Goetz Graefe

Roberto V. Zicari — Mon, 29 Aug 2011 07:22:11 +0000

“With no disks and thus no seek delays, assembly of complex objects will have different performance tradeoffs. I think a lot of options in physical database design will change, from indexing to compression and clustering and replication.” — Goetz Graefe.

Are “disk-less” databases the future of data management? What about the issue of energy consumption and databases? Will we have have “Green Databases”?
I discussed these issues with Dr. Goetz Graefe, HP Fellow (*), one of the most accomplished and influential technologists in the areas of database query optimization and query processing.

Hope you`ll enjoy this interview.

RVZ.

Q1 The world of data management is changing. Service platforms, scalable cloud platforms, analytical data platforms, NoSQL databases and new approaches to concurrency control are all becoming hot topics both in academia and industry. What is your take on this?

Goetz Graefe: I am wondering whether new and imminent hardware, in particular very large RAM memory as well as inexpensive non-volatile memory (NV-RAM, PCM, etc.) will be significant shifts that will affect software architecture, functionality, scalability, total cost of ownership, etc.
Hasso Plattner in his keynote at BTW (” SanssouciDB: An In-Memory Database for Processing Enterprise Workloads”) definitely seemed to think so. Whether or not one agrees with everything he said, he traced many of the changes back to not having disk drives in their new system (except for backup and restore).

For what it’s worth, I suspect that, based on the performance advantage of semiconductor storage, the vendors will ask for a price premium for semiconductor storage for a long time to come. That enables disk vendors to sell less expensive storage space, i.e., there’ll continue to be role for traditional disk space for colder data such as long-ago history analyzed only once in a while.

I think there’ll also be a differentiation by RAID level. For example, warm data with updates might be on RAID-1 (mirroring) whereas cold historical data might be on RAID-5 or RAID-6 (dual redundancy, no data loss in dual failure). In the end, we might end up with more rather than fewer levels in the memory hierarchy.

Q2. What is expected impact of large RAM (volatile) memory on database technology?

Goetz Graefe: I believe that large RAM (volatile) memory has already made a significant difference. SAP’s HANA /Sanssouci/NewDB project is one example,
C-store/VoltDB is another. Others database management system vendors are sure to follow.

It might be that NoSQL databases and key-value stores will “win” over traditional databases simply because they adapt faster to the new hardware, even if purely because they currently are simpler than database management systems and contain less code that needs adapting.

Non-volatile memory such as phase-change memory and memristors will change a lot of requirements for concurrency control and recovery code. With storage in RAM, including non-volatile RAM, compression will increase in economic value, sophistication, and compression factors. Vertica, for example, already uses multiple compression techniques, some of them pretty clever.

Q3. Will we end up having “disk less” databases then?

Goetz Graefe: With no disks and thus no seek delays, assembly of complex objects will have different performance tradeoffs. I think a lot of options in physical database design will change, from indexing to compression and clustering and replication.
I suspect we’ll see disk-less databases where the database contains only application state, e.g., current account balances, currently active logins, current shopping carts, etc. Disks will continue to have a role and economic value where the database also contains history, including cold history such as transactions that affected the account balances, login & logout events, click streams eventually leading to shopping carts, etc.

Q4. Where will the data go if we have no disks? In the Cloud?

Goetz Graefe: Public clouds in some cases, private clouds in many cases. If “we” don’t have disks, someone else will, and many of us will use them whether we are aware of it or not.

Q5. As new developments in memory (also flash) occur, it will result in possibly less energy consumption when using a database. Are we going to see “Green Databases” in the near future?

Goetz Graefe: I think energy efficiency is a terrific question to pursue. I know of several efforts, e.g., by Jignesh Patel et al. and Stavros Harizopoulos et al. Your own students at DBIS Goethe Universität just did a very nice study, too.

It seems to me there are many avenues to pursue.

For example, some people just look at the most appropriate hardware, e.g., high-performance CPUs such as Xeon versus high-efficiency CPUs such as Centrino (back then). Similar thoughts apply to storage, e.g., (capacity-optimized) 7200 rpm SATA drives versus (performance-optimized) 15K rpm fiber channel drives.
Others look at storage placement, e.g., RAID-1 versus RAID-5/6, and at storage formats, e.g., columnar storage & compression.
Others look at workload management, e.g., deferred index maintenance (& view maintenance) during peak load (perhaps the database equivalent to load shedding in streams) or ‘pause and resume’ functionality in utilities such as index creation.
Yet others look at caching, e.g., memcached. Etc.

Q6. What about Workload management?

Goetz Graefe: Workload management really has two aspects: the policy engine including its monitoring component that provides input into policies, and the engine mechanisms that implement the policies. It seems that most people focus on the first aspect above. Typical mechanisms are then quite crude, e.g., admission control.

I have long been wondering about more sophisticated and graceful mechanisms. For example, should workload management control memory allocation among operations? Should memory-intensive operations such as sorting grow & shrink their memory allocation during their execution (i.e.,, not only when they start)? Should utilities such as index creation participate is resource management? Should index creation (etc.) support ‘pause and resume’ functionality?

It seems to me that I’d want to say ‘yes’ to all those questions. Some of us at Hewlett-Packard Laboratories have been looking into engine mechanisms in that direction.

Q7. What are the main research questions for data management and energy efficiency?

Goetz Graefe: I’ve recently attended a workshop by NSF on data management and energy efficiency.
The topic was split into data management for energy efficiency (e.g., sensors & history & analytics in smart buildings) and energy efficiency in data management (e.g., efficiency of flash storage versus traditional disk storage).
One big issue in the latter set of topics was the difference between traditional performance & scalability improvements versus improvements in energy efficiency, and we had a hard time coming up with good examples where the two goals (performance, efficiency) differed. I suspect that we’ll need cases with different resources, e.g., trading 1 second of CPU time (50 Joule) against 3 seconds of disk time (20 Joule).
NSF (the US National Science Foundation) seems to be very keen on supporting good research in these directions. I think that’s very timely and very laudable. I hope they’ll receive great proposals and can fund many of them.

————————–
Dr. Goetz Graefe is an HP Fellow, and a member of the Intelligent Information Management Lab within Hewlett-Packard Laboratories. His experience and expertise are focused on relational database management systems, gained in academic research, industrial consulting, and industrial product development.

His current research efforts focus on new hardware technologies in database management as well as robustness in database request processing in order to reduce total cost of ownership. Prior to joining Hewlett-Packard Laboratories in 2006, Goetz spent 12 years as software architect in product development at Microsoft, mostly in database management. Both query optimization and query execution of Microsoft’s re-implementation of SQL Server are based on his designs.

Goetz’s areas of expertise within database management systems include compile-time query optimization including extensible query optimization, run-time query execution including parallel query execution, indexing, and transactions. He has also worked on transactional memory, specifically techniques for software implementations of transactional memory.

Goetz studied Computer Science at TU Braunschweig from 1980 to 1983.

(*) HP Fellows are “pioneers in their fields, setting the standards for technical excellence and driving the direction of research in their respective disciplines”.

Resources:

– NoSQL Data Stores (Free Downloads and Links).

– Cloud Data Stores (Free Downloads and Links).

– Graphs and Data Stores (Free Downloads and Links).

– Databases in General (Free Downloads and Links).

“Applying Graph Analysis and Manipulation to Data Stores.”

Roberto V. Zicari — Wed, 22 Jun 2011 06:56:59 +0000

” This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk—a traversal. ” — Marko Rodriguez and Peter Neubauer.
__________________________________________________

Interview with Marko Rodriguez and Peter Neubauer.

The open source community is quite active in the area of Graph Analysis and Manipulation, and their applicability to new data stores. I wanted to know more about an open source initiative called “Tinkerpop”.
I have interviewed Marko Rodriguez and Peter Neubauer, who are the ledears of the Tinkerpop” project.

RVZ

Q1. You recently started a project called “Tinkerpop”. What is it?

Marko Rodriguez and Peter Neubauer:
TinkerPop is an open-source graph software group. Currently, we provide a stack of technologies (called the TinkerPop stack) and members contribute to those aspects of the stack that align with their expertise. The stack starts just above the database layer (just above the graph persistence layer) and connects to various graph database vendors — e.g. Neo4j, OrientDB, DEX, RDF Sail triple/quad stores, etc.

The graph database space is a relatively nascent space. At the time that TinkerPop started back in 2009, graph database vendors were primarily focused on graph persistence issues–storing and retrieving a graph structure to and from disk. Given the expertise of the original TinkerPop members (Marko, Peter, and Josh), we decided to take our research (from our respective institutions) and apply it to the creation of tools one step above the graph persistence layer. Out of that effort came Gremlin — the first TinkerPop project. In late 2009, Gremlin was pieced apart into multiple self contained projects: Blueprints and Pipes.
From there, other TinkerPop products have emerged which we discuss later.

Q2. Who currently work on “Tinkerpop”?

Marko Rodriguez and Peter Neubauer:
The current members of TinkerPop are Marko A. Rodriguez (USA), Peter Neubauer (Sweden), Joshua Shinavier (USA), Stephen Mallette (USA), Pavel Yaskevich (Belarus), Derrick Wiebe (Canada), and Alex Averbuch (New Zealand).
However, recently, while not yet an “official member” (i.e. picture on website), Pierre DeWilde (Belgium) has contributed much to TinkerPop through code reviews and community relations. Finally, we have a warm, inviting community where users can help guide the development of the TinkerPop stack.

Q3. You say, that you intend to provide higher-level graph processing tools, APIs and constructs? Who needs them? and for what?

Marko Rodriguez and Peter Neubauer:
TinkerPop facilitates the application of graphs to various problems in engineering. These problems are generally defined as those that require expressivity and speed when traversing a joined structure. The joined structure is provided by a graph database. With a graph database, a user can does not arbitrarily join two tables according to some predicate as there is no notion of tables.
There only exists a single atomic structure known as the graph. However, in order to unite disparate data, a traversal is enacted that moves over the data in order to yield some computational side-effect — e.g. a search, a score, a rank, a pattern match, etc.
The benefit of the graph comes from being able to rapidly traverse structures to an arbitrary depth (e.g., tree structures, cyclic structures) and with an arbitrary path description (e.g. friends that work together, roads below a certain congestion threshold). Moreover, this space provides a unique way of thinking about data processing.
We call this data processing pattern, the graph traversal pattern.
This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk—a traversal.

Q4. Why using graphs and not objects and/or classical relations? What about non normalized data structures offered by NoSQL databases?

Marko Rodriguez and Peter Neubauer:
In a world where memory is expensive, hybrid memory/disk technology is a must (colloquially, a database).
A graph database is nothing more than a memory/disk technology that allows for the rapid creation of an in-memory object (sub)graph from a disk-based (full)graph. A traversal (the means by which data is queried/processed) is all about filling in memory those aspects of the persisted graph that are being touched as the traverser moves along the graph’s vertices and edges.
Graph databases simply cache what is on disk into memory which makes for a highly reusable in-memory cache.
In contrast, with a relational database, where any table can be joined with any table, many different data structures are constructed from the explicit tables persisted. Unlike a relational database, a graph database has one structure, itself.
Thus, components of itself are always reusable. Hence, a “highly reusable cache.” Given this description, if a persistence engine is sufficiently fast at creating an in-memory cache, then it meets the processing requirements of a graph database user.

Q5. Besides graph databases, who may need Tinkerpop tools? Could they be useful for users of relational databases as well? or of other databases, like for example NoSQL or Object Databases? If yes, how?

Marko Rodriguez and Peter Neubauer:
In the end, the TinkerPop stack is based on the low-level Blueprints API.
By implementing the Blueprints API and making it sufficiently speedy, any database can, in theory, provide graph processing functionality. So yes, TinkerPop could be leveraged by other database technologies.

Q6. Tinkerpop is composed of several sub projects: Gremlin, Pipes, Blueprints and more. At a first glimpse, it is difficult to grasp how they are related to each other. What are all these sub projects? do they all relate with each other?

Marko Rodriguez and Peter Neubauer:
The TinkerPop stack is described from bottom-to-top:
Blueprints: A graph API with an operational semantics test suite that when implemented, yields a Blueprints-enabled graph database which is accessible to all TinkerPop products.
Pipes: A data flow framework that allows for lazy graph traversing.
Gremlin: A graph traversal language that compiles down to Pipes.
Frames: An object-to-graph mapper that turns vertices and edges into objects and relations (and vice versa).
Rexster: A RESTful graph server that exposes the TinkerPop suite “over the wire.”

Q7. Is there a unified API for Tinkerpop? And if yes, how does it look like?

Marko Rodriguez and Peter Neubauer:
Blueprints is the foundation of TinkerPop.
You can think of Blueprints as the JDBC of the graph database community. Many graph vendors, while providing their own APIs, also provide a Blueprints implementation so the TinkerPop stack can be used with their database. Currently, Neo4j, OrientDB, DEX, RDF Sail, TinkerGraph, and Rexster are all TinkerPop promoted/supported implementations.
However, out there in the greater developer community, there exists an implementation for HBase (GraphBase) and Redis (Blueredis). Moreover, the graph database vendor InfiniteGraph plans to release a Blueprints implementation in the near future.

Q8. In your projects you speak of “dataflow-inspired traversal models”. What is it?

Marko Rodriguez and Peter Neubauer:
Data flow graph processing, in the Pipes/Gremlin-sense, is a lazy iteration approach to graph traversing.
In this model, chains of pipes are connected. Each pipe is a computational step that is one of three types of operations: transform, filter, or side-effect.
A transformation pipe will take data of one type and emit data of another type. For example, given a vertex, a pipe will emit its outgoing edges. A filter pipe will take data and either emit it or not. For example, given an edge, emit it if its label equals “friend.” Finally, a side-effect will take data and emit the same data, however, in the process, it will yield some side-effect.
For example, increment a counter, update a ranking, print a value to standard out, etc.
Pipes is a library of general purpose pipes that can be composed to effect a graph traversal based computation. Finally, Gremlin is a DSL (domain specific language) that supports the concise specification of a pipeline. The Gremlin code base is actually quite small — all of the work is in Pipes.

Q9. How other developers could contribute to this project?

Marko Rodriguez and Peter Neubauer:
New members tend to be users. A user will get excited about a particular product or some tangent idea that is generally useful to the community. They provide thoughts, code, and ultimately, if they “click” with the group (coding style, approach, etc.), then they become members. For example, Stephen Mallette was very keen on advancing Rexster and as such, has and continues to work wonders on the server codebase.
Pavel Yaskevich was interested in the compiler aspects of Gremlin and contributed on that front through many versions. Pavel is also a contributing member to Cassandra’s recent query language known as CQL.
Derrick Wiebe has contributed alot to Pipes and in his day job, needed to advance particular aspects of Blueprints (and luckily, this benefits others). There are no hard rules to membership. Primarily its about excitement, dedication, and expert-level development.
In the end, the community requires that TinkerPop be a solid stack of technologies that is well thought out and consistent throughout. In TinkerPop, its less about features and lines of code as it is about a consistent story that resonates well for those succumbing to the graph mentality.

____________________________________________________________________________________

Marko A. Rodriguez:
Dr. Marko A. Rodriguez currently owns the graph consulting firm Aurelius LLC. Prior to this venture, he was a Director’s Fellow at the Center for Nonlinear Studies at the Los Alamos National Laboratory and a Graph Systems Architect at AT&T.
Marko’s work for the last 10 years has focused on the applied and theoretical aspects of graph analysis and manipulation.

Peter Neubauer:
Peter Neubauer has been deeply involved in programming for over a decade and is co-founder of a number of popular open source projects such as Neo4j, TinkerPop, OPS4J and Qi4j. Peter loves connecting things, writing novel prototypes and throwing together new ideas and projects around graphs and society-scale innovation.
Right now, Peter is the co-founder and VP of Product Development at Neo4j Technology, the company sponsoring the development of the Neo4j graph database.
If you want brainstorming, feed him a latte and you are in business.

______________________________________

For further readings

Graphs and Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes

“Marrying objects with graphs”: Interview with Darren Wood.

“Interview with Jonathan Ellis, project chair of Apache Cassandra”.

“The evolving market for NoSQL Databases: Interview with James Phillips.”

_________________________

Interview with Rick Cattell: There is no “one size fits all” solution.

Roberto V. Zicari — Tue, 18 Jan 2011 16:13:57 +0000

I start in 2011 with an interview with Dr. Rick Cattell.
Rick is best known for his contributions to database systems and middleware — He was a founder of SQL Access (a predecessor to ODBC), the founder and chair of the Object Data Management Group (ODMG), and the co-creator of JDBC.

Rick has worked for over twenty years at Sun in management and senior technical roles, and for ten years in research at Xerox PARC and at Carnegie-Mellon University.

You can download the article by Rick Cattell: “Relational Databases, Object Databases, Key-Value Stores, Document Stores, and Extensible Record Stores: A Comparison.”

And look at recent posts on the same topic: “New and Old Data Stores”, and “NoSQL databases”.

RVZ

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Rick Cattell: Basically, things changed with “Web 2.0” and with other applications where there were thousands or millions of users writing as well as reading a database. RDBMSs could not scale to this number of writers. Amazon (with Dynamo) and Google (with BigTable) were forced to develop their own scalable datastores. A host of others followed suit.

Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?

Rick Cattell: That’s a good question. The proliferation and differences are confusing, and I have no one-paragraph answer to this question. The systems differ in data model, consistency models, and many other dimensions. I wrote a couple papers and provide some references on my website, these may be helpful for more background. There I categorize several kinds of “NoSQL” data stores, according to data model: key-value stores, document stores, and extensible record stores. I also discuss scalable SQL stores.

Q3. How new data stores compare with respect to relational databases?

Rick Cattell: In a nutshell, NoSQL datastores give up SQL and they give up ACID transactions, in exchange for scalability. Scalability is achieved by partitioning and/or replicating the data over many servers. There are some other advantages, as well: for example, the new data stores generally do not demand a fixed data schema, and provide a simpler programming interface, e.g. a RESTful interface.

Q4. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed “object” cache over multiple machines. How do these new data stores compare with object-oriented databases?

Rick Cattell: It is true, OODBs provide features that NoSQL systems do not, like integration with OOPLs, and ACID transactions. On the other hand, OODBs do not provide the horizontal scalability. There is no “one size fits all” solution, just as OODBs and RDBMSs are good for different applications.

Q5. With the emergence of cloud computing, new data management systems have surfaced. What is in your opinion of the direction in which cloud computing data management is evolving? What are the main challenges of cloud computing data management?

Rick Cattell: There are a number of data management issues with cloud computing, in addition to the scaling issue I already discussed. For example, if you don’t know which servers your software is going to run on, you cannot tune your hardware (RAM, flash, disk, CPU) to your software, or vice versa.

Q6 What are cloud stores omitting that enable them to scale so well?

Rick Cattell: You haven’t defined “cloud stores”. I’m going to assume that you mean something similar to what we discussed earlier: new data stores that provide horizontal scaling. In which case, I answered that question earlier: they give up SQL and ACID.

Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?

Rick Cattell: As I interpret this question, systems such as MongoDB already have this. Also, a SQL interpreter has been ported to BigTable, but the lower-level interface has proven to be more popular. The main scalability problem with declarative queries is when queries require operations like joins or transactions that span many servers: then you get killed by the node coordination and data movement.

Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management.
To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Rick Cattell: I agree with Stonebraker. There are actually two points here: one about performance (of each server) and one about scalability (of all the servers together). We already discussed the latter.
Stonebraker makes an important point about the former: with databases that fit mostly in RAM (on distributed servers), the DBMS architecture needs to change dramatically, otherwise 90% of your overhead goes into transaction coordination, locking, multi-threading latching, buffer management, and other operations that are “acceptable” in traditional DBMSs, where you spend your time waiting for disk. Stonebraker and I had an argument a year ago, and reached agreement on this as well as other issues on scalable DBMSs. We wrote a paper about our agreement, which will appear in CACM. It can be found on my website in the meantime.

Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes. Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Rick Cattell: Yes, I believe so. MySQL Cluster is already close to doing so, and I believe that VoltDB and Clustrix will do so. The key to scalability with RDBMSs is to avoid SQL and transactions that span nodes, as you say. VoltDB demands that transactions be encapsulated as stored procedures, and allows some control over how tables are sharded over nodes. This allows transactions to be pre-compiled and pre-analyzed to execute on a single node, in general.

Q10. There is also xml DBs, which go beyond relational. Hybridization with relational turned out to be very useful. For example, DB2 has a huge investment in XML, and it is extensively published, and it has also succeeded commercially. Monet DB did substantial work in that area early on as well. How do they relate with “new data stores”?

Rick Cattell: With XML, we have yet another data model, like relational and object-oriented. XML data can be stored in a separate DBMS like Monet DB, or could be transformed to store in another DBMS, as with DB2. The focus of the new NoSQL data stores is generally not a new data model, but new scalability. In fact, they generally have quite simple data models. The “document stores” like MongoDB and CouchDB do allow nested objects, which might make them more amenable to storing XML. But in my experience, the new data stores are being used to store simpler data, like key-value pairs required for user information on web site.

Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?

Rick Cattell: This is an even harder question to answer than the ones contrasting the DBMSs themselves, because each application has characteristics that might make you lean one way or another. I made an attempt at answering this in the paper I mentioned, but I only scratched the surface… I concluded that there is no “cookbook” answer to tell you which way to go.

Don White on “New and old Data stores”.

Roberto V. Zicari — Thu, 30 Dec 2010 14:29:07 +0000

“New and old Data stores” : This time I asked Don White a number of questions.
Don White is a senior development manager at Progress Software Inc., responsible for all feature development and engineering support for ObjectStore.

RVZ

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Don White : Speaking from an OODB perspective, OODBs grew out of recognition that the relational model is not the best fit for all application needs. OODBs continue to deliver their traditional value which is transparency in handling and optimizing moving rich data models between storage and virtual memory. The emergence of common ORM solutions must have provide benefit for some RDB based shops, where I presume they had need to use object oriented programming for data that already fits well into an RDB. There is something important to understand, if you fail to leverage what your storage system is good at then you are using the wrong tool or the wrong approach. Relational model wants to model with relations, wants to perform joins, an application’s data access pattern should expect to query the database the way the model wants to work. ORM mapping for an RDB that tries to query and build one object at a time will have real poor performance. If you try to query in bulk to avoid costs of model transitions then you likely have to live with compromises in less than optimal locking and/or fetch patterns. A project with model complexity that pursues OOA/OOD to for a solution will find implementation easier with an OOP and will find storage of that data easier and more effective with an OODB. As for newer Data Stores that are neither OODB nor RDB, and they appear to be trying to fill a need that provides a storage solution that is less than general database solution. Not trying to be everything to everybody allows for different implementation tradeoffs to be made.

Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?

Don White: This probably needs to be answered by people involved with the products trying to distinguish themselves. I lump document stores in the NoSQL category. However there does seem to be some common themes or subclass types within new NoSQL stores document stores and key value stores. Each subclass seems to have a different way of declaring groups of related information and differences in exposing how to find and access stored information. In case it is not obvious you can argue an OODB has some characteristics of the NoSQL stores, although any discussion will have to clearly define the scope of what is included in the NoSQL discussion.

Q3. How new data stores compare with respect to relational databases?

Don White: In general there seems to be recognition that Relational based technology has difficulty making tradeoffs managing fluid schema requirements and how to optimize access to related information.

Don White: These new data stores are not Object Oriented. Some might provide language bindings to Object Oriented languages but they are not preserving OOA/OOD as implemented in an OOP all the way through to the storage model. The new data systems are very data centric and are not trying to facilitate the melding of data and behavior. These new storage systems present a specific model abstractions and provide their own specific storage structure. In some cases they offer schema flexibility, but it is basically used to just manage data and not for building sophisticated data structures with type specific behavior. One way of keeping modeling abilities in perspective, you can use an OODB as a basis to build any other database system, NoSQL or even relational. The simple reason is an OODB can store any structure a developer needs and/or can even imagine. A document store, name/value pair store or RDB store, all present a particular abstractions for a data store, but under the hood there is an implementation to serve that abstraction. No matter what that implementation looks like for the store it could be put into an OODB. Of course the key is determining if the data store abstraction presented works for your actual model and application space.

The problem with an OODB is not everyone is looking to build a database of their own design and they prefer someone else to supply the storage abstraction and worry about the details to make the abstraction work. Not
to say the only way to interface with an OODB is a 3GL program, but the most effective way to use an OODB is when the storage model matches the needs of the in-memory model. That is a very leading statement because it really is forcing a particular question, why would you want to store data differently than how you intend to use it? I guess the simple answer is when you don’t know how you are going to use your data, so if you don’t know how you are going to use it then why is any data store abstraction better than another? If you want to use an OO model and implementation then you will find a non OODB is a poor way of handling that situation.

To generalize it appears the newer stores make different compromises in the management of the data to suit their intended audience. In other words they are not developing a general purpose database solution so they are
willing to make tradeoffs that traditional database products would/should/could not make. The new data stores do not provide traditional database query language support or even strict ACID transaction support. They do provide an abstractions for data storage and processing capabilities that leverage the idiosyncrasies of their chosen implementation data structures and/or relaxations in strictness of the transaction model to try to make gains in processing.

Don White: One challenge is just trying to understand what is really meant by cloud computing. In some form it is how to leverage computing resources or facilities available through a network. Those resources could be software or hardware, leveraging those resources requires nothing to be installed on the accessing device, you only need a network connection. The network is the virtual mainframe and any device used to access the network is the virtual terminal endpoint. You have the same concerns of trying to leverage the computing power of a virtual mainframe as a real local machine, how to optimize computing resources, how to share them among many users and how to keep them running. You have interesting upside with all the possible scalability but with the power and flexibility comes new levels of management complexity. You have to consider how algorithms for processing and handling data can be distributed and coordinated. When you involve more than one machine to do anything then you have to consider what happens when any node or connecting piece fails along the way.

Q6 What are cloud stores omitting that enable them to scale so well?

Don White: Strict serialized transaction processing for one. I think you will find the more complex a data model needs to be, the more need there is for strict serialized transactions. You can’t expect to navigate relationships cleanly if you don’t promise to keep all data strictly serialized.
The data and/or the storage abstractions used in the new models seem devoid of any sophisticated data processing and relationship modeling. What is being managed and distribute is simple data, where algorithms and the data needing to be managed can be easily partitioned/dispersed and required processing is easily replicated with basic coordination requirements. It is easy to imagine how to process queries that can replicated in bulk across simple data stored in structures that are amenable to be split apart.

Why are serialized transactions important? It makes sure this is one view of history and is necessary to maintain integrity among related data. Some systems try to pass off something less than serializable isolation as
adequate for transaction processing, however allowing updates to occur without the prerequisite read locks risks trying to use data that is not correct. If you are using pointers rather than indirect references as part
of your processing, the things you point to have to exist to run. Once you materialize a key/value based relationship as a pointer then there has to be commitment to not only the existence of the relationship (thing pointed to) but also the state of the data involved in the relationship that allows the existence to be valid.

Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?

Don White: Can’t answer that. It will be a shame if these systems end up having to build many things that are available in other database systems that could have given them that for free.

Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Don White: I don’t have any argument with the overheads identified, however I would say I don’t want to use SQL, a non-procedural way of getting to data, when I can solve my problem faster by using navigation of data structures specifically geared to solve my targeted problem. I have seen customers put SQL interfaces on top of specialized models stored in an OODB. They use SQL through ODBC as a standard endpoint to get at the data, but the implementation model under the hood is a model the customer implemented that performs queries faster than what a traditional relational implementation could do.

Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes.
Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Don White: Hmm, what is the barrier? What makes SQL hard to span nodes? I suppose one inherit problem is an RDB is built around the relational model, which is based on joining relations. If processing is going to spread across many nodes then where does joining take place. So either there possibly single points of failure or some layer of complicated partitioning that has to be managed to figure out how to join data together.

Don White: I would think one thing that has to be addressed is how you store and process non text information that is ultimately represented as text in XML. String based models are a poor means to manage relationships and numeric information. There are also costs in trying to make sure the information is valid for the real data type you want it to be.
A product would have to decide on how to handle types that are not meant to be textual. For example you can’t expect to accurately compare/restrict floating point numbers that are represented as text, certainly storing numbers as text is an inefficient storage model. Most likely you would want to leverage parsed XML for your processing, so if the data is not stored in a parsed format then you will have to pay for parsing when moving the data to and from storage model. XML can be used to store trees of information, but not all data is easily represented with XML.
Common data modeling needs like graphs and non containment relationships among data items would be a challenge. When evaluating any type of storage system it should be based on the type of data model needed and how it will be used.

Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?

Don White: Make sure you choose a tool for the job at hand. I think the one thing we know is the Relational Model has been used to solve lots of problems, but it has not the end all and be all of data storage solutions. Other data storage model can offer advantages for more than niche situations.