ODBMS Industry Watch

Jan 18 11

Interview with Rick Cattell: There is no “one size fits all” solution.

by Roberto V. Zicari

I start in 2011 with an interview with Dr. Rick Cattell.
Rick is best known for his contributions to database systems and middleware — He was a founder of SQL Access (a predecessor to ODBC), the founder and chair of the Object Data Management Group (ODMG), and the co-creator of JDBC.

Rick has worked for over twenty years at Sun in management and senior technical roles, and for ten years in research at Xerox PARC and at Carnegie-Mellon University.

You can download the article by Rick Cattell: “Relational Databases, Object Databases, Key-Value Stores, Document Stores, and Extensible Record Stores: A Comparison.”

And look at recent posts on the same topic: “New and Old Data Stores”, and “NoSQL databases”.

RVZ

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Rick Cattell: Basically, things changed with “Web 2.0” and with other applications where there were thousands or millions of users writing as well as reading a database. RDBMSs could not scale to this number of writers. Amazon (with Dynamo) and Google (with BigTable) were forced to develop their own scalable datastores. A host of others followed suit.

Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?

Rick Cattell: That’s a good question. The proliferation and differences are confusing, and I have no one-paragraph answer to this question. The systems differ in data model, consistency models, and many other dimensions. I wrote a couple papers and provide some references on my website, these may be helpful for more background. There I categorize several kinds of “NoSQL” data stores, according to data model: key-value stores, document stores, and extensible record stores. I also discuss scalable SQL stores.

Q3. How new data stores compare with respect to relational databases?

Rick Cattell: In a nutshell, NoSQL datastores give up SQL and they give up ACID transactions, in exchange for scalability. Scalability is achieved by partitioning and/or replicating the data over many servers. There are some other advantages, as well: for example, the new data stores generally do not demand a fixed data schema, and provide a simpler programming interface, e.g. a RESTful interface.

Q4. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed “object” cache over multiple machines. How do these new data stores compare with object-oriented databases?

Rick Cattell: It is true, OODBs provide features that NoSQL systems do not, like integration with OOPLs, and ACID transactions. On the other hand, OODBs do not provide the horizontal scalability. There is no “one size fits all” solution, just as OODBs and RDBMSs are good for different applications.

Q5. With the emergence of cloud computing, new data management systems have surfaced. What is in your opinion of the direction in which cloud computing data management is evolving? What are the main challenges of cloud computing data management?

Rick Cattell: There are a number of data management issues with cloud computing, in addition to the scaling issue I already discussed. For example, if you don’t know which servers your software is going to run on, you cannot tune your hardware (RAM, flash, disk, CPU) to your software, or vice versa.

Q6 What are cloud stores omitting that enable them to scale so well?

Rick Cattell: You haven’t defined “cloud stores”. I’m going to assume that you mean something similar to what we discussed earlier: new data stores that provide horizontal scaling. In which case, I answered that question earlier: they give up SQL and ACID.

Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?

Rick Cattell: As I interpret this question, systems such as MongoDB already have this. Also, a SQL interpreter has been ported to BigTable, but the lower-level interface has proven to be more popular. The main scalability problem with declarative queries is when queries require operations like joins or transactions that span many servers: then you get killed by the node coordination and data movement.

Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management.
To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Rick Cattell: I agree with Stonebraker. There are actually two points here: one about performance (of each server) and one about scalability (of all the servers together). We already discussed the latter.
Stonebraker makes an important point about the former: with databases that fit mostly in RAM (on distributed servers), the DBMS architecture needs to change dramatically, otherwise 90% of your overhead goes into transaction coordination, locking, multi-threading latching, buffer management, and other operations that are “acceptable” in traditional DBMSs, where you spend your time waiting for disk. Stonebraker and I had an argument a year ago, and reached agreement on this as well as other issues on scalable DBMSs. We wrote a paper about our agreement, which will appear in CACM. It can be found on my website in the meantime.

Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes. Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Rick Cattell: Yes, I believe so. MySQL Cluster is already close to doing so, and I believe that VoltDB and Clustrix will do so. The key to scalability with RDBMSs is to avoid SQL and transactions that span nodes, as you say. VoltDB demands that transactions be encapsulated as stored procedures, and allows some control over how tables are sharded over nodes. This allows transactions to be pre-compiled and pre-analyzed to execute on a single node, in general.

Q10. There is also xml DBs, which go beyond relational. Hybridization with relational turned out to be very useful. For example, DB2 has a huge investment in XML, and it is extensively published, and it has also succeeded commercially. Monet DB did substantial work in that area early on as well. How do they relate with “new data stores”?

Rick Cattell: With XML, we have yet another data model, like relational and object-oriented. XML data can be stored in a separate DBMS like Monet DB, or could be transformed to store in another DBMS, as with DB2. The focus of the new NoSQL data stores is generally not a new data model, but new scalability. In fact, they generally have quite simple data models. The “document stores” like MongoDB and CouchDB do allow nested objects, which might make them more amenable to storing XML. But in my experience, the new data stores are being used to store simpler data, like key-value pairs required for user information on web site.

Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?

Rick Cattell: This is an even harder question to answer than the ones contrasting the DBMSs themselves, because each application has characteristics that might make you lean one way or another. I made an attempt at answering this in the paper I mentioned, but I only scratched the surface… I concluded that there is no “cookbook” answer to tell you which way to go.

Dec 30 10

Don White on “New and old Data stores”.

by Roberto V. Zicari

“New and old Data stores” : This time I asked Don White a number of questions.
Don White is a senior development manager at Progress Software Inc., responsible for all feature development and engineering support for ObjectStore.

RVZ

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Don White : Speaking from an OODB perspective, OODBs grew out of recognition that the relational model is not the best fit for all application needs. OODBs continue to deliver their traditional value which is transparency in handling and optimizing moving rich data models between storage and virtual memory. The emergence of common ORM solutions must have provide benefit for some RDB based shops, where I presume they had need to use object oriented programming for data that already fits well into an RDB. There is something important to understand, if you fail to leverage what your storage system is good at then you are using the wrong tool or the wrong approach. Relational model wants to model with relations, wants to perform joins, an application’s data access pattern should expect to query the database the way the model wants to work. ORM mapping for an RDB that tries to query and build one object at a time will have real poor performance. If you try to query in bulk to avoid costs of model transitions then you likely have to live with compromises in less than optimal locking and/or fetch patterns. A project with model complexity that pursues OOA/OOD to for a solution will find implementation easier with an OOP and will find storage of that data easier and more effective with an OODB. As for newer Data Stores that are neither OODB nor RDB, and they appear to be trying to fill a need that provides a storage solution that is less than general database solution. Not trying to be everything to everybody allows for different implementation tradeoffs to be made.

Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?

Don White: This probably needs to be answered by people involved with the products trying to distinguish themselves. I lump document stores in the NoSQL category. However there does seem to be some common themes or subclass types within new NoSQL stores document stores and key value stores. Each subclass seems to have a different way of declaring groups of related information and differences in exposing how to find and access stored information. In case it is not obvious you can argue an OODB has some characteristics of the NoSQL stores, although any discussion will have to clearly define the scope of what is included in the NoSQL discussion.

Q3. How new data stores compare with respect to relational databases?

Don White: In general there seems to be recognition that Relational based technology has difficulty making tradeoffs managing fluid schema requirements and how to optimize access to related information.

Don White: These new data stores are not Object Oriented. Some might provide language bindings to Object Oriented languages but they are not preserving OOA/OOD as implemented in an OOP all the way through to the storage model. The new data systems are very data centric and are not trying to facilitate the melding of data and behavior. These new storage systems present a specific model abstractions and provide their own specific storage structure. In some cases they offer schema flexibility, but it is basically used to just manage data and not for building sophisticated data structures with type specific behavior. One way of keeping modeling abilities in perspective, you can use an OODB as a basis to build any other database system, NoSQL or even relational. The simple reason is an OODB can store any structure a developer needs and/or can even imagine. A document store, name/value pair store or RDB store, all present a particular abstractions for a data store, but under the hood there is an implementation to serve that abstraction. No matter what that implementation looks like for the store it could be put into an OODB. Of course the key is determining if the data store abstraction presented works for your actual model and application space.

The problem with an OODB is not everyone is looking to build a database of their own design and they prefer someone else to supply the storage abstraction and worry about the details to make the abstraction work. Not
to say the only way to interface with an OODB is a 3GL program, but the most effective way to use an OODB is when the storage model matches the needs of the in-memory model. That is a very leading statement because it really is forcing a particular question, why would you want to store data differently than how you intend to use it? I guess the simple answer is when you don’t know how you are going to use your data, so if you don’t know how you are going to use it then why is any data store abstraction better than another? If you want to use an OO model and implementation then you will find a non OODB is a poor way of handling that situation.

To generalize it appears the newer stores make different compromises in the management of the data to suit their intended audience. In other words they are not developing a general purpose database solution so they are
willing to make tradeoffs that traditional database products would/should/could not make. The new data stores do not provide traditional database query language support or even strict ACID transaction support. They do provide an abstractions for data storage and processing capabilities that leverage the idiosyncrasies of their chosen implementation data structures and/or relaxations in strictness of the transaction model to try to make gains in processing.

Don White: One challenge is just trying to understand what is really meant by cloud computing. In some form it is how to leverage computing resources or facilities available through a network. Those resources could be software or hardware, leveraging those resources requires nothing to be installed on the accessing device, you only need a network connection. The network is the virtual mainframe and any device used to access the network is the virtual terminal endpoint. You have the same concerns of trying to leverage the computing power of a virtual mainframe as a real local machine, how to optimize computing resources, how to share them among many users and how to keep them running. You have interesting upside with all the possible scalability but with the power and flexibility comes new levels of management complexity. You have to consider how algorithms for processing and handling data can be distributed and coordinated. When you involve more than one machine to do anything then you have to consider what happens when any node or connecting piece fails along the way.

Q6 What are cloud stores omitting that enable them to scale so well?

Don White: Strict serialized transaction processing for one. I think you will find the more complex a data model needs to be, the more need there is for strict serialized transactions. You can’t expect to navigate relationships cleanly if you don’t promise to keep all data strictly serialized.
The data and/or the storage abstractions used in the new models seem devoid of any sophisticated data processing and relationship modeling. What is being managed and distribute is simple data, where algorithms and the data needing to be managed can be easily partitioned/dispersed and required processing is easily replicated with basic coordination requirements. It is easy to imagine how to process queries that can replicated in bulk across simple data stored in structures that are amenable to be split apart.

Why are serialized transactions important? It makes sure this is one view of history and is necessary to maintain integrity among related data. Some systems try to pass off something less than serializable isolation as
adequate for transaction processing, however allowing updates to occur without the prerequisite read locks risks trying to use data that is not correct. If you are using pointers rather than indirect references as part
of your processing, the things you point to have to exist to run. Once you materialize a key/value based relationship as a pointer then there has to be commitment to not only the existence of the relationship (thing pointed to) but also the state of the data involved in the relationship that allows the existence to be valid.

Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?

Don White: Can’t answer that. It will be a shame if these systems end up having to build many things that are available in other database systems that could have given them that for free.

Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Don White: I don’t have any argument with the overheads identified, however I would say I don’t want to use SQL, a non-procedural way of getting to data, when I can solve my problem faster by using navigation of data structures specifically geared to solve my targeted problem. I have seen customers put SQL interfaces on top of specialized models stored in an OODB. They use SQL through ODBC as a standard endpoint to get at the data, but the implementation model under the hood is a model the customer implemented that performs queries faster than what a traditional relational implementation could do.

Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes.
Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Don White: Hmm, what is the barrier? What makes SQL hard to span nodes? I suppose one inherit problem is an RDB is built around the relational model, which is based on joining relations. If processing is going to spread across many nodes then where does joining take place. So either there possibly single points of failure or some layer of complicated partitioning that has to be managed to figure out how to join data together.

Don White: I would think one thing that has to be addressed is how you store and process non text information that is ultimately represented as text in XML. String based models are a poor means to manage relationships and numeric information. There are also costs in trying to make sure the information is valid for the real data type you want it to be.
A product would have to decide on how to handle types that are not meant to be textual. For example you can’t expect to accurately compare/restrict floating point numbers that are represented as text, certainly storing numbers as text is an inefficient storage model. Most likely you would want to leverage parsed XML for your processing, so if the data is not stored in a parsed format then you will have to pay for parsing when moving the data to and from storage model. XML can be used to store trees of information, but not all data is easily represented with XML.
Common data modeling needs like graphs and non containment relationships among data items would be a challenge. When evaluating any type of storage system it should be based on the type of data model needed and how it will be used.

Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?

Don White: Make sure you choose a tool for the job at hand. I think the one thing we know is the Relational Model has been used to solve lots of problems, but it has not the end all and be all of data storage solutions. Other data storage model can offer advantages for more than niche situations.

Dec 16 10

Watch the Video of the Keynote Panel “New and old Data stores”

by Roberto V. Zicari

A number of people asked me to make it easier to watch the video of the Keynote Panel “New and old Data stores”, held at ICOODB 2010 Frankfurt on September 29, 2010.
So rather then downloading the video, you can now watch it directly here!

The panel discussed the pros and cons of new data stores with respect to classical relational databases.

The panel of experts was composed of:
-Robert Greene, Chief Strategist Versant.
-Leon Guzenda, Chief Technology Officer Objectivity.
-Michael Keith, architect at Oracle.
-Patrick Linskey, Apache OpenJPA project.
-Peter Neubauer, COO NeoTechnology.
-Ulf Michael (Monty) Widenius, main author of the original version of the open source MySQL database.

Moderators were: Alan Dearle, University of St. Andrew, and myself.

The panelists engaged in lively discussions addressing a variety of interesting issues, such as: why the recent proliferation of “new data stores”, such as “document stores”, and “nosql databases”; their differences with classic relational databases, how object databases compare with NoSQL databases, scalability and consistency for huge amount of data…to name a few.

RVZ
Since the original video was rather large, I split it into two parts

Keynote Panel “New and old Data stores” PART I:

Keynote Panel “New and old Data stores” PART II:

Dec 2 10

Robert Greene on “New and Old Data stores”

by Roberto V. Zicari

I am back covering the topic “New and Old Data stores”.
I asked several questions to Robert Greene, CTO and V.P. Open Source Operations at Versant.

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Robert Greene: Well, it’s a question of innovation in the face of need. When relational databases were invented, applications and their models were simpler, data was smaller, concurrent users were less. There was no internet, no wireless devices, no global information systems. In the mid 90’s, even Larry Ellison stated that complexly related information, at the time largely in niche application areas like CAD, did not fit well with the relational model. Now, complexity is pervasive in nearly all applications.

Further, the relational model is based on a runtime relationship execution engine, re- calculating relations based on primary-key, foreign-key data associations even though the vast majority of data relationships remain fixed once established. When data continues to grow at enormous rates, the approach of re-calculating the relations becomes impractical. Today even normal applications start to see data at sizes which in the past were only seen in data warehousing solutions, the first data management space which embraced a non-relational approach to data management.

So, in a generation when millions of users are accessing applications linked to near real-time analytic algorithms, at times operating over terabytes of data, innovation must occur to deal with these new realities.

Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?

Robert Greene:The answer to this could require a book, but let’s try to distil into the fundamentals.

I think the biggest difference is the programming model. There is some overlap, so you don’t see clear distinctions, but for each type: object database, distributed file systems, key-value stores, document stores and graph stores, the manner in which the user stores and retrieves data varies considerably. The OODB uses language integration, the distributed file systems use map-reduce, key-value stores use data keys, document stores use keys and query based on indexed meta data overlay, graph stores use a navigational expression language. I think it is important to point out that “store” is probably a more appropriate label than “database” for many of these technologies as most do not implement the classical ACID requirements defined for a database.

Beyond programming model, these technologies vary considerably in architecture, how they actually store data, retrieve it from disk, facilitate backup, recovery, reliability, replication, etc.

Q3. How new data stores compare with respect to relational databases?

Robert Greene: As described above, they have a very different programming model than the RDB. Though in some ways, they are all subsets of the RDB, but their specialization allows them to do what they do ( at times ) better than the RDB.

Most of them are utilizing an underlying architecture which I call, “the oldest scalability architecture of the relational database”. It’s the use of the key-value/blob architecture. The RDB has long suffered performance under scalability and historically many architects have gotten around those performance issues by removing the JOIN operation from the implementation. They manage identity from the application space and store information in either single tables and/or blobs of isolatable information. This comparison is obvious for key-value stores. However, you can also see this approach in the document store, which is storing its information as key-JSON objects. The keys to those documents ( JSON blob objects ) must be managed by user implemented layers in the application space. Try to implement a basic collection reference, you will find yourself writing lots of custom code. Of course, JSON objects also have meta data which can be extracted and indexed, allowing document stores to provide better ways at finding data, but the underlying architecture is key-value.

Robert Greene: They compare similarly in that they achieve better scalability than the RDB by utilizing identity management in the application layer similarly to the way done with the object database. However, the approach is significantly less opaque, because for those NoSQL stores, the management of the identity is not integrated into the language constructs and abstracted away from the user API as it is with the object database. Plus, there is a big difference in the delivery of the ACID properties of a database. The NoSQL databases are almost exclusively non-transactional unless you use them in only the narrowest of use cases.

Robert Greene: Unquestionably, the world is moving to a platform as a service computing model (PaaS). Databases will play a role in this transition in all forms. The challenges in delivering on data management technology which is effective in these “cloud” computing architectures turn out to be very similar to effectively delivering technology for the new n-core chip architectures. They are challenges related to distributed data management, whether it is across machines or across cores, splitting the problem into pieces and managing the distributed execution in the fact of concurrent updates. Then the often overlooked aspect in these discussions is the operational element. How to effectively develop, debug, manage and administer the production deployments of this technology within distributed computing environments.

Q6 What are cloud stores omitting that enable them to scale so well?

Robert Greene: I think architecture plays the biggest role in their ability to scale. It is that application identity managed approach to data retrieval, data distribution, semi-static data relations. These are things they actually have in common with object databases, which incidentally, you also find in some of the worlds largest, most demanding application domains. I think that is the biggest scalability story for those technologies. If you look past architecture then it comes down to some of the sacrifices made in the area of fully supporting the ACID requirements of a database. Taking the “eventually consistent” approach, this in some cases, makes a tremendous amount of sense if you can afford probabilistic results instead of determinism

Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?

Robert Greene: I am sure you will see this as literally all database technologies which will remain relevant, will live in the cloud.

Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Robert Greene: I agree with the theory. Reality though does introduce some practical limitations during implementation. Technology is doing a remarkable job of removing those bottlenecks. For example, you can now get non-volatile memory appliances which are 5T in size effectively eliminating disk I/O as the what was historically the #1 bottleneck in database systems. Still, architecture will continue in the future to play the strongest role in performance and scalability. Relational databases and other implementations which need to runtime calculate relationships based on data values over growing volumes of data will remain performance challenged.

Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes.
Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Robert Greene: Yes of course, they already do in vendors like Netezza, Greenplum, AsterData. The question is will they perform well in the face of those scalability requirements. This distinction between performance and scalability is often overlooked.

However, I think this notion that you cannot scale well if your transactions span many nodes is non-sense. It is a question of implementation. Just because a database has 100 nodes, does not mean that all transactions will operate on data within those 100 nodes. Transactions will naturally partition and span some percentage of nodes especially with regards to relevant data. Access in a multi-node system can be parallelized in all aspects of a transaction. Further, at a commit boundary, the overwhelming case is that the number of nodes involved where data is inserted, changed, deleted and/or logically dependent, is some small fraction of all the physical nodes in the system. Therefore, advanced 2-phase commit protocols can do interesting things like rolling back non active nodes and parallelizing protocol handshaking and using asynchronous I/O and handshaking to finalize the commit. Is it complicated, yes, but is it too complicated to work, not by a long shot.

Robert Greene: I really look at XML databases as large index engines. I have seen implementations of these which look very much like document stores. The main difference being that they are generally indexing everything. Where the document stores appear to be much more selective about the meta data exposed for indexing and query. Still, I think the challenge for XML db’s is the mismatch in it’s use within the programming paradigm. Developers think of XML as data interchange and transformation technology. It is not perceived as transactional data management and storage and developers don’t program in XML, so it feels clunky for them to figure out how to wrap it into their logical transactions. I suspect if feels a little less clunky if what you are dealing with are documents. Perhaps they should be considered the original document stores.

Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?

Robert Greene: I choose the right tool for the job. This is again one of those questions which deserves several books. There is no 1 best solution for all applications, deciding factors can be complicated, but here is what I think about as major influencing factors. I look at it from the perspective of whether the application is data driven or model driven.

If it is model driven, I lean towards ODB or RDB.
If it is data driven, I lean towards NoSQL or RDB.

If the project is model driven and has a complex known model, ODB is a good choice because it handles the complexity well. If the project is model driven and has a simple known model, RDB is a good choice, because you should not be performance penalized if the model is simple, there are lots of available choices and people who know how to use the technology.

If the project is data driven and the data is small, RDB is good for the prior reasons. If the project is data driven and the data is huge, then NoSQL is a good choice because it takes a better architectural approach to huge data allowing the use of things like map reduce for parallel processing and/or application managed identity for better data distribution.

Of course, even within these categorizations you have ranges of value in different products. For example, MySQL and Oracle are both RDB, so which one to choose? Similarly, db4o and Versant are both ODB’s, so which one should you choose? So, I also look at the selection process from the perspective of 2 additional requirements: Data Volume, Concurrency. Within a given category, these will help narrow in on a good choice. For example, if you look at the company Oracle, you naturally consider that MySQL is less data scalable and less concurrent than the Oracle database, yet they are both RDB. Similarly, if you look at the company Versant, you would consider db4o to be less data scalable and less concurrent than the Versant database, yet they are both ODB.

Finally, I say you should test, evaluate any selection within the context of your major requirements. Get the core use cases mocked up and put your top choices to the test, it is the only way to be sure.

Nov 22 10

Why Patterns of Data Modeling?

by Roberto V. Zicari

I published another chapter of the new book on “Patterns of Data Modeling” of
Dr. Michael Blaha. All together you can now download three chapters of the book:
Tree Template, Models, and Universal Antipatterns.

At the same time, I asked Dr. Blaha a few questions.
At the end of the interview you`ll find some more opinions on this topic.

Q1. What are Patterns of Data Modeling?

Michael Blaha: Experienced data modelers don’t limit their thinking to primitive constructs. Rather they leverage what they have seen before. Patterns of data modeling are ways of cataloging past superstructures that are profound and likely to recur.
There are different aspects of data modeling patterns. There are models of common data structures (mathematical templates), models to be avoided (antipatterns), core concepts that transcend application domains (archetypes), and models of common services (canonical models). Modelers should avail themselves of the full pattern toolkit and not focus on one technique to the exclusion of others.
The literature covers abstract programming patterns that exist apart from application concepts. For example, the gang of four book — “Design Patterns: Elements of Reusable Object-Oriented Software” has excellent coverage of abstract programming patterns. There is no reason why databases should not have a comparable level of treatment. Until my recent book (“Patterns of Data Modeling“) the literature has lacked an abstract treatment of data modeling patterns.

Q2. Where and when are Patterns of Data Modeling useful?

Michael Blaha: All experienced modelers should use data modeling patterns. It is important to reuse ideas that have been tried and tested, rather than reinvent technology from scratch. I know that data modeling patterns are useful because this is the way that I think as I perform my work as an industrial consultant.
I use data modeling patterns for application data models, enterprise data models, data reverse engineering, and abstract conceptual thinking. Data modeling patterns are not a panacea to the troubles of development, but they are part of the solution. With patterns, developers can accelerate their thinking and reduce modeling errors.

Q3. Is there any difference in the applicability of Patterns of Data Modeling if the underlying Database System is a relational database as opposed to for example an Object Oriented or a NoSQL database?

Michael Blaha: No. That is the whole premise of software engineering — to quickly address the essential aspects of a problem and defer implementation details. A conceptual data model is focused on finding the important concepts for a problem, delineating scope, and determining the proper level of abstraction. All this deep, early thinking happens regardless of the eventual implementation target. Data modeling patterns mostly apply to the early stages of software development
Bill Premerlani and I took this approach in our 1998 book (“Object-Oriented Modeling and Design for Database Applications”). We presented detailed mapping rules for how to implement conceptual models with relational databases, an object-oriented database (ObjectStore) and flat files. Our 1991 book (“Object-Oriented Modeling and Design”) and its 2005 sequel explained how to map OO models to several programming languages.
So patterns of data modeling (as well as programming patterns and other kinds of patterns) apply regardless of the eventual downstream implementation.

Q4. What’s the difference between a pattern and a seed model?

Michael Blaha: A seed model is specific to a problem domain. It is a tangible piece that you can extend to build an entire application. Several authors (such as Hay, Fowler, and Silverston) have published excellent books with seed models. In constrast, a pattern is abstract and stands apart from any particular application domain. Patterns are at the same level of abstraction as UML classes, associations, and generalizations. A pattern is a composite building block. Seed models and abstract patterns are both valuable techniques. They are complimentary and are often used together.

Q5. What do you see as frontier areas of databases and data modeling?

Michael Blaha: I’m now working on a new topic — SOA and databases. SOA is an acronym for Service-Oriented Architecture, an approach for organizing business functionality into meaningful units of work. Instead of placing logic in application silos, SOA organizes functionality into services that transcend the various departments and fiefdoms of a business. A service is a meaningful unit of business processing. Services communicate by passing data back and forth. Such data is typically expressed in terms of XML. XML combines data with metadata that defines the data’s structure. A second language — XSD (XML Schema Definition) — is often used to specify valid XML data structure.
The promise of SOA is being held back by a lack of rigor with XSD files. Many developers focus on the design of individual services and pay little attention to how the services fit together and collectively evolve. Enterprise data modeling is the solution to this problem. A data model is essential for grasping the entirety of services and abstracting services properly. A data model also provides a guide for combining services in flexible ways.
I see evidence for a lack of data modeling in my consulting practice. I have studied several XSD standards and they all ignore data models. The literature in the area of SOA and data modeling is sparse. The current situation is untenable and SOA projects must pay more attention to data.
#

“Patterns of data modeling are very important. They enable data modeling efforts to be both effective and efficient. Working without patterns is like wandering around in the data wilderness trying to find your way.
SOA and Data. This is another vital area that must be addressed. I am doing it in my practice. It brings together data, metadata, metacards, data registries, data catalogs — and service. Very important for scalablility when the data network size grows (e.g., the government, nationwide health services, etc.).” — James Odell.

“I am mostly an object modeller, but I always recommend that my clients start with existing data model patterns rather than with a blank sheet of paper.
The data modelling patterns I most turn to are David C. Hay (Data Model Patterns: Conventions of Thought etc.).” — Jim Arlow.

“I agree with all that Dr. Blaha said advocating the use of patterns. This was very articulately worded, and I like to see those views spread around.
I also recognize that what he’s tried to do in this book is very different from what Len Silverston, Martin Fowler and I did.
It is true that we were focused on modeling the real world–“domains” as he described it. He, on the other hand has abstracted modeling to the point that he describes modeling itself–“tree” structures, undirected graphs, directed graphs, and so forth.
It is true that Dr. Blaha’s book is abstract in the extreme.
In fact, in my new book, Enterprise Model Patterns: Describing the World take on the issue of level of abstraction directly. In this, I am presenting a semantic model that I claim describes the entire enterprise, but on multiple levels of abstraction.
The first (Level 1) is a generic model that any company or government agency can take on as a starting point. It is generic because most attributes are actually captured as data in CHARACTERISTIC entities. (This corresponds to Dr. Blaha’s discussion of soft-coded values.) Thus, they become the problem of the user community, not the data modeler. The data modeler can address the true structures of the business. Yes, this model is organized in terms of five fundamental domains: people and organizations (who), geographic locations (where), physical assets (what), and activities and events (how). It also addresses time (when), but that’s a different kind of model. (This model is based on some 20+ years experience in the field, but I was inspired to write it from my experience over the last few years with the Federal Data Architecture Subcommittee. The committee hasn’t been very effective at creating patterns to distribute to Federal agencies, but it did inspire me to try to capture my views on the subject.)
I then address Level 0, which is a template for the first four categories above. (This is an enhanced version of the THING/THING TYPE model). In addition, at this level are two “meta” models: Document management and accounting. Each of these subject areas itself refers to the entire rest of the model.
At Level 2, I deal with functional specializations. These are more detailed than the level 1 models and make use of the entities in Level 1 combined in specific ways. These subject areas address such things as addresses (both physical addresses–“facilities”–and virtual addresses–telephone numbers, e-mail addresses, etc.), human resources, contracts, and the like. While they are more specialized than level 1, they are still generally applicable patterns. (And these areas address the “why” of the organization.)
At level 3, I address specific industries. For “vertical” models, I take the position that the Level 1/2 models address 80-90% of any company’s requirements. For each industry, however, there are a few special areas that need special attention. These are the things that make that industry unique. I took on a five of these, trying to get a cross-section from completely different worlds: criminal justice, microbiology, banking, oil production, and highway maintenance. If you don’t know anything about one of these industries, here is where you can learn something.
I agree that patterns are technology independent. I disagree that “object” models are technologically independent. That Dr. Blaha began with the gang of four book – “Design Patterns: Elements of Reusable Object-Oriented Software” tells something about his orientation. As it happens, in my latest book, I did (as my colleagues would say) “move over to the dark side”, and use UML as the notation, even though that notation is specifically oriented towards object-oriented design, not business modeling. I had to tweak some of the terms to break out of its object-oriented design history. These are conceptual, business-oriented models, not design models.
In doing this, I may have managed to offend both my data modeling colleagues (“You really have gone over to the dark side, haven’t you?”) and my UML colleagues (“What have you done to my UML?”). Or, perhaps, maybe have started building a bridge between the two groups? Only time will tell.” — Dave Hay.

Oct 28 10

Video “New and old Data stores”.

by Roberto V. Zicari

You can now freely download the Video of the Keynote Panel “New and old Data stores”, held at ICOODB 2010 Frankfurt on September 29, 2010.

Here is the LINK to download the video.

Since the original file was rather large, I split it into two separate files, each one takes about 6 minutes to download…

The panel discussed the pros and cons of new data stores with respect to classical relational databases.

The panel of experts was composed by:
Ulf Michael (Monty) Widenius, main author of the original version of the open source MySQL database.
Michael Keith, architect at Oracle.
Patrick Linskey, Apache OpenJPA project.
Robert Greene, Chief Strategist Versant.
Leon Guzenda, Chief Technology Officer Objectivity.
Peter Neubauer. COO NeoTechnology.

Moderators were: Alan Dearle, University of St. Andrew, and Roberto V. Zicari, Goethe University Frankfurt.

RVZ

Oct 26 10

Proceedings ICOODB 2010 Frankfurt.

by Roberto V. Zicari

The research papers (RESEARCH TRACK) of the ICOODB 2010 Frankfurt conference, have been published by Springer in their Lecture Notes in Computer Science. Here are the details:

Objects and Databases. Dearle, Alan; Zicari, Roberto V. (Eds.)
Proceedings Series: Lecture Notes in Computer Science, Vol. 6348. 1st Edition., 2010, XIV, 161 p., Softcover ISBN: 978-3-642-16091-2
Preface and Table of Contents| September 2010|.

This book constitutes the thoroughly refereed conference proceedings of the Third International Conference on Object Databases, ICOODB 2010, held in Frankfurt/Main, Germany in September 2010.

Most presentations in the Industry Track, Keynotes and Tutorials are available for free download at ODBMS.ORG.

I will very soon upload the video of the very interesting keynote panel “NEW AND OLD DATA STORES” …stay tuned.

RVZ

Oct 11 10

Presentations of ICOODB Frankfurt 2010.

by Roberto V. Zicari

I have published in ODBMS.ORG most of the industry presentations being given at the ICOODB Frankfurt 2010 conference.

Here are the relevant links:

TUTORIALS:
1. “Object Databases” (PDF 75 pages) by Michael Grossniklaus, Politecnico di Milano. |
2. “Patterns of Data Modeling”(PDF 49 pages) | , by Michael R. Blaha, Modelsoft Consulting Corp.
–>Download Link.

NoSQL Workshop:
1. “Approaches to Data Modeling in Non-Relational Systems Using Apache Cassandra”, by Gary Dusbabek, Rackspace
2. “Dinner in the sky with MongoDB.”, by Marc Boeker, ONchestra.
3. “Scale Out vs. Scale In- a face-off between Cassandra and Redis. ” by Tim Lossen, wooga.
4. “The Graph DB Landscape and SonesDB. “ by Achim Friedland, Sones.
5. “Neo4j for deep spatial and social intelligence. “ by Peter Neubauer, Neo Technology.
6. “Mastering Massive Data Volumes with Hypertable. “ by Doug Judd, Hypertable Inc..
—> Link to Download all presentations (.PDF).

ICOODB KEYNOTES and Industry Track Presentations:
1. “Efficient Development of Event-Driven Systems with Versant Object Database.” by Guenter Ressell-Herbert, Versant
2. “Accelerating Application Development with Objects. “ by Eric Falsken, German Viscuso, Roman Stoffel, db4objects.
3. “The Synergy Between the Object Database, Graph Database, Cloud Computing and NoSQL Paradigms. ” by Leon Guzenda, Objectivity.
4. “Unifying Remote Data, Remote Procedures and Web Services.” KEYNOTE by William Cook, University of Texas at Austin.
5. “Searching the Web of Objects” KEYNOTE by Ricardo Baeza-Yates, VP, Yahoo! Research, Europe and Latin America.
—> Download presentations Link.

5. “State of MariaDB” and “Dynamic Columns in MariaDB“, by Michael (Monty) Widenious, MariaDB.
—> Download link

A lot to read…

Around 200 Researchers from around the world attended the conference
You can see some photos here.

RVZ

Oct 1 10

Best Object Databases Lecture Notes for ETH Zurich!

by Roberto V. Zicari

The winners of the ODBMS.ORG “Best Object Databases Lecture Notes” Award 2010 are Dr. Michael Grossniklaus and Prof. Moira Norrie, ETH Zürich, Switzerland, for their Lecture Notes “Object-Oriented Databases”.

Second place for:
“Object Database Tutorial”
by Dr. Rick Cattell, Independent Consultant, USA.

Third place for:
“Modern Database Techniques”
by Prof. Martin Hulin, Hochschule Ravensburg-Weingarten, Germany.

The Award Ceremony was held on September 29, 2010, at the 3rd International Conference on Objects and Databases (ICOODB 2010) in Frankfurt.

The Awards recognize the most complete and up to date lecture notes on Object Databases, that have been, or have strong potential to be, instrumental to the teaching of theory and practice in the field of objects and databases. Any Lecture Notes published in ODBMS.ORG during the years 2004-2010 were eligible for the 2010 award.

“This is a very nice recognition to the award winners, and it also encourages others to contribute educational materials that others can use. Very good.” Prof. Alfonso Cardenas, Computer Science Department, UCLA.

Sep 27 10

Object Database Technologies and Data Management in the Cloud.

by Roberto V. Zicari

One of our expert, Dr. Michael Grossniklaus, has recently been awarded a grant by the Swiss National Science Foundation (SNF) for a fellowship as an advanced researcher in David Maier’s group at Portland State University. There, he will be investigating the use of object database technology for cloud data management.
I asked Michael to elaborate on his research plan and share it with our ODBMS.ORG community.

Q1. People from different fields have slightly different definitions of the term Cloud Computing. What is the common denominator of most of these definitions?

MG: Many of the differences stem from the fact that people use the term Cloud Computing both to denote a vision at the conceptual level and technologies at the implementation level. A nice collection of no less than twenty-one definitions can be found here.
In terms of vision, the common denominator of most definitions is to look at processing power, storage and software as commodities that are readily available from large infrastructures. As a consequence, cloud computing unifies elements of distributed, grid, utility and autonomic computing. The term elastic computing is also often used in this context to describe the ability of cloud computing to cope with bursts or spikes in the demand of resources on an on-demand basis. As for technologies, there is a consensus that cloud computing corresponds to a service-oriented stack that provides computing resources at different levels. Again, there are many variants of cloud computing stacks, but the trend seems to go towards three layers. At the lowest level, Infrastructure-as-a-Service (IaaS) offers resources such as processing power or storage as a service. One level above, Platform-as-a-Service (PaaS) provides development tools to build applications based on the service provider’s API. Finally, on the top-most level, Software-as-a-Service (SaaS) describes the model of deploying applications to clients on demand.

Q2. With the emergence of cloud computing, new data management systems have surfaced. Why?

MG: I see new data management systems such as NoSQL databases and MapReduce systems mainly as a reaction to the way in which cloud computing provides scalability. In cloud computing, more processing power typically translates to more (cheap, shared-nothing) computing nodes, rather than migrating or upgrading to better hardware. Therefore, cloud computing applications need to be parallelizable in order to scale. Both NoSQL and MapReduce advocate simplicity in terms of data models and data processing, in order to provide light-weight and fault-tolerant frameworks that support automatic parallelization and distribution.
In comparison to existing parallel and distributed (relational) databases however, many established data management concepts, such as data independence, declarative query languages, algebraic optimization and transactional data processing, are often omitted. As a consequence, more weight is put on the shoulders of application developers that now face new challenges and responsibilities. Acknowledging the fact that the initial vision was maybe too simple, there is already a trend of extending MapReduce systems with established data management concepts. Yahoo’s PigLatin and Microsoft’s Dryad have introduced a (near-) relational algebra and Facebook’s HIVE supports SQL, to name only a few examples. In this sense, cloud computing has triggered a “reboot” of data management systems by starting from a very simple paradigm and adding classical features back in, whenever they are required.

Q3. What is in your opinion the direction into which cloud computing data management is evolving? What are the main challenges of cloud computing data management?

MG: Data management in cloud computing will take place on a massively parallel and widely distributed scale. Based on these characteristics, several people have argued that cloud data management is more suitable for analytic rather than transactional data processing. Applications that need mostly read-only access to data and perform updates in batch mode are, therefore, expected to profit the most from cloud computing. At the same time, analytical data processing is gaining importance both in industry in terms of market shares and in academia through novel fields of application, such as computational science and e-science. Furthermore, from the classical data management concepts mentioned above, ACID transactions is the notable exception since, so far, nobody has proposed to extend MapReduce systems with transactional data processing. This might be another indication that cloud data management is evolving into the direction of analytical data processing.
At the time of answering these questions, I see three main challenges for data management in cloud computing: massively parallel and widely distributed data storage and processing, integration of novel data processing paradigms as well as the provision of service-based interfaces. The first challenge has been identified many times and is a direct consequence of the very nature of cloud computing. The second challenge is to build a comprehensive data processing platform by integrating novel paradigms with existing database technology. Often cited paradigms include data stream processing systems, service-based data processing or the above-mentioned NoSQL databases and MapReduce systems. Finally, the third challenge is to provide service-based interfaces for this new data processing platform in order to expose the platform itself as a service in the cloud, which is also referred to as “Database-as-a-Service” (DaaS) or “Cloud Data Services”.

Q4. What is the impact of cloud computing on data management research so far?

MG: Most of the challenges mentioned above are already being addressed in some way by the database research community. In particular, parallel and distributed data management is a well-established field of research, which has contributed many results that are strongly related to cloud data management. Research in this area investigates whether and how existing parallel and distributed databases can scale up to the level of parallelism and distribution that is characteristic of cloud computing. While this approach is more “top down”, there is also the “bottom up” approach of starting with an already highly parallel and widely distributed system and extending it with classical database functionality. This second approach has led to the extended MapReduce systems that were mentioned before. While these extended approaches already partially address the second challenge of cloud data management—integrating of novel data processing paradigms—there are also research results that take this integration even further such as HadoopDB and Clustera. The third challenge is being addressed as part of the research on programmability of cloud data services in terms of languages, interfaces and development models.
The impact of cloud computing on data management research is also visible in recent call for papers of both established and emerging workshops and conferences. Furthermore, there are several additional initiatives dedicated to support cloud data management research. For example, the MSR Summer Institute 2010 held at the University of Washington brought together a number of database researcher to discuss the current challenges and opportunities of cloud data services.

Q5. In your opinion, is there a relationship between cloud computing and object database technologies? If yes, please explain.

MG: Yes, there are multiple connections between cloud data management and object database technology which relate to all of the previously mentioned challenges. According to a recent article in Information Week , businesses are likely to split their data management into (transactional) in-house and (analytical) cloud data processing. This requirement corresponds to the first challenge of supporting highly parallel and widely distributed data processing. In this setting, objects and relationships could prove to be a valuable abstraction to bridge the gap between the two partitions.
Introducing the concept of objects in cloud data management systems also makes sense from the perspective of addressing the second challenge of integrating different data processing paradigms. One advantage of MapReduce is that it can cast base data into different implicit models. The associated disadvantage is that the data model is constructed on the fly and, thus, type checking is only possible to a limited extent. To support typing of MapReduce queries, the same base data instances could be exposed using different object wrappers. Microsoft has recently proposed “Orleans”, a next-generation programming model for cloud computing that features a higher level of abstraction than MapReduce. In order to integrate different processing paradigms, Orleans introduces the notion of “grains” that serve as a unit of computation and data storage.
Finally, object database technologies can also contribute to addressing the third challenge, i.e. providing service-based interfaces for cloud data management. Since object data models and service-oriented interfaces are closely related, it makes a lot of sense to consider object database technology, rather than introducing additional mapping layers. The concept of orthogonal persistence, that is an essential feature of most recent object databases, is particularly relevant in this context. In their ICOODB 2009 paper, Dearle et al. have suggested that orthogonal persistence could be extended in order to simplify the development of cloud applications. Instead of only abstracting from the storage hierarchy, this extended orthogonal persistence would also abstract from replication and physical location, giving transparent access to distributed objects. Even though Orleans is built on top of the Windows Azure Platform that provides a relational database (SQL Azure), the vision of grains is to support transparent replication, consistency and persistence.

Q6. Do you know of any application domains where object database technologies are already used in the Cloud?

MG: From the major object database vendors, I am only aware of Objectivity that has a version of their product that is ready to be deployed on cloud infrastructures such as Amazon EC2 and GoGrid. However, I have not yet seen any concrete case study showing how their clients are using this product. This being said, it might be interesting to point out, that many of the applications that are currently deployed using object databases are very close to the envisioned use case of cloud data management. For example, Objectivity has been applied in Space Situational Awareness Foundational Enterprise (SSAFE) system and in several data-intensive science applications, for example at the Stanford Linear Accelerator Center (SLAC). Similarly, the European Space Agency (ESA) has chosen Versant to gather and analyze the data transmitted by the Herschel telescope. All of these applications deal with large or even huge amounts of data and require analytical data processing in the sense that was described before.

Q7. What issues would you recommend as a researcher to tackle to go beyond the current state of the art in cloud computing data management?

MG: There is ample opportunity to tackle interesting and important issues along the lines of all three challenges mentioned before. However, if we abstract even more, there are two general research areas that will need to be tackled in order to deliver the vision of cloud data management.
The first area addresses research questions “under the hood”, for example: How can existing parallel and distributed databases scale up to the level of cloud computing? What traditional database functionality is required in the context of cloud data management and how can it be supported? How can traditional databases be combined with other data processing paradigms such as MapReduce or data stream processing? What architectures will lead to fast and scalable data processing systems? The second important area is how cloud data services are provided to clients and, thus, the following research questions are situated “on the hood”: What interfaces should be offered by cloud data services? Do we still need declarative query languages or is a procedural interface the way to go? Is there even a need for entirely new programming models? Can cloud computing be made independent of or orthogonal to the development of the application business logic? How are cloud data management applications tested, deployed and debugged? Are existing database benchmarks sufficient to evaluate cloud data services or do we need new ones?
Of course, these lists of research questions are not exhaustive and merely highlight some of the challenges. Nevertheless, I believe that in answering these questions, one should always keep an eye on recent and also not-so-recent contributions from object databases. As outlined above, many developments in cloud data services have introduced some kind of object notion and, therefore, contributions from object databases can serve two purposes. On the hand, technologies such as orthogonal persistence can serve as valuable starting points and inspiration for novel developments. On the other hand, we should also learn from previous approaches in order not to reinvent the wheel and not to repeat some of the mistakes that were made before.

Acknowledgement
Michael Grossniklaus would like to thank Moira C. Norrie, David Maier, Bill Howe and Alan Dearle for interesting discussions on this topic and the valuable exchange of ideas.

Michael Grossniklaus
Michael received his doctorate in computer science from ETH Zurich in 2007. His PhD thesis examined how object data models can be extended with versioning to support context-aware data management. In addition to conducting research, Michael has been involved in several courses as a lecturer. Together with Moira C. Norrie, he developed a course on object databases for advanced students which he taught for several years. Currently, Michael is a senior researcher at the Politecnico di Milano, where he both contributes to the “Search Computing” project and works on reasoning over data streams. He has recently been awarded a grant by the Swiss National Science Foundation (SNF) for a fellowship as an advanced researcher in David Maier’s group at Portland State University, where he will be investigating the use of object database technology for cloud data management.
#

ODBMS Industry Watch

Interview with Rick Cattell: There is no “one size fits all” solution.

Don White on “New and old Data stores”.

Watch the Video of the Keynote Panel “New and old Data stores”

Why Patterns of Data Modeling?

Video “New and old Data stores”.

Proceedings ICOODB 2010 Frankfurt.

Presentations of ICOODB Frankfurt 2010.

Best Object Databases Lecture Notes for ETH Zurich!

About the author

Archives

Meta

About

Flickr

Search

About the author

Tags

Archives

Meta

About

Flickr

Search