A number of people asked me to make it easier to watch the video of the Keynote Panel “New and old Data stores”, held at ICOODB 2010 Frankfurt on September 29, 2010.
So rather then downloading the video, you can now watch it directly here!
The panel discussed the pros and cons of new data stores with respect to classical relational databases.
The panel of experts was composed of:
-Robert Greene, Chief Strategist Versant.
-Leon Guzenda, Chief Technology Officer Objectivity.
-Michael Keith, architect at Oracle.
-Patrick Linskey, Apache OpenJPA project.
-Peter Neubauer, COO NeoTechnology.
-Ulf Michael (Monty) Widenius, main author of the original version of the open source MySQL database.
Moderators were: Alan Dearle, University of St. Andrew, and myself.
The panelists engaged in lively discussions addressing a variety of interesting issues, such as: why the recent proliferation of “new data stores”, such as “document stores”, and “nosql databases”; their differences with classic relational databases, how object databases compare with NoSQL databases, scalability and consistency for huge amount of data…to name a few.
RVZ
Since the original video was rather large, I split it into two parts
Keynote Panel “New and old Data stores” PART I:
Keynote Panel “New and old Data stores” PART II:
I am back covering the topic “New and Old Data stores”.
I asked several questions to Robert Greene, CTO and V.P. Open Source Operations at Versant.
Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?
Robert Greene: Well, it’s a question of innovation in the face of need. When relational databases were invented, applications and their models were simpler, data was smaller, concurrent users were less. There was no internet, no wireless devices, no global information systems. In the mid 90’s, even Larry Ellison stated that complexly related information, at the time largely in niche application areas like CAD, did not fit well with the relational model. Now, complexity is pervasive in nearly all applications.
Further, the relational model is based on a runtime relationship execution engine, re- calculating relations based on primary-key, foreign-key data associations even though the vast majority of data relationships remain fixed once established. When data continues to grow at enormous rates, the approach of re-calculating the relations becomes impractical. Today even normal applications start to see data at sizes which in the past were only seen in data warehousing solutions, the first data management space which embraced a non-relational approach to data management.
So, in a generation when millions of users are accessing applications linked to near real-time analytic algorithms, at times operating over terabytes of data, innovation must occur to deal with these new realities.
Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?
Robert Greene:The answer to this could require a book, but let’s try to distil into the fundamentals.
I think the biggest difference is the programming model. There is some overlap, so you don’t see clear distinctions, but for each type: object database, distributed file systems, key-value stores, document stores and graph stores, the manner in which the user stores and retrieves data varies considerably. The OODB uses language integration, the distributed file systems use map-reduce, key-value stores use data keys, document stores use keys and query based on indexed meta data overlay, graph stores use a navigational expression language. I think it is important to point out that “store” is probably a more appropriate label than “database” for many of these technologies as most do not implement the classical ACID requirements defined for a database.
Beyond programming model, these technologies vary considerably in architecture, how they actually store data, retrieve it from disk, facilitate backup, recovery, reliability, replication, etc.
Q3. How new data stores compare with respect to relational databases?
Robert Greene: As described above, they have a very different programming model than the RDB. Though in some ways, they are all subsets of the RDB, but their specialization allows them to do what they do ( at times ) better than the RDB.
Most of them are utilizing an underlying architecture which I call, “the oldest scalability architecture of the relational database”. It’s the use of the key-value/blob architecture. The RDB has long suffered performance under scalability and historically many architects have gotten around those performance issues by removing the JOIN operation from the implementation. They manage identity from the application space and store information in either single tables and/or blobs of isolatable information. This comparison is obvious for key-value stores. However, you can also see this approach in the document store, which is storing its information as key-JSON objects. The keys to those documents ( JSON blob objects ) must be managed by user implemented layers in the application space. Try to implement a basic collection reference, you will find yourself writing lots of custom code. Of course, JSON objects also have meta data which can be extracted and indexed, allowing document stores to provide better ways at finding data, but the underlying architecture is key-value.
Q4. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed “object” cache over multiple machines. How do these new data stores compare with object-oriented databases?
Robert Greene: They compare similarly in that they achieve better scalability than the RDB by utilizing identity management in the application layer similarly to the way done with the object database. However, the approach is significantly less opaque, because for those NoSQL stores, the management of the identity is not integrated into the language constructs and abstracted away from the user API as it is with the object database. Plus, there is a big difference in the delivery of the ACID properties of a database. The NoSQL databases are almost exclusively non-transactional unless you use them in only the narrowest of use cases.
Q5. With the emergence of cloud computing, new data management systems have surfaced. What is in your opinion of the direction in which cloud computing data management is evolving? What are the main challenges of cloud computing data management?
Robert Greene: Unquestionably, the world is moving to a platform as a service computing model (PaaS). Databases will play a role in this transition in all forms. The challenges in delivering on data management technology which is effective in these “cloud” computing architectures turn out to be very similar to effectively delivering technology for the new n-core chip architectures. They are challenges related to distributed data management, whether it is across machines or across cores, splitting the problem into pieces and managing the distributed execution in the fact of concurrent updates. Then the often overlooked aspect in these discussions is the operational element. How to effectively develop, debug, manage and administer the production deployments of this technology within distributed computing environments.
Q6 What are cloud stores omitting that enable them to scale so well?
Robert Greene: I think architecture plays the biggest role in their ability to scale. It is that application identity managed approach to data retrieval, data distribution, semi-static data relations. These are things they actually have in common with object databases, which incidentally, you also find in some of the worlds largest, most demanding application domains. I think that is the biggest scalability story for those technologies. If you look past architecture then it comes down to some of the sacrifices made in the area of fully supporting the ACID requirements of a database. Taking the “eventually consistent” approach, this in some cases, makes a tremendous amount of sense if you can afford probabilistic results instead of determinism
Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?
Robert Greene: I am sure you will see this as literally all database technologies which will remain relevant, will live in the cloud.
Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?
Robert Greene: I agree with the theory. Reality though does introduce some practical limitations during implementation. Technology is doing a remarkable job of removing those bottlenecks. For example, you can now get non-volatile memory appliances which are 5T in size effectively eliminating disk I/O as the what was historically the #1 bottleneck in database systems. Still, architecture will continue in the future to play the strongest role in performance and scalability. Relational databases and other implementations which need to runtime calculate relationships based on data values over growing volumes of data will remain performance challenged.
Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes.
Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?
Robert Greene: Yes of course, they already do in vendors like Netezza, Greenplum, AsterData. The question is will they perform well in the face of those scalability requirements. This distinction between performance and scalability is often overlooked.
However, I think this notion that you cannot scale well if your transactions span many nodes is non-sense. It is a question of implementation. Just because a database has 100 nodes, does not mean that all transactions will operate on data within those 100 nodes. Transactions will naturally partition and span some percentage of nodes especially with regards to relevant data. Access in a multi-node system can be parallelized in all aspects of a transaction. Further, at a commit boundary, the overwhelming case is that the number of nodes involved where data is inserted, changed, deleted and/or logically dependent, is some small fraction of all the physical nodes in the system. Therefore, advanced 2-phase commit protocols can do interesting things like rolling back non active nodes and parallelizing protocol handshaking and using asynchronous I/O and handshaking to finalize the commit. Is it complicated, yes, but is it too complicated to work, not by a long shot.
Q10. There is also xml DBs, which go beyond relational. Hybridization with relational turned out to be very useful. For example, DB2 has a huge investment in XML, and it is extensively published, and it has also succeeded commercially. Monet DB did substantial work in that area early on as well. How do they relate with “new data stores”?
Robert Greene: I really look at XML databases as large index engines. I have seen implementations of these which look very much like document stores. The main difference being that they are generally indexing everything. Where the document stores appear to be much more selective about the meta data exposed for indexing and query. Still, I think the challenge for XML db’s is the mismatch in it’s use within the programming paradigm. Developers think of XML as data interchange and transformation technology. It is not perceived as transactional data management and storage and developers don’t program in XML, so it feels clunky for them to figure out how to wrap it into their logical transactions. I suspect if feels a little less clunky if what you are dealing with are documents. Perhaps they should be considered the original document stores.
Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?
Robert Greene: I choose the right tool for the job. This is again one of those questions which deserves several books. There is no 1 best solution for all applications, deciding factors can be complicated, but here is what I think about as major influencing factors. I look at it from the perspective of whether the application is data driven or model driven.
If it is model driven, I lean towards ODB or RDB.
If it is data driven, I lean towards NoSQL or RDB.
If the project is model driven and has a complex known model, ODB is a good choice because it handles the complexity well. If the project is model driven and has a simple known model, RDB is a good choice, because you should not be performance penalized if the model is simple, there are lots of available choices and people who know how to use the technology.
If the project is data driven and the data is small, RDB is good for the prior reasons. If the project is data driven and the data is huge, then NoSQL is a good choice because it takes a better architectural approach to huge data allowing the use of things like map reduce for parallel processing and/or application managed identity for better data distribution.
Of course, even within these categorizations you have ranges of value in different products. For example, MySQL and Oracle are both RDB, so which one to choose? Similarly, db4o and Versant are both ODB’s, so which one should you choose? So, I also look at the selection process from the perspective of 2 additional requirements: Data Volume, Concurrency. Within a given category, these will help narrow in on a good choice. For example, if you look at the company Oracle, you naturally consider that MySQL is less data scalable and less concurrent than the Oracle database, yet they are both RDB. Similarly, if you look at the company Versant, you would consider db4o to be less data scalable and less concurrent than the Versant database, yet they are both ODB.
Finally, I say you should test, evaluate any selection within the context of your major requirements. Get the core use cases mocked up and put your top choices to the test, it is the only way to be sure.
Why Patterns of Data Modeling?
I published another chapter of the new book on “Patterns of Data Modeling” of
Dr. Michael Blaha. All together you can now download three chapters of the book:
Tree Template, Models, and Universal Antipatterns.
At the same time, I asked Dr. Blaha a few questions.
At the end of the interview you`ll find some more opinions on this topic.
Q1. What are Patterns of Data Modeling?
Michael Blaha: Experienced data modelers don’t limit their thinking to primitive constructs. Rather they leverage what they have seen before. Patterns of data modeling are ways of cataloging past superstructures that are profound and likely to recur.
There are different aspects of data modeling patterns. There are models of common data structures (mathematical templates), models to be avoided (antipatterns), core concepts that transcend application domains (archetypes), and models of common services (canonical models). Modelers should avail themselves of the full pattern toolkit and not focus on one technique to the exclusion of others.
The literature covers abstract programming patterns that exist apart from application concepts. For example, the gang of four book — “Design Patterns: Elements of Reusable Object-Oriented Software” has excellent coverage of abstract programming patterns. There is no reason why databases should not have a comparable level of treatment. Until my recent book (“Patterns of Data Modeling“) the literature has lacked an abstract treatment of data modeling patterns.
Q2. Where and when are Patterns of Data Modeling useful?
Michael Blaha: All experienced modelers should use data modeling patterns. It is important to reuse ideas that have been tried and tested, rather than reinvent technology from scratch. I know that data modeling patterns are useful because this is the way that I think as I perform my work as an industrial consultant.
I use data modeling patterns for application data models, enterprise data models, data reverse engineering, and abstract conceptual thinking. Data modeling patterns are not a panacea to the troubles of development, but they are part of the solution. With patterns, developers can accelerate their thinking and reduce modeling errors.
Q3. Is there any difference in the applicability of Patterns of Data Modeling if the underlying Database System is a relational database as opposed to for example an Object Oriented or a NoSQL database?
Michael Blaha: No. That is the whole premise of software engineering — to quickly address the essential aspects of a problem and defer implementation details. A conceptual data model is focused on finding the important concepts for a problem, delineating scope, and determining the proper level of abstraction. All this deep, early thinking happens regardless of the eventual implementation target. Data modeling patterns mostly apply to the early stages of software development
Bill Premerlani and I took this approach in our 1998 book (“Object-Oriented Modeling and Design for Database Applications”). We presented detailed mapping rules for how to implement conceptual models with relational databases, an object-oriented database (ObjectStore) and flat files. Our 1991 book (“Object-Oriented Modeling and Design”) and its 2005 sequel explained how to map OO models to several programming languages.
So patterns of data modeling (as well as programming patterns and other kinds of patterns) apply regardless of the eventual downstream implementation.
Q4. What’s the difference between a pattern and a seed model?
Michael Blaha: A seed model is specific to a problem domain. It is a tangible piece that you can extend to build an entire application. Several authors (such as Hay, Fowler, and Silverston) have published excellent books with seed models. In constrast, a pattern is abstract and stands apart from any particular application domain. Patterns are at the same level of abstraction as UML classes, associations, and generalizations. A pattern is a composite building block. Seed models and abstract patterns are both valuable techniques. They are complimentary and are often used together.
Q5. What do you see as frontier areas of databases and data modeling?
Michael Blaha: I’m now working on a new topic — SOA and databases. SOA is an acronym for Service-Oriented Architecture, an approach for organizing business functionality into meaningful units of work. Instead of placing logic in application silos, SOA organizes functionality into services that transcend the various departments and fiefdoms of a business. A service is a meaningful unit of business processing. Services communicate by passing data back and forth. Such data is typically expressed in terms of XML. XML combines data with metadata that defines the data’s structure. A second language — XSD (XML Schema Definition) — is often used to specify valid XML data structure.
The promise of SOA is being held back by a lack of rigor with XSD files. Many developers focus on the design of individual services and pay little attention to how the services fit together and collectively evolve. Enterprise data modeling is the solution to this problem. A data model is essential for grasping the entirety of services and abstracting services properly. A data model also provides a guide for combining services in flexible ways.
I see evidence for a lack of data modeling in my consulting practice. I have studied several XSD standards and they all ignore data models. The literature in the area of SOA and data modeling is sparse. The current situation is untenable and SOA projects must pay more attention to data.
#
“Patterns of data modeling are very important. They enable data modeling efforts to be both effective and efficient. Working without patterns is like wandering around in the data wilderness trying to find your way.
SOA and Data. This is another vital area that must be addressed. I am doing it in my practice. It brings together data, metadata, metacards, data registries, data catalogs — and service. Very important for scalablility when the data network size grows (e.g., the government, nationwide health services, etc.).” — James Odell.
“I am mostly an object modeller, but I always recommend that my clients start with existing data model patterns rather than with a blank sheet of paper.
The data modelling patterns I most turn to are David C. Hay (Data Model Patterns: Conventions of Thought etc.).” — Jim Arlow.
“I agree with all that Dr. Blaha said advocating the use of patterns. This was very articulately worded, and I like to see those views spread around.
I also recognize that what he’s tried to do in this book is very different from what Len Silverston, Martin Fowler and I did.
It is true that we were focused on modeling the real world–“domains” as he described it. He, on the other hand has abstracted modeling to the point that he describes modeling itself–“tree” structures, undirected graphs, directed graphs, and so forth.
It is true that Dr. Blaha’s book is abstract in the extreme.
In fact, in my new book, Enterprise Model Patterns: Describing the World take on the issue of level of abstraction directly. In this, I am presenting a semantic model that I claim describes the entire enterprise, but on multiple levels of abstraction.
The first (Level 1) is a generic model that any company or government agency can take on as a starting point. It is generic because most attributes are actually captured as data in CHARACTERISTIC entities. (This corresponds to Dr. Blaha’s discussion of soft-coded values.) Thus, they become the problem of the user community, not the data modeler. The data modeler can address the true structures of the business. Yes, this model is organized in terms of five fundamental domains: people and organizations (who), geographic locations (where), physical assets (what), and activities and events (how). It also addresses time (when), but that’s a different kind of model. (This model is based on some 20+ years experience in the field, but I was inspired to write it from my experience over the last few years with the Federal Data Architecture Subcommittee. The committee hasn’t been very effective at creating patterns to distribute to Federal agencies, but it did inspire me to try to capture my views on the subject.)
I then address Level 0, which is a template for the first four categories above. (This is an enhanced version of the THING/THING TYPE model). In addition, at this level are two “meta” models: Document management and accounting. Each of these subject areas itself refers to the entire rest of the model.
At Level 2, I deal with functional specializations. These are more detailed than the level 1 models and make use of the entities in Level 1 combined in specific ways. These subject areas address such things as addresses (both physical addresses–“facilities”–and virtual addresses–telephone numbers, e-mail addresses, etc.), human resources, contracts, and the like. While they are more specialized than level 1, they are still generally applicable patterns. (And these areas address the “why” of the organization.)
At level 3, I address specific industries. For “vertical” models, I take the position that the Level 1/2 models address 80-90% of any company’s requirements. For each industry, however, there are a few special areas that need special attention. These are the things that make that industry unique. I took on a five of these, trying to get a cross-section from completely different worlds: criminal justice, microbiology, banking, oil production, and highway maintenance. If you don’t know anything about one of these industries, here is where you can learn something.
I agree that patterns are technology independent. I disagree that “object” models are technologically independent. That Dr. Blaha began with the gang of four book – “Design Patterns: Elements of Reusable Object-Oriented Software” tells something about his orientation. As it happens, in my latest book, I did (as my colleagues would say) “move over to the dark side”, and use UML as the notation, even though that notation is specifically oriented towards object-oriented design, not business modeling. I had to tweak some of the terms to break out of its object-oriented design history. These are conceptual, business-oriented models, not design models.
In doing this, I may have managed to offend both my data modeling colleagues (“You really have gone over to the dark side, haven’t you?”) and my UML colleagues (“What have you done to my UML?”). Or, perhaps, maybe have started building a bridge between the two groups? Only time will tell.” — Dave Hay.
Video “New and old Data stores”.
You can now freely download the Video of the Keynote Panel “New and old Data stores”, held at ICOODB 2010 Frankfurt on September 29, 2010.
Here is the LINK to download the video.
Since the original file was rather large, I split it into two separate files, each one takes about 6 minutes to download…
The panel discussed the pros and cons of new data stores with respect to classical relational databases.
The panel of experts was composed by:
Ulf Michael (Monty) Widenius, main author of the original version of the open source MySQL database.
Michael Keith, architect at Oracle.
Patrick Linskey, Apache OpenJPA project.
Robert Greene, Chief Strategist Versant.
Leon Guzenda, Chief Technology Officer Objectivity.
Peter Neubauer. COO NeoTechnology.
Moderators were: Alan Dearle, University of St. Andrew, and Roberto V. Zicari, Goethe University Frankfurt.
The panelists engaged in lively discussions addressing a variety of interesting issues, such as: why the recent proliferation of “new data stores”, such as “document stores”, and “nosql databases”; their differences with classic relational databases, how object databases compare with NoSQL databases, scalability and consistency for huge amount of data…to name a few.
RVZ
Proceedings ICOODB 2010 Frankfurt.
The research papers (RESEARCH TRACK) of the ICOODB 2010 Frankfurt conference, have been published by Springer in their Lecture Notes in Computer Science. Here are the details:
Objects and Databases. Dearle, Alan; Zicari, Roberto V. (Eds.)
Proceedings Series: Lecture Notes in Computer Science, Vol. 6348. 1st Edition., 2010, XIV, 161 p., Softcover ISBN: 978-3-642-16091-2
Preface and Table of Contents| September 2010|.
This book constitutes the thoroughly refereed conference proceedings of the Third International Conference on Object Databases, ICOODB 2010, held in Frankfurt/Main, Germany in September 2010.
Most presentations in the Industry Track, Keynotes and Tutorials are available for free download at ODBMS.ORG.
I will very soon upload the video of the very interesting keynote panel “NEW AND OLD DATA STORES” …stay tuned.
RVZ
Presentations of ICOODB Frankfurt 2010.
I have published in ODBMS.ORG most of the industry presentations being given at the ICOODB Frankfurt 2010 conference.
Here are the relevant links:
TUTORIALS:
1. “Object Databases” (PDF 75 pages) by Michael Grossniklaus, Politecnico di Milano. |
2. “Patterns of Data Modeling”(PDF 49 pages) | , by Michael R. Blaha, Modelsoft Consulting Corp.
–>Download Link.
NoSQL Workshop:
1. “Approaches to Data Modeling in Non-Relational Systems Using Apache Cassandra”, by Gary Dusbabek, Rackspace
2. “Dinner in the sky with MongoDB.”, by Marc Boeker, ONchestra.
3. “Scale Out vs. Scale In- a face-off between Cassandra and Redis. ” by Tim Lossen, wooga.
4. “The Graph DB Landscape and SonesDB. “ by Achim Friedland, Sones.
5. “Neo4j for deep spatial and social intelligence. “ by Peter Neubauer, Neo Technology.
6. “Mastering Massive Data Volumes with Hypertable. “ by Doug Judd, Hypertable Inc..
—> Link to Download all presentations (.PDF).
ICOODB KEYNOTES and Industry Track Presentations:
1. “Efficient Development of Event-Driven Systems with Versant Object Database.” by Guenter Ressell-Herbert, Versant
2. “Accelerating Application Development with Objects. “ by Eric Falsken, German Viscuso, Roman Stoffel, db4objects.
3. “The Synergy Between the Object Database, Graph Database, Cloud Computing and NoSQL Paradigms. ” by Leon Guzenda, Objectivity.
4. “Unifying Remote Data, Remote Procedures and Web Services.” KEYNOTE by William Cook, University of Texas at Austin.
5. “Searching the Web of Objects” KEYNOTE by Ricardo Baeza-Yates, VP, Yahoo! Research, Europe and Latin America.
—> Download presentations Link.
5. “State of MariaDB” and “Dynamic Columns in MariaDB“, by Michael (Monty) Widenious, MariaDB.
—> Download link
A lot to read…
Around 200 Researchers from around the world attended the conference
You can see some photos here.
RVZ
The winners of the ODBMS.ORG “Best Object Databases Lecture Notes” Award 2010 are Dr. Michael Grossniklaus and Prof. Moira Norrie, ETH Zürich, Switzerland, for their Lecture Notes “Object-Oriented Databases”.
Second place for:
“Object Database Tutorial”
by Dr. Rick Cattell, Independent Consultant, USA.
Third place for:
“Modern Database Techniques”
by Prof. Martin Hulin, Hochschule Ravensburg-Weingarten, Germany.
The Award Ceremony was held on September 29, 2010, at the 3rd International Conference on Objects and Databases (ICOODB 2010) in Frankfurt.
The Awards recognize the most complete and up to date lecture notes on Object Databases, that have been, or have strong potential to be, instrumental to the teaching of theory and practice in the field of objects and databases. Any Lecture Notes published in ODBMS.ORG during the years 2004-2010 were eligible for the 2010 award.
“This is a very nice recognition to the award winners, and it also encourages others to contribute educational materials that others can use. Very good.” Prof. Alfonso Cardenas, Computer Science Department, UCLA.
One of our expert, Dr. Michael Grossniklaus, has recently been awarded a grant by the Swiss National Science Foundation (SNF) for a fellowship as an advanced researcher in David Maier’s group at Portland State University. There, he will be investigating the use of object database technology for cloud data management.
I asked Michael to elaborate on his research plan and share it with our ODBMS.ORG community.
Q1. People from different fields have slightly different definitions of the term Cloud Computing. What is the common denominator of most of these definitions?
MG: Many of the differences stem from the fact that people use the term Cloud Computing both to denote a vision at the conceptual level and technologies at the implementation level. A nice collection of no less than twenty-one definitions can be found here.
In terms of vision, the common denominator of most definitions is to look at processing power, storage and software as commodities that are readily available from large infrastructures. As a consequence, cloud computing unifies elements of distributed, grid, utility and autonomic computing. The term elastic computing is also often used in this context to describe the ability of cloud computing to cope with bursts or spikes in the demand of resources on an on-demand basis. As for technologies, there is a consensus that cloud computing corresponds to a service-oriented stack that provides computing resources at different levels. Again, there are many variants of cloud computing stacks, but the trend seems to go towards three layers. At the lowest level, Infrastructure-as-a-Service (IaaS) offers resources such as processing power or storage as a service. One level above, Platform-as-a-Service (PaaS) provides development tools to build applications based on the service provider’s API. Finally, on the top-most level, Software-as-a-Service (SaaS) describes the model of deploying applications to clients on demand.
Q2. With the emergence of cloud computing, new data management systems have surfaced. Why?
MG: I see new data management systems such as NoSQL databases and MapReduce systems mainly as a reaction to the way in which cloud computing provides scalability. In cloud computing, more processing power typically translates to more (cheap, shared-nothing) computing nodes, rather than migrating or upgrading to better hardware. Therefore, cloud computing applications need to be parallelizable in order to scale. Both NoSQL and MapReduce advocate simplicity in terms of data models and data processing, in order to provide light-weight and fault-tolerant frameworks that support automatic parallelization and distribution.
In comparison to existing parallel and distributed (relational) databases however, many established data management concepts, such as data independence, declarative query languages, algebraic optimization and transactional data processing, are often omitted. As a consequence, more weight is put on the shoulders of application developers that now face new challenges and responsibilities. Acknowledging the fact that the initial vision was maybe too simple, there is already a trend of extending MapReduce systems with established data management concepts. Yahoo’s PigLatin and Microsoft’s Dryad have introduced a (near-) relational algebra and Facebook’s HIVE supports SQL, to name only a few examples. In this sense, cloud computing has triggered a “reboot” of data management systems by starting from a very simple paradigm and adding classical features back in, whenever they are required.
Q3. What is in your opinion the direction into which cloud computing data management is evolving? What are the main challenges of cloud computing data management?
MG: Data management in cloud computing will take place on a massively parallel and widely distributed scale. Based on these characteristics, several people have argued that cloud data management is more suitable for analytic rather than transactional data processing. Applications that need mostly read-only access to data and perform updates in batch mode are, therefore, expected to profit the most from cloud computing. At the same time, analytical data processing is gaining importance both in industry in terms of market shares and in academia through novel fields of application, such as computational science and e-science. Furthermore, from the classical data management concepts mentioned above, ACID transactions is the notable exception since, so far, nobody has proposed to extend MapReduce systems with transactional data processing. This might be another indication that cloud data management is evolving into the direction of analytical data processing.
At the time of answering these questions, I see three main challenges for data management in cloud computing: massively parallel and widely distributed data storage and processing, integration of novel data processing paradigms as well as the provision of service-based interfaces. The first challenge has been identified many times and is a direct consequence of the very nature of cloud computing. The second challenge is to build a comprehensive data processing platform by integrating novel paradigms with existing database technology. Often cited paradigms include data stream processing systems, service-based data processing or the above-mentioned NoSQL databases and MapReduce systems. Finally, the third challenge is to provide service-based interfaces for this new data processing platform in order to expose the platform itself as a service in the cloud, which is also referred to as “Database-as-a-Service” (DaaS) or “Cloud Data Services”.
Q4. What is the impact of cloud computing on data management research so far?
MG: Most of the challenges mentioned above are already being addressed in some way by the database research community. In particular, parallel and distributed data management is a well-established field of research, which has contributed many results that are strongly related to cloud data management. Research in this area investigates whether and how existing parallel and distributed databases can scale up to the level of parallelism and distribution that is characteristic of cloud computing. While this approach is more “top down”, there is also the “bottom up” approach of starting with an already highly parallel and widely distributed system and extending it with classical database functionality. This second approach has led to the extended MapReduce systems that were mentioned before. While these extended approaches already partially address the second challenge of cloud data management—integrating of novel data processing paradigms—there are also research results that take this integration even further such as HadoopDB and Clustera. The third challenge is being addressed as part of the research on programmability of cloud data services in terms of languages, interfaces and development models.
The impact of cloud computing on data management research is also visible in recent call for papers of both established and emerging workshops and conferences. Furthermore, there are several additional initiatives dedicated to support cloud data management research. For example, the MSR Summer Institute 2010 held at the University of Washington brought together a number of database researcher to discuss the current challenges and opportunities of cloud data services.
Q5. In your opinion, is there a relationship between cloud computing and object database technologies? If yes, please explain.
MG: Yes, there are multiple connections between cloud data management and object database technology which relate to all of the previously mentioned challenges. According to a recent article in Information Week , businesses are likely to split their data management into (transactional) in-house and (analytical) cloud data processing. This requirement corresponds to the first challenge of supporting highly parallel and widely distributed data processing. In this setting, objects and relationships could prove to be a valuable abstraction to bridge the gap between the two partitions.
Introducing the concept of objects in cloud data management systems also makes sense from the perspective of addressing the second challenge of integrating different data processing paradigms. One advantage of MapReduce is that it can cast base data into different implicit models. The associated disadvantage is that the data model is constructed on the fly and, thus, type checking is only possible to a limited extent. To support typing of MapReduce queries, the same base data instances could be exposed using different object wrappers. Microsoft has recently proposed “Orleans”, a next-generation programming model for cloud computing that features a higher level of abstraction than MapReduce. In order to integrate different processing paradigms, Orleans introduces the notion of “grains” that serve as a unit of computation and data storage.
Finally, object database technologies can also contribute to addressing the third challenge, i.e. providing service-based interfaces for cloud data management. Since object data models and service-oriented interfaces are closely related, it makes a lot of sense to consider object database technology, rather than introducing additional mapping layers. The concept of orthogonal persistence, that is an essential feature of most recent object databases, is particularly relevant in this context. In their ICOODB 2009 paper, Dearle et al. have suggested that orthogonal persistence could be extended in order to simplify the development of cloud applications. Instead of only abstracting from the storage hierarchy, this extended orthogonal persistence would also abstract from replication and physical location, giving transparent access to distributed objects. Even though Orleans is built on top of the Windows Azure Platform that provides a relational database (SQL Azure), the vision of grains is to support transparent replication, consistency and persistence.
Q6. Do you know of any application domains where object database technologies are already used in the Cloud?
MG: From the major object database vendors, I am only aware of Objectivity that has a version of their product that is ready to be deployed on cloud infrastructures such as Amazon EC2 and GoGrid. However, I have not yet seen any concrete case study showing how their clients are using this product. This being said, it might be interesting to point out, that many of the applications that are currently deployed using object databases are very close to the envisioned use case of cloud data management. For example, Objectivity has been applied in Space Situational Awareness Foundational Enterprise (SSAFE) system and in several data-intensive science applications, for example at the Stanford Linear Accelerator Center (SLAC). Similarly, the European Space Agency (ESA) has chosen Versant to gather and analyze the data transmitted by the Herschel telescope. All of these applications deal with large or even huge amounts of data and require analytical data processing in the sense that was described before.
Q7. What issues would you recommend as a researcher to tackle to go beyond the current state of the art in cloud computing data management?
MG: There is ample opportunity to tackle interesting and important issues along the lines of all three challenges mentioned before. However, if we abstract even more, there are two general research areas that will need to be tackled in order to deliver the vision of cloud data management.
The first area addresses research questions “under the hood”, for example: How can existing parallel and distributed databases scale up to the level of cloud computing? What traditional database functionality is required in the context of cloud data management and how can it be supported? How can traditional databases be combined with other data processing paradigms such as MapReduce or data stream processing? What architectures will lead to fast and scalable data processing systems? The second important area is how cloud data services are provided to clients and, thus, the following research questions are situated “on the hood”: What interfaces should be offered by cloud data services? Do we still need declarative query languages or is a procedural interface the way to go? Is there even a need for entirely new programming models? Can cloud computing be made independent of or orthogonal to the development of the application business logic? How are cloud data management applications tested, deployed and debugged? Are existing database benchmarks sufficient to evaluate cloud data services or do we need new ones?
Of course, these lists of research questions are not exhaustive and merely highlight some of the challenges. Nevertheless, I believe that in answering these questions, one should always keep an eye on recent and also not-so-recent contributions from object databases. As outlined above, many developments in cloud data services have introduced some kind of object notion and, therefore, contributions from object databases can serve two purposes. On the hand, technologies such as orthogonal persistence can serve as valuable starting points and inspiration for novel developments. On the other hand, we should also learn from previous approaches in order not to reinvent the wheel and not to repeat some of the mistakes that were made before.
Acknowledgement
Michael Grossniklaus would like to thank Moira C. Norrie, David Maier, Bill Howe and Alan Dearle for interesting discussions on this topic and the valuable exchange of ideas.
Michael Grossniklaus
Michael received his doctorate in computer science from ETH Zurich in 2007. His PhD thesis examined how object data models can be extended with versioning to support context-aware data management. In addition to conducting research, Michael has been involved in several courses as a lecturer. Together with Moira C. Norrie, he developed a course on object databases for advanced students which he taught for several years. Currently, Michael is a senior researcher at the Politecnico di Milano, where he both contributes to the “Search Computing” project and works on reasoning over data streams. He has recently been awarded a grant by the Swiss National Science Foundation (SNF) for a fellowship as an advanced researcher in David Maier’s group at Portland State University, where he will be investigating the use of object database technology for cloud data management.
#
New Resources
I published some new resources.
1. A new interesting User Report: 33/10 by Tilmann Zäschke. Tilmann used to work for the European Space Agency- His task there was to implement a persistence backend for the Herschel Space Observatory. The Herschel Space Observatory is a satellite that performs observations in the far infrared spectrum, in particular observing very old objects with a high red-shift. The life time of the satellite is limited to 3-4 years, during which it is expected to produce 15TB of data. You can download the User Report: 33/10.
2. An article by German Viscuso: “Using Object Database db4o as Storage Provider in Voldemort.” Voldemort’s local persistence component allows for different storage engines to be plugged in. In his article German shows how to create a new storage engine that uses db4o as storage engine in Voldemort. You can download the paper (PDF).
The source code is available under the Apache 2.0 license.
3. Three revised TechView Product Reports:
– db4o TechView Product Report -Updated July 2010
– Objectivity/DB TechView Product Report-Updated June 2010.
– ObjectStore TechView Product Report -Updated August 2010.
The jury has selected three finalists for the ODBMS.ORG “Best Object Databases Lecture Notes” Award 2010.
The three finalists are:
“Object Database Tutorial”
by Rick Cattell, Independent Consultant, USA.
“Object-Oriented Databases”
by Michael Grossniklaus and Moira Norrie, ETH Zürich, Switzerland.
“Modern Database Techniques”
by Martin Hulin, Hochschule Ravensburg-Weingarten, Germany.
You can download the three Lecture Notes here.
The Awards recognize the most complete and up to date lecture notes on Object Databases, that have been, or have strong potential to be, instrumental to the teaching of theory and practice in the field of objects and databases. Any Lecture Notes published in ODBMS.ORG during the years 2004-2010 were eligible for the 2010 award.
The jury panel was composed by:
Prof. Suad Alagic, University of Southern Maine, USA
Prof. Dr. Alfonso F. Cárdenas, UCLA, USA
Leon Guzenda, Objectivity, USA
John McHugh, Progress Software, USA
Prof. Renzo Orsini, University of Venice, Italy
Prof. Tore J.M. Risch, University of Uppsala, Sweden
Prof. Nicolas Spyratos, University of Paris South, France
Prof. Roberto V. Zicari, Goethe University Frankfurt, Germany.
The Award Ceremony will be on September 29, 2010, at the 3rd International Conference on Objects and Databases (ICOODB 2010) in Frankfurt.

