Object Database Technologies and Data Management in the Cloud.
One of our expert, Dr. Michael Grossniklaus, has recently been awarded a grant by the Swiss National Science Foundation (SNF) for a fellowship as an advanced researcher in David Maier’s group at Portland State University. There, he will be investigating the use of object database technology for cloud data management.
I asked Michael to elaborate on his research plan and share it with our ODBMS.ORG community.
Q1. People from different fields have slightly different definitions of the term Cloud Computing. What is the common denominator of most of these definitions?
MG: Many of the differences stem from the fact that people use the term Cloud Computing both to denote a vision at the conceptual level and technologies at the implementation level. A nice collection of no less than twenty-one definitions can be found here.
In terms of vision, the common denominator of most definitions is to look at processing power, storage and software as commodities that are readily available from large infrastructures. As a consequence, cloud computing unifies elements of distributed, grid, utility and autonomic computing. The term elastic computing is also often used in this context to describe the ability of cloud computing to cope with bursts or spikes in the demand of resources on an on-demand basis. As for technologies, there is a consensus that cloud computing corresponds to a service-oriented stack that provides computing resources at different levels. Again, there are many variants of cloud computing stacks, but the trend seems to go towards three layers. At the lowest level, Infrastructure-as-a-Service (IaaS) offers resources such as processing power or storage as a service. One level above, Platform-as-a-Service (PaaS) provides development tools to build applications based on the service provider’s API. Finally, on the top-most level, Software-as-a-Service (SaaS) describes the model of deploying applications to clients on demand.
Q2. With the emergence of cloud computing, new data management systems have surfaced. Why?
MG: I see new data management systems such as NoSQL databases and MapReduce systems mainly as a reaction to the way in which cloud computing provides scalability. In cloud computing, more processing power typically translates to more (cheap, shared-nothing) computing nodes, rather than migrating or upgrading to better hardware. Therefore, cloud computing applications need to be parallelizable in order to scale. Both NoSQL and MapReduce advocate simplicity in terms of data models and data processing, in order to provide light-weight and fault-tolerant frameworks that support automatic parallelization and distribution.
In comparison to existing parallel and distributed (relational) databases however, many established data management concepts, such as data independence, declarative query languages, algebraic optimization and transactional data processing, are often omitted. As a consequence, more weight is put on the shoulders of application developers that now face new challenges and responsibilities. Acknowledging the fact that the initial vision was maybe too simple, there is already a trend of extending MapReduce systems with established data management concepts. Yahoo’s PigLatin and Microsoft’s Dryad have introduced a (near-) relational algebra and Facebook’s HIVE supports SQL, to name only a few examples. In this sense, cloud computing has triggered a “reboot” of data management systems by starting from a very simple paradigm and adding classical features back in, whenever they are required.
Q3. What is in your opinion the direction into which cloud computing data management is evolving? What are the main challenges of cloud computing data management?
MG: Data management in cloud computing will take place on a massively parallel and widely distributed scale. Based on these characteristics, several people have argued that cloud data management is more suitable for analytic rather than transactional data processing. Applications that need mostly read-only access to data and perform updates in batch mode are, therefore, expected to profit the most from cloud computing. At the same time, analytical data processing is gaining importance both in industry in terms of market shares and in academia through novel fields of application, such as computational science and e-science. Furthermore, from the classical data management concepts mentioned above, ACID transactions is the notable exception since, so far, nobody has proposed to extend MapReduce systems with transactional data processing. This might be another indication that cloud data management is evolving into the direction of analytical data processing.
At the time of answering these questions, I see three main challenges for data management in cloud computing: massively parallel and widely distributed data storage and processing, integration of novel data processing paradigms as well as the provision of service-based interfaces. The first challenge has been identified many times and is a direct consequence of the very nature of cloud computing. The second challenge is to build a comprehensive data processing platform by integrating novel paradigms with existing database technology. Often cited paradigms include data stream processing systems, service-based data processing or the above-mentioned NoSQL databases and MapReduce systems. Finally, the third challenge is to provide service-based interfaces for this new data processing platform in order to expose the platform itself as a service in the cloud, which is also referred to as “Database-as-a-Service” (DaaS) or “Cloud Data Services”.
Q4. What is the impact of cloud computing on data management research so far?
MG: Most of the challenges mentioned above are already being addressed in some way by the database research community. In particular, parallel and distributed data management is a well-established field of research, which has contributed many results that are strongly related to cloud data management. Research in this area investigates whether and how existing parallel and distributed databases can scale up to the level of parallelism and distribution that is characteristic of cloud computing. While this approach is more “top down”, there is also the “bottom up” approach of starting with an already highly parallel and widely distributed system and extending it with classical database functionality. This second approach has led to the extended MapReduce systems that were mentioned before. While these extended approaches already partially address the second challenge of cloud data management—integrating of novel data processing paradigms—there are also research results that take this integration even further such as HadoopDB and Clustera. The third challenge is being addressed as part of the research on programmability of cloud data services in terms of languages, interfaces and development models.
The impact of cloud computing on data management research is also visible in recent call for papers of both established and emerging workshops and conferences. Furthermore, there are several additional initiatives dedicated to support cloud data management research. For example, the MSR Summer Institute 2010 held at the University of Washington brought together a number of database researcher to discuss the current challenges and opportunities of cloud data services.
Q5. In your opinion, is there a relationship between cloud computing and object database technologies? If yes, please explain.
MG: Yes, there are multiple connections between cloud data management and object database technology which relate to all of the previously mentioned challenges. According to a recent article in Information Week , businesses are likely to split their data management into (transactional) in-house and (analytical) cloud data processing. This requirement corresponds to the first challenge of supporting highly parallel and widely distributed data processing. In this setting, objects and relationships could prove to be a valuable abstraction to bridge the gap between the two partitions.
Introducing the concept of objects in cloud data management systems also makes sense from the perspective of addressing the second challenge of integrating different data processing paradigms. One advantage of MapReduce is that it can cast base data into different implicit models. The associated disadvantage is that the data model is constructed on the fly and, thus, type checking is only possible to a limited extent. To support typing of MapReduce queries, the same base data instances could be exposed using different object wrappers. Microsoft has recently proposed “Orleans”, a next-generation programming model for cloud computing that features a higher level of abstraction than MapReduce. In order to integrate different processing paradigms, Orleans introduces the notion of “grains” that serve as a unit of computation and data storage.
Finally, object database technologies can also contribute to addressing the third challenge, i.e. providing service-based interfaces for cloud data management. Since object data models and service-oriented interfaces are closely related, it makes a lot of sense to consider object database technology, rather than introducing additional mapping layers. The concept of orthogonal persistence, that is an essential feature of most recent object databases, is particularly relevant in this context. In their ICOODB 2009 paper, Dearle et al. have suggested that orthogonal persistence could be extended in order to simplify the development of cloud applications. Instead of only abstracting from the storage hierarchy, this extended orthogonal persistence would also abstract from replication and physical location, giving transparent access to distributed objects. Even though Orleans is built on top of the Windows Azure Platform that provides a relational database (SQL Azure), the vision of grains is to support transparent replication, consistency and persistence.
Q6. Do you know of any application domains where object database technologies are already used in the Cloud?
MG: From the major object database vendors, I am only aware of Objectivity that has a version of their product that is ready to be deployed on cloud infrastructures such as Amazon EC2 and GoGrid. However, I have not yet seen any concrete case study showing how their clients are using this product. This being said, it might be interesting to point out, that many of the applications that are currently deployed using object databases are very close to the envisioned use case of cloud data management. For example, Objectivity has been applied in Space Situational Awareness Foundational Enterprise (SSAFE) system and in several data-intensive science applications, for example at the Stanford Linear Accelerator Center (SLAC). Similarly, the European Space Agency (ESA) has chosen Versant to gather and analyze the data transmitted by the Herschel telescope. All of these applications deal with large or even huge amounts of data and require analytical data processing in the sense that was described before.
Q7. What issues would you recommend as a researcher to tackle to go beyond the current state of the art in cloud computing data management?
MG: There is ample opportunity to tackle interesting and important issues along the lines of all three challenges mentioned before. However, if we abstract even more, there are two general research areas that will need to be tackled in order to deliver the vision of cloud data management.
The first area addresses research questions “under the hood”, for example: How can existing parallel and distributed databases scale up to the level of cloud computing? What traditional database functionality is required in the context of cloud data management and how can it be supported? How can traditional databases be combined with other data processing paradigms such as MapReduce or data stream processing? What architectures will lead to fast and scalable data processing systems? The second important area is how cloud data services are provided to clients and, thus, the following research questions are situated “on the hood”: What interfaces should be offered by cloud data services? Do we still need declarative query languages or is a procedural interface the way to go? Is there even a need for entirely new programming models? Can cloud computing be made independent of or orthogonal to the development of the application business logic? How are cloud data management applications tested, deployed and debugged? Are existing database benchmarks sufficient to evaluate cloud data services or do we need new ones?
Of course, these lists of research questions are not exhaustive and merely highlight some of the challenges. Nevertheless, I believe that in answering these questions, one should always keep an eye on recent and also not-so-recent contributions from object databases. As outlined above, many developments in cloud data services have introduced some kind of object notion and, therefore, contributions from object databases can serve two purposes. On the hand, technologies such as orthogonal persistence can serve as valuable starting points and inspiration for novel developments. On the other hand, we should also learn from previous approaches in order not to reinvent the wheel and not to repeat some of the mistakes that were made before.
Michael Grossniklaus would like to thank Moira C. Norrie, David Maier, Bill Howe and Alan Dearle for interesting discussions on this topic and the valuable exchange of ideas.
Michael received his doctorate in computer science from ETH Zurich in 2007. His PhD thesis examined how object data models can be extended with versioning to support context-aware data management. In addition to conducting research, Michael has been involved in several courses as a lecturer. Together with Moira C. Norrie, he developed a course on object databases for advanced students which he taught for several years. Currently, Michael is a senior researcher at the Politecnico di Milano, where he both contributes to the “Search Computing” project and works on reasoning over data streams. He has recently been awarded a grant by the Swiss National Science Foundation (SNF) for a fellowship as an advanced researcher in David Maier’s group at Portland State University, where he will be investigating the use of object database technology for cloud data management.