ODBMS Industry Watch

Aug 29 11

The future of data management: “Disk-less” databases? Interview with Goetz Graefe

by Roberto V. Zicari

“With no disks and thus no seek delays, assembly of complex objects will have different performance tradeoffs. I think a lot of options in physical database design will change, from indexing to compression and clustering and replication.” — Goetz Graefe.

Are “disk-less” databases the future of data management? What about the issue of energy consumption and databases? Will we have have “Green Databases”?
I discussed these issues with Dr. Goetz Graefe, HP Fellow (*), one of the most accomplished and influential technologists in the areas of database query optimization and query processing.

Hope you`ll enjoy this interview.

RVZ.

Q1 The world of data management is changing. Service platforms, scalable cloud platforms, analytical data platforms, NoSQL databases and new approaches to concurrency control are all becoming hot topics both in academia and industry. What is your take on this?

Goetz Graefe: I am wondering whether new and imminent hardware, in particular very large RAM memory as well as inexpensive non-volatile memory (NV-RAM, PCM, etc.) will be significant shifts that will affect software architecture, functionality, scalability, total cost of ownership, etc.
Hasso Plattner in his keynote at BTW (” SanssouciDB: An In-Memory Database for Processing Enterprise Workloads”) definitely seemed to think so. Whether or not one agrees with everything he said, he traced many of the changes back to not having disk drives in their new system (except for backup and restore).

For what it’s worth, I suspect that, based on the performance advantage of semiconductor storage, the vendors will ask for a price premium for semiconductor storage for a long time to come. That enables disk vendors to sell less expensive storage space, i.e., there’ll continue to be role for traditional disk space for colder data such as long-ago history analyzed only once in a while.

I think there’ll also be a differentiation by RAID level. For example, warm data with updates might be on RAID-1 (mirroring) whereas cold historical data might be on RAID-5 or RAID-6 (dual redundancy, no data loss in dual failure). In the end, we might end up with more rather than fewer levels in the memory hierarchy.

Q2. What is expected impact of large RAM (volatile) memory on database technology?

Goetz Graefe: I believe that large RAM (volatile) memory has already made a significant difference. SAP’s HANA /Sanssouci/NewDB project is one example,
C-store/VoltDB is another. Others database management system vendors are sure to follow.

It might be that NoSQL databases and key-value stores will “win” over traditional databases simply because they adapt faster to the new hardware, even if purely because they currently are simpler than database management systems and contain less code that needs adapting.

Non-volatile memory such as phase-change memory and memristors will change a lot of requirements for concurrency control and recovery code. With storage in RAM, including non-volatile RAM, compression will increase in economic value, sophistication, and compression factors. Vertica, for example, already uses multiple compression techniques, some of them pretty clever.

Q3. Will we end up having “disk less” databases then?

Goetz Graefe: With no disks and thus no seek delays, assembly of complex objects will have different performance tradeoffs. I think a lot of options in physical database design will change, from indexing to compression and clustering and replication.
I suspect we’ll see disk-less databases where the database contains only application state, e.g., current account balances, currently active logins, current shopping carts, etc. Disks will continue to have a role and economic value where the database also contains history, including cold history such as transactions that affected the account balances, login & logout events, click streams eventually leading to shopping carts, etc.

Q4. Where will the data go if we have no disks? In the Cloud?

Goetz Graefe: Public clouds in some cases, private clouds in many cases. If “we” don’t have disks, someone else will, and many of us will use them whether we are aware of it or not.

Q5. As new developments in memory (also flash) occur, it will result in possibly less energy consumption when using a database. Are we going to see “Green Databases” in the near future?

Goetz Graefe: I think energy efficiency is a terrific question to pursue. I know of several efforts, e.g., by Jignesh Patel et al. and Stavros Harizopoulos et al. Your own students at DBIS Goethe Universität just did a very nice study, too.

It seems to me there are many avenues to pursue.

For example, some people just look at the most appropriate hardware, e.g., high-performance CPUs such as Xeon versus high-efficiency CPUs such as Centrino (back then). Similar thoughts apply to storage, e.g., (capacity-optimized) 7200 rpm SATA drives versus (performance-optimized) 15K rpm fiber channel drives.
Others look at storage placement, e.g., RAID-1 versus RAID-5/6, and at storage formats, e.g., columnar storage & compression.
Others look at workload management, e.g., deferred index maintenance (& view maintenance) during peak load (perhaps the database equivalent to load shedding in streams) or ‘pause and resume’ functionality in utilities such as index creation.
Yet others look at caching, e.g., memcached. Etc.

Q6. What about Workload management?

Goetz Graefe: Workload management really has two aspects: the policy engine including its monitoring component that provides input into policies, and the engine mechanisms that implement the policies. It seems that most people focus on the first aspect above. Typical mechanisms are then quite crude, e.g., admission control.

I have long been wondering about more sophisticated and graceful mechanisms. For example, should workload management control memory allocation among operations? Should memory-intensive operations such as sorting grow & shrink their memory allocation during their execution (i.e.,, not only when they start)? Should utilities such as index creation participate is resource management? Should index creation (etc.) support ‘pause and resume’ functionality?

It seems to me that I’d want to say ‘yes’ to all those questions. Some of us at Hewlett-Packard Laboratories have been looking into engine mechanisms in that direction.

Q7. What are the main research questions for data management and energy efficiency?

Goetz Graefe: I’ve recently attended a workshop by NSF on data management and energy efficiency.
The topic was split into data management for energy efficiency (e.g., sensors & history & analytics in smart buildings) and energy efficiency in data management (e.g., efficiency of flash storage versus traditional disk storage).
One big issue in the latter set of topics was the difference between traditional performance & scalability improvements versus improvements in energy efficiency, and we had a hard time coming up with good examples where the two goals (performance, efficiency) differed. I suspect that we’ll need cases with different resources, e.g., trading 1 second of CPU time (50 Joule) against 3 seconds of disk time (20 Joule).
NSF (the US National Science Foundation) seems to be very keen on supporting good research in these directions. I think that’s very timely and very laudable. I hope they’ll receive great proposals and can fund many of them.

————————–
Dr. Goetz Graefe is an HP Fellow, and a member of the Intelligent Information Management Lab within Hewlett-Packard Laboratories. His experience and expertise are focused on relational database management systems, gained in academic research, industrial consulting, and industrial product development.

His current research efforts focus on new hardware technologies in database management as well as robustness in database request processing in order to reduce total cost of ownership. Prior to joining Hewlett-Packard Laboratories in 2006, Goetz spent 12 years as software architect in product development at Microsoft, mostly in database management. Both query optimization and query execution of Microsoft’s re-implementation of SQL Server are based on his designs.

Goetz’s areas of expertise within database management systems include compile-time query optimization including extensible query optimization, run-time query execution including parallel query execution, indexing, and transactions. He has also worked on transactional memory, specifically techniques for software implementations of transactional memory.

Goetz studied Computer Science at TU Braunschweig from 1980 to 1983.

(*) HP Fellows are “pioneers in their fields, setting the standards for technical excellence and driving the direction of research in their respective disciplines”.

Resources:

– NoSQL Data Stores (Free Downloads and Links).

– Cloud Data Stores (Free Downloads and Links).

– Graphs and Data Stores (Free Downloads and Links).

– Databases in General (Free Downloads and Links).

Aug 17 11

On Versant`s technology. Interview with Vishal Bagga.

by Roberto V. Zicari

“We believe that data only becomes useful once it becomes structured.” — Vishal Bagga

There is a lot of discussion on NoSQL databases nowdays. But what about object databases?
I asked a few questions to Vishal Bagga, Senior Product Manager at Versant.

RVZ

Q1. How has Versant’s technology evolved over the past three years?

Vishal Bagga: Versant is a customer driven company. We work closely with our customers trying to understand how we can evolve our technology to meet their challenges – whether it’s regarding complexity, data size or demanding workloads.

In the last 3 years we have seen 2 very clear trends from our interaction with our new and existing customers – growing data sizes and increasingly parallel workloads. This is very much in-line with what the general database market is seeing. In addition there was request for simplified database management and monitoring.

Our state of the art Versant Object Database 8 released last year was designed for exactly these scenarios. We have added increased scalability and performance on multi-core architectures, faster and better defragmentation tools, Eclipse based management and monitoring tools to name a few. We are also re-architecting our database server technology to automatically scale when possible without manual DBA intervention and allow online tuning (reconfigure the database instance online without impacting applications).

Q2. On December 1, 2008 Versant acquired the assets of the database software business of Servo Software, Inc. (formerly db4objects, Inc.). What happened to db4objects since then? How does db4objects fit into Versant technology strategy?

Vishal Bagga: The db4o community is doing well and is an integral part of Versant. In fact, when we first acquired db4o at the end of 2008, there were just short of 50,000 registered members.
Today, the db4o community boasts nearly 110,000 members having more than doubled in size in the last 2+ years.
In addition, db4o has had 2 major releases with some significant advances in enterprise type features allowing things like online defragmentation support. In our latest major release, we announced a new data replication capability between db4o and the large scale enterprise class Versant database.
Versant sees a great need in the mobile markets for technology like db4o which can play well in the lightweight handheld, mobile computing and machine-to-machine space while leveraging big data aggregation servers like Versant which can handle the huge number of events coming off of these intelligent edge devices.
In the coming year, even greater synergies are being developed and our communities are merging into one single group dedicated to next generation NoSQL 2.0 technology development.

Q3. Versant database and NoSQL databases: what are the similarities and what are the differences?

Vishal Bagga: The Not Only SQL databases are essentially systems that have evolved out of a certain business need – The need was essentially to have horizontally scalable systems running on commodity hardware with a simple “soft-schema” model for example social networking, offline data crunching, distributed logging system, event processing systems etc.

Relational databases were considered to be too slow, expensive and difficult to manage and administrate, expensive and difficult to adapt to quick changing models.

If I look at similarities between Versant and NoSQL, I would say that:

Both systems have designed around the inefficiency of JOINs. This is the biggest problem with relational databases. If you think about it, in most operational systems relations don’t change e.g. Blog:Article, Order:OrderItem, so why recalculate those relations each time they are accessed using a methodology which gets slower and slower as the amount of data gets larger. JOINs have a use case, but for some 20%, not 100% of the use cases.

Both systems leverage an architectural shift to a “soft-schema” which allows scale-out capability – the ability to partition information across many physical nodes and treat those nodes as 1 ubiquitous database.

When it comes to differences:

The biggest in my opinion is the complexity of the data. Versant allows to you to model very complicated data models seamlessly with ease whereas doing so with a NoSQL solution would be much more effort and you would need to write a lot of code in the application to represent the data model.
In this respect, Versant prefers to use the term “soft-schema” –vs- the term “schemaless”, terms which are often interchanged in discussion.
We believe that data only becomes useful once it becomes structured, in fact that is the whole point of technologies like Hadoop, to churn unstructured data looking for a way to structure it into something useful.
NoSQL technologies that bill themselves as “schema-less” are in denial of the fact that they are leaving the application developer the burden of defining the structure and mapping the data into that structure in the language of the application space. In many ways, it is the mapping problem all over again. Plus, that kind of data management is very hard to change it over time, leading to a brittle solution difficult to optimize for more than 1 use case. The use of “soft-schema” lends itself to a more enterprise manageable and extensible system where the database still retains important elements of structure, while still being able to store and manipulate unstructured types.

Another is the difference in the consistency model. Versant is ACID centric and Versant’s customers depend on this for their mission critical systems – it would be nearly impossible for these systems to use NoSQL given the relaxed constraints. Versant can do a CAP mode, but that is not our only mode of operation. You use it where it is really needed; you are not forced into using it unilaterally.

NoSQL systems make you store your data in a way that you can lookup efficiently by a key. But what if want to lookup something differently; it is likely to be terribly inefficient. This may be okay for the design but a lot of people do not realize that this is a big change in mindset. Versant offers a more balanced approach where you can navigate between related objects using references; you can for example define a root object and then navigate your tree from that object. At the same time you can run ad-hoc queries whenever you want to.

Q4. Big Data: Can Versant database be useful when dealing with petabytes of user data? How?

Vishal Bagga: I don’t see why not. Versant was always designed to work on a network of databases from the very start. Dealing with a Petabyte is really about designing a system with the right architecture. Versant has that architecture just as intact as anyone in the database space saying they can handle a Petabyte. Make no mistake, no matter how you do it, it is a non-trivial task. Today, our largest customer databases are in the 100’s of terrabyte range, so getting to a Petabyte is really a matter of needing that much data.

Q5. Hadoop is designed to process large batches of data quickly. Do you plan to use Hadoop and leverage components of the Hadoop ecosystem like HBase, Pig, and Hive?

Vishal Bagga: Yes, and some of our customers already do that today. A question for you: “Why are those layers in existence?” I would say the answer is that most of these early NoSQL 1.0 technologies do not handle real world complexity in information models. So, these layers are built to try and compensate for that fact. That is the exact point where Versant’s NoSQL 2.0 technology fits into the picture, we help people deal with complexity of information models, something that 1st generation NoSQL has not managed to accomplish.

Q6. Do you think that projects such as JSON (JavaScript Object Notation) and MessagePack (binary-based efficient object serialization library ) play a role in the odbms market?

Vishal Bagga: Absolutely. We believe in open standards. Fortunately, you can store any type in an ODBMS. These specific libraries are particularly important for current most popular client frameworks like Ajax. Finding ways to deliver a soft-schema into a client friendly format is essential to help ease the development burden.

Q7. Looking at three elements: Data, Platform, Analysis, where is Versant heading up?

Vishal Bagga: It is a difficult question as database and data management is increasingly a cross cutting concern. It used to be perfectly fine to keep your Analysis as part of your off-line OLAP systems, but these days there is an increasing push to get Analytics to the real time business.
So, you play with Data, you play with Analytics whether you do it directly or in concert with other technologies through partnership. Certainly, as Versant embraces Platform as a Service, we will do so through eco system partners who are paving the way with new development and deployment methodologies.

– Benchmarking ORM tools and Object Databases. (March 14, 2011)

– Robert Greene on “New and Old Data stores” . (December 2, 2010)

– Object Database Technologies and Data Management in the Cloud. (September 27, 2010)

Aug 9 11

Google Fusion Tables. Interview with Alon Y. Halevy

by Roberto V. Zicari

“The main challenges is that it’s hard for people who have data, but not database expertise, to manage their data.” –Alon Y. Halevy.

Google Fusion Tables was launched on June 9th, 2009. I wanted to know what happened since then. I have therefore interviewed Dr. Alon Y. Halevy, who heads the Structured Data Group at Google Research.

RVZ

Q1.In your web page you write that your job at Google is ” to make data management tools collaborative and much easier to use, and to leverage the incredible collections of structured data on the Web.” What are the main problems and challenges you are currently facing?

Halevy: The main challenges is that it’s hard for people who have data, but not database expertise, to manage their data, share it and create visualizations. Data management requires too much up-front effort, and that forms a significant impediment towards data sharing on a large scale.

Q2. Your group is responsible for Google Fusion Tables, “a service for managing data in the cloud that focuses on ease of use, collaboration and data integration.” What is exactly Google Fusion Tables? What challenges with respect to data management is solving and how?

Halevy: Fusion Tables enables you to easily upload data sets (e.g., spreadsheets and CSV files) to the cloud and manage it. Fusion Tables makes it easy to create insightful visualizations (e.g,. maps, timelines and other charts) and to share these with collaborators or with the public at large.

In addition, Fusion Tables enables merging data sets that belong to different owners. The true power of data is realized when we can combine data from multiple sources and draw conclusions that were impossible earlier.

As an example, this visualization combines data about earthquakes and data about the location of nuclear power reactors, showing what areas are prone to disasters similar to the one experienced in Japan in March 2011.

Q3. Fusion Tables enables users to upload tabular data files. Is there a limitation to the size of such users data files? Which one?

Halevy: Right now we allow 100MB per table and 250MB per user, but that’s not a technical limitation, just a limitation of our free offering

Q4. Google Fusion Tables was launched On June 9th, 2009. What happened since then?

Halevy: We’ve continually improved our service based largely on needs expressed by our users. In particular, we’ve have made our map visualizations much more powerful, and we’ve developed APIs for programmers.

Q5. What data sources do you consider and how do you integrate them together?

Halevy: We do not prescribe any data sources. All the sources you can obtain in Fusion Tables come from users who’ve explicitly marked their data sets as public. Data sources are combined with a merge operation that’s similar to a ‘join’ in SQL.

Q6. How do you ensure a good performance in the presence of millions of user tables? In fact, do you have any benchmark results for that?

Halevy: Given that we’re the first ones to pursue storing such a large number of tables in a single repository, there isn’t an established benchmark. The technical description of how we do it appears in two short papers that we published in SIGMOD 2010 [2] and SoCC 2010 [1] .

Q7. One of the main features of Fusion Tables is that it “allows multiple users to merge their tables into one, even if they do not belong to the same organization or were not aware of each other when they created the tables.
A table constructed by merging (by means of equi-joins) multiple base tables is a view” [1]. What about data consistency, duplicates and updates? Do you handle such cases?

Halevy: The data belongs to the users, not to us, so they have to ensure that it is up to date and does not contain duplicates. Of course, when you combine data from multiple sources you may get inconsistencies. Hopefully, the visualizations we provide will enable you to discover them quickly and resolve them.

Q8. How does Google Fusion Tables relates to Google Maps? What are the main challenges with respect to data management that you face when dealing with Big Data coming from Google Maps?

Halevy: Fusion Tables relies on a lot of the Google Maps infrastructure to display maps. The challenges when displaying maps from large data sets is that you need to do a lot of the computation on the server side so the client is not overwhelmed with points to render, but at the same time that the user experience remains snappy and interactive.

Q9. Why and how do you manage data in the Cloud?

Halevy: Managing data in the cloud is easier for many data owners because they do not have to maintain their own database system (which requires hiring database experts). Putting data in the cloud is also a key facilitator in order to share data with others, including people outside your organization.

We manage the data using some of the Google infrastructure such as BigTable and some layers built over it.

Q10. When you started Google Fusion Tables you did not support complex SQL queries or high throughput transactions. Why? How is the situation now?

Halevy: We still don’t support all forms of SQL queries, and we’re not in the race to become the database system supporting the highest transaction throughput. There are plenty of products on the market that serve those needs. Our goal from the start was to help under-served users with data management tasks that typically do not require complex SQL queries or high-throughput transactions, but rather emphasize data sharing and visualization.

Q11. Fusion Tables is about to “make it easier for people to create, manage and share on structured data on the Web.”
What about handling non structured data and the Web?

Halevy: There are plenty of other tools for that, including blogs, site creation tools and cloud-based word processors.

Q12. You write “…to facilitate collaboration, users can conduct fine-grained discussions on the data.” What does it mean?

Halevy: This means that users can attach comments to individual rows in a table, individual columns and even individual cells. If you are collaborating on a large data set then it is not enough to put all the comments in one big blob around the table. You really need to attach it to specific pieces of data.

Q13. Is Google Fusion Tables open source? Do you have plans to open up the API to developers?

Halevy: We have had an API for almost two years now. Fusion Tables is not open source; it’s a Google service built on top of Google’s infrastructure.

Q14. Fusion Tables API provides developers with a way of extending the functionality of the platform. What are the main extensions being implemented by the community so far?

Halevy: There have been tools developed for importing data from different formats into Fusion Tables (e.g., Shape to Fusion ). There is also a tool for importing Fusion Tables into the statistical R package and outputting the results back into Fusion Tables.

The API is used mostly to tailor applications using Fusion Tables to specific needs.

Q15. How Fusion Tables differs from Amazon`s SimpleDB?

Halevy: Fusion Tables is designed primarily to be a user-facing tool, not as much for developers. I should emphasize that Fusion Tables is not part of Google Labs anymore — it has “graduated” over a year ago.

Q16. In 2005 you co-authored a paper introducing the concept of “Dataspace Systems”, that is systems which provide “pay-as-you-go data management based on best-effort services.” Is this still actual? What are these “Dataspace Systems” exactly?

Halevy: Yes, it is. In fact, Fusion Tables are one example of a dataspace system. Fusion Tables does not require you to create a schema before entering the data, and it tries to get the data types of the columns in order to offer relevant visualizations.
The collaborative aspects of Fusion Tables make it easier for a group of collaborators to improve the quality of the data and combine it with others.

Dataspace systems are still in their infancy, and we have a long way to go to realize the full vision.

Q17. You have worked on Deep web, Surface web, and now Fusion Tables: How do these three area relate to each others?
What is your next project?

Halevy: All of these projects have the same overall goal: to make structured data on the Web more discoverable, so users can enhance it, combine it with data from other sources, and create and publish interesting new data sets.

The Deep Web project had the goal of extracting data sets from behind HTML forms and making the data discoverable in search.
The Surface Web (WebTables) project’s goal was to identify interesting data sets that are on the Web but are not being treated in the most optimal way. Fusion Tables provides a tool for users to upload their own data and publish data sets that can be crawled by search engines.

These three projects — and filling in the gaps between — will keep me busy for a while to come!

———————
Dr. Alon Halevy heads the Structured Data Group at Google Research. Prior to that, he was a Professor of Computer Science at the University of Washington, where he founded the Database Research Group. From 1993 to 1997 he was a Principal Member of Technical Staff at AT&T Bell Laboratories (later AT&T Laboratories). He received his Ph.D in Computer Science from Stanford University in 1993, and his Bachelors degree in Computer Science and Mathematics from the Hebrew University in Jerusalem in 1988. Dr. Halevy was elected Fellow of the Association of Computing Machinery in 2006.
————————-

Resources:

[1]Google Fusion Tables: Web-Centered Data Management and Collaboration (link .pdf).

Hector Gonzalez, Alon Halevy, Christian S. Jensen, Anno Langen,Jayant Madhavan, Rebecca Shapley, Warren Shen, Google Inc.
in SoCC’10, June 10-11, 2010, Indianapolis, Indiana, USA.

[2] Megastore: A Scalable Data System for User Facing Applications.
J. Furman, J. S. Karlsson, J.-M. Leon, A. Lloyd, S. Newman, and P. Zeyliger. In SIGMOD, 2008.

[3] Megastore: Providing Scalable, Highly Available Storage for Interactive Services (link .pdf)
Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh , Google, Inc. In 5 Biennial Conference on Innovative Data Systems Research (CIDR ’11), January 9-12, 2011, Asilomar, California, USA.

– Google Fusion Tables (link)

– Fusion Tables API (link)

– Amazon`s SimpleDB (link)

– Deep-web crawl Project (Download .pdf).

– WebTables Project (Download .pdf)

——————–

Jul 25 11

How good is UML for Database Design? Interview with Michael Blaha.

by Roberto V. Zicari

„ The tools are not good at taking changes to a model and generating incremental design code to alter a populated database.”— Michael Blaha

The Unified Modeling Language™ – UML – is OMG’s most-used specification.
UML is a de facto standard for object modeling, and it is often used for database design as well. But how good is UML really for the task of database conceptual modeling?
I asked a few questions to Dr. Michael Blaha, one of the leading authorities on databases and data modeling.

RVZ

Q1. Why using UML for database design?

Blaha: Often the most difficult aspect of software development is abstracting a problem and thinking about it clearly — that is the purpose of conceptual data modeling.
A conceptual model lets developers think deeply about a system, understand its core essence, and then choose a proper representation. A sound data model is extensible, understandable, ready to implement, less prone to errors, and usually performs well without special tuning effort.

The UML is a good notation for conceptual data modeling. The representation stands apart from implementation choices, be it a relational database, object oriented database, files, or some other mechanism.

Q2. What are the main weaknesses of UML for database design? And how do you cope with them in practice?

Blaha: First consider object databases. The design of object database code is similar to the design of OO programming code. The UML class model specifies the static data structure. The most difficult implementation issue is the weak support in many object database engines for associations. The workaround depends on object database features and the application architecture.

Now consider relational databases. Relational database tools do not support the UML. There is no technical reason for this, but several cultural reasons. One is that there is a divide between the programming and database communities; each has their own jargon, style, and history and pay little attention to the other.
Also the UML creators focused on unifying programming notation, but spent little time talking to the database community.
The bottom line is that the relational database tools do not support the UML and the UML tools do not support relational databases. In practice, I usually construct a conceptual model with a UML tool (so that I can think deeply and abstractly).
Then I rekey the model into a database tool (so that I can generate schema).

Q3. Even if you have a sound high level UML design, what else can get wrong?

Blaha: I do lots of database reverse engineering for my consulting clients, mostly for relational database applications because that’s what’s used most often in practice. I start with the database schema and work backwards to a conceptual model. I published a paper 10 years ago with statistics for what does go wrong.

In practice, I would say that about 25% of applications have a solid conceptual model, 50% have a mediocre conceptual model, and 25% are just downright awful. Given that a conceptual model is the foundation for an application, you can see why many applications go awry.

In practice, about 50% of applications have a professional database design and 50% are substantially flawed. It’s odd to see so many database design mistakes, given the wide availability of database design tools. It’s relatively easy to take a conceptual model and generate a database design. This illustrates that the importance of software engineering has not reached many developers.

Of course, there can always be flaws in programming logic and user interface code, but these kinds of flaws are easier to correct if there is a sound conceptual model underlying the application and if the model is implemented well with a database schema.

Q4. And specifically for object databases?

Blaha: An object database is nothing special when it comes to the benefits of a sound model and software engineering practice. A carefully considered conceptual model gives you a free hand to choose the most appropriate development platform.

One of my past books (Object-Oriented Modeling and Design for Database Applications) vividly illustrated this point by driving object-oriented data models into different implementation targets, specifically relational databases, object databases, and flat files.

Q5. What are most common pitfalls?

Blaha: It is difficult to construct a robust conceptual model. A skilled modeler must quickly learn the nuances of a problem domain and be able to meld problem content with data abstractions and data patterns.

Another pitfall is that it is important to perform agile development. Developers much work quickly, deliver often, obtain feedback, and build on prior results to evolve an application. I have seen too many developers not take the principles of agile development to heart and become bogged down by ponderous development of interminable scope.

Another pitfall is that some developers are sloppy with database design. Nowdays there really is no excuse for that as tools can
generate database code. Object-oriented CASE tools can generate programming stubs that can seed an object database.
For relational database projects, I first construct an object-oriented model, then re-enter the design into a relational database tool, and finally generate the database schema. (The UML data modeling notation is nearly isomorphic with the modeling language in most relational database design tools.)

Q6. In your experience, how do you handle the situation when a UML conceptual database design is done and a database is implemented using such design, but then later on, updates to the implementation are done without considering the original conceptual design. What to do in such cases?

Blaha: The more common situation is that an application gradually evolves and the software engineering documentation (such as the conceptual model) is not kept up to date.
With a lack of clarity for its intellectual focus, an application gradually degrades. Eventually there has to be a major effort to revamp the application and clean it up, or replace the application with a new one.

The database design tools are good at taking a model and generating the initial database design.
The tools are not good at taking changes to a model and generating incremental design code to alter a populated database.
Thus much manual effort is needed to make changes as an application evolves and keep documentation up to date. However, the alternative of not doing so is an application that eventually becomes a mess and is unmaintainable.

————————————————–
Michael Blaha is a partner at Modelsoft Consulting Corporation.
Dr. Blaha is recognized as one of the world’s leading authorities on databases and data modeling. He has more than 25 years of experience as a consultant and trainer in conceiving, architecting, modeling, designing, and tuning databases for dozens of major organizations around the world. He has authored six U.S. patents, six books, and many papers. Dr. Blaha received his doctorate from Washington University in St. Louis and is an alumnus of GE Global Research in Schenectady, New York.

Related Resources

– OMG UML Resource Page.

– Object-Oriented Design of Database Stored Procedures, By Michael Blaha, Bill Huth, Peter Cheung

– Models, By Michael Blaha

– Universal Antipatterns, By Michael Blaha

– Patterns of Data Modelling (Database Systems and Applications),Blaha, Michael, CRC Press, May 2010, ISBN 1439819890

Jul 15 11

Big Data and the European Commission: Call for Project Proposals.

by Roberto V. Zicari

I thought that this information is of interest for the data management community:

The European Commission has a budget for funding projects in the area of Intelligent Information Management.
It is currently seeking Project Submissions in this area.

It is not always easy to understand the official documents published by the European Commission… Therefore I tried to develop a set of easy-to-read questions/answers for whose of you who might be interested to know more.

Hope it helps

RZV

– What is it?

The European Commission has a budget for funding projects in the area of Intelligent Information Management.

– Which Program is it?:

Formally the Program is called: “Work Programme 2011-2012 of the FP7 Specific Programme ‘Cooperation’, Theme 3, ICT – Information and Communication Technologies”, which has been published on 19 July 2010.

The Work Programme contains in Challenge 4: ‘Technologies for Digital Content and Languages’ the objective ICT-2011.4.4 – Intelligent Information Management.

The objective ICT 2011-4.4 is part of call 8, which is expected to be published July 26, 2011.

– What kind of projects is the European Commission looking to support?

Projects that develop and test new technologies to manage and exploit extremely large volumes of data, with real time capabilities whenever this is relevant. Projects must be the joint work of consortia of partners.
It is very important that at least one member of your consortium be willing and able to make available the large volumes of data needed to test your ideas.

– What kind of support is the EU giving to accepted project proposals?

Depends on the type of partner and the type of proposals. For the most common type of proposal, up to 75% of direct costs for research institutions and small or medium enterprises.

– Why should I submit a proposal?

To join other talented people to solve a problem that you couldn’t solve alone and with your own resources.

– What are the benefits for me and/or for my organization to participate?

Funding is a clear benefit but it should be thought as a means to real end, which is the benefit of advancing the state of the art (for scientific or business objectives) working with people from all over Europe who have very strong skills complementary to yours.

– What is required if the proposal is accepted?

Entering a grant agreement with the European Commission, committing to an agreed plan of work, opening your work to the evaluation of peers selected by the Commission, periodically reporting your costs using agreed standards, being open to audits throughout the duration of the project and for a few years afterwards.

– How do I qualify to participate?

The full rules for participation (which classify countries in various categories with different types of access to the programme) are available here.
Most participants are legal entities established in the EU.

– How can I participate?

You need to become part of an eligible consortium (see again rules for participation above) and submit a proposal that addresses the specific requirements of the call.

– When is the deadline?

Expected to be 17 January 2012 17:00 Brussels time. Please refer to the official text of the call when it is published.

– How do I submit a proposal?

You need to fill the forms and submit your proposal using the submission system .

– Can I see other proposals submitted in the past?

This is not really possible because the Commission has an obligation of confidentiality to past submitters.

– Can I see some projects funded by the EU in the past in the same area?

Certainly. Please visit content-knowledge/projects_en.

– How do I get more info?

Further details on the scope of the call, individual research lines and indicative budget are provided in the Work Programme 2011-2012.

You can also write to: infso-e2@ec.europa.eu

Note: These calls are highly competitive. It is thus important that your idea be really innovative and that your plans for implementing and testing it be really concrete and really credible. It is also important for you to find the right partners and to work with them as a team. This requires a joint vision based on shared objective.
This means in turn that you will need to try to figure out how your skills can help others as hard as you will try to figure out who could help you with what you want to do.

Jun 22 11

“Applying Graph Analysis and Manipulation to Data Stores.”

by Roberto V. Zicari

” This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk—a traversal. ” — Marko Rodriguez and Peter Neubauer.
__________________________________________________

Interview with Marko Rodriguez and Peter Neubauer.

The open source community is quite active in the area of Graph Analysis and Manipulation, and their applicability to new data stores. I wanted to know more about an open source initiative called “Tinkerpop”.
I have interviewed Marko Rodriguez and Peter Neubauer, who are the ledears of the Tinkerpop” project.

RVZ

Q1. You recently started a project called “Tinkerpop”. What is it?

Marko Rodriguez and Peter Neubauer:
TinkerPop is an open-source graph software group. Currently, we provide a stack of technologies (called the TinkerPop stack) and members contribute to those aspects of the stack that align with their expertise. The stack starts just above the database layer (just above the graph persistence layer) and connects to various graph database vendors — e.g. Neo4j, OrientDB, DEX, RDF Sail triple/quad stores, etc.

The graph database space is a relatively nascent space. At the time that TinkerPop started back in 2009, graph database vendors were primarily focused on graph persistence issues–storing and retrieving a graph structure to and from disk. Given the expertise of the original TinkerPop members (Marko, Peter, and Josh), we decided to take our research (from our respective institutions) and apply it to the creation of tools one step above the graph persistence layer. Out of that effort came Gremlin — the first TinkerPop project. In late 2009, Gremlin was pieced apart into multiple self contained projects: Blueprints and Pipes.
From there, other TinkerPop products have emerged which we discuss later.

Q2. Who currently work on “Tinkerpop”?

Marko Rodriguez and Peter Neubauer:
The current members of TinkerPop are Marko A. Rodriguez (USA), Peter Neubauer (Sweden), Joshua Shinavier (USA), Stephen Mallette (USA), Pavel Yaskevich (Belarus), Derrick Wiebe (Canada), and Alex Averbuch (New Zealand).
However, recently, while not yet an “official member” (i.e. picture on website), Pierre DeWilde (Belgium) has contributed much to TinkerPop through code reviews and community relations. Finally, we have a warm, inviting community where users can help guide the development of the TinkerPop stack.

Q3. You say, that you intend to provide higher-level graph processing tools, APIs and constructs? Who needs them? and for what?

Marko Rodriguez and Peter Neubauer:
TinkerPop facilitates the application of graphs to various problems in engineering. These problems are generally defined as those that require expressivity and speed when traversing a joined structure. The joined structure is provided by a graph database. With a graph database, a user can does not arbitrarily join two tables according to some predicate as there is no notion of tables.
There only exists a single atomic structure known as the graph. However, in order to unite disparate data, a traversal is enacted that moves over the data in order to yield some computational side-effect — e.g. a search, a score, a rank, a pattern match, etc.
The benefit of the graph comes from being able to rapidly traverse structures to an arbitrary depth (e.g., tree structures, cyclic structures) and with an arbitrary path description (e.g. friends that work together, roads below a certain congestion threshold). Moreover, this space provides a unique way of thinking about data processing.
We call this data processing pattern, the graph traversal pattern.
This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk—a traversal.

Q4. Why using graphs and not objects and/or classical relations? What about non normalized data structures offered by NoSQL databases?

Marko Rodriguez and Peter Neubauer:
In a world where memory is expensive, hybrid memory/disk technology is a must (colloquially, a database).
A graph database is nothing more than a memory/disk technology that allows for the rapid creation of an in-memory object (sub)graph from a disk-based (full)graph. A traversal (the means by which data is queried/processed) is all about filling in memory those aspects of the persisted graph that are being touched as the traverser moves along the graph’s vertices and edges.
Graph databases simply cache what is on disk into memory which makes for a highly reusable in-memory cache.
In contrast, with a relational database, where any table can be joined with any table, many different data structures are constructed from the explicit tables persisted. Unlike a relational database, a graph database has one structure, itself.
Thus, components of itself are always reusable. Hence, a “highly reusable cache.” Given this description, if a persistence engine is sufficiently fast at creating an in-memory cache, then it meets the processing requirements of a graph database user.

Q5. Besides graph databases, who may need Tinkerpop tools? Could they be useful for users of relational databases as well? or of other databases, like for example NoSQL or Object Databases? If yes, how?

Marko Rodriguez and Peter Neubauer:
In the end, the TinkerPop stack is based on the low-level Blueprints API.
By implementing the Blueprints API and making it sufficiently speedy, any database can, in theory, provide graph processing functionality. So yes, TinkerPop could be leveraged by other database technologies.

Q6. Tinkerpop is composed of several sub projects: Gremlin, Pipes, Blueprints and more. At a first glimpse, it is difficult to grasp how they are related to each other. What are all these sub projects? do they all relate with each other?

Marko Rodriguez and Peter Neubauer:
The TinkerPop stack is described from bottom-to-top:
Blueprints: A graph API with an operational semantics test suite that when implemented, yields a Blueprints-enabled graph database which is accessible to all TinkerPop products.
Pipes: A data flow framework that allows for lazy graph traversing.
Gremlin: A graph traversal language that compiles down to Pipes.
Frames: An object-to-graph mapper that turns vertices and edges into objects and relations (and vice versa).
Rexster: A RESTful graph server that exposes the TinkerPop suite “over the wire.”

Q7. Is there a unified API for Tinkerpop? And if yes, how does it look like?

Marko Rodriguez and Peter Neubauer:
Blueprints is the foundation of TinkerPop.
You can think of Blueprints as the JDBC of the graph database community. Many graph vendors, while providing their own APIs, also provide a Blueprints implementation so the TinkerPop stack can be used with their database. Currently, Neo4j, OrientDB, DEX, RDF Sail, TinkerGraph, and Rexster are all TinkerPop promoted/supported implementations.
However, out there in the greater developer community, there exists an implementation for HBase (GraphBase) and Redis (Blueredis). Moreover, the graph database vendor InfiniteGraph plans to release a Blueprints implementation in the near future.

Q8. In your projects you speak of “dataflow-inspired traversal models”. What is it?

Marko Rodriguez and Peter Neubauer:
Data flow graph processing, in the Pipes/Gremlin-sense, is a lazy iteration approach to graph traversing.
In this model, chains of pipes are connected. Each pipe is a computational step that is one of three types of operations: transform, filter, or side-effect.
A transformation pipe will take data of one type and emit data of another type. For example, given a vertex, a pipe will emit its outgoing edges. A filter pipe will take data and either emit it or not. For example, given an edge, emit it if its label equals “friend.” Finally, a side-effect will take data and emit the same data, however, in the process, it will yield some side-effect.
For example, increment a counter, update a ranking, print a value to standard out, etc.
Pipes is a library of general purpose pipes that can be composed to effect a graph traversal based computation. Finally, Gremlin is a DSL (domain specific language) that supports the concise specification of a pipeline. The Gremlin code base is actually quite small — all of the work is in Pipes.

Q9. How other developers could contribute to this project?

Marko Rodriguez and Peter Neubauer:
New members tend to be users. A user will get excited about a particular product or some tangent idea that is generally useful to the community. They provide thoughts, code, and ultimately, if they “click” with the group (coding style, approach, etc.), then they become members. For example, Stephen Mallette was very keen on advancing Rexster and as such, has and continues to work wonders on the server codebase.
Pavel Yaskevich was interested in the compiler aspects of Gremlin and contributed on that front through many versions. Pavel is also a contributing member to Cassandra’s recent query language known as CQL.
Derrick Wiebe has contributed alot to Pipes and in his day job, needed to advance particular aspects of Blueprints (and luckily, this benefits others). There are no hard rules to membership. Primarily its about excitement, dedication, and expert-level development.
In the end, the community requires that TinkerPop be a solid stack of technologies that is well thought out and consistent throughout. In TinkerPop, its less about features and lines of code as it is about a consistent story that resonates well for those succumbing to the graph mentality.

____________________________________________________________________________________

Marko A. Rodriguez:
Dr. Marko A. Rodriguez currently owns the graph consulting firm Aurelius LLC. Prior to this venture, he was a Director’s Fellow at the Center for Nonlinear Studies at the Los Alamos National Laboratory and a Graph Systems Architect at AT&T.
Marko’s work for the last 10 years has focused on the applied and theoretical aspects of graph analysis and manipulation.

Peter Neubauer:
Peter Neubauer has been deeply involved in programming for over a decade and is co-founder of a number of popular open source projects such as Neo4j, TinkerPop, OPS4J and Qi4j. Peter loves connecting things, writing novel prototypes and throwing together new ideas and projects around graphs and society-scale innovation.
Right now, Peter is the co-founder and VP of Product Development at Neo4j Technology, the company sponsoring the development of the Neo4j graph database.
If you want brainstorming, feed him a latte and you are in business.

______________________________________

For further readings

Graphs and Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes

“Marrying objects with graphs”: Interview with Darren Wood.

“Interview with Jonathan Ellis, project chair of Apache Cassandra”.

“The evolving market for NoSQL Databases: Interview with James Phillips.”

_________________________

Jun 13 11

Interview with Iran Hutchinson, Globals.

by Roberto V. Zicari

“ The newly launched Globals initiative is not about creating a new database.
It is however, about exposing the core multi-dimensional arrays directly to developers.” — Iran Hutchinson.
__________________________

InterSystems recently launched a new initiative: Globals.
I wanted to know more about Globals. I have therefore interviewed Iran Hutchinson, software/systems architect at InterSystems and one of the people behind the Globals project.

RVZ

Q1. InterSystems recently launched a new database product: Globals. Why a new database? What is Globals?

Iran Hutchinson: InterSystems has continually provided innovative database technology to its technology partners for over 30 years. Understanding customer needs to build rich, high-performance, and scalable applications resulted
in a database implementation with a proven track record. The core of the database technology is multi-dimensional arrays (aka globals).
The newly launched Globals initiative is not about creating a new database. It is however, about exposing the core multi-dimensional arrays directly to developers. By closely integrating access into development technologies like Java and JavaScript, developers can take full advantage of high-performance access to our core database components.

We undertook this project to build much broader awareness of the technology that lies at the heart of all of our products. In doing so, we hope to build a thriving developer community conversant in the Globals technology, and aware of the benefits to this approach of building applications.

Q2. You classify Globals as a NoSQL-database. Is this correct? What are the differences and similarities of Globals with respect to other NoSQL databases in the market?

Iran Hutchinson: While Globals can be classified as a NoSQL database, it goes beyond the definition of other NoSQL databases. As you there are many different offerings in NoSQL and no key comparison matrices or feature lists. Below we list some comparisons and differences with hopes of later expanding the available information on the globalsdb.org website.

• Globals differs from other NoSQL databases in a number of ways.

o It is not limited to one of the known paradigms in NoSQL (Column/Wide Column, Key-Value, Graph, Document, etc.). You can build your own paradigm on top of the core engine. This is an approach we took as we evolved Caché to support objects, xml, and relational, to name a few.
o Globals still offers optional transactions and locking. Though efficient in implementation we wanted to make sure that locking and transactions were at the discretion of the developer.
o MVCC is built into the database.
o Globals runs in-memory and writes data to disk.
o There is currently no sharding or replication available in Globals. We are discussing options for these features.
o Globals builds on the over 33 years of success of Caché. It is well proven. It is the exact same database technology. Globals will continue to evolve, and receive the innovations going into the core of Caché.
o Our goal with Globals is be a very good steward of the project and technology. The Globals initiative will also start to drive contests and events to further promote adoption of the technology, as well as innovative approaches to building applications. We see this stewardship as a key differentiator, along with the underlying flexible core technology.

• Globals shares similar traits with other NoSQL databases in the market.

o It is free for development and deployment.
o The data model can optionally use a schema. We mitigate the impact of using schemas by using the same infrastructure we use to store the data. The schema information and the data are both stored in globals.
o Developers can index their data.
o The document paradigm enabled by the Globals Document Store (GDS) API enables a query language for data stored using the GDS API. GDS is also an example of how to build a storage paradigm in Globals. Globals APIs are open source and available on the github link.
o Globals is fast and efficient at storing data. We know performance is one of many hallmarks of NoSQL. Globals can store data at rates exceeding 100,000 objects/records per second.
o Different technology APIs are available for use with Globals. We’ve released 2 Java APIs and the JavaScript API is immanent.

Q3. How do you position Globals with respect to Caché? Who should use Globals and who should use Caché?

Iran Hutchinson: Today, Globals offers multi-dimensional array storage, whereas Caché offers a much richer set of features. Caché (and the InterSystems technology it powers including Ensemble, DeepSee, HealthShare, and TrakCare) offers a core underlying object technology, native web services, distributed communication via ECP (Enterprise Cache Protocol), strategies for high availability, interactive development environment, industry standard data access (JDBC, ODBC, SQL, XML, etc.) and a host of other enterprise ready features.

Anyone can use Globals or Caché to tackle challenges with large data volumes (terabytes, petabytes, etc.), high transactions (100,000+ per second), and complex data (healthcare, financial, aerospace, etc.). However, Caché provides much of the needed out-of-box tooling and technology to get started rapidly building solutions in our core technology, as well as a variety of languages. Currently provided as Java APIs, Globals is a toolkit to build the infrastructure already provided by Caché. Use Caché if you want to get started today; use Globals if you have a keen interest in building the infrastructure of your data management system.

Q4. Globals offers multi-dimensional array storage. Can you please briefly explain this feature, and how this can be beneficial for developers?

Iran Hutchinson: It is beneficial to go here. I grabbed the following paragraphs directly from this page:

Summary Definition: A global is a persistent sparse multi-dimensional array, which consists of one or more storage elements or “nodes”. Each node is identified by a node reference (which is, essentially, its logical address). Each node consists of a name (the name of the global to which this node belongs) and zero or more subscripts.

Subscripts may be of any of the types String, int, long, or double. Subscripts of any of these types can be mixed among the nodes of the same global, at the same or different levels.

Benefits for developers: Globals does not limit developers to using objects, key-value, or any other type of storage paradigm. Developers are free to think of the optimal storage paradigm for what they are working on. With this flexibility, and the history of successful applications powered by globals, we think developers can begin building applications with confidence.

Q5. Globals does not include Objects. Is it possible to use Globals if my data is made of Java objects? If yes, how?

Iran Hutchinson:. Globals exposes a multi-dimensional sparse array directly to Java and other languages. While Globals itself does not include direct Java object storage technology like JPA or JDO, one can easily store and retrieve data in Java objects using the APIs documented here. Anyone can also extend Globals to support popular Java object storage and retrieval interfaces.

One of the core concepts in Globals is that it is not limited to a paradigm, like objects, but can be used in many paradigms. As an example, the new GDS (Globals Document Store) API enables developers to use the NoSQL document paradigm to store their objects in Globals. GDS is available here (more docs to come).

Q6. Is Globals open source?

Iran Hutchinson: Globals itself it not open source. However, the Globals APIs hosted at the github location are open source.

Q7. Do you plan to create a Globals Community? And if yes, what will you offer to the community and what do you expect back from the community?

Iran Hutchinson: We created a community for Globals from the beginning. One of the main goals of the Globals initiative is to create a thriving community around the technology, and applications built on the technology.
We offer the community:
• Proven core data management technology
• An enthusiastic technology partner that will continue to evolve and support project ◦ Marketing the project globally
◦ Continual underlying technology evolution ◦ Involvement in the forums and open source technology development ◦ Participation in or hosting events and contests around Globals.
• A venue to not only express ideas, but take a direct role in bringing those ideas to life in technology
• For those who want to build a business around Globals, 30+ years of experience in supplying software developers with the technology to build successful breakthrough applications.

____________________________________

Iran Hutchinson serves as product manager and software/systems architect at InterSystems. He is one of the people behind the Globals project. He has held architecture and development positions at startups and Fortune 50 companies. He focuses on language platforms, data management technologies, distributed/cloud computing, and high performance computing. When not on trail talking with fellow geeks or behind the computer you can find him eating (just look for the nearest steak house).
___________________________________

Resources

Globals.
Globals is a free database from InterSystem. Globals offer multi-dimensional storage. The first version is for Java. Software | Intermediate | English | LINK | May 2011

Globals APIs
Globals APIs are open source available at github location .

– The evolving market for NoSQL Databases: Interview with James Phillips.

– “Marrying objects with graphs”: Interview with Darren Wood.

– “Distributed joins are hard to scale”: Interview with Dwight Merriman.

– On Graph Databases: Interview with Daniel Kirstenpfad.

– Interview with Rick Cattell: There is no “one size fits all” solution.
———-

May 30 11

Measuring the scalability of SQL and NoSQL systems.

by Roberto V. Zicari

“Our experience from PNUTS also tells that these systems are hard to build: performance, but also scaleout, elasticity, failure handling, replication.
You can’t afford to take any of these for granted when choosing a system.
We wanted to find a way to call these out.”
— Adam Silberstein and Raghu Ramakrishnan, Yahoo! Research.
___________________________________

A team of researchers composed of Adam Silberstein, Brian F. Cooper, Raghu Ramakrishnan, Russell Sears, and Erwin Tam, all at Yahoo! Research Silicon Valley, developed a new benchmark for Cloud Serving Systems, called YCSB.

They published their results in the paper “Benchmarking Cloud Serving Systems with YCSB” at the 1st ACM Symposium on Cloud Computing, ACM, Indianapolis, IN, USA (June 10-11, 2010).

They open-sourced the benchmark about a year ago.

The YCSB benchmark appears to be the best to date for measuring the scalability of SQL and NoSQL systems.

Together with my colleague and odbms.org expert, Dr. Rick Cattell, we interviewed Adam Silberstein and Raghu Ramakrishnan.

Hope you`ll enjoy the interview.

Rick Cattell and Roberto V. Zicari
________________________________

Q1. What motivated you to write a new database benchmark?

Silberstein, Ramakrishnan: Over the last few years, we observed an explosive rise in the number of new large-scale distributed data management systems: BigTable, Dynamo, HBase, Cassandra, MongoDB, Voldemort, etc. And of course our own PNUTS system [2].
This field expanded so quickly that there is no agreement on what to call these systems: NoSQL systems, cloud databases, key-value stores, etc. (Some of these terms (e.g., cloud databases) are even applied to Map-Reduce systems such as Hadoop, which are qualitatively different, leading to terminological confusion.)
This trend created a lot of excitement throughout the community of web application developers and among data management developers and researchers. But it also created a lot of debate. Which systems are most stable and mature? Which have the best performance? Which is best for my use case? We saw these questions being asked in the community, but also within Yahoo!.

Our experience with PNUTS tells us there are many design decisions to make when building one of these systems, and those decisions have a huge impact on how the system performs for different workloads (e.g., read-heavy workloads vs. write-heavy workloads), how it scales, how it handles failures, ease of operation and tuning, etc. We wanted to build a benchmark that lets us expose the nuances of each these decisions and their implementations.

Our initial experiences with systems in this space made it very clear that tuning the systems is a challenging problem requiring expert advice from the systems’ developers. Out-of-the-box, systems might be tuned for small-scale deployment, or running batch jobs rather than serving. By creating a serving benchmark, we could create a common “language” for describing the type of workloads web application developers care about, which in turn might bring about tuning best practices for these workloads.

–The space of cloud databases/nosql systems/key-value stores exploded very quickly: HBase, Cassandra, MongoDB, Voldemort…even the space of names describing such systems grew quickly (nosql, cloud db, etc.).
This created a lot of excitement and confusion, both in Yahoo and externally (what do they do, are they stable/mature, what is performance like, etc).

–Each system has very nice things to say about itself. Wanted to make sense of the space so we could learn for ourselves, give advice, etc. Wanted a way for these systems to speak a common language when describing their capabilities.

Our experience from PNUTS also tells that these systems are hard to build: performance, but also scaleout, elasticity, failure handling, replication. You can’t take afford to take any of these for granted when choosing a system. We wanted to find a way to call these out.
We also quickly learned that tuning these systems is difficult. Out-of-box, they might be tuned for small clusters, rather than large (or vice-versa), for different hardware, etc.
And the systems aren’t at the point where they have detailed tuning guides/best practices. While we planned to open-source all along, this motivated us further: put the tuning problem in the hands of the experts by giving them the workload.

Q2. In your paper [1] you write “The main motivation for developing new cloud serving systems is the difficulty in providing scale out and elasticity using traditional database systems”. Can you please explain this statement? What about databases such as object oriented and XML-based?

Silberstein, Ramakrishnan: Database systems have addressed scale-out, and commercial systems (e.g., high-end systems from the major vendors) do scale out well. However, they typically do not scale to web-scale workloads using commodity servers, and are not designed to be operated in configurations of 1000s of servers, and to support high-availability and geo-replication in these settings.
For example, how does ACID carry over when you scale the traditional systems across data centers? Further, they do not support elastic expansion of a running system—When adding servers, how do you offload data to them? How do you replicate for fault tolerance? The new systems we see are making entirely new design tradeoffs. For example, they may sacrifice much of ACID (e.g., it’s strong consistency model), but make it easy and cost-effective to scale to a large number of servers with high-availability and elastic expandability.

These considerations are largely orthogonal to the underlying data model and programming paradigm (i.e., XML or object-oriented systems), though some of the newer systems have also innovated in the areas of flexible schema evolution and nested columns.

Q3. What is the difference between your YCSB benchmark and well known benchmarks such as TPC-C?

Silberstein, Ramakrishnan: At a high level, we have a lot in common with TPC-C and other OLTP benchmarks.
We care about query latency and overall system throughput. When take a closer look, however, the queries are very different. TPC-C contains several diverse types of queries meant to mimic a company warehouse environment. Some queries execute transactions over multiple tables; some are more heavyweight than others.
In contrast, the web applications we are benchmarking tend to run a huge number of extremely simple queries. Consider a table where each record holds a user’s profile information. Every query touches only a single record, likely either reading it, or reading+writing it. We do include support for skewed workloads; some tables may have active sets accessed much more than others. But we have focused on simple queries, since that is what we see in practice.
Another important difference is that we have emphasized the ease of creating a new suite of benchmarks using the YCSB framework, rather than the particular workload that we have defined, because we feel it is more important to have a uniform approach to defining new benchmarks in this space, given its nascent stage.

Q4. In your paper (1) you focus your work on two properties of “Cloud” systems: “elasticity” and “performance”, or as you write “simplified application development and deployment”. Can you please explain what do you mean with these two properties?

Silberstein, Ramakrishnan: – Performance refers to the usual metrics of latency and throughput, with the ability to scale out by adding capacity. Elasticity refers to the ability to add capacity to a running deployment “on-demand”, without manual intervention (e.g., to re-shard existing data across new servers).

Q5. Cloud systems differ for example in the data models they support (e.g., column-group oriented, simple hashtable based, document based). How do you compare systems with such different data models?

Silberstein, Ramakrishnan: In our initial work we focused on the commonalities among the systems we benchmarked (PNUTS, Cassandra, HBase). We ran the systems as row stores holding ordered data.
The YCSB “core workload” accesses entire records at a time, and executes range queries over small portions of the table.
Though Cassandra and HBase both support column-groups, we did not exercise this feature.

It is of course possible to design workloads that test features like column groups, and we encourage users to do so—as we noted earlier, one of our goals was to make it easy to add benchmarks using the YCSB framework. One way to compare systems with different feature sets is to disqualify systems lacking the desired features.

Rather than disqualify a system because it lacks column groups, for example, it may make sense to compare the system as-is against others with column groups.
It is then possible to measure the cost of reading all columns, even when only one is needed. On the other hand, if a workload must execute range scans over a continuous set of of keys, there is no reasonable way to run that workload against a hash-based store.

Q6. You write that the focus of your work is on “serving” systems, rather than “batch” or “analytical”. Are there any particular reasons for doing that? What are the challenges in defining your benchmark?

Silberstein, Ramakrishnan: This is analogous to having TPC-C and TPC-H benchmarks in the conventional database setting.
Many prominent systems in cloud computing, either target serving workloads or target analytical/batch workloads. Hadoop is the most well-known example of a batch system. Some systems, such as HBase, support both. We believe serving and batch are extremely important, but very different. Serving needs its own treatment and we wanted to very clearly call that out in our benchmark. It is easy to confuse the two when looking at benchmark results. In both settings we talk a lot about throughput, e.g., number of records read or written per second.
But in batch, where the workload reads or writes an entire table, those records are likely sequentially on disk. In serving, the workload does many random reads and writes, so those records are likely scattered on disk. Disk I/O patterns have a huge impact on throughput, and so we must understand whether a workload is batch or serving before we can appreciate the benchmark results.

-These are really different spaces, although some applications combine both. We believe both are important in the cloud space.

-One reason this space gets confusing is because most systems we’ve looked at have some batch functionality (often via Hadoop integration). (We know of cases where YCSB is used to drive both a serving and batch workload). And there are even use cases that require both. But there are many important use cases that do only serving, and in this work we wanted to benchmark that specifically.

-Another reason we started with just serving is because it is very easy to confuse the two when discussing performance.

For example, in both settings we can talk about throughput numbers (ops/second).
For serving you are probably talking about doing a huge number of operations on a random set of records. For batch you are talking about doing 1 operation on a huge number of records, and they are probably physically clustered together. So we’re talking about, for example, random reads vs. sequential scans. The numbers will be very different and its important to look at the results for the case you care about.

Q7. Your YCSB benchmark consists of two tiers: Tier-1 Performance and Tier-2 Scaling. Can you please explain what these two Tiers are and what do you expect to measure with them?

Silberstein, Ramakrishnan: The performance tier is the bread-and-butter of our initial YCSB work; for a fixed system size, we want to see how much load we can put on each system while still getting low latency.

This is one of the key questions, if not the key question, an application developer asks when choosing a data system: how much of my expected workload does the system support per-server, and so how many servers am I going to have to buy? No matter what system we benchmark, the performance results have a similar feel. We plot throughput on the x-axis and (average or 95th percentile) latency on the y-axis. As we push throughput higher, latency grows gradually. Eventually, as the system the system reaches saturation, latency jumps dramatically, and then we can push throughput no higher. It is easy to compare these graphs across systems and get a feel for what kind of load each can support. It is also worthwhile to verify that while systems slow down at saturation point, they nonetheless remain stable.

The big selling point of cloud serving systems is their ability to scale upward by adding more servers. If the system offers low latency at small scale, as we proportionally increase workload size and the number of servers, latency should remain constant. If this is not the case, this is a hint the system might have bottlenecks that surface at scale.

In our scaling tier we also make the point of distinguishing scalability and elasticity. While good scalability means the system runs well over large workloads if pre-configured with the appropriate number of servers, elasticity means the system can actually grow from small to large scale by adding servers while remaining online.

In our initial work we observed systems doing well on scalability, but having erratic behavior when we added capacity online and increased workloads (i.e., they were often weak on elasticity).

Q8. How are the workloads defined in your YCSB benchmark? How do they differ from traditional TCP-C workloads?

Silberstein, Ramakrishnan: Our workloads consist of much simpler queries than TPC-C. There are just a handful of knobs for the user to adjust. The user may specify size settings such as existing database size and record size, and distribution settings, such as the relative proportions of insert, reads, updates, delete, and scan operations and distribution of accesses over the key space. These simple settings let us characterize many important workloads we find at Yahoo!, and we provide a few simple ones as part of our Core Workload.

Of course, users can develop much more complex workloads that look more like TPC-C, and our code is designed to make that easy.

Q9. You used your benchmark to measure four different systems: Cassandra, HBase, your cloud system PNUTS and your implementation of a sharded MySQL. Why did you choose these four systems? What other systems would you have liked to include, or would you like to see someone run YCSB on?

Silberstein, Ramakrishnan: When we started our work there were certainly more than four systems to choose from, and there are even more now. We limited ourselves to just a handful to avoid spreading ourselves too thin. This was a wise decision, since figuring out how to run and tune each system for serving was a full-time job (providing even more motivation to produce the benchmark). Ultimately, we made our choice of systems based on interest at Yahoo! and our desire to compare a collection of systems that made fundamentally different design decisions.
We built PNUTS here, so that is an obvious choice, and many Yahoo! developers are curious about the features and performance of Cassandra and HBase. PNUTS uses a buffer page based architecture, while Cassandra and HBase (itself a clone of BigTable) are based on differential-file architectures.

We are happy with our initial choice of systems. We got a lot of interest in our work and have gained a lot of users and contributors. Those contributors have done a variety of work, including improvements to clients of the systems we initially benchmarked and adding new clients. As of this writing, we have clients for Voldemort, MongoDB, and JDBC.

Q10. In your paper you present the main results of comparing these four systems. Can you please summarize the main results you obtained and the lessons learned?

Silberstein, Ramakrishnan: Our impact comes not from the actual numbers we collected; this was just a snapshot in time for each system. Our key contribution was creating an apples-to-apples environment that made ongoing comparisons feasible. We encouraged some competition between the systems and produced a tool for the systems to compare themselves against in the future.

That said, we had some interesting results. We know that the systems we tested made different design decisions to optimize reads or writes, and we saw these decisions reflected in the results. For a 50% read 50% write workload, Cassandra achieved the highest throughput. For a 95% read workload, PNUTS tied Cassandra on throughput while having better latency.

All systems we tested advertised scalability and elasticity. The systems all performed well at scale, but we noticed hiccups when growing elastically.
Cassandra took an extremely long time to integrate a new server, and had erratic latencies during that time. HBase is extremely lazy about integrating new servers, requiring background compactions to move partitions to them.

This was over a year ago, and these systems may well have overcome these issues by now.

-Actual numbers not the important thing here…this was just a snapshot in time of these systems. The key thing is that we created an apples-to-apples setup that made comparisons possible, created a bit of competition, and created something for the systems to evaluate themselves against.

-Result #1. We knew the systems made fundamental decisions to optimize writes or optimize reads.
It was nice to see these decisions show up in the results.
Example: in a 50/50 workload, Cassandra was best on throughput. In a 95% read workload, PNUTS caught up and had the best latencies.

-Result #2. The systems may advertise scalability and elasticity, but this is clearly a place where the implementations needed more work. Ref. elasticity experiment. Ref. HBase with only 1-2 nodes.

-Lesson. We are in the early stages. The systems are moving fast enough that there is no clear guidance on how to tune each system for particular workloads.

So figuring out how to do justice to each system is tricky.

Q11. Please briefly explain the experiment set up.

Silberstein, Ramakrishnan: We simply performed a bake-off. We had a collection of server class machines, and installed and ran each cloud serving system on them. We spent a huge amount of time configuring each system to perform their best on our hardware. We are in the early days of cloud serving systems, and there are no clear best practices for how to tune these systems to our workloads. We also took care to allocate memory fairly across each system.
For example, HBase runs on top of HDFS, and both HBase and HDFS must be given memory, and the sum must equal what we grant Cassandra.
Finally we made sure to load each system with enough data to avoid fitting all records in memory; all of the systems we benchmarked perform dramatically differently when running solely in memory vs. not.
We were more interested in performance in the latter case, so ran in that mode.

Q12. You plan to extend the benchmark to include two additional properties: availability and replication. Can you please explain how you intend to do so?

Silberstein, Ramakrishnan: These are tricky, but important, tiers to build. We want to measure a variety of things. What is the added cost of replication? What happens to availability under different failure scenarios? Can the records be read and/or written, and is there a cost penalty when doing so? What is the record consistency model during failures, and can we quantify the differences between different failure models?

These tiers expose areas where system designers can make very different design decisions. They might ensure write durability by replicating writes on-disk or in-memory only. They might replicate intra-data center or cross-data center. During network partitions, they might prioritize consistency and make data read-only, or might prioritize availability, and use eventual consistency to make record replicas converge later.

Q13. Your benchmark appears to be the best to date for measuring the scalability of SQL and NoSQL systems; it would be great if others would run it on their systems. Do you have ideas to encourage that to happen?

Silberstein, Ramakrishnan: We think we have been fairly successful here already. We open-sourced the benchmark about a year ago. It is available here.

Before we released the benchmark, we shared drafts of our results with the developer/user lists of each benchmarked system a few times prior to publication, and each time got a lot of interest and questions: what do our machines look like, what tuning parameters did we use, etc. This interest is a great sign. First, as you point out, we’re filling an unmet need.
Second, our audience recognizes our Yahoo! workloads are important and that we as Yahoo! researchers are doing a rigorous comparison.

By the time we released the benchmark we had many people waiting to use it.

Today, we have many users and contributors. We see comments on system mailing lists like “I am running YCSB Workload A and seeing XXX ops/sec and that seems too low. How should I tune my setup?”

Q14. A simpler benchmark might be easier for others to reproduce, getting more results. Is there a simple subset of your benchmark you’d suggest that captures most of the important elements?

Silberstein, Ramakrishnan: The Core Benchmark is already very simple. It uses synthetic data and provides just a few knobs for users to adjust. We and others users have done more complex testing like feed in actual production logs in YCSB, but that is not part of Core. Within Core, we should mention there are 3 very simple workloads that can execute on the simplest hash-based key value store. Workloads A, B, and C execute only inserts, reads, and updates. They do not execute range queries.

Q15. Open source NoSQL projects may not have half a dozen dedicated servers available to reproduce your results on their systems. Do you have suggestions there? Is it possible to run an accurate benchmark on a leased cloud platform?

Silberstein, Ramakrishnan: Funny you ask this because we felt the half dozen servers we ran on was not enough. We went to great lengths to borrow 50 homogeneous servers for a short time to run 1 large-scale experiment that showed YCSB itself scales.
Running YCSB on a leased platform like Amazon EC2 is straightforward.
The big question is determining if the results are accurate and repeatable. Certainly if the machines are VMs on non-dedicated servers, that might not be the case. We have heard if you request the largest size VMs, EC2 might allocate dedicated servers. One of the benefits of working at a large Internet company is that we have not had to try this ourselves!

Q16. Do you think that flash memory may “tip the balance” in the best data platforms? It cannot simply be treated as a disk, nor as RAM, because it has fundamentally different read/write costs.

Silberstein, Ramakrishnan: As SSD hardware has matured, we have noticed two trends. First, it is quite easy to build a machine with enough SSDs to run into CPU, bus and controller bottlenecks. This has led to rewrites of most storage stacks, cloud based or not.

Second, SSDs use extremely sophisticated log structured techniques to mask the cost of writes. Some of these techniques, such as data deduplication and compression only help certain workloads. The big question in this space is how higher-level database indexes will interact with the lower-level log structured system.

It could be that future hardware devices will mask the cost of random writes so well that higher-level log structured techniques will become redundant. On the other hand, higher-level log structured approaches have more computational resources at their disposal, and also have more information about the application. These advantages could mean that they will always be able to significantly improve upon hardware-based approaches.
______________________________________________________

Adam Silberstein, Yahoo! Research.
Research Area: Web Information Management.
My current interests are in large distributed data systems. My research interests are in the general area of large scale data management. Specifically, this includes both online transaction processing, and analytics, and bridging the gap between them,
as well as techniques for generating user feeds in social networks. I joined Yahoo! in August 2007, after finishing my Ph.D. at Duke University in February 2007.

Raghu Ramakrishnan, Yahoo! Research.
Raghu Ramakrishnan is Chief Scientist for Search and Cloud Platforms at Yahoo!, and is a Yahoo! Fellow, heading the Web Information Management research group. His work in database systems, with a focus on data mining, query optimization, and web-scale data management, has influenced query optimization in commercial database systems and the design of window functions in SQL:1999. His paper on the Birch clustering algorithm received the SIGMOD 10-Year Test-of-Time award, and he
has written the widely-used text “Database Management Systems” (with Johannes Gehrke).
His current research interests are in cloud computing, content optimization, and the development of a “web of concepts” that indexes all information on the web in semantically rich terms. Ramakrishnan has received several awards, including the ACM SIGKDD Innovations Award, the ACM SIGMOD Contributions Award, a Distinguished Alumnus Award from IIT Madras, a Packard Foundation Fellowship in Science and Engineering, and an NSF Presidential Young Investigator Award. He is a Fellow of the ACM and IEEE.

Ramakrishnan is on the Board of Directors of ACM SIGKDD, and is a past Chair of ACM SIGMOD and member of the Board of Trustees of the VLDB Endowment. He was Professor of Computer Sciences at the University of Wisconsin-Madison, and was founder and CTO of QUIQ, a company that pioneered crowd-sourcing, specifically question-answering communities, powering Ask Jeeves’ AnswerPoint as well as customer-support for companies such as Compaq.

____________________________________________
Dr. R. G. G. “Rick” Cattell is an independent consultant in database systems and engineering management. He previously worked as a Distinguished Engineer at Sun Microsystems, mostly recently on open source database systems and horizontal database scaling. Dr. Cattell served for 20+ years at Sun Microsystems in management and senior technical roles,
and for 10+ years in research at Xerox PARC and at Carnegie-Mellon University.
Dr. Cattell is best known for his contributions to middleware and database systems, including database scalability, enterprise Java, object/relational mapping, object-oriented databases, and database interfaces. He is the author of several dozen papers and six books. He instigated Java DB and Java 2 Enterprise Edition, and was a contributor to a number of the Enterprise Java APIs and products. He previously led development of the Cedar DBMS at Xerox PARC, the Sun Simplify database GUI, and SunSoft’s ORB-database integration. He was a founder of SQL Access (a predecessor to ODBC), the founder and chair of the Object Data Management Group (ODMG), the co-creator of JDBC, the author of the world’s first monograph on object/relational and object databases, and a recipient of the ACM Outstanding PhD Dissertation Award.

References:
[1] Benchmarking Cloud Serving Systems with YCSB. (.pdf)
Adam Silberstein, Brian F. Cooper, Raghu Ramakrishnan, Russell Sears, Erwin Tam, Yahoo! Research.
1st ACM Symposium on Cloud Computing, ACM, Indianapolis, IN, USA (June 10-11, 2010)

[2] PNUTS: Yahoo!’s Hosted Data Serving Platform (.pdf)
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni, Yahoo! Research.
The paper describes PNUTS/Sherpa, Yahoo’s record-oriented cloud database.

Open Source Code

– Yahoo! Cloud Serving Benchmark (brianfrankcooper / YCSB) Downloads

– Interview with Jonathan Ellis, project chair of Apache Cassandra.

– Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera

– The evolving market for NoSQL Databases: Interview with James Phillips.

For further readings

– Scalable Datastores, by Rick Cattell

_____________________________________________________________________________________________

May 16 11

Interview with Jonathan Ellis, project chair of Apache Cassandra.

by Roberto V. Zicari

“You’re going to see these databases attempting to make things easy that today are possible but difficult.” –Jonathan Ellis.

This interview is part of my series of interviews on the evolving market for Data Management Platforms. This time, I had the pleasure to interview Jonathan Ellis, project chair of Apache Cassandra.

RVZ: In my understanding of how the market of Data Management Platforms is evolving, I have identified three phases:

Phase I– New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Phase II- The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerged.

Phase III– Evolving Analytical Data Platforms. Hadoop for analytic. Companies such a Cloudera, but also IBM`s BigInsights are in this space.

Q1. Why Amazon and Google had to develop their own database systems? Why didn’t they use/”adjust” existing database systems?

Jonathan Ellis: Google and Amazon were really breaking new ground when they were working on Bigtable and Dynamo. The thing you have to remember is that the problem they were trying to solve was high-volume transactional systems: while companies like Teradata have been building large-scale databases for some time, these were analytical databases and not designed for high query volumes with low latency.

The state-of-the-art at the time for transactional systems was horizontal and vertical partitioning customized for a given application, built on a traditional database like Oracle. These systems were not application-transparent, meaning repartitioning was a major undertaking, nor were they reusable from one application to another.

Q2. I defined Phase II as the advent of Open Source Developments with Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerged. Any comment on this?

Jonathan Ellis: Note that Hadoop is a different kind of animal here: Hadoop *is* an analytical system, not a real–time or transaction oriented one.

Q3. How was it possible that Amazon and Google`s proprietary systems were used as input for Open Source Projects?

Jonathan Ellis: Google and Amazon both published whitepapers on Bigtable and Dynamo which were very influential for the open source systems that started appearing soon afterward. Since then, the open-source systems have of course continued to evolve; today Cassandra—which began as a fusion of Bigtable and Dynamo concepts—includes features described by neither of its ancestors, such as distributed counters and column indexes.

Q4. Why Facebook and LinkedIn developed Open Source data platforms and not proprietary software?

Jonathan Ellis: Probably a combination of two reasons:
– They recognized that with the kind of head start that Google and Amazon had, it would be difficult to achieve technical parity without leveraging the efforts of a larger development community.

– They don’t see infrastructure in and of itself as the place where they gain their competitive advantage. You can see more evidence of this in Facebook’s recent announcement that they were opening up the plans for their newest data center. So it goes beyond just source code.

Q5. Not everyone has data and scalability requirements such as Amazon and Google. Who currently needs such new data management platforms, and why?

Jonathan Ellis: Again, I think the piece that’s new here is the emphasis on high-volume transaction processing. Ten years ago you didn’t see this kind of urgency around transaction processing–some large web sites like eBay were concerned already, but it feels like there’s been a kind of Moore’s law of data growth that’s been catching more and more companies, both on the web and off. Today DataStax has customers like startups Backupify and Inkling, as well as companies you might expect to see like Netflix.

Q6. Is there a common denominator between the business models built around such open source projects?

Jonathan Ellis: At the most basic level, there’s only so many options to choose from.
You have services and support, and you have proprietary products built on top or around the open source core. Everyone is doing some combination of those.

Phase III– Evolving Analytical Data Platforms
Q7. Is Business Intelligence becoming more like Science for profit?

Jonathan Ellis: If you mean that a lot of teams are now trying to commercialize technologies that were originally developed without that kind of focus, then yes, we’ve definitely seen a lot of that the last couple years.

Q8. Who are the main actors in the Platform Ecosystem?

Jonathan Ellis: On the real-time side, Cassandra’s strongest competitors are probably Riak and HBase. Riak is backed by Basho, and I believe Cloudera supports HBase although it’s not their focus.

For analytics, everyone is standardizing on Hadoop, and there are a number of companies pushing that.

DataStax is unique here in that our just-released Brisk project gives you the best of both worlds: a Hadoop powered by Cassandra so you never have to do an ETL process before running an analytical query against your real-time data, while at the same time keeping those workloads separate so that they don’t interfere with each other.

Q9. What role will RDBMS play in the future? What about Object Databases, do they have a role o play?t

Jonathan Ellis: Relational databases will continue to be the main choice when you need ACID semantics and you have a relatively small data or query volume that you care about. Many of our customers continue to use a relational database in conjunction with Cassandra for things like user registration.

To be honest, I don’t see object databases being able to ride the NoSQL wave out of their niche. The popularity of NoSQL options isn’t from their rejection of the SQL language per se, but because that was part of what they left behind when they added features that are starting to matter even more than query language, primarily scalability.

Q10. Looking at three elements: Data, Platform, Analysis, what are the main research challenges ahead? And what are the main business challenges ahead?

Jonathan Ellis: I see the technical side as more engineering than R&D. You’re going to see these databases attempting to make things easy that today are possible but difficult. Cassandra’s column indexes are an example of this–you could use Cassandra to look up rows by column values in the past, but you had to maintain those indexes manually, in application code. Today Cassandra can automate that for you.

This ties into the business side as well: the challenge for everyone is to move beyond the early adopter market and go mainstream. Ease of use will be a big part of that.

Q11. What are the main future developments? Anything you wish to add?

Jonathan Ellis: This is an exciting space to work in right now because the more we build, the more we can see that we’ve barely scratched the surface so far. The feature we’re working on right now that I’m personally most excited about is predicate push-down for Brisk: allowing Hive, a data warehouse system for Hadoop, to take advantage of Cassandra column indexes.

Readers curious about Brisk–which is fully open-source–can learn more here.

Thanks for the questions!

Jonathan Ellis

__________________________________
Jonathan Ellis.
Jonathan Ellis is CTO and co-founder of DataStax (formerly Riptano), the commercial leader in products and support for Apache Cassandra. Prior to DataStax, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy. Jonathan is project chair of Apache Cassandra.
__________________________________

2. The evolving market for NoSQL Databases: Interview with James Phillips.

____________________

Apr 13 11

Objects in Space vs. Friends in Facebook.

by Roberto V. Zicari

“Data is everywhere, never be at a single location. Not scalable, not maintainable.”–Alex Szalay

I recently reported about the Gaia mission which is considered by the experts “the biggest data processing challenge to date in astronomy“.

Alex Szalay- who knows about data and astronomy, having worked from 1992 till 2008 with the Sloan Digital Sky Survey together with Jim Gray – wrote back in 2004:
“Astronomy is a good example of the data avalanche. It is becoming a data-rich science. The computational-Astronomers are riding the Moore’s Law curve, producing larger and larger datasets each year.” [Gray,Szalay 2004]

Gray and Szalay observed: “If you are reading this you are probably a “database person”, and have wondered why our “stuff” is widely used to manage information in commerce and government but seems to not be used by our colleagues in the sciences. In particular our physics, chemistry, biology, geology, and oceanography colleagues often tell us: “I tried to use databases in my project, but they were just to [slow | hard-to-use |expensive | complex ]. So, I use files.” Indeed, even our computer science colleagues typically manage their experimental data without using database tools. What’s wrong with our database tools? What are we doing wrong? “ [Gray,Szalay 2004].

Six years later, Szalay in his presentation “Extreme Data-Intensive Computing” presented what he calls “Jim Gray`s Law of Data Engineering”:
1. Scientific computing is revolving around data.
2. Need scale-out solution for analysis.” [Szalay2010].

He also says about Scientific Data Analysis, or as he calls it (DISC: Data Intensive Scientific Computing): “Data is everywhere, never be at a single location. Not scalable, not maintainable.”[Szalay2010]

I would like to make three observations:

i. Great thinkers do anticipate the future. They “feel” it. Better said, they “see” more clearly how things really are.
Consider for example what the philosopher Friedrich Nietzsche wrote in his book “Thus Spoke Zarathustra”: “The middle is everywhere.” Confirmed 128 years later by “Data is everywhere”….

ii. “Astronomy is a good example of the data avalanche”: the Universe is beyond our comprenshion, which means I believe, that ultimately we will figure out that indeed “data is not scalable, and not maintainable.”

iii. I now dare to twist the quote: “If you are reading this you are probably a “database person”, and have wondered why our “stuff” is widely used to manage information in commerce and government but seems to not be used by our colleagues at Facebook or Google”….

I have asked Professor Alex Szalay for his opinion.

Alexander Szalay is a professor in the Department of Physics and Astronomy of the Johns Hopkins University. His research interests are theoretical astrophysics and galaxy formation.

RVZ

Alex Szalay: This is very flattering… and I agree. But to be fair, the Facebook guys are using databases, first MySQL, and now Oracle in the middle of their whole system.

I have recently heard a talk by Jeff Hammerbacher, who built the original infrastructure for Facebook. Now he quit, and formed Cloudera. He did explicitly say that in the middle there will always be SQL, but people use Hadoop/MR for the ETL layer… and R and other tools for analytics and reporting.

As far as I can see Google is also gently moving towards not quite a database yet, but Jeff Dean is building Bloom filters and other indexes into BigTable. So even if it is NoSQL, some of their stuff starts to resemble a database….

So I think there is a general agreement that indices are useful, but for large scale data analytics, we do not need full ACID, transactions are much more a burden than an advantage. And there is a lot of religion there, of course.

I would put it in such a way, that there is a phase transition coming, and there is an increasing diversification, where there were only three DB vendors 5 years ago, now there are many options and a broad spectrum of really interesting niche tools. In a healthy ecosystem everything is a 1/f power law, and we will see a much bigger diversity. And this is great for academic research. “In every crisis there is an opportunity” — we again have a chance to do something significant in academia.

RVZ: The National Science Foundation has awarded a $2M grant to you and your team of co-investigators from across many scientific disciplines, to build a 5.2 Petabyte Data-Scope, a new instrument targeted at analyzing the huge data sets emerging in almost all areas of science. The instrument will be a special data-supercomputer, the largest of its kind in the academic world.

What is the project about?

Alex Szalay: We feel that the Data-Scope is not a traditional multi-user computing cluster, but a new kind of instrument, that enables people to do science with datasets ranging between 100TB and 1000TB.
This is simply not possible today. The task is much more, than just throw the necessary storage together.
It requires a holistic approach: the data must be first brought to the instrument, then staged, and then moved to the computing nodes that have both enough compute power and enough storage bandwidth (450GBps) to perform the typical analyses, and then the (complex) analyses must be performed.

RVZ: Could you please explain what are the main challenges that this project poses?

Alex Szalay: It would be quite difficult, if not outright impossible to develop a new instrument with so many cutting-edge features without adequately considering all aspects of the system, beyond the hardware. We need to write at least a barebones set of system management tools (beyond the basic operating system etc), and we need to provide help and support for the teams who are willing to be at the “bleeding-edge” to be able to solve their big data problems today, rather than wait another 5 years, when such instruments become more common.
This is why we feel that our proposal reflects a realistic mix of hardware and personnel, which leads to a high probability of success.

The instrument will be open for scientists beyond JHU. There was an unbelievable amount if interest just at JHU in such an instrument, since analyzing such data sets is beyond the capability of any group on campus. There were 20 groups with data sets totaling over 2.8PB just within JHU, who would use the facility immediately, if it was available. We expect to go no-line at the end of this summer.

References

[Szalay2010]
Extreme Data-Intensive Computing (.pdf)
Alex Szalay, The Johns Hopkins University, 2010.

[Gray,Szalay 2004]
Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science.
Jim Gray,Microsoft Research and Alex Szalay,Johns Hopkins University.
IEEE Data Engineering Bulletin and Technical Report, MSR-TR-2004-110, Microsoft Research, 2004

Friedrich Nietzsche,
Thus Spoke Zarathustra: a Book for Everyone and No-one. (Also Sprach Zarathustra: Ein Buch für Alle und Keinen) – written between 1883 and 1885.

Related Posts

Objects in Space

Objects in Space: “Herschel” the largest telescope ever flown.

Objects in Space. –The biggest data processing challenge to date in astronomy: The Gaia mission.–

Big Data

Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

The evolving market for NoSQL Databases: Interview with James Phillips.

ODBMS Industry Watch

On Versant`s technology. Interview with Vishal Bagga.

How good is UML for Database Design? Interview with Michael Blaha.

Big Data and the European Commission: Call for Project Proposals.

“Applying Graph Analysis and Manipulation to Data Stores.”

Interview with Iran Hutchinson, Globals.

Interview with Jonathan Ellis, project chair of Apache Cassandra.

Objects in Space vs. Friends in Facebook.

About the author

Archives

Meta

About

Flickr

Search

About the author

Tags

Archives

Meta

About

Flickr

Search