Skip to content

"Trends and Information on AI, Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Aug 9 11

Google Fusion Tables. Interview with Alon Y. Halevy

by Roberto V. Zicari

“The main challenges is that it’s hard for people who have data, but not database expertise, to manage their data.” –Alon Y. Halevy.

Google Fusion Tables was launched on June 9th, 2009. I wanted to know what happened since then. I have therefore interviewed Dr. Alon Y. Halevy, who heads the Structured Data Group at Google Research.

RVZ

Q1.In your web page you write that your job at Google is ” to make data management tools collaborative and much easier to use, and to leverage the incredible collections of structured data on the Web.” What are the main problems and challenges you are currently facing?

Halevy: The main challenges is that it’s hard for people who have data, but not database expertise, to manage their data, share it and create visualizations. Data management requires too much up-front effort, and that forms a significant impediment towards data sharing on a large scale.

Q2. Your group is responsible for Google Fusion Tables, “a service for managing data in the cloud that focuses on ease of use, collaboration and data integration.” What is exactly Google Fusion Tables? What challenges with respect to data management is solving and how?

Halevy: Fusion Tables enables you to easily upload data sets (e.g., spreadsheets and CSV files) to the cloud and manage it. Fusion Tables makes it easy to create insightful visualizations (e.g,. maps, timelines and other charts) and to share these with collaborators or with the public at large.

In addition, Fusion Tables enables merging data sets that belong to different owners. The true power of data is realized when we can combine data from multiple sources and draw conclusions that were impossible earlier.

As an example, this visualization combines data about earthquakes and data about the location of nuclear power reactors, showing what areas are prone to disasters similar to the one experienced in Japan in March 2011.

Q3. Fusion Tables enables users to upload tabular data files. Is there a limitation to the size of such users data files? Which one?

Halevy: Right now we allow 100MB per table and 250MB per user, but that’s not a technical limitation, just a limitation of our free offering

Q4. Google Fusion Tables was launched On June 9th, 2009. What happened since then?

Halevy: We’ve continually improved our service based largely on needs expressed by our users. In particular, we’ve have made our map visualizations much more powerful, and we’ve developed APIs for programmers.

Q5. What data sources do you consider and how do you integrate them together?

Halevy: We do not prescribe any data sources. All the sources you can obtain in Fusion Tables come from users who’ve explicitly marked their data sets as public. Data sources are combined with a merge operation that’s similar to a ‘join’ in SQL.

Q6. How do you ensure a good performance in the presence of millions of user tables? In fact, do you have any benchmark results for that?

Halevy: Given that we’re the first ones to pursue storing such a large number of tables in a single repository, there isn’t an established benchmark. The technical description of how we do it appears in two short papers that we published in SIGMOD 2010 [2] and SoCC 2010 [1] .

Q7. One of the main features of Fusion Tables is that it “allows multiple users to merge their tables into one, even if they do not belong to the same organization or were not aware of each other when they created the tables.
A table constructed by merging (by means of equi-joins) multiple base tables is a view” [1]. What about data consistency, duplicates and updates? Do you handle such cases?

Halevy: The data belongs to the users, not to us, so they have to ensure that it is up to date and does not contain duplicates. Of course, when you combine data from multiple sources you may get inconsistencies. Hopefully, the visualizations we provide will enable you to discover them quickly and resolve them.

Q8. How does Google Fusion Tables relates to Google Maps? What are the main challenges with respect to data management that you face when dealing with Big Data coming from Google Maps?

Halevy: Fusion Tables relies on a lot of the Google Maps infrastructure to display maps. The challenges when displaying maps from large data sets is that you need to do a lot of the computation on the server side so the client is not overwhelmed with points to render, but at the same time that the user experience remains snappy and interactive.

Q9. Why and how do you manage data in the Cloud?

Halevy: Managing data in the cloud is easier for many data owners because they do not have to maintain their own database system (which requires hiring database experts). Putting data in the cloud is also a key facilitator in order to share data with others, including people outside your organization.

We manage the data using some of the Google infrastructure such as BigTable and some layers built over it.

Q10. When you started Google Fusion Tables you did not support complex SQL queries or high throughput transactions. Why? How is the situation now?

Halevy: We still don’t support all forms of SQL queries, and we’re not in the race to become the database system supporting the highest transaction throughput. There are plenty of products on the market that serve those needs. Our goal from the start was to help under-served users with data management tasks that typically do not require complex SQL queries or high-throughput transactions, but rather emphasize data sharing and visualization.

Q11. Fusion Tables is about to “make it easier for people to create, manage and share on structured data on the Web.”
What about handling non structured data and the Web?

Halevy: There are plenty of other tools for that, including blogs, site creation tools and cloud-based word processors.

Q12. You write “…to facilitate collaboration, users can conduct fine-grained discussions on the data.” What does it mean?

Halevy: This means that users can attach comments to individual rows in a table, individual columns and even individual cells. If you are collaborating on a large data set then it is not enough to put all the comments in one big blob around the table. You really need to attach it to specific pieces of data.

Q13. Is Google Fusion Tables open source? Do you have plans to open up the API to developers?

Halevy: We have had an API for almost two years now. Fusion Tables is not open source; it’s a Google service built on top of Google’s infrastructure.

Q14. Fusion Tables API provides developers with a way of extending the functionality of the platform. What are the main extensions being implemented by the community so far?

Halevy: There have been tools developed for importing data from different formats into Fusion Tables (e.g., Shape to Fusion ). There is also a tool for importing Fusion Tables into the statistical R package and outputting the results back into Fusion Tables.

The API is used mostly to tailor applications using Fusion Tables to specific needs.

Q15. How Fusion Tables differs from Amazon`s SimpleDB?

Halevy: Fusion Tables is designed primarily to be a user-facing tool, not as much for developers. I should emphasize that Fusion Tables is not part of Google Labs anymore — it has “graduated” over a year ago.

Q16. In 2005 you co-authored a paper introducing the concept of “Dataspace Systems”, that is systems which provide “pay-as-you-go data management based on best-effort services.” Is this still actual? What are these “Dataspace Systems” exactly?

Halevy: Yes, it is. In fact, Fusion Tables are one example of a dataspace system. Fusion Tables does not require you to create a schema before entering the data, and it tries to get the data types of the columns in order to offer relevant visualizations.
The collaborative aspects of Fusion Tables make it easier for a group of collaborators to improve the quality of the data and combine it with others.

Dataspace systems are still in their infancy, and we have a long way to go to realize the full vision.

Q17. You have worked on Deep web, Surface web, and now Fusion Tables: How do these three area relate to each others?
What is your next project?

Halevy: All of these projects have the same overall goal: to make structured data on the Web more discoverable, so users can enhance it, combine it with data from other sources, and create and publish interesting new data sets.

The Deep Web project had the goal of extracting data sets from behind HTML forms and making the data discoverable in search.
The Surface Web (WebTables) project’s goal was to identify interesting data sets that are on the Web but are not being treated in the most optimal way. Fusion Tables provides a tool for users to upload their own data and publish data sets that can be crawled by search engines.

These three projects — and filling in the gaps between — will keep me busy for a while to come!

———————
Dr. Alon Halevy heads the Structured Data Group at Google Research. Prior to that, he was a Professor of Computer Science at the University of Washington, where he founded the Database Research Group. From 1993 to 1997 he was a Principal Member of Technical Staff at AT&T Bell Laboratories (later AT&T Laboratories). He received his Ph.D in Computer Science from Stanford University in 1993, and his Bachelors degree in Computer Science and Mathematics from the Hebrew University in Jerusalem in 1988. Dr. Halevy was elected Fellow of the Association of Computing Machinery in 2006.
————————-

Resources:

[1]Google Fusion Tables: Web-Centered Data Management and Collaboration (link .pdf).

Hector Gonzalez, Alon Halevy, Christian S. Jensen, Anno Langen,Jayant Madhavan, Rebecca Shapley, Warren Shen, Google Inc.
in SoCC’10, June 10-11, 2010, Indianapolis, Indiana, USA.

[2] Megastore: A Scalable Data System for User Facing Applications.
J. Furman, J. S. Karlsson, J.-M. Leon, A. Lloyd, S. Newman, and P. Zeyliger. In SIGMOD, 2008.

[3] Megastore: Providing Scalable, Highly Available Storage for Interactive Services (link .pdf)
Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh , Google, Inc. In 5 Biennial Conference on Innovative Data Systems Research (CIDR ’11), January 9-12, 2011, Asilomar, California, USA.

Google Fusion Tables (link)

Fusion Tables API (link)

Amazon`s SimpleDB (link)

Deep-web crawl Project (Download .pdf).

WebTables Project (Download .pdf)

——————–

Jul 25 11

How good is UML for Database Design? Interview with Michael Blaha.

by Roberto V. Zicari

„ The tools are not good at taking changes to a model and generating incremental design code to alter a populated database.”— Michael Blaha

The Unified Modeling Language™ – UML – is OMG’s most-used specification.
UML is a de facto standard for object modeling, and it is often used for database design as well. But how good is UML really for the task of database conceptual modeling?
I asked a few questions to Dr. Michael Blaha, one of the leading authorities on databases and data modeling.

RVZ

Q1. Why using UML for database design?

Blaha: Often the most difficult aspect of software development is abstracting a problem and thinking about it clearly — that is the purpose of conceptual data modeling.
A conceptual model lets developers think deeply about a system, understand its core essence, and then choose a proper representation. A sound data model is extensible, understandable, ready to implement, less prone to errors, and usually performs well without special tuning effort.

The UML is a good notation for conceptual data modeling. The representation stands apart from implementation choices, be it a relational database, object oriented database, files, or some other mechanism.

Q2. What are the main weaknesses of UML for database design? And how do you cope with them in practice?

Blaha: First consider object databases. The design of object database code is similar to the design of OO programming code. The UML class model specifies the static data structure. The most difficult implementation issue is the weak support in many object database engines for associations. The workaround depends on object database features and the application architecture.

Now consider relational databases. Relational database tools do not support the UML. There is no technical reason for this, but several cultural reasons. One is that there is a divide between the programming and database communities; each has their own jargon, style, and history and pay little attention to the other.
Also the UML creators focused on unifying programming notation, but spent little time talking to the database community.
The bottom line is that the relational database tools do not support the UML and the UML tools do not support relational databases. In practice, I usually construct a conceptual model with a UML tool (so that I can think deeply and abstractly).
Then I rekey the model into a database tool (so that I can generate schema).

Q3. Even if you have a sound high level UML design, what else can get wrong?

Blaha: I do lots of database reverse engineering for my consulting clients, mostly for relational database applications because that’s what’s used most often in practice. I start with the database schema and work backwards to a conceptual model. I published a paper 10 years ago with statistics for what does go wrong.

In practice, I would say that about 25% of applications have a solid conceptual model, 50% have a mediocre conceptual model, and 25% are just downright awful. Given that a conceptual model is the foundation for an application, you can see why many applications go awry.

In practice, about 50% of applications have a professional database design and 50% are substantially flawed. It’s odd to see so many database design mistakes, given the wide availability of database design tools. It’s relatively easy to take a conceptual model and generate a database design. This illustrates that the importance of software engineering has not reached many developers.

Of course, there can always be flaws in programming logic and user interface code, but these kinds of flaws are easier to correct if there is a sound conceptual model underlying the application and if the model is implemented well with a database schema.

Q4. And specifically for object databases?

Blaha: An object database is nothing special when it comes to the benefits of a sound model and software engineering practice. A carefully considered conceptual model gives you a free hand to choose the most appropriate development platform.

One of my past books (Object-Oriented Modeling and Design for Database Applications) vividly illustrated this point by driving object-oriented data models into different implementation targets, specifically relational databases, object databases, and flat files.

Q5. What are most common pitfalls?

Blaha: It is difficult to construct a robust conceptual model. A skilled modeler must quickly learn the nuances of a problem domain and be able to meld problem content with data abstractions and data patterns.

Another pitfall is that it is important to perform agile development. Developers much work quickly, deliver often, obtain feedback, and build on prior results to evolve an application. I have seen too many developers not take the principles of agile development to heart and become bogged down by ponderous development of interminable scope.

Another pitfall is that some developers are sloppy with database design. Nowdays there really is no excuse for that as tools can
generate database code. Object-oriented CASE tools can generate programming stubs that can seed an object database.
For relational database projects, I first construct an object-oriented model, then re-enter the design into a relational database tool, and finally generate the database schema. (The UML data modeling notation is nearly isomorphic with the modeling language in most relational database design tools.)

Q6. In your experience, how do you handle the situation when a UML conceptual database design is done and a database is implemented using such design, but then later on, updates to the implementation are done without considering the original conceptual design. What to do in such cases?

Blaha: The more common situation is that an application gradually evolves and the software engineering documentation (such as the conceptual model) is not kept up to date.
With a lack of clarity for its intellectual focus, an application gradually degrades. Eventually there has to be a major effort to revamp the application and clean it up, or replace the application with a new one.

The database design tools are good at taking a model and generating the initial database design.
The tools are not good at taking changes to a model and generating incremental design code to alter a populated database.
Thus much manual effort is needed to make changes as an application evolves and keep documentation up to date. However, the alternative of not doing so is an application that eventually becomes a mess and is unmaintainable.

————————————————–
Michael Blaha is a partner at Modelsoft Consulting Corporation.
Dr. Blaha is recognized as one of the world’s leading authorities on databases and data modeling. He has more than 25 years of experience as a consultant and trainer in conceiving, architecting, modeling, designing, and tuning databases for dozens of major organizations around the world. He has authored six U.S. patents, six books, and many papers. Dr. Blaha received his doctorate from Washington University in St. Louis and is an alumnus of GE Global Research in Schenectady, New York.

Related Resources

OMG UML Resource Page.

Object-Oriented Design of Database Stored Procedures, By Michael Blaha, Bill Huth, Peter Cheung

Models, By Michael Blaha

Universal Antipatterns, By Michael Blaha

Patterns of Data Modelling (Database Systems and Applications),Blaha, Michael, CRC Press, May 2010, ISBN 1439819890

Jul 15 11

Big Data and the European Commission: Call for Project Proposals.

by Roberto V. Zicari

I thought that this information is of interest for the data management community:

The European Commission has a budget for funding projects in the area of Intelligent Information Management.
It is currently seeking Project Submissions in this area.

It is not always easy to understand the official documents published by the European Commission… Therefore I tried to develop a set of easy-to-read questions/answers for whose of you who might be interested to know more.

Hope it helps

RZV

– What is it?

The European Commission has a budget for funding projects in the area of Intelligent Information Management.

– Which Program is it?:

Formally the Program is called: “Work Programme 2011-2012 of the FP7 Specific Programme ‘Cooperation’, Theme 3, ICT – Information and Communication Technologies”, which has been published on 19 July 2010.

The Work Programme contains in Challenge 4: ‘Technologies for Digital Content and Languages’ the objective ICT-2011.4.4 – Intelligent Information Management.

The objective ICT 2011-4.4 is part of call 8, which is expected to be published July 26, 2011.

– What kind of projects is the European Commission looking to support?

Projects that develop and test new technologies to manage and exploit extremely large volumes of data, with real time capabilities whenever this is relevant. Projects must be the joint work of consortia of partners.
It is very important that at least one member of your consortium be willing and able to make available the large volumes of data needed to test your ideas.

– What kind of support is the EU giving to accepted project proposals?

Depends on the type of partner and the type of proposals. For the most common type of proposal, up to 75% of direct costs for research institutions and small or medium enterprises.

– Why should I submit a proposal?

To join other talented people to solve a problem that you couldn’t solve alone and with your own resources.

– What are the benefits for me and/or for my organization to participate?

Funding is a clear benefit but it should be thought as a means to real end, which is the benefit of advancing the state of the art (for scientific or business objectives) working with people from all over Europe who have very strong skills complementary to yours.

– What is required if the proposal is accepted?

Entering a grant agreement with the European Commission, committing to an agreed plan of work, opening your work to the evaluation of peers selected by the Commission, periodically reporting your costs using agreed standards, being open to audits throughout the duration of the project and for a few years afterwards.

– How do I qualify to participate?

The full rules for participation (which classify countries in various categories with different types of access to the programme) are available here.
Most participants are legal entities established in the EU.

– How can I participate?

You need to become part of an eligible consortium (see again rules for participation above) and submit a proposal that addresses the specific requirements of the call.

– When is the deadline?

Expected to be 17 January 2012 17:00 Brussels time. Please refer to the official text of the call when it is published.

– How do I submit a proposal?

You need to fill the forms and submit your proposal using the submission system .

– Can I see other proposals submitted in the past?

This is not really possible because the Commission has an obligation of confidentiality to past submitters.

– Can I see some projects funded by the EU in the past in the same area?

Certainly. Please visit content-knowledge/projects_en.

– How do I get more info?

Further details on the scope of the call, individual research lines and indicative budget are provided in the Work Programme 2011-2012.

You can also write to: infso-e2@ec.europa.eu

Note: These calls are highly competitive. It is thus important that your idea be really innovative and that your plans for implementing and testing it be really concrete and really credible. It is also important for you to find the right partners and to work with them as a team. This requires a joint vision based on shared objective.
This means in turn that you will need to try to figure out how your skills can help others as hard as you will try to figure out who could help you with what you want to do.

##

Jun 22 11

“Applying Graph Analysis and Manipulation to Data Stores.”

by Roberto V. Zicari

” This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk—a traversal. ” — Marko Rodriguez and Peter Neubauer.
__________________________________________________

Interview with Marko Rodriguez and Peter Neubauer.

The open source community is quite active in the area of Graph Analysis and Manipulation, and their applicability to new data stores. I wanted to know more about an open source initiative called “Tinkerpop”.
I have interviewed Marko Rodriguez and Peter Neubauer, who are the ledears of the Tinkerpop” project.

RVZ

Q1. You recently started a project called “Tinkerpop”. What is it?

Marko Rodriguez and Peter Neubauer:
TinkerPop is an open-source graph software group. Currently, we provide a stack of technologies (called the TinkerPop stack) and members contribute to those aspects of the stack that align with their expertise. The stack starts just above the database layer (just above the graph persistence layer) and connects to various graph database vendors — e.g. Neo4j, OrientDB, DEX, RDF Sail triple/quad stores, etc.

The graph database space is a relatively nascent space. At the time that TinkerPop started back in 2009, graph database vendors were primarily focused on graph persistence issues–storing and retrieving a graph structure to and from disk. Given the expertise of the original TinkerPop members (Marko, Peter, and Josh), we decided to take our research (from our respective institutions) and apply it to the creation of tools one step above the graph persistence layer. Out of that effort came Gremlin — the first TinkerPop project. In late 2009, Gremlin was pieced apart into multiple self contained projects: Blueprints and Pipes.
From there, other TinkerPop products have emerged which we discuss later.

Q2. Who currently work on “Tinkerpop”?

Marko Rodriguez and Peter Neubauer:
The current members of TinkerPop are Marko A. Rodriguez (USA), Peter Neubauer (Sweden), Joshua Shinavier (USA), Stephen Mallette (USA), Pavel Yaskevich (Belarus), Derrick Wiebe (Canada), and Alex Averbuch (New Zealand).
However, recently, while not yet an “official member” (i.e. picture on website), Pierre DeWilde (Belgium) has contributed much to TinkerPop through code reviews and community relations. Finally, we have a warm, inviting community where users can help guide the development of the TinkerPop stack.

Q3. You say, that you intend to provide higher-level graph processing tools, APIs and constructs? Who needs them? and for what?

Marko Rodriguez and Peter Neubauer:
TinkerPop facilitates the application of graphs to various problems in engineering. These problems are generally defined as those that require expressivity and speed when traversing a joined structure. The joined structure is provided by a graph database. With a graph database, a user can does not arbitrarily join two tables according to some predicate as there is no notion of tables.
There only exists a single atomic structure known as the graph. However, in order to unite disparate data, a traversal is enacted that moves over the data in order to yield some computational side-effect — e.g. a search, a score, a rank, a pattern match, etc.
The benefit of the graph comes from being able to rapidly traverse structures to an arbitrary depth (e.g., tree structures, cyclic structures) and with an arbitrary path description (e.g. friends that work together, roads below a certain congestion threshold). Moreover, this space provides a unique way of thinking about data processing.
We call this data processing pattern, the graph traversal pattern.
This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk—a traversal.

Q4. Why using graphs and not objects and/or classical relations? What about non normalized data structures offered by NoSQL databases?

Marko Rodriguez and Peter Neubauer:
In a world where memory is expensive, hybrid memory/disk technology is a must (colloquially, a database).
A graph database is nothing more than a memory/disk technology that allows for the rapid creation of an in-memory object (sub)graph from a disk-based (full)graph. A traversal (the means by which data is queried/processed) is all about filling in memory those aspects of the persisted graph that are being touched as the traverser moves along the graph’s vertices and edges.
Graph databases simply cache what is on disk into memory which makes for a highly reusable in-memory cache.
In contrast, with a relational database, where any table can be joined with any table, many different data structures are constructed from the explicit tables persisted. Unlike a relational database, a graph database has one structure, itself.
Thus, components of itself are always reusable. Hence, a “highly reusable cache.” Given this description, if a persistence engine is sufficiently fast at creating an in-memory cache, then it meets the processing requirements of a graph database user.

Q5. Besides graph databases, who may need Tinkerpop tools? Could they be useful for users of relational databases as well? or of other databases, like for example NoSQL or Object Databases? If yes, how?

Marko Rodriguez and Peter Neubauer:
In the end, the TinkerPop stack is based on the low-level Blueprints API.
By implementing the Blueprints API and making it sufficiently speedy, any database can, in theory, provide graph processing functionality. So yes, TinkerPop could be leveraged by other database technologies.

Q6. Tinkerpop is composed of several sub projects: Gremlin, Pipes, Blueprints and more. At a first glimpse, it is difficult to grasp how they are related to each other. What are all these sub projects? do they all relate with each other?

Marko Rodriguez and Peter Neubauer:
The TinkerPop stack is described from bottom-to-top:
Blueprints: A graph API with an operational semantics test suite that when implemented, yields a Blueprints-enabled graph database which is accessible to all TinkerPop products.
Pipes: A data flow framework that allows for lazy graph traversing.
Gremlin: A graph traversal language that compiles down to Pipes.
Frames: An object-to-graph mapper that turns vertices and edges into objects and relations (and vice versa).
Rexster: A RESTful graph server that exposes the TinkerPop suite “over the wire.”

Q7. Is there a unified API for Tinkerpop? And if yes, how does it look like?

Marko Rodriguez and Peter Neubauer:
Blueprints is the foundation of TinkerPop.
You can think of Blueprints as the JDBC of the graph database community. Many graph vendors, while providing their own APIs, also provide a Blueprints implementation so the TinkerPop stack can be used with their database. Currently, Neo4j, OrientDB, DEX, RDF Sail, TinkerGraph, and Rexster are all TinkerPop promoted/supported implementations.
However, out there in the greater developer community, there exists an implementation for HBase (GraphBase) and Redis (Blueredis). Moreover, the graph database vendor InfiniteGraph plans to release a Blueprints implementation in the near future.

Q8. In your projects you speak of “dataflow-inspired traversal models”. What is it?

Marko Rodriguez and Peter Neubauer:
Data flow graph processing, in the Pipes/Gremlin-sense, is a lazy iteration approach to graph traversing.
In this model, chains of pipes are connected. Each pipe is a computational step that is one of three types of operations: transform, filter, or side-effect.
A transformation pipe will take data of one type and emit data of another type. For example, given a vertex, a pipe will emit its outgoing edges. A filter pipe will take data and either emit it or not. For example, given an edge, emit it if its label equals “friend.” Finally, a side-effect will take data and emit the same data, however, in the process, it will yield some side-effect.
For example, increment a counter, update a ranking, print a value to standard out, etc.
Pipes is a library of general purpose pipes that can be composed to effect a graph traversal based computation. Finally, Gremlin is a DSL (domain specific language) that supports the concise specification of a pipeline. The Gremlin code base is actually quite small — all of the work is in Pipes.

Q9. How other developers could contribute to this project?

Marko Rodriguez and Peter Neubauer:
New members tend to be users. A user will get excited about a particular product or some tangent idea that is generally useful to the community. They provide thoughts, code, and ultimately, if they “click” with the group (coding style, approach, etc.), then they become members. For example, Stephen Mallette was very keen on advancing Rexster and as such, has and continues to work wonders on the server codebase.
Pavel Yaskevich was interested in the compiler aspects of Gremlin and contributed on that front through many versions. Pavel is also a contributing member to Cassandra’s recent query language known as CQL.
Derrick Wiebe has contributed alot to Pipes and in his day job, needed to advance particular aspects of Blueprints (and luckily, this benefits others). There are no hard rules to membership. Primarily its about excitement, dedication, and expert-level development.
In the end, the community requires that TinkerPop be a solid stack of technologies that is well thought out and consistent throughout. In TinkerPop, its less about features and lines of code as it is about a consistent story that resonates well for those succumbing to the graph mentality.

____________________________________________________________________________________

Marko A. Rodriguez:
Dr. Marko A. Rodriguez currently owns the graph consulting firm Aurelius LLC. Prior to this venture, he was a Director’s Fellow at the Center for Nonlinear Studies at the Los Alamos National Laboratory and a Graph Systems Architect at AT&T.
Marko’s work for the last 10 years has focused on the applied and theoretical aspects of graph analysis and manipulation.

Peter Neubauer:
Peter Neubauer has been deeply involved in programming for over a decade and is co-founder of a number of popular open source projects such as Neo4j, TinkerPop, OPS4J and Qi4j. Peter loves connecting things, writing novel prototypes and throwing together new ideas and projects around graphs and society-scale innovation.
Right now, Peter is the co-founder and VP of Product Development at Neo4j Technology, the company sponsoring the development of the Neo4j graph database.
If you want brainstorming, feed him a latte and you are in business.

______________________________________

For further readings

Graphs and Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes

Related Posts

“On Graph Databases: Interview with Daniel Kirstenpfad”.

“Marrying objects with graphs”: Interview with Darren Wood.

“Interview with Jonathan Ellis, project chair of Apache Cassandra”.

The evolving market for NoSQL Databases: Interview with James Phillips.”

_________________________

Jun 13 11

Interview with Iran Hutchinson, Globals.

by Roberto V. Zicari

“ The newly launched Globals initiative is not about creating a new database.
It is however, about exposing the core multi-dimensional arrays directly to developers.” — Iran Hutchinson.

__________________________

InterSystems recently launched a new initiative: Globals.
I wanted to know more about Globals. I have therefore interviewed Iran Hutchinson, software/systems architect at InterSystems and one of the people behind the Globals project.

RVZ

Q1. InterSystems recently launched a new database product: Globals. Why a new database? What is Globals?

Iran Hutchinson: InterSystems has continually provided innovative database technology to its technology partners for over 30 years. Understanding customer needs to build rich, high-performance, and scalable applications resulted
in a database implementation with a proven track record. The core of the database technology is multi-dimensional arrays (aka globals).
The newly launched Globals initiative is not about creating a new database. It is however, about exposing the core multi-dimensional arrays directly to developers. By closely integrating access into development technologies like Java and JavaScript, developers can take full advantage of high-performance access to our core database components.

We undertook this project to build much broader awareness of the technology that lies at the heart of all of our products. In doing so, we hope to build a thriving developer community conversant in the Globals technology, and aware of the benefits to this approach of building applications.

Q2. You classify Globals as a NoSQL-database. Is this correct? What are the differences and similarities of Globals with respect to other NoSQL databases in the market?

Iran Hutchinson: While Globals can be classified as a NoSQL database, it goes beyond the definition of other NoSQL databases. As you there are many different offerings in NoSQL and no key comparison matrices or feature lists. Below we list some comparisons and differences with hopes of later expanding the available information on the globalsdb.org website.

Globals differs from other NoSQL databases in a number of ways.

o It is not limited to one of the known paradigms in NoSQL (Column/Wide Column, Key-Value, Graph, Document, etc.). You can build your own paradigm on top of the core engine. This is an approach we took as we evolved Caché to support objects, xml, and relational, to name a few.
o Globals still offers optional transactions and locking. Though efficient in implementation we wanted to make sure that locking and transactions were at the discretion of the developer.
o MVCC is built into the database.
o Globals runs in-memory and writes data to disk.
o There is currently no sharding or replication available in Globals. We are discussing options for these features.
o Globals builds on the over 33 years of success of Caché. It is well proven. It is the exact same database technology. Globals will continue to evolve, and receive the innovations going into the core of Caché.
o Our goal with Globals is be a very good steward of the project and technology. The Globals initiative will also start to drive contests and events to further promote adoption of the technology, as well as innovative approaches to building applications. We see this stewardship as a key differentiator, along with the underlying flexible core technology.

• Globals shares similar traits with other NoSQL databases in the market.

o It is free for development and deployment.
o The data model can optionally use a schema. We mitigate the impact of using schemas by using the same infrastructure we use to store the data. The schema information and the data are both stored in globals.
o Developers can index their data.
o The document paradigm enabled by the Globals Document Store (GDS) API enables a query language for data stored using the GDS API. GDS is also an example of how to build a storage paradigm in Globals. Globals APIs are open source and available on the github link.
o Globals is fast and efficient at storing data. We know performance is one of many hallmarks of NoSQL. Globals can store data at rates exceeding 100,000 objects/records per second.
o Different technology APIs are available for use with Globals. We’ve released 2 Java APIs and the JavaScript API is immanent.

Q3. How do you position Globals with respect to Caché? Who should use Globals and who should use Caché?

Iran Hutchinson: Today, Globals offers multi-dimensional array storage, whereas Caché offers a much richer set of features. Caché (and the InterSystems technology it powers including Ensemble, DeepSee, HealthShare, and TrakCare) offers a core underlying object technology, native web services, distributed communication via ECP (Enterprise Cache Protocol), strategies for high availability, interactive development environment, industry standard data access (JDBC, ODBC, SQL, XML, etc.) and a host of other enterprise ready features.

Anyone can use Globals or Caché to tackle challenges with large data volumes (terabytes, petabytes, etc.), high transactions (100,000+ per second), and complex data (healthcare, financial, aerospace, etc.). However, Caché provides much of the needed out-of-box tooling and technology to get started rapidly building solutions in our core technology, as well as a variety of languages. Currently provided as Java APIs, Globals is a toolkit to build the infrastructure already provided by Caché. Use Caché if you want to get started today; use Globals if you have a keen interest in building the infrastructure of your data management system.

Q4. Globals offers multi-dimensional array storage. Can you please briefly explain this feature, and how this can be beneficial for developers?

Iran Hutchinson: It is beneficial to go here. I grabbed the following paragraphs directly from this page:

Summary Definition: A global is a persistent sparse multi-dimensional array, which consists of one or more storage elements or “nodes”. Each node is identified by a node reference (which is, essentially, its logical address). Each node consists of a name (the name of the global to which this node belongs) and zero or more subscripts.

Subscripts may be of any of the types String, int, long, or double. Subscripts of any of these types can be mixed among the nodes of the same global, at the same or different levels.

Benefits for developers: Globals does not limit developers to using objects, key-value, or any other type of storage paradigm. Developers are free to think of the optimal storage paradigm for what they are working on. With this flexibility, and the history of successful applications powered by globals, we think developers can begin building applications with confidence.

Q5. Globals does not include Objects. Is it possible to use Globals if my data is made of Java objects? If yes, how?

Iran Hutchinson:. Globals exposes a multi-dimensional sparse array directly to Java and other languages. While Globals itself does not include direct Java object storage technology like JPA or JDO, one can easily store and retrieve data in Java objects using the APIs documented here. Anyone can also extend Globals to support popular Java object storage and retrieval interfaces.

One of the core concepts in Globals is that it is not limited to a paradigm, like objects, but can be used in many paradigms. As an example, the new GDS (Globals Document Store) API enables developers to use the NoSQL document paradigm to store their objects in Globals. GDS is available here (more docs to come).

Q6. Is Globals open source?

Iran Hutchinson: Globals itself it not open source. However, the Globals APIs hosted at the github location are open source.

Q7. Do you plan to create a Globals Community? And if yes, what will you offer to the community and what do you expect back from the community?

Iran Hutchinson: We created a community for Globals from the beginning. One of the main goals of the Globals initiative is to create a thriving community around the technology, and applications built on the technology.
We offer the community:
• Proven core data management technology
• An enthusiastic technology partner that will continue to evolve and support project ◦ Marketing the project globally
◦ Continual underlying technology evolution ◦ Involvement in the forums and open source technology development ◦ Participation in or hosting events and contests around Globals.
• A venue to not only express ideas, but take a direct role in bringing those ideas to life in technology
• For those who want to build a business around Globals, 30+ years of experience in supplying software developers with the technology to build successful breakthrough applications.

____________________________________

Iran Hutchinson serves as product manager and software/systems architect at InterSystems. He is one of the people behind the Globals project. He has held architecture and development positions at startups and Fortune 50 companies. He focuses on language platforms, data management technologies, distributed/cloud computing, and high performance computing. When not on trail talking with fellow geeks or behind the computer you can find him eating (just look for the nearest steak house).
___________________________________

Resources

Globals.
Globals is a free database from InterSystem. Globals offer multi-dimensional storage. The first version is for Java. Software | Intermediate | English | LINK | May 2011

Globals APIs
Globals APIs are open source available at github location .

Related Posts

Interview with Jonathan Ellis, project chair of Apache Cassandra.

The evolving market for NoSQL Databases: Interview with James Phillips.

– “Marrying objects with graphs”: Interview with Darren Wood.

“Distributed joins are hard to scale”: Interview with Dwight Merriman.

On Graph Databases: Interview with Daniel Kirstenpfad.

Interview with Rick Cattell: There is no “one size fits all” solution.
———-

May 30 11

Measuring the scalability of SQL and NoSQL systems.

by Roberto V. Zicari

“Our experience from PNUTS also tells that these systems are hard to build: performance, but also scaleout, elasticity, failure handling, replication.
You can’t afford to take any of these for granted when choosing a system.
We wanted to find a way to call these out.”

— Adam Silberstein and Raghu Ramakrishnan, Yahoo! Research.

___________________________________

A team of researchers composed of Adam Silberstein, Brian F. Cooper, Raghu Ramakrishnan, Russell Sears, and Erwin Tam, all at Yahoo! Research Silicon Valley, developed a new benchmark for Cloud Serving Systems, called YCSB.

They published their results in the paper “Benchmarking Cloud Serving Systems with YCSB” at the 1st ACM Symposium on Cloud Computing, ACM, Indianapolis, IN, USA (June 10-11, 2010).

They open-sourced the benchmark about a year ago.

The YCSB benchmark appears to be the best to date for measuring the scalability of SQL and NoSQL systems.

Together with my colleague and odbms.org expert, Dr. Rick Cattell, we interviewed Adam Silberstein and Raghu Ramakrishnan.

Hope you`ll enjoy the interview.

Rick Cattell and Roberto V. Zicari
________________________________

Q1. What motivated you to write a new database benchmark?

Silberstein, Ramakrishnan: Over the last few years, we observed an explosive rise in the number of new large-scale distributed data management systems: BigTable, Dynamo, HBase, Cassandra, MongoDB, Voldemort, etc. And of course our own PNUTS system [2].
This field expanded so quickly that there is no agreement on what to call these systems: NoSQL systems, cloud databases, key-value stores, etc. (Some of these terms (e.g., cloud databases) are even applied to Map-Reduce systems such as Hadoop, which are qualitatively different, leading to terminological confusion.)
This trend created a lot of excitement throughout the community of web application developers and among data management developers and researchers. But it also created a lot of debate. Which systems are most stable and mature? Which have the best performance? Which is best for my use case? We saw these questions being asked in the community, but also within Yahoo!.

Our experience with PNUTS tells us there are many design decisions to make when building one of these systems, and those decisions have a huge impact on how the system performs for different workloads (e.g., read-heavy workloads vs. write-heavy workloads), how it scales, how it handles failures, ease of operation and tuning, etc. We wanted to build a benchmark that lets us expose the nuances of each these decisions and their implementations.

Our initial experiences with systems in this space made it very clear that tuning the systems is a challenging problem requiring expert advice from the systems’ developers. Out-of-the-box, systems might be tuned for small-scale deployment, or running batch jobs rather than serving. By creating a serving benchmark, we could create a common “language” for describing the type of workloads web application developers care about, which in turn might bring about tuning best practices for these workloads.

The space of cloud databases/nosql systems/key-value stores exploded very quickly: HBase, Cassandra, MongoDB, Voldemort…even the space of names describing such systems grew quickly (nosql, cloud db, etc.).
This created a lot of excitement and confusion, both in Yahoo and externally (what do they do, are they stable/mature, what is performance like, etc).

Each system has very nice things to say about itself. Wanted to make sense of the space so we could learn for ourselves, give advice, etc. Wanted a way for these systems to speak a common language when describing their capabilities.

Our experience from PNUTS also tells that these systems are hard to build: performance, but also scaleout, elasticity, failure handling, replication. You can’t take afford to take any of these for granted when choosing a system. We wanted to find a way to call these out.
We also quickly learned that tuning these systems is difficult. Out-of-box, they might be tuned for small clusters, rather than large (or vice-versa), for different hardware, etc.
And the systems aren’t at the point where they have detailed tuning guides/best practices. While we planned to open-source all along, this motivated us further: put the tuning problem in the hands of the experts by giving them the workload.

Q2. In your paper [1] you write “The main motivation for developing new cloud serving systems is the difficulty in providing scale out and elasticity using traditional database systems”. Can you please explain this statement? What about databases such as object oriented and XML-based?

Silberstein, Ramakrishnan: Database systems have addressed scale-out, and commercial systems (e.g., high-end systems from the major vendors) do scale out well. However, they typically do not scale to web-scale workloads using commodity servers, and are not designed to be operated in configurations of 1000s of servers, and to support high-availability and geo-replication in these settings.
For example, how does ACID carry over when you scale the traditional systems across data centers? Further, they do not support elastic expansion of a running system—When adding servers, how do you offload data to them? How do you replicate for fault tolerance? The new systems we see are making entirely new design tradeoffs. For example, they may sacrifice much of ACID (e.g., it’s strong consistency model), but make it easy and cost-effective to scale to a large number of servers with high-availability and elastic expandability.

These considerations are largely orthogonal to the underlying data model and programming paradigm (i.e., XML or object-oriented systems), though some of the newer systems have also innovated in the areas of flexible schema evolution and nested columns.

Q3. What is the difference between your YCSB benchmark and well known benchmarks such as TPC-C?

Silberstein, Ramakrishnan: At a high level, we have a lot in common with TPC-C and other OLTP benchmarks.
We care about query latency and overall system throughput. When take a closer look, however, the queries are very different. TPC-C contains several diverse types of queries meant to mimic a company warehouse environment. Some queries execute transactions over multiple tables; some are more heavyweight than others.
In contrast, the web applications we are benchmarking tend to run a huge number of extremely simple queries. Consider a table where each record holds a user’s profile information. Every query touches only a single record, likely either reading it, or reading+writing it. We do include support for skewed workloads; some tables may have active sets accessed much more than others. But we have focused on simple queries, since that is what we see in practice.
Another important difference is that we have emphasized the ease of creating a new suite of benchmarks using the YCSB framework, rather than the particular workload that we have defined, because we feel it is more important to have a uniform approach to defining new benchmarks in this space, given its nascent stage.

Q4. In your paper (1) you focus your work on two properties of “Cloud” systems: “elasticity” and “performance”, or as you write “simplified application development and deployment”. Can you please explain what do you mean with these two properties?

Silberstein, Ramakrishnan: – Performance refers to the usual metrics of latency and throughput, with the ability to scale out by adding capacity. Elasticity refers to the ability to add capacity to a running deployment “on-demand”, without manual intervention (e.g., to re-shard existing data across new servers).

Q5. Cloud systems differ for example in the data models they support (e.g., column-group oriented, simple hashtable based, document based). How do you compare systems with such different data models?

Silberstein, Ramakrishnan: In our initial work we focused on the commonalities among the systems we benchmarked (PNUTS, Cassandra, HBase). We ran the systems as row stores holding ordered data.
The YCSB “core workload” accesses entire records at a time, and executes range queries over small portions of the table.
Though Cassandra and HBase both support column-groups, we did not exercise this feature.

It is of course possible to design workloads that test features like column groups, and we encourage users to do so—as we noted earlier, one of our goals was to make it easy to add benchmarks using the YCSB framework. One way to compare systems with different feature sets is to disqualify systems lacking the desired features.

Rather than disqualify a system because it lacks column groups, for example, it may make sense to compare the system as-is against others with column groups.
It is then possible to measure the cost of reading all columns, even when only one is needed. On the other hand, if a workload must execute range scans over a continuous set of of keys, there is no reasonable way to run that workload against a hash-based store.

Q6. You write that the focus of your work is on “serving” systems, rather than “batch” or “analytical”. Are there any particular reasons for doing that? What are the challenges in defining your benchmark?

Silberstein, Ramakrishnan: This is analogous to having TPC-C and TPC-H benchmarks in the conventional database setting.
Many prominent systems in cloud computing, either target serving workloads or target analytical/batch workloads. Hadoop is the most well-known example of a batch system. Some systems, such as HBase, support both. We believe serving and batch are extremely important, but very different. Serving needs its own treatment and we wanted to very clearly call that out in our benchmark. It is easy to confuse the two when looking at benchmark results. In both settings we talk a lot about throughput, e.g., number of records read or written per second.
But in batch, where the workload reads or writes an entire table, those records are likely sequentially on disk. In serving, the workload does many random reads and writes, so those records are likely scattered on disk. Disk I/O patterns have a huge impact on throughput, and so we must understand whether a workload is batch or serving before we can appreciate the benchmark results.

-These are really different spaces, although some applications combine both. We believe both are important in the cloud space.

-One reason this space gets confusing is because most systems we’ve looked at have some batch functionality (often via Hadoop integration). (We know of cases where YCSB is used to drive both a serving and batch workload). And there are even use cases that require both. But there are many important use cases that do only serving, and in this work we wanted to benchmark that specifically.

-Another reason we started with just serving is because it is very easy to confuse the two when discussing performance.

For example, in both settings we can talk about throughput numbers (ops/second).
For serving you are probably talking about doing a huge number of operations on a random set of records. For batch you are talking about doing 1 operation on a huge number of records, and they are probably physically clustered together. So we’re talking about, for example, random reads vs. sequential scans. The numbers will be very different and its important to look at the results for the case you care about.

Q7. Your YCSB benchmark consists of two tiers: Tier-1 Performance and Tier-2 Scaling. Can you please explain what these two Tiers are and what do you expect to measure with them?

Silberstein, Ramakrishnan: The performance tier is the bread-and-butter of our initial YCSB work; for a fixed system size, we want to see how much load we can put on each system while still getting low latency.

This is one of the key questions, if not the key question, an application developer asks when choosing a data system: how much of my expected workload does the system support per-server, and so how many servers am I going to have to buy? No matter what system we benchmark, the performance results have a similar feel. We plot throughput on the x-axis and (average or 95th percentile) latency on the y-axis. As we push throughput higher, latency grows gradually. Eventually, as the system the system reaches saturation, latency jumps dramatically, and then we can push throughput no higher. It is easy to compare these graphs across systems and get a feel for what kind of load each can support. It is also worthwhile to verify that while systems slow down at saturation point, they nonetheless remain stable.

The big selling point of cloud serving systems is their ability to scale upward by adding more servers. If the system offers low latency at small scale, as we proportionally increase workload size and the number of servers, latency should remain constant. If this is not the case, this is a hint the system might have bottlenecks that surface at scale.

In our scaling tier we also make the point of distinguishing scalability and elasticity. While good scalability means the system runs well over large workloads if pre-configured with the appropriate number of servers, elasticity means the system can actually grow from small to large scale by adding servers while remaining online.

In our initial work we observed systems doing well on scalability, but having erratic behavior when we added capacity online and increased workloads (i.e., they were often weak on elasticity).

Q8. How are the workloads defined in your YCSB benchmark? How do they differ from traditional TCP-C workloads?

Silberstein, Ramakrishnan: Our workloads consist of much simpler queries than TPC-C. There are just a handful of knobs for the user to adjust. The user may specify size settings such as existing database size and record size, and distribution settings, such as the relative proportions of insert, reads, updates, delete, and scan operations and distribution of accesses over the key space. These simple settings let us characterize many important workloads we find at Yahoo!, and we provide a few simple ones as part of our Core Workload.

Of course, users can develop much more complex workloads that look more like TPC-C, and our code is designed to make that easy.

Q9. You used your benchmark to measure four different systems: Cassandra, HBase, your cloud system PNUTS and your implementation of a sharded MySQL. Why did you choose these four systems? What other systems would you have liked to include, or would you like to see someone run YCSB on?

Silberstein, Ramakrishnan: When we started our work there were certainly more than four systems to choose from, and there are even more now. We limited ourselves to just a handful to avoid spreading ourselves too thin. This was a wise decision, since figuring out how to run and tune each system for serving was a full-time job (providing even more motivation to produce the benchmark). Ultimately, we made our choice of systems based on interest at Yahoo! and our desire to compare a collection of systems that made fundamentally different design decisions.
We built PNUTS here, so that is an obvious choice, and many Yahoo! developers are curious about the features and performance of Cassandra and HBase. PNUTS uses a buffer page based architecture, while Cassandra and HBase (itself a clone of BigTable) are based on differential-file architectures.

We are happy with our initial choice of systems. We got a lot of interest in our work and have gained a lot of users and contributors. Those contributors have done a variety of work, including improvements to clients of the systems we initially benchmarked and adding new clients. As of this writing, we have clients for Voldemort, MongoDB, and JDBC.

Q10. In your paper you present the main results of comparing these four systems. Can you please summarize the main results you obtained and the lessons learned?

Silberstein, Ramakrishnan: Our impact comes not from the actual numbers we collected; this was just a snapshot in time for each system. Our key contribution was creating an apples-to-apples environment that made ongoing comparisons feasible. We encouraged some competition between the systems and produced a tool for the systems to compare themselves against in the future.

That said, we had some interesting results. We know that the systems we tested made different design decisions to optimize reads or writes, and we saw these decisions reflected in the results. For a 50% read 50% write workload, Cassandra achieved the highest throughput. For a 95% read workload, PNUTS tied Cassandra on throughput while having better latency.

All systems we tested advertised scalability and elasticity. The systems all performed well at scale, but we noticed hiccups when growing elastically.
Cassandra took an extremely long time to integrate a new server, and had erratic latencies during that time. HBase is extremely lazy about integrating new servers, requiring background compactions to move partitions to them.

This was over a year ago, and these systems may well have overcome these issues by now.

-Actual numbers not the important thing here…this was just a snapshot in time of these systems. The key thing is that we created an apples-to-apples setup that made comparisons possible, created a bit of competition, and created something for the systems to evaluate themselves against.

-Result #1. We knew the systems made fundamental decisions to optimize writes or optimize reads.
It was nice to see these decisions show up in the results.

Example: in a 50/50 workload, Cassandra was best on throughput. In a 95% read workload, PNUTS caught up and had the best latencies.

-Result #2. The systems may advertise scalability and elasticity, but this is clearly a place where the implementations needed more work. Ref. elasticity experiment. Ref. HBase with only 1-2 nodes.

-Lesson. We are in the early stages. The systems are moving fast enough that there is no clear guidance on how to tune each system for particular workloads.

So figuring out how to do justice to each system is tricky.

Q11. Please briefly explain the experiment set up.

Silberstein, Ramakrishnan: We simply performed a bake-off. We had a collection of server class machines, and installed and ran each cloud serving system on them. We spent a huge amount of time configuring each system to perform their best on our hardware. We are in the early days of cloud serving systems, and there are no clear best practices for how to tune these systems to our workloads. We also took care to allocate memory fairly across each system.
For example, HBase runs on top of HDFS, and both HBase and HDFS must be given memory, and the sum must equal what we grant Cassandra.
Finally we made sure to load each system with enough data to avoid fitting all records in memory; all of the systems we benchmarked perform dramatically differently when running solely in memory vs. not.
We were more interested in performance in the latter case, so ran in that mode.

Q12. You plan to extend the benchmark to include two additional properties: availability and replication. Can you please explain how you intend to do so?

Silberstein, Ramakrishnan: These are tricky, but important, tiers to build. We want to measure a variety of things. What is the added cost of replication? What happens to availability under different failure scenarios? Can the records be read and/or written, and is there a cost penalty when doing so? What is the record consistency model during failures, and can we quantify the differences between different failure models?

These tiers expose areas where system designers can make very different design decisions. They might ensure write durability by replicating writes on-disk or in-memory only. They might replicate intra-data center or cross-data center. During network partitions, they might prioritize consistency and make data read-only, or might prioritize availability, and use eventual consistency to make record replicas converge later.

Q13. Your benchmark appears to be the best to date for measuring the scalability of SQL and NoSQL systems; it would be great if others would run it on their systems. Do you have ideas to encourage that to happen?

Silberstein, Ramakrishnan: We think we have been fairly successful here already. We open-sourced the benchmark about a year ago. It is available here.

Before we released the benchmark, we shared drafts of our results with the developer/user lists of each benchmarked system a few times prior to publication, and each time got a lot of interest and questions: what do our machines look like, what tuning parameters did we use, etc. This interest is a great sign. First, as you point out, we’re filling an unmet need.
Second, our audience recognizes our Yahoo! workloads are important and that we as Yahoo! researchers are doing a rigorous comparison.

By the time we released the benchmark we had many people waiting to use it.

Today, we have many users and contributors. We see comments on system mailing lists like “I am running YCSB Workload A and seeing XXX ops/sec and that seems too low. How should I tune my setup?”

Q14. A simpler benchmark might be easier for others to reproduce, getting more results. Is there a simple subset of your benchmark you’d suggest that captures most of the important elements?

Silberstein, Ramakrishnan: The Core Benchmark is already very simple. It uses synthetic data and provides just a few knobs for users to adjust. We and others users have done more complex testing like feed in actual production logs in YCSB, but that is not part of Core. Within Core, we should mention there are 3 very simple workloads that can execute on the simplest hash-based key value store. Workloads A, B, and C execute only inserts, reads, and updates. They do not execute range queries.

Q15. Open source NoSQL projects may not have half a dozen dedicated servers available to reproduce your results on their systems. Do you have suggestions there? Is it possible to run an accurate benchmark on a leased cloud platform?

Silberstein, Ramakrishnan: Funny you ask this because we felt the half dozen servers we ran on was not enough. We went to great lengths to borrow 50 homogeneous servers for a short time to run 1 large-scale experiment that showed YCSB itself scales.
Running YCSB on a leased platform like Amazon EC2 is straightforward.
The big question is determining if the results are accurate and repeatable. Certainly if the machines are VMs on non-dedicated servers, that might not be the case. We have heard if you request the largest size VMs, EC2 might allocate dedicated servers. One of the benefits of working at a large Internet company is that we have not had to try this ourselves!

Q16. Do you think that flash memory may “tip the balance” in the best data platforms? It cannot simply be treated as a disk, nor as RAM, because it has fundamentally different read/write costs.

Silberstein, Ramakrishnan: As SSD hardware has matured, we have noticed two trends. First, it is quite easy to build a machine with enough SSDs to run into CPU, bus and controller bottlenecks. This has led to rewrites of most storage stacks, cloud based or not.

Second, SSDs use extremely sophisticated log structured techniques to mask the cost of writes. Some of these techniques, such as data deduplication and compression only help certain workloads. The big question in this space is how higher-level database indexes will interact with the lower-level log structured system.

It could be that future hardware devices will mask the cost of random writes so well that higher-level log structured techniques will become redundant. On the other hand, higher-level log structured approaches have more computational resources at their disposal, and also have more information about the application. These advantages could mean that they will always be able to significantly improve upon hardware-based approaches.
______________________________________________________

Adam Silberstein, Yahoo! Research.
Research Area: Web Information Management.
My current interests are in large distributed data systems. My research interests are in the general area of large scale data management. Specifically, this includes both online transaction processing, and analytics, and bridging the gap between them,
as well as techniques for generating user feeds in social networks. I joined Yahoo! in August 2007, after finishing my Ph.D. at Duke University in February 2007.

Raghu Ramakrishnan, Yahoo! Research.
Raghu Ramakrishnan is Chief Scientist for Search and Cloud Platforms at Yahoo!, and is a Yahoo! Fellow, heading the Web Information Management research group. His work in database systems, with a focus on data mining, query optimization, and web-scale data management, has influenced query optimization in commercial database systems and the design of window functions in SQL:1999. His paper on the Birch clustering algorithm received the SIGMOD 10-Year Test-of-Time award, and he
has written the widely-used text “Database Management Systems” (with Johannes Gehrke).
His current research interests are in cloud computing, content optimization, and the development of a “web of concepts” that indexes all information on the web in semantically rich terms. Ramakrishnan has received several awards, including the ACM SIGKDD Innovations Award, the ACM SIGMOD Contributions Award, a Distinguished Alumnus Award from IIT Madras, a Packard Foundation Fellowship in Science and Engineering, and an NSF Presidential Young Investigator Award. He is a Fellow of the ACM and IEEE.

Ramakrishnan is on the Board of Directors of ACM SIGKDD, and is a past Chair of ACM SIGMOD and member of the Board of Trustees of the VLDB Endowment. He was Professor of Computer Sciences at the University of Wisconsin-Madison, and was founder and CTO of QUIQ, a company that pioneered crowd-sourcing, specifically question-answering communities, powering Ask Jeeves’ AnswerPoint as well as customer-support for companies such as Compaq.

____________________________________________
Dr. R. G. G. “Rick” Cattell is an independent consultant in database systems and engineering management. He previously worked as a Distinguished Engineer at Sun Microsystems, mostly recently on open source database systems and horizontal database scaling. Dr. Cattell served for 20+ years at Sun Microsystems in management and senior technical roles,
and for 10+ years in research at Xerox PARC and at Carnegie-Mellon University.
Dr. Cattell is best known for his contributions to middleware and database systems, including database scalability, enterprise Java, object/relational mapping, object-oriented databases, and database interfaces. He is the author of several dozen papers and six books. He instigated Java DB and Java 2 Enterprise Edition, and was a contributor to a number of the Enterprise Java APIs and products. He previously led development of the Cedar DBMS at Xerox PARC, the Sun Simplify database GUI, and SunSoft’s ORB-database integration. He was a founder of SQL Access (a predecessor to ODBC), the founder and chair of the Object Data Management Group (ODMG), the co-creator of JDBC, the author of the world’s first monograph on object/relational and object databases, and a recipient of the ACM Outstanding PhD Dissertation Award.

References:
[1] Benchmarking Cloud Serving Systems with YCSB. (.pdf)
Adam Silberstein, Brian F. Cooper, Raghu Ramakrishnan, Russell Sears, Erwin Tam, Yahoo! Research.
1st ACM Symposium on Cloud Computing, ACM, Indianapolis, IN, USA (June 10-11, 2010)

[2] PNUTS: Yahoo!’s Hosted Data Serving Platform (.pdf)
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni, Yahoo! Research.
The paper describes PNUTS/Sherpa, Yahoo’s record-oriented cloud database.

Open Source Code

Yahoo! Cloud Serving Benchmark (brianfrankcooper / YCSB) Downloads

Related Posts

Benchmarking ORM tools and Object Databases.

Interview with Jonathan Ellis, project chair of Apache Cassandra.

– Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera

The evolving market for NoSQL Databases: Interview with James Phillips.

For further readings

Scalable Datastores, by Rick Cattell

_____________________________________________________________________________________________

May 16 11

Interview with Jonathan Ellis, project chair of Apache Cassandra.

by Roberto V. Zicari

You’re going to see these databases attempting to make things easy that today are possible but difficult.” –Jonathan Ellis.

This interview is part of my series of interviews on the evolving market for Data Management Platforms. This time, I had the pleasure to interview Jonathan Ellis, project chair of Apache Cassandra.

RVZ: In my understanding of how the market of Data Management Platforms is evolving, I have identified three phases:

Phase I– New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Phase II- The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerged.

Phase III– Evolving Analytical Data Platforms. Hadoop for analytic. Companies such a Cloudera, but also IBM`s BigInsights are in this space.

Q1. Why Amazon and Google had to develop their own database systems? Why didn’t they use/”adjust” existing database systems?

Jonathan Ellis: Google and Amazon were really breaking new ground when they were working on Bigtable and Dynamo. The thing you have to remember is that the problem they were trying to solve was high-volume transactional systems: while companies like Teradata have been building large-scale databases for some time, these were analytical databases and not designed for high query volumes with low latency.

The state-of-the-art at the time for transactional systems was horizontal and vertical partitioning customized for a given application, built on a traditional database like Oracle. These systems were not application-transparent, meaning repartitioning was a major undertaking, nor were they reusable from one application to another.

Q2. I defined Phase II as the advent of Open Source Developments with Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerged. Any comment on this?

Jonathan Ellis: Note that Hadoop is a different kind of animal here: Hadoop *is* an analytical system, not a real–time or transaction oriented one.

Q3. How was it possible that Amazon and Google`s proprietary systems were used as input for Open Source Projects?

Jonathan Ellis: Google and Amazon both published whitepapers on Bigtable and Dynamo which were very influential for the open source systems that started appearing soon afterward. Since then, the open-source systems have of course continued to evolve; today Cassandra—which began as a fusion of Bigtable and Dynamo concepts—includes features described by neither of its ancestors, such as distributed counters and column indexes.

Q4. Why Facebook and LinkedIn developed Open Source data platforms and not proprietary software?

Jonathan Ellis: Probably a combination of two reasons:
– They recognized that with the kind of head start that Google and Amazon had, it would be difficult to achieve technical parity without leveraging the efforts of a larger development community.

– They don’t see infrastructure in and of itself as the place where they gain their competitive advantage. You can see more evidence of this in Facebook’s recent announcement that they were opening up the plans for their newest data center. So it goes beyond just source code.

Q5. Not everyone has data and scalability requirements such as Amazon and Google. Who currently needs such new data management platforms, and why?

Jonathan Ellis: Again, I think the piece that’s new here is the emphasis on high-volume transaction processing. Ten years ago you didn’t see this kind of urgency around transaction processing–some large web sites like eBay were concerned already, but it feels like there’s been a kind of Moore’s law of data growth that’s been catching more and more companies, both on the web and off. Today DataStax has customers like startups Backupify and Inkling, as well as companies you might expect to see like Netflix.

Q6. Is there a common denominator between the business models built around such open source projects?

Jonathan Ellis: At the most basic level, there’s only so many options to choose from.
You have services and support, and you have proprietary products built on top or around the open source core. Everyone is doing some combination of those.

Phase III– Evolving Analytical Data Platforms
Q7. Is Business Intelligence becoming more like Science for profit?

Jonathan Ellis: If you mean that a lot of teams are now trying to commercialize technologies that were originally developed without that kind of focus, then yes, we’ve definitely seen a lot of that the last couple years.

Q8. Who are the main actors in the Platform Ecosystem?

Jonathan Ellis: On the real-time side, Cassandra’s strongest competitors are probably Riak and HBase. Riak is backed by Basho, and I believe Cloudera supports HBase although it’s not their focus.

For analytics, everyone is standardizing on Hadoop, and there are a number of companies pushing that.

DataStax is unique here in that our just-released Brisk project gives you the best of both worlds: a Hadoop powered by Cassandra so you never have to do an ETL process before running an analytical query against your real-time data, while at the same time keeping those workloads separate so that they don’t interfere with each other.

Q9. What role will RDBMS play in the future? What about Object Databases, do they have a role o play?t

Jonathan Ellis: Relational databases will continue to be the main choice when you need ACID semantics and you have a relatively small data or query volume that you care about. Many of our customers continue to use a relational database in conjunction with Cassandra for things like user registration.

To be honest, I don’t see object databases being able to ride the NoSQL wave out of their niche. The popularity of NoSQL options isn’t from their rejection of the SQL language per se, but because that was part of what they left behind when they added features that are starting to matter even more than query language, primarily scalability.

Q10. Looking at three elements: Data, Platform, Analysis, what are the main research challenges ahead? And what are the main business challenges ahead?

Jonathan Ellis: I see the technical side as more engineering than R&D. You’re going to see these databases attempting to make things easy that today are possible but difficult. Cassandra’s column indexes are an example of this–you could use Cassandra to look up rows by column values in the past, but you had to maintain those indexes manually, in application code. Today Cassandra can automate that for you.

This ties into the business side as well: the challenge for everyone is to move beyond the early adopter market and go mainstream. Ease of use will be a big part of that.

Q11. What are the main future developments? Anything you wish to add?

Jonathan Ellis: This is an exciting space to work in right now because the more we build, the more we can see that we’ve barely scratched the surface so far. The feature we’re working on right now that I’m personally most excited about is predicate push-down for Brisk: allowing Hive, a data warehouse system for Hadoop, to take advantage of Cassandra column indexes.

Readers curious about Brisk–which is fully open-source–can learn more here.

Thanks for the questions!

Jonathan Ellis

__________________________________
Jonathan Ellis.
Jonathan Ellis is CTO and co-founder of DataStax (formerly Riptano), the commercial leader in products and support for Apache Cassandra. Prior to DataStax, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy. Jonathan is project chair of Apache Cassandra.
__________________________________

Related Posts

1. Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

2. The evolving market for NoSQL Databases: Interview with James Phillips.

____________________

Apr 13 11

Objects in Space vs. Friends in Facebook.

by Roberto V. Zicari

“Data is everywhere, never be at a single location. Not scalable, not maintainable.”–Alex Szalay

I recently reported about the Gaia mission which is considered by the experts “the biggest data processing challenge to date in astronomy“.

Alex Szalay- who knows about data and astronomy, having worked from 1992 till 2008 with the Sloan Digital Sky Survey together with Jim Gray – wrote back in 2004:
“Astronomy is a good example of the data avalanche. It is becoming a data-rich science. The computational-Astronomers are riding the Moore’s Law curve, producing larger and larger datasets each year.” [Gray,Szalay 2004]

Gray and Szalay observed: “If you are reading this you are probably a “database person”, and have wondered why our “stuff” is widely used to manage information in commerce and government but seems to not be used by our colleagues in the sciences. In particular our physics, chemistry, biology, geology, and oceanography colleagues often tell us: “I tried to use databases in my project, but they were just to [slow | hard-to-use |expensive | complex ]. So, I use files.” Indeed, even our computer science colleagues typically manage their experimental data without using database tools. What’s wrong with our database tools? What are we doing wrong? “ [Gray,Szalay 2004].

Six years later, Szalay in his presentation “Extreme Data-Intensive Computing” presented what he calls “Jim Gray`s Law of Data Engineering”:
1. Scientific computing is revolving around data.
2. Need scale-out solution for analysis.”
[Szalay2010].

He also says about Scientific Data Analysis, or as he calls it (DISC: Data Intensive Scientific Computing): “Data is everywhere, never be at a single location. Not scalable, not maintainable.”[Szalay2010]

I would like to make three observations:

i. Great thinkers do anticipate the future. They “feel” it. Better said, they “see” more clearly how things really are.
Consider for example what the philosopher Friedrich Nietzsche wrote in his book “Thus Spoke Zarathustra”: “The middle is everywhere.” Confirmed 128 years later by “Data is everywhere”….

ii. “Astronomy is a good example of the data avalanche”: the Universe is beyond our comprenshion, which means I believe, that ultimately we will figure out that indeed “data is not scalable, and not maintainable.”

iii. I now dare to twist the quote: “If you are reading this you are probably a “database person”, and have wondered why our “stuff” is widely used to manage information in commerce and government but seems to not be used by our colleagues at Facebook or Google”….

I have asked Professor Alex Szalay for his opinion.

Alexander Szalay is a professor in the Department of Physics and Astronomy of the Johns Hopkins University. His research interests are theoretical astrophysics and galaxy formation.

RVZ

Alex Szalay: This is very flattering… and I agree. But to be fair, the Facebook guys are using databases, first MySQL, and now Oracle in the middle of their whole system.

I have recently heard a talk by Jeff Hammerbacher, who built the original infrastructure for Facebook. Now he quit, and formed Cloudera. He did explicitly say that in the middle there will always be SQL, but people use Hadoop/MR for the ETL layer… and R and other tools for analytics and reporting.

As far as I can see Google is also gently moving towards not quite a database yet, but Jeff Dean is building Bloom filters and other indexes into BigTable. So even if it is NoSQL, some of their stuff starts to resemble a database….

So I think there is a general agreement that indices are useful, but for large scale data analytics, we do not need full ACID, transactions are much more a burden than an advantage. And there is a lot of religion there, of course.

I would put it in such a way, that there is a phase transition coming, and there is an increasing diversification, where there were only three DB vendors 5 years ago, now there are many options and a broad spectrum of really interesting niche tools. In a healthy ecosystem everything is a 1/f power law, and we will see a much bigger diversity. And this is great for academic research. “In every crisis there is an opportunity” — we again have a chance to do something significant in academia.

RVZ: The National Science Foundation has awarded a $2M grant to you and your team of co-investigators from across many scientific disciplines, to build a 5.2 Petabyte Data-Scope, a new instrument targeted at analyzing the huge data sets emerging in almost all areas of science. The instrument will be a special data-supercomputer, the largest of its kind in the academic world.

What is the project about?

Alex Szalay: We feel that the Data-Scope is not a traditional multi-user computing cluster, but a new kind of instrument, that enables people to do science with datasets ranging between 100TB and 1000TB.
This is simply not possible today. The task is much more, than just throw the necessary storage together.
It requires a holistic approach: the data must be first brought to the instrument, then staged, and then moved to the computing nodes that have both enough compute power and enough storage bandwidth (450GBps) to perform the typical analyses, and then the (complex) analyses must be performed.

RVZ: Could you please explain what are the main challenges that this project poses?

Alex Szalay: It would be quite difficult, if not outright impossible to develop a new instrument with so many cutting-edge features without adequately considering all aspects of the system, beyond the hardware. We need to write at least a barebones set of system management tools (beyond the basic operating system etc), and we need to provide help and support for the teams who are willing to be at the “bleeding-edge” to be able to solve their big data problems today, rather than wait another 5 years, when such instruments become more common.
This is why we feel that our proposal reflects a realistic mix of hardware and personnel, which leads to a high probability of success.

The instrument will be open for scientists beyond JHU. There was an unbelievable amount if interest just at JHU in such an instrument, since analyzing such data sets is beyond the capability of any group on campus. There were 20 groups with data sets totaling over 2.8PB just within JHU, who would use the facility immediately, if it was available. We expect to go no-line at the end of this summer.

References

[Szalay2010]
Extreme Data-Intensive Computing (.pdf)
Alex Szalay, The Johns Hopkins University, 2010.

[Gray,Szalay 2004]
Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science.
Jim Gray,Microsoft Research and Alex Szalay,Johns Hopkins University.
IEEE Data Engineering Bulletin and Technical Report, MSR-TR-2004-110, Microsoft Research, 2004

Friedrich Nietzsche,
Thus Spoke Zarathustra: a Book for Everyone and No-one. (Also Sprach Zarathustra: Ein Buch für Alle und Keinen) – written between 1883 and 1885.

Related Posts

Objects in Space

Objects in Space: “Herschel” the largest telescope ever flown.

Objects in Space. –The biggest data processing challenge to date in astronomy: The Gaia mission.–

Big Data

Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

The evolving market for NoSQL Databases: Interview with James Phillips.

#

Apr 4 11

Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

by Roberto V. Zicari

“Data is the big one challenge ahead” –Michael Olson.

I was interested to learn more about Hadoop, why it is important, and how it is used for business.
I have therefore interviewed Michael Olson, Chief Executive Officer, Cloudera..

RVZ

RVZ: In my understanding of how the market of Data Management Platforms is evolving, I have identified three phases:

Phase I– New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Phase II- The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerge.

Phase III– Evolving Analytical Data Platforms. Hadoop for analytic. Companies such a Cloudera, but also IBM`s BigInsights are in this space.

Q1. Would you agree with this? Would you have anything to add/change?

Michael Olson: I think that’s generally accurate. The one qualification I’d offer is that the process isn’t a waterfall, where the Phase I players do some innovative work that flows down to Phase II where it’s implemented, and so on. The open source projects have come up with some novel and innovative ideas not described in the papers. The arrival of commercial players like Cloudera and IBM was also more interesting than the progression suggests. We’ve spent considerable time with customers, but also in the community, and we’ll continue to build reputation by contributing alongside the many great people working on Apache Hadoop and related projects around the world. IBM’s been working with Hadoop for a couple of years, and BigInsights really flows out of some early exploratory work they did with a small number of early customers.

Hadoop
Q2. What is Hadoop? Why is it becoming so popular?

Michael Olson:Hadoop is an open source project, sponsored by the Apache Software Foundation, aimed at storing and processing data in new ways. It’s able to store any kind of information — you needn’t define a schema and it handles any kind of data, including the tabular stuff than an RDBMS understands, but also complex data like video, text, geolocation data, imagery, scientific data and so on. It can run arbitrary user code over the data it stores, and it’s able to do large-scale parallel processing very easily, so you can answer petabyte-scale questions quickly. It’s genuinely a new thing in the data management world.

Apache Hadoop, the project, consists of three major components: the Hadoop Distributed File System, or HDFS, which handles storage; MapReduce, the distributed processing infrastructure, which handles the work of running analyses on data; and Common, which is a bunch of shared infrastructure that both HDFS and MapReduce need. When most people talk about “Hadoop,” they’re talking about this collection of software.

Hadoop’s developed by a global community of programmers. It’s one hundred percent open source. No single company owns or controls it.

Q3. Who needs Hadoop?, and for what kind of applications?

Michael Olson:The software was developed first by big web properties, notably Yahoo! and Facebook, to handle jobs like large-scale document indexing and web log processing. It was applied to real business problems by those properties. For example, it can examine the behavior of large numbers of users on a web site, cluster individual users into groups based on the the things that they do, and then predict the behavior of individuals based on the behavior of the group.

You should think of Hadoop in kind of the same way that you think of a relational database. All by itself, it’s a general-purpose platform for storing and operating on data. What makes the platform really valuable is the application that runs on top of it. Hadoop is hugely flexible. Lots of businesses want to understand users in the way that Yahoo! and Facebook do, above, but the platform supports other business workloads as well: portfolio analysis and valuation, intrusion detection and network security applications, sequence alignment in biotechnology and more.

Really, anyone who has a large amount of data — structured, complex, messy, whatever — and who wants to ask really hard questions about that data can use Hadoop to do that. These days, that’s just about every significant enterprise on the planet.

Q4. Can you use Hadoop stand alone or do you need other components? If yes which one and why?

Michael Olson: Hadoop provides the storage and processing infrastructure you need, but that’s all. If you need to load data into the platform, then you either have to write some code or else go find a tool that does that, like Flume (for streaming data) or Sqoop (for relational data). There’s no query tool out of the box, so if you want to do interactive data explorations, you have to go find a tool like Apache Pig or Apache Hive to do that.

There’s actually a pretty big collection of those tools that we’ve found that customers need. It’s the main reason we created the open source package we call Cloudera’s Distribution including Apache Hadoop, or CDH. We assemble Apache Hadoop and tools like Pig, Hive, Flume, Sqoop and others — really, the full suite of open source tools that our users have found they require — and we make it available in a single package. It’s 100% open source, not proprietary to us. We’re big believers in the open source platform — customers love not being shackled to a vendor by proprietary code.

Analytical Data Platforms

Q5. It is said that more than 95% of enterprise data is unstructured, and that enterprise data volumes are growing rapidly. Is it true? What kind of applications generate such high volume of unstructured data? and what can be done with such data?

Michael Olson: You have to talk to a firm like IDC to get numbers. What I will say is that what you call “unstructured” data (I prefer “complex” because all data has structure) is big and getting bigger really, really fast.

It used to be that data was generated at human scale. You’d buy or sell something and a transaction record would happen. You’d hire or fire someone and you’d hit the “employee” table in your database.

These days, data comes from machines talking to machines. The servers, switches, routers and disks on your LAN are all furiously conversing. The content of their messages is interesting, and also the patterns and timing of the messages that they send to one another. (In fact, if you can capture all that data and do some pattern detection and machine learning, you have a pretty good tool for finding bad guys breaking into your network.) Same is true for programmed trading on Wall Street, mobile telephony and many other pieces of technology infrastructure we rely on.

Hadoop knows how to capture and store that data cheaply and reliably, even if you get to petabytes. More importantly, Hadoop knows how to process that data — it can run different algorithms and analytic tools, spread across its massively parallel infrastructure, to answer hard questions on enormous amounts of information very quickly.

Q6. Why building data analysis applications on Hadoop? Why not using already existing BI products?

Michael Olson: Lots of BI tools today talk to relational databases. If that’s the case, then you’re constrained to operating on data types that an RDBMS understands, and most of the data in the world — see above — doesnt’ fit in an RDBMS. Also, there are some kinds of analyses — complex modeling of systems, user clustering and behavioral analysis, natural language processing — that BI tools were never designed to handle.

I want to be clear: RDBMS engines and the BI tools that run on them are excellent products, hugely successful and handling mission-critical problems for demanding users in production every day. They’re not going away. But for a new generation of problems that they weren’t designed to consider, a new platform is necessary, and we believe that that platform is Apache Hadoop, with a new suite of analytic tools, from existing or new vendors, that understand the data and can answer the questions that Hadoop was designed to handle.

Q7. Why Cloudera? What do you see as your main value proposition?

Michael Olson: We make Hadoop consumable in the way that enterprises require.
Cloudera Enterprise provides an open source platform for data storage and analysis, along with the management, monitoring and administrative applications that enterprise IT staff can use to keep the cluster running. We help our customers set and meet SLAs for work on the cluster, do capacity planning, provision new users, set and enforce policies and more. Of course Cloudera Enterprise comes with 24×7 support and a subscription to updates and fixes during the year.

When one of our customers deploys Hadoop, it’s to solve serious business problems. They can’t tolerate missed deadlines. They need their existing IT staff, who probably know how to run an Oracle database or VMware or other big data center infrastructure. That kind of person can absolutely run Hadoop, but needs the right applications and dashboards to do so. That’s what we provide.

Q8. In your opinion, what role will RDBMS and classical Data Warehouse systems play in the future in the market for Analytical Data Platforms? What about other data stores such NoSQL and Object Databases? Will they play a role?

Michael Olson: I believe that RDBMS and classic EDWs are here to stay. They’re outstanding at what they do — they’ve evolved alongside the problems they solve for the last thirty years. You’d be nuts to take them on. We view Hadoop as strictly complementary, solving a new class of problems: Complex analyses, complex data, generally at scale.

As to NoSQL and ODBMS, I don’t have a strong view. The “NoSQL” moniker isn’t well-defined, in my opinion.
There are a bunch of different key-value stores out there that provide a bunch of different services and abstractions. Really, it’s knives and arrows and battleships — they’re all useful, but which one you want depends on what kind of fight you’re in.

Q9. Is Cloud technology important in this context? Why?

Michael Olson: “Cloud” is a deployment detail, not fundamental. Where you run your software and what software you run are two different decisions, and you need to make the right choice in both cases.

Q10. Looking at three elements: Data, Platform, and Analysis, what are the main business and technical challenges ahead?

Michael Olson: Data is the big one. Seriously: More. More complex, more variable, more useful if you can figure out what’s locked up in it. More than you can imagine, even if you take this statement into account.

We obviously need to improve the platforms we have, and I think the next decade will be an exciting time for that. That’s good news — I’ve been in the database industry since 1986, and it has frankly been pretty dull. Same is true for analyses, but our opportunities there will be constrained by both the platforms we have and the data on which we can operate.

Q11. Anything you wish to add?

Michael Olson: Thanks for the opportunity!

–mike
________________
Michael Olson, Chief Executive Officer, Cloudera.
Mike was formerly CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine. Mike spent two years at Oracle Corporation as Vice President for Embedded Technologies after Oracle’s acquisition of Sleepycat in 2006. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies and Informix Software. Mike has Bachelor’s and Master’s degrees in Computer Science from the University of California at Berkeley.)
–––––––––––––––––

Related Post

The evolving market for NoSQL Databases: Interview with James Phillips.

–––––––––––––––––

Mar 26 11

The evolving market for NoSQL Databases: Interview with James Phillips.

by Roberto V. Zicari

“It is possible we will see standards begin to emerge, both in on-the-wire protocols and perhaps in query languages, allowing interoperability between NoSQL database technologies similar to the kind of interoperability we’ve seen with SQL and relational database technology.” — James Phillips.
_______________

In my understanding of how the market of Data Management Platforms is evolving, I have identified three phases:
Phase INew Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Phase IIThe advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerge.

Phase III– Evolving Analytical Data Platforms. Hadoop for analytic. Companies such a Cloudera, but also IBM`s BigInsight are in this space.

I wanted to learn more about the background of Phases I and II. I have interviewed James Phillips, who co-founded of Membase, and since last month is Co-Founder and Senior Vice President of Couchbase, the new company that originated from the merge of Membase and CouchOne.

RVZ

Phase I– New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Q1. Why did Amazon and Google have to develop their own database systems? Why didn’t they use/”adjust” existing database systems?

James Phillips: Existing relational database technologies were a poor match for the flexibility, performance, scaling and cost requirements of these organizations. It wasn’t enough to simply “adjust” relational database technology; there was a wholesale rethinking required. See our (non marketing 🙂 white paper for a detailed look at why that was the case.

Phase II- The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerge.

Q2. How was it possible that Amazon and Google`s proprietary systems were used as input for Open Source Projects?

James Phillips: Amazon and Google published academic papers
[e.g. GoogleBigTable, Amazon-Dynamo], highlighting the design of many of their data management technologies. These papers inspired the creation of a number of open source software projects.

The Apache Cassandra open source project, initially developed at Facebook, has been described by Jeff Hammerbacher who led the Facebook Data team at the time as a Big Table data model running on an Amazon Dynamo-like infrastructure.

Key Apache Hadoop project technologies (initially developed by Doug Cutting while employed at Yahoo!) were heavily influenced by published papers describing the Google File System and MapReduce technologies.

Q3. Why Facebook and Yahoo! developed Open Source data platforms and not proprietary software?

James Phillips: Obviously I can’t answer why another company made a business decision, because I wasn’t involved in the decisions.
But one can make a reasonable guess as to why these companies would do this: Facebook is not in the database business
– they are a social networking company.

They would probably have taken and used Dynamo and/or BigTable had they been open-sourced by Google and or Amazon.
But they weren’t and Facebook was forced to write Cassandra.
By open-sourcing the technology, Facebook could ostensibly benefit from the advancement of that technology by the open source community
– new features, increased performance, bug fixes and other community-driven value.

Assuming rational behavior, one can reasonably infer that the potential value of these community-driven benefits was deemed greater than the perceived cost of possibly “arming” a potential social networking competitor with data management technology, of having to fully maintain the technology themselves, or of any sort of liability associated with open-sourcing the software.

There is also some recruiting value to be gained for companies like Facebook
– by demonstrating they are developing leading-edge technology, solving hard computer science problems, etc. they are able to attract the best and brightest minds to the company.

Q4. Not everyone has data and scalability requirements such as Amazon and Google. Who currently needs such new data management platforms, and why?

James Phillips: Our white paper covers this in great detail. While not everyone has the scalability requirements, anyone building web applications (and who isn’t) needs the flexibility and cost advantages these solutions deliver.
Google and Amazon had the biggest pain initially, driving the innovation, but everyone benefits now. Velcro was invented to solve problems in space travel. Not everyone struggles with those problems. But use cases for Velcro are still being discovered.

Q5. Is there a common denominator between the business models built around such open source projects?

James Phillips:s: The only common denominator between business models related to open source software is open source software. There are support, product, services, training and countless other offerings that are derived from, based on or related to open source software projects.

Q6. Last month Membase and CouchOne merged to form Couchbase. What is the reasoning for this merge?

James Phillips:: Prior to merging, Membase and CouchOne had focused on different layers of the NoSQL database technology stack:

Membase had focused on distributed caching, cluster management and high-performance memory-to-disk data flows.

CouchOne had concentrated on advanced indexing, document-oriented operations, real-time map-reduce, replication and support for mobile/web application synchronization.

There were many Membase customers asking for features of the CouchOne platform, and vice versa. This merger has allowed us to each eliminate roughly 18 months of redundant R&D we would have done separately. We’ve effectively doubled the size of our engineering team and eliminated 2 net years of work allowing us to get better products to market more quickly and focus on innovating versus duplicating functionality.

Q7 Technically speaking, do you plan to “merge” the two products into one? If you do not “merge” the two products, what else do you do then?

James Phillips:: Yes, Elastic Couchbase is a new product we will introduce this Summer; it will combine technologies from Membase and CouchOne.

Q8. What happens to existing customers who are using Membase and CouchOne respective products?

James Phillips:: We will continue to support customers using existing Membase and CouchOne products, while providing a seamless upgrade path for customers who want to migrate to Elastic Couchbase. Couch customers who migrate get a higher-performance, elastic version of CouchDB. Membase customers who migrate gain the ability to index and query data stored in the cluster.

Q9. How do you see the NoSQL market evolving in the next 12 months?

James Phillips: It is possible we will see standards begin to emerge, both in on-the-wire protocols and perhaps in query languages, allowing interoperability between NoSQL database technologies similar to the kind of interoperability we’ve seen with SQL and relational database technology. It would not be surprising to see additional consolidation as well.
–––––––––––––––––––––––––––––
James Phillips, Co-Founder and Senior Vice President, Couchbase.

A twenty-five year veteran of the software industry, James Phillips started his career writing software for the Apple II and TRS-80 microcomputer platforms. In 1984, at age 17, he co-founded his first software company, Fifth Generation Systems, which was acquired by Symantec in 1993 forming the foundation of Symantec’s PC backup software business.
Most recently, James was co-founder and CEO of Akimbi Systems, a venture-backed software company acquired by VMware in 2006. Book-ended by these entrepreneurial successes, James has held executive leadership roles in software engineering, product management, marketing and corporate development at large public companies including Intel, Synopsys and Intuit and with venture-backed software startups including Central Point Software (acquired by Symantec), Ensim and Actional Corporation (acquired by Progress Software). Additionally, James spent two years as a technology investment banker with PaineWebber and Robertson Stephens and Co., delivering M&A advisory services to software companies.
James holds a BS in Mathematics and earned his MBA, with honors, from the University of Chicago.
He currently serves on the board of directors of Teneros and as an investor in and advisor to a number of privately-held software companies including Delphix, Replay Solutions and Virsto.

For further Reading

NoSQL Database Technology.
White Paper, Couchbase.
This white paper introduces the basic concepts of NoSQL Database technology and compares them with RDBMS.

Paper | Introductory | English | Link DOWNLOAD (PDF)| March 2011|

Related Posts
“Marrying objects with graphs”: Interview with Darren Wood.

“Distributed joins are hard to scale”: Interview with Dwight Merriman.

On Graph Databases: Interview with Daniel Kirstenpfad.

Interview with Rick Cattell: There is no “one size fits all” solution.