ODBMS Industry Watch

Apr 4 11

Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

by Roberto V. Zicari

“Data is the big one challenge ahead” –Michael Olson.

I was interested to learn more about Hadoop, why it is important, and how it is used for business.
I have therefore interviewed Michael Olson, Chief Executive Officer, Cloudera..

RVZ

RVZ: In my understanding of how the market of Data Management Platforms is evolving, I have identified three phases:

Phase I– New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Phase II- The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerge.

Phase III– Evolving Analytical Data Platforms. Hadoop for analytic. Companies such a Cloudera, but also IBM`s BigInsights are in this space.

Q1. Would you agree with this? Would you have anything to add/change?

Michael Olson: I think that’s generally accurate. The one qualification I’d offer is that the process isn’t a waterfall, where the Phase I players do some innovative work that flows down to Phase II where it’s implemented, and so on. The open source projects have come up with some novel and innovative ideas not described in the papers. The arrival of commercial players like Cloudera and IBM was also more interesting than the progression suggests. We’ve spent considerable time with customers, but also in the community, and we’ll continue to build reputation by contributing alongside the many great people working on Apache Hadoop and related projects around the world. IBM’s been working with Hadoop for a couple of years, and BigInsights really flows out of some early exploratory work they did with a small number of early customers.

Hadoop
Q2. What is Hadoop? Why is it becoming so popular?

Michael Olson:Hadoop is an open source project, sponsored by the Apache Software Foundation, aimed at storing and processing data in new ways. It’s able to store any kind of information — you needn’t define a schema and it handles any kind of data, including the tabular stuff than an RDBMS understands, but also complex data like video, text, geolocation data, imagery, scientific data and so on. It can run arbitrary user code over the data it stores, and it’s able to do large-scale parallel processing very easily, so you can answer petabyte-scale questions quickly. It’s genuinely a new thing in the data management world.

Apache Hadoop, the project, consists of three major components: the Hadoop Distributed File System, or HDFS, which handles storage; MapReduce, the distributed processing infrastructure, which handles the work of running analyses on data; and Common, which is a bunch of shared infrastructure that both HDFS and MapReduce need. When most people talk about “Hadoop,” they’re talking about this collection of software.

Hadoop’s developed by a global community of programmers. It’s one hundred percent open source. No single company owns or controls it.

Q3. Who needs Hadoop?, and for what kind of applications?

Michael Olson:The software was developed first by big web properties, notably Yahoo! and Facebook, to handle jobs like large-scale document indexing and web log processing. It was applied to real business problems by those properties. For example, it can examine the behavior of large numbers of users on a web site, cluster individual users into groups based on the the things that they do, and then predict the behavior of individuals based on the behavior of the group.

You should think of Hadoop in kind of the same way that you think of a relational database. All by itself, it’s a general-purpose platform for storing and operating on data. What makes the platform really valuable is the application that runs on top of it. Hadoop is hugely flexible. Lots of businesses want to understand users in the way that Yahoo! and Facebook do, above, but the platform supports other business workloads as well: portfolio analysis and valuation, intrusion detection and network security applications, sequence alignment in biotechnology and more.

Really, anyone who has a large amount of data — structured, complex, messy, whatever — and who wants to ask really hard questions about that data can use Hadoop to do that. These days, that’s just about every significant enterprise on the planet.

Q4. Can you use Hadoop stand alone or do you need other components? If yes which one and why?

Michael Olson: Hadoop provides the storage and processing infrastructure you need, but that’s all. If you need to load data into the platform, then you either have to write some code or else go find a tool that does that, like Flume (for streaming data) or Sqoop (for relational data). There’s no query tool out of the box, so if you want to do interactive data explorations, you have to go find a tool like Apache Pig or Apache Hive to do that.

There’s actually a pretty big collection of those tools that we’ve found that customers need. It’s the main reason we created the open source package we call Cloudera’s Distribution including Apache Hadoop, or CDH. We assemble Apache Hadoop and tools like Pig, Hive, Flume, Sqoop and others — really, the full suite of open source tools that our users have found they require — and we make it available in a single package. It’s 100% open source, not proprietary to us. We’re big believers in the open source platform — customers love not being shackled to a vendor by proprietary code.

Analytical Data Platforms

Q5. It is said that more than 95% of enterprise data is unstructured, and that enterprise data volumes are growing rapidly. Is it true? What kind of applications generate such high volume of unstructured data? and what can be done with such data?

Michael Olson: You have to talk to a firm like IDC to get numbers. What I will say is that what you call “unstructured” data (I prefer “complex” because all data has structure) is big and getting bigger really, really fast.

It used to be that data was generated at human scale. You’d buy or sell something and a transaction record would happen. You’d hire or fire someone and you’d hit the “employee” table in your database.

These days, data comes from machines talking to machines. The servers, switches, routers and disks on your LAN are all furiously conversing. The content of their messages is interesting, and also the patterns and timing of the messages that they send to one another. (In fact, if you can capture all that data and do some pattern detection and machine learning, you have a pretty good tool for finding bad guys breaking into your network.) Same is true for programmed trading on Wall Street, mobile telephony and many other pieces of technology infrastructure we rely on.

Hadoop knows how to capture and store that data cheaply and reliably, even if you get to petabytes. More importantly, Hadoop knows how to process that data — it can run different algorithms and analytic tools, spread across its massively parallel infrastructure, to answer hard questions on enormous amounts of information very quickly.

Q6. Why building data analysis applications on Hadoop? Why not using already existing BI products?

Michael Olson: Lots of BI tools today talk to relational databases. If that’s the case, then you’re constrained to operating on data types that an RDBMS understands, and most of the data in the world — see above — doesnt’ fit in an RDBMS. Also, there are some kinds of analyses — complex modeling of systems, user clustering and behavioral analysis, natural language processing — that BI tools were never designed to handle.

I want to be clear: RDBMS engines and the BI tools that run on them are excellent products, hugely successful and handling mission-critical problems for demanding users in production every day. They’re not going away. But for a new generation of problems that they weren’t designed to consider, a new platform is necessary, and we believe that that platform is Apache Hadoop, with a new suite of analytic tools, from existing or new vendors, that understand the data and can answer the questions that Hadoop was designed to handle.

Q7. Why Cloudera? What do you see as your main value proposition?

Michael Olson: We make Hadoop consumable in the way that enterprises require.
Cloudera Enterprise provides an open source platform for data storage and analysis, along with the management, monitoring and administrative applications that enterprise IT staff can use to keep the cluster running. We help our customers set and meet SLAs for work on the cluster, do capacity planning, provision new users, set and enforce policies and more. Of course Cloudera Enterprise comes with 24×7 support and a subscription to updates and fixes during the year.

When one of our customers deploys Hadoop, it’s to solve serious business problems. They can’t tolerate missed deadlines. They need their existing IT staff, who probably know how to run an Oracle database or VMware or other big data center infrastructure. That kind of person can absolutely run Hadoop, but needs the right applications and dashboards to do so. That’s what we provide.

Q8. In your opinion, what role will RDBMS and classical Data Warehouse systems play in the future in the market for Analytical Data Platforms? What about other data stores such NoSQL and Object Databases? Will they play a role?

Michael Olson: I believe that RDBMS and classic EDWs are here to stay. They’re outstanding at what they do — they’ve evolved alongside the problems they solve for the last thirty years. You’d be nuts to take them on. We view Hadoop as strictly complementary, solving a new class of problems: Complex analyses, complex data, generally at scale.

As to NoSQL and ODBMS, I don’t have a strong view. The “NoSQL” moniker isn’t well-defined, in my opinion.
There are a bunch of different key-value stores out there that provide a bunch of different services and abstractions. Really, it’s knives and arrows and battleships — they’re all useful, but which one you want depends on what kind of fight you’re in.

Q9. Is Cloud technology important in this context? Why?

Michael Olson: “Cloud” is a deployment detail, not fundamental. Where you run your software and what software you run are two different decisions, and you need to make the right choice in both cases.

Q10. Looking at three elements: Data, Platform, and Analysis, what are the main business and technical challenges ahead?

Michael Olson: Data is the big one. Seriously: More. More complex, more variable, more useful if you can figure out what’s locked up in it. More than you can imagine, even if you take this statement into account.

We obviously need to improve the platforms we have, and I think the next decade will be an exciting time for that. That’s good news — I’ve been in the database industry since 1986, and it has frankly been pretty dull. Same is true for analyses, but our opportunities there will be constrained by both the platforms we have and the data on which we can operate.

Q11. Anything you wish to add?

Michael Olson: Thanks for the opportunity!

–mike
________________
Michael Olson, Chief Executive Officer, Cloudera.
Mike was formerly CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine. Mike spent two years at Oracle Corporation as Vice President for Embedded Technologies after Oracle’s acquisition of Sleepycat in 2006. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies and Informix Software. Mike has Bachelor’s and Master’s degrees in Computer Science from the University of California at Berkeley.)
–––––––––––––––––

–––––––––––––––––

Mar 26 11

The evolving market for NoSQL Databases: Interview with James Phillips.

by Roberto V. Zicari

“It is possible we will see standards begin to emerge, both in on-the-wire protocols and perhaps in query languages, allowing interoperability between NoSQL database technologies similar to the kind of interoperability we’ve seen with SQL and relational database technology.” — James Phillips.
_______________

In my understanding of how the market of Data Management Platforms is evolving, I have identified three phases:
Phase I— New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Phase II– The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerge.

Phase III– Evolving Analytical Data Platforms. Hadoop for analytic. Companies such a Cloudera, but also IBM`s BigInsight are in this space.

I wanted to learn more about the background of Phases I and II. I have interviewed James Phillips, who co-founded of Membase, and since last month is Co-Founder and Senior Vice President of Couchbase, the new company that originated from the merge of Membase and CouchOne.

RVZ

Phase I– New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Q1. Why did Amazon and Google have to develop their own database systems? Why didn’t they use/”adjust” existing database systems?

James Phillips: Existing relational database technologies were a poor match for the flexibility, performance, scaling and cost requirements of these organizations. It wasn’t enough to simply “adjust” relational database technology; there was a wholesale rethinking required. See our (non marketing 🙂 white paper for a detailed look at why that was the case.

Phase II- The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerge.

Q2. How was it possible that Amazon and Google`s proprietary systems were used as input for Open Source Projects?

James Phillips: Amazon and Google published academic papers
[e.g. GoogleBigTable, Amazon-Dynamo], highlighting the design of many of their data management technologies. These papers inspired the creation of a number of open source software projects.

The Apache Cassandra open source project, initially developed at Facebook, has been described by Jeff Hammerbacher who led the Facebook Data team at the time as a Big Table data model running on an Amazon Dynamo-like infrastructure.

Key Apache Hadoop project technologies (initially developed by Doug Cutting while employed at Yahoo!) were heavily influenced by published papers describing the Google File System and MapReduce technologies.

Q3. Why Facebook and Yahoo! developed Open Source data platforms and not proprietary software?

James Phillips: Obviously I can’t answer why another company made a business decision, because I wasn’t involved in the decisions.
But one can make a reasonable guess as to why these companies would do this: Facebook is not in the database business
– they are a social networking company.

They would probably have taken and used Dynamo and/or BigTable had they been open-sourced by Google and or Amazon.
But they weren’t and Facebook was forced to write Cassandra.
By open-sourcing the technology, Facebook could ostensibly benefit from the advancement of that technology by the open source community
– new features, increased performance, bug fixes and other community-driven value.

Assuming rational behavior, one can reasonably infer that the potential value of these community-driven benefits was deemed greater than the perceived cost of possibly “arming” a potential social networking competitor with data management technology, of having to fully maintain the technology themselves, or of any sort of liability associated with open-sourcing the software.

There is also some recruiting value to be gained for companies like Facebook
– by demonstrating they are developing leading-edge technology, solving hard computer science problems, etc. they are able to attract the best and brightest minds to the company.

Q4. Not everyone has data and scalability requirements such as Amazon and Google. Who currently needs such new data management platforms, and why?

James Phillips: Our white paper covers this in great detail. While not everyone has the scalability requirements, anyone building web applications (and who isn’t) needs the flexibility and cost advantages these solutions deliver.
Google and Amazon had the biggest pain initially, driving the innovation, but everyone benefits now. Velcro was invented to solve problems in space travel. Not everyone struggles with those problems. But use cases for Velcro are still being discovered.

Q5. Is there a common denominator between the business models built around such open source projects?

James Phillips:s: The only common denominator between business models related to open source software is open source software. There are support, product, services, training and countless other offerings that are derived from, based on or related to open source software projects.

Q6. Last month Membase and CouchOne merged to form Couchbase. What is the reasoning for this merge?

James Phillips:: Prior to merging, Membase and CouchOne had focused on different layers of the NoSQL database technology stack:

– Membase had focused on distributed caching, cluster management and high-performance memory-to-disk data flows.

– CouchOne had concentrated on advanced indexing, document-oriented operations, real-time map-reduce, replication and support for mobile/web application synchronization.

There were many Membase customers asking for features of the CouchOne platform, and vice versa. This merger has allowed us to each eliminate roughly 18 months of redundant R&D we would have done separately. We’ve effectively doubled the size of our engineering team and eliminated 2 net years of work allowing us to get better products to market more quickly and focus on innovating versus duplicating functionality.

Q7 Technically speaking, do you plan to “merge” the two products into one? If you do not “merge” the two products, what else do you do then?

James Phillips:: Yes, Elastic Couchbase is a new product we will introduce this Summer; it will combine technologies from Membase and CouchOne.

Q8. What happens to existing customers who are using Membase and CouchOne respective products?

James Phillips:: We will continue to support customers using existing Membase and CouchOne products, while providing a seamless upgrade path for customers who want to migrate to Elastic Couchbase. Couch customers who migrate get a higher-performance, elastic version of CouchDB. Membase customers who migrate gain the ability to index and query data stored in the cluster.

Q9. How do you see the NoSQL market evolving in the next 12 months?

James Phillips: It is possible we will see standards begin to emerge, both in on-the-wire protocols and perhaps in query languages, allowing interoperability between NoSQL database technologies similar to the kind of interoperability we’ve seen with SQL and relational database technology. It would not be surprising to see additional consolidation as well.
–––––––––––––––––––––––––––––
James Phillips, Co-Founder and Senior Vice President, Couchbase.

A twenty-five year veteran of the software industry, James Phillips started his career writing software for the Apple II and TRS-80 microcomputer platforms. In 1984, at age 17, he co-founded his first software company, Fifth Generation Systems, which was acquired by Symantec in 1993 forming the foundation of Symantec’s PC backup software business.
Most recently, James was co-founder and CEO of Akimbi Systems, a venture-backed software company acquired by VMware in 2006. Book-ended by these entrepreneurial successes, James has held executive leadership roles in software engineering, product management, marketing and corporate development at large public companies including Intel, Synopsys and Intuit and with venture-backed software startups including Central Point Software (acquired by Symantec), Ensim and Actional Corporation (acquired by Progress Software). Additionally, James spent two years as a technology investment banker with PaineWebber and Robertson Stephens and Co., delivering M&A advisory services to software companies.
James holds a BS in Mathematics and earned his MBA, with honors, from the University of Chicago.
He currently serves on the board of directors of Teneros and as an investor in and advisor to a number of privately-held software companies including Delphix, Replay Solutions and Virsto.

For further Reading

NoSQL Database Technology.
White Paper, Couchbase.
This white paper introduces the basic concepts of NoSQL Database technology and compares them with RDBMS.
Paper | Introductory | English | Link DOWNLOAD (PDF)| March 2011|

– “Distributed joins are hard to scale”: Interview with Dwight Merriman.

– On Graph Databases: Interview with Daniel Kirstenpfad.

– Interview with Rick Cattell: There is no “one size fits all” solution.

Mar 18 11

Objects in Space: “Herschel” the largest telescope ever flown.

by Roberto V. Zicari

Managing telemetry data and information on steering and calibrating scientific on-board instruments with an object database.

Interview with Dr. Johannes Riedinger, Herschel Mission Manager, European Space Agency (ESA), and Dr Jon Brumfitt, System Architect of the Herschel Science Ground Segment, European Space Agency (ESA).
———————————
More Objects in Space…
I became aware of another very interesting project at the European Space Agency (ESA). On May 14, 2009, the European Space Agency launched an Arianne 5 rocket carrying the largest telescope ever flown: the “Herschel” telescope, 3.5 meters in diameter. The satellite whose orbit is some 1.6 million kilometers from the Earth, will operate 48 months.

One interesting aspect of this project, for us at odbms.org, is that they use an object database to manage telemetry data and information on steering and calibrating scientific on-board instruments.

I had the pleasure to interview Dr. Johannes Riedinger, Herschel Mission Manager, and Dr. Jon Brumfitt, System Architect of Herschel Scientific Ground Segment, both at the European Space Agency.

Hope you`ll enjoy the interview!

RVZ

Q1. What is the mission of the “Herschel” telescope?

Johannes: The Herschel Space Observatory is the latest in a series of space telescopes that observe the sky at far-infrared wavelengths that cannot be observed from the ground. The most prominent predecessors were the Infrared Astronomical Satellite (IRAS), a joint US, UK, NL project launched in 1983, next in line was the Infrared Space Observatory (ISO), a telescope the European Space Agency launched in 1995, which was followed by the Spitzer Space Telescope, a NASA mission launched in 2003. The two features that distinguish Herschel from all its predecessors are its large primary mirror, which translates into a high sensitivity because of the large photon collecting area, and the fact that it is sensitive out to a wavelength of 670 μm. Extending the wavelength range of its predecessors by more than a factor of 3, Herschel is closing the last big gap that has remained in viewing celestial objects at wavelengths between 200 μm and the sub-millimetre regime. In the wavelength range in which Herschel overlaps to varying degrees with its predecessors, from 57 to 200 μm, Herschel’s big advantage is once more the size of its primary mirror. Because spatial resolution at a given wavelength improves linearly with telescope diameter, Herschel has a 4 to 6 times sharper vision than all earlier far-infrared telescopes.

Q2. In a report in 2010 [1] you wrote that “you collect every day an average of six to seven gigabit raw telemetry data and additional information for steering and calibrating three scientific on-board instruments“. Is this still actual? What are the main technical challenges with respect to data processing, manipulation and storage that this project poses?

Johannes: The average rate at which we have been receiving raw data from Herschel over the past 18 months is about 12 gigabits/day, which increases somewhat through the addition of related data by the time it is ingested into the database. The amount of data, and its distribution to the three national data centres that monitor the health and calibration of the instruments, is quite manageable. More challenging are the data volumes that need to be handled in the form of various levels of scientific data products, which are generated in the data processing pipeline. Another challenge, even for today’s computers, is the amount of memory some products require during processing. This goes significantly beyond the amount of memory a standard desktop or laptop computer comes with, and even the amount of disk space such machines used to come with until a few years ago: Only recently we have procured two computers with 256 GB of RAM for the grid on which we run the pipeline to efficiently process some of the data.

Q3. What are the requirements of the “Herschel” data processing software? And what are the main components and functionalities of the Herschel software?

Jon: The software is required to plan daily sequences of astronomical observations and associated spacecraft manoeuvres, based on proposals submitted by the astronomical community, and to turn these into telecommands sequences to uplink to the spacecraft. The data processing system is required to process the resulting telemetry and convert it into scientific products (images, spectra, etc) for the astronomers. The “uplink” system is particularly critical, because it must have a high availability to keep the spacecraft supplied with commands. Significant parts of the software were required to be working several years before launch, so that they could be used to test the scientific instruments in the laboratory. This approach, which we call “smooth transition”, ensures that certain critical parts of the software are very mature by the time we come to launch and that the instruments have been extensively tested with the real control system.

The uplink chain starts with a proposal submission system that allows astronomers to submit their observation requests. The astronomer downloads a Java client which communicates with a JBOSS application server to ingest proposals into the Versant database. Next, the mission planning system allows the mission planner to schedule sequences of observations. This is a complex task which is subject to many constraints on the allowed spacecraft orientation, the target visibility, stray light, ground station availability, etc. It is important to minimize the time wasted by slewing between targets. Each schedule is expanded into a sequence of several thousand telecommands to control the spacecraft and scientific instruments for a period of 24 hours. The spacecraft is in contact with the Earth only once a day for typically 3 hours, during which time a new schedule is uplinked and data from the previous day is downlinked and ingested into the Versant database. Between contact periods, the spacecraft executes the commands autonomously. The sequences of instrument commands have to be accurately synchronized with each little manoeuvre of the spacecraft when we are mapping areas of the sky. This is achieved by a system known as the “Common Uplink System” in which the detailed commanding is programmed in a special language that models the on-board execution timing.

The other major component of the software is the Data Processing system. This consists of a pipeline that processes the telemetry each day and places the resulting data products in an archive. Astronomers can download the products. We also provide an interactive analysis system which they can download to carry out further specialized analysis of their data.

Overall, it is quite a complex system with about 3 million lines of Java code (including test code) organized as 1100 Java packages and 300 CVS modules. There are 13,000 classes of which 300 are persistent.

Q4. You have chosen to have two separate database systems for Herschel: a relational database for storing and managing processed data products and an object database for storing and managing proposal data, mission planning data, telecommands and raw (unprocessed) telemetry. Is this correct? Why two databases? What is the size of the data expected for each database?

Jon: The processed data products, which account for the vast bulk of the data, are kept in a relational database, which forms part of a common infrastructure shared by many of our ESA science projects. This archive provides a uniform way of accessing data from different missions and performing multi-mission queries. Also, the archive will need to be maintained for years after the Herschel project finishes. The Herschel data in this archive is expected to grow to around 50 TB.

The proposal data, scheduling data, telecommands and raw (unprocessed) telemetry are kept in the Versant object database. This database will only grow to around 2 TB. In fact, back in 2000, we envisaged that all the data would be in the object database, but it soon became apparent that there were enormous benefits from adopting a common approach with other projects.

The object database is very good for the kind of data involved in the uplink chain. This typically requires complex linking of objects and navigational access. The data in the relational archive is stored in the form of FITS files (a format widely used by astronomers) and the appropriate files can be found by querying the meta-data that consist only of a few hundred keywords even for files that can be several gigabytes in size. We need to deliver the data as FITS files, so it makes sense to store it in that form, rather than to convert objects in a database into files in response to each astronomer query. Our interactive analysis system retrieves the files from the archive and converts them back into objects for processing.

Johannes: At present the Science Data Archive contains about 100 TB of data. This includes older versions of the same products that were generated by earlier versions of the pipeline software—we regenerate the products with the latest software and the best current knowledge of instrument calibration about every 6 months. Several years from now, when we will have generated the “best and final” products for the Herschel legacy archive and after we have discarded all the earlier versions of each product, we expect to end up, as Jon says, with an archive volume of about 50 to 100 TB. This is up by two orders of magnitude from the legacy archive of ISO which, given today’s technology, you can carry around in your pocket on a 512 GB external disk drive.

Q5. How do the two databases interact with each other? Are there any similarities with the Gaia data processing system?

Jon: The two database systems are largely separate as they perform rather different roles. The gap is bridged by the data processing pipeline which takes the telemetry from the Versant database, processes it and puts the resulting products in the archive.

We do, in fact, have the ability to store products in the Versant database and this is used by experts within the instrument teams, although for the normal astronomer everything is archive based. There are many kinds of products and in principle new kinds of products can be defined to meet the needs of data processing as the mission progresses. If it were necessary to define a new persistent class for each new product type, we would have a major schema evolution problem. Consequently, product classes are defined by “composition” of a fixed set of building blocks to build an appropriate hierarchical data-structure. These blocks are implemented as a fixed set of persistent classes, allowing new product types to be defined without a schema change.

The Versant database requires that first-class objects in the database are “enhanced” to make them persistence capable. However, we need to use Plain Old Java Objects for the products so that our interactive analysis system does not require a Versant installation. We have solved this by using “product persistor” classes that act as adaptors to make the POJOs persistent.

Johannes: Concerning Gaia, and here I need to make a point that has little to do with database technology per se but it has a lot to do with using databases as a tool for different purposes even in the same generic area of science—astronomy in this case—I want to emphasize the differences between Herschel and Gaia rather than the similarities.

Herschel is an observatory, i.e. the satellite collects data from any celestial object which is selected by a user to be observed in a particular observing mode with one or more of Herschel’s three on-board instruments. Each observing mode generates its own, specific set of products that is characteristic of the observing mode rather than characteristic of the celestial object that is observed: A set of products can e.g. consist of a set of 2-dimensional, monochromatic images at infrared wavelengths, from which various false-colour composites can be generated by superposition. In the lingo of this journal: Observing modes are the classes, the data products are the objects. If I observe two different celestial objects in the same observing mode, I can directly compare the results. E.g. the distribution of colours in a false-colour image immediately tells me something about the temperature distribution of the material that is present; the intensity of the colour tells me something about the amount of material of a given temperature. And if I also have a spectrum of the celestial object, this tells me something about the chemical composition of the material and its physical state, such as pressure, density, and temperature.

Gaia, on the other hand, is not an observatory and it does not offer different observing modes that can be requested. It measures the same set of a dozen or so parameters for each and every object time and time again, with some of these parameters being time, brightness in different colours, position relative to other objects that appear in the same snapshot, and radial velocity. But every time it measures these parameters for a particular object it measures them in a different context, i.e. in combination with a different set of other objects that appear in the same snapshot. So you end up with a billion objects, each appearing on a different ensemble of several dozen snapshots, and you have to find the “global best fit” that minimizes the residual errors of a dozen parameters fitted to a billion objects. Computationally, and compared to Herschel, this is a gargantuan task. But it is a single task—derive the mechanical state of an ensemble of a billion objects whose motion is controlled by relativistic celestial mechanics in the presence of a few possible disturbances, such as orbiting planets or unstable stellar states (pulsating or otherwise variable stars). On Herschel, the scientific challenge is somewhat different: We are not so intensely interested in the state of motion of individual objects, we are interested in the chemical composition of matter—much of which does not consist of stars but of clouds of gas and dust which Gaia cannot “see”—and its physical state.

Q6. You have chosen an object database, from Versant, for storing and managing raw (unprocessed) telemetry data. What is the rationale for this choice? How do you map raw data into database objects? What is a typical object database representation of these subsets of “Herschel” data stored in the Versant Object Database? What are the typical operations on such data?

Jon: Back in 2000, we looked at the available object databases. Versant and Objectivity appeared to have the scalability to cope with the amount of data we were envisaging. After a detailed comparison, we chose Versant although I think either would have done the job. At this stage, we still envisaged storing all the product data in the object database.

The uplink objects, such as proposals, observations, telecommands, schedules, etc, are stored directly as objects in the Versant database, as you might expect. The telemetry is a special case because we want to preserve the original data exactly as it arrives in binary packets, so that we can always get back to the raw data. So we have a TelemetryPacket class that encapsulates the binary packet and provides methods for accessing its contents. To support queries, we decode key meta-data from the binary packet when the object is constructed and store it as attributes of the object. This allows querying to retrieve, for example, packets for a given time range for a specified instrument or packets for a particular observation.

The persistent data model is known as the Core Class Model (CCM). This was developed by starting with a domain model describing the key entities in the problem domain, such as proposals and observations, and then adding methods by analysing the interactions implied by the use-cases. The model then evolved by the introduction of various design patterns and further classes related to the implementation domain.

Q7. How did you specifically use the Versant object database? Are there any specific components of the Versant object database that were (are) crucial for the data storage and management part of the project? If yes, which ones? Were there any technical challenges? If yes, which ones?

Jon: With a project that spans 20 years from the initial concept at the science ground segment level to generation and availability of the final archive of scientific products, it is important to allow for technology becoming obsolete. At the start of the science ground segment development part of the project, a decade ago, we looked at the standards that were then emerging (ODMG and JDO) to see if these could provide a vendor-independent database interface. However, the standards were not sufficiently mature, so we designed our own abstraction layer for the database and the persistent data. That meant that, in principle, we could change to a different database if needed. We tried to only include features that you would reasonably expect from any object database.

We organized all the persistent classes in the system into a single package. This was important to keep changes to the schema under tight control by ensuring that the schema version was defined by a single deliverable item. Consequently, the CCM classes are responsible for persistence and data integrity, but have little application-specific functionality. The persistent objects can be customized by the applications for example by using the decorator pattern. For various reasons, the persistent classes tend to contain Versant-specific code, so by keeping them all in one package it keeps the vendor-specific code together in one module behind the abstraction layer.

We needed to propagate copies of subsets of the data to databases to the three instrument control centres. At the time, there wasn’t an out-of-the-box solution that did what we wanted, so we developed our own data propagation. This was quite a substantial amount of work. One nice thing is that this is all hidden within our database abstraction layer, so that it is transparent to the applications. We continuously propagate the database to all three instrument teams and we also propagate it to our test system and backup system.

Another important factor is backup and recovery and it makes a big difference how you organize the data. We have an uplink node and a set of telemetry nodes. There is a current telemetry node into which new data is ingested. All the older telemetry nodes are read-only as the data never changes, which means you only have to back them up once. The abstraction layer is able to hide the mapping of objects onto databases.

Q8. Is scalability an important factor in your application?

Jon: In the early stages, when we were considering putting all the data into the object database, scalability was a very important issue. It became much less of an issue when we decided to store the products in the common archive.

Q9. Do you have specific time requirements for handling data objects?

Jon: We need to ingest the 1.5 GB of telemetry each day and propagate it to the instrument sites. In general the performance of the Versant database is more than adequate. It is also important that we can plan (and if necessary replan) the observations for an operational day within a few hours, although this does not pose a problem in terms of database performance.

Q10. By the end of the four years life cycle of “Herschel” you are expecting a minimum of 50 terabytes of data which will be available to the scientific community. How do you plan to store, maintain and elaborate this amount of data?

Johannes: At the European Space Astronomy Centre (ESAC) in Madrid we keep the archives of all of ESA’s astrophysics missions, and a few years ago we started to also import and organise into archives the data of the planetary missions that deal with observations of the sun, planets, comets and asteroids. The user interface to all these archives is basically the same. This generates a kind of “corporate identity” of these archives and if you have used one you know how to use all of them.

For many years, the Science Archive Team at ESAC has played a leading role in Europe in the area of “Virtual Observatories”. If you are looking for some specific information on a particular astronomical object, chances are good, and they are getting better by the day, that some archive in the world already has this information. ESA’s archives, and amongst them the Herschel legacy archive once it has been built, are accessible through this virtual observatory from practically anywhere in the world.

Q10. You are basically half way in the life cycle of “Herschel”. What are the main lessons learned so far from the project? What are the next steps planned for the project and the main technical challenges ahead?

Johannes: We are very lucky, and without doubt the more than 1,300 individuals who are listed on a web page as having made major contributions to the success of this project have every reason to be proud, that we are operating this world class observatory with only minor problems in satellite hardware which are well under control.

We are approximately half way through the in-orbit life time of the mission after launch. But this is not the same as saying that we have reaped half of the science results from the mission.

For the first 4 months of the mission, i.e. until mid-September 2009, the satellite and the ground segment underwent a checkout, a commissioning and a performance verification phase. Through a large number of “engineering” and “calibration” observations in these initial phases we ensured that the observing modes we had planned to provide and had advertised to our users worked in principle, that the telescope pointed in the right direction, that the temperatures did not drift beyond specification, etc. From mid-September to about the end of 2009, we performed an increasing number of scientific observations of astronomical targets in what we called the “Science Demonstration Phase”. These observations were “the proof of the pudding” and showed that, indeed, the astronomers could do the science they had intended to do with the data they were getting from Herschel: More than 250 scientific papers were published in mid-2010 that are based on observations made during this Science Demonstration Phase.

We have been in the “Routine Science Mission Phase” for about one year now, and we expect to remain in this phase for at least another year and a half, i.e. we have collected perhaps 40% of the scientific data Herschel will eventually have collected when the Liquid Helium coolant runs out. Already now we can see that Herschel will revolutionize some of the currently accepted theories and concepts of how and where stars are born, how they enrich the interstellar medium with elements heavier than Helium when they die, and in many other areas such as how the birth rate of stars has changed over the age of the universe, which is over 13 billion years old. But, most importantly, we need to realise and remember that sustained progress in science does not come about on short time scales. The initial results that have already been published are spectacular, but they are only the tip of the iceberg, they are results that stare you in the face. Many more results, which together will change our view of the cosmos as profoundly as the cream of the coffee results we already see now, will only be found after years of archival research, by astronomers who plough through vast amounts of archive data in the search of new, exciting similarities between seemingly different types of objects and stunning differences between objects that had been classified to be of the same type. And this kind of knowledge will accumulate from Herschel data for many years beyond the end of the Herschel in-orbit lifetime. ESA will support this slower but more sustained scientific progress through archival research by keeping together a team of about half a dozen software experts and several instrument specialists for up to 5 years after the end of the in-orbit mission. In addition, members of the astronomical community will, as they have done on previous missions, contribute to the archive their own, interactively improved products which extract features from the data that no automatic pipeline process can extract no matter how clever it has been set up to be.

I believe it is fair to say that no-one yet knows the challenges the software developers and the software users will face.
But I do expect that a lot more problems will arise from the scientific interpretation of the data than from the software technologies we use, which are up to the task that is at hand.
—————————

Dr. Johannes Riedinger, Herschel Mission Manager, European Space Agency.
Johannes Riedinger has a background in Mathematics and Astronomy and has worked on space projects since the start of his PhD thesis in 1980, when he joined a team at Max-Planck-Institut für Astronomie in Heidelberg in the development of an instrument for a space shuttle payload. Working on a successor project, the ISO Photo-Polarimeter ISOPHOT, he joined industry from 1985-1988 as the System Engineer for the phase B study of this payload. Johannes joined the European Agency in 1988 and, having contributed to the implementation of the Science Ground Segments of the Infrared Space Observatory (ISO, launched in 1995) and the X-ray Multi Mirror telescope (XMM-Newton, launched in 1999), he became the Herschel Science Centre Development Manager in 2000. Following successful commissioning of the satellite and ground segment, he became Herschel Mission Manager in 2009.

Dr. Jon Brumfitt, System Architect of Herschel Scientific Ground Segment, European Space Agency.
Jon Brumfitt has a background in Electronics with Physics and Mathematics and has worked on several of ESA’s astrophysics missions, including IUE, Hipparcos, ISO, XMM and currently Herschel. After completing his PhD and a post-doctoral fellowship in image processing, Jon worked on data reduction for the IUE satellite before joining Logica Space and Defence in 1980. In 1984 he moved to Logica’s research centre in Cambridge and then in 1993 to ESTEC in the Netherlands to work on the scientific ground segments for ISO and XMM. In January 2000, he joined the newly formed Herschel team as ground segment System Architect. As Herschel approached launch, he moved down to the European Space Astronomy Centre in Madrid to become part of the Herschel Science Operations Team.

For Additional Reading

“Herschel” telescope

[1] Data From Outer Space.
Versant.
This short paper describes a case study: the handling of telemetry data and information of the “Herschel” telescope from outer space with the Versant object database. The telescope was launched by the European Space Agency (ESA) with the Ariane 5 rocket on 14 May 2009.
Paper | Introductory | English | LINK to DOWNLOAD (PDF)| 2010|

[2] ESA Web site for “Herschel”

[3]Additonal Links
– SPECIALS/Herschel
– HERSCHEL: Exploring the formation of galaxies and stars
– ESA latest_news
– ESA Press Releases

Mar 14 11

Benchmarking ORM tools and Object Databases.

by Roberto V. Zicari

“I believe that one should benchmark before making any technology decisions.”
An interview with Pieter van Zyl creator of the OO7J benchmark.

In August last year, I published an interesting resource in ODBMS.ORG, the dissertation of Pieter van Zyl, from the University of Pretoria:“Performance investigation into selected object persistence stores”. The dissertation presented the OO7J benchmark.

OO7J is a Java version of the original OO7 benchmark (written in C++) from Mike Carey, David DeWitt and Jeff Naughton at the University of Wisconsin-Madison. The original benchmark tested Object Databases (ODBMS) performance. This project also includes benchmarking Object Relational Mapping (ORM) tools. Currently there are implementations for Hibernate on PostgreSQL, MySQL, db4o and Versant databases. The source code is available on sourceforge. It uses the GNU GPL license.

Together with Srini Penchikala, InfoQ I have interviewed Pieter van Zyl creator of the OO7J benchmark.

Hope you`ll enjoy the interview.
RVZ

Q.1 Please give us a summary of OO7J research project.

Pieter van Zyl: The study investigated and focused on the performance of object persistence and compared ORM tools to object databases. ORM tools provide an extra layer between the business logic layer and the data layer. This study began with the hypothesis that this extra layer and mapping that happens at that point, slows down the performance of object persistence.
The aim was to investigate the influence of this extra layer against the use of object databases that remove the need for this extra mapping layer. The study also investigated the impact of certain optimisation techniques on performance.

A benchmark was used to compare ORM tools to object databases. The benchmark provided criteria that were used to compare them with each other. The particular benchmark chosen for this study was OO7, widely used to comprehensively test object persistence performance. Part of the study was to investigate the OO7 benchmark in greater detail to get a clearer understanding of the OO7 benchmark code and inside workings thereof.

Because of its general popularity, reflected by the fact that most of the large persistence providers provide persistence for Java objects, it was decided to use Java objects and focus on Java persistence. A consequence of this decision is that the OO7 Benchmark, currently available in C++, has had to be re-implemented in Java as part of this study.

Included in this study was a comparison of the performance of an open source object database, db4o, against a proprietary object database, Versant. These representatives of object databases were compared against one another as well as against Hibernate, a popular open source representative of the ORM stable. It is important to note that these applications were initially used in their default modes (out of the box). Later some optimisation techniques were incorporated into the study, based on feedback obtained from the application developers. My dissertation can be found here-

Q.2 Please give us a summary of the recommendations of the research project.

Pieter van Zyl: The study found that:
− The use of an index does help for queries. This was expected. Hibernate’s indexes seem to function better than db4o’s indexes during queries.
− Lazy and eager loading was also investigated and their influence stood out in the study. Eager loading improves traversal times if the tree being traversed can be cached between iterations. Although the first run can be slow, by caching the whole tree, calls to the server in follow-up runs are reduced. Lazy loading improves query times and unnecessary objects are not loaded into memory when not needed.
− Caching was investigated. It was found that caching helps to improve most of the operations. By using eager loading and a cache, the speed of repeated traversals of the same objects is increased. Hibernate with its first level cache in some cases perform better than db4o during traversals with repeated access to the same objects.

When creating and using benchmarks it is important to clearly state what settings and environment is being used. In this study it was found that:
− Running in client-server mode or embedded mode in db4o has different performance results. It was shown that some queries are faster when db4o is run in embedded mode and also that some traversals are faster when db4o is run in client-server mode.
− That it is important to state when a cache is being used and cleared. It was shown that clearing the Versant cache with every transaction commit influenced the hot traversal times. It is also important for example, to state if a first or second level cache is being used in Hibernate as this could influence traversal times.
− Having a cache does not always improve random inserts, deletes and queries. It was shown that the cache assisted traversals more than inserts, deletes and queries. Because of random inserts, deletes and queries not all objects accessed will be in the cache. Also if query caches are used, it must be stated clearly, otherwise it could create false results.
− It is important to note that it is quite difficult to generalise. One mechanism is faster with updates to smaller amount of objects, others with larger amount of objects; some perform better with changes to index fields others with changes to non indexed fields; some perform better if traversals are repeated with the same objects. Others perform better in first time traversals which might be what the application under development needs. The application needs to be profiled to see if there is repeated access to the same objects in a client session.
− In most cases the cache helps to improve access times. But if an application does not access the same objects repeatedly or accesses scattered objects then the cache will not help as much. In these cases it is very important to look at the OO7 cold traversal times which access the disk resident objects. For cold traversals, or differently stated, first time access to objects, in most cases Versant is the fastest of all the mechanisms tested by OO7.
− By not stating the benchmark and persistence mechanisms settings clearly it is very easy to cheat and create false statements. It is easy to cheat if certain settings are not brought to light. For example with Versant it is important to state what type of commit is being used: a normal commit or a checkpoint commit. Also what types of queries are used can impact benchmark results. Also there is a slight performance difference in using a find() vs. running a query in Hibernate.
− It is important to make sure that the performance techniques that are being used actually improve the speed of the application. It has been shown that to add subselects to a MySQL database does not automatically improve the speed.

This work formed part of my MSc. While the findings are not always surprising or new, the work showed that you could use the OO7 benchmark still to test today’s persistence frameworks. It really brought out performance differences between ORM Tools and object databases. This work is also the first OO7 implementation that tested ORM tools and compared open source against commercial object databases.

Q.3 What is the current state of the project?

Pieter van Zyl: The project has implementations for db4o, Hibernate with PostgreSQL and MySQL and the Versant database. The project currently works with settings files and Ant script to run different configurations. The project is a complete implementation of the original OO7 C++ implementation. More implementations will be added in the future. I also believe that all results must be audited. I will keep submitting benchmark results to vendors.

Q.4 What are the best practices and lessons learned in the research project?

Pieter van Zyl: See answer of Question 2. What is interesting today is that bench-markers are still not allowed to publish benchmark result of commercial products. Their licensees prohibit it. We felt that academics must be allowed to investigate and publish their results freely. In the end we did comply with the licenses and submitted the work to the vendors.

Q.5 Do you see a chance that your benchmark be used by the industry? Why?

Pieter van Zyl: Yes, but I suspect they are using benchmarks already. These benchmarks are probably home grown. Also there are no de-facto benchmarks for object database and ORM tool vendors. There exists a TPC benchmark for relational database vendors. While some vendors did use the OO7 benchmark in the late 90s they seem to not use it any more or maybe they have adjusted for in-house use.

OO7J could be used to test improvements from one version to the next. I have used it to benchmark differences between different db4o releases. We use tested embedded versions of db4o with the client-server version of db4o and this gave us valuable information and we could discern the differences in performance.

Currently OO7J has its own interface to the persistence store being benchmark. This means that it can be extended to test most persistence tools. We wanted to use the JPA or JDO interfaces but not all vendors support these standards.

Q.6 What is the feedback did you receive so far?

Pieter van Zyl: The dissertation was well received. I got a distinction for the work. I submitted the benchmark to the vendors to get their input on the benchmark and how to optimize their products. The feedback was good and no bugs were found. It is important that a benchmark is accurate and used consistently for all vendor implementations. I don’t think there are any funnies or inconsistencies in the benchmark code.

Jeffrey C. Mogul states that it is important that benchmarks should be repeatable, relevant, use realistic metrics, be comparable and widely used. I think OO7 complies with those requirements and I stayed as close as possible to OO7 with OO7J.

Also OO7J has been used by students at ETH Zurich – Department of Computer Science. Another object database vendor in America also contacted me about my work and wanted to use it for their benchmarking. Not sure how far they progressed.

Q.7 What are the main related works? How does OO7J research project compare with other persistence benchmarking approaches and what are the limitations of the OO7J project?

Pieter van Zyl: There have been related attempts to create a Java implementation of OO7 in the late 90s by a few researchers. Sun also created a Java version. These versions are not available any-more and wasn’t open sourced. See my dissertation for more details.

More recent work includes:
• ETH Zurich CS department (Prof. Moira C. Norrie) created a Mavenized and GUI version of my work but include changes to the database and they need to sort out some small issues.
• Prof. William R. Cook`s students effort: A public implementation of the OO7 benchmark..
Other benchmarking work in the Java object space:
• New benchmark that I discovered recently: JPA Performance Benchmark.
• The PolePosition benchmark.

These benchmarks are not entirely vendor independent. But they are Open Source and one can look at the code and challenge their coding.

I think OO7 has one thing going for it that the others don’t have: I still think it is more widely used. Especially in the academic world. It has a lot of vendor independence behind it historically. It has had more reviews and documentation on how it works internally.

But I have seen some implementation of OO7 that are not complete: they for example build half the model and then don’t disclose these changes when publishing the results. Or only have some of the queries of traversals working.

That is why I like to stay close to the original well known OO7. I document any changes clearly.
If you run Query 8 of OO7 I want to expect that it functions 100% like the original. If anyone modifies it they should see this as an extension and rename the operation.
I have also included asserts/checkpoints to make sure the correct number of objects are returned for every operation.

Limitations of OO7J:
• It needs to be upgraded to run in client/server mode. More concurrent clients must be created to run operation on the databases at the same time.
• Its configurations need to be updated to create tera and peta byte database models.

Q.8 To do list: What still needs to be done?

Pieter van Zyl:
• Currently there are 3 implementations for each of the products being tested. While it uses the same model and operation I found that parent objects must be saved before children objects in the Hibernate version. Also the original OO7 also had an implementation per product benchmarked. I want to create one code base that can switch between different products using configuration. The ETH students has attempted this already but I am thinking of a different approach:
• Configurations for larger datasets
• More concurrent clients
• Investigate if I could use OO7 in a MapReduce world.
• Investigate if OO7 can be used to benchmark column-oriented databases
• Include Hibernate+Oracle combination.

Q.9 NoSQL/NRDBMS solutions are getting lot of attention these days. Are there any plans to do a persistence performance comparison of NoSQL persistence frameworks in the future?

Pieter van Zyl: Yes, they will be incorporated. I still believe object databases are well suited to this environment.
Still not sure that people are using them in the correct situations. I sometimes suspect people jump on to a hot technology without really benchmarking or under sting their application needs.

Q.10 What is the future road map of your research?

Pieter van Zyl: Investigate clustering, caches, MapReduce, Column-oriented databases and investigate how to incorporate these into my benchmarking effort.

I would also love to get more implementation experience either with an vendor or building my own database.

Quoting Martin Fowler (Patterns of Enterprise Application Architecture): “Too often I’ve seen designs used or rejected because of performance considerations, which turn out to be bogus once somebody actually does some measurements on the real setup used for the application”

I believe that one should benchmark before making any technology decisions. People have a lot of opinions of what performs better but there are usually not enough proof. There are a lot of noise in the market. Cut through it and benchmark and investigate for yourself.

For Additional Reading

Related Articles

Object-Oriented or Object-Relational? An Experience Report from a High-Complexity, Long-Term Case Study.
Peter Baumann, Jacobs University Bremen
Object-Oriented or Object-Relational? We unroll this question based on our experience with the design and implementation of the array DBMS rasdaman which offers storage and query language retrieval on large, multi-dimensional arrays such as 2-D remote sensing imagery and 4-D atmospheric simulation results. This information category is sufficiently far from both relational tuples and object-oriented pointer networks to achieve a “fair” comparison where no approach has an immediate advantage.
Paper | Intermediate | English | LINK DOWNLOAD (PDF)| July 2010|

More articles on ORM in the Object Databases section of ODBMS.ORG.

More articles on ORM in the OO Programming Section of ODBMS.ORG.

Mar 14 11

One minute of our time.

by Roberto V. Zicari

Many of us are very busy, engaged in activities, duties, hopes, worries, joy. This is our life.
But life can end within a few minutes or even less.

In dedication to those who suffer in Japan right now.

RVZ

Mar 5 11

“Marrying objects with graphs”: Interview with Darren Wood.

by Roberto V. Zicari

Is it possible to have both objects and graphs?
This is what Objectivity has done. They recently launched InfiniteGraph, a Distributed Graph Database for the Enterprise. Is InfiniteGraph a NoSQL database? How does it relate to their Objectivity/DB object database?.
To know more about it, I have interviewed Darren Wood, Chief Architect, InfiniteGraph, Objectivity, Inc.

RVZ

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Wood: I think at some level, there have always been requirements for data stores that don’t fit the traditional relational data model.
Objectivity (and many others) have built long term successful businesses meeting scale and complexity requirements that generally went beyond what was offered by the RDBMS market. In many cases, the other significant player in this market was not a generally available alternative technology at all, but “home grown” systems designed to meet a specific data challenge. This included everything from application managed file based systems to the more well known projects like Dynamo (Amazon) and BigTable (Google).

This trend is not in any way a failing of RDBMS products, it is simply a recognition that not all data fits squarely into rows and columns and not all data access patterns can be expressed precisely or efficiently in SQL.
Over time, this market has simply grown to a point where it makes sense to start grouping data storage, consistency and access model requirements together and create solutions that are designed specifically to meet them.

Q2. There has been recently a proliferation of New data stores, such as document stores, and NoSQL databases: What are the differences between them?

Wood: Although NoSQL has been broadly categorized as a collection of Graph, Document, Key-Value, and BigTable style data stores, it is really a collection of alternatives which are best defined by the use case for which they are most suited.

Dynamo derived Key-Value stores for example, mostly trade off data consistency for extreme horizontal scalability with a high tolerance to host failure and network partitioning. This is obviously suited to
systems serving large numbers of concurrent clients (like a web property), which can tolerate stale or inconsistent data to some extent.

Graph databases are another great example of this, they treat relationships (edges) as first class citizens and organize data physically to accelerate traversal performance across the graph. There is also typically a very “graph oriented” API to simplify applications that view their data as a graph. This is a perfect example of how
providing a database with a specific data model and access pattern can dramatically simplify application development and significantly improve performance.

Q3. Objectivity has recently launched InfiniteGraph, a Distributed Graph Database for the Enterprise. Is InfiniteGraph DB a NoSQL graph database? How does it relate to your Objectivity/DB object database?

Wood: InfiniteGraph had been an idea within our company for some time. Objectivity has a long and successful track record providing customers with a scalable, distributed data platform and in many cases the underlying data model was a graph. In various customer accounts our Systems Engineers would assist building custom applications to perform high speed data ingest and advanced relationship analytics.
For the most part, there was a common “graph analytics” theme emerging from these engagements and the differences were really only in the specifics of the customer’s domain or industry.

Eventually, an internal project began with a focus on management and analysis of graph data. It took the high performance distributed data engine from Objectivity/DB and married it to a graph management and analysis platform which makes development of complex graph analytic applications significantly easier. We were very happy with what we had achieved in the first iteration and eventually offered it as a public beta under the InfiniteGraph name. A year later we are now into our third public release and adding even more features focused around scaling the graph in a distributed environment.

Q4. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed object cache over multiple machines. How do these new data stores compare with object-oriented databases?

Wood: I think that most of the technologies you mention generally have unique properties that appeal to some segment of the database market. In same cases its the data model (like the flexibility of document model databases) and in others it is the ease of use or deployment and simple interface that will attract users (which is often said about CouchDB). Systems architects that evaluate various solutions will
invariably balance ease of use and flexibility with other requirements like performance, scalability and supported consistency models.

Object Databases will be treated in much the same way, they handle complex object models really well and minimize the impedance mismatch between your application and the database, so in cases where this is
an important requirement, they will always be considered a good option.
Of course not all OODBMS implementations are the same, so even within this genre there are significant differences to help make a clear choice for a particular use case.

Q5. With the emergence of cloud computing, new data management systems have surfaced.
What is in your opinion the direction in which cloud computing data management is evolving? What are the main challenges of cloud computing data management?

Wood: This is an area of great interest to us. Coming from a distributed data background, we are well positioned to take advantage of the trend for sure. I think there are a couple of major things that distinguish traditional “distributed systems” from the typical cloud environment.

Firstly, products that live in the cloud need to be specifically designed for tolerance to host failures and network partitions or inconsistencies. Cloud platforms are invariably built on low cost commodity hardware and provide a virtualized environment that offers a lower class of reliability seen in dedicated “enterprise class”
hardware. This essentially requires that availability be built into the software to some extent, which translates to replication and redundancy in the data world.

Another important requirement in the cloud is ease of deployment. The elasticity of a cloud environment generally leads to “on the fly” provisioning of resources, so spinning up nodes needs to be a simple from a deployment and configuration perspective. When you look at the technologies (like Dynamo based KV stores) that have their roots in early cloud based systems, there is a real focus in these areas.

Q6. Will cloud store projects end up with support for declarative queries and declarative secondary keys?

Wood: If you look at most of the KV type stores out there, a lot of them are now turning some focus on what is being termed as “search”. Cross population indexing and complex queries were not the primary design goal of these systems, however many users find “some” capability in this area is necessary, especially if it is being used exclusively as the persistence engine of the system.

An alternative to this is actually using multiple data stores side by side (so called polyglot persistence) and directing queries at the system that has been best designed to handle it. We are see a lot of this in the Graph Database market.

Q7. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Wood: I agree totally that NoSQL has nothing to do with SQL, it’s an unfortunate term which is often misunderstood. It is simply about choosing the data store with just the right mix of trade offs and characteristics that are they closest match to your applications requirements. The problem was, for the most part people are familiar with RDBMS/SQL, so NoSQL became a “place” for other lesser known data
stores and models to call home (hence the unfortunate name).

A good example is ACID, mentioned in the above abstract. ACID in itself is one of these choices, some use cases require it and others don’t. Arguing about the efficiency of its implementation is somewhat of a mute point if it isn’t a requirement at all ! In other cases, like graph databases, the physical data model can dramatically effect its performance for certain types of navigational queries which the RDBMS data model and SQL query language are simply not designed for.

I think this post was a reaction to the idea that NoSQL somehow threatened RDBMS and SQL, when that isn’t the case at all. There are still a large proportion of data problems out there that are very well suited to the RDBMS model and the plethora of implementations.

Q8. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes. Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Wood: Certainly partitioning and replication of RDBMS can be used to great effect where the application suits the relational model well, however this doesn’t change its suitability for a specific task. Even a set of indexed partitions doesn’t make sense if a KV store is all that is required. Using a distributed hashing algorithm doesn’t require execution of lookups on every node, so there is no reason to pay an
overhead for a generalized query when it is not required. Of course this doesn’t devalue the existence of a partitioned RDBMS at all, since there are many applications where this would be a perfect solution.

Recent Related Interviews/Videos:

“Distributed joins are hard to scale”: Interview with Dwight Merriman.

On Graph Databases: Interview with Daniel Kirstenpfad.

Interview with Rick Cattell: There is no “one size fits all” solution.

Don White on “New and old Data stores”.

Watch the Video of the Keynote Panel “New and old Data stores”

Robert Greene on “New and Old Data stores

Feb 14 11

Objects in Space.

by Roberto V. Zicari

Objects in Space.
–The biggest data processing challenge to date in astronomy: The Gaia mission.–

I became aware of a very interesting project at the European Space Agency:the Gaia mission. It is considered by the experts “the biggest data processing challenge to date in astronomy“.

I wanted to know more. I have interviewed William O`Mullane, Science Operations Development Manager, at the European Space Agency, and Vik Nagjee, Product Manager, Core Technologies, at InterSystems Corporation, both deeply involved with the Proof-of-Concept of the data management part of this project.

Hope you`ll enjoy this interview.
RVZ

Q1. Space missions are long-term. Generally 15 to 20 years in length. The European Space Agency plans to launch in 2012 a Satellite called Gaia. What is Gaia supposed to do?

O`Mullane:
Well, we now have learned we launch in early 2013, delays are fairly common place in complex space missions. Gaia is ESA’s ambitious space astrometry mission, the main objective of which is to astrometrically and spectro-photometrically map 1000 Million celestial objects (mostly in our galaxy) with unprecedented accuracy. The satellite will downlink close to 100 TB of raw telemetry data over 5 years.
To achieve its required accuracy of a few 10s of Microarcsecond astrometry, a highly involved processing of this data is required. The data processing is a pan-European effort undertaken by the Gaia Data processing and Analysis consortium. The result a phase space map of our Galaxy helping to untangle its evolution and formation.

Q2. In your report, “Charting the Galaxy with the Gaia Satellite”, you write “the Gaia mission is considered the biggest data processing challenge to date in astronomy. Gaia is expected to observe around 1,000,000,000 celestial objects”. What kind of information Gaia is expected to collect on celestial objects? And what kind of information Gaia itself needs in order to function properly?

O`Mullane:
Gaia has two telescopes and a Radial Velocity Spectrometer (RVS). From each telescope simultaneously images are taken to calculate position on the sky and magnitude. Two special sets of CCDs at the end of the focal plane record images in red and blue bands to provide photometry for every object. Then for a large number of objects spectrographic images are recorded.
From the combination of this data we derive very accurate positions distances and motions for celestial objects. Additionally we get metalicites and temperatures etc. which allow the objects to be classified in different star groups. We use the term celestial objects since not all objects that Gaia observes are stars.
Gaia will, for example, see many asteroids and improve their know orbits. Many planets will be detected (though not directly observed).

As for any satellite Gaia has a barrage of on board instrumentation such as gyroscopes, star trackers and thermometers which are read out and downlinked as ‘house keeping’ telemetry. All of this information is needed to track Gaia’s state and position. Perhaps rather more unique for Gaia is the use of the data taken through scientific instruments for ‘self’ calibration. Gaia is all about beating noise with statistics.

Q3. All Gaia data processing software is written in Java. What are the main functionalities of the Gaia data processing software? What were the data requirements for this software?

O`Mullane:
The functionality is to process all of the observed data and reduce it to a set of star catalogue entries which represent the observed stars. The data requirements vary – in the science operations centre we downlink 50-80GB per day so there is a requirement on the daily processing software to be able to process this volume of data in 8 hours. This is the strictest requirement because if we are swamped by the incoming data stream we may never recover.

The Astrometric solution involves a few TB of data extracted from the complete set, the process runs less frequently (each six months to one year) the requirement on that software is to do its job in 4 weeks.
There are many systems like this for photometry, spectroscopy, non single stars, classification, variability analysis, each have there own requirements on data volume and processing time.

Q4. What are the main technical challenges with respect to data processing, manipulation and storage this project poses?

Nagjee:
The sheer volume of data that is expected to be captured by the Gaia satellite poses a technical challenge. For example, 1 billion celestial objects will be surveyed, and roughly 1000 observations (100*10) will be captured for each object, totaling around 1000 billion observations; each observation is represented as a discrete Java object and contains many properties that express various characteristics of these celestial bodies.
The trick is to not only capture this information and ingest (insert) it into the database very quickly, but to also be able to do this as discrete (non-BLOB) objects so that downstream processing can be facilitated in an easy manner. Additionally, the system needs to be optimized for very fast inserts (writes) and also for very fast queries (reads). All of this needs to be done in a very economical and cost-effective fashion – reducing power, energy, cooling, etc. requirements, and also reducing costs as much as possible.
These are some of the challenges that this project poses.

Q5. A specific part of the Gaia data processing software is the so called Astrometric Global Iterative Solution (AGIS). This software is written in Java. What is the function of such module? And what specific data requirements and technical challenges does it have?

Nagjee:
AGIS is a solution or program that iteratively turns the raw data into meaningful information.

O`Mullane:
AGIS takes a subset of the data, so called well behaved or primary stars and basically fits the observations to the astrometric model. This involves refining the satellite attitude (using the known basic angle and the multiple simultaneous observations in each telescope) and calibration (given an attitude and known star positions they should appear at a definite points in the CCD at a given times). It`s a huge (too huge infact) matrix inversion but we do a block iterative estimation based on the conjugate gradient method. This requires looping (or iterating) over the observational data up to 40 times to converge the solution. So we need the IO to be reasonable -we know the access pattern though so we can organize the data such that the reads are almost serial.

Q6. In your high level architecture you use two databases, a so called Main Database and an AGIS Database. What is the rational for this choice and what are the functionalities expected from the two databases?

O`Mullane:
Well the Main Database will hold all data from Gaia and the products of processing. This will grow from a few TBs to few hundreds of TBs during the mission. It`s a large repository of data. Now we could consider lots of tasks reading and updating this but the accounting would be a nightmare – and astronomers really like to know the provenance of their results. So we made the Main Database a little slower to update declaring a version as immutable. One version is then the input to all processing tasks the outputs of which form the next version.
AGIS meanwhile (and other tasks) only require a subset of this data. Often the tasks are iterative and will need to scan over the data frequently or access it in some particular way. Again it is easier for the designer of each task to also design and optimize his data system than try to make one system work for all.

Q7. You write that the AGIS database will contain data for roughly 500,000,000 sources (totaling 50,000,000,000 observations). This is roughly 100 Terabyte of Java data objects. Can you please tell us what are the attributes of the Java objects and what do you plan to do with these 100 Terabyte of data? Is scalability an important factor in your application? Do you have specific time requirements for handling the 100 Terabyte of Java data objects?

O`Mullane:
Last question first – 4 weeks is the target time for running AGIS, that’s ~40 passes through the dataset. The 500 Million sources plus observations for Gaia is a subset of the Gaia Data we estimate it to be around 10TB. Reduced to its quintessential element each object observation is a time, the time when the celestial image crosses the fiducial line on the CCD. For one transit there are 10 such observations with an average of 80 transits per source. Carried with that is the small cut out image over the 5 year mission lifetime, from the CCD. In addition there are various other pieces of information such as which telescope and residuals which are carried around for each observation. For each source we calculate the astrometric parameters which you may see as 6 numbers plus errors. These are: the position (alpha and delta) the distance or parallax (varPi) the transverse motion (muAlpha and muDelta) and the radial velocity (muRvs).
Then there is a Magnitude estimation and various other parameters. We see no choice but to scale the application to achieve the run times we desire and it has always been designed as a distributed application.

Nagjee:
The AGIS Data Model comprises several objects and is defined in terms of Java interfaces. Specifically, AGIS treats each observation as a discrete AstroElementary object. As described in the paper, the AstroElementary object contains various properties (mostly of the IEEE long data type) and is roughly 600 bytes on disk. In addition, the AGIS database contains several supporting indexes which are built during the ingestion phase. These indexes assist with queries during AGIS processing, and also provide fast ad-hoc reporting capabilities. Using InterSystems Caché, with its Caché eXTreme for Java capability, multiple AGIS Java programs will ingest the 100 Terabytes of data generated by Gaia as 50,000,000,000 discrete AstroElementary objects within 5 days (yielding roughly 115,000 object inserts per second, sustained over 5 days).

Internally, we will spread the data across several database files within Caché using our Global and Subscript Mapping capabilities (you can read more about these capabilities here) ,while providing seamless access to the data across all ranges. The spreading of the data across multiple database files is mainly done for manageability.

Q8. You conducted a proof-of-concept with Caché for the AGIS database. What were the main technical challenges of such proof-of-concept and what are the main results you obtained? Why did you select Caché and not a relational database for the AGIS database?

O`Mullane:
We have worked with Oracle for years and we can run AGIS on Derby. We have tested MySql and Postgress (though not with AGIS). To make the relation systems work fast enough we had to reduce our row count – this we did by effectively combining objects in blobs with the result that the RDBMs became more like a files system. Tests with Caché have shown we can achieve the read and write speeds we require without having to group data in blobs. This obviously is more flexible. It may have been possible to do this with another product but each time we had a problem InterSystems came (quickly) and showed us how to get
around it or fixed it. For the recent test we requested writing of specific representative dataset within 24 hours our our hardware – this was achieved in 12 hours. Caché is also a more cost-effective solution.

Nagjee:
The main technical challenge of such a Proof-of-Concept is to be able to generate realistic data and load on the system, and to tune and configure the system to be able to meet the strict insert requirements, while still optimizing sufficiently for down-stream querying of the data.
The white paper(.pdf)” discusses the results, but in summary, we were able to ingest 5,000,000,000 AstroElementary objects (roughly 1/10th of the eventual projected amount) in around 12 hours. Our target was to ingest this data within 24 hours, and we were successful at being able to do this in 1/2 the time.

Caché is an extremely high-performance database, and as the Proof-of-Concept outlined in the white paper proves, Caché is more than capable of handling the stringent time requirements imposed by the Gaia project, even when run on relatively modest hardware.

Q9. How do you handle data ingestion in the AGIS database, and how do you publish back updated objects into the main database?

Nagjee:
We use the Caché eXTreme for Java capability to interact between the Java AGIS application.

Q10. One main component of the proof-of-concept is the new Caché eXTreme for Java. Why is it important, and how did it get used in the proof-of-concept? How do you ensure low latency data storage and retrieval in the AGIS solution?

Nagjee:
Caché eXTreme for Java is a new capability of the InterSystems Caché database that exposes Caché’s enterprise and high-performance features to Java via the JNI (Java Native Intervace). It enables “in-process” communication between Java and Caché, thereby providing extremely low-latency data storage and retrieval.

Q11. What are the next steps planned for AGIS project and the main technical challenges ahead?

Nagjee:
The next series of testing will focus on ingesting even more data sources –up to 50% of the total projected objects. Next, we’ll work on tuning the application and system for reads (queries), and will also continue to explore additional deployment options for the read/query phase (for example, ESA and InterSystems are looking at deploying hundreds of AGIS nodes in the Amazon EC2 cloud so as to reduce the amount of hardware that ESA has to purchase).

O`Mullane:
Well on the technical side we need to move AGIS definitively to Caché for production. There are always communication bottlenecks to be investigated which limit scalability. AGIS itself requires several further developments such as a more robust outlier scheme and a more complete set of calibration equations. AGIS is in a good state but needs more work to deal with the mission data.

———————-
William O`Mullane, Science Operations Development Manager, European Space Agency.
William O’Mullane has a background in Computer Science and has worked on space science projects since 1996 when he assisted with the production of the Hipparcos CDROMS. During this period he was also involved with the Planck and Integral science ground segments as well as contemplating the Gaia data processing problem. From 2000-2005 Wil worked on developing the US National Virtual Observatory (NVO) and on the Sloan Digital Sky Survey (SDSS) in Baltimore, USA. In August 2005 he rejoined the European Space Agency as Gaia Science Operations Development Manager to lead the ESAC development effort for the Gaia Data Processing and Analysis Consortium.

Vik Nagjee, Product Manager, Core Technologies, InterSystems Corporation.
Vik Nagjee is the Product Manager for Core Technologies in the Systems Development group at InterSystems. He is responsible for the areas of Reliability, Availability, Scalability, and Performance for InterSystems’ core products – Caché and Ensemble. Prior to joining InterSystems in 2008, Nagjee held several positions, including Security Architect, Development Lead, and head of the performance & scalability group, at Epic Systems Corporation, a leading healthcare application vendor in the US.
————————-
For additional reading:

The Gaia Mission :
1. “Pinpointing the Milky Way” (download paper, .pdf).

2. “Gaia: Organisation and challenges for the data processing” (download paper, .pdf).

3. “Gaia Data Processing Architecture 2009” (download paper, .pdf).

4. “To Boldly Go Where No Man has Gone Before: Seeking Gaia’s Astrometric Solution with AGIS” (download paper, .pdf).

The AGIS Database

5. “Charting the Galaxy with the Gaia Satellite”.(download white paper, .pdf)”

Feb 8 11

“Distributed joins are hard to scale”: Interview with Dwight Merriman.

by Roberto V. Zicari

On the topic of NoSQL databases, I asked a few questions to Dwight Merriman, CEO of 10gen.

You can read the interview below.

RVZ

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Dwight Merriman:
The catalyst for acceptance of new non-relational tools has been the desire to scale –particularly the desire to scale operational databases horizontally on commodity hardware. Relational is a powerful tool that every developer already knows; using something else has to clear a pretty big hurdle. The push for scale and speed is helping people clear that and add other tools to the mix. However, once the new tools are part of the established set of things developers use, they often find there are other benefits – such as agility of development of data backed applications. So we see two big benefits : speed and agility.

Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them? How do you position 10gen?

Dwight Merriman:
The products vary on a few dimensions:
– what’s the data model?
– what’s the scale-out / data distribution model?
– what’s the consistent model (strong consistency or eventual or both?)
– does the product make development more agile, or just focus on scaling only

MongoDB uses a document-oriented data model, auto-sharding (range based partitioning) for data distribution, and is more towards the strong consistency side of the spectrum (for example atomic operations on single documents are possible). Agility is a big priority for the mongodb project.

Q3. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed “object” cache over multiple machines. Do you agree with this? How do these new data stores compare with object-oriented databases?

Dwight Merriman:
Some products have a simple philosophy, and others in the space do not. However a little bit must be left out if one wants to scale well horizontally. The philosophy with MongoDB project is to try to make the functionality rich, but not to add any features which make scaling hard or impossible. Thus there is quite a bit of functionality : ad hoc queries, secondary indexes, sorting, atomic operations, etc.

Q4. Recently you annouced the completion of a round of funding adding Sequoia Capital as New Investment Partner. Why Sequoia Capital is investing in 10gen and how do you see the database market shaping on?

Dwight Merriman:
We think the “NoSQL” will be an important new tool for companies of all sizes. We expect large companies in the future to have in house three types of data stores : a traditional RDBMS, a NoSQL database, and a data warehousing technology.

Q5. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Dwight Merriman:
We agree. It has nothing to do with SQL. It has to do with relational though – distributed joins are hard to scale.

Q6. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes.
Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Dwight Merriman:
Without loss of generality, we believe no. With loss of generality, perhaps yes. For example if you say, i only want star schemas, i only want to bulk load data at night and query it all day, i only want to run a few really expensive queries not millions of tiny ones — then it works and you are now in the realm of the relational business intelligent / data warehousing products, which do scale out quite well.

If you put some restrictions on – say require the user to do their data models certain ways, or force them to run a proprietary low latency network, or only have 10 nodes instead of 1000, you can do it. But we think there is a need for scalable solutions that work on commodity hardware, particularly because cloud computing is such an important trend in the future. And there are other benefits too such as developer agility.

Jan 28 11

On Graph Databases: Interview with Daniel Kirstenpfad.

by Roberto V. Zicari

I wanted to know more about Sones, a relatively new graph database company based in Germany.
I therefore interviewed Daniel Kirstenpfad, their CTO.

RVZ

Q1. Sones was founded in 2007 in Germany, with the aim of developing an object oriented database with its own file system technology. What motivated you to start the company?

Daniel Kirstenpfad: Building a product and a new technology with a great team surely was the main motivation to start sones. What started as a private project took off in 2007 when a business angel believed in our idea and started to fly with the substantial investments by T-Venture, TGFS and KfW. A great team and great opportunities are the things that keep that motivation high for everyone at sones.

Q2. Your database, ‘GraphDB’, is a graph database. Why did you develop a graph database? How does it differ from an object oriented database? Is GraphDB a NoSQL database?

Daniel Kirstenpfad: Well once upon a time sones got it’s name from the words “SOcial NEtwork Systems”. That time we mainly were focused on tools and technologies for social networks. We started to develop those technologies and found out that a graph-theoretical approach is the most natural one for typical problems in social networks.

When we drilled deeper into graph theory we found out that there are a bazillion real problems people are having which can be solved faster and more elegant with a graph-theory based database management system – not mentioning the new opportunities such graph-theory database brings.

Taking the knowledge about the tools and use cases we did previously we started the work on the multi purpose graph dbms that is today the open source “sones GraphDB”.

The sones GraphDB is a graph database management system which includes many important features object oriented databases typical have. Typically object oriented databases lack specific graph features like graph algorithms and easy to use property graph data structures (vertices, edges (directed/undirected), hyper edges,…). More importantly we think that an easy to learn and use query language which is derived from things the users know already like SQL needs to be included into a database these days – in sones GraphDB we have those features and more.

We think of NoSQL as “not only SQL”. When we think of SQL as relational databased think that this RDBMS technology is still very important for the field of use cased that RDBMSs are tailored to solve. It’s just that people found a huge number of problems that cannot be solved in a scalable and performant way with the SQL approach.

When we think of SQL as a query language we think that having an easy to use functionality to do ad-hoc queries on a large set of structured is probably one of the things that made those RDBMSs so popular. It’s the reason we at sones believe that it’s very important to have an easy to use query language which you can use to do ad-hoc queries on large sets of semi-structured and unstructured data. It’s the reason why the query language you use to query the sones GraphDB is solely based on the syntax of SQL: it’s easy to use and easy to learn. A typical user can write and run rather complex queries within minutes.

Q3. One feature of a graph database is to allow linking data from various SQL databases. How does it work in practice? And who needs such a feature?

Daniel Kirstenpfad: Data in most cases does not stand alone like a list. In most cases data comes with links and relations between the data. SQL is highly limited when working with complex linked data sets. Even worse with RDBMSs the query performance is significantly impacted when schemes are getting more complex. To overcome those performance and scalability limitations it can be helpful to establish a graph database layer which handles those complex linking needs. So in practice the user keeps his RDBMs databases and establishes what we call a “metadata repository” across his different RDBMs databases. This metadata repository is basically a large graph of all the edges that link data sets. Giving the user the ability to do ad-hoc graph queries globally on a virtually unlimited number of different RDBMs databases (or data silos as we call them).

That also answers the question which users benefit the most of such a feature: Everyone who has mutliple relational databases which each store another aspect of the data can benefit by establishing a globally available graph metadata repository.

Q4. How do you handle semistructured and unstructured data?

Daniel Kirstenpfad: Semistructured data is a compromise between structured and unstructured data without sacrificing important things. To handle semistructured data the sones GraphDB comes with a dynamic data scheme and consistency criteria. This data scheme is an extension of the OOP data model allowing inheritance, abstract types, data streams (binary – unstructured data), undefined Attributes, Editions and Revisions on Object Namespaces and Object Instances. Having a dynamic scheme means that the user can change the data scheme anytime without performance penalties. Unstructured data is handled either using undefined attributes on edges and vertices or using the binary data streams which can be handled by the sones GraphDB without any mapping to external storages – basically binary data is stored with data objects.

Q5. How do ensure scalability and performance?

Daniel Kirstenpfad: Ensuring scalability and performance was a major design goal through all development steps of the sones GraphDB. Because there are so many possible ways in enhancing the scalability and performance aspects of a graph database we have been busy in implementing some of those ways and we will be busy implementing more in the future. Currently Master-Slave replication is one key factor to scale the query performance of a data set. By replicating a data set onto multiple machines and running queries on read slaves the number of possible queries per seconds scales nearly linear. In the future the sones GraphDB will have an easy to use and extend framework to partition a graph onto multiple machines – coupled with the Master-Slave Replication this allows virtually unlimited sizes of data sets and virtually unlimited number of simultaneous queries.

Q6. How do you see the market for Cloud computing and open source?

Daniel Kirstenpfad: We see a very diverse cloud computing market with some bigger and some smaller cloud platform providers on the one hand and many customers who want to utilize these cloud platforms, either as a software vendor or a platform user, on the other hand.

Basically the cloud for us is another opportunity to host the services the sones GraphDB provides. It’s important for us to support the major cloud platforms like Microsofts Windows Azure and Amazons EC2. There are some cool technologies and frameworks that can be utilized by a database to store and access data in a fast and very scalable way.

What most of those cloud platforms lack these days is an actually useable system to allow software vendors like sones to publish their applications and services. Currently many users need to take the not-so-scenic route until they get their instance of something hosted on a cloud platform. It would be great to have a better software vendor integration and therefore allow the users to take the scenic route to service.

Q7. sones offers its product using a dual licensing scheme: open source software under AGPLv3, and a full enterprise version. Why did you choose a dual licensing scheme and how do handle the two license models?

Daniel Kirstenpfad: We started as a closed source company – mainly because we wanted to have something to share before we start going open source. Making the whole sones GraphDB open source was always an option – which we finally chose to take in Mid-2010. Since then there is an Open Source Edition of the sones GraphDB available which shares it’s code with the enterprise version.

What differentiates the Open Source Edition and the Enterprise Edition are feature plugins which extend the functionality in certain ways. There are plans to gradually publish the previously closed source plugins under open source licenses however. We chose this dual licensing scheme because first of all it gives us and the software the most flexibility. We can combine closed source technology – from partners for example – and open source technology using dual-licensing. And we think that this will suit most of our users needs best.

Q8. Sones has recently concluded a round of financing. What is your strategy for the next months?

Daniel Kirstenpfad: First of all we want to broaden our presence. One key factor to enable a great community is to start an open dialogue with the community. We want to bring our team to the next level – that means expanding the development, customer support, press activities and management team. We furthermore are going to expand our cooperations with universities and partners. Beyond anything partners are another key factor which will lead to the success of the sones GraphDB.

Jan 25 11

Agile data modeling and databases.

by Roberto V. Zicari

In a previous interview with our expert Dr. Michael Blaha, we talked about Patterns of Data Modeling. This is a follow up interview with Dr. Blaha on a related topic: Agile data modeling and databases.

RVZ

Q1. Many developers are familiar with the notion of agile programming. What is the main promise of agile programming? And how does it relate to conceptual data modeling?

Blaha: Agile programming focuses on writing code quickly and showing the evolving results to the customer. Agile programming is a reaction to broken software engineering practices where a lengthy and tedious process keeps software hidden until the very end.

However, software engineering need not be ponderous if you use agile modeling. Conceptual data models can be constructed live, in front of an audience — I know this for a fact, because I often build models this way. Business users understand a data model, if you explain it as you go. They are patient as long as there is clear progress and
they can see that the data model reflects their business goals.

Q2. You have posted a series of videos on YouTube that demonstrate agile data modeling. What is the main message you want to convey?

Blaha: The videos demonstrate that it is possible to rapidly construct a conceptual data model. (Many developers do not realize that it is possible.) The videos use the example of developing library asset management software and are representative of how I build models.

Q3. What about object oriented databases?

Blaha: When I develop data-oriented applications I always start with a data model — this applies to relational databases, object-oriented databases, flat files, XML, and other formats.

For object-oriented databases, the UML class model specifies the structure of domain classes. Several books elaborate the mapping details, such as my book Object-Oriented Modeling and Design for Database Applications.

The trickiest issue for mapping a UML class model to OO databases is the handling of associations. An association relates objects of the participating classes and is inherently bidirectional. For example, the WorksFor association can be traversed to find the collection of employees for a company, as well as the employer for an employee.
By their very nature, associations transcend classes and cannot be private to a class. Associations are important because they break class encapsulation.

OO databases vary in their association support. ObjectStore uses dual pointers — a class A to class B association is implemented with an instance variable in A that refers to B and an instance variable in B that refers to A. The ObjectStore database engine automatically keeps the dual pointers mutually consistent across inserts, deletes, and
updates.

Many other OO databases lack association support and require that you architect your own solution. You can have dual references that application software maintains. Or you can build a generic association mechanism apart from your application. Or you can just implement a single direction when that suffices.

In any case, a data model documents associations so that they are clearly recognized and understood. Then you are sure not to overlook the associations in your application architecture.

You can watch here the series of videos:

ODBMS Industry Watch

Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

The evolving market for NoSQL Databases: Interview with James Phillips.

Objects in Space: “Herschel” the largest telescope ever flown.

Benchmarking ORM tools and Object Databases.

One minute of our time.

Objects in Space.

“Distributed joins are hard to scale”: Interview with Dwight Merriman.

On Graph Databases: Interview with Daniel Kirstenpfad.

Agile data modeling and databases.

About the author

Archives

Meta

About

Flickr

Search

About the author

Tags

Archives

Meta

About

Flickr

Search