ODBMS Industry Watch » Versant http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Sun, 02 Apr 2017 17:59:10 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.13 Big Data from Space: the “Herschel” telescope. http://www.odbms.org/blog/2013/08/big-data-from-space-the-herschel-telescope/ http://www.odbms.org/blog/2013/08/big-data-from-space-the-herschel-telescope/#comments Fri, 02 Aug 2013 12:45:02 +0000 http://www.odbms.org/blog/?p=2169

” One of the biggest challenges with any project of such a long duration is coping with change. There are many aspects to coping with change, including changes in requirements, changes in technology, vendor stability, changes in staffing and so on”–Jon Brumfitt.

On May 14, 2009, the European Space Agency launched an Arianne 5 rocket carrying the largest telescope ever flown: the “Herschel” telescope, 3.5 meters in diameter.

I first did an interview with Dr. Jon Brumfitt, System Architect & System Engineer of Herschel Scientific Ground Segment, at the European Space Agency in March 2011. You can read that interview here.

Two years later, I wanted to know the status of the project. This is a follow up interview.

RVZ

Q1. What is the status of the mission?

Jon Brumfitt: The operational phase of the Herschel mission came to an end on 29th April 2013, when the super-fluid helium used to cool the instruments was finally exhausted. By operating in the far infra-red, Herschel has been able to see cold objects that are invisible to normal telescopes.
However, this requires that the detectors are cooled to an even lower temperature. The helium cools the instruments down to 1.7K (about -271 Celsius). Individual detectors are then cooled down further to about 0.3K. This is very close to absolute zero, which is the coldest possible temperature. The exhaustion of the helium marks the end of new observations, but it is by no means the end of the mission.
We still have a lot of work to do in getting the best results from the data processing to give astronomers a final legacy archive of high-quality data to work with for years to come.

The spacecraft has been in orbit around a point known as the second Lagrangian point “L2″, which is about 1.5 million kilometres from Earth (around four times as far away as the Moon). This location provided a good thermal environment and a relatively unrestricted view of the sky. The spacecraft cannot be left in this orbit because regular correction manoeuvres would be needed. Consequently, it is being transferred into a “parking” orbit around the Sun.

Q2. What are the main results obtained so far by using the “Herschel” telescope?

Jon Brumfitt: That is a difficult one to answer in a few sentences. Just to take a few examples, Herschel has given us new insights into the way that stars form and the history of star formation and galaxy evolution since the big-bang.
It has discovered large quantities of cold water vapour in the dusty disk surrounding a young star, which suggests the possibility of other water covered planets. It has also given us new evidence for the origins of water on Earth.
The following are some links giving more detailed highlights from the mission:

– Press
– Results
– Press Releases
– Latest news

With its 3.5 metre diameter mirror, Herschel is the largest space telescope ever launched. The large mirror not only gives it a high sensitivity but also allows us to observe the sky with a high spatial resolution. So in a sense every observation we make is showing us something we have never seen before. We have performed around 35,000 science observations, which have already resulted in over 600 papers being published in scientific journals. There are many years of work ahead for astronomers in interpreting the results, which will undoubtedly lead to many new discoveries.

Q3. How much data did you receive and process so far? Could you give us some up to date information?

Jon Brumfitt: We have about 3 TB of data in the Versant database, most of which is raw data from the spacecraft. The data received each day is processed by our data processing pipeline and the resulting data products, such as images and spectra, are placed in an archive for access by astronomers.
Each time we make a major new release of the software (roughly every six months at this stage), with improvements to the data processing, we reprocess everything.
The data processing runs on a grid with around 35 nodes, each with typically 8 cores and between 16 and 256 GB of memory. This is able to process around 40 days worth of data per day, so it is possible to reprocess everything in a few weeks. The data in the archive is stored as FITS files (a standard format for astronomical data).
The archive uses a relational (PostgreSQL) database to catalogue the data and allow queries to find relevant data. This relational database is only about 60 GB, whereas the product files account for about 60 TB.
This may reduce somewhat for the final archive, once we have cleaned it up by removing the results of earlier processing runs.

Q4. What are the main technical challenges in the data management part of this mission and how did you solve them?

Jon Brumfitt: One of the biggest challenges with any project of such a long duration is coping with change. There are many aspects to coping with change, including changes in requirements, changes in technology, vendor stability, changes in staffing and so on.

The lifetime of Herschel will have been 18 years from the start of software development to the end of the post-operations phase.
We designed a single system to meet the needs of all mission phases, from early instrument development, through routine in-flight operations to the end of the post-operations phase. Although the spacecraft was not launched until 2009, the database was in regular use from 2002 for developing and testing the instruments in the laboratory. By using the same software to control the instruments in the laboratory as we used to control them in flight, we ended up with a very robust and well-tested system. We call this approach “smooth transition”.

The development approach we adopted is probably best classified as an Agile iterative and incremental one. Object orientation helps a lot because changes in the problem domain, resulting from changing requirements, tend to result in localised changes in the data model.
Other important factors in managing change are separation of concerns and minimization of dependencies, for example using component-based architectures.

When we decided to use an object database, it was a new technology and it would have been unwise to rely on any database vendor or product surviving for such a long time. Although work was under way on the ODMG and JDO standards, these were quite immature and the only suitable object databases used proprietary interfaces.
We therefore chose to implement our own abstraction layer around the database. This was similar in concept to JDO, with a factory providing a pluggable implementation of a persistence manager. This abstraction provided a route to change to a different object database, or even a relational database with an object-relational mapping layer, should it have proved necessary.

One aspect that is difficult to abstract is the use of queries, because query languages differ. In principle, an object database could be used without any queries, by navigating to everything from a global root object. However, in practice navigation and queries both have their role. For example, to find all the observation requests that have not yet been scheduled, it is much faster to perform a query than to iterate by navigation to find them. However, once an observation request is in memory it is much easier and faster to navigate to all the associated objects needed to process it. We have used a variety of techniques for encapsulating queries. One is to implement them as methods of an extent class that acts as a query factory.

Another challenge was designing a robust data model that would serve all phases of the mission from instrument development in the laboratory, through pre-flight tests and routine operations to the end of post-operations. We approached this by starting with a model of the problem domain and then analysing use-cases to see what data needed to be persistent and where we needed associations. It was important to avoid the temptation to store too much just because transitive persistence made it so easy.

One criticism that is sometimes raised against object databases is that the associations tend to encode business logic in the object schema, whereas relational databases just store data in a neutral form that can outlive the software that created it; if you subsequently decide that you need a new use-case, such as report generation, the associations may not be there to support it. This is true to some extent, but consideration of use cases for the entire project lifetime helped a lot. It is of course possible to use queries to work-around missing associations.

Examples are sometimes given of how easy an object database is to use by directly persisting your business objects. This may be fine for a simple application with an embedded database, but for a complex system you still need to cleanly decouple your business logic from the data storage. This is true whether you are using a relational or an object database. With an object database, the persistent classes should only be responsible for persistence and referential integrity and so typically just have getter and setter methods.
We have encapsulated our persistent classes in a package called the Core Class Model (CCM) that has a factory to create instances. This complements the pluggable persistence manager. Hence, the application sees the persistence manager and CCM factories and interfaces, but the implementations are hidden.
Applications define their own business classes which can work like decorators for the persistent classes.

Q5. What is your experience in having two separate database systems for Herschel? A relational database for storing and managing processed data products and an object database for storing and managing proposal data, mission planning data, telecommands and raw (unprocessed) telemetry?

Jon Brumfitt: There are essentially two parts to the ground segment for a space observatory.
One is the “uplink” which is used for controlling the spacecraft and instruments. This includes submission of observing proposals, observation planning, scheduling, flight dynamics and commanding.
The other is the “downlink”, which involves ingesting and processing the data received from the spacecraft.

On some missions the data processing is carried out by a data centre, which is separate from spacecraft operations. In that case there is a very clear separation.
On Herschel, the original concept was to build a completely integrated system around an object database that would hold all uplink and downlink data, including processed data products. However, after further analysis it became clear that it was better to integrate our product archive with those from other missions. This also means that the Herschel data will remain available long after the project has finished. The role of the object database is essentially for operating the spacecraft and storing the raw data.

The Herschel archive is part of a common infrastructure shared by many of our ESA science projects. This provides a uniform way of accessing data from multiple missions.
The following is a nice example of how data from Herschel and our XMM-Newton X-ray telescope have been combined to make a multi-spectral image of the Andromeda Galaxy.

Our archive, in turn, forms part of a larger international archive known as the “Virtual Observatory” (VO), which includes both space and ground-based observatories from all over the world.

I think that using separate databases for operations and product archiving has worked well. In fact, it is more the norm rather than the exception. The two databases serve very different roles.
The uplink database manages the day-to-day operations of the spacecraft and is constantly being updated. The uplink data forms a complex object graph which is accessed by navigation, so an object database is well suited.
The product archive is essentially a write-once-read-many repository. The data is not modified, but new versions of products may be added as a result of reprocessing. There are a large number of clients accessing it via the Internet. The archive database is a catalogue containing the product meta-data, which can be queried to find the relevant product files. This is better suited to a relational database.

The motivation for the original idea of using a single object database for everything was that it allowed direct association between uplink and downlink data. For example, processed products could be associated with their observation requests. However, using separate databases does not prevent one database being queried with an observation identifier obtained from the other.
One complication is that processing an observation requires both downlink data and the associated uplink data.
We solved this by creating “uplink products” from the relevant uplink data and placing them in the archive. This has the advantage that external users, who do not have access to the Versant database, have everything they need to process the data themselves.

Q6. What are the main lessons learned so far in using Versant object database for managing telemetry data and information on steering and calibrating scientific on-board instruments?

Jon Brumfitt: Object databases can be very effective for certain kinds of application, but may have less benefit for others. A complex system typically has a mixture of application types, so the advantages are not always clear cut. Object databases can give a high performance for applications that need to navigate through a complex object graph, particularly if used with fairly long transactions where a significant part of the object graph remains in memory. Web (JavaEE) applications lose some of the benefit because they typically perform many short transactions with each one performing a query. They also use additional access layers that result in a system which loses the simplicity of the transparent persistence of an object database.

In our case, the object database was best suited for the uplink. It simplified the uplink development by avoiding object-relational mapping and the complexity of a design based on JDBC or EJB 2. Nowadays with JPA, relational databases are much easier to use for object persistence, so the rationale for using an object database is largely determined by whether the application can benefit from fast navigational access and how much effort is saved in mapping. There are now at least two object database vendors that support both JDO and JPA, so the distinction is becoming somewhat blurred.

For telemetry access we query the database instead of using navigation, as the packets don’t fit neatly into a single containment hierarchy. Queries allows packets to be accessed by many different criteria, such as time, instrument, type, source and so on.
Processing calibration observations does not introduce any special considerations as far as the database is concerned.

Q7. Did you have any scalability and or availability issues during the project? If yes, how did you solve them?

Jon Brumfitt: Scalability would have been an important issue if we had kept to the original concept of storing everything including products in a single database. However, using the object database for just uplink and telemetry meant that this was not a big issue.

The data processing grid retrieves the raw telemetry data from the object database server, which is a 16-core Linux machine with 64 GB of memory. The average load on the server is quite low, but occasionally there have been high peak loads from the grid that have saturated the server disk I/O and slowed down other users of the database. Interactive applications such as mission planning need a rapid response, whereas batch data processing is less critical. We solved this by implementing a mechanism to spread out the grid load by treating the database as a resource.

Once a year, we have made an “Announcement of Opportunity” for astronomers to propose observations that they would like to perform with Herschel. It is only human nature that many people leave it until the last minute and we get a very high peak load on the server in the last hour or two before the deadline! We have used a separate server for this purpose, rather than ingesting proposals directly into our operational database. This has avoided any risk of interfering with routine operations. After the deadline, we have copied the objects into the operational database.

Q8. What about the overall performance of the two databases? What are the lessons learned?

Jon Brumfitt: The databases are good at different things.
As mentioned before, an object database can give a high performance for applications involving a complex object graph which you navigate around. An example is our mission planning system. Object persistence makes application design very simple, although in a real system you still need to introduce layers to decouple the business logic from the persistence.

For the archive, on the other hand, a relational database is more appropriate. We are querying the archive to find data that matches a set of criteria. The data is stored in files rather than as objects in the database.

Q9. What are the next steps planned for the project and the main technical challenges ahead?

Jon Brumfitt: As I mentioned earlier, the coming post-operations phase will concentrate on further improving the data processing software to generate a top-quality legacy archive, and on provision of high-quality support documentation and continued interactive support for the community of astronomers that forms our “customer base”. The system was designed from the outset to support all phases of the mission, from early instrument development tests in the laboratory, though routine operations to the end of the post-operations phase of the mission. The main difference moving into post-operations is that we will stop uplink activities and ingesting new telemetry. We will continue to reprocess all the data regularly as improvements are made to the data processing software.

We are currently in the process of upgrading from Versant 7 to Versant 8.
We have been using Versant 7 since launch and the system has been running well, so there has been little urgency to upgrade.
However, with routine operations coming to an end, we are doing some “technology refresh”, including upgrading to Java 7 and Versant 8.

Q10. Anything else you wish to add?

Jon Brumfitt: These are just some personal thoughts on the way the database market has evolved over the lifetime of Herschel. Thirteen years ago, when we started development of our system, there were expectations that object databases would really take off in line with the growing use of object orientation, but this did not happen. Object databases still represent rather a niche market. It is a pity there is no open-source object-database equivalent of MySQL. This would have encouraged more people to try object databases.

JDO has developed into a mature standard over the years. One of its key features is that it is “architecture neutral”, but in fact there are very few implementations for relational databases. However, it seems to be finding a new role for some NoSQL databases, such as the Google AppEngine datastore.
NoSQL appears to be taking off far quicker than object databases did, although it is an umbrella term that covers quite a few kinds of datastore. Horizontal scaling is likely to be an important feature for many systems in the future. The relational model is still dominant, but there is a growing appreciation of alternatives. There is even talk of “Polyglot Persistence” using different kinds of databases within a system; in a sense we are doing this with our object database and relational archive.

More recently, JPA has created considerable interest in object persistence for relational databases and appears to be rapidly overtaking JDO.
This is partly because it is being adopted by developers of enterprise applications who previously used EJB 2.
If you look at the APIs of JDO and JPA they are actually quite similar apart from the locking modes. However, there is an enormous difference in the way they are typically used in practice. This is more to do with the fact that JPA is often used for enterprise applications. The distinction is getting blurred by some object database vendors who now support JPA with an object database. This could expand the market for object databases by attracting some traditional relational type applications.

So, I wonder what the next 13 years will bring! I am certainly watching developments with interest.
——

Dr Jon Brumfitt, System Architect & System Engineer of Herschel Scientific Ground Segment, European Space Agency.

Jon Brumfitt has a background in Electronics with Physics and Mathematics and has worked on several of ESA’s astrophysics missions, including IUE, Hipparcos, ISO, XMM and currently Herschel. After completing his PhD and a post-doctoral fellowship in image processing, Jon worked on data reduction for the IUE satellite before joining Logica Space and Defence in 1980. In 1984 he moved to Logica’s research centre in Cambridge and then in 1993 to ESTEC in the Netherlands to work on the scientific ground segments for ISO and XMM. In January 2000, he joined the newly formed Herschel team as science ground segment System Architect. As Herschel approached launch, he moved down to the European Space Astronomy Centre in Madrid to become part of the Herschel Science Operations Team, where he is currently System Engineer and System Architect.

Related Posts

The Gaia mission, one year later. Interview with William O’Mullane. January 16, 2013

Objects in Space: “Herschel” the largest telescope ever flown. March 18, 2011

Resources

Introduction to ODBMS By Rick Grehan

ODBMS.org Resources on Object Database Vendors.

—————————————
You can follow ODBMS.org on Twitter : @odbmsorg

##

]]>
http://www.odbms.org/blog/2013/08/big-data-from-space-the-herschel-telescope/feed/ 0
Acquiring Versant –Interview with Steve Shine. http://www.odbms.org/blog/2013/03/acquiring-versant-interview-with-steve-shine/ http://www.odbms.org/blog/2013/03/acquiring-versant-interview-with-steve-shine/#comments Wed, 06 Mar 2013 17:26:21 +0000 http://www.odbms.org/blog/?p=2096 “So the synergies in data management come not from how the systems connect but how the data is used to derive business value” –Steve Shine,

On Dec. 21, 2012, Actian Corp. announced the completion of the transaction to buy Versant Corporation. I have interviewed Steve Shine, CEO and President, Actian Corporation.

RVZ

Q1. Why acquiring an object-oriented database company such as Versant?

Steve Shine: Versant Corporation, like us, has a long pedigree in solving complex data management in some of the world’s largest organisations. We see many synergies in bringing the two companies together. The most important of these is together we are able to invest more resources in helping our customers extract even more value from their data. Our direct clients will have a larger product portfolio to choose from, our partners will be able to expand in adjacent solution segments, and strategically we arm ourselves with the skills and technology to fulfil our plans to deliver innovative solutions in the emerging Big Data Market.

Q2. For the enterprise market, Actian offers its legacy Ingres relational database. Versant on the other hand offers an object oriented database, especially suited for complex science/engineering applications. How does this fit? Do you have a strategy on how to offer a set of support processes and related tools for the enterprise? if yes, how?

Steve Shine: While the two databases may not have a direct logical connection at client installations, we recognise that most clients use these two products as part of a larger more holistic solutions to support their operations. The data they manage is the same and interacts to solve business issues – for example object stores to manage the relationships between entities; transactional systems to manage clients and the supply chain and analytic systems to monitor and tune operational performance. – Different systems using the same underlying data to drive a complex business.

We plan to announce a vision of an integrated platform designed to help our clients manage all their data and their complex interactions, both internal and external so they can not only focus on their running their business, but better exploit the incremental opportunity promised by Big Data.

Q3. Bernhard Woebker, president and chief executive officer of Versant stated, “the combination of Actian and Versant provides numerous synergies for data management”. Could you give us some specific examples of such synergies for data management?

Steve Shine: Here is a specific example of what I mean by helping clients extracting more value from data in the Telco space. These type of incremental opportunities exist in every vertical we have looked at.

An OSS system in a Telco today may use an Object store to manage the complex relationships between the data, the same data is used in a relational store to monitor, control and manage the telephone network.

Another relational store using variants of the same data manages the provisioning, billing and support for the users of the network. The whole data set in Analytical stores is used to monitor and optimise performance and usage of the network.

Fast forwarding to today, the same data used in more sophisticated ways has allowed voice and data networks to converge to provide a seamless interface to mobile users. As a result, Telcos have tremendous incremental revenue opportunities BUT only if they can exploit the data they already have in their networks. For example: The data on their networks has allowed for a huge increase in location based services, knowledge and analysis of the data content has allowed providers to push targeted advertising and other revenue earning services at their users; then turning the phone into a common billing device to get even a greater share of the service providers revenue… You get the picture.

Now imagine other corporations being able to exploit their information in similar ways: Would a retailer benefit from knowing the preferences of who’s in their stores? Would a hedge fund benefit from detecting a sentiment shift for a stock as it happens? Even knowledge of simple events can help organisations become more efficient.
A salesman knowing immediately a key client raises a support ticket; A product manager knowing what’s being asked on discussion forums; A marketing manager knowing a perfect prospect is on the website.

So the synergies in data management come not from how the systems connect but how the data is used to derive business value. We want to help manage all the data in our customers organisations and help them drive incremental value from it. That is the what we mean by numerous synergies from data management and we have a vision to deliver it to our customers.

Q4. Actian claims to have more than 10,000 customers worldwide. What is the value proposition of Versant’s acquisition for the existing Actian`s customers?

Steve Shine: I have covered this in the answers above. They get access to a larger portfolio of products and services and we together drive a vision to help them extract greater value from their data.

Q5. Versant claims to have more than 150,000 installations worldwide. How do you intend to support them?

Steve Shine: Actian already runs a 24/7 global support organisation that prides itself in delivering one of the industry’s best client satisfaction scores. As far as numbers are concerned, Versant’s large user count is in essence driven by only 250 or so very sophisticated large installations whereas Actian already deals with over 10,000 discreet mission critical installations worldwide. So we are confident of maintaining our very high support levels and the Versant support infrastructure is being integrated into Actian’s as we speak.

Q6. Actian is active in the market for big data analytics. How does Versant’s database technology fit into Actian’s big data analytics offerings and capabilities?

Steve Shine: Using the example above imagine using OSS data to analyse network utilisation, CDR’s and billing information to identify pay plans for your most profitable clients.

Now give these clients the ability to take business action on real time changes in their data.Now imagine being able to do that from an integrated product set from one vendor. We will be announcing the vision behind this strategy this quarter. In addition, the Versant technology gives us additional options for solutions for big data for example visualisation and managing meta data.

Q7. Do you intend to combine or integrate your analytics database Vectorwise with Versant’s database technology (such as Versant JPA)? If yes, how?

Steve Shine: Specific plans for integrating products within the overall architecture have not been formulated. We have a strong philosophy that you should use the best tool for the job eg OODB for some things, OLTP RDBMS for other etc. But the real value comes from being able to perform sophisticated analysis and management across the different data stores. That is part of the work out platform integration efforts are focused on.

Q8. What are the plans for future software developments. Will you have a joint development team or else?

Steve Shine: We will be merging the engineering teams to focus on providing innovative solutions for big Data under single leadership.

Q9. You have recently announced two partnerships for Vectorwise, with Inferenda and BiBoard. Will you also pursue this indirect channel path also for Versant’s database technology?

Steve Shine: The beauty of the vision we speak of is that our joint partner have a real opportunity to expand their solutions using Actian’s broader product set and for those that are innovative the opportunity for new emerging markets

Q10. Versant recently developed Versant JPA. Is the Java market important for Actian?

Steve Shine: Yes !

Q11. It is currently a crowded database market: several new database vendors (NoSQL and NewSQL) offering innovative database technology (NuoDB, VoltDB, MongoDB, Cassandra, Couchbase, Riak to name a few), and large companies such as IBM and Oracle, are all chasing the big data market. What is your plan to stand out of the crowd?

Steve Shine: We are very excited about the upcoming announcement on our plans for the Big Data market. We will be happy to brief you on the details closer to the time but I will say that early feedback from analysts houses like Gartner have confirmed that our solution is very effective and differentiated in helping corporations extract business value from Big Data. On a higher scale, many of the start ups are going to get a very rude awakening when they find that delivering a database for mission critical use is much more than speed and scale of technology. Enterprises want world class 24×7 support service with failsafe resilience and security. Real industry grade databases take years and many $m’s to reach scalable maturity. Most of the start ups will not make it. Actian is uniquely positioned in being profitable and having delivered industry grade database innovation but also being singularly focused around data management unlike the broad, cumbersome and expensive bigger players. We believe value conscious enterprises will see our maturity and agility as a great strength.

Qx Anything else you wish to add?

Steve Shine: DATA! – What a great thing to be involved in! Endless value, endless opportunities for innovation and no end in sight as far as growth is concerned. I look forward to the next 5 years.

———————–

Steve Shine, CEO and President, Actian Corporation.
Steve comes to Actian from Sybase where he was senior vice president and general manager for EMEA, overseeing all operational, sales, financial and human resources in the region for the past three years. While at Sybase, he achieved more than 200 million in revenue and managed 500 employees, charting over 50 percent growth in the Business Intelligence market for Sybase. Prior to Sybase, Steve was at Canadian-based Geac Computer Corporation for ten successful years, helping to successfully turn around two major global divisions for the ERP firm.

Related Posts

Managing Internet Protocol Television Data. — An interview with Stefan Arbanowski. June 25, 2012

On Versant`s technology. Interview with Vishal Bagga. August 17, 2011

Resources

Big Data: Principles and best practices of scalable realtime data systems. Nathan Marz (Twitter) and James Warren, MEAP Began: January 2012,Manning Publications.

Analyzing Big Data With Twitter. A special UC Berkeley iSchool course.

-A write-performance improvement of ZABBIX with NoSQL databases and HistoryGluon. MIRACLE LINUX CORPORATION, February 13, 2013

Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs. Ben Engber, CEO, Thumbtack Technology, JANUARY 2013.

Follow ODBMS.org on Twitter: @odbmsorg
##

]]>
http://www.odbms.org/blog/2013/03/acquiring-versant-interview-with-steve-shine/feed/ 0
Managing Internet Protocol Television Data. — An interview with Stefan Arbanowski. http://www.odbms.org/blog/2012/06/managing-internet-protocol-television-data-an-interview-with-stefan-arbanowski/ http://www.odbms.org/blog/2012/06/managing-internet-protocol-television-data-an-interview-with-stefan-arbanowski/#comments Mon, 25 Jun 2012 07:02:08 +0000 http://www.odbms.org/blog/?p=1533 ” There is a variety of services possible via IPTV. Starting with live/linear TV and Video on Demand (VoD) over interactive broadcast related apps, like shopping or advertisement, up to social TV apps where communities of users have shared TV experience”— Stefan Arbanowski.

The research center Fraunhofer FOKUS (Fraunhofer Institute for Open Communication Systems) in Berlin, has established a “SmartTV Lab” to build an independent development and testing environment for HybridTV technologies and solutions. They did some work on benchmarking databases for Internet Protocol Television Data. I have interviewed Stefan Arbanowski, who leads the Lab.

RVZ

Q1.What are the main research areas at the Fraunhofer Fokus research center?

Stefan Arbanowski: Be it on your mobile device, TV set or car – the FOKUS Competence Center for Future Media and Applications (FAME) develops future web technologies to offer intelligent services and applications. Our team of visionaries combines creativity and innovation with their technical expertise for the creation of interactive media. These technologies enable smart personalization and support future web functionalities on various platform from diverse domains.

The experts rigorously focus on web-based technologies and strategically use open standards. In the FOKUS Hybrid TV Lab our experts develop future IPTV technologies compliant to current standards with emphasis on advanced functionality, convergence and interoperability. The FOKUS Open IPTV Ecosystem offers one of the first solutions for standardized media services and core components of the various standards.

Q2. At Fraunhofer Fokus, you have experience in using a database for managing and controlling IPTV (Internet Protocol Television) content. What is IPTV? What kind of internet television services can be delivered using IPTV?

Stefan Arbanowski: There is a variety of services possible via IPTV.
Starting with live/linear TV and Video on Demand (VoD) over interactive broadcast related apps, like shopping or advertisement, up to social TV apps where communities of users have shared TV experience.

Q3. What is IPTV data? Could you give a short description of the structure of a typical IPTV data?

Stefan Arbanowski: This is complex: start with page 14 of this doc. 😉

Q4. What are the main requirements for a database to manage such data?

Stefan Arbanowski: There are different challenges. One is the management of different sessions of streams that is used by viewers following a particular service including for instance electronic program guide (EPG) data. Another one is the pure usage data for billing purpose. Requirements are concurrent read/write ops on large (>=1GB) DBs ensuring fast response times.

Q5. How did you evaluate the feasibility of a database technology for managing IPTV data?

Stefan Arbanowski: We did compare Versant ODB (JDO interface) with MySQL Server 5.0 and handling data in RAM. For this we did 3 implementations trying to get most out of the individual technologies.

Q6. Your IPTV benchmark is based on use cases. Why? Could you briefly explain them?

Stefan Arbanowski: It has to be a real world scenario to judge whether a particular technology really helps. We did identify the bottlenecks in current IPTV systems and used them as basis for our use cases.

The objective of the first test case was to handle a demanding number of simultaneous read/write operations and queries with small data objects, typically found in an IPTV Session Control environment.
V/OD performed more than 50% better compared to MySQL in a server side, 3-tier application server architecture. Our results for a traditional client/server architecture showed smaller advantages for the Versant Object Database, performing only approximately 25% better than MySQL, probably because of the network latency of the test environment.

The second test case managed larger sized Broadband Content Guide = Electronic Program Guide (BCG) data in one large transaction. V/OD was more than 8 times
faster compared to MySQL. In our analysis, the test case demonstrated V/OD’s significant advantages when managing complex data structures.

We wrote a white paper for more details.

Q7. What are the main lessons learned in running your benchmark?

Stefan Arbanowski: Comparing databases is never an easy task. Many specific requirements influence the decision making process, for example, the application specific data model and application specific data management tasks. Instead of using a standard database benchmark, such as TPC-C, we chose to develop a couple of benchmarks that are based on our existing IPTV Ecosystem data model and data management requirements, which allowed us to analyze results that are more relevant to the real world requirements found in such systems.

Q8. Anything else you wish to add?

Stefan Arbanowski: Considering these results, we would recommend a V/OD database implementation where performance is mandatory and in particular when the application must manage complex data structures.

_____________________
Dr. Stefan Arbanowski is head of the Competence Centre Future Applications and Media (FAME) at Fraunhofer Institute for Open Communication Systems FOKUS in Berlin, Germany.
Currently, he is coordinating Fraunhofer FOKUS’ IPTV activities, bundling expertise in the areas of interactive applications, media handling, mobile telecommunications, and next generation networks. FOKUS channels those activities towards networked media environments featuring live, on demand, context-aware, and personalized interactive media.
Beside telecommunications and distributed service platforms, he has published more than 70 papers in respective journals and conferences in the area of personalized service provisioning. He is member of various program committees of international conferences.

]]>
http://www.odbms.org/blog/2012/06/managing-internet-protocol-television-data-an-interview-with-stefan-arbanowski/feed/ 2
On Versant`s technology. Interview with Vishal Bagga. http://www.odbms.org/blog/2011/08/on-versants-technology-interview-with-vishal-bagga/ http://www.odbms.org/blog/2011/08/on-versants-technology-interview-with-vishal-bagga/#comments Wed, 17 Aug 2011 13:53:03 +0000 http://www.odbms.org/blog/?p=1077 “We believe that data only becomes useful once it becomes structured.” — Vishal Bagga

There is a lot of discussion on NoSQL databases nowdays. But what about object databases?
I asked a few questions to Vishal Bagga, Senior Product Manager at Versant.

RVZ

Q1. How has Versant’s technology evolved over the past three years?

Vishal Bagga: Versant is a customer driven company. We work closely with our customers trying to understand how we can evolve our technology to meet their challenges – whether it’s regarding complexity, data size or demanding workloads.

In the last 3 years we have seen 2 very clear trends from our interaction with our new and existing customers – growing data sizes and increasingly parallel workloads. This is very much in-line with what the general database market is seeing. In addition there was request for simplified database management and monitoring.

Our state of the art Versant Object Database 8 released last year was designed for exactly these scenarios. We have added increased scalability and performance on multi-core architectures, faster and better defragmentation tools, Eclipse based management and monitoring tools to name a few. We are also re-architecting our database server technology to automatically scale when possible without manual DBA intervention and allow online tuning (reconfigure the database instance online without impacting applications).

Q2. On December 1, 2008 Versant acquired the assets of the database software business of Servo Software, Inc. (formerly db4objects, Inc.). What happened to db4objects since then? How does db4objects fit into Versant technology strategy?

Vishal Bagga: The db4o community is doing well and is an integral part of Versant. In fact, when we first acquired db4o at the end of 2008, there were just short of 50,000 registered members.
Today, the db4o community boasts nearly 110,000 members having more than doubled in size in the last 2+ years.
In addition, db4o has had 2 major releases with some significant advances in enterprise type features allowing things like online defragmentation support. In our latest major release, we announced a new data replication capability between db4o and the large scale enterprise class Versant database.
Versant sees a great need in the mobile markets for technology like db4o which can play well in the lightweight handheld, mobile computing and machine-to-machine space while leveraging big data aggregation servers like Versant which can handle the huge number of events coming off of these intelligent edge devices.
In the coming year, even greater synergies are being developed and our communities are merging into one single group dedicated to next generation NoSQL 2.0 technology development.

Q3. Versant database and NoSQL databases: what are the similarities and what are the differences?

Vishal Bagga: The Not Only SQL databases are essentially systems that have evolved out of a certain business need – The need was essentially to have horizontally scalable systems running on commodity hardware with a simple “soft-schema” model for example social networking, offline data crunching, distributed logging system, event processing systems etc.

Relational databases were considered to be too slow, expensive and difficult to manage and administrate, expensive and difficult to adapt to quick changing models.

If I look at similarities between Versant and NoSQL, I would say that:

Both systems have designed around the inefficiency of JOINs. This is the biggest problem with relational databases. If you think about it, in most operational systems relations don’t change e.g. Blog:Article, Order:OrderItem, so why recalculate those relations each time they are accessed using a methodology which gets slower and slower as the amount of data gets larger. JOINs have a use case, but for some 20%, not 100% of the use cases.

Both systems leverage an architectural shift to a “soft-schema” which allows scale-out capability – the ability to partition information across many physical nodes and treat those nodes as 1 ubiquitous database.

When it comes to differences:

The biggest in my opinion is the complexity of the data. Versant allows to you to model very complicated data models seamlessly with ease whereas doing so with a NoSQL solution would be much more effort and you would need to write a lot of code in the application to represent the data model.
In this respect, Versant prefers to use the term “soft-schema” –vs- the term “schemaless”, terms which are often interchanged in discussion.
We believe that data only becomes useful once it becomes structured, in fact that is the whole point of technologies like Hadoop, to churn unstructured data looking for a way to structure it into something useful.
NoSQL technologies that bill themselves as “schema-less” are in denial of the fact that they are leaving the application developer the burden of defining the structure and mapping the data into that structure in the language of the application space. In many ways, it is the mapping problem all over again. Plus, that kind of data management is very hard to change it over time, leading to a brittle solution difficult to optimize for more than 1 use case. The use of “soft-schema” lends itself to a more enterprise manageable and extensible system where the database still retains important elements of structure, while still being able to store and manipulate unstructured types.

Another is the difference in the consistency model. Versant is ACID centric and Versant’s customers depend on this for their mission critical systems – it would be nearly impossible for these systems to use NoSQL given the relaxed constraints. Versant can do a CAP mode, but that is not our only mode of operation. You use it where it is really needed; you are not forced into using it unilaterally.

NoSQL systems make you store your data in a way that you can lookup efficiently by a key. But what if want to lookup something differently; it is likely to be terribly inefficient. This may be okay for the design but a lot of people do not realize that this is a big change in mindset. Versant offers a more balanced approach where you can navigate between related objects using references; you can for example define a root object and then navigate your tree from that object. At the same time you can run ad-hoc queries whenever you want to.

Q4. Big Data: Can Versant database be useful when dealing with petabytes of user data? How?

Vishal Bagga: I don’t see why not. Versant was always designed to work on a network of databases from the very start. Dealing with a Petabyte is really about designing a system with the right architecture. Versant has that architecture just as intact as anyone in the database space saying they can handle a Petabyte. Make no mistake, no matter how you do it, it is a non-trivial task. Today, our largest customer databases are in the 100’s of terrabyte range, so getting to a Petabyte is really a matter of needing that much data.

Q5. Hadoop is designed to process large batches of data quickly. Do you plan to use Hadoop and leverage components of the Hadoop ecosystem like HBase, Pig, and Hive?

Vishal Bagga: Yes, and some of our customers already do that today. A question for you: “Why are those layers in existence?” I would say the answer is that most of these early NoSQL 1.0 technologies do not handle real world complexity in information models. So, these layers are built to try and compensate for that fact. That is the exact point where Versant’s NoSQL 2.0 technology fits into the picture, we help people deal with complexity of information models, something that 1st generation NoSQL has not managed to accomplish.

Q6. Do you think that projects such as JSON (JavaScript Object Notation) and MessagePack (binary-based efficient object serialization library ) play a role in the odbms market?

Vishal Bagga: Absolutely. We believe in open standards. Fortunately, you can store any type in an ODBMS. These specific libraries are particularly important for current most popular client frameworks like Ajax. Finding ways to deliver a soft-schema into a client friendly format is essential to help ease the development burden.

Q7. Looking at three elements: Data, Platform, Analysis, where is Versant heading up?

Vishal Bagga: It is a difficult question as database and data management is increasingly a cross cutting concern. It used to be perfectly fine to keep your Analysis as part of your off-line OLAP systems, but these days there is an increasing push to get Analytics to the real time business.
So, you play with Data, you play with Analytics whether you do it directly or in concert with other technologies through partnership. Certainly, as Versant embraces Platform as a Service, we will do so through eco system partners who are paving the way with new development and deployment methodologies.

Related Posts

Objects in Space: “Herschel” the largest telescope ever flown. (March 18, 2011)

Benchmarking ORM tools and Object Databases. (March 14, 2011)

Robert Greene on “New and Old Data stores” . (December 2, 2010)

Object Database Technologies and Data Management in the Cloud. (September 27, 2010)

##

]]>
http://www.odbms.org/blog/2011/08/on-versants-technology-interview-with-vishal-bagga/feed/ 0
Objects in Space: “Herschel” the largest telescope ever flown. http://www.odbms.org/blog/2011/03/objects-in-space-herschel-the-largest-telescope-ever-flown/ http://www.odbms.org/blog/2011/03/objects-in-space-herschel-the-largest-telescope-ever-flown/#comments Fri, 18 Mar 2011 21:36:57 +0000 http://www.odbms.org/blog/?p=674 Managing telemetry data and information on steering and calibrating scientific on-board instruments with an object database.

Interview with Dr. Johannes Riedinger, Herschel Mission Manager, European Space Agency (ESA), and Dr Jon Brumfitt, System Architect of the Herschel Science Ground Segment, European Space Agency (ESA).
———————————
More Objects in Space…
I became aware of another very interesting project at the European Space Agency (ESA). On May 14, 2009, the European Space Agency launched an Arianne 5 rocket carrying the largest telescope ever flown: the “Herschel” telescope, 3.5 meters in diameter. The satellite whose orbit is some 1.6 million kilometers from the Earth, will operate 48 months.

One interesting aspect of this project, for us at odbms.org, is that they use an object database to manage telemetry data and information on steering and calibrating scientific on-board instruments.

I had the pleasure to interview Dr. Johannes Riedinger, Herschel Mission Manager, and Dr. Jon Brumfitt, System Architect of Herschel Scientific Ground Segment, both at the European Space Agency.

Hope you`ll enjoy the interview!

RVZ

Q1. What is the mission of the “Herschel” telescope?

Johannes: The Herschel Space Observatory is the latest in a series of space telescopes that observe the sky at far-infrared wavelengths that cannot be observed from the ground. The most prominent predecessors were the Infrared Astronomical Satellite (IRAS), a joint US, UK, NL project launched in 1983, next in line was the Infrared Space Observatory (ISO), a telescope the European Space Agency launched in 1995, which was followed by the Spitzer Space Telescope, a NASA mission launched in 2003. The two features that distinguish Herschel from all its predecessors are its large primary mirror, which translates into a high sensitivity because of the large photon collecting area, and the fact that it is sensitive out to a wavelength of 670 μm. Extending the wavelength range of its predecessors by more than a factor of 3, Herschel is closing the last big gap that has remained in viewing celestial objects at wavelengths between 200 μm and the sub-millimetre regime. In the wavelength range in which Herschel overlaps to varying degrees with its predecessors, from 57 to 200 μm, Herschel’s big advantage is once more the size of its primary mirror. Because spatial resolution at a given wavelength improves linearly with telescope diameter, Herschel has a 4 to 6 times sharper vision than all earlier far-infrared telescopes.

Q2. In a report in 2010 [1] you wrote that “you collect every day an average of six to seven gigabit raw telemetry data and additional information for steering and calibrating three scientific on-board instruments“. Is this still actual? What are the main technical challenges with respect to data processing, manipulation and storage that this project poses?

Johannes: The average rate at which we have been receiving raw data from Herschel over the past 18 months is about 12 gigabits/day, which increases somewhat through the addition of related data by the time it is ingested into the database. The amount of data, and its distribution to the three national data centres that monitor the health and calibration of the instruments, is quite manageable. More challenging are the data volumes that need to be handled in the form of various levels of scientific data products, which are generated in the data processing pipeline. Another challenge, even for today’s computers, is the amount of memory some products require during processing. This goes significantly beyond the amount of memory a standard desktop or laptop computer comes with, and even the amount of disk space such machines used to come with until a few years ago: Only recently we have procured two computers with 256 GB of RAM for the grid on which we run the pipeline to efficiently process some of the data.

Q3. What are the requirements of the “Herschel” data processing software? And what are the main components and functionalities of the Herschel software?

Jon: The software is required to plan daily sequences of astronomical observations and associated spacecraft manoeuvres, based on proposals submitted by the astronomical community, and to turn these into telecommands sequences to uplink to the spacecraft. The data processing system is required to process the resulting telemetry and convert it into scientific products (images, spectra, etc) for the astronomers. The “uplink” system is particularly critical, because it must have a high availability to keep the spacecraft supplied with commands. Significant parts of the software were required to be working several years before launch, so that they could be used to test the scientific instruments in the laboratory. This approach, which we call “smooth transition”, ensures that certain critical parts of the software are very mature by the time we come to launch and that the instruments have been extensively tested with the real control system.

The uplink chain starts with a proposal submission system that allows astronomers to submit their observation requests. The astronomer downloads a Java client which communicates with a JBOSS application server to ingest proposals into the Versant database. Next, the mission planning system allows the mission planner to schedule sequences of observations. This is a complex task which is subject to many constraints on the allowed spacecraft orientation, the target visibility, stray light, ground station availability, etc. It is important to minimize the time wasted by slewing between targets. Each schedule is expanded into a sequence of several thousand telecommands to control the spacecraft and scientific instruments for a period of 24 hours. The spacecraft is in contact with the Earth only once a day for typically 3 hours, during which time a new schedule is uplinked and data from the previous day is downlinked and ingested into the Versant database. Between contact periods, the spacecraft executes the commands autonomously. The sequences of instrument commands have to be accurately synchronized with each little manoeuvre of the spacecraft when we are mapping areas of the sky. This is achieved by a system known as the “Common Uplink System” in which the detailed commanding is programmed in a special language that models the on-board execution timing.

The other major component of the software is the Data Processing system. This consists of a pipeline that processes the telemetry each day and places the resulting data products in an archive. Astronomers can download the products. We also provide an interactive analysis system which they can download to carry out further specialized analysis of their data.

Overall, it is quite a complex system with about 3 million lines of Java code (including test code) organized as 1100 Java packages and 300 CVS modules. There are 13,000 classes of which 300 are persistent.

Q4. You have chosen to have two separate database systems for Herschel: a relational database for storing and managing processed data products and an object database for storing and managing proposal data, mission planning data, telecommands and raw (unprocessed) telemetry. Is this correct? Why two databases? What is the size of the data expected for each database?

Jon: The processed data products, which account for the vast bulk of the data, are kept in a relational database, which forms part of a common infrastructure shared by many of our ESA science projects. This archive provides a uniform way of accessing data from different missions and performing multi-mission queries. Also, the archive will need to be maintained for years after the Herschel project finishes. The Herschel data in this archive is expected to grow to around 50 TB.

The proposal data, scheduling data, telecommands and raw (unprocessed) telemetry are kept in the Versant object database. This database will only grow to around 2 TB. In fact, back in 2000, we envisaged that all the data would be in the object database, but it soon became apparent that there were enormous benefits from adopting a common approach with other projects.

The object database is very good for the kind of data involved in the uplink chain. This typically requires complex linking of objects and navigational access. The data in the relational archive is stored in the form of FITS files (a format widely used by astronomers) and the appropriate files can be found by querying the meta-data that consist only of a few hundred keywords even for files that can be several gigabytes in size. We need to deliver the data as FITS files, so it makes sense to store it in that form, rather than to convert objects in a database into files in response to each astronomer query. Our interactive analysis system retrieves the files from the archive and converts them back into objects for processing.

Johannes: At present the Science Data Archive contains about 100 TB of data. This includes older versions of the same products that were generated by earlier versions of the pipeline software—we regenerate the products with the latest software and the best current knowledge of instrument calibration about every 6 months. Several years from now, when we will have generated the “best and final” products for the Herschel legacy archive and after we have discarded all the earlier versions of each product, we expect to end up, as Jon says, with an archive volume of about 50 to 100 TB. This is up by two orders of magnitude from the legacy archive of ISO which, given today’s technology, you can carry around in your pocket on a 512 GB external disk drive.

Q5. How do the two databases interact with each other? Are there any similarities with the Gaia data processing system?

Jon: The two database systems are largely separate as they perform rather different roles. The gap is bridged by the data processing pipeline which takes the telemetry from the Versant database, processes it and puts the resulting products in the archive.

We do, in fact, have the ability to store products in the Versant database and this is used by experts within the instrument teams, although for the normal astronomer everything is archive based. There are many kinds of products and in principle new kinds of products can be defined to meet the needs of data processing as the mission progresses. If it were necessary to define a new persistent class for each new product type, we would have a major schema evolution problem. Consequently, product classes are defined by “composition” of a fixed set of building blocks to build an appropriate hierarchical data-structure. These blocks are implemented as a fixed set of persistent classes, allowing new product types to be defined without a schema change.

The Versant database requires that first-class objects in the database are “enhanced” to make them persistence capable. However, we need to use Plain Old Java Objects for the products so that our interactive analysis system does not require a Versant installation. We have solved this by using “product persistor” classes that act as adaptors to make the POJOs persistent.

Johannes: Concerning Gaia, and here I need to make a point that has little to do with database technology per se but it has a lot to do with using databases as a tool for different purposes even in the same generic area of science—astronomy in this case—I want to emphasize the differences between Herschel and Gaia rather than the similarities.

Herschel is an observatory, i.e. the satellite collects data from any celestial object which is selected by a user to be observed in a particular observing mode with one or more of Herschel’s three on-board instruments. Each observing mode generates its own, specific set of products that is characteristic of the observing mode rather than characteristic of the celestial object that is observed: A set of products can e.g. consist of a set of 2-dimensional, monochromatic images at infrared wavelengths, from which various false-colour composites can be generated by superposition. In the lingo of this journal: Observing modes are the classes, the data products are the objects. If I observe two different celestial objects in the same observing mode, I can directly compare the results. E.g. the distribution of colours in a false-colour image immediately tells me something about the temperature distribution of the material that is present; the intensity of the colour tells me something about the amount of material of a given temperature. And if I also have a spectrum of the celestial object, this tells me something about the chemical composition of the material and its physical state, such as pressure, density, and temperature.

Gaia, on the other hand, is not an observatory and it does not offer different observing modes that can be requested. It measures the same set of a dozen or so parameters for each and every object time and time again, with some of these parameters being time, brightness in different colours, position relative to other objects that appear in the same snapshot, and radial velocity. But every time it measures these parameters for a particular object it measures them in a different context, i.e. in combination with a different set of other objects that appear in the same snapshot. So you end up with a billion objects, each appearing on a different ensemble of several dozen snapshots, and you have to find the “global best fit” that minimizes the residual errors of a dozen parameters fitted to a billion objects. Computationally, and compared to Herschel, this is a gargantuan task. But it is a single task—derive the mechanical state of an ensemble of a billion objects whose motion is controlled by relativistic celestial mechanics in the presence of a few possible disturbances, such as orbiting planets or unstable stellar states (pulsating or otherwise variable stars). On Herschel, the scientific challenge is somewhat different: We are not so intensely interested in the state of motion of individual objects, we are interested in the chemical composition of matter—much of which does not consist of stars but of clouds of gas and dust which Gaia cannot “see”—and its physical state.

Q6. You have chosen an object database, from Versant, for storing and managing raw (unprocessed) telemetry data. What is the rationale for this choice? How do you map raw data into database objects? What is a typical object database representation of these subsets of “Herschel” data stored in the Versant Object Database? What are the typical operations on such data?

Jon: Back in 2000, we looked at the available object databases. Versant and Objectivity appeared to have the scalability to cope with the amount of data we were envisaging. After a detailed comparison, we chose Versant although I think either would have done the job. At this stage, we still envisaged storing all the product data in the object database.

The uplink objects, such as proposals, observations, telecommands, schedules, etc, are stored directly as objects in the Versant database, as you might expect. The telemetry is a special case because we want to preserve the original data exactly as it arrives in binary packets, so that we can always get back to the raw data. So we have a TelemetryPacket class that encapsulates the binary packet and provides methods for accessing its contents. To support queries, we decode key meta-data from the binary packet when the object is constructed and store it as attributes of the object. This allows querying to retrieve, for example, packets for a given time range for a specified instrument or packets for a particular observation.

The persistent data model is known as the Core Class Model (CCM). This was developed by starting with a domain model describing the key entities in the problem domain, such as proposals and observations, and then adding methods by analysing the interactions implied by the use-cases. The model then evolved by the introduction of various design patterns and further classes related to the implementation domain.

Q7. How did you specifically use the Versant object database? Are there any specific components of the Versant object database that were (are) crucial for the data storage and management part of the project? If yes, which ones? Were there any technical challenges? If yes, which ones?

Jon: With a project that spans 20 years from the initial concept at the science ground segment level to generation and availability of the final archive of scientific products, it is important to allow for technology becoming obsolete. At the start of the science ground segment development part of the project, a decade ago, we looked at the standards that were then emerging (ODMG and JDO) to see if these could provide a vendor-independent database interface. However, the standards were not sufficiently mature, so we designed our own abstraction layer for the database and the persistent data. That meant that, in principle, we could change to a different database if needed. We tried to only include features that you would reasonably expect from any object database.

We organized all the persistent classes in the system into a single package. This was important to keep changes to the schema under tight control by ensuring that the schema version was defined by a single deliverable item. Consequently, the CCM classes are responsible for persistence and data integrity, but have little application-specific functionality. The persistent objects can be customized by the applications for example by using the decorator pattern. For various reasons, the persistent classes tend to contain Versant-specific code, so by keeping them all in one package it keeps the vendor-specific code together in one module behind the abstraction layer.

We needed to propagate copies of subsets of the data to databases to the three instrument control centres. At the time, there wasn’t an out-of-the-box solution that did what we wanted, so we developed our own data propagation. This was quite a substantial amount of work. One nice thing is that this is all hidden within our database abstraction layer, so that it is transparent to the applications. We continuously propagate the database to all three instrument teams and we also propagate it to our test system and backup system.

Another important factor is backup and recovery and it makes a big difference how you organize the data. We have an uplink node and a set of telemetry nodes. There is a current telemetry node into which new data is ingested. All the older telemetry nodes are read-only as the data never changes, which means you only have to back them up once. The abstraction layer is able to hide the mapping of objects onto databases.

Q8. Is scalability an important factor in your application?

Jon: In the early stages, when we were considering putting all the data into the object database, scalability was a very important issue. It became much less of an issue when we decided to store the products in the common archive.

Q9. Do you have specific time requirements for handling data objects?

Jon: We need to ingest the 1.5 GB of telemetry each day and propagate it to the instrument sites. In general the performance of the Versant database is more than adequate. It is also important that we can plan (and if necessary replan) the observations for an operational day within a few hours, although this does not pose a problem in terms of database performance.

Q10. By the end of the four years life cycle of “Herschel” you are expecting a minimum of 50 terabytes of data which will be available to the scientific community. How do you plan to store, maintain and elaborate this amount of data?

Johannes: At the European Space Astronomy Centre (ESAC) in Madrid we keep the archives of all of ESA’s astrophysics missions, and a few years ago we started to also import and organise into archives the data of the planetary missions that deal with observations of the sun, planets, comets and asteroids. The user interface to all these archives is basically the same. This generates a kind of “corporate identity” of these archives and if you have used one you know how to use all of them.

For many years, the Science Archive Team at ESAC has played a leading role in Europe in the area of “Virtual Observatories”. If you are looking for some specific information on a particular astronomical object, chances are good, and they are getting better by the day, that some archive in the world already has this information. ESA’s archives, and amongst them the Herschel legacy archive once it has been built, are accessible through this virtual observatory from practically anywhere in the world.

Q10. You are basically half way in the life cycle of “Herschel”. What are the main lessons learned so far from the project? What are the next steps planned for the project and the main technical challenges ahead?

Johannes: We are very lucky, and without doubt the more than 1,300 individuals who are listed on a web page as having made major contributions to the success of this project have every reason to be proud, that we are operating this world class observatory with only minor problems in satellite hardware which are well under control.

We are approximately half way through the in-orbit life time of the mission after launch. But this is not the same as saying that we have reaped half of the science results from the mission.

For the first 4 months of the mission, i.e. until mid-September 2009, the satellite and the ground segment underwent a checkout, a commissioning and a performance verification phase. Through a large number of “engineering” and “calibration” observations in these initial phases we ensured that the observing modes we had planned to provide and had advertised to our users worked in principle, that the telescope pointed in the right direction, that the temperatures did not drift beyond specification, etc. From mid-September to about the end of 2009, we performed an increasing number of scientific observations of astronomical targets in what we called the “Science Demonstration Phase”. These observations were “the proof of the pudding” and showed that, indeed, the astronomers could do the science they had intended to do with the data they were getting from Herschel: More than 250 scientific papers were published in mid-2010 that are based on observations made during this Science Demonstration Phase.

We have been in the “Routine Science Mission Phase” for about one year now, and we expect to remain in this phase for at least another year and a half, i.e. we have collected perhaps 40% of the scientific data Herschel will eventually have collected when the Liquid Helium coolant runs out. Already now we can see that Herschel will revolutionize some of the currently accepted theories and concepts of how and where stars are born, how they enrich the interstellar medium with elements heavier than Helium when they die, and in many other areas such as how the birth rate of stars has changed over the age of the universe, which is over 13 billion years old. But, most importantly, we need to realise and remember that sustained progress in science does not come about on short time scales. The initial results that have already been published are spectacular, but they are only the tip of the iceberg, they are results that stare you in the face. Many more results, which together will change our view of the cosmos as profoundly as the cream of the coffee results we already see now, will only be found after years of archival research, by astronomers who plough through vast amounts of archive data in the search of new, exciting similarities between seemingly different types of objects and stunning differences between objects that had been classified to be of the same type. And this kind of knowledge will accumulate from Herschel data for many years beyond the end of the Herschel in-orbit lifetime. ESA will support this slower but more sustained scientific progress through archival research by keeping together a team of about half a dozen software experts and several instrument specialists for up to 5 years after the end of the in-orbit mission. In addition, members of the astronomical community will, as they have done on previous missions, contribute to the archive their own, interactively improved products which extract features from the data that no automatic pipeline process can extract no matter how clever it has been set up to be.

I believe it is fair to say that no-one yet knows the challenges the software developers and the software users will face.
But I do expect that a lot more problems will arise from the scientific interpretation of the data than from the software technologies we use, which are up to the task that is at hand.
—————————

Dr. Johannes Riedinger, Herschel Mission Manager, European Space Agency.
Johannes Riedinger has a background in Mathematics and Astronomy and has worked on space projects since the start of his PhD thesis in 1980, when he joined a team at Max-Planck-Institut für Astronomie in Heidelberg in the development of an instrument for a space shuttle payload. Working on a successor project, the ISO Photo-Polarimeter ISOPHOT, he joined industry from 1985-1988 as the System Engineer for the phase B study of this payload. Johannes joined the European Agency in 1988 and, having contributed to the implementation of the Science Ground Segments of the Infrared Space Observatory (ISO, launched in 1995) and the X-ray Multi Mirror telescope (XMM-Newton, launched in 1999), he became the Herschel Science Centre Development Manager in 2000. Following successful commissioning of the satellite and ground segment, he became Herschel Mission Manager in 2009.

Dr. Jon Brumfitt, System Architect of Herschel Scientific Ground Segment, European Space Agency.
Jon Brumfitt has a background in Electronics with Physics and Mathematics and has worked on several of ESA’s astrophysics missions, including IUE, Hipparcos, ISO, XMM and currently Herschel. After completing his PhD and a post-doctoral fellowship in image processing, Jon worked on data reduction for the IUE satellite before joining Logica Space and Defence in 1980. In 1984 he moved to Logica’s research centre in Cambridge and then in 1993 to ESTEC in the Netherlands to work on the scientific ground segments for ISO and XMM. In January 2000, he joined the newly formed Herschel team as ground segment System Architect. As Herschel approached launch, he moved down to the European Space Astronomy Centre in Madrid to become part of the Herschel Science Operations Team.

For Additional Reading

“Herschel” telescope

[1] Data From Outer Space.
Versant.
This short paper describes a case study: the handling of telemetry data and information of the “Herschel” telescope from outer space with the Versant object database. The telescope was launched by the European Space Agency (ESA) with the Ariane 5 rocket on 14 May 2009.
Paper | Introductory | English | LINK to DOWNLOAD (PDF)| 2010|

[2] ESA Web site for “Herschel”

[3]Additonal Links
SPECIALS/Herschel
HERSCHEL: Exploring the formation of galaxies and stars
ESA latest_news
ESA Press Releases

Related Post:

Objects in Space.
–The biggest data processing challenge to date in astronomy: The Gaia mission.–

]]>
http://www.odbms.org/blog/2011/03/objects-in-space-herschel-the-largest-telescope-ever-flown/feed/ 2
Robert Greene on “New and Old Data stores” http://www.odbms.org/blog/2010/12/robert-greene-on-new-and-old-data-stores/ http://www.odbms.org/blog/2010/12/robert-greene-on-new-and-old-data-stores/#comments Thu, 02 Dec 2010 07:46:01 +0000 http://www.odbms.org/blog/?p=430 I am back covering the topic “New and Old Data stores”.
I asked several questions to Robert Greene, CTO and V.P. Open Source Operations at Versant.

Q1. Traditionally, the obvious platform for most database applications has been a relational DBMS. Why do we need new Data Stores?

Robert Greene: Well, it’s a question of innovation in the face of need. When relational databases were invented, applications and their models were simpler, data was smaller, concurrent users were less. There was no internet, no wireless devices, no global information systems. In the mid 90’s, even Larry Ellison stated that complexly related information, at the time largely in niche application areas like CAD, did not fit well with the relational model. Now, complexity is pervasive in nearly all applications.

Further, the relational model is based on a runtime relationship execution engine, re- calculating relations based on primary-key, foreign-key data associations even though the vast majority of data relationships remain fixed once established. When data continues to grow at enormous rates, the approach of re-calculating the relations becomes impractical. Today even normal applications start to see data at sizes which in the past were only seen in data warehousing solutions, the first data management space which embraced a non-relational approach to data management.

So, in a generation when millions of users are accessing applications linked to near real-time analytic algorithms, at times operating over terabytes of data, innovation must occur to deal with these new realities.

Q2. There has been recently a proliferation of “new data stores”, such as “document stores”, and “nosql databases”: What are the differences between them?

Robert Greene:The answer to this could require a book, but let’s try to distil into the fundamentals.

I think the biggest difference is the programming model. There is some overlap, so you don’t see clear distinctions, but for each type: object database, distributed file systems, key-value stores, document stores and graph stores, the manner in which the user stores and retrieves data varies considerably. The OODB uses language integration, the distributed file systems use map-reduce, key-value stores use data keys, document stores use keys and query based on indexed meta data overlay, graph stores use a navigational expression language. I think it is important to point out that “store” is probably a more appropriate label than “database” for many of these technologies as most do not implement the classical ACID requirements defined for a database.

Beyond programming model, these technologies vary considerably in architecture, how they actually store data, retrieve it from disk, facilitate backup, recovery, reliability, replication, etc.

Q3. How new data stores compare with respect to relational databases?

Robert Greene: As described above, they have a very different programming model than the RDB. Though in some ways, they are all subsets of the RDB, but their specialization allows them to do what they do ( at times ) better than the RDB.

Most of them are utilizing an underlying architecture which I call, “the oldest scalability architecture of the relational database”. It’s the use of the key-value/blob architecture. The RDB has long suffered performance under scalability and historically many architects have gotten around those performance issues by removing the JOIN operation from the implementation. They manage identity from the application space and store information in either single tables and/or blobs of isolatable information. This comparison is obvious for key-value stores. However, you can also see this approach in the document store, which is storing its information as key-JSON objects. The keys to those documents ( JSON blob objects ) must be managed by user implemented layers in the application space. Try to implement a basic collection reference, you will find yourself writing lots of custom code. Of course, JSON objects also have meta data which can be extracted and indexed, allowing document stores to provide better ways at finding data, but the underlying architecture is key-value.

Q4. Systems such as CouchDB, MongoDB, SimpleDB, Voldemort, Scalaris, etc. provide less functionality than OODBs and are little more than a distributed “object” cache over multiple machines. How do these new data stores compare with object-oriented databases?

Robert Greene: They compare similarly in that they achieve better scalability than the RDB by utilizing identity management in the application layer similarly to the way done with the object database. However, the approach is significantly less opaque, because for those NoSQL stores, the management of the identity is not integrated into the language constructs and abstracted away from the user API as it is with the object database. Plus, there is a big difference in the delivery of the ACID properties of a database. The NoSQL databases are almost exclusively non-transactional unless you use them in only the narrowest of use cases.

Q5. With the emergence of cloud computing, new data management systems have surfaced. What is in your opinion of the direction in which cloud computing data management is evolving? What are the main challenges of cloud computing data management?

Robert Greene: Unquestionably, the world is moving to a platform as a service computing model (PaaS). Databases will play a role in this transition in all forms. The challenges in delivering on data management technology which is effective in these “cloud” computing architectures turn out to be very similar to effectively delivering technology for the new n-core chip architectures. They are challenges related to distributed data management, whether it is across machines or across cores, splitting the problem into pieces and managing the distributed execution in the fact of concurrent updates. Then the often overlooked aspect in these discussions is the operational element. How to effectively develop, debug, manage and administer the production deployments of this technology within distributed computing environments.

Q6 What are cloud stores omitting that enable them to scale so well?

Robert Greene: I think architecture plays the biggest role in their ability to scale. It is that application identity managed approach to data retrieval, data distribution, semi-static data relations. These are things they actually have in common with object databases, which incidentally, you also find in some of the worlds largest, most demanding application domains. I think that is the biggest scalability story for those technologies. If you look past architecture then it comes down to some of the sacrifices made in the area of fully supporting the ACID requirements of a database. Taking the “eventually consistent” approach, this in some cases, makes a tremendous amount of sense if you can afford probabilistic results instead of determinism

Q7. Will cloud store projects end up with support for declarative queries and declarative secondary keys?

Robert Greene: I am sure you will see this as literally all database technologies which will remain relevant, will live in the cloud.

Q8. In his post, titled “The “NoSQL” Discussion has Nothing to Do With SQL”, Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

Robert Greene: I agree with the theory. Reality though does introduce some practical limitations during implementation. Technology is doing a remarkable job of removing those bottlenecks. For example, you can now get non-volatile memory appliances which are 5T in size effectively eliminating disk I/O as the what was historically the #1 bottleneck in database systems. Still, architecture will continue in the future to play the strongest role in performance and scalability. Relational databases and other implementations which need to runtime calculate relationships based on data values over growing volumes of data will remain performance challenged.

Q9. Some progress has also been made on RDBMS scalability. For example, Oracle RAC and MySQL Cluster provide some partitioning of load over multiple nodes. More recently, there are new scalable variations of MySQL underway with ScaleDB and Drizzle, and VoltDB is expected to provide scalability on top of a more performant inmemory RDBMS with minimal overhead. Typically you cannot scale well if your SQL operations span many nodes. And you cannot scale well if your transactions span many nodes.
Will RDBMSs provide scalability to 100 nodes or more? And if yes, how?

Robert Greene: Yes of course, they already do in vendors like Netezza, Greenplum, AsterData. The question is will they perform well in the face of those scalability requirements. This distinction between performance and scalability is often overlooked.

However, I think this notion that you cannot scale well if your transactions span many nodes is non-sense. It is a question of implementation. Just because a database has 100 nodes, does not mean that all transactions will operate on data within those 100 nodes. Transactions will naturally partition and span some percentage of nodes especially with regards to relevant data. Access in a multi-node system can be parallelized in all aspects of a transaction. Further, at a commit boundary, the overwhelming case is that the number of nodes involved where data is inserted, changed, deleted and/or logically dependent, is some small fraction of all the physical nodes in the system. Therefore, advanced 2-phase commit protocols can do interesting things like rolling back non active nodes and parallelizing protocol handshaking and using asynchronous I/O and handshaking to finalize the commit. Is it complicated, yes, but is it too complicated to work, not by a long shot.

Q10. There is also xml DBs, which go beyond relational. Hybridization with relational turned out to be very useful. For example, DB2 has a huge investment in XML, and it is extensively published, and it has also succeeded commercially. Monet DB did substantial work in that area early on as well. How do they relate with “new data stores”?

Robert Greene: I really look at XML databases as large index engines. I have seen implementations of these which look very much like document stores. The main difference being that they are generally indexing everything. Where the document stores appear to be much more selective about the meta data exposed for indexing and query. Still, I think the challenge for XML db’s is the mismatch in it’s use within the programming paradigm. Developers think of XML as data interchange and transformation technology. It is not perceived as transactional data management and storage and developers don’t program in XML, so it feels clunky for them to figure out how to wrap it into their logical transactions. I suspect if feels a little less clunky if what you are dealing with are documents. Perhaps they should be considered the original document stores.

Q11. Choosing a solution: Given this confusing array of alternatives, which data storage system do you choose for your application?

Robert Greene: I choose the right tool for the job. This is again one of those questions which deserves several books. There is no 1 best solution for all applications, deciding factors can be complicated, but here is what I think about as major influencing factors. I look at it from the perspective of whether the application is data driven or model driven.

If it is model driven, I lean towards ODB or RDB.
If it is data driven, I lean towards NoSQL or RDB.

If the project is model driven and has a complex known model, ODB is a good choice because it handles the complexity well. If the project is model driven and has a simple known model, RDB is a good choice, because you should not be performance penalized if the model is simple, there are lots of available choices and people who know how to use the technology.

If the project is data driven and the data is small, RDB is good for the prior reasons. If the project is data driven and the data is huge, then NoSQL is a good choice because it takes a better architectural approach to huge data allowing the use of things like map reduce for parallel processing and/or application managed identity for better data distribution.

Of course, even within these categorizations you have ranges of value in different products. For example, MySQL and Oracle are both RDB, so which one to choose? Similarly, db4o and Versant are both ODB’s, so which one should you choose? So, I also look at the selection process from the perspective of 2 additional requirements: Data Volume, Concurrency. Within a given category, these will help narrow in on a good choice. For example, if you look at the company Oracle, you naturally consider that MySQL is less data scalable and less concurrent than the Oracle database, yet they are both RDB. Similarly, if you look at the company Versant, you would consider db4o to be less data scalable and less concurrent than the Versant database, yet they are both ODB.

Finally, I say you should test, evaluate any selection within the context of your major requirements. Get the core use cases mocked up and put your top choices to the test, it is the only way to be sure.

]]>
http://www.odbms.org/blog/2010/12/robert-greene-on-new-and-old-data-stores/feed/ 3
Object Database Technologies and Data Management in the Cloud. http://www.odbms.org/blog/2010/09/object-database-technologies-and-data-management-in-the-cloud/ http://www.odbms.org/blog/2010/09/object-database-technologies-and-data-management-in-the-cloud/#comments Mon, 27 Sep 2010 08:04:21 +0000 http://www.odbms.org/blog/?p=368 One of our expert, Dr. Michael Grossniklaus, has recently been awarded a grant by the Swiss National Science Foundation (SNF) for a fellowship as an advanced researcher in David Maier’s group at Portland State University. There, he will be investigating the use of object database technology for cloud data management.
I asked Michael to elaborate on his research plan and share it with our ODBMS.ORG community.

Q1. People from different fields have slightly different definitions of the term Cloud Computing. What is the common denominator of most of these definitions?

MG: Many of the differences stem from the fact that people use the term Cloud Computing both to denote a vision at the conceptual level and technologies at the implementation level. A nice collection of no less than twenty-one definitions can be found here.
In terms of vision, the common denominator of most definitions is to look at processing power, storage and software as commodities that are readily available from large infrastructures. As a consequence, cloud computing unifies elements of distributed, grid, utility and autonomic computing. The term elastic computing is also often used in this context to describe the ability of cloud computing to cope with bursts or spikes in the demand of resources on an on-demand basis. As for technologies, there is a consensus that cloud computing corresponds to a service-oriented stack that provides computing resources at different levels. Again, there are many variants of cloud computing stacks, but the trend seems to go towards three layers. At the lowest level, Infrastructure-as-a-Service (IaaS) offers resources such as processing power or storage as a service. One level above, Platform-as-a-Service (PaaS) provides development tools to build applications based on the service provider’s API. Finally, on the top-most level, Software-as-a-Service (SaaS) describes the model of deploying applications to clients on demand.

Q2. With the emergence of cloud computing, new data management systems have surfaced. Why?

MG: I see new data management systems such as NoSQL databases and MapReduce systems mainly as a reaction to the way in which cloud computing provides scalability. In cloud computing, more processing power typically translates to more (cheap, shared-nothing) computing nodes, rather than migrating or upgrading to better hardware. Therefore, cloud computing applications need to be parallelizable in order to scale. Both NoSQL and MapReduce advocate simplicity in terms of data models and data processing, in order to provide light-weight and fault-tolerant frameworks that support automatic parallelization and distribution.
In comparison to existing parallel and distributed (relational) databases however, many established data management concepts, such as data independence, declarative query languages, algebraic optimization and transactional data processing, are often omitted. As a consequence, more weight is put on the shoulders of application developers that now face new challenges and responsibilities. Acknowledging the fact that the initial vision was maybe too simple, there is already a trend of extending MapReduce systems with established data management concepts. Yahoo’s PigLatin and Microsoft’s Dryad have introduced a (near-) relational algebra and Facebook’s HIVE supports SQL, to name only a few examples. In this sense, cloud computing has triggered a “reboot” of data management systems by starting from a very simple paradigm and adding classical features back in, whenever they are required.

Q3. What is in your opinion the direction into which cloud computing data management is evolving? What are the main challenges of cloud computing data management?

MG: Data management in cloud computing will take place on a massively parallel and widely distributed scale. Based on these characteristics, several people have argued that cloud data management is more suitable for analytic rather than transactional data processing. Applications that need mostly read-only access to data and perform updates in batch mode are, therefore, expected to profit the most from cloud computing. At the same time, analytical data processing is gaining importance both in industry in terms of market shares and in academia through novel fields of application, such as computational science and e-science. Furthermore, from the classical data management concepts mentioned above, ACID transactions is the notable exception since, so far, nobody has proposed to extend MapReduce systems with transactional data processing. This might be another indication that cloud data management is evolving into the direction of analytical data processing.
At the time of answering these questions, I see three main challenges for data management in cloud computing: massively parallel and widely distributed data storage and processing, integration of novel data processing paradigms as well as the provision of service-based interfaces. The first challenge has been identified many times and is a direct consequence of the very nature of cloud computing. The second challenge is to build a comprehensive data processing platform by integrating novel paradigms with existing database technology. Often cited paradigms include data stream processing systems, service-based data processing or the above-mentioned NoSQL databases and MapReduce systems. Finally, the third challenge is to provide service-based interfaces for this new data processing platform in order to expose the platform itself as a service in the cloud, which is also referred to as “Database-as-a-Service” (DaaS) or “Cloud Data Services”.

Q4. What is the impact of cloud computing on data management research so far?

MG: Most of the challenges mentioned above are already being addressed in some way by the database research community. In particular, parallel and distributed data management is a well-established field of research, which has contributed many results that are strongly related to cloud data management. Research in this area investigates whether and how existing parallel and distributed databases can scale up to the level of parallelism and distribution that is characteristic of cloud computing. While this approach is more “top down”, there is also the “bottom up” approach of starting with an already highly parallel and widely distributed system and extending it with classical database functionality. This second approach has led to the extended MapReduce systems that were mentioned before. While these extended approaches already partially address the second challenge of cloud data management—integrating of novel data processing paradigms—there are also research results that take this integration even further such as HadoopDB and Clustera. The third challenge is being addressed as part of the research on programmability of cloud data services in terms of languages, interfaces and development models.
The impact of cloud computing on data management research is also visible in recent call for papers of both established and emerging workshops and conferences. Furthermore, there are several additional initiatives dedicated to support cloud data management research. For example, the MSR Summer Institute 2010 held at the University of Washington brought together a number of database researcher to discuss the current challenges and opportunities of cloud data services.

Q5. In your opinion, is there a relationship between cloud computing and object database technologies? If yes, please explain.

MG: Yes, there are multiple connections between cloud data management and object database technology which relate to all of the previously mentioned challenges. According to a recent article in Information Week , businesses are likely to split their data management into (transactional) in-house and (analytical) cloud data processing. This requirement corresponds to the first challenge of supporting highly parallel and widely distributed data processing. In this setting, objects and relationships could prove to be a valuable abstraction to bridge the gap between the two partitions.
Introducing the concept of objects in cloud data management systems also makes sense from the perspective of addressing the second challenge of integrating different data processing paradigms. One advantage of MapReduce is that it can cast base data into different implicit models. The associated disadvantage is that the data model is constructed on the fly and, thus, type checking is only possible to a limited extent. To support typing of MapReduce queries, the same base data instances could be exposed using different object wrappers. Microsoft has recently proposed “Orleans”, a next-generation programming model for cloud computing that features a higher level of abstraction than MapReduce. In order to integrate different processing paradigms, Orleans introduces the notion of “grains” that serve as a unit of computation and data storage.
Finally, object database technologies can also contribute to addressing the third challenge, i.e. providing service-based interfaces for cloud data management. Since object data models and service-oriented interfaces are closely related, it makes a lot of sense to consider object database technology, rather than introducing additional mapping layers. The concept of orthogonal persistence, that is an essential feature of most recent object databases, is particularly relevant in this context. In their ICOODB 2009 paper, Dearle et al. have suggested that orthogonal persistence could be extended in order to simplify the development of cloud applications. Instead of only abstracting from the storage hierarchy, this extended orthogonal persistence would also abstract from replication and physical location, giving transparent access to distributed objects. Even though Orleans is built on top of the Windows Azure Platform that provides a relational database (SQL Azure), the vision of grains is to support transparent replication, consistency and persistence.

Q6. Do you know of any application domains where object database technologies are already used in the Cloud?

MG: From the major object database vendors, I am only aware of Objectivity that has a version of their product that is ready to be deployed on cloud infrastructures such as Amazon EC2 and GoGrid. However, I have not yet seen any concrete case study showing how their clients are using this product. This being said, it might be interesting to point out, that many of the applications that are currently deployed using object databases are very close to the envisioned use case of cloud data management. For example, Objectivity has been applied in Space Situational Awareness Foundational Enterprise (SSAFE) system and in several data-intensive science applications, for example at the Stanford Linear Accelerator Center (SLAC). Similarly, the European Space Agency (ESA) has chosen Versant to gather and analyze the data transmitted by the Herschel telescope. All of these applications deal with large or even huge amounts of data and require analytical data processing in the sense that was described before.

Q7. What issues would you recommend as a researcher to tackle to go beyond the current state of the art in cloud computing data management?

MG: There is ample opportunity to tackle interesting and important issues along the lines of all three challenges mentioned before. However, if we abstract even more, there are two general research areas that will need to be tackled in order to deliver the vision of cloud data management.
The first area addresses research questions “under the hood”, for example: How can existing parallel and distributed databases scale up to the level of cloud computing? What traditional database functionality is required in the context of cloud data management and how can it be supported? How can traditional databases be combined with other data processing paradigms such as MapReduce or data stream processing? What architectures will lead to fast and scalable data processing systems? The second important area is how cloud data services are provided to clients and, thus, the following research questions are situated “on the hood”: What interfaces should be offered by cloud data services? Do we still need declarative query languages or is a procedural interface the way to go? Is there even a need for entirely new programming models? Can cloud computing be made independent of or orthogonal to the development of the application business logic? How are cloud data management applications tested, deployed and debugged? Are existing database benchmarks sufficient to evaluate cloud data services or do we need new ones?
Of course, these lists of research questions are not exhaustive and merely highlight some of the challenges. Nevertheless, I believe that in answering these questions, one should always keep an eye on recent and also not-so-recent contributions from object databases. As outlined above, many developments in cloud data services have introduced some kind of object notion and, therefore, contributions from object databases can serve two purposes. On the hand, technologies such as orthogonal persistence can serve as valuable starting points and inspiration for novel developments. On the other hand, we should also learn from previous approaches in order not to reinvent the wheel and not to repeat some of the mistakes that were made before.

Acknowledgement
Michael Grossniklaus would like to thank Moira C. Norrie, David Maier, Bill Howe and Alan Dearle for interesting discussions on this topic and the valuable exchange of ideas.

Michael Grossniklaus
Michael received his doctorate in computer science from ETH Zurich in 2007. His PhD thesis examined how object data models can be extended with versioning to support context-aware data management. In addition to conducting research, Michael has been involved in several courses as a lecturer. Together with Moira C. Norrie, he developed a course on object databases for advanced students which he taught for several years. Currently, Michael is a senior researcher at the Politecnico di Milano, where he both contributes to the “Search Computing” project and works on reasoning over data streams. He has recently been awarded a grant by the Swiss National Science Foundation (SNF) for a fellowship as an advanced researcher in David Maier’s group at Portland State University, where he will be investigating the use of object database technology for cloud data management.
#

]]>
http://www.odbms.org/blog/2010/09/object-database-technologies-and-data-management-in-the-cloud/feed/ 0
the OO7J benchmark. http://www.odbms.org/blog/2010/08/the-oo7j-benchmark/ http://www.odbms.org/blog/2010/08/the-oo7j-benchmark/#comments Thu, 05 Aug 2010 08:35:38 +0000 http://www.odbms.org/blog/?p=326 I just published a very interesting resource in ODBMS.ORG, the dissertation of Pieter van Zyl, from the University of Pretoria.

The title of the dissertation is: “Performance investigation into selected object persistence stores” and presents the OO7J benchmark.

OO7J is a Java version of the original OO7 benchmark (written in C++) from Mike Carey, David DeWitt and Jeff Naughton = Univ
Wisconsin-Madison. The original benchmark tested ODBMS performance.

OO7J also includes benchmarking ORM Tools. Currently there are implementations for Hibernate on PostgreSQL and MySQL, db4o and Versant.

You can download the dissertation (187 pages PDF) : LINK

The code is available on Sourceforge: LINK

RVZ

]]>
http://www.odbms.org/blog/2010/08/the-oo7j-benchmark/feed/ 0
Versant acquired the assets of the database software business of privately-held Servo Software, Inc. (formerly db4objects, Inc.) http://www.odbms.org/blog/2008/12/versant-acquired-assets-of-database_22/ http://www.odbms.org/blog/2008/12/versant-acquired-assets-of-database_22/#comments Mon, 22 Dec 2008 07:22:00 +0000 http://www.odbms.org/odbmsblog/2008/12/22/versant-acquired-the-assets-of-the-database-software-business-of-privately-held-servo-software-inc-formerly-db4objects-inc/ You probably noticed a news in the object database market: On December 1, 2008 “Versant acquired the assets of the database software business of privately-held Servo Software, Inc. (formerly db4objects, Inc.)”.

What`s the meaning of this acquisition? I asked a few questions to
Robert Greene who is responsible for defining Versant’s overall object database strategy ….

Q1. What`s the meaning of this acquisition for Versant? db4o is an open source object database, but Versant had no open source strategy until now.

[RCG] This acquisition recognizes the value the db4objects team created, by bringing visibility to software developers, the relevance of object database technology in the software development toolkit.

Incidentally, this is not Versant’s first initiative in the open source space. In 2006, Versant open sourced a JDO/JPA based ORM driver and initiated an open source JPA project within Eclipse, at the time known as Eclipse JSR220-ORM. Eclipse had managed to use this project to get Oracle to commit a similar open source project. In the end, both projects merged into what is the Eclipse Dali project and Oracle became the project lead.

This open source activity by Versant was aimed at making developers more aware of object based transparent persistence and fostering such an API approach in their development. We view this as a tremendous success, as now a substantial portion of the Java community uses Hibernate (or TopLink) and Eclipse Dali to develop applications.

Those ORM API’s which have flourished since the early 2002 timeframe, are in essence the Versant database API’s which have existed since the mid 90’s in our object database technology. It was an ex-Versant product manager who went to Sun and drove those standards through the Java JSR process. Ultimately, it was open source Hibernates’ flavor which gained the most acceptance, but the similarity of the approach is undeniable.

Due to the power of open source, anyone who knows ORM technology, has in essence, become an expert in the use of object databases. They can simply get rid of the mapping portion of the ORM work and then everything else is nearly the same as long as they point connections to an object database. In fact, Versant plans to release a compatibility version for Eclipse Dali.

Q2. Will you keep db4o as a separate product or will you merge it into Versant Object Database?

[RCG] Versant plans to continue to operate db4o in the same manner, continuing to foster the community and improve the technology in the traditional open source fashion. It will remain a separate product.

Q3. How do you plan to manage/support the db4o open source community?

[RCG] one of the nice things about db4o is the extended community of supporters it’s developed over the years. Versant plans to simply join that community, following the same open form which has worked for db4o in the past. Of course, that being said, Versant has a long history and extended expertise in the OODB technology space. In that regard, we have opened our technology stack to the db4o core team and where it makes good technology sense, we can contribute significant forms of functionality that otherwise take a long time to create.

Q4. db4o is targeting the embedded device market. Is this a market for Versant as well?

[RCG] Versant technology has many successes in the embedded space. However, our real commercial success, comes from the many large scale systems developed using our technology to overcome limitations in traditional database systems. So, in this regard, db4o will dominate the embedded side of the Versant business and the Versant commercial object database will exist to help those who want the simplicity of the OODB programming model, but require greater scaling capabilities.

Q5. Are there going to be any changes in the db4o business model?

[RCG] No. The db4o brand will continue to offer the dual licensing model common to open source businesses, along with professional levels of subscription based support.

]]>
http://www.odbms.org/blog/2008/12/versant-acquired-assets-of-database_22/feed/ 0
TechView Product Reports http://www.odbms.org/blog/2008/12/techview-product-reports/ http://www.odbms.org/blog/2008/12/techview-product-reports/#comments Sat, 20 Dec 2008 00:54:00 +0000 http://www.odbms.org/odbmsblog/2008/12/20/techview-product-reports/ Most of the time it is difficult to gather good technical information on products, without marketing or sales hype.

I therefore decided to create a series of product reports on some of the leading Object Database Systems around.

For that, I have prepared 23 questions which I sent to four vendors: db4objects,Objectivity, Inc.,Progress Software and Versant Corporation.
I asked them detailed information on their products, such as: Support of Programming Languages, Queries, Data Modeling, Integration with relational data, Transactions,Persistence,Storage, Architecture,Applications, and Performance.

The result are four TechView Product Reports, which contain detailed useful information on the respective products:
-db4o
– Objectivity/DB
– ObjectStore
– Versant Object Database

I hope these will be useful resources for developers and architects alike.
As always you can freely download the reports.

]]>
http://www.odbms.org/blog/2008/12/techview-product-reports/feed/ 3