Big Data from Space: the “Herschel” telescope.
” One of the biggest challenges with any project of such a long duration is coping with change. There are many aspects to coping with change, including changes in requirements, changes in technology, vendor stability, changes in staffing and so on”–Jon Brumfitt.
I first did an interview with Dr. Jon Brumfitt, System Architect & System Engineer of Herschel Scientific Ground Segment, at the European Space Agency in March 2011. You can read that interview here.
Two years later, I wanted to know the status of the project. This is a follow up interview.
Q1. What is the status of the mission?
Jon Brumfitt: The operational phase of the Herschel mission came to an end on 29th April 2013, when the super-fluid helium used to cool the instruments was finally exhausted. By operating in the far infra-red, Herschel has been able to see cold objects that are invisible to normal telescopes.
However, this requires that the detectors are cooled to an even lower temperature. The helium cools the instruments down to 1.7K (about -271 Celsius). Individual detectors are then cooled down further to about 0.3K. This is very close to absolute zero, which is the coldest possible temperature. The exhaustion of the helium marks the end of new observations, but it is by no means the end of the mission.
We still have a lot of work to do in getting the best results from the data processing to give astronomers a final legacy archive of high-quality data to work with for years to come.
The spacecraft has been in orbit around a point known as the second Lagrangian point “L2″, which is about 1.5 million kilometres from Earth (around four times as far away as the Moon). This location provided a good thermal environment and a relatively unrestricted view of the sky. The spacecraft cannot be left in this orbit because regular correction manoeuvres would be needed. Consequently, it is being transferred into a “parking” orbit around the Sun.
Q2. What are the main results obtained so far by using the “Herschel” telescope?
Jon Brumfitt: That is a difficult one to answer in a few sentences. Just to take a few examples, Herschel has given us new insights into the way that stars form and the history of star formation and galaxy evolution since the big-bang.
It has discovered large quantities of cold water vapour in the dusty disk surrounding a young star, which suggests the possibility of other water covered planets. It has also given us new evidence for the origins of water on Earth.
The following are some links giving more detailed highlights from the mission:
With its 3.5 metre diameter mirror, Herschel is the largest space telescope ever launched. The large mirror not only gives it a high sensitivity but also allows us to observe the sky with a high spatial resolution. So in a sense every observation we make is showing us something we have never seen before. We have performed around 35,000 science observations, which have already resulted in over 600 papers being published in scientific journals. There are many years of work ahead for astronomers in interpreting the results, which will undoubtedly lead to many new discoveries.
Q3. How much data did you receive and process so far? Could you give us some up to date information?
Jon Brumfitt: We have about 3 TB of data in the Versant database, most of which is raw data from the spacecraft. The data received each day is processed by our data processing pipeline and the resulting data products, such as images and spectra, are placed in an archive for access by astronomers.
Each time we make a major new release of the software (roughly every six months at this stage), with improvements to the data processing, we reprocess everything.
The data processing runs on a grid with around 35 nodes, each with typically 8 cores and between 16 and 256 GB of memory. This is able to process around 40 days worth of data per day, so it is possible to reprocess everything in a few weeks. The data in the archive is stored as FITS files (a standard format for astronomical data).
The archive uses a relational (PostgreSQL) database to catalogue the data and allow queries to find relevant data. This relational database is only about 60 GB, whereas the product files account for about 60 TB.
This may reduce somewhat for the final archive, once we have cleaned it up by removing the results of earlier processing runs.
Q4. What are the main technical challenges in the data management part of this mission and how did you solve them?
Jon Brumfitt: One of the biggest challenges with any project of such a long duration is coping with change. There are many aspects to coping with change, including changes in requirements, changes in technology, vendor stability, changes in staffing and so on.
The lifetime of Herschel will have been 18 years from the start of software development to the end of the post-operations phase.
We designed a single system to meet the needs of all mission phases, from early instrument development, through routine in-flight operations to the end of the post-operations phase. Although the spacecraft was not launched until 2009, the database was in regular use from 2002 for developing and testing the instruments in the laboratory. By using the same software to control the instruments in the laboratory as we used to control them in flight, we ended up with a very robust and well-tested system. We call this approach “smooth transition”.
The development approach we adopted is probably best classified as an Agile iterative and incremental one. Object orientation helps a lot because changes in the problem domain, resulting from changing requirements, tend to result in localised changes in the data model.
Other important factors in managing change are separation of concerns and minimization of dependencies, for example using component-based architectures.
When we decided to use an object database, it was a new technology and it would have been unwise to rely on any database vendor or product surviving for such a long time. Although work was under way on the ODMG and JDO standards, these were quite immature and the only suitable object databases used proprietary interfaces.
We therefore chose to implement our own abstraction layer around the database. This was similar in concept to JDO, with a factory providing a pluggable implementation of a persistence manager. This abstraction provided a route to change to a different object database, or even a relational database with an object-relational mapping layer, should it have proved necessary.
One aspect that is difficult to abstract is the use of queries, because query languages differ. In principle, an object database could be used without any queries, by navigating to everything from a global root object. However, in practice navigation and queries both have their role. For example, to find all the observation requests that have not yet been scheduled, it is much faster to perform a query than to iterate by navigation to find them. However, once an observation request is in memory it is much easier and faster to navigate to all the associated objects needed to process it. We have used a variety of techniques for encapsulating queries. One is to implement them as methods of an extent class that acts as a query factory.
Another challenge was designing a robust data model that would serve all phases of the mission from instrument development in the laboratory, through pre-flight tests and routine operations to the end of post-operations. We approached this by starting with a model of the problem domain and then analysing use-cases to see what data needed to be persistent and where we needed associations. It was important to avoid the temptation to store too much just because transitive persistence made it so easy.
One criticism that is sometimes raised against object databases is that the associations tend to encode business logic in the object schema, whereas relational databases just store data in a neutral form that can outlive the software that created it; if you subsequently decide that you need a new use-case, such as report generation, the associations may not be there to support it. This is true to some extent, but consideration of use cases for the entire project lifetime helped a lot. It is of course possible to use queries to work-around missing associations.
Examples are sometimes given of how easy an object database is to use by directly persisting your business objects. This may be fine for a simple application with an embedded database, but for a complex system you still need to cleanly decouple your business logic from the data storage. This is true whether you are using a relational or an object database. With an object database, the persistent classes should only be responsible for persistence and referential integrity and so typically just have getter and setter methods.
We have encapsulated our persistent classes in a package called the Core Class Model (CCM) that has a factory to create instances. This complements the pluggable persistence manager. Hence, the application sees the persistence manager and CCM factories and interfaces, but the implementations are hidden.
Applications define their own business classes which can work like decorators for the persistent classes.
Q5. What is your experience in having two separate database systems for Herschel? A relational database for storing and managing processed data products and an object database for storing and managing proposal data, mission planning data, telecommands and raw (unprocessed) telemetry?
Jon Brumfitt: There are essentially two parts to the ground segment for a space observatory.
One is the “uplink” which is used for controlling the spacecraft and instruments. This includes submission of observing proposals, observation planning, scheduling, flight dynamics and commanding.
The other is the “downlink”, which involves ingesting and processing the data received from the spacecraft.
On some missions the data processing is carried out by a data centre, which is separate from spacecraft operations. In that case there is a very clear separation.
On Herschel, the original concept was to build a completely integrated system around an object database that would hold all uplink and downlink data, including processed data products. However, after further analysis it became clear that it was better to integrate our product archive with those from other missions. This also means that the Herschel data will remain available long after the project has finished. The role of the object database is essentially for operating the spacecraft and storing the raw data.
The Herschel archive is part of a common infrastructure shared by many of our ESA science projects. This provides a uniform way of accessing data from multiple missions.
The following is a nice example of how data from Herschel and our XMM-Newton X-ray telescope have been combined to make a multi-spectral image of the Andromeda Galaxy.
Our archive, in turn, forms part of a larger international archive known as the “Virtual Observatory” (VO), which includes both space and ground-based observatories from all over the world.
I think that using separate databases for operations and product archiving has worked well. In fact, it is more the norm rather than the exception. The two databases serve very different roles.
The uplink database manages the day-to-day operations of the spacecraft and is constantly being updated. The uplink data forms a complex object graph which is accessed by navigation, so an object database is well suited.
The product archive is essentially a write-once-read-many repository. The data is not modified, but new versions of products may be added as a result of reprocessing. There are a large number of clients accessing it via the Internet. The archive database is a catalogue containing the product meta-data, which can be queried to find the relevant product files. This is better suited to a relational database.
The motivation for the original idea of using a single object database for everything was that it allowed direct association between uplink and downlink data. For example, processed products could be associated with their observation requests. However, using separate databases does not prevent one database being queried with an observation identifier obtained from the other.
One complication is that processing an observation requires both downlink data and the associated uplink data.
We solved this by creating “uplink products” from the relevant uplink data and placing them in the archive. This has the advantage that external users, who do not have access to the Versant database, have everything they need to process the data themselves.
Q6. What are the main lessons learned so far in using Versant object database for managing telemetry data and information on steering and calibrating scientific on-board instruments?
Jon Brumfitt: Object databases can be very effective for certain kinds of application, but may have less benefit for others. A complex system typically has a mixture of application types, so the advantages are not always clear cut. Object databases can give a high performance for applications that need to navigate through a complex object graph, particularly if used with fairly long transactions where a significant part of the object graph remains in memory. Web (JavaEE) applications lose some of the benefit because they typically perform many short transactions with each one performing a query. They also use additional access layers that result in a system which loses the simplicity of the transparent persistence of an object database.
In our case, the object database was best suited for the uplink. It simplified the uplink development by avoiding object-relational mapping and the complexity of a design based on JDBC or EJB 2. Nowadays with JPA, relational databases are much easier to use for object persistence, so the rationale for using an object database is largely determined by whether the application can benefit from fast navigational access and how much effort is saved in mapping. There are now at least two object database vendors that support both JDO and JPA, so the distinction is becoming somewhat blurred.
For telemetry access we query the database instead of using navigation, as the packets don’t fit neatly into a single containment hierarchy. Queries allows packets to be accessed by many different criteria, such as time, instrument, type, source and so on.
Processing calibration observations does not introduce any special considerations as far as the database is concerned.
Q7. Did you have any scalability and or availability issues during the project? If yes, how did you solve them?
Jon Brumfitt: Scalability would have been an important issue if we had kept to the original concept of storing everything including products in a single database. However, using the object database for just uplink and telemetry meant that this was not a big issue.
The data processing grid retrieves the raw telemetry data from the object database server, which is a 16-core Linux machine with 64 GB of memory. The average load on the server is quite low, but occasionally there have been high peak loads from the grid that have saturated the server disk I/O and slowed down other users of the database. Interactive applications such as mission planning need a rapid response, whereas batch data processing is less critical. We solved this by implementing a mechanism to spread out the grid load by treating the database as a resource.
Once a year, we have made an “Announcement of Opportunity” for astronomers to propose observations that they would like to perform with Herschel. It is only human nature that many people leave it until the last minute and we get a very high peak load on the server in the last hour or two before the deadline! We have used a separate server for this purpose, rather than ingesting proposals directly into our operational database. This has avoided any risk of interfering with routine operations. After the deadline, we have copied the objects into the operational database.
Q8. What about the overall performance of the two databases? What are the lessons learned?
Jon Brumfitt: The databases are good at different things.
As mentioned before, an object database can give a high performance for applications involving a complex object graph which you navigate around. An example is our mission planning system. Object persistence makes application design very simple, although in a real system you still need to introduce layers to decouple the business logic from the persistence.
For the archive, on the other hand, a relational database is more appropriate. We are querying the archive to find data that matches a set of criteria. The data is stored in files rather than as objects in the database.
Q9. What are the next steps planned for the project and the main technical challenges ahead?
Jon Brumfitt: As I mentioned earlier, the coming post-operations phase will concentrate on further improving the data processing software to generate a top-quality legacy archive, and on provision of high-quality support documentation and continued interactive support for the community of astronomers that forms our “customer base”. The system was designed from the outset to support all phases of the mission, from early instrument development tests in the laboratory, though routine operations to the end of the post-operations phase of the mission. The main difference moving into post-operations is that we will stop uplink activities and ingesting new telemetry. We will continue to reprocess all the data regularly as improvements are made to the data processing software.
We are currently in the process of upgrading from Versant 7 to Versant 8.
We have been using Versant 7 since launch and the system has been running well, so there has been little urgency to upgrade.
However, with routine operations coming to an end, we are doing some “technology refresh”, including upgrading to Java 7 and Versant 8.
Q10. Anything else you wish to add?
Jon Brumfitt: These are just some personal thoughts on the way the database market has evolved over the lifetime of Herschel. Thirteen years ago, when we started development of our system, there were expectations that object databases would really take off in line with the growing use of object orientation, but this did not happen. Object databases still represent rather a niche market. It is a pity there is no open-source object-database equivalent of MySQL. This would have encouraged more people to try object databases.
JDO has developed into a mature standard over the years. One of its key features is that it is “architecture neutral”, but in fact there are very few implementations for relational databases. However, it seems to be finding a new role for some NoSQL databases, such as the Google AppEngine datastore.
NoSQL appears to be taking off far quicker than object databases did, although it is an umbrella term that covers quite a few kinds of datastore. Horizontal scaling is likely to be an important feature for many systems in the future. The relational model is still dominant, but there is a growing appreciation of alternatives. There is even talk of “Polyglot Persistence” using different kinds of databases within a system; in a sense we are doing this with our object database and relational archive.
More recently, JPA has created considerable interest in object persistence for relational databases and appears to be rapidly overtaking JDO.
This is partly because it is being adopted by developers of enterprise applications who previously used EJB 2.
If you look at the APIs of JDO and JPA they are actually quite similar apart from the locking modes. However, there is an enormous difference in the way they are typically used in practice. This is more to do with the fact that JPA is often used for enterprise applications. The distinction is getting blurred by some object database vendors who now support JPA with an object database. This could expand the market for object databases by attracting some traditional relational type applications.
So, I wonder what the next 13 years will bring! I am certainly watching developments with interest.
Dr Jon Brumfitt, System Architect & System Engineer of Herschel Scientific Ground Segment, European Space Agency.
Jon Brumfitt has a background in Electronics with Physics and Mathematics and has worked on several of ESA’s astrophysics missions, including IUE, Hipparcos, ISO, XMM and currently Herschel. After completing his PhD and a post-doctoral fellowship in image processing, Jon worked on data reduction for the IUE satellite before joining Logica Space and Defence in 1980. In 1984 he moved to Logica’s research centre in Cambridge and then in 1993 to ESTEC in the Netherlands to work on the scientific ground segments for ISO and XMM. In January 2000, he joined the newly formed Herschel team as science ground segment System Architect. As Herschel approached launch, he moved down to the European Space Astronomy Centre in Madrid to become part of the Herschel Science Operations Team, where he is currently System Engineer and System Architect.
You can follow ODBMS.org on Twitter : @odbmsorg