The Gaia mission, one year later. Interview with William O’Mullane.
” We will observe at LEAST 1,000,000,000 celestial objects. If we launched today we would cope with difficulty – but we are on track to be ready by September when we actually launch. This is a game changer for astronomy thus very challenging for us, but we have done many large scale tests to gain confidence in our ability to process the complex and voluminous data arriving on ground and turn it into catalogues. I still feel the galaxy has plenty of scope to throw us an unexpected curve ball though and really challenge us in the data processing.” — William O`Mullane.
The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”. I recall here the Objectives and the Mission of the Gaia Project (source ESA Web site):
“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”
“Gaia is an ambitious mission to chart a three-dimensional map of our Galaxy, the Milky Way, in the process revealing the composition, formation and evolution of the Galaxy. Gaia will provide unprecedented positional and radial velocity measurements with the accuracies needed to produce a stereoscopic and kinematic census of about one billion stars in our Galaxy and throughout the Local Group. This amounts to about 1 per cent of the Galactic stellar population. Combined with astrophysical information for each star, provided by on-board multi-colour photometry, these data will have the precision necessary to quantify the early formation, and subsequent dynamical, chemical and star formation evolution of the Milky Way Galaxy.
Additional scientific products include detection and orbital classification of tens of thousands of extra-solar planetary systems, a comprehensive survey of objects ranging from huge numbers of minor bodies in our Solar System, through galaxies in the nearby Universe, to some 500 000 distant quasars. It will also provide a number of stringent new tests of general relativity and cosmology.”
Last year in February, I have interviewed William O`Mullane, Science Operations Development Manager, at the European Space Agency, and Vik Nagjee, Product Manager, Core Technologies, at InterSystems Corporation, both deeply involved with the initial Proof-of-Concept of the data management part of the project.
A year later, I have asked William O`Mullane (European Space Agency), and Jose Ruperez (Intersystems Spain), some follow up questions.
Q1. The original goal of the Gaia mission was to “observe around 1,000,000,000 celestial objects”. Is this still true? Are you ready for that?
William O’Mullane: YES ! We will have a Ground Segment Readiness Review next Spring and a Flight Acceptance Review before summer. We will observe at LEAST 1,000,000,000 celestial objects. If we launched today we would cope with difficulty – but we are on track to be ready by September when we actually launch. This is a game changer for astronomy thus very challenging for us, but we have done many large scale tests to gain confidence in our ability to process the complex and voluminous data arriving on ground and turn it into catalogues. I still feel the galaxy has plenty of scope to throw us an unexpected curve ball though and really challenge us in the data processing.
Q2. The plan was to launch the Gaia satellite in early 2013. Is this plan confirmed?
William O’Mullane: Currently September 2013 in Q1 is the official launch date.
Q3. Did the data requirements for the project change in the last year? If yes, how?
William O’Mullane: Downlink rate has not changed so we know how much comes into the System still only about 100TB over 5 years. Data processing volumes depend on how many intermediate steps we keep in different locations. Not much change there since last year.
Q4. The sheer volume of data that is expected to be captured by the Gaia satellite poses a technical challenge. What work has been done in the last year to prepare for such a challenge? What did you learn from the Proof-of-Concept of the data management part of this project?
William O’Mullane: I suppose we learned the same lessons as other projects. We have multiple processing centres with different needs met by different systems. We did not try for a single unified approach across these centers.
The CNES have gone completely to Hadoop for their processing. At ESAC we are going to InterSystems Caché. Last year only AGIS was on Caché – now the main daily processing chain is in Caché also [Edit: see also Q.9 ]. There was a significant boost in performance here but it must be said some of this was probably internal to the system, in moving it we looked at some bottlenecks more closely.
Jose Ruperez: We are very pleased to know that last year was only AGIS and now they have several other databases in Caché.
William O’Mullane: The second operations rehearsal is just drawing to a close. This was run completely on Caché (the first rehearsal used Oracle). There were of course some minor problems (also with our software) but in general from Caché perspective it went well.
Q5. Could you please give us some numbers related to performance? Could you also tells us what bottlenecks did you look at, and how did you avoid them?
William O’Mullane: Would take me time to dig out numbers .. we got factor 10 in some places with combination of better queries and removing some code bottle necks. We seem to regularly see factor 10 on “non optimized” systems.
Q6. Is it technically possible to interchange data between Hadoop and Caché ? Does it make sense for the project?
Jose Ruperez: The raw data downloaded from the satellite link every day can be loaded in any database in general. ESAC has chosen InterSystems Caché for performance reasons, but not only. William also explains how cost-effectiveness as well as the support from InterSystems were key. Other centers can try and use other products.
This is a valid point – support is a major reason for our using Caché. InterSystems work with us very well and respond to needs quickly. InterSystems certainly have a very developer oriented culture which matches our team well.
Hadoop is one thing HDFS is another .. but of course they go together. In many ways our DataTrain Whiteboard do “map reduce” with some improvements for our specific problem. There are Hadoop database interfaces so it could work with Caché.
Q7. Could you tell us a bit more what did you learn so far with this project? In particular, what is the implication for Caché, now that also the the main daily processing chain is stored in Caché?
Jose Ruperez: A relevant number, regarding InterSystems Caché performance, is to be able to insert over 100,000 records per second sustained over several days. This also means that InterSystems Caché, in turn, has to write hundreds of MegaBytes per second to disk. To me, this is still mind-boggling.
And done with fairly standard NetApp Storage. Caché and NetApp engineers sat together here at ESAC to align the configuration of both systems to get the max IO for Java through Caché to NetApp. There were several low level page size settings etc. which were modified for this.
Q8. What else still need to be done?
William O’Mullane: Well we have most of the parts but it is not a well oiled machine yet. We need more robustness and a little more automation across the board.
Q9. Your high level architecture a year ago consisted of two databases, a so called Main Database and an AGIS Database.
The Main Database was supposed to hold all data from Gaia and the products of processing. (This was expected to grow from a few TBs to few hundreds of TBs during the mission). AGIS was only required a subset of this data for analytics purpose. Could you tell us how the architecture has evolved in the last year?
William O’Mullane: This remains precisely the same.
Q10. For the AGIS database, were you able to generate realistic data and load on the system?
William O’Mullane: We have run large scale AGIS tests with 50,000,000 sources or about 4,500,000,000 simulated observations. This worked rather nicely and well within requirements. We confirmed going from 2 to 10 to 50 million sources that the problem scales as expected. The final (end of mission 2018) requirement is for 100,000,000 sources, so for now we are quite confident with the load characteristics. The simulation had a realistic source distribution in magnitude and coordinates (i.e. real sky like inhomogeneities are seen).
Q11. What results did you obtain in tuning and configuring the AGIS system in order to meet the strict insert requirements, while still optimizing sufficiently for down-stream querying of the data?
William O’Mullane: We still have bottlenecks in the update servers but the 50 million test still ran inside one month on a small in house cluster. So the 100 million in 3 months (system requirement) will be easily met especially with new hardware.
Q12. What are the next steps planned for the Gaia mission and what are the main technical challenges ahead?
William O’Mullane: AGIS is the critical piece of Gaia software for astrometry but before that the daily data treatment must be run. This so called Initial Data Treatment (IDT) is our main focus right now. It must be robust and smoothly operating for the mission and able to cope with non nominal situations occurring in commissioning the instrument. So some months of consolidation, bug fixing and operational rehearsals for us. The future challenge I expect not to be technical but rather when we see the real data and it is not exactly as we expect/hope it will be.
I may be pleasantly surprised of course. Ask me next year …
William O`Mullane, Science Operations Development Manager, European Space Agency.
William O’Mullane has a PhD in Physics and a background in Computer Science and has worked on space science projects since 1996 when he assisted with the production of the Hipparcos CDROMS. During this period he was also involved with the Planck and Integral science ground segments as well as contemplating the Gaia data processing problem. From 2000-2005 Wil worked on developing the US National Virtual Observatory (NVO) and on the Sloan Digital Sky Survey (SDSS) in Baltimore, USA. In August 2005 he rejoined the European Space Agency as Gaia Science Operations Development Manager to lead the ESAC development effort for the Gaia Data Processing and Analysis Consortium.
José Rupérez, Senior Engineer at InterSystems.
He has been providing technical advise to customers and partners in Spain and Portugal for the last 10 years. In particular, he has been working with the European Space Agency since December 2008. Before InterSystems, José developed his career at eSkye Solutions and A.T. Kearney in the United States, always in Software. He started his career working for Alcatel in 1994 as a Software Engineer. José holds a Bachelor of Science in Physics from Universidad Complutense (Madrid, Spain) and a Master of Science in Computer Science from Ball State University (Indiana, USA). He has also attended courses at the MIT Sloan School of Business.
You can follow ODBMS.org on Twitter : @odbmsorg.