Gaia Mission maps 1 Billion stars. Interview with Uwe Lammers
“Gaia continues to be a challenging mission in all areas even after 4 years of operation.
In total we have processed almost 800 Billion (=800,000 Million) astrometric, 160 Billion (=160,000 Million) photometric and more than 15 Billion spectroscopic observation which is the largest astronomical dataset from a science space mission until the present day.”
— Uwe Lammers.
In December of 2013, the European Space Agency (ESA) launched a satellite called Gaia on a five-year mission to map the galaxy and learn about its past. The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”.
I recall here the Objectives of the Gaia Project (source ESA Web site):
“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”
I have been following the GAIA mission since 2011, and I have reported it in three interviews until now.
In this interview, Uwe Lammers – Gaia’s Science Operations Manager – gives a very detailed description of the data challenges and the opportunities of the Gaia mission.
This interview is the fourth of the series, the second after the launch.
Q1. Of the raw astrometry, photometry and spectroscopy data collected so far by the Gaia spacecraft, what is their Volume, Velocity, Variety, Veracity and Value?
Since the beginning of the nominal mission in 2014 until end June 2017 the satellite has delivered about 47.5 TB compressed raw data. This data is not suitable for any scientific analysis but first has to be processed into higher-level products which inflates the volume about 4 times.
The average raw daily data rate is about 40 GB but highly variable depending on which part of the sky the satellite is currently scanning through. The data is highly-complex and interdependent but not unstructured – it does not come with a lot of meta-information as such but follows strictly defined structures. In general it is very trustworthy, however, the downstream
data processing cannot blindly assume that every single observation is valid.
As with all scientific measurements, there can be outliers which must be identified and eliminated from the data stream as part of the analysis. Regarding value, Gaia’s data set is absolutely unique in a number of ways.
Gaia is the only mission surveying the complete sky with unprecedented precision and completeness. The end results is expected to be a treasure trove for generations of astronomers to come.
Q2 How is this data transmitted to Earth?
Under normal observing conditions the data is transmitted from the satellite to the ground through a so-called phased-array-antenna (PAA) at a rate of up to 8.5 Mbps. As the satellite spins, it continuously keeps a radio beam directed towards the Earth by activating successive panels on the PAA. This is a fully electronic process as there can be no moving parts on Gaia which would otherwise disturb the precise measurements. On the Earth we use three 35m radio dishes in Spain, Australia, and Argentina to receive the telemetry from Gaia.
Q3. Calibrated processed data, high level data products and raw data. What is the difference? What kind of technical data challenges do they each pose?
That question is not easy to answer in a few words. Raw data are essentially unprocessed digital measurements from the CCDs – perhaps comparable to data from the “raw mode” of digital consumer cameras. They have to be processed with a range of complex software to turn it into higher level products from which at the end astrophysical information can be inferred. There are many technical challenges, the most basic one is still to handle the 100s of GBs of daily data. Handling means, reception, storage, processing, I/O by the scientific algorithms, backing-up, and disseminating the processed data to 5 other partner data processing centres across Europe.
Here at the Science Operations Centre (SOC) near Madrid we have chosen years ago InterSystems Caché RDMS + NetApp hardware as our storage solution and this continues to be a good solution. The system is reliable and performant which are crucial pre-requisites for us. Another technical challenge is data accountability which means to keep track of the more than 70 Mio scientific observation we get from the satellite every single day.
Q4. Who are the users for such data and what they do with it?
The data we are generating here at the SOC has no immediate users. It is sent out to the 5 other Gaia Data Processing Centres where more scientific processing takes place and more higher-level products get created. From all this processed data we are constructing a stellar catalogue which is our final result and this is what the end users – the astronomical community of world – to see. The first version of our catalogue was published 14 September last year (Gaia Data Release 1) and we are currently working hard to release the second version (DR2) in April next year.
Our end users do fundamental astronomical research with the data ranging from looking at individual stars, studies of clusters, dynamics of our Milky-Way to cosmological questions like the expansion rate of our universe. The scientific exploitation of the Gaia data has just started but already now more than 200 scientific articles have been published. This is about 1 per day since DR1 and we expect this rate to go higher up after DR2.
Q5. Can you explain at a high level how is the ground processing of Gaia data implemented?
ESA has entrusted the Gaia data processing to the Data Processing Analysis Consortium (DPAC) which the SOC is an integral part of. DPAC consists of 9 so-called Coordination Units (CU) and 6 data processing centres (DPCs) across Europe, so this is a large distributed system.
In total some 450 people from 20+ countries with a large range of educational backgrounds and experiences are forming DPAC. Roughly speaking, the CUs are responsible for writing and validating the scientific processing software which is then run in one of the DPCs (every CU is associated with exactly one DPC).
The different CUs cover different aspects of the data processing (e.g. CU3 takes care of astrometry, CU5 of photometry).
The corresponding processes run more or less independent of each other, however, due to the complex interdependencies of the Gaia data itself this is only a first approximation. Ultimately, everything depends on everything else (e.g. astrometry depends on photometry and vice versa) which means that the entire system has to be iterated to produce the final solution. As you can imagine a lot of data has the be exchanged. SOC/DPCE is the hub in a hub-and-spokes topology where the other 5 DPCs are sitting at the ends of the spokes. No data exchange between DPCs is allowed but all the data flow is centrally managed through the hub at DPCE.
Q6. How do you process the data stream in near real-time in order to provide rapid alerts to facilitate ground-base follow up?
Yes, indeed we do. For ground-based follow up observations of variable objects quick turn-around times are essential. The time difference between an observation made on-board and the confirmation of a photometric alert on the ground is typically 2 days now which is close to the optimal value given all the operational constraints we have.
Q7. What are the main technical challenges with respect to data processing, manipulation and storage you have encountered so far? and how did you solved them?
Regarding storage, the handling of 100s of GBs of raw and processed data every day has always been and remains until today quite a challenge as explained above. The Gaia data reduction task is also a formidable computational problem. Years ago we estimated the total numerical effort to produce the final catalogue at some 10^20 FLOPs and this has proven fairly accurate.
So we need quite some number-crunching capabilities in the DPCs and to continuously expand CPU resources as the data volume keeps growing in the operational phase of the mission. Moore’s law is slowly coming to an end but, fortunately, a number of algorithms are perfectly parallelizable (processing every object in the sky individually and isolated) such that CPU bottlenecks can be ameliorated by simply adding more processors to the existing systems.
Data transfers are likewise a challenge. At the moment 1 Gbps connections (public Internet) between DPCE and the other 5 DPCs are sufficient, however, in the coming years we heavily rely on seeing bandwidths increasing to 10 Gbps and beyond. Unfortunately, this is largely not under our control which is a risk to the project.
Q8. What kind of databases and analytics tools do you use for the Gaia`s data pipeline?
As explained above, for the so-called daily pipeline we have chosen InterSystems Caché and are very satisfied with this approach. We had some initial problems with the system but were able to overcome all difficulties with the help of Intersystems. We much appreciated their excellent service and customer orientation in this phase and till the present day. Regarding analytics tools we use most facilities that are part of Caché, but have also developed a suite of custom-made solutions.
Q9. How do you transform the raw information into useful and reliable stellar positions?
The raw data from the satellite is first turned into higher level-products which already includes preliminary estimates for the stellar positions. But each of these positions is then only based on a single measurements. The high accuracy of Gaia comes from combining _all_ observations that have been taken during the mission with a scheme called Astrometric Iterative Solution (AGIS) [see The astrometric core solution for the Gaia mission. Overview of models, algorithms, and software implementation].
This cannot be done on a star-by-star basis but is a global, simultaneous optimization of a large number of parameters including the 5 basic astrometric parameters of each star (about 1 Billion in total), the time-varying attitude of the satellite
(a few Million), and a number of calibration parameters (a few 10.000).
The process is iterative and in the end gives the best match between the model parameters and the actual observations. The stellar positions are two of the five astrometric parameter of each object.
Q10. What is the level of accuracy you have achieved so far?
The accuracies depend on the brightnesses of the stars – the brighter a star, the higher is the achievable accuracy. In DR1 the typical uncertainty is about 0.3 mas for the positions and parallaxes, and about 1 mas yr^-1 for the proper motions.
For positions and parallaxes a systematic component of another 0.3 mas should be added. With DR2 we are aiming to reduce these formal errors by at least a factor 3 and likewise eliminate systematic errors by the same or a larger amount.
Q11. The first catalogue of more than a billion stars from ESA’s Gaia satellite was published on 14 September 2016 – the largest all-sky survey of celestial objects to date. What data is in this catalog? What is the size and structure of the information you analysed so far?
Gaia DR1 contains astrometry, G-band photometry (brightnesses), and a modest number of variable star light curves, for a total of 1 142 679 769 sources [See Gaia Data Release 1. Summary of the astrometric, photometric, and survey properties]. For the large majority of those we only provide position and magnitude but about 2 Million stars also have parallaxes and proper motions. In DR2 these numbers will be substantially larger.
The information is structured in simple, easy-to-use tables which can be queried via the central Gaia Archive and a number of other data centres around the world.
Q12. What insights have been derived so far by analysing this data?
The astronomical community eagerly grabbed the DR1 data and since 14 September a couple of hundred scientific articles have appeared in peer-reviewed astronomical journals covering a large breads of topics.
Only to give one example: A new so-called open cluster of stars was discovered very close to the brightest star in the night sky, Sirius. All previous surveys had missed it!
Q13 How do you offer a proper infrastructure and middleware upon which scientists will be able to do exploration and modeling with this huge data set?
That is a very good question! At the moment the archive system does not allow yet real big data-mining using the entire large Gaia data set. Up to know we do not know precisely yet what scientists will want to do with the Gaia data in the end.
There is the “traditional” astronomical research which mostly uses only subsets of the data, e.g. all stars in a particular area of the sky. Such data requests can be satisfied with traditional queries to a RDBMS.
But in the future we expect also applications which will need data mining capabilities and we are experimenting with a number of different approaches using the “code-to-the-data” paradigm. The idea is that scientists will be able to upload and deploy their codes directly through a platform which allows execution with quick data access close to the archive.
For DR2 this will only be available for DPAC-internal use but, depending on experiences gained, as per DR3 it might become a service for public use. One technology we are looking at is Apache Spark for big data mining.
Q14. What software technologies do you use for accessing the Gaia catalogue and associated data?
As explained above, at the moment we are offering access to the catalogue only through a traditional RDBMS system which allows queries to be submitted in a special SQL dialect called ADQL (Astronomical Data Query Language). This DB system is not using InterSystems Caché but Postgres.
Q15. In addition to the query access, how do you “visualize” such data? Which “big data” techniques do you use for histograms production?
Visualization is done with a special custom-made application that sits close to the archive and is using not the raw data but pre-computed special objects especially constructed for fast visualization. We are not routinely using any big data techniques but are experimenting with a few key concepts.
For visualization one interesting novel application is called vaex and we are looking at it.
Histogramming of the entire data set is likewise done using pre-canned summary statistics which was generated when the data was ingested into the archive. The number of users really wanting the entire data set and this kind of functionality is very limited at the moment. We as well as the scientific community are still learning what can be done with the Gaia data set.
Q16. Which “big data” software and hardware technologies did use so far? And what are the lessons learned?
Again, we are only starting to look into big data technologies that may be useful for us. Until now most of the effort has gone into robustifying all systems and prepare DR1 and now DR2 for April next year. One issue is always that the Gaia data is so peculiar and special that COTS solutions rarely work. Most of the software systems we use are special developments.
Q.17 What are the main technical challenges ahead?
As far as the daily systems are concerned we are now finally in the routine phase. The main future challenges lie in robustifying and validating the big outer iterative loop that I described above. It has not been tested yet, so, we are executing it for the first time with real flight data.
Producing DR3 (mid to late 2020) will be a challenge as this for the first time involves output from all CUs and the results from the outer iterative loop. DR4 around end 2022 is then the final release for the nominal mission and for that we want to release “everything”. This means also the individual observation data (“epoch data”) which will inflate the total volume served by the archive by a factor 100 or so.
Qx Anything else you wish to add?
Gaia continues to be a challenging mission in all areas even after 4 years of operation. In total we have processed almost 800 Billion (=800,000 Million) astrometric, 160 Billion (=160,000 Million) photometric and more than 15 Billion spectroscopic observation which is the largest astronomical dataset from a science space mission until the present day.
Gaia is fulfilling its promises in every regard and the scientific community is eagerly looking into what is available already now and the coming data releases. This continues to be a great source of motivation for everybody working on this great mission.
Uwe Lammers. My academic background is in physics and computer science. After my PhD I joined ESA to first work on the X-ray missions EXOSAT, Beppo-SAX, and XMM-Newton before getting interested in Gaia in 2005. The first years I led the development of the so-called Astrometric Global Iterative Solution (AGIS) system and then became Gaia’s Science Operations Manager in 2014.
– The astrometric core solution for the Gaia mission. Overview of models, algorithms, and software implementation
L. Lindegren, U. Lammers et al. Astronomy & Astrophysics, Volume 538, id.A78, 47 pp. February 2012, DOI: 10.1051/0004-6361/201117905
– Gaia Data Release 1. Summary of the astrometric, photometric, and survey properties A.G.A. Brown and Gaia Collaboration, Astronomy & Astrophysics, Volume 595, id.A2, 23 pp. November 2016, DOI: 10.1051/0004-6361/201629512
– Gaia Data Release 1. Astrometry: one billion positions, two million proper motions and parallaxes L. Lindegren, U. Lammers, et al. Astronomy & Astrophysics, Volume 595, id.A4, 32 pp. November 2016, DOI: 10.1051/0004-6361/201628714
– The Gaia mission in 2015. Interview with Uwe Lammers and Vik Nagjee , ODBMS Industry Watch, March 24, 2015
– The Gaia mission, one year later. Interview with William O’Mullane. ODBMS Industry Watch, January 16, 2013
– Objects in Space vs. Friends in Facebook. ODBMS Industry Watch, April 13, 2011
– Objects in Space. ODBMS Industry Watch, February 14, 2011
Follow us on Twitter: @odbmsorg