The Gaia mission in 2015. Interview with Uwe Lammers and Vik Nagjee
“Some believe that the Gaia data will revolutionize astronomy! Only time will tell if that is true, but it is clear that it will be a treasure trove for astronomers for decades to come.”–Dr. Uwe Lammers.
“The Gaia mission is considered to be the largest data processing challenge in astronomy.”–Vik Nagjee
In December of 2013, the European Space Agency (ESA) launched a satellite called Gaia on a five-year mission to map the galaxy and learn about its past.
The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”.
I recall here the Objectives of the Gaia Project (source ESA Web site):
“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”
I have been following the GAIA mission since 2011, and I have reported it in two interviews until now. This is the third interview of the series, the first one after the launch.
The interview is with Dr. Uwe Lammers, Gaia Science Operations Manager at the European Space Agency, and Vik Nagjee, Product Manager for Data Platforms at InterSystems.
Q1. Could you please elaborate in some detail what is the goal and what are the expected results of the Gaia mission?
Uwe Lammers: We are trying to construct the most consistent, most complete and most accurate astronomical catalog ever done. Completeness means to observe all objects in the sky that are brighter than a so-called magnitude limit of 20. These are mostly stars in our Milky Way up to 1.5 billion in number. In addition, we expect to observe as many as 10 million other galaxies, hundreds of thousands of celestial bodies in our solar system (mostly asteroids), tens of thousands of new exo-planets, and more. Some believe that the Gaia data will revolutionize astronomy! Only time will tell if that is true, but it is clear that it will be a treasure trove for astronomers for decades to come.
Vik Nagjee: The data collected from Gaia will ultimately result in a three-dimensional map of the Milky Way, plotting over a billion celestial objects at a distance of up to 30,000 light years. This will reveal the composition, formation and evolution of the Galaxy, and will enable the testing of Albert Einstein’s Theory of Relativity, the space-time continuum, and gravitational waves, among other things. As such, the Gaia mission is considered to be the largest data processing challenge in astronomy.
Orbiting the Lagrange 2 (L2) point, a fixed spot 1.5 million kilometers from Earth, Gaia will measure the position, movement, and brightness of more than a billion celestial objects, looking at each one an average of 70 times over the course of five years. Gaia’s measurements will be much more complete, powerful, and accurate than anything that has been done before. ESA scientists estimate that Gaia will find hundreds of thousands of new celestial objects, including extra-solar planets, and the failed stars known as brown dwarfs. In addition, because Gaia can so accurately measure the position and movement of the stars, it will provide valuable information about the galaxy’s past – and future – evolution.
Read more about the Gaia mission here.
Q2. What is the size and structure of the information you analysed so far?
Uwe Lammers: From the start of the nominal mission on 25 July until today, we have received about 13 terabytes of compressed binary telemetry from the satellite. The daily pipeline running here at the Science Operations Centre (SOC) has processed all this and generated about 48 TB of higher-level data products for downstream systems.
At the end of the mission, the Main Database (MDB) is expected to hold more than 1 petabyte of data. The structure of the data is complex and this is one of the main challenges of the project. Our data model contains about 1,500 tables with thousands of fields in total, and many inter-dependencies. The final catalog to be released sometime around 2020 will have a simpler structure, and there will be ways to access and work with it in a convenient form, of course.
Q3. Since the launch of Gaia in December 2013, what intermediate results did you obtain by analysing the data received so far?
Uwe Lammers: Last year we found our first supernova (exploding star) with the prototype of the so-called Science Alert pipeline. When this system is fully operational, we expect to find several of these per day. The recent detection of a micro-lensing event was another nice demonstration of Gaia’s capabilities.
Q4. Did you find out any unexpected information and/or confirmation of theories by analysing the data generated by Gaia so far?
Uwe Lammers: It is still too early in the mission to prove or disprove established astronomical theories. For that we need to collect more data and do much more processing. The daily SOC pipeline is only one, the first part, of a large distributed system that involves five other Data Processing Centres (DPCs), each running complex scientific algorithms on the data. The whole system is designed to improve the results iteratively, step by step, until the final accuracy has been reached. However, there will certainly be intermediate results. One simple example of an unexpected early finding is that Gaia gets hit by micro-meteoroids much more often than pre-launch estimates predicted.
Q5. Could you please explain at some high level the Gaia’s data pipeline?
Uwe Lammers: Hmmm, that’s not easy to do in a few words. The daily pipeline at the SOC converts compact binary telemetry of the satellite into higher level products for the downstream systems at the SOC and the other processing centres. This sounds simple, but it is not – mainly because of the complex dependencies and the fact that data does not arrive from the satellite in strict time order. The output of the daily pipeline is only the start as mentioned above.
From the SOC, data gets sent out daily to the other DPCs, which perform more specialized processing. After a number of months we declare the current data segment as closed, receive the outputs from the other DPCs back at the SOC, and integrate all into a coherent next version of the MDB. The creation of it marks the end of the current iteration and the start of a new one. This cyclic processing will go on for as many iterations as needed to converge to a final result.
An important key process is the Astrometric Global Iterative Solution (AGIS), which will give us the astrometric part of the catalog. As the name suggests, it is in itself an iterative process and we run it likewise here at the SOC.
Vik Nagjee: To add on to what Dr. Lammers describes, Gaia data processing is handled by a pan-European collaboration, the Gaia Data Processing and Analysis Consortium (DPAC), and consists of about 450 scientists and engineers from across Europe. The DPAC is organized into nine Coordination Units (CUs); each CU is responsible for a specific portion of the Gaia data processing challenge.
One of the CUs – CU3: Core Processing – is responsible for unpacking, decompressing, and processing the science data retrieved from the satellite to provide rapid monitoring and feedback of the spacecraft and payload performances at the ultra-precise accuracy levels targeted by the mission. In other words, CU3 is responsible for ensuring the accuracy of the data collected by Gaia, as it is being collected, to ensure the accuracy of the eventual 3-D catalog of the Milky Way.
Over its lifetime, Gaia will generate somewhere between 500,000 to 1 million GB of data. On an average day, approximately 50 million objects will “transit” Gaia’s field of view, resulting in about 285 GB of data. When Gaia is surveying a densely populated portion of the galaxy, the daily amount could be 7 to 10 times as much, climbing to over 2,000 GB of data in a day.
There is an eight-hour window of time each day when raw data from Gaia is downloaded to one of three ground stations.
The telemetry is sent to the European Space Astronomy Centre (ESAC) in Spain – the home of CU3: Core Processing – where the data is ingested and staged.
The initial data treatment converts the data into the complex astrometric data models required for further computation. These astrometric objects are then sent to various other Computational Units, each of which is responsible for looking at different aspects of the data. Eventually the processed data will be combined into a comprehensive catalog that will be made available to astronomers around the world.
In addition to performing the initial data treatment, ESAC also processes the resulting astrometric data with some complex algorithms to take a “first-look” at the data, making sure that Gaia is operating correctly and sending back good information. This processing occurs on the Initial Data Treatment / First Look (IDT/FL) Database; the data platform for the IDT/FL database is InterSystems Caché.
Q6. Observations made and conclusions drawn are only as good as the data that supports them. How do you evaluate the “quality” of the data you receive? and how do you discard the “noise” from the valuable information?
Uwe Lammers: A very good question! If you refer to the final catalog, this is a non-trivial problem and a whole dedicated group of people is working on it. The main issue is, of course, that we do not know the “true” values as in simulations. We work with models, e.g., models of the stars’ positions and the satellite orientation. With those we can predict the observations, and the difference between the predicted and the observed values tells us how well our models represent reality. We can also do consistency checks. For instance, we do two runs of AGIS, one with only the observations from odd months and another one from even months, and both must give similar results. But we will also make use of external astronomical knowledge to validate results, e.g., known distances to particular stars. For distinguishing “noise” from “signal,” we have implemented robust outlier rejection schemes. The quality of the data coming directly from the satellite and from the daily pipeline is assessed with a special system called First Look running also at the SOC.
Vik Nagjee: The CU3: Core Processing Unit is responsible for ensuring the accuracy of the data being collected by Gaia, as it is being collected, so as to ensure the accuracy of the eventual 3-D catalog of the Milky Way.
InterSystems Caché is the data platform used by CU3 to quickly determine that Gaia is working properly and that the data being downloaded is trustworthy. Caché was chosen for this task because of its proven ability to rapidly ingest large amounts of data, populate extremely complex astrometric data models, and instantly make the data available for just-in-time analytics using SQL, NoSQL, and object paradigms.
One million GB of data easily qualifies as Big Data. What makes InterSystems Caché unique is not so much its ability to handle very large quantities of data, but its abilities to provide just-in-time analytics on just the right data.
We call this “Big Slice” — which is where analytics is performed just-in-time for a focused result.
A good analogy is how customer service benefits from occasional Big Data analytics. Breakthrough customer service comes from improving service at the point of service, one customer at a time, based on just-in-time processing of a Big Slice – the data relevant to the customer and her interactions. Back to the Gaia mission: at the conclusion of five years of data collection, a true Big Data exercise will plot the solar map. Yet, frequently ensuring data accuracy is an example of the increasing strategic need for our “Big Slice” concept.
Q7. What kind of databases and analytics tools do you use for the Gaia`s data pipeline?
Uwe Lammers: At the SOC all systems use InterSystems’ Caché database. Despite some initial hiccups, Cache´ has proved to be a good choice for us. For analytics we use a few popular generic astronomical tools (e.g., topcat), but most are custom-made and specific to Gaia data. All DPCs had originally used relational databases, but some have migrated to Apache’s Hadoop.
Q8. Specifically for the Initial Data Treatment/First Look (IDT/FL) database, what are the main data management challenges you have?
Uwe Lammers: The biggest challenge is clearly the data volumes and the steady incoming stream that will not stop for the next five years. The satellite sends us 40-100 GB of compressed raw data every day, which the daily pipeline needs to process and store the output in near real time, as otherwise we quickly accumulate backlogs.
This means all components, the hardware, databases, and software, have to run and work robustly more or less around the clock. The IDTFL database grows daily by a few hundred gigabytes, but not all data has to be kept forever. There is an automatic cleanup process running that deletes data that falls out of chosen retention periods. Keeping all this machinery running around the clock is tough!
Vik Nagjee: Gaia’s data pipeline imposes some rather stringent requirements on the data platform used for the Initial Data Treatment/First Look (IDT/FL) database. The technology must be capable of ingesting a large amount of data and converting it into complex objects very quickly. In addition, the data needs to be immediately accessible for just-in-time analytics using SQL.
ESAC initially attempted to use traditional relational technology for the IDT/FL database, but soon discovered that a traditional RDBMS couldn’t ingest discrete objects quickly enough. To achieve the required insert rate, the data would have to be ingested as large BLOBs of approximately 50,000 objects, which would make further analysis extremely difficult. In particular, the first look process, which requires rapid, just-in-time analytics of the discrete astrometric data, would be untenable. Another drawback to using traditional relational technology, in addition to the typical performance and scalability challenges, was the high cost of the hardware that would be needed.
Since traditional RDBMS technology couldn’t meet the stringent demands imposed by CU3, ESAC decided to use InterSystems Caché.
Q9. How did you solve such challenges and what lessons did you learn until now?
Uwe Lammers: I have a good team of talented and very motivated people and this is certainly one aspect.
In case of problems we are also totally dependent on quick response times from the hardware vendors, the software developers and InterSystems. This has worked well in the past, and InterSystems’ excellent support in all cases where the database was involved is much appreciated. As far as the software is concerned, the clear lesson is that rigorous validation testing is essential – the more the better. There can never be too much. As a general lesson, one of my favorite quotes from Einstein captures it well: “Everything should be made as simple as possible, but no simpler.”
Q10. What is the usefulness of the CU3’s IDT/FL database for the Gaia’s mission so far?
Uwe Lammers: It is indispensable. It is the central working repository of all input/output data for the daily pipeline including the important health monitoring of the satellite.
Vik Nagjee: The usefulness of CU3’s IDT/FL database was proven early in Gaia’s mission. During the commissioning period for the satellite, an initial look at the data it was generating showed that extraneous light was being gathered. If the situation couldn’t be corrected, the extra light could significantly degrade Gaia’s ability to see and measure faint objects.
It was hypothesized that water vapor from the satellite outgassed in the vacuum of space, and refroze on Gaia’s mirrors, refracting light into its focal plane. Although this phenomenon was anticipated (and the mirrors equipped with heaters for that very reason), the amount of ice deposited was more than expected. Heating the mirrors melted the ice and solved the problem.
Scientists continue to rely on the IDT/FL database to provide just-in-time feedback about the efficacy and reliability of the data they receive from Gaia.
Qx Anything else you wish to add?
Uwe Lammers: Gaia is by far the most interesting and challenging project I have every worked on.
It is fascinating to see science, technology, and a large diverse group of people working together trying to create something truly great and lasting. Please all stay tuned for exciting results from Gaia to come!
Vik Nagjee: As Dr. Lammers said, Gaia is truly one of the most interesting and challenging computing projects of all time. I’m honored to have been a contributor to this project, and cannot wait to see the results from the Gaia catalog. Here’s to unraveling the chemical and dynamical history of our Galaxy!
Dr. Uwe Lammers, Gaia Science Operations Manager at the European Space Agency.
Uwe Lammers has a PhD in Physics and a degree in Computer Science and has been working for the European Space Agency on a number of space science mission for the past 20 years. After being involved in the X-ray missions
EXOSAT, BeppoSAX, and XMM-Newton, Gaia caught his attention in 2004.
As of late 2005, together with William O’Mullane, he built up the Gaia Science Operations Centre (SOC) at ESAC near Madrid. From early 2006 to mid-2014 he was in charge of the development of AGIS and is now leading the SOC as Gaia Science Operations Manager.
Vik Nagjee is a Product Manager for Data Platforms at InterSystems.
He’s responsible for Performance and Scalability of InterSystems Caché, and spends the rest of his time helping people (prospects, application partners, end users, etc.) find perfect solutions for their data, processing, and system architecture needs.
Follow ODBMS.org on Twitter: @odbmsorg