Objects in Space.
Objects in Space.
–The biggest data processing challenge to date in astronomy: The Gaia mission.–
I became aware of a very interesting project at the European Space Agency:the Gaia mission. It is considered by the experts “the biggest data processing challenge to date in astronomy“.
I wanted to know more. I have interviewed William O`Mullane, Science Operations Development Manager, at the European Space Agency, and Vik Nagjee, Product Manager, Core Technologies, at InterSystems Corporation, both deeply involved with the Proof-of-Concept of the data management part of this project.
Hope you`ll enjoy this interview.
Q1. Space missions are long-term. Generally 15 to 20 years in length. The European Space Agency plans to launch in 2012 a Satellite called Gaia. What is Gaia supposed to do?
Well, we now have learned we launch in early 2013, delays are fairly common place in complex space missions. Gaia is ESA’s ambitious space astrometry mission, the main objective of which is to astrometrically and spectro-photometrically map 1000 Million celestial objects (mostly in our galaxy) with unprecedented accuracy. The satellite will downlink close to 100 TB of raw telemetry data over 5 years.
To achieve its required accuracy of a few 10s of Microarcsecond astrometry, a highly involved processing of this data is required. The data processing is a pan-European effort undertaken by the Gaia Data processing and Analysis consortium. The result a phase space map of our Galaxy helping to untangle its evolution and formation.
Q2. In your report, “Charting the Galaxy with the Gaia Satellite”, you write “the Gaia mission is considered the biggest data processing challenge to date in astronomy. Gaia is expected to observe around 1,000,000,000 celestial objects”. What kind of information Gaia is expected to collect on celestial objects? And what kind of information Gaia itself needs in order to function properly?
Gaia has two telescopes and a Radial Velocity Spectrometer (RVS). From each telescope simultaneously images are taken to calculate position on the sky and magnitude. Two special sets of CCDs at the end of the focal plane record images in red and blue bands to provide photometry for every object. Then for a large number of objects spectrographic images are recorded.
From the combination of this data we derive very accurate positions distances and motions for celestial objects. Additionally we get metalicites and temperatures etc. which allow the objects to be classified in different star groups. We use the term celestial objects since not all objects that Gaia observes are stars.
Gaia will, for example, see many asteroids and improve their know orbits. Many planets will be detected (though not directly observed).
As for any satellite Gaia has a barrage of on board instrumentation such as gyroscopes, star trackers and thermometers which are read out and downlinked as ‘house keeping’ telemetry. All of this information is needed to track Gaia’s state and position. Perhaps rather more unique for Gaia is the use of the data taken through scientific instruments for ‘self’ calibration. Gaia is all about beating noise with statistics.
Q3. All Gaia data processing software is written in Java. What are the main functionalities of the Gaia data processing software? What were the data requirements for this software?
The functionality is to process all of the observed data and reduce it to a set of star catalogue entries which represent the observed stars. The data requirements vary – in the science operations centre we downlink 50-80GB per day so there is a requirement on the daily processing software to be able to process this volume of data in 8 hours. This is the strictest requirement because if we are swamped by the incoming data stream we may never recover.
The Astrometric solution involves a few TB of data extracted from the complete set, the process runs less frequently (each six months to one year) the requirement on that software is to do its job in 4 weeks.
There are many systems like this for photometry, spectroscopy, non single stars, classification, variability analysis, each have there own requirements on data volume and processing time.
Q4. What are the main technical challenges with respect to data processing, manipulation and storage this project poses?
The sheer volume of data that is expected to be captured by the Gaia satellite poses a technical challenge. For example, 1 billion celestial objects will be surveyed, and roughly 1000 observations (100*10) will be captured for each object, totaling around 1000 billion observations; each observation is represented as a discrete Java object and contains many properties that express various characteristics of these celestial bodies.
The trick is to not only capture this information and ingest (insert) it into the database very quickly, but to also be able to do this as discrete (non-BLOB) objects so that downstream processing can be facilitated in an easy manner. Additionally, the system needs to be optimized for very fast inserts (writes) and also for very fast queries (reads). All of this needs to be done in a very economical and cost-effective fashion – reducing power, energy, cooling, etc. requirements, and also reducing costs as much as possible.
These are some of the challenges that this project poses.
Q5. A specific part of the Gaia data processing software is the so called Astrometric Global Iterative Solution (AGIS). This software is written in Java. What is the function of such module? And what specific data requirements and technical challenges does it have?
AGIS is a solution or program that iteratively turns the raw data into meaningful information.
AGIS takes a subset of the data, so called well behaved or primary stars and basically fits the observations to the astrometric model. This involves refining the satellite attitude (using the known basic angle and the multiple simultaneous observations in each telescope) and calibration (given an attitude and known star positions they should appear at a definite points in the CCD at a given times). It`s a huge (too huge infact) matrix inversion but we do a block iterative estimation based on the conjugate gradient method. This requires looping (or iterating) over the observational data up to 40 times to converge the solution. So we need the IO to be reasonable -we know the access pattern though so we can organize the data such that the reads are almost serial.
Q6. In your high level architecture you use two databases, a so called Main Database and an AGIS Database. What is the rational for this choice and what are the functionalities expected from the two databases?
Well the Main Database will hold all data from Gaia and the products of processing. This will grow from a few TBs to few hundreds of TBs during the mission. It`s a large repository of data. Now we could consider lots of tasks reading and updating this but the accounting would be a nightmare – and astronomers really like to know the provenance of their results. So we made the Main Database a little slower to update declaring a version as immutable. One version is then the input to all processing tasks the outputs of which form the next version.
AGIS meanwhile (and other tasks) only require a subset of this data. Often the tasks are iterative and will need to scan over the data frequently or access it in some particular way. Again it is easier for the designer of each task to also design and optimize his data system than try to make one system work for all.
Q7. You write that the AGIS database will contain data for roughly 500,000,000 sources (totaling 50,000,000,000 observations). This is roughly 100 Terabyte of Java data objects. Can you please tell us what are the attributes of the Java objects and what do you plan to do with these 100 Terabyte of data? Is scalability an important factor in your application? Do you have specific time requirements for handling the 100 Terabyte of Java data objects?
Last question first – 4 weeks is the target time for running AGIS, that’s ~40 passes through the dataset. The 500 Million sources plus observations for Gaia is a subset of the Gaia Data we estimate it to be around 10TB. Reduced to its quintessential element each object observation is a time, the time when the celestial image crosses the fiducial line on the CCD. For one transit there are 10 such observations with an average of 80 transits per source. Carried with that is the small cut out image over the 5 year mission lifetime, from the CCD. In addition there are various other pieces of information such as which telescope and residuals which are carried around for each observation. For each source we calculate the astrometric parameters which you may see as 6 numbers plus errors. These are: the position (alpha and delta) the distance or parallax (varPi) the transverse motion (muAlpha and muDelta) and the radial velocity (muRvs).
Then there is a Magnitude estimation and various other parameters. We see no choice but to scale the application to achieve the run times we desire and it has always been designed as a distributed application.
The AGIS Data Model comprises several objects and is defined in terms of Java interfaces. Specifically, AGIS treats each observation as a discrete AstroElementary object. As described in the paper, the AstroElementary object contains various properties (mostly of the IEEE long data type) and is roughly 600 bytes on disk. In addition, the AGIS database contains several supporting indexes which are built during the ingestion phase. These indexes assist with queries during AGIS processing, and also provide fast ad-hoc reporting capabilities. Using InterSystems Caché, with its Caché eXTreme for Java capability, multiple AGIS Java programs will ingest the 100 Terabytes of data generated by Gaia as 50,000,000,000 discrete AstroElementary objects within 5 days (yielding roughly 115,000 object inserts per second, sustained over 5 days).
Internally, we will spread the data across several database files within Caché using our Global and Subscript Mapping capabilities (you can read more about these capabilities here) ,while providing seamless access to the data across all ranges. The spreading of the data across multiple database files is mainly done for manageability.
Q8. You conducted a proof-of-concept with Caché for the AGIS database. What were the main technical challenges of such proof-of-concept and what are the main results you obtained? Why did you select Caché and not a relational database for the AGIS database?
We have worked with Oracle for years and we can run AGIS on Derby. We have tested MySql and Postgress (though not with AGIS). To make the relation systems work fast enough we had to reduce our row count – this we did by effectively combining objects in blobs with the result that the RDBMs became more like a files system. Tests with Caché have shown we can achieve the read and write speeds we require without having to group data in blobs. This obviously is more flexible. It may have been possible to do this with another product but each time we had a problem InterSystems came (quickly) and showed us how to get
around it or fixed it. For the recent test we requested writing of specific representative dataset within 24 hours our our hardware – this was achieved in 12 hours. Caché is also a more cost-effective solution.
The main technical challenge of such a Proof-of-Concept is to be able to generate realistic data and load on the system, and to tune and configure the system to be able to meet the strict insert requirements, while still optimizing sufficiently for down-stream querying of the data.
The white paper(.pdf)” discusses the results, but in summary, we were able to ingest 5,000,000,000 AstroElementary objects (roughly 1/10th of the eventual projected amount) in around 12 hours. Our target was to ingest this data within 24 hours, and we were successful at being able to do this in 1/2 the time.
Caché is an extremely high-performance database, and as the Proof-of-Concept outlined in the white paper proves, Caché is more than capable of handling the stringent time requirements imposed by the Gaia project, even when run on relatively modest hardware.
Q9. How do you handle data ingestion in the AGIS database, and how do you publish back updated objects into the main database?
We use the Caché eXTreme for Java capability to interact between the Java AGIS application.
Q10. One main component of the proof-of-concept is the new Caché eXTreme for Java. Why is it important, and how did it get used in the proof-of-concept? How do you ensure low latency data storage and retrieval in the AGIS solution?
Caché eXTreme for Java is a new capability of the InterSystems Caché database that exposes Caché’s enterprise and high-performance features to Java via the JNI (Java Native Intervace). It enables “in-process” communication between Java and Caché, thereby providing extremely low-latency data storage and retrieval.
Q11. What are the next steps planned for AGIS project and the main technical challenges ahead?
The next series of testing will focus on ingesting even more data sources –up to 50% of the total projected objects. Next, we’ll work on tuning the application and system for reads (queries), and will also continue to explore additional deployment options for the read/query phase (for example, ESA and InterSystems are looking at deploying hundreds of AGIS nodes in the Amazon EC2 cloud so as to reduce the amount of hardware that ESA has to purchase).
Well on the technical side we need to move AGIS definitively to Caché for production. There are always communication bottlenecks to be investigated which limit scalability. AGIS itself requires several further developments such as a more robust outlier scheme and a more complete set of calibration equations. AGIS is in a good state but needs more work to deal with the mission data.
William O`Mullane, Science Operations Development Manager, European Space Agency.
William O’Mullane has a background in Computer Science and has worked on space science projects since 1996 when he assisted with the production of the Hipparcos CDROMS. During this period he was also involved with the Planck and Integral science ground segments as well as contemplating the Gaia data processing problem. From 2000-2005 Wil worked on developing the US National Virtual Observatory (NVO) and on the Sloan Digital Sky Survey (SDSS) in Baltimore, USA. In August 2005 he rejoined the European Space Agency as Gaia Science Operations Development Manager to lead the ESAC development effort for the Gaia Data Processing and Analysis Consortium.
Vik Nagjee, Product Manager, Core Technologies, InterSystems Corporation.
Vik Nagjee is the Product Manager for Core Technologies in the Systems Development group at InterSystems. He is responsible for the areas of Reliability, Availability, Scalability, and Performance for InterSystems’ core products – Caché and Ensemble. Prior to joining InterSystems in 2008, Nagjee held several positions, including Security Architect, Development Lead, and head of the performance & scalability group, at Epic Systems Corporation, a leading healthcare application vendor in the US.
For additional reading:
The Gaia Mission :
1. “Pinpointing the Milky Way” (download paper, .pdf).
2. “Gaia: Organisation and challenges for the data processing” (download paper, .pdf).
3. “Gaia Data Processing Architecture 2009” (download paper, .pdf).
4. “To Boldly Go Where No Man has Gone Before: Seeking Gaia’s Astrometric Solution with AGIS” (download paper, .pdf).
The AGIS Database
5. “Charting the Galaxy with the Gaia Satellite”.(download white paper, .pdf)”
really very nice and interesting, thanks 🙂
This is a great article!
I think that some interesting options for the Gaia project and Amazon Web Services would be:
1) Reserved EC2 Instances
2) Spot EC2 Instances
There can be significant cost savings if you can use the right combination of Reserved, Spot, or some on-Demand EC2 instances.
Interesting article and use of Java. I wonder if the project thinks it will need 64 bit implementations of the JVM to make make it work?
Very interesting article! It would have been great to learn more about other parts of your infrastructure although I understand that Roberto’s main focus is on databases 😉
– how do you ensure non-blocking ingress?
– how do you deal with the load of multiple tenants running queries in parallel on the vast amounts of data?
– how do you deal with application prioritization? Do you over-provision to account for peaks?
– what tools do you use to track performance / application health?
There are probably a lot of parallels between your implementation and a more normal SaaS setup but there are definitely areas that are unique to your use case that would be very interesting to learn from. Where can we learn more about how you have set up your infrastructure?
Leveraging the cloud makes sense (you mention 100s of AWS EC2 instances) to ensure elasticity but I hope you also look at solutions that enable you to do more with your existing infrastructure by increasing your application density – particularly as you mentioned your goal to reduce energy consumption. Linux Containers or Librato Silverline would be a more elegant approach than throwing VMs at the problem.
I hope we get an update on the project further down the road!
Thank you for sending this interesting articles. It’s amazing how far
we have come along in terms of handling database to the tune of 10^12!
Thank you again for sharing such good information and making the
internet much more valuable than a mere social network…:)