Objects in Space vs. Friends in Facebook.
“Data is everywhere, never be at a single location. Not scalable, not maintainable.”–Alex Szalay
I recently reported about the Gaia mission which is considered by the experts “the biggest data processing challenge to date in astronomy“.
Alex Szalay- who knows about data and astronomy, having worked from 1992 till 2008 with the Sloan Digital Sky Survey together with Jim Gray – wrote back in 2004:
“Astronomy is a good example of the data avalanche. It is becoming a data-rich science. The computational-Astronomers are riding the Moore’s Law curve, producing larger and larger datasets each year.” [Gray,Szalay 2004]
Gray and Szalay observed: “If you are reading this you are probably a “database person”, and have wondered why our “stuff” is widely used to manage information in commerce and government but seems to not be used by our colleagues in the sciences. In particular our physics, chemistry, biology, geology, and oceanography colleagues often tell us: “I tried to use databases in my project, but they were just to [slow | hard-to-use |expensive | complex ]. So, I use files.” Indeed, even our computer science colleagues typically manage their experimental data without using database tools. What’s wrong with our database tools? What are we doing wrong? “ [Gray,Szalay 2004].
Six years later, Szalay in his presentation “Extreme Data-Intensive Computing” presented what he calls “Jim Gray`s Law of Data Engineering”:
1. Scientific computing is revolving around data.
2. Need scale-out solution for analysis.” [Szalay2010].
He also says about Scientific Data Analysis, or as he calls it (DISC: Data Intensive Scientific Computing): “Data is everywhere, never be at a single location. Not scalable, not maintainable.”[Szalay2010]
I would like to make three observations:
i. Great thinkers do anticipate the future. They “feel” it. Better said, they “see” more clearly how things really are.
Consider for example what the philosopher Friedrich Nietzsche wrote in his book “Thus Spoke Zarathustra”: “The middle is everywhere.” Confirmed 128 years later by “Data is everywhere”….
ii. “Astronomy is a good example of the data avalanche”: the Universe is beyond our comprenshion, which means I believe, that ultimately we will figure out that indeed “data is not scalable, and not maintainable.”
iii. I now dare to twist the quote: “If you are reading this you are probably a “database person”, and have wondered why our “stuff” is widely used to manage information in commerce and government but seems to not be used by our colleagues at Facebook or Google”….
I have asked Professor Alex Szalay for his opinion.
Alexander Szalay is a professor in the Department of Physics and Astronomy of the Johns Hopkins University. His research interests are theoretical astrophysics and galaxy formation.
RVZ
Alex Szalay: This is very flattering… and I agree. But to be fair, the Facebook guys are using databases, first MySQL, and now Oracle in the middle of their whole system.
I have recently heard a talk by Jeff Hammerbacher, who built the original infrastructure for Facebook. Now he quit, and formed Cloudera. He did explicitly say that in the middle there will always be SQL, but people use Hadoop/MR for the ETL layer… and R and other tools for analytics and reporting.
As far as I can see Google is also gently moving towards not quite a database yet, but Jeff Dean is building Bloom filters and other indexes into BigTable. So even if it is NoSQL, some of their stuff starts to resemble a database….
So I think there is a general agreement that indices are useful, but for large scale data analytics, we do not need full ACID, transactions are much more a burden than an advantage. And there is a lot of religion there, of course.
I would put it in such a way, that there is a phase transition coming, and there is an increasing diversification, where there were only three DB vendors 5 years ago, now there are many options and a broad spectrum of really interesting niche tools. In a healthy ecosystem everything is a 1/f power law, and we will see a much bigger diversity. And this is great for academic research. “In every crisis there is an opportunity” — we again have a chance to do something significant in academia.
RVZ: The National Science Foundation has awarded a $2M grant to you and your team of co-investigators from across many scientific disciplines, to build a 5.2 Petabyte Data-Scope, a new instrument targeted at analyzing the huge data sets emerging in almost all areas of science. The instrument will be a special data-supercomputer, the largest of its kind in the academic world.
What is the project about?
Alex Szalay: We feel that the Data-Scope is not a traditional multi-user computing cluster, but a new kind of instrument, that enables people to do science with datasets ranging between 100TB and 1000TB.
This is simply not possible today. The task is much more, than just throw the necessary storage together.
It requires a holistic approach: the data must be first brought to the instrument, then staged, and then moved to the computing nodes that have both enough compute power and enough storage bandwidth (450GBps) to perform the typical analyses, and then the (complex) analyses must be performed.
RVZ: Could you please explain what are the main challenges that this project poses?
Alex Szalay: It would be quite difficult, if not outright impossible to develop a new instrument with so many cutting-edge features without adequately considering all aspects of the system, beyond the hardware. We need to write at least a barebones set of system management tools (beyond the basic operating system etc), and we need to provide help and support for the teams who are willing to be at the “bleeding-edge” to be able to solve their big data problems today, rather than wait another 5 years, when such instruments become more common.
This is why we feel that our proposal reflects a realistic mix of hardware and personnel, which leads to a high probability of success.
The instrument will be open for scientists beyond JHU. There was an unbelievable amount if interest just at JHU in such an instrument, since analyzing such data sets is beyond the capability of any group on campus. There were 20 groups with data sets totaling over 2.8PB just within JHU, who would use the facility immediately, if it was available. We expect to go no-line at the end of this summer.
References
[Szalay2010]
Extreme Data-Intensive Computing (.pdf)
Alex Szalay, The Johns Hopkins University, 2010.
[Gray,Szalay 2004]
Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science.
Jim Gray,Microsoft Research and Alex Szalay,Johns Hopkins University.
IEEE Data Engineering Bulletin and Technical Report, MSR-TR-2004-110, Microsoft Research, 2004
Friedrich Nietzsche,
Thus Spoke Zarathustra: a Book for Everyone and No-one. (Also Sprach Zarathustra: Ein Buch für Alle und Keinen) – written between 1883 and 1885.
Related Posts
Objects in Space
Objects in Space: “Herschel” the largest telescope ever flown.
Objects in Space. –The biggest data processing challenge to date in astronomy: The Gaia mission.–
Big Data
Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.
The evolving market for NoSQL Databases: Interview with James Phillips.
#
“in the middle there is always SQL” – SQL is used to manage and organize data about data stored in whatever we have to store this amount of data.
In the middle of the middle there is a model – (ER I think) to describe the meaning of the tables and their relationships. With thousands of databases in use (which is quite common in a large organization) also this “model” represents quite a large complex volume of data – and it is dynamic, multiuser and hopefully ACID compliant. SAP reportedly consists of a collection of 90000 tables (guess the number of rows and relationships!). The obvious conclusion is that there must be another layer to control this structure and to reduce the span of control.
I think the real challenge is to build a hiearchical layer of data structures which reduce the amount of data to manageable sets of information, which provide an insight into the relevant meaning of the PBs of raw data.