Business Big Data from a Scientific High Performance Computing (HPC) perspective.!
Business Big Data from a Scientific High Performance Computing (HPC) perspective.!
BY Luigi Scorzato, PhD, Accenture AG.
I have been using supercomputers and PC-clusters for scientific applications since I started my PhD in particle physics in 1996. In the last few years, I have also been using very similar hardware (HW) and software (SW) tools for business applications. When I tell my scientists friends and former colleagues, what I am doing now, they recognise the large similarities, but they are quite puzzled when I tell them that … no, we do not use MPI (www.mpi-forum.org) here. Their bewilderment is very similar to the one of my present colleagues, when I tell them what I was doing in science, but … no, nobody uses Hadoop (hadoop.apache.org) there…
Why so different tools? Some folklore says that HPC applications are very different from Big Data applications. This is not accurate. Of course, all applications are different, but, from a computational point of view, I have seen far more variety within HPC applications and within Big Data applications than between the two domains. This misconception has some historic reasons. The HPC community has always emphasised the FLOPS (floating point operations per second) counting. This is an old traditional measure which is hard to dismiss, but less and less useful: it is now quite cheap to insert many floating point units in a chip. Since the time I can recall, the main bottlenecks in HPC applications have always been related to data movements, up and down the memory hierarchy, just like in Big Data. So, why are the tools so different? let us have a closer look at them.
MPI stands for Message Passage Interface, and it is a set of standardised routines to exchange data between processes in a network (with or without shared memories). MPI was developed in the early 90’s to converge from a variety of HW and vendor dependent communication routines into a unique portable standard. The MPI standard defines APIs (Application Program Interfaces) for Fortran, C and C++ (there is also a de facto standard for Java). Since 1996, there exist various implementations of the MPI standard (e.g. MPICH, LAM/MPI, openMPI…), and they are hugely successful in the HPC community. These let us port smoothly our code between any PC-clusters and any supercomputers.
Since Hadoop was developed only around 2005, it is natural to ask why it was not based on MPI. Tom White, in the introduction of his “Hadoop: The Definitive Guide” (O’Reilly), gives three answers to this question. Tom White was among the early legendary developers of Hadoop; so, his reasons must be historically correct. It is very interesting to examine them one by one, because they tell us a lot about the potential benefit of a closer interaction between the scientific HPC community and the business big data community.
The first reason why MPI was not used in Hadoop is the principle of “data locality”: within Hadoop, each computing node tries to use its local disk as much as possible. Tom White notices that in the HPC paradigm, instead, the computing nodes typically use an external, shared, storage area network (SAN), rather than their local disks. To access the SAN, in a simple and efficient way, MPI offers a set of dedicated routines. But many MPI applications do actually use also the local disks, they just don’t need an MPI routine for that, since they can simply fprintf. Data Locality is of paramount importance also in HPC, of course, but the reason why local disks are not so much exploited in HPC, in practice, is the same reason why they are less and less loved within Hadoop as well. When the data are local, we want to keep them in the RAM as much as possible. When this is not possible, writing to the local disk has some advantages in terms of speed, w.r.t. to writing to the external SAN, but it also brings bad disadvantages in terms of complexity of reshuffling the data in cases of failures, restarts, job management, HW management, etc. Having an external SAN is not a luxury of “F1” HPC applications, it is just the most cost effective way to manage the workflow of most applications I know about. The success of Spark and other in-memory solutions seems to confirm that this is the case also for many business big data applications.
The HPC community feels no obligation toward supercomputers. In fact, every other year, in the last 20 years, someone has been claiming that commodity PC-clusters just became more cost effective than dedicated supercomputers for many applications. Indeed, commodity PCs evolved a lot in the past 20 years. But when we consider the cost of cooling, the big savings made possible by liquid cooling, the cost of a scalable network, that benefits a lot from a compact architecture, the management costs, and many other details, then it becomes clear that out-of-the-box PC clusters are rarely the best solution. You certainly need to invest some more work to have a cost effective configuration. On the other hand, supercomputers changed even more over the years. And as they grew in processors count, they could not afford being tightly coupled systems as they were in the past. As a result, modern supercomputers tend to be more and more similar to well organised PC-clusters.
For all these reasons, I believe that the experience of the HPC community will be very useful to the business big data community. But also the converse is true: by now, the Hadoop community has developed a wealth of excellent tools to manage effectively the data on the local disks of a cluster. Many HPC users could benefit a lot from that experience.
The second reason why MPI was not used in Hadoop is the lack of fault tolerance in MPI. But what does “fault tolerance” mean? Any system corrects some failures; no system can correct any failures. Because automatic fault recovery inevitably adds very expensive overheads, you want to implement only those automatic recoveries that are really justified by your application, your context and your workflow. Because of the variety of HPC applications, the MPI forum decided to leave the responsibility of managing failures entirely in the hands of the application developers. This was perhaps unfortunate. In fact, much work has been done recently to offer tools to deal with the most common cases of failures (see e.g. www.open-mpi.org/faq/?category=ft). On the other hand, because of the relative homogeneity of the original Hadoop applications, it made sense to integrate a general system of fault tolerance. As the spectrum of applications that run on Hadoop will broaden, I expect more demand for flexibility and modularity in the choice of which fault tolerance to include in which cases. Once again a closer interaction between the “MPI” community and the “Hadoop” community, could benefit both a lot.
The third reason is the most problematic one. MPI was not used because it exposes the data distribution to the programmer, who must handle explicitly the data flow between different computing nodes. On the other hand, Hadoop wants to hide the data distribution from the programmer, who should only care about the high level.
I doubt that Hadoop will continue to support only this paradigm in the long term. In fact, we had very similar wishes within the scientific community. The scientists do not particularly enjoy programming parallel computers, they do it because they have to, and most of them would happily give up a factor two or more of performance if that were the price of avoiding parallel coding. For some time we dreamed about automatic task distribution, but the results were extremely disappointing. In fact, the data distribution affects crucially the high level loop structure of a code. Now, it is fairly easy for a good programmer to chose the right loop structure, if she is aware of the data distribution. But reshuffling an inadequate loop structure can be impossible for any realistic compiler. Compilers are excellent in reshuffling the low level organisation of a program, but they get lost at the high level, that is what matters for an effective task distributions. I do not say that automatic task distribution is impossible or useless: it can be very effective for some tasks and with the help of a programmer who is aware of the data distribution. But why hiding the data distribution from the programmer if we need her to be aware of it anyway? In fact, this is essentially the reason why Hadoop has a quite steep learning curve for the the average programmer, who has only seen serial codes.
One of the reasons of the success of MPI was that people finally gave up the idea of hiding the data distribution to the programmer, and instead identified the simplest possible way of making it clear to the programmer. Once we accept this point of view, MPI is a very natural framework.
To conclude, it may be unfortunate that the scientific HPC and the business big data communities had so little interaction, in spite of dealing with vey similar problems. But it is never too late: we have now a great opportunity to learn from each other.