High Performance High Functionality Big Data Software Stack
Geoffrey Fox, Judy Qiu, Shantenu Jha,
Indiana and Rutgers University
Big data is important in all areas including research, government and commercial applications. The 51 use cases gathered in NIST  had 34, 8 and 9 use cases in these categories. Indeed the largest datasets are possibly those associated with commercial clouds including search and social media . It is estimated that this year there are approximately 6 zettabytes of stored data  and CISCO estimates  almost a zettabyte of total IP traffic this year; with largest individual science applications like the LHC with “only” around 0.0001 zettabytes (100 petabytes). Other research applications in the NIST study were typically fractions of petabytes with much larger astronomy (LSST, SKA) projects underway and imagery (light sources, medical, surveillance, radar such as EISCAT-3D) large and growing in size with non-cardiac medical imagery around 70 petabytes a year in the USA . This broad interest in Big data has spurred a frantic software activity much of it aimed at commercial cloud deployments. This was seen in NIST study in both architecture discussions and the details given in many use cases. Here we suggest that it is valuable to understand the large scale commercial approaches and understand how they can be made use of and as appropriate be integrated in the HPC and exascale data environments. This approach is helped by the broad use of Open source technology in the consensus commercial clouds, with in particular many Apache software projects contributing to what we can term ABDS or Apache Big Data Stack.
Download article (.PDF): HPCandApacheBigDataFinal