Survey of Apache Big Data Stack
For the PhD Qualifying Exam 12/16/2013
Advisory Committee Prof. Geoffrey Fox Prof. David Leake Prof. Judy Qiu
Over the last decade there has being an explosion of data. Large amounts of data are generated from scientific experiments, government records, and large Internet sites and sensors networks. The term Big Data was introduced to identify such data that cannot be captured, curated, managed or processed by traditional tools in a reasonable amount of time. With the amount of data generated increasing everyday without an end in sight, there is a strong need for storing, processing and analyzing this huge volumes of data. A rich set of tools ranging from storage management to data processing and to data analytics has being developed to this specific purpose. Because of the shear volume of data, a large amount of computing power and space is required to process this data. Having specialized hardware like Super Computing infrastructures for doing such processing is not economically feasible most of the time. Large clusters of commodity hardware are a good economical alternative for getting the required computing power. But such large clusters of commodity hardware impose challenges that are not seen in traditional high-end hardware and big data frameworks must be specifically designed in order to overcome these challenges to be successful.
Open source software is playing a key role in the big data field and there is a growing trend to make the big data solutions open and free to the general public. The open source software is dominating the big data solutions space and we hardly hear the names of proprietary software giants like Microsoft, Oracle and IBM in this space. Open source software development is rapidly changing the innovation in the big data field and how we generally think about big data solutions.
The open source software movement officially began in 1983 with the start of the GNU Project by Richard Stallman. Open source software development is well studied in the Information Science research and the development methods used by the open source communities have proven to be successful in projects requiring the human innovation and passion. One of the most important factors for a successful open source project is the community around the project. The community drives the innovation and development of an open source project. Well functioning diverse communities can create software solutions that are robust and facilitating diverse set of requirements.
Apache Software Foundation (ASF)  is a non-profit open source software foundation, which has put itself in a very important position in the big data space. The software foundation has a diverse development and user community spreading the globe and is home to some of the most widely used software projects. The community is considered important above all other requirements in ASF and it has being the main reason behind its success. As a result, ASF provides a flexible and agile environment for the open source project development along with infrastructure and legal framework necessary for maintaining successful projects. Because of its success as an open source foundation, ASF has attracted a large chunk of successful big data projects in the past few years.
The other leading open source software platform for big data projects is GitHub. GitHub is a git based code repository for open source projects and it doesn’t provide the legal and organizational framework provided by ASF. GitHub is an ad-hoc platform for developing open source software and communities are formed in an add-hoc manner around the successful projects. Software foundations like ASF, Universities and companies like Netflix, LinkedIn use GitHub for hosting their open source projects. There are successful big data projects developed at GitHub and communities are formed around these products.
There is a recent trend in large Internet companies to make most of their entire operational codes available as open source free platforms. Netflix  and LinkedIn  are pioneers in making their source code open to the public. Various projects created by large software companies like Yahoo, Facebook and Twitter is being donated to the public through the open source software foundations. The process is mutually benefiting both the community and the original software creators. The original software creators get their code exposed to a diverse community and this helps the products to mature fast and evolve quickly. The open source software gets battle tested in all kinds of scenarios for free, which makes the product resilient. One of the most rewarding things about making software open source may be the gaining of high credibility and trust among the peer developers for the leaders of open source software projects.
Even though there are many open source platforms for big data projects, Apache Software Foundation has immerged as the clear leader in the open source Big Data space, hosting the key big data projects. ASF has a complete stack of big data solutions ranging from the basic storage management to complex data analytics. These projects are mostly autonomous projects intertwined by the technologies they use but they somehow form a complete solution stack for
big data. It is interesting to know the inter-relationships among these projects and the internal of these projects. In this paper we will analyze the big data echo system in ASF by categorizing the projects in to a layered big data stack.
The first part of the report will focus on the overall layered architecture of the big data stack and rest of the report will discuss each layer starting from the bottom.
DOWNLOAD Report (.pdf):