How Spark and Hadoop can Drive Business Value Together
How Spark and Hadoop can Drive Business Value Together.
John Kreisa, VP International Marketing, Hortonworks.
Recently, Apache Spark set the world of Big Data on fire. With a promise of amazing performance and comfortable APIs, some thought that Spark was bound to replace Hadoop MapReduce. Or is it? Looking closely into it, Spark rather appears to be a natural complement to Apache Hadoop YARN, the architectural center of Hadoop…
Hadoop is already transforming many industries, accelerating Big Data projects to help businesses translate information into competitive advantage. Everywhere you look, you can find companies using Hadoop in large-scale projects to enable deep data discovery, to capture a single view of customers across multiple data sets, and to help data scientists perform predictive analytics. In these ways, companies meet current customer needs, anticipate shifting market dynamics and consumer behaviors, and test business hypotheses—all crucial capabilities to help them outmaneuver and outperform their competitors.
The booming demand for Big Data has fueled a dizzying rise in spending on the technologies that make it possible.
One of the most active and remarkable open source projects in the Apache Software Foundation is Apache SparkTM, which makes it possible to run programs up to 100X faster than MapReduce using an advanced DAG (directed acyclic graph) execution engine that supports cyclic data flows and in-memory computing. Spark is also developer-friendly and leverages Java, Scala, Python and R with 80 high-level operators that make it easy to build parallel apps. Since Spark combines SQL, streaming and complex analytics, it offers broad compatibilities within multiple tools—a key advantage for running analytics against diverse data sources.
Apache Spark has generated a lot of excitement in the Big Data community, inspiring contributions by more than 400 developers since the project started in 2009. With the promise of such performance and comfortable APIs, some thought this could be the end of Hadoop MapReduce. If Spark performs better when all the data fits in the memory, especially on dedicated clusters; Hadoop MapReduce is designed for data that doesn’t fit in the memory and it can run well alongside other services. Both have benefits and most companies won’t use Spark it on its own; they still need HDFS to store the data and may want to use HBase, Hive, Pig, Drill or other Hadoop projects. This means they will still need to run Hadoop and MapReduce alongside Spark for a full Big Data package.
The truth is that Spark is a natural complement to Apache Hadoop MapReduce. Through YARN, the architectural center of Hadoop, multiple data processing engines can interact with data stored in a single platform. YARN provides resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters. It unlocks an entirely new approach to Big Data analytics, and Spark is a key pillar in that approach.
Combined together, Spark and YARN enable a modern data architecture that allows users to store data in a single location and interact with it in multiple ways, using whichever data processing engine best matches the analysis.
Spark and Hadoop’s combination is a key solution to address many organizational challenges. Firstly, improving storage and processing scalability which can help to cut costs by 20-40% while simultaneously adding high volumes of data.
Secondly, unifying separate clusters into one that supports both Spark and Hadoop. Finally, with only a retrospective view of data, companies have limited predictive capabilities, hampering Big Data’s strategic value to anticipate emerging market trends and customer needs. Spark helps to process billions of events per day at a blistering analytical pace of 40 milliseconds per event. Through tackling these issues with Spark and Hadoop there is a huge potential of benefits for companies!
We see the adoption of Hadoop and Spark going hand-in-hand as both are enabling technologies that make it possible for organizations to create new data applications that were previously impossible. This is driving a transformation across the enterprise in nearly every industry whether you are in manufacturing, financial services, retail, telecommunications.
All industries are being impacted by data and all organizations need to realize and drive the value and the potential of the data they have.