Open Source Software and IBM’s Big Data platform
By Cynthia M. Saracco, senior solutions architect at IBM’s Silicon Valley Laboratory.
It’s a question I hear a lot: why is IBM basing so much of its data and analytics strategy around software that it doesn’t own – specifically, around Apache Spark, Apache Hadoop, and complementary open source projects? The answer is pretty simple: market demand.
It’s not hard to find statistics about the phenomenal growth rates of digital data and the strong global interest that many organizations have in managing and analyzing all that varied data. And even a cursory look at the various conferences, Meetups, papers, and technical courses dedicated to Big Data, data science, and data analytics point to the popularity of open source offerings.
Vendors who ignore this reality risk losing business. And IBM is no exception. That’s why its Big Data platform features software or cloud-based services that leverage, integrate with, and add value to the open source software preferred by many organizations.
Curious? Skeptical? Read on.
The growth of open source software
A 2015 independent survey from North Bridge and Black Duck revealed that the number of firms running all or part of their business operations on open source software has nearly doubled since 2010. Furthermore, more than 66% of respondents said they look to open source software before commercial options. Why? Open source software is often associated with speeding innovation, creating competitive advantage, improving productivity and ease of use, providing scalability, helping firms contain costs, and providing other benefits.
Want to know which technology areas are expected to be most impacted by open source software in the near future? They’re areas that IBM has identified as key to its business interests, including cloud computing, Big Data, and the Internet of Things (IoT).
About Apache Spark and Apache Hadoop
Let’s focus a bit on Spark and Hadoop. Both enable firms to analyze and process vast amounts of data captured in a variety of formats. Spark’s in-memory processing provides fast runtime performance and its software libraries for machine learning, streaming, SQL access, etc. provide development productivity. Hadoop stores and processes massive amounts of data distributed across a computing cluster; it’s often used to create a “data lake” or “data reservoir” for advanced analytics. And, of course, Spark and Hadoop can be used separately or together to address various Big Data challenges.
According to a 2015 IBM study on Analytics: The upside of disruption, 57% of surveyed firms indicated that they already had or were planning to implement a Big Data platform based on Hadoop/Spark. The same study indicated that widespread use of Big Data technologies has more than doubled when compared with 2012-2014 usage. Finally, surveyed firms were largely bullish on an analytics platform based on Hadoop/Spark; indeed, 79% of implementers said that these technologies had (or will have) a positive disruptive impact on their organizations.
Furthermore, a recent Wikibon survey showed that more than 85% of organizations that used Hadoop had moved or were planning to move workloads from traditional data warehouses (such as Teradata or Oracle) or from IBM mainframes to Hadoop.
What’s IBM really doing?
So, with that backdrop, perhaps it’s not so hard to understand why IBM is investing in various initiatives based on open source software. But is its investment substantial or merely symbolic?
As of this writing, several IBM Big Data offerings use open source technologies at their core: the IBM Open Platform for Apache Hadoop and Spark, IBM Analytics for Apache Spark (a Bluemix cloud-based service), IBM BigInsights, and IBM BigInsights on Cloud (a Bluemix service). In addition, IBM offers a text analytics engine and a high-performance SQL query engine (Big SQL) for its Hadoop-based platform. Various IBM offerings have already announced or delivered support for Spark or Hadoop, including SPSS Modeler, IBM DataWorks, and Watson Health. IBM is even providing technical support to a 100% open source Big Data platform through its Elite Support offering for the IBM Open Platform for Apache Hadoop and Spark.
But IBM isn’t simply embedding open source software in its fee-based products and services. IBM’s newly-created Spark Technology Center in San Francisco employs engineers who contribute to Spark and help facilitate its adoption. Other IBM developers contribute to Hadoop and related projects, and IBM recently donated internally-developed machine learning software to the open source community. In addition, IBM is actively supporting efforts to standardize core Big Data technologies through ODPi, an industry group spanning various Big Data providers and consumers worldwide. Furthermore, for those interested in learning more about key open source technologies for data science and analytics, IBM has developed free online courses for Big Data University and posted various technical materials on Hadoop Dev.
So what lies ahead? Ultimately, client demands will determine that for IBM and other vendors. But all indications point to open source software continuing to play a very significant role.