On Spring for Apache Hadoop. Interview with Thomas Risberg.
“Spring for Apache Hadoop together with Spring XD is used by many large organizations for developing new big data apps for stream processing and using HDFS for storage.”–Thomas Risberg.
On Spring for Apache Hadoop, I have interviewed Thomas Risberg, Software Engineer focusing on Big Data at Pivotal.
Q1. What is new with the current Spring for Apache Hadoop?
Thomas Risberg: The main focus for Spring for Apache Hadoop version 2.0 is to support new distributions based on Hadoop v2 including support for YARN. We provide backwards compatibility with Hadoop v1 based distributions, so as an end-user you can choose when to move to a new Hadoop version.
In Spring for Apache Hadoop 2.0 we are adding YARN application development support in addition to improvements in the HDFS and MapReduce support. We are introducing a Spring Boot based programming model for easy YARN app development. The main goal is to provide a simplified development experience so the developer can focus on getting the business logic implemented and not having to worry about the “plumbing”.
Q2. How do you measure the “effectiveness” of how Spring for Apache Hadoop simplifies developing Apache Hadoop?
Thomas Risberg: The main differentiator would be developer productivity where Spring for Apache Hadoop provides support for the infrastructure plumbing code and configuration allowing the developer to focus on code that brings business value. The most common measurement would be how long completing a project would take compared to the same project being developed without Spring.
Q3. What is the rationale for offering a unified configuration model and an APIs for using HDFS, MapReduce, Pig, and Hive?
Thomas Risberg: Big Data workflow development usually involves parts that are executed on Hadoop and parts that are executed or at least interact with resources outside of Hadoop. So, a unified configuration strategy will help developers when they move between different parts of the workflow. The basic configuration used across all Hadoop components is based on Spring and therefore similar whether working on a Hive job or an FTP file transfer job.
Q4. How does Spring for Apache Hadoop relate to the Spring Data project, and in general to the overall Spring ecosystem of projects?
Thomas Risberg: Spring for Apache Hadoop is part of the Spring Data umbrella project, but not part of the “release train” that most other Spring Data projects are part of. The reason is that the coupling is not very tight with other Spring Data projects. Spring for Apache Hadoop doesn’t directly use other Spring Data components although anyone using Spring for Apache Hadoop can use the Spring Data MongoDB project when exporting data from HDFS to MongoDB.
Q5. What about the integration with other software systems that are not part of the Spring ecosystem?
Thomas Risberg: In terms of Hadoop we have integration with Hive, Pig and HBase. We also use features from projects outside of the Apache Hadoop family. One example is Kite SDK which is a project that started out at Cloudera but is now a separate project that can be used with any Hadoop distribution. Other examples would be a JSON library like Jackson etc. The whole Spring IO platform uses 689 third party libraries so managing all of this is crucial for anyone using Spring. That is the main motivating factor behind the new Spring IO platform that provides a unified set of dependency versions across all Spring projects.
Q6. Could you give us some detail on how you handle big data ingest/export: e.g. from enterprise databases into Hadoop and vice versa? How is this different than conventional ETL?
Thomas Risberg: We rely on Spring Batch functionality for this task, so in case HDFS ingest isn’t different than loading data into any other data store. It’s simply a different batch writer. Spring Batch is a proven technology that is the bases for JSR-352 and is now certified as a JSR-352 compliant implementation. We support import/export with most relational databases that have a JDBC driver and also with many NoSQL stores that have Spring Data support like MongoDB, Cassandra, Couchbase or Redis.
Q7. What is the main contribution of Spring Data Hadoop to the Hadoop workflow and security?
Thomas Risberg: Spring for Apache Hadoop allows the developer to treat Hadoop workloads the same way as they would approach any workflow problem. Just because Hadoop is involved doesn’t have to mean that you need to use new and different tools from what you are used to. In terms of security we use what Hadoop itself provides and haven’t so far attempted to integrate that with Spring Security.
Q8. Do you also offer tools for analyzing Big Data? If yes, which ones?
Thomas Risberg: Spring XD provides integration with PMML analytics via a plug-in module. That module integrates with the JPMML-Evaluator library that provides support for a wide range of model types and is interoperable with models exported from R, Rattle, KNIME, and RapidMiner. Pivotal also provides MADLib, developed in collaboration with researchers at UC Berkeley and a growing world wide user community. This library is typically used with Pivotal’s Greenplum database or HAWQ which is the SQL engine that is part of Pivotal’s Hadoop distribution.
Q9. In which situations Spring Data Hadoop can add value, and in which situations would it be a poor choice?
Thomas Risberg: It definitely adds a lot of value if you are already using Spring in your workflow and just want to add some Hadoop functionality. It also makes sense if you are using Java and would like to take advantage of Spring’s dependency injection approach when developing your enterprise applications. It would make less sense for an organization that is not using Java as their development language or someone that already have a working solution using other tools that they are happy with.
Q10 Who is currently using Spring Data Hadoop and for which projects/business problems?
Thomas Risberg: I can’t name names, but Spring for Apache Hadoop together with Spring XD is used by many large organizations for developing new big data apps for stream processing and using HDFS for storage. Industries include telecommunications, equipment manufacturing, retail and finance. I’ve mentioned Spring XD and Spring for Apache Hadoop is a key component of this project. Spring XD is Pivotal’s new Spring project providing a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. The Spring XD project’s goal is to simplify the development of big data applications.
Thomas Risberg, Software Engineer focusing on Big Data, Pivotal, New Hampshire, USA
My current focus is on the “Spring XD”, “Spring for Apache Hadoop” and “Spring Data JDBC Extensions” projects. I’m a co-author of “Spring Data, Modern Data Access for Enterprise Java” published by O’Reilly Media in 2013 and “Professional Java Development with the Spring Framework” published by Wiley in 2005.
Follow ODBMS.org on Twitter: @odbmsorg