David Turanski is a Senior Software Engineer with Pivotal. David is a member of the Spring XD team. Prior to this, David was the project leader for the Spring Data GemFire project. He is also a contributer to Spring Integration and Spring Batch. David has extensive experience as a developer, architect and consultant serving a variety of industries. In addition he has trained hundreds of developers how to use the Spring Framework effectively.
Q1. What are the main technical challenges in your opinion, to assemble, deploy and run big data analysis applications?
Today, the cloud, along with open source technologies such as Hadoop, make computing and storage required to do big data analysis cheaper than ever. It is relatively painless to set up a Hadoop cluster in the public cloud. Once in place, the first challenge is to ingest data into the cluster. There are many potential sources for such data, and in general, companies want to capture all of it including application and system logs, click streams, interactions with mobile applications and social media, as well as data generated from legacy systems, batch jobs, and the like. Organizing and making sense of all this data requires developing a new suite of custom applications. At this point, IT departments face some familiar challenges involving enterprise integration, stream processing, batch workflow orchestration. Secondly, there are new challenges that come with introducing new technologies. The Hadoop ecosystem offers various tools such as Map Reduce, Pig, Hive, Sqoop, and Oozie, with more popping up every day. However, these tools tend to be point solutions which themselves need to be integrated, often requiring complex shell scripts to perform end-to-end analysis tasks. In addition, dealing with large data volumes requires mastery of highly distributed systems at levels of scalability not previously seen in many enterprises.
Q2. Are there any similarities between Big data applications and Enterprise Integration and Batch applications?
Absolutely. At the core, Big data applications are very much tied to enterprise integration and batch processing. Analysis applications that depend on the Hadoop file system (HDFS) tend to be batch oriented. For example, applications that looking at historical trends tend to be very data and compute intensive similar to traditional batch jobs and require the same sort of infrastructure to support scheduling, orchestration, error handling, and restarting failed jobs. The sweet spot for Enterprise integration is stream processing in which data is ingested, filtered, and transformed, as it becomes available. Stream processing pipelines are commonly used to ingest data into HDFS, to be used as a source for subsequent batch processes.In addition, stream processing is essential for real time analysis. Stream processing applications are especially well suited to asynchronous, event-driven message-oriented applications, following widely accepted best practices for enterprise integration.
Q3. You are currently working on Spring-XD. What is it? What is it useful for?
Spring XD is a relatively new addition to the Spring IO platform. It is positioned as a Domain Specific Runtime, meaning it is downloaded and installed as a distributed runtime environment for big data applications. The goal of Spring XD is to be a one stop shop for big data applications, providing a common programming model for stream processing and batch processing and communication between these two processing domains. For example, completion of a batch step can trigger a stream, and vice versa. Spring XD includes a Domain Specific Language (DSL) along with many out of the box components that make it simple to implement processing streams and batch jobs without writing code. The DSL is based on the familiar UNIX pipes and filters syntax to define a processing stream. For example:
http | transform | hdfs
represents a stream definition to listen for data posted to an http endpoint, perform a simple transformation, and store the result to HDFS. In this example ‘http’, ‘transform’, and ‘hdfs’, are among the many pre-built components, known as Modules, included with the Spring XD distribution. Modules may be configured using parameters that follow the standard syntax for command line options (not shown in the example). A more complete version of the above is:
http –port=9010 | transform –expression=payload.toUpperCase() | hdfs –filename=mydata
Note the ‘transform’ module accepts a Spring Expression Language (SpEL) expression that may be evaluated against the input payload to perform a simple transformation. The ‘transform’ module may also be configured to accept a Groovy script. The ‘hdfs’ module represents an HDFS file system which is the ultimate destination for this stream and requires HDFS to be installed somewhere on the network. The “pipes” represented by the “|” symbol are actually backed by distributed transport in Spring XD. Currently the available transport options are Rabbit MQ and Redis, although the Spring XD architecture allows alternate transports to be plugged in. This means that each module in general runs on a separate node (called the Container) in the cluster. This means that each module may be independently horizontally scaled. For example, this stream may be configured to run three instances of the transform module, each on a separate node.
Spring XD provides a simple command line interface to define, deploy and manage streams using the DSL, an Admin node which deploys a stream in a distributed fashion to a fault-tolerant cluster of Containers. The Spring XD runtime and the Modules are built using mature Spring libraries, notably Spring Integration, Spring Batch, and Spring Data. Spring XD’s architecture is extensible. For instance, users may easily provide their own modules. The runtime is portable and may be run standalone, deployed to Amazon EC2, Hadoop Yarn, and CloudFoundry (work in progress). In addition, all major Hadoop distributions are supported.
Q4. How can Spring-XD simplify the challenges described before?
Hopefully, the overview above makes a compelling case for Spring-XD. Spring XD gives users the ability to assemble and deploy a distributed stream processing or batch application in a purely declarative fashion. Many common big data use cases may be satisfied without writing a single line of Java code. In addition, the distributed runtime is completely scalable and fault-tolerant. If a Container node goes down, the Admin detects it, and will automatically redeploy all the modules to available nodes. In the mean time, streaming messages are held in queue until the required modules come back on line. So stream processing continues, essentially uninterrupted. If an Admin node goes down, a backup will take over. As with all Spring projects, Spring XD’s mission is to eliminate boilerplate code, simplify infrastructure, and allow developers to focus on the business problem.
Q5. How does Spring-XD integrate with the Spring IO platform?
Spring XD is actually a component of the Spring IO platform which is a high-level tiered model encompassing all Spring products. Spring XD is part of the top tier of Domain Specific Runtimes which also includes Grails and Spring Boot. The DSRs all build on top of Foundation Spring libraries, such as Spring Integration, Spring Batch, and Spring Data, which in turn build upon components in the Core layer, which provides cross-cutting infrastructure including the Spring Framework, Spring Security, Groovy, and the recent Reactor project. The Spring IO platform also serves as a formal specification defining which versions of each Spring project are certified to work together. The Spring team is currently working on the first release of the platform specification.
Q6. In which way is Spring XD leveraging Spring Data, Spring Batch, and Spring Integration?
Spring Integration and Spring Batch are at the very heart of Spring XD. Stream processing applications are completely implemented with Spring Integration. Modules used to assemble streams are fine grained, reusable message flows defined with Spring integration. Likewise, modules used to define batch processes are Spring Batch job definitions. Batch Jobs are triggered in Spring XD by sending a message to the job’s control channel using Spring integration. In addition, batch jobs may be monitored via wire taps built with Spring Integration. Spring Integration adapters are used to implement various modules to support integration with file system, HTTP, FTP, MQTT, JMS, AMQP, TCP, mail, Splunk, Twitter, Syslog, Reactor, etc.Spring Data is used to implement various data related modules to support integration with HDFS, JDBC, MongoDB, and GemFire. In addition Spring XD, provides common metrics such as gauges and counters useful for real time analytics. These are built using Spring Data Redis.
Q7. Do you have any sample projects with Spring-XD to share?