The Spring Data project. Interview with David Turanski.

by Roberto V. Zicari on January 3, 2013

“Given the recent explosion of NoSQL data stores, we saw the need for a common data access abstraction to simplify development with NoSQL stores. Hence the Spring Data team was created.” –David Turanski.

I wanted to know more about the Spring Data project. I have interviewed David Turanski, Senior Software Engineer with SpringSource, a division of VMWare.

RVZ

Q1. What is the Spring Framework?

David Turanski: Spring is a widely adopted open source application development framework for enterprise Java‚ used by millions of developers. Version 1.0 was released in 2004 as a lightweight alternative to Enterprise Java Beans (EJB). Since, then Spring has expanded into many other areas of enterprise development, such as enterprise integration (Spring Integration), batch processing (Spring Batch), web development (Spring MVC, Spring Webflow), security (Spring Security). Spring continues to push the envelope for mobile applications (Spring Mobile), social media (Spring Social), rich web applications (Spring MVC, s2js Javascript libraries), and NoSQL data access(Spring Data).

Q2. In how many open source Spring projects is VMware actively contributing?

David Turanski: It’s difficult to give an exact number. Spring is very modular by design, so if you look at the SpringSource page on github, there are literally dozens of projects. I would estimate there are about 20 Spring projects actively supported by VMware.

Q3. What is the Spring Data project?

David Turanski: The Spring Data project started in 2010, when Rod Johnson (Spring Framework’s inventor), and Emil Eifrem (founder of Neo Technologies) were trying to integrate Spring with the Neo4j graph database. Spring has always provided excellent support for working with RDBMS and ORM frameworks such as Hibernate. However, given the recent explosion of NoSQL data stores, we saw the need for a common data access abstraction to simplify development with NoSQL stores. Hence the Spring Data team was created with the mission to:

“…provide a familiar and consistent Spring-based programming model for NoSQL and relational stores while retaining store-specific features and capabilities.”

The last bit is significant. It means we don’t take a least common denominator approach. We want to expose a full set of capabilities whether it’s JPA/Hibernate, MongoDB, Neo4j, Redis, Hadoop, GemFire, etc.

Q4. Could you give us an example on how you build Spring-powered applications that use NOSQL data stores (e.g. Redis, MongoDB, Neo4j, HBase)

David Turanski: Spring Data provides an abstraction for the Repository pattern for data access. A Repository is akin to a Data Access Object and provides an interface for managing persistent objects. This includes the standard CRUD operations, but also includes domain specific query operations. For example, if you have a Person object:

Person {
    int id;
	int age;
	String firstName;
	String lastName;
}

You may want to perform queries such as findByFirstNameAndLastName, findByLastNameStartsWith, findByFirstNameContains, findByAgeLessThan, etc. Traditionally, you would have to write code to implement each of these methods. With Spring Data, you simply declare a Java interface to define the operations you need. Using method naming conventions, as illustrated above, Spring Data generates a dynamic proxy to implement the interface on top of whatever data store is configured for the application. The Repository interface in this case looks like:

	
public interface PersonRepository 
extends CrudRepository {
  Person findByFirstNameAndLastName(String firstName, String lastName);
  Person findByLastNameStartsWith(String lastName);
  Persion findByAgeLessThan(int age);
	...
}

In addition, Spring Data Repositories provide declarative support for pagination and sorting.

Then, using Spring’s dependency injection capabilities, you simply wire the repository into your application. For example:

public class PersonApp {
             @Autowired
             PersonRepository personRepository;

             public Person findPerson(String lastName, String firstName) {
             return personRepository.findByFirstNameAndLastName(firstName, lastName);
   }
}

Essentially, you don’t have to write any data access code! However, you must provide Java annotations on your domain class to configure entity mapping to the data store. For example, if using MongoDB you would associate the domain class to a document:

@Document
Person {
    int id;
	int age;
	String firstName;
	String lastName;
}

Note that the entity mapping annotations are store-specific. Also, you need to provide some Spring configuration to tell your application how to connect to the data store, in which package(s) to search for Repository interfaces and the like.

The Spring Data team has written an excellent book, including lots of code examples. Spring Data Modern Data Acces for Enterprise Java recently published by O’Reilly. Also, the project web site includes many resources to help you get started using Spring Data.

Q5 And for map-reduce frameworks?

David Turanski: Spring Data provides excellent support for developing applications with Apache Hadoop along with Pig and/or Hive. However, Hadoop applications typically involve a complex data pipeline which may include loading data from multiple sources, pre-procesing and real-time analysis while loading data into HDFS, data cleansing, implementing a workflow to coordinate several data analysis steps, and finally publishing data from HDFS to on or more application data relational or NoSQL data stores.

The complete pipeline can be implemented using Spring for Apache Hadoop along with Spring Integration and Spring Batch. However, Hadoop has its own set of challenges which the Spring for Apache Hadoop project is designed to address. Like all Spring projects, it leverages the Spring Framework to provide a consistent structure and simplify writing Hadoop applications. For example, Hadoop applications rely heavily on command shell tools. So applications end up being a hodge-podge of Perl, Python, Ruby, and bash scripts. Spring for Apache Hadoop, provides a dedicated XML namespace for configuring Hadoop jobs with embedded scripting features and support for Hive and Pig. In addition, Spring for Apache Hadoop allows you to take advantage of core Spring Framework features such as task scheduling, Quartz integration, and property placeholders to reduce lines of code, improve testability and maintainability, and simplify the development proces.

Q6. What about cloud based data services? and support for relational database technologies or object-relational mappers?

David Turanski: While there are currently no plans to support cloud based services such as Amazon 3S, Spring Data provides a flexible architecture upon which these may be implemented. Relational technologies and ORM are supported via Spring Data JPA. Spring has always provided first class support for Relation database via the JdbcTemplate using a vendor provided JDBC driver. For ORM, Spring supports Hibernate, any JPA provider, and Ibatis. Additionally, Spring provides excellent support for declarative transactions.

With Spring Data, things get even easier. In a traditional Spring application backed by JDBC, you are required to hand code the Repositories or Data Access Objects. With Spring Data JPA, the data access layer is generated by the framework while persistent objects use standard JPA annotations.

Q7. How can use Spring to perform:
– Data ingestion from various data sources into Hadoop,
– Orchestrating Hadoop based analysis workflow,
– Exporting data out of Hadoop into relational and non-relational databases

David Turanski: As previously mentioned, a complete big data processing pipeline involving all of these steps will require Spring for Apache Hadoop in conjunction with Spring Integration and Spring Batch.

Spring Integration greatly simplifies enterprise integration tasks by providing a light weight messaging framework, based on the well known Patterns of Enterprise Integration by Hohpe and Woolf. Sometimes referred to as the “anti ESB”, Spring Integration requires no runtime component other than a Spring container and is embedded in your application process to handle data ingestion from various distributed sources, mediation, transformation, and data distribution.

Spring Batch provides a robust framework for any type of batch processing and is be used to configure and execute scheduled jobs composed of the coarse-grained processing steps. Individual steps may be implemented as Spring Integration message flows or Hadoop jobs.

Q8. What is the Spring Data GemFire project?

David Turanski: Spring Data GemFire began life as a separate project from Spring Data following VMWare’s acquisition of GemStone and it’s commercial GemFire distributed data grid.
Initially, it’s aim was to simplify the development of GemFire applications and the configuration of GemFire caches, data regions, and related components. While this was, and still is, developed independently as an open source Spring project, the GemFire product team recognized the value to its customers of developing with Spring and has increased its commitment to Spring Data GemFire. As of the recent GemFire 7.0 release, Spring Data GemFire is being promoted as the recommended way to develop GemFire applications for Java. At the same time, the project was moved under the Spring Data umbrella. We implemented a GemFire Repository and will continue to provide first class support for GemFire.

Q9. Could you give a technical example on how do you simplify the development of building highly scalable applications?

David Turanski: GemFire is a fairly mature distributed, memory oriented data grid used to build highly scalable applications. As a consequence, there is inherent complexity involved in configuration of cache members and data stores known as regions (a region is roughly analogous to a table in a relational database). GemFire supports peer-to-peer and client-server topologies, and regions may be local, replicated, or partitioned. In addition, GemFire provides a number of advanced features for event processing, remote function execution, and so on.

Prior to Spring Data GemFire, GemFire configuration was done predominantly via its native XML support. This works well but is relatively limited in terms of flexibility. Today, configuration of core components can be done entirely in Spring, making simple things simple and complex things possible.

In a client-server scenario, an application developer may only be concerned with data access. In GemFire, a client application accesses data via a client cache and a client region which act as a proxies to provide access to the grid. Such components are easily configured with Spring and the application code is the same whether data is distributed across one hundred servers or cached locally. This fortunately allows developers to take advantage of Spring’s environment profiles to easily switch to a local cache and region suitable for unit integration tests which are self-contained and may run anwhere, including automated build environments. The cache resources are configured in Spring XML:

<beans>
        </beans><beans profile="test">
		  <gfe:cache />
		  <gfe:local -region name="Person"/>
	</beans>

       <beans profile="default">
               <context:property-placeholder location="cache.properties"/>
		<gfe:client-cache/>
		<gfe:client-region name="Person"/>
		<gfe:pool>
	                 <gfe:locator host="${locator.host}" port="${locator.port}"/>
		</gfe:pool>
	</beans>
</beans>

Here we see the deployed application (default profile) depends on a remote GemFire locator process. The client region does not store data locally by default but is connected to an available cache server via the locator. The region is distributed among the cache server and its peers and may be partitioned or replicated. The test profile sets up a self contained region in local memory, suitable for unit integration testing.

Additionally, applications may by further simplified by using a GemFire backed Spring Data Repository. The key difference from the example above is that the entity mapping annotations are replaced with GemFire specific annotations:

@Region
Person {
    int id;
	int age;
	String firstName;
	String lastName;
}

The @Region annotation maps the Person type to an existing region of the same name. The Region annotation provides an attribute to specify the name of the region if necessary.

Q10. The project uses GemFire as a distributed data management platform. Why using an In-Memory Data Management platform, and not a NoSQL or NewSQL data store?

David Turanski: Customers choose GemFire primarily for performance. As an in memory grid, data access can be an order of magnitude faster than disk based stores. Many disk based systems also cache data in memory to gain performance. However your mileage may vary depending on the specific operation and when disk I/O is needed. In Contrast, GemFire’s performance is very consistent. This is a major advantage for a certain class of high volume, low latency, distributed systems. Additionally, GemFire is extremely reliable, providing disk-based backup and recovery.

GemFire also builds in advanced features not commonly found in the NoSQL space. This includes a number of advanced tuning parameters to balance performance and reliability, synchronous or asynchronous replication, advanced object serialization features, flexible data partitioning with configurable data colocation, WAN gateway support, continuous queries, .Net interoperability, and remote function execution.

Q11. Is GemFire a full fledged distributed database management system? or else?

David Turanski: Given all its capabilities and proven track record supporting many mission critical systems, I would certainly characterize GemFire as such.
———————————-

David Turanski is a Senior Software Engineer with SpringSource, a division of VMWare. David is a member of the Spring Data team and lead of the Spring Data GemFire project. He is also a committer on the Spring Integration project. David has extensive experience as a developer, architect and consultant serving a variety of industries. In addition he has trained hundreds of developers how to use the Spring Framework effectively.

– Two cons against NoSQL. Part II. November 21, 2012

– Two Cons against NoSQL. Part I. October 30, 2012

– Interview with Mike Stonebraker. May 2, 2012

Resources

– ODBMS.org Lecture Notes: Data Management in the Cloud.
Michael Grossniklaus, David Maier, Portland State University.
Course Description: “Cloud computing has recently seen a lot of attention from research and industry for applications that can be parallelized on shared-nothing architectures and have a need for elastic scalability. As a consequence, new data management requirements have emerged with multiple solutions to address them. This course will look at the principles behind data management in the cloud as well as discuss actual cloud data management systems that are currently in use or being developed. The topics covered in the courserange from novel data processing paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data management platforms (Google BigTable, Microsoft Azure, Amazon S3 and Dynamo, Yahoo PNUTS) and open-source NoSQL databases (Cassandra, MongoDB, Neo4J). The world of cloud data management is currently very diverse and heterogeneous. Therefore, our course will also report on efforts to classify, compare and benchmark the various approaches and systems. Students in this course will gain broad knowledge about the current state of the art in cloud data management and, through a course project, practical experience with a specific system.”
Lecture Notes | Intermediate/Advanced | English | DOWNLOAD ~280 slides (PDF)| 2011-12|

From → Uncategorized

No comments yet

Leave a Reply Cancel reply

About the author

Archives

Meta

About

Flickr

Search