Skip to content

"Trends and Information on Big Data, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Jan 28 13

On Big Data Velocity. Interview with Scott Jarr.

by Roberto V. Zicari

“There is only so much static data in the world as of today. The vast majority of new data, the data that is said to explode in volume over the next 5 years, is arriving from a high velocity source. It’s funny how obvious it is when you think about it. The only way to get Big Data in the future is to have it arrive in a high velocity rate ” – Scott Jarr.

One of the key technical challenges of Big Data is (Data) Velocity. On that, I have interviewed Scott Jarr, Co-founder and Chief Strategy Officer of VoltDB.

RVZ

Q1. Marc Geall, past Head of European Technology Research at Deutsche Bank AG/London, writes about the “Big Data myth”, claiming that there is:
1) limited need of petabyte-scale data today,
2) very low proportion of databases in corporate deployment which requires more than tens of TB of data to be handled, and
3) lack of availability and high cost of highly skilled operators (often post-doctoral) to operate highly scalable NoSQL clusters.
What is your take on this?

Scott Jarr: Interestingly I agree with a lot of this for today. However, I also believe we are in the midst of a massive shift in business to what I call data-as-a-priority.
We are just beginning, but you can already see the signs. People are loathed to get rid of anything, sensors are capturing finer resolutions, and people want to make far more, data informed decisions.
I also believe that the value that corporate IT teams were able to extract from data with the advent of data warehouses really whet the appetite of what could be done with data. We are now seeing people ask questions like “why can’t I see this faster,” or “how do we use this incoming data to better serve customers,” or “how can we beat the other guys with our data.”
Data is becoming viewed as a corporate weapon. Add inbound data rates (velocity) combined with the desire to use data for better decisions and you have data sizes that will dwarf what is considered typical today. And almost no industry is excluded. The cost ceiling has collapsed.

Q2: What are the typical problems that are currently holding back many Big Data projects?

Scott Jarr:
1) Spending too much time trying to figure out what solution to use for what problem. We were seeing this so often that we created a graphic and presentation that addresses this topic. We called it the Data Continuum.
2) Putting out fires that the current data environment is causing. Most infrastructures aren’t ready for the volume or velocity of data that is already starting to arrive at their doorsteps. They are spending a ton of time dealing with band-aids on small-data-infrastructure and unable to shake free to focus on the Big Data infrastructure that will be a longer-term fix.
3) Being able to clearly articulate the business value the company expects to achieve from a Big Data project has a way of slowing things down in a radical way.
4) Most of the successes in Big Data projects today are in situations where the company has done a very good job maintaining a reasonable scope to the project.

Q3: Why is it important to solve the Velocity problem when dealing with Big Data projects?

Scott Jarr: There is only so much static data in the world as of today. The vast majority of new data, the data that is said to explode in volume over the next 5 years, is arriving from a high velocity source. It’s funny how obvious it is when you think about it. The only way to get Big Data in the future is to have it arrive in a high velocity rate.
Companies are recognizing the business value they can get by acting on that data as it arrives rather than depositing it in a file to be batch processed at some later data. So much of the context that makes that data is lost when it not acted on quickly.

Q4: What exactly is Big Data Velocity? Is Big Data Velocity the same as stream computing?

Scott Jarr: We think of Big Data Velocity as data that is coming into the organization at a rate that can’t be managed in the traditional database. However, companies want to extract the most value they can from that data as it arrives. We see them doing three specific things:
1) Ingesting relentless feed(s) of data;
2) Making decisioning on each piece of data as it arrives; and
3) Using real-time analytics to derive immediate insights into what is happening with this velocity data.

Making the best possible decision each time data is touched is what velocity is all about. These decisions used to be called transactions in the OLTP world. They involve using other data stored in the database to make decision – approve a transaction, server the ad, authorize the access, etc. These decisions, and the real-time analytics that support them, all require the context of other data. In other words, the database used to perform these decisions must hold some amount of previously processed data – they must hold state. Streaming systems are good at a different set of problems.

Q5: Two other critical factors often mentioned for Big Data projects are: 1) Data discovery: How to find high-quality data from the Web? and 2) Data Veracity: How can we cope with uncertainty, imprecision, missing values, mis-statements or untruths? Any thoughts on these?

Scott Jarr: We have a number of customers who are using VoltDB in ways to improve data quality within their organization. We have one customer who is examining incoming financial events and looking for misses in sequence numbers to determine lost or miss-ordered information. Likewise, a popular use case is to filter out bad data as it comes in by looking at it in its high velocity state against a known set of bad or good characteristics. This keeps much of the bad data from ever entering the data pipeline.

Q6: Scalability has three aspects: data volume, hardware size, and concurrency. Scale and performance requirements for Big Data strain conventional databases. Which database technology is best to scale to petabytes?

Scott Jarr: VoltDB is focused on a very different problem, which is how to process that data prior to it landing in the long-term petabyte system. We see customers deploying VoltDB in front of both MPP OLAP and Hadoop, in roughly the same numbers. It really all depends on what the customer is ultimately trying to do with the data once it settles into its resting state in the petabyte store.

Q7: A/B testing, sessionization, bot detection, and pathing analysis all require powerful analytics on many petabytes of semi-structured Web data. Do you have some customers examples in this area?

Scott Jarr: Absolutely. Taken broadly, this is one of the most common uses of VoltDB. Micro-segmentation and on-the-fly ad content optimization are examples that we see regularly. The ability to design an ad, in real-time, based on five sets of audience meta-data can have a radical impact on performance.

Q8: When would you recommend to store Big Data in a traditional Data Warehouse and when in Hadoop?

Scott Jarr: My experience here is limited. As I said, our customers are using VoltDB in front of both types of stores to do decisioning and real-time analytics before the data moves into the long term store. Often, when the data is highly structured, it goes into a data warehouse and when it is less structured, it goes into Hadoop.

Q9: Instead of stand-alone products for ETL, BI/reporting and analytics wouldn’t it be better to have a seamless integration? In what ways can we open up a data processing platform to enable applications to get closer?

Scott Jarr: This is very much inline with our vision of the world. As Mike (Stonebraker , VoltDB founder) has stated for years, in high performance data systems, you need to have specialized databases. So we see the new world having far more data pipelines than stand alone databases. A data pipeline will have seamless integrations between velocity stores, warehouses, BI tools and exploratory analytics. Standards go a long way to making these integrations easier.

Q10: Anything you wish to add?

Scott Jarr.: Thank you Roberto. Very interesting discussion.

——————–
VoltDB Co-founder and Chief Strategy Officer Scott Jarr. Scott brings more than 20 years of experience building, launching and growing technology companies from inception to market leadership in highly competitive environments.
Prior to joining VoltDB, Scott was VP Product Management and Marketing at on-line backup SaaS leader LiveVault Corporation. While at LiveVault, Scott was key in growing the recurring revenue business to 2,000 customers strong, leading to an acquisition by Iron Mountain. Scott has also served as board member and advisor to other early-stage companies in the search, mobile, security, storage and virtualization markets. Scott has an undergraduate degree in mathematical programming from the University of Tampa and an MBA from the University of South Florida.

Related Posts

- On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. on December 5, 2012

- Two cons against NoSQL. Part II. on November 21, 2012

- Two Cons against NoSQL. Part I. on October 30, 2012

- Interview with Mike Stonebraker. on May 2, 2012

Resources

- ODBMS.org: NewSQL.
Blog Posts | Free Software | Articles and Presentations | Lecture Notes | Tutorials| Journals |

- Big Data: Challenges and Opportunities.
Roberto V. Zicari, October 5, 2012.
Abstract: In this presentation I review three current aspects related to Big Data:
1. The business perspective, 2. The Technology perspective, and 3. Big Data for social good.

Presentation (89 pages) | Intermediate| English | DOWNLOAD (PDF)| October 2012|
##

You can follow ODBMS.org on Twitter : @odbmsorg.
——————————-

Jan 16 13

The Gaia mission, one year later. Interview with William O’Mullane.

by Roberto V. Zicari

” We will observe at LEAST 1,000,000,000 celestial objects. If we launched today we would cope with difficulty – but we are on track to be ready by September when we actually launch. This is a game changer for astronomy thus very challenging for us, but we have done many large scale tests to gain confidence in our ability to process the complex and voluminous data arriving on ground and turn it into catalogues. I still feel the galaxy has plenty of scope to throw us an unexpected curve ball though and really challenge us in the data processing.” — William O`Mullane.

The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”. I recall here the Objectives and the Mission of the Gaia Project (source ESA Web site):
OBJECTIVES:
“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”
THE MISSION:
“Gaia is an ambitious mission to chart a three-dimensional map of our Galaxy, the Milky Way, in the process revealing the composition, formation and evolution of the Galaxy. Gaia will provide unprecedented positional and radial velocity measurements with the accuracies needed to produce a stereoscopic and kinematic census of about one billion stars in our Galaxy and throughout the Local Group. This amounts to about 1 per cent of the Galactic stellar population. Combined with astrophysical information for each star, provided by on-board multi-colour photometry, these data will have the precision necessary to quantify the early formation, and subsequent dynamical, chemical and star formation evolution of the Milky Way Galaxy.
Additional scientific products include detection and orbital classification of tens of thousands of extra-solar planetary systems, a comprehensive survey of objects ranging from huge numbers of minor bodies in our Solar System, through galaxies in the nearby Universe, to some 500 000 distant quasars. It will also provide a number of stringent new tests of general relativity and cosmology.”

Last year in February, I have interviewed William O`Mullane, Science Operations Development Manager, at the European Space Agency, and Vik Nagjee, Product Manager, Core Technologies, at InterSystems Corporation, both deeply involved with the initial Proof-of-Concept of the data management part of the project.

A year later, I have asked William O`Mullane (European Space Agency), and Jose Ruperez (Intersystems Spain), some follow up questions.

RVZ

Q1. The original goal of the Gaia mission was to “observe around 1,000,000,000 celestial objects”. Is this still true? Are you ready for that?

William O’Mullane: YES ! We will have a Ground Segment Readiness Review next Spring and a Flight Acceptance Review before summer. We will observe at LEAST 1,000,000,000 celestial objects. If we launched today we would cope with difficulty – but we are on track to be ready by September when we actually launch. This is a game changer for astronomy thus very challenging for us, but we have done many large scale tests to gain confidence in our ability to process the complex and voluminous data arriving on ground and turn it into catalogues. I still feel the galaxy has plenty of scope to throw us an unexpected curve ball though and really challenge us in the data processing.

Q2. The plan was to launch the Gaia satellite in early 2013. Is this plan confirmed?

William O’Mullane: Currently September 2013 in Q1 is the official launch date.

Q3. Did the data requirements for the project change in the last year? If yes, how?

William O’Mullane: Downlink rate has not changed so we know how much comes into the System still only about 100TB over 5 years. Data processing volumes depend on how many intermediate steps we keep in different locations. Not much change there since last year.

Q4. The sheer volume of data that is expected to be captured by the Gaia satellite poses a technical challenge. What work has been done in the last year to prepare for such a challenge? What did you learn from the Proof-of-Concept of the data management part of this project?

William O’Mullane: I suppose we learned the same lessons as other projects. We have multiple processing centres with different needs met by different systems. We did not try for a single unified approach across these centers.
The CNES have gone completely to Hadoop for their processing. At ESAC we are going to InterSystems Caché. Last year only AGIS was on Caché – now the main daily processing chain is in Caché also [Edit: see also Q.9 ]. There was a significant boost in performance here but it must be said some of this was probably internal to the system, in moving it we looked at some bottlenecks more closely.

Jose Ruperez: We are very pleased to know that last year was only AGIS and now they have several other databases in Caché.

William O’Mullane: The second operations rehearsal is just drawing to a close. This was run completely on Caché (the first rehearsal used Oracle). There were of course some minor problems (also with our software) but in general from Caché perspective it went well.

Q5. Could you please give us some numbers related to performance? Could you also tells us what bottlenecks did you look at, and how did you avoid them?

William O’Mullane: Would take me time to dig out numbers .. we got factor 10 in some places with combination of better queries and removing some code bottle necks. We seem to regularly see factor 10 on “non optimized” systems.

Q6. Is it technically possible to interchange data between Hadoop and Caché ? Does it make sense for the project?

Jose Ruperez: The raw data downloaded from the satellite link every day can be loaded in any database in general. ESAC has chosen InterSystems Caché for performance reasons, but not only. William also explains how cost-effectiveness as well as the support from InterSystems were key. Other centers can try and use other products.

William O’Mullane:
This is a valid point – support is a major reason for our using Caché. InterSystems work with us very well and respond to needs quickly. InterSystems certainly have a very developer oriented culture which matches our team well.
Hadoop is one thing HDFS is another .. but of course they go together. In many ways our DataTrain Whiteboard do “map reduce” with some improvements for our specific problem. There are Hadoop database interfaces so it could work with Caché.

Q7. Could you tell us a bit more what did you learn so far with this project? In particular, what is the implication for Caché, now that also the the main daily processing chain is stored in Caché?

Jose Ruperez: A relevant number, regarding InterSystems Caché performance, is to be able to insert over 100,000 records per second sustained over several days. This also means that InterSystems Caché, in turn, has to write hundreds of MegaBytes per second to disk. To me, this is still mind-boggling.

William O’Mullane:
And done with fairly standard NetApp Storage. Caché and NetApp engineers sat together here at ESAC to align the configuration of both systems to get the max IO for Java through Caché to NetApp. There were several low level page size settings etc. which were modified for this.

Q8. What else still need to be done?

William O’Mullane: Well we have most of the parts but it is not a well oiled machine yet. We need more robustness and a little more automation across the board.

Q9. Your high level architecture a year ago consisted of two databases, a so called Main Database and an AGIS Database.
The Main Database was supposed to hold all data from Gaia and the products of processing. (This was expected to grow from a few TBs to few hundreds of TBs during the mission). AGIS was only required a subset of this data for analytics purpose. Could you tell us how the architecture has evolved in the last year?

William O’Mullane: This remains precisely the same.

Q10. For the AGIS database, were you able to generate realistic data and load on the system?

William O’Mullane: We have run large scale AGIS tests with 50,000,000 sources or about 4,500,000,000 simulated observations. This worked rather nicely and well within requirements. We confirmed going from 2 to 10 to 50 million sources that the problem scales as expected. The final (end of mission 2018) requirement is for 100,000,000 sources, so for now we are quite confident with the load characteristics. The simulation had a realistic source distribution in magnitude and coordinates (i.e. real sky like inhomogeneities are seen).

Q11. What results did you obtain in tuning and configuring the AGIS system in order to meet the strict insert requirements, while still optimizing sufficiently for down-stream querying of the data?

William O’Mullane: We still have bottlenecks in the update servers but the 50 million test still ran inside one month on a small in house cluster. So the 100 million in 3 months (system requirement) will be easily met especially with new hardware.

Q12. What are the next steps planned for the Gaia mission and what are the main technical challenges ahead?

William O’Mullane: AGIS is the critical piece of Gaia software for astrometry but before that the daily data treatment must be run. This so called Initial Data Treatment (IDT) is our main focus right now. It must be robust and smoothly operating for the mission and able to cope with non nominal situations occurring in commissioning the instrument. So some months of consolidation, bug fixing and operational rehearsals for us. The future challenge I expect not to be technical but rather when we see the real data and it is not exactly as we expect/hope it will be.
I may be pleasantly surprised of course. Ask me next year …

———–
William O`Mullane, Science Operations Development Manager, European Space Agency.
William O’Mullane has a PhD in Physics and a background in Computer Science and has worked on space science projects since 1996 when he assisted with the production of the Hipparcos CDROMS. During this period he was also involved with the Planck and Integral science ground segments as well as contemplating the Gaia data processing problem. From 2000-2005 Wil worked on developing the US National Virtual Observatory (NVO) and on the Sloan Digital Sky Survey (SDSS) in Baltimore, USA. In August 2005 he rejoined the European Space Agency as Gaia Science Operations Development Manager to lead the ESAC development effort for the Gaia Data Processing and Analysis Consortium.

José Rupérez, Senior Engineer at InterSystems.
He has been providing technical advise to customers and partners in Spain and Portugal for the last 10 years. In particular, he has been working with the European Space Agency since December 2008. Before InterSystems, José developed his career at eSkye Solutions and A.T. Kearney in the United States, always in Software. He started his career working for Alcatel in 1994 as a Software Engineer. José holds a Bachelor of Science in Physics from Universidad Complutense (Madrid, Spain) and a Master of Science in Computer Science from Ball State University (Indiana, USA). He has also attended courses at the MIT Sloan School of Business.
—————-

Related Posts
- Objects in Space -The biggest data processing challenge to date in astronomy: The Gaia mission. February 14, 2011

- Objects in Space: “Herschel” the largest telescope ever flown. March 18, 2011

- Objects in Space vs. Friends in Facebook. April 13, 2011

Resources

- Gaia Overview (ESA)

- Gaia Web page at ESA Spacecraft Operations.

- ESA’s web site for the Gaia scientific community.

- Gaia library (ESA) collates Gaia conference proceedings, selected reports, papers, and articles on the Gaia mission as well as public DPAC documents.

- “Implementing the Gaia Astrometric Global Iterative Solution (AGIS) in Java”. William O’Mullane, Uwe Lammers, Lennart Lindegren, Jose Hernandez and David Hobbs. Aug. 2011

- “Implementing the Gaia Astrometric Solution”, William O’Mullane, PhD Thesis, 2012
##

You can follow ODBMS.org on Twitter : @odbmsorg.
——————————-

Jan 3 13

The Spring Data project. Interview with David Turanski.

by Roberto V. Zicari

“Given the recent explosion of NoSQL data stores, we saw the need for a common data access abstraction to simplify development with NoSQL stores. Hence the Spring Data team was created.” –David Turanski.

I wanted to know more about the Spring Data project. I have interviewed David Turanski, Senior Software Engineer with SpringSource, a division of VMWare.

RVZ

Q1. What is the Spring Framework?

David Turanski: Spring is a widely adopted open source application development framework for enterprise Java‚ used by millions of developers. Version 1.0 was released in 2004 as a lightweight alternative to Enterprise Java Beans (EJB). Since, then Spring has expanded into many other areas of enterprise development, such as enterprise integration (Spring Integration), batch processing (Spring Batch), web development (Spring MVC, Spring Webflow), security (Spring Security). Spring continues to push the envelope for mobile applications (Spring Mobile), social media (Spring Social), rich web applications (Spring MVC, s2js Javascript libraries), and NoSQL data access(Spring Data).

Q2. In how many open source Spring projects is VMware actively contributing?

David Turanski: It’s difficult to give an exact number. Spring is very modular by design, so if you look at the SpringSource page on github, there are literally dozens of projects. I would estimate there are about 20 Spring projects actively supported by VMware.

Q3. What is the Spring Data project?

David Turanski: The Spring Data project started in 2010, when Rod Johnson (Spring Framework’s inventor), and Emil Eifrem (founder of Neo Technologies) were trying to integrate Spring with the Neo4j graph database. Spring has always provided excellent support for working with RDBMS and ORM frameworks such as Hibernate. However, given the recent explosion of NoSQL data stores, we saw the need for a common data access abstraction to simplify development with NoSQL stores. Hence the Spring Data team was created with the mission to:

“…provide a familiar and consistent Spring-based programming model for NoSQL and relational stores while retaining store-specific features and capabilities.”

The last bit is significant. It means we don’t take a least common denominator approach. We want to expose a full set of capabilities whether it’s JPA/Hibernate, MongoDB, Neo4j, Redis, Hadoop, GemFire, etc.

Q4. Could you give us an example on how you build Spring-powered applications that use NOSQL data stores (e.g. Redis, MongoDB, Neo4j, HBase)

David Turanski: Spring Data provides an abstraction for the Repository pattern for data access. A Repository is akin to a Data Access Object and provides an interface for managing persistent objects. This includes the standard CRUD operations, but also includes domain specific query operations. For example, if you have a Person object:

Person {
    int id;
	int age;
	String firstName;
	String lastName;
} 

You may want to perform queries such as findByFirstNameAndLastName, findByLastNameStartsWith, findByFirstNameContains, findByAgeLessThan, etc. Traditionally, you would have to write code to implement each of these methods. With Spring Data, you simply declare a Java interface to define the operations you need. Using method naming conventions, as illustrated above, Spring Data generates a dynamic proxy to implement the interface on top of whatever data store is configured for the application. The Repository interface in this case looks like:

	
public interface PersonRepository 
extends CrudRepository {
  Person findByFirstNameAndLastName(String firstName, String lastName);
  Person findByLastNameStartsWith(String lastName);
  Persion findByAgeLessThan(int age);
	...
}

In addition, Spring Data Repositories provide declarative support for pagination and sorting.

Then, using Spring’s dependency injection capabilities, you simply wire the repository into your application. For example:

public class PersonApp {
             @Autowired
             PersonRepository personRepository;

             public Person findPerson(String lastName, String firstName) {
             return personRepository.findByFirstNameAndLastName(firstName, lastName);
   }
}

Essentially, you don’t have to write any data access code! However, you must provide Java annotations on your domain class to configure entity mapping to the data store. For example, if using MongoDB you would associate the domain class to a document:

@Document
Person {
    int id;
	int age;
	String firstName;
	String lastName;
} 

Note that the entity mapping annotations are store-specific. Also, you need to provide some Spring configuration to tell your application how to connect to the data store, in which package(s) to search for Repository interfaces and the like.

The Spring Data team has written an excellent book, including lots of code examples. Spring Data Modern Data Acces for Enterprise Java recently published by O’Reilly. Also, the project web site includes many resources to help you get started using Spring Data.

Q5 And for map-reduce frameworks?

David Turanski: Spring Data provides excellent support for developing applications with Apache Hadoop along with Pig and/or Hive. However, Hadoop applications typically involve a complex data pipeline which may include loading data from multiple sources, pre-procesing and real-time analysis while loading data into HDFS, data cleansing, implementing a workflow to coordinate several data analysis steps, and finally publishing data from HDFS to on or more application data relational or NoSQL data stores.

The complete pipeline can be implemented using Spring for Apache Hadoop along with Spring Integration and Spring Batch. However, Hadoop has its own set of challenges which the Spring for Apache Hadoop project is designed to address. Like all Spring projects, it leverages the Spring Framework to provide a consistent structure and simplify writing Hadoop applications. For example, Hadoop applications rely heavily on command shell tools. So applications end up being a hodge-podge of Perl, Python, Ruby, and bash scripts. Spring for Apache Hadoop, provides a dedicated XML namespace for configuring Hadoop jobs with embedded scripting features and support for Hive and Pig. In addition, Spring for Apache Hadoop allows you to take advantage of core Spring Framework features such as task scheduling, Quartz integration, and property placeholders to reduce lines of code, improve testability and maintainability, and simplify the development proces.

Q6. What about cloud based data services? and support for relational database technologies or object-relational mappers?

David Turanski: While there are currently no plans to support cloud based services such as Amazon 3S, Spring Data provides a flexible architecture upon which these may be implemented. Relational technologies and ORM are supported via Spring Data JPA. Spring has always provided first class support for Relation database via the JdbcTemplate using a vendor provided JDBC driver. For ORM, Spring supports Hibernate, any JPA provider, and Ibatis. Additionally, Spring provides excellent support for declarative transactions.

With Spring Data, things get even easier. In a traditional Spring application backed by JDBC, you are required to hand code the Repositories or Data Access Objects. With Spring Data JPA, the data access layer is generated by the framework while persistent objects use standard JPA annotations.

Q7. How can use Spring to perform:
- Data ingestion from various data sources into Hadoop,
- Orchestrating Hadoop based analysis workflow,
- Exporting data out of Hadoop into relational and non-relational databases

David Turanski: As previously mentioned, a complete big data processing pipeline involving all of these steps will require Spring for Apache Hadoop in conjunction with Spring Integration and Spring Batch.

Spring Integration greatly simplifies enterprise integration tasks by providing a light weight messaging framework, based on the well known Patterns of Enterprise Integration by Hohpe and Woolf. Sometimes referred to as the “anti ESB”, Spring Integration requires no runtime component other than a Spring container and is embedded in your application process to handle data ingestion from various distributed sources, mediation, transformation, and data distribution.

Spring Batch provides a robust framework for any type of batch processing and is be used to configure and execute scheduled jobs composed of the coarse-grained processing steps. Individual steps may be implemented as Spring Integration message flows or Hadoop jobs.

Q8. What is the Spring Data GemFire project?

David Turanski: Spring Data GemFire began life as a separate project from Spring Data following VMWare’s acquisition of GemStone and it’s commercial GemFire distributed data grid.
Initially, it’s aim was to simplify the development of GemFire applications and the configuration of GemFire caches, data regions, and related components. While this was, and still is, developed independently as an open source Spring project, the GemFire product team recognized the value to its customers of developing with Spring and has increased its commitment to Spring Data GemFire. As of the recent GemFire 7.0 release, Spring Data GemFire is being promoted as the recommended way to develop GemFire applications for Java. At the same time, the project was moved under the Spring Data umbrella. We implemented a GemFire Repository and will continue to provide first class support for GemFire.

Q9. Could you give a technical example on how do you simplify the development of building highly scalable applications?

David Turanski: GemFire is a fairly mature distributed, memory oriented data grid used to build highly scalable applications. As a consequence, there is inherent complexity involved in configuration of cache members and data stores known as regions (a region is roughly analogous to a table in a relational database). GemFire supports peer-to-peer and client-server topologies, and regions may be local, replicated, or partitioned. In addition, GemFire provides a number of advanced features for event processing, remote function execution, and so on.

Prior to Spring Data GemFire, GemFire configuration was done predominantly via its native XML support. This works well but is relatively limited in terms of flexibility. Today, configuration of core components can be done entirely in Spring, making simple things simple and complex things possible.

In a client-server scenario, an application developer may only be concerned with data access. In GemFire, a client application accesses data via a client cache and a client region which act as a proxies to provide access to the grid. Such components are easily configured with Spring and the application code is the same whether data is distributed across one hundred servers or cached locally. This fortunately allows developers to take advantage of Spring’s environment profiles to easily switch to a local cache and region suitable for unit integration tests which are self-contained and may run anwhere, including automated build environments. The cache resources are configured in Spring XML:

<beans>
        </beans><beans profile="test">
		  <gfe:cache />
		  <gfe:local -region name="Person"/>
	</beans>

       <beans profile="default">
               <context:property-placeholder location="cache.properties"/>
		<gfe:client-cache/>
		<gfe:client-region name="Person"/>
		<gfe:pool>
	                 <gfe:locator host="${locator.host}" port="${locator.port}"/>
		</gfe:pool>
	</beans>
</beans>

Here we see the deployed application (default profile) depends on a remote GemFire locator process. The client region does not store data locally by default but is connected to an available cache server via the locator. The region is distributed among the cache server and its peers and may be partitioned or replicated. The test profile sets up a self contained region in local memory, suitable for unit integration testing.

Additionally, applications may by further simplified by using a GemFire backed Spring Data Repository. The key difference from the example above is that the entity mapping annotations are replaced with GemFire specific annotations:

@Region
Person {
    int id;
	int age;
	String firstName;
	String lastName;
} 

The @Region annotation maps the Person type to an existing region of the same name. The Region annotation provides an attribute to specify the name of the region if necessary.

Q10. The project uses GemFire as a distributed data management platform. Why using an In-Memory Data Management platform, and not a NoSQL or NewSQL data store?

David Turanski: Customers choose GemFire primarily for performance. As an in memory grid, data access can be an order of magnitude faster than disk based stores. Many disk based systems also cache data in memory to gain performance. However your mileage may vary depending on the specific operation and when disk I/O is needed. In Contrast, GemFire’s performance is very consistent. This is a major advantage for a certain class of high volume, low latency, distributed systems. Additionally, GemFire is extremely reliable, providing disk-based backup and recovery.

GemFire also builds in advanced features not commonly found in the NoSQL space. This includes a number of advanced tuning parameters to balance performance and reliability, synchronous or asynchronous replication, advanced object serialization features, flexible data partitioning with configurable data colocation, WAN gateway support, continuous queries, .Net interoperability, and remote function execution.

Q11. Is GemFire a full fledged distributed database management system? or else?

David Turanski: Given all its capabilities and proven track record supporting many mission critical systems, I would certainly characterize GemFire as such.
———————————-

David Turanski is a Senior Software Engineer with SpringSource, a division of VMWare. David is a member of the Spring Data team and lead of the Spring Data GemFire project. He is also a committer on the Spring Integration project. David has extensive experience as a developer, architect and consultant serving a variety of industries. In addition he has trained hundreds of developers how to use the Spring Framework effectively.

Related Posts

- On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. December 5, 2012

- Two cons against NoSQL. Part II. November 21, 2012

- Two Cons against NoSQL. Part I. October 30, 2012

- Interview with Mike Stonebraker. May 2, 2012

Resources

- ODBMS.org Lecture Notes: Data Management in the Cloud.
Michael Grossniklaus, David Maier, Portland State University.
Course Description: “Cloud computing has recently seen a lot of attention from research and industry for applications that can be parallelized on shared-nothing architectures and have a need for elastic scalability. As a consequence, new data management requirements have emerged with multiple solutions to address them. This course will look at the principles behind data management in the cloud as well as discuss actual cloud data management systems that are currently in use or being developed. The topics covered in the courserange from novel data processing paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data management platforms (Google BigTable, Microsoft Azure, Amazon S3 and Dynamo, Yahoo PNUTS) and open-source NoSQL databases (Cassandra, MongoDB, Neo4J). The world of cloud data management is currently very diverse and heterogeneous. Therefore, our course will also report on efforts to classify, compare and benchmark the various approaches and systems. Students in this course will gain broad knowledge about the current state of the art in cloud data management and, through a course project, practical experience with a specific system.”
Lecture Notes | Intermediate/Advanced | English | DOWNLOAD ~280 slides (PDF)| 2011-12|

##

Dec 27 12

In-memory OLTP database. Interview with Asa Holmstrom.

by Roberto V. Zicari

“Those who claim they can give you both ACID transactions and linearly scalability at the same time are not telling the truth because it is theoretically proven impossible” –Asa Holmstrom.

I heard about a start up called Starcounter. I wanted to know more. I have interviewed the CEO of the company Asa Holmstrom.

RVZ

Q1. You just launched Starcounter 2.0 public beta. What is it? and who can already use it?

Asa Holmstrom: Starcounter is a high performance in-memory OLTP database. We have partners who built applications on top of Starcounter, e.g. AdServe application, retail application. Today Starcounter has 60+ customers using Starcounter in production.

Q2. You define Starcounter as “memory centric”, using a technique you call “VMDBMS”. What is special about VMDBMS?

Asa Holmstrom: VMDBMS integrates the application runtime virtual machine (VM) with the database management system (DBMS). Data only residees in one single place all the time in RAM, no data is transferred back and forth between the database memory and the temporary storage (object heap) of the application. The VMDBMS makes Starcounter significantly faster than other in-memory databases.

Q3. When you say “the VMDBMS makes Starcounter significantly faster than other in-memory databases”, could you please give some specific benchmarking numbers? Which other in-memory databases did you compare with your benchmark?

Asa Holmstrom: In general we are 100 times faster than any other RDBMS, 10 times comes from being IMDBMS, 10 times comes from VMDBMS.

Q4. How do you handle situations when data in RAM is no more available due to hardware failures?

Asa Holmstrom: In Starcounter the data is just as secure as in any disk-centric database. Image files and transaction log are stored on disk, and before a transaction is regarded committed it has been written to the transaction log.
When restarting Starcounter after a crash, a recovery of the database will automatically be done. To guarantee high availability we recommend our customers to have a hot stand-by machine which subscribes on the transaction log.

Q5. Goetz Graefe, HP fellow, commented in an interview (ref.1) that “disks will continue to have a role and economic value where the database also contains history, including cold history such as transactions that affected the account balances, login & logout events, click streams eventually leading to shopping carts, etc.” What is your take on this?

Asa Holmstrom: As we have hardware limitations on RAM databases in practice about 2TB, therefore there will still be a need for database storage on disk.

Q6. You claim to achieve high performance and consistent data. Do you have any benchmark results to sustain such a claim?

Asa Holmstrom: Yes, we have made internal benchmark tests to compare performance while keeping data consistent.

Q7: Do you have some results of your benchmark tests publically available? If yes, could you please summarize here the main results?

Asa Holmstrom: As TPC tests are not applicable to us, we have done some internal tests. We can’t share them with you.

Q8. What kind of consistency do you implement?

Asa Holmstrom: We support true ACID consistency, implemented using snapshot isolation and fake writes, in a similar way as Oracle.

Q9. How do you achieve scalability?

Asa Holmstrom: ACID transactions are not scalable. All parallel ACID transactions need to be synchronized and the closer the transactions are executed in space, the faster the synchronization becomes. Therefore you get best performance by executing all ACID transactions on one machine. We call it to scale in. When it comes to storage, you scale up a Starcounter database by adding more RAM.

Q10: For which class of applications it is realistic to expect to execute all ACID transactions on one machine?

Asa Holmstrom: For all applications when you want high transactional throughput. When you have parallell ACID transactions you need to synchronize these transactions, and this synchorinization becomes harder when you scale out to several different machines. The benefits of scaling out grows linearly with the number of machines, but the cost of synchronization grows quadratically. Consequently you do not gain anything by scaling out. In fact, you get better total transactional throughput by running all transaction in RAM on one machine, which we call to “scale in”. No other databas can give you the same total ACID transactional throughput as Starcounter. Those who claim they can give you both ACID transactions and linearly scalability at the same time are not telling the truth because it is theoretically proven impossible. Databases which can give you ACID transaction or linearly scalablity
cannot give you both these things at the same time.

Q11. How do you define queries and updates?

Asa Holmstrom: We distinguish between read-only transactions and read-write transactions. You can only write (insert/update) database data using a read-write transaction.

Q12. Are you able to handle Big Data analytics with your system?

Asa Holmstrom: Starcounter is optimized for transactional processing, not for analytical processing.

Q13. How does Starcounter differs from other in-memory databases, such as for example SAP HANA, and McObject?

Asa Holmstrom: In general the primary differentiator between Starcounter and any other in-memory database is the VMDMBS. SAP HANA has primarily an OLAP focus.

Q14. From a user perspective, what is the value proposition of having a VMDMBS as a database engine?

Asa Holmstrom: Uncompetable ACID transactional performance.

Q15. How do you differentiate with respect to VoltDB?

Asa Holmstrom: Better ACID transactional performance. VoltDB gives you either ACID transactions (on one machine) or the possibility to scale out without any guarantees of global database consistency (no ACID). Starcounter has a native .Net object interface which makes it easy to use from any .Net language.

Q16. Is Starcounter 2.0 open source? If not, do you have any plan to make it open source?

Asa Holmstrom: We do not have any current plans of making Starcounter open source.

——————
CEO Asa Holmstrom brings to her role at Starcounter more than 20 years of executive leadership in the IT industry. Previously, she served as the President of Columbitech, where she successfully established its operations in the U.S. Prior to Colmbitech, Asa was CEO of Kvadrat, a technology consultancy firm. Asa also spent time as a management consultant, focusing on sales, business development and leadership within global technology companies such as Ericsson and Siemens. She holds a bachelor’s degree in mathematics and computer science from Stockholm University.

Related Posts

- Interview with Mike Stonebraker. by Roberto V. Zicari on May 2, 2012

- In-memory database systems. Interview with Steve Graves, McObject. by Roberto V. Zicari on March 16, 2012

(ref. 1)- The future of data management: “Disk-less” databases? Interview with Goetz Graefe. by Roberto V. Zicari on August 29, 2011

Resources

- Cloud Data Stores – Lecture Notes: Data Management in the Cloud.
by Michael Grossniklaus, David Maier, Portland State University.
Course Description: “Cloud computing has recently seen a lot of attention from research and industry for applications that can be parallelized on shared-nothing architectures and have a need for elastic scalability. As a consequence, new data management requirements have emerged with multiple solutions to address them. This course will look at the principles behind data management in the cloud as well as discuss actual cloud data management systems that are currently in use or being developed.
The topics covered in the course range from novel data processing paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data management platforms (Google BigTable, Microsoft Azure, Amazon S3 and Dynamo, Yahoo PNUTS) and open-source NoSQL databases (Cassandra, MongoDB, Neo4J).
The world of cloud data management is currently very diverse and heterogeneous. Therefore, our course will also report on efforts to classify, compare and benchmark the various approaches and systems.
Students in this course will gain broad knowledge about the current state of the art in cloud data management and, through a course project, practical experience with a specific system.”
Lecture Notes | Intermediate/Advanced | English | DOWNLOAD ~280 slides (PDF)| 2011-12|

##

Dec 20 12

What has coffee to do with managing data? An Interview with Alon Halevy.

by Roberto V. Zicari

What has coffee to do with managing data? Probably nothing, but at the same time how many developers do I know who drink coffee? Quite a lot.
I myself, decided years ago to reduce my intake of coffee, and I now drink mostly green tea, but not early in the morning, where I need an espresso to get started :-)

So, once I found out that my database fellow colleague Alon Halevy, very well known for his work on data management, also wrote a (seemly successful) book on coffee, I had to ask him a few questions.

RVZ

Q1. How come did you end up writing a book on coffee?

Alon Halevy: It was a natural outcome of going to many conferences on database management. I realized that anywhere I go in the world, the first thing I look for are nice cafes in town. I also noticed that the coffee culture varies quite a bit from one city to another, and started wondering about how culture affects coffee and vice versa. I think it was during VLDB 2008 in Auckland, New Zealand, that I realized that I should investigate the topic in much more detail.

Q2. When did you start this project?

Alon Halevy: I was toying around with the idea for a while, but did very little. One day in March, 2009, recalling a conversation with a friend, I googled “barista competitions”. I found out that the US Barista Championship was taking place in Portland in a week’s time. I immediately bought a ticket to Portland and a week later I was immersed in world-class barista culture. From that point on, there was no turning back.

Q3. What is the book about?

Alon Halevy: In the book I go to over 25 countries on 6 continents and ask what is special about coffee culture in that country. I typically tell the story of the culture through a fascinating character or a unique cafe. I cover places such as the birthplace of coffee, Ethiopia, where I spend time in a farming community; in Iceland, where the coffee culture developed because of a bored spouse of an engineering Ph.D student at the University of Wisconsin, Madison; and Bosnia, where coffee pervades culture like nowhere else in the world. The book includes many photos and data visualizations (of course!) and makes you want to travel the world to experience some of these great places.

Q4. How did you find the time to visit the main locations you write in your book?

Alon Halevy: Many of the trips were, of course, piggybacked off a database conference or other speaking engagements. After a while, I was so determined to finish the book that I simply got on a plane and covered the regions that don’t attract database conferences (e.g., Central America, Ethiopia). Over time, I became very efficient when I arrived at a destination (and the rumor that I usually bring Google goodies with me started spreading through the coffee community). I developed a rich network of coffee friends, and everywhere I went I was taken directly to the most interesting places.

Q5. So what is the best coffee in the world :-)?

Alon Halevy: That is obviously very subjective, and depends more than anything on who you’re having the coffee with! In terms of sheer quality, I think some of the Scandinavian roasters are doing an amazing job. I was really impressed by the coffee culture in Melbourne, Australia. Japan and Korea were also unexpected highlights. But if you know where to go, you can find amazing coffee almost anywhere. Once you start sensing coffee in “high definition”, tasting the variety of flavors and notes in the brew, coffee becomes a fascinating journey.

Q6. Would you say that coffee is an important part of your work?

Alon Halevy: Yes, even putting aside the fact that no work gets done before coffee is consumed, the coffee community has given me an interesting perspective on computing. Coffee folks are enthusiastic users of technology, as they are a globally distributed community that is constantly on the move (and have a lot to say thanks to their elevated caffeine levels). It is also interesting to think of how technology, and data management techniques in particular, can be used to help this global industry. I hope to investigate these issues more in the future (I’m already being tapped by the coffee community for various database advice often).

Q7. Coffee can be bad, if you drink it in excess. Do you cover this aspect in your book? (I assume not), but what is your take on this?

Alon Halevy: There is a lot of debate on the benefits of coffee in the scientific literature. If you consume coffee moderation (2-3 cups a day), you’re fine (unless you have some abnormal condition). But remember, I am a Doctor of Computer Science, not Medicine.
—————–

The Infinite Emotions of Coffee, by Alon Halevy.

You can download a sample chapter of the book: “COFFEE: A Competitive Sport” (Link to download .pdf)

Details on where to get the book: the book’s page.
(The digital version is available on iTunes, Kindle and the Google bookstore) .

##

——————————————

Dec 5 12

On Big Data, Analytics and Hadoop. Interview with Daniel Abadi.

by Roberto V. Zicari

“Some people even think that “Hadoop” and “Big Data” are synonymous (though this is an over-characterization). Unfortunately, Hadoop was designed based on a paper by Google in 2004 which was focused on use cases involving unstructured data (e.g. extracting words and phrases from Webpages in order to create Google’s Web index). Since it was not originally designed to leverage the structure in relational data in order to take short-cuts in query processing, its performance for processing relational data is therefore suboptimal.”– Daniel Abadi.

On the subject of Big Data, Analytics and Hadoop I have Interviewed Daniel Abadi, associate professor of computer science at Yale University and Chief Scientist and Co-founder of Hadapt.

RVZ

Q1. On the subject of “eventual consistency”, Justin Sheehy of Basho recently said in an interview (1): “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”.
What is your opinion on this? Would you wish to have an “eventual consistency” update to your bank account?

Daniel Abadi: It’s definitely true that it’s a common misconception that banks use the ACID features of database systems to perform certain banking transactions (e.g. a transfer from one customer to another customer) in a single atomic transaction. In fact, they use complicated workflows that leverage semantics that reasonably approximate eventual consistency.

Q2. Justin Sheely also said in the same interview that “the use of eventual consistency in well-designed systems does not lead to inconsistency.” How do you ensure that a system is well designed then?

Daniel Abadi: That’s exactly the problem with eventual consistency. If your database only guarantees eventual consistency, you have to make sure your application is well-designed to resolve all consistency conflicts.
For example, eventually consistent systems allow you to update two different replicas to two different values simultaneously. When these updates propagate to the each other, you get a conflict. Application code has to be smart enough to deal with any possible kind of conflict, and resolve them correctly. Sometimes simple policies like “last update wins” is sufficient to handle all cases.
But other applications are far more complicated, and using an eventually consistent database makes the application code far more complex. Often this complexity can lead to errors and security flaws, such as the ATM heist that was written up recently. Having a database that can make stronger guarantees greatly simplifies application design.

Q3. You have created a start up called Hadapt which claims to be the ” first platform to combine Apache Hadoop and relational DBMS technologies”. What is it? Why combining Hadoop with Relational database technologies?

Daniel Abadi: Hadoop is becoming the standard platform for doing large scale processing of data in the enterprise. It’s rate of growth far exceeds any other “Big Data” processing platform. Some people even think that “Hadoop” and “Big Data” are synonymous (though this is an over-characterization). Unfortunately, Hadoop was designed based on a paper by Google in 2004 which was focused on use cases involving unstructured data (e.g. extracting words and phrases from Webpages in order to create Google’s Web index).
Since it was not originally designed to leverage the structure in relational data in order to take short-cuts in query processing, its performance for processing relational data is therefore suboptimal.

At Hadapt, we’re bringing 3 decades of relational database research to Hadoop. We have added features like indexing, co-partitioned joins, broadcast joins, and SQL access (with interactive query response times) to Hadoop, in order to both accelerate its performance for queries over relational data and also provide an interface that third party data processing and business intelligence tools are familiar with.
Therefore we have taken Hadoop, which used to be just a tool for super-smart data scientists, and brought it to the mainstream by providing a high performance SQL interface that business analysts and data analysis tools already know how to use. However, we’ve gone a step further and made it possible to include both relational data and non-relational data in the same query; so what we’ve got now is a platform that people can use to do really new and innovative types of analytics involving both unstructured data like tweets or blog posts and structured data such as traditional transactional data that usually sits in relational databases.

Q4. What is special about tha Hadapt architecture that makes it different from other Hadoop-based products?

Daniel Abadi: The prevalent architecture that people use to analyze structured and unstructured data is a two-system configuration, where Hadoop is used for processing the unstructured data and a relational database system is used for the structured data. However, this is a highly undesirable architecture, since now you have two systems to maintain, two systems where data may be stored, and if you want to do analysis involving data in both systems, you end up having to send data over the network which can be a major bottleneck. What is special about the Hadapt architecture is that we are bringing database technology to Hadoop, so that Hadapt customers only need to deploy a single cluster — a normal Hadoop cluster — that is optimized for both structured and unstructured data, and is capable of pushing the envelope on the type of analytics that can be run over Big Data.

Q5. You claim that Hadapt SQL queries are an order of magnitude faster than Hadoop+Hive? Do you have any evidence for that? How significant is this comparison for an enterprise?

Daniel Abadi: We ran some experiments in my lab at Yale using the queries from the TPC-H benchmark and compared the technology behind Hadapt to both Hive and also traditional database systems.
These experiments were peer-reviewed and published in SIGMOD 2011. You can see the full 12 page paper here. From my point of view, query performance is certainly important for the enterprise, but not as important as the new type of analytics that our platform opens up, and also the enterprise features around availability, security, and SQL compliance that distinguishes our platform from Hive.

Q6. You say that “Hadoop-connected SQL databases” do not eliminate “silos”. What do you mean with silos? And what does it mean in practice to have silos for an enterprise?

Daniel Abadi: A lot of people are using Hadoop as a sort of data refinery. Data starts off unstructured, and Hadoop jobs are run to clean, transform, and structure the data. Once the data is structured, it is shipped to SQL databases where it can be subsequently analyzed. This leads to the raw data being left in Hadoop and the refined data in the SQL databases. But it’s basically the same data — one is just a cleaned (and potentially aggregated) version of the other. Having multiple copies of the data can lead to all kinds of problems. For example, let’s say you want to update the data in one of the two locations — it does not get automatically propagated to the copy in the other silo. Furthermore, let’s say you are doing some analysis in the SQL database and you see something interesting and want to drill down to the raw data — if the raw data is located on a different system, such a drill down becomes highly nontrivial. Furthermore, data provenance is a total nightmare. It’s just a really ugly architecture to have these two systems with a connector between them.

Q7. What is “multi-structured” data analytics?

Daniel Abadi: I would define it as analytics than include both structured data and unstructured data in the same “query” or “job”. For example, Hadapt enables keyword search, advanced analytic functions, and common machine learning algorithms over unstructured data, all of which can be invoked from a SQL query that includes data from relational tables stored inside the same Hadoop cluster. A great example of how this can be used in real life can be seen in the demo given by Data Scientist Mingsheng Hong.

Q8. On the subject of Big Data Analytics, Werner Vogels of Amazon.com said in an interview [2] that “one of the core concepts of Big Data is being able to evolve analytics over time. In the new world of data analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without being constrained by resources.”
What is your opinion on this? How Hadapt can help on this?

Daniel Abadi: Yes, I agree. This is precisely why you need a system like Hadapt that has extreme flexibility in terms of the types of data that it can analyze (see above) and types of analysis that can be performed.

Q9. With your research group at Yale you are developing a new data management prototype system called Calvin. What is it? What is special about it?

Daniel Abadi: Calvin is a system designed and built by a large team of researchers in my lab at Yale (Alex Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Anton Petrov, Michael Giuffrida, and Aaron Segal) that scales transactional processing in database systems to millions of transactions per second while still maintaining ACID guarantees.
If you look at what is available in the industry today if you want to scale your transactional workload, you have all these NoSQL systems like HBase or Cassandra or Riak — all of which have to eliminate ACID guarantees in order to achieve scalability. On the other hand, you have NewSQL systems like VoltDB or various sharded MySQL implementations that maintain ACID but can only scale when transactions rarely cross physical server boundaries. Calvin is unique in the way that it uses a deterministic execution model to eliminate the need for commit protocols (such as two phase commit) and allow arbitrary transactions (that can include data on multiple different physical servers) to scale while still maintaining the full set of ACID guarantees. It’s a cool project. I recommend your readers check out the most recent research paper describing the system.

Q10. Looking at three elements: Data, Platform, Analysis, what are the main research challenges ahead?

Daniel Abadi: Here are a few that I think are interesting:
(1) Scalability of non-SQL analytics. How do you parallelize clustering, classification, statistical, and algebraic functions that are not “embarrassingly parallel” (that have traditionally been performed on a single server in main memory) over a large cluster of shared-nothing servers.
(2) Reducing the cognitive complexity of “Big Data” so that it can fit in the working set of the brain of a single analyst who is wrangling with the data.
(3) Incorporating graph data sets and graph algorithms into database management systems.
(4) Enabling platform support for probabilistic data and probabilistic query processing.

———————-
Daniel Abadi is an associate professor of computer science at Yale University where his research focuses on database system architecture and implementation, and cloud computing. He received his PhD from MIT, where his dissertation on column-store database systems led to the founding of Vertica (which was eventually acquired by Hewlett Packard).
His research lab at Yale has produced several scalable database systems, the best known of which is HadoopDB, which was subsequently commercialized by Hadapt, where Professor Abadi serves as Chief Scientist. He is a recipient of a Churchill Scholarship, an NSF CAREER Award, a Sloan Research Fellowship, the 2008 SIGMOD Jim Gray Doctoral Dissertation Award, and the 2007 VLDB best paper award.

———-

Related Posts

[1] On Eventual Consistency- An interview with Justin Sheehy. by Roberto V. Zicari on August 15, 2012

[2] On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com. by Roberto V. Zicari on November 2, 2011

Resources

- Efficient Processing of Data Warehousing Queries in a Split Execution Environment (.pdf)

- Calvin: Fast Distributed Transactions for Partitioned Database Systems (.pdf)

- Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing), 2010 (.pdf)

- MapReduce vs RDBMS [PPT]

- Big Data and Analytical Data Platforms:
Blog Posts | Free Software | Articles | PhD and Master Thesis|

##

Nov 21 12

Two cons against NoSQL. Part II.

by Roberto V. Zicari

This post is the second part of a series of feedback I received from various experts, with obviously different point of views, on:

Two cons against NoSQL data stores :

Cons1. ” It’s very hard to move data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch. “

Cons2. “There is no standard way to access a NoSQL data store. All tools that already exist for SQL has to be recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many NoSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).”

You can also read Part I here.

RVZ
————

J Chris Anderson, Couchbase, cofounder and Mobile Architect:
On Cons1: JSON is the defacto format for APIs these days. I’ve found moving between NoSQL stores to be very simple, just a matter of swapping out a driver. It is also usually quite simple to write a small script to migrate the data between databases. Now, there aren’t pre packaged tools for this yet, but it’s typically one page of code to do. There is some truth that if you’ve tightly bound your application to a particular query capability, there might be some more work involved, but a least you don’t have to redo your stored data structures.

On Cons2: I’m from the “it’s a simple matter of programming” school of thought. e.g. just write the query you need, and a little script to turn it into CSV. If you want to do all of this without writing code, then of course the industry isn’t as mature as RDBMS. It’s only a few years old, not decades. But this isn’t a permanent structural issues, it’s just an artifact of the relative freshness of NoSQL.

—-
Marko Rodriguez, Aurelius, cofounder :
On Cons1: NoSQL is not a mirror image of SQL. SQL databases such as MySQL, Oracle, PostgreSQL all share the same data model (table) and query language (SQL).
In the NoSQL space, there are numerous data models (e.g. key/value, document, graph) and numerous query languages (e.g. JavaScript MapReduce, Mongo Query Language, Gremlin ).
As such, the vendor lock-in comment should be directed to a particular type of NoSQL system, not to NoSQL in general. Next, in the graph space, there are efforts to mitigate vendor lock-in. TinkerPop provides a common set of interfaces that allow the various graph computing technologies to work together and it allows developers to “plug and play” the underlying technologies as needed. In this way, for instance, TinkerPop’s Gremlin traversal language works for graph systems such as OrientDB , Neo4j, Titan, InfiniteGraph, RDF Sail stores , and DEX to name several.
To stress the point, with TinkerPop and graphs, there is no need to re-implement an application as any graph vendor’s technology is interoperable and swappable.

On Cons2: Much of the argument above holds for this comment as well. Again, one should not see NoSQL as the space, but the particular subset of NoSQL (by data model) as the space to compare against SQL (table). In support of SQL, SQL and the respective databases have been around for much longer than most of the technologies in the NoSQL space. This means they have had longer to integrate into popular data workflows (e.g. Ruby on Rails). However, it does not mean “that it will always be harder to access data in NoSQL than from SQL.” New technologies emerge, they find their footing within the new generation of technologies (e.g. Hadoop) and novel ways of processing/exporting/understanding data emerge. If SQL was the end of the story, it would have been the end of the story.

—–
David Webber, Oracle, Information Technologist:
On Cons1: Well it makes sense. Of course depends what you are using the NoSQL store for – if it is a niche application – or innovative solution – then a “one off” may not be an issue for you. Do you really see people using NoSQL as their primary data store? As with any technology – knowing when to apply it successfully is always the key. And these aspects of portability help inform when NoSQL is appropriate. There are obviously more criteria as well that people should reference to understand when NoSQL would be suitable for their particular application. The good news is that there are solid and stable choices available should they decide NoSQL is their appropriate option. BTW – in the early days of SQL too – even with the ANSI standard – its was a devil to port across SQL implementations – not just syntax, but performance and technique issues – I know – I did three such projects!

—-
Wiqar Chaudry, NuoDB, Tech Evangelist :
On Cons1: The answer to the first scenario is relatively straightforward. There are many APIs like REST or third-party ETL tools that now support popular NoSQL databases. The right way to think about this is to put yourself in the shoes of multiple different users. If you are a developer then it should be relatively simple and if you are a non-developer then it comes down to what third-party tools you have access to and those with which you are familiar. Re-educating yourself to migrate can be time consuming if you have never used these tools however. In terms of migrating applications from one NoSQL technology to another this is largely dependent on how well the data access layer has been abstracted from the physical database. Unfortunately, since there is limited or no support for ORM technologies this can indeed be a daunting task.

On Cons2: This is a fair assessment of NoSQL. It is limited when it comes to third-party tools and integration. So you will be spending time doing custom design.
However, it’s also important to note that the NoSQL movement was really born out of necessity. For example, technologies such as Cassandra were designed by private companies to solve a specific problem that a particular company was facing. Then the industry saw what NoSQL can do and everyone tried to adopt the technology as a general purpose database. With that said, what many NoSQL companies have ignored is the tremendous opportunity to take from SQL-based technologies the goodness that is applicable to 21st century database needs. .

—–
Robert Greene, Versant, Vice President, Technology:
On Cons1: Yes, I agree that this is difficult with most NoSQL solutions and that is a problem for adopters.
Versant has taken the position of trying to be first to deliver enterprise connectivity and standards into the NoSQL community. Of course, we can only take that so far, because many of the concepts that make NoSQL attractive to adopters simply do not have an equivalent in the relational database world. For example, horizontal scale-out capabilities are only loosely defined for relational technologies, but certainly not standardized. Specifically in terms of moving data in/out of other systems, Versant has developed a connector for the Talend Open Studio which has connectivity to over 400 relational and non-relational data sources, making it easy to move data in and out of Versant depending on your needs. For the case of Excell, while it is certainly not our fastest interface, having recognized the needs of data access from accepted tools, Versant has developed odbc/jdbc capabilities which can be used to get data from Versant databases into things like Excell, Toad, etc.

On Cons2: Yes, I also agree that this is a problem for most NoSQL solutions and again Versant is moving to bring better standards based programming API’s to the NoSQL community. For example, in our Java language interface, we support JPA ( Java Persistence API ), which is the interface application developers get when ever they download the Java SDK. They can create an application using JPA and execute against a Versant NoSQL database without implementing any relational mapping annotations or XML.
Versant thinks this is a great low risk way for enterprise developers to test out the benefits of NoSQL with limited risk. For example, if Versant does not perform much faster that the relational databases, run on much less hardware, scale-out effectively to multiple commodity servers, then they can simply take Hibernate or OpenJPA, EclipseLink, etc and drop it into place, do the mapping exercise and then point it at their relational database with nothing lost in productivity.
In the .NET world,b we have an internal implementation that support LINK and will be made available in the near future to interested developers. We are also supporting other standards in the area of production management, having SNMP capabilities so we can be integrated into tools like OpenView and others where IT folks can get a unified view of all their production systems.

I think we as an engineering discipline should not forget our lessons learned in the early 2000′s. Some pretty smart people helped many realize that what is important is the life cycle of your application objects, some of which are persistent, and that what is important is providing the appropriate abstraction for things like transaction demarcation, caching activation, state tracking ( new, changed, deleted ) etc. These are all features common to any application and developers can easily abstract them away to be database implementation independent, just like we did in the ORM days. Its what we do as good software engineers, find the right abstractions and refine and reuse them over time. It is important that the NoSQL vendors embrace such an approach to ease the development burden of the practitioners that will use the technology.

—-
Jason Hunter, MarkLogic , Chief Architect:
On Cons1: When choosing a database, being future-proof is definitely something to consider. You never know where requirements will take you or what future technologies you’ll want to leverage. You don’t want your data locked into a proprietary format that’s going to paint you into a corner and reduce your options. That’s why MarkLogic chose XML (and now JSON also) as its internal data format. It’s an international standard. It’s plain-text, human readable, fully internationalized, widely deployed, and supported by thousands upon thousands of products. Customers choose MarkLogic for several reasons, but a key reason is that the underlying XML data format will still be understood and supported by vendors decades in the future. Furthermore, I think the first sentence above could be restated, “It’s very hard to move the data out from one SQL to some other system, even other SQL.” Ask anyone who’s tried!

On Cons2: People aren’t choosing NoSQL databases because they’re unhappy with the SQL language. They’re picking them because NoSQL databases provide a combination of feature, performance, cost, and flexibility advantages. Customers don’t pick MarkLogic to run away from SQL, they pick MarkLogic because they want the advantages of a document store, the power of integrated text search, the easy scaling and cost savings of a shared-nothing architecture, and the enterprise reliability of a mature product. Yes, there’s a use case for exporting data to Excel. That’s why MarkLogic has a SQL interface as well as REST and Java interfaces. The SQL interface isn’t the only interface, nor is it the most powerful (it limits MarkLogic down to the subset of functionality expressable in SQL) but it provides an integration path.

Related Posts
- Two Cons against NoSQL. Part I. by Roberto V. Zicari on October 30, 2012

Resources
NoSQL Data Stores:
Blog Posts | Free Software | Articles, Papers, Presentations| Documentations, Tutorials, Lecture Notes | PhD and Master Thesis

##

Nov 12 12

Big Data Analytics– Interview with Duncan Ross

by Roberto V. Zicari

“The biggest technical challenge is actually the separation of the technology from the business use! Too often people are making the assumption that big data is synonymous with Hadoop, and any time that technology leads business things become difficult.” –Duncan Ross.

I asked Duncan Ross (Director Data Science, EMEA, Teradata), what is in his opinion the current status of Big Data Analytics industry.

RVZ

Q1. What is in your opinion the current status of Big Data Analytics Industry?

Duncan Ross: The industry is still in an immature state, dominated by a single technology whilst at the same time experiencing an explosion of different
technological solutions. Many of the technologies are far from robust or enterprise ready, often requiring significant technical skills to support the software even before analysis is attempted.
At the same time there is a clear shortage of analytical experience to take advantage of the new data. Nevertheless the potential value is becoming increasingly clear.

Q2. What are the main technical challenges for Big Data analytics?

Duncan Ross: The biggest technical challenge is actually the seperation of the technology from the business use! Too often people are making the assumption that big data is synonymous with Hadoop, and any time that technology leads business things become difficult. Part of this is the difficulty of use that comes with this. It’s reminiscent of the command line technologies of the 70s – it wasn’t until the GUI became popular that computing could take off.

Q3. Is BI really evolving into Consumer Intelligence? Could you give us some examples of existing use cases?

Duncan Ross: BI and big data analytics are far more than just Consumer Intelligence. Already more than 50% of IP traffic is non human, and M2M will become increasingly important. But out of the connected vehicle we’re already seeing behaviour based insurance pricing and condition based maintenance. Individual movement patterns are being used to detect the early onset of illness.
New measures of voice of the customer are allowing companies to reach out beyond their internal data to understand the motivations and influence of their customers. We’re also seeing the growth of data philanthropy, with these approaches being used to benefit charities and not-for-profits.

Q4. How do you handle large volume of data? When dealing with petabytes of data, how do you ensure scalability and performance?

Duncan Ross: Teradata has years of experience dealing with Petabyte scale data. The core of both our traditional platform and the Teradata Aster big data platform is a shared nothing MPP system with a track history of proven linear scalability. For low information density data we provide a direct link to HDFS and work with partners such as Hortonworks.

Q5. How do you analyze structured data; semi-structured data, and unstructured data at scale?

Duncan Ross: The Teradata Aster technology combines the functionality of MapReduce within the well understood framework of ANSI SQL, allowing complex programatic analysis to sit alongside more traditional data mining techniques. Many MapReduce functions have been simplified (from the users’ perspective) and can be called easily and directly – but more advanced users are free to write their own functions. By parallelizing the analysis within the database you get extremely high scalability and performance.

Q6. How do you reduce the I/O bandwidth requirements for big data analytics workloads?

Duncan Ross: Two methods: firstly by matching analytical approach to technology – set based analysis using traditional SQL based approaches, and programmatic and iterative analysis using MapReduce style approaches.
Secondly by matching data ‘temperature’ to different storage medium: hot data on SSD, cool data on fast disk drives, and cold data on cheap large (slow) drives.
The skill is to automatically rebalance without impacting users!

Q7. What is the tradeoff between Accuracy and Speed that you usually need to make with Big Data?

Duncan Ross: In the world of data mining this isn’t reeally a problem as our approaches are based around sampling anyway. A more important distinction is between speed of analysis and business demand. We’re entering a world where data requires us to work far more agiley than we have in the past.

Q8. Brewer’s CAP theorem says that for really big distributed database systems you can have only 2 out of 3 from Consistency (“atomic”), Availability and (network) Partition Tolerance. Do you have practical evidence of this? And if yes, how?

Duncan Ross: No. Although it may be true for an arbitarily big system, in most real world cases this isn’t too much of a problem.

Q9. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?

Duncan Ross: Well we aren’t using Hadoop, and as I commented earlier, equating Hadoop with Big Data is a dangerous assumption. Many analyses do not require
anything approaching real time, but as closeness to an event becomes more important then we can look to scoring within an EDW environment or even embedding code within an operational system.
To do this requires you to understand the eventual use of your analysis when starting out of course.
A great example of this approach is Ebay’s Hadoop-Singularity-EDW configuration.

Q10. What are the typical In-database support for analytics operations?

Duncan Ross: It’s clear that moving the analysis to the data is more beneficial than moving data to the analysis. Teradata has great experience in this area.
We have examples of fast fourier transforms, predictive modelling, and parameterised modelling all happening in highly parallel ways within the database. I once built and estimated 64 000 models in parallel for a project.

Q11. Cloud computing: What role does it play with Analytics? What are the main differences between Ground vs Cloud analytics?

Duncan Ross: The cloud is a useful approach for evaluating and testing new approaches, but has some significant drawbacks in terms of data security. Of course
there is a huge difference between public and private cloud solutions.
………………
Duncan Ross, Director Data Science, EMEA, Teradata.
Duncan has been a data miner since the mid 1990s. He was Director of Advanced Analytics at Teradata until 2010, leaving to become Data Director of Experian UK. He recently rejoined Teradata to lead their European Data Science team.

At Teradata he has been responsible for developing analytical solutions across a number of industries, including warranty and root cause analysis in manufacturing, and social network analysis in telecommunications.
These solutions have been developed directly with customers and have been deployed against some of the largest consumer bases in Europe.

In his spare time Duncan has been a city Councillor, chair of a national charity, founded an award winning farmers’ market, and is one of the founding Directors of the Society of Data Miners.

Related Posts

- Managing Big Data. An interview with David Gorbet by Roberto V. Zicari on July 2, 2012

- Big Data: Smart Meters — Interview with Markus Gerdes. by Roberto V. Zicari on June 18, 2012

- Big Data for Good. by Roberto V. Zicari on June 4, 2012

- Analytics at eBay. An interview with Tom Fastner. by Roberto V. Zicari on October 6, 2011

BIg Data Resources
Big Data and Analytical Data Platforms: Blog Posts | Free Software | Articles| PhD and Master Thesis|

##

Oct 30 12

Two Cons against NoSQL. Part I.

by Roberto V. Zicari

Two cons against NoSQL data stores read like this:

1. It’s very hard to move data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

2. There is no standard way to access a NoSQL data store.
All tools that already exist for SQL has to be recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many NoSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).

These are valid points. I wanted to start a discussion on this.
This post is the first part of a series of feedback I received from various experts, with obviously different point of views.

I plan to publish Part II, with more feedback later on.

You are welcome to contribute to the discussion by leaving a comment if you wish!

RVZ
————

1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL.

Dwight Merriman ( founder 10gen, maker of MongoDB): I agree it is still early and I expect some convergence in data models over time. btw I am having conversations with other nosql product groups about standards but it is super early so nothing is imminent.
50% of the nosql products are JSON-based document-oriented databases.
So that is the greatest commonality. Use that and you have some good flexibility and JSON is standards-based and widely used in general which is nice. MongoDB, couchdb, riak for example use JSON. (Mongo internally stores “BSON“.)

So moving data across these would not be hard.

1. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

Dwight Merriman: Yes. Once again I wouldn’t assume that to be the case forever, but it is for the present. Also I think there is a bit of an illusion of portability with relational. There are subtle differences in the SQL, medium differences in the features, and there are giant differences in the stored procedure languages.
I remember at DoubleClick long ago we migrated from SQL Server to Oracle and it was a HUGE project. (We liked SQL server we just wanted to run on a very very large server — i.e. we wanted vertical scaling at that time.)

Also: while porting might be work, given that almost all these products are open source, the potential “risks” of lock-in I think drops an order of magnitude — with open source the vendors can’t charge too much.

Ironically people are charged a lot to use Oracle, and yet in theory it has the portability properties that folks would want.

I would anticipate SQL-like interfaces for BI tool integration in all the products in the future. However that doesn’t mean that is the way one will write apps though. I don’t really think that even when present those are ideal for application development productivity.

1. For example, how many noSQL databases can export their data to excel? (Something every CEO wants to get sooner or later).

Dwight Merriman: So with MongoDB what I would do would be to use the mongoexport utility to dump to a CSV file and then load that into excel. That is done often by folks today. And when there is nested data that isn’t tabular in structure, you can use the new Aggregation Framework to “unwind” it to a more matrix-like format for Excel before exporting.

You’ll see more and more tooling for stuff like that over time. Jaspersoft and Pentaho have mongo integration today, but the more the better.

John Hugg (VoltDB Engineering): Regarding your first point about the issue with moving data out from one NoSQL to some other system, even other NoSQL.
There are a couple of angles to this. First, data movement itself is indeed much easier between systems that share a relational model.
Most SQL relational systems, including VoltDB, will import and export CSV files, usually without much issue. Sometimes you might need to tweak something minor, but it’s straightforward both to do and to understand.

Beyond just moving data, moving your application to another system is usually more challenging. As soon as you target a platform with horizontal scalability, an application developer must start thinking about partitioning and parallelism. This is true whether you’re moving from Oracle to Oracle RAC/Exadata, or whether you’re moving from MySQL to Cassandra. Different target systems make this easier or harder, from both development and operations perspectives, but the core idea is the same. Moving from a scalable system to another scalable system is usually much easier.

Where NoSQL goes a step further than scalability, is the relaxing of consistency and transactions in the database layer. While this simplifies the NoSQL system, it pushes complexity onto the application developer. A naive application port will be less successful, and a thoughtful one will take more time.
The amount of additional complexity largely depends on the application in question. Some apps are more suited to relaxed consistency than others. Other applications are nearly impossible to run without transactions. Most lie somewhere in the middle.

To the point about there being no standard way to access a NoSQL data store. While the tooling around some of the most popular NoSQL systems is improving, there’s no escaping that these are largely walled gardens.
The experience gained from using one NoSQL system is only loosely related to another. Furthermore, as you point out, non-traditional data models are often more difficult to export to the tabular data expected by many reporting and processing tools.

By embracing the SQL/Relational model, NewSQL systems like VoltDB can leverage a developer’s experience with legacy SQL systems, or other NewSQL systems.
All share a common query language and data model. Most can be queried at a console. Most have familiar import and export functionality.
The vocabulary of transactions, isolation levels, indexes, views and more are all shared understanding. That’s especially impressive given the diversity in underlying architecture and target use cases of the many available SQL/Relational systems.

Finally, SQL/Relational doesn’t preclude NoSQL-style development models. Postgres, Clustrix and VoltDB support MongoDB/CouchDB-style JSON Documents in columns. Functionality varies, but these systems can offer features not easily replicated on their NoSQL inspiration, such as JSON sub-key joins or multi-row/key transactions on JSON data

1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

Steve Vinoski (Architect at Basho): Keep in mind that relational databases are around 40 years old while NoSQL is 3 years old. In terms of the technology adoption lifecycle, relational databases are well down toward the right end of the curve, appealing to even the most risk-averse consumer. NoSQL systems, on the other hand, are still riding the left side of the curve, appealing to innovators and the early majority who are willing to take technology risks in order to gain advantage over their competitors.

Different NoSQL systems make very different trade-offs, which means they’re not simply interchangeable. So you have to ask yourself: why are you really moving to another database? Perhaps you found that your chosen database was unreliable, or too hard to operate in production, or that your original estimates for read/write rates, query needs, or availability and scale were off such that your chosen database no longer adequately serves your application.
Many of these reasons revolve around not fully understanding your application in the first place, so no matter what you do there’s going to be some inconvenience involved in having to refactor it based on how it behaves (or misbehaves) in production, including possibly moving to a new database that better suits the application model and deployment environment.

2. There is no standard way to access a NoSQL data store.
All tools that already exists for SQL has to recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many noSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).

Steve Vinoski: Don’t make the mistake of thinking that NoSQL is attempting to displace SQL entirely. If you want data for your Excel spreadsheet, or you want to keep using your existing SQL-oriented tools, you should probably just stay with your relational database. Such databases are very well understood, they’re quite reliable, and they’ll be helping us solve data problems for a long time to come. Many NoSQL users still use relational systems for the parts of their applications where it makes sense to do so.

NoSQL systems are ultimately about choice. Rather than forcing users to try to fit every data problem into the relational model, NoSQL systems provide other models that may fit the problem better. In my own career, for example, most of my data problems have fit the key-value model, and for that relational systems were overkill, both functionally and operationally. NoSQL systems also provide different tradeoffs in terms of consistency, latency, availability, and support for distributed systems that are extremely important for high-scale applications. The key is to really understand the problem your application is trying to solve, and then understand what different NoSQL systems can provide to help you achieve the solution you’re looking for.

1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

Cindy Saracco (IBM Senior Solutions Architect) (these comments reflect my personal views and not necessarily those of my employer, IBM) :
Since NoSQL systems are newer to market than relational DBMSs and employ a wider range of data models and interfaces, it’s understandable that migrating data and applications from one NoSQL system to another — or from NoSQL to relational — will often involve considerable effort.
However, I’ve heard more customer interest around NoSQL interoperability than migration. By that, I mean many potential NoSQL users seem more focused on how to integrate that platform into the rest of their enterprise architecture so that applications and users can have access to the data they need regardless of the underlying database used.

2. There is no standard way to access a NoSQL data store.
All tools that already exists for SQL has to recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many noSQL databases can export their data to excel? (Something every CEO wants to get sooner or later).

Cindy Saracco: From what I’ve seen, most organizations gravitate to NoSQL systems when they’ve concluded that relational DBMSs aren’t suitable for a particular application (or set of applications). So it’s probably best for those groups to evaluate what tools they need for their NoSQL data stores and determine what’s available commercially or via open source to fulfill their needs.
There’s no doubt that a wide range of compelling tools are available for relational DBMSs and, by comparison, fewer such tools are available for any given NoSQL system. If there’s sufficient market demand, more tools for NoSQL systems will become available over time, as software vendors are always looking for ways to increase their revenues.

As an aside, people sometimes equate Hadoop-based offerings with NoSQL.
We’re already seeing some “traditional” business intelligence tools (i.e., tools originally designed to support query, reporting, and analysis of relational data) support Hadoop, as well as newer Hadoop-centric analytical tools emerge.
There’s also a good deal of interest in connecting Hadoop to existing data warehouses and relational DBMSs, so various technologies are already available to help users in that regard . . . . IBM happens to be one vendor that’s invested quite a bit in different types of tools for its Hadoop-based offering (InfoSphere BigInsights), including a spreadsheet-style analytical tool for non-programmers that can export data in CSV format (among others), Web-based facilities for administration and monitoring, Eclipse-based application development tools, text analysis facilities, and more. Connectivity to relational DBMSs and data warehouses are also part IBM’s offerings. (Anyone who wants to learn more about BigInsights can explore links to articles, videos, and other technical information available through its public wiki. )

Related Posts

- On Eventual Consistency– Interview with Monty Widenius. by Roberto V. Zicari on October 23, 2012

- On Eventual Consistency– An interview with Justin Sheehy. by Roberto V. Zicari, August 15, 2012

- Hadoop and NoSQL: Interview with J. Chris Anderson by Roberto V. Zicari

##

Oct 23 12

On Eventual Consistency– Interview with Monty Widenius.

by Roberto V. Zicari

“For analytical things, eventual consistency is ok (as long as you can know after you have run them if they were consistent or not). For real world involving money or resources it’s not necessarily the case.” — Michael “Monty” Widenius.

In a recent interview, I asked Justin Sheehy, Chief Technology Officer at Basho Technologies, maker of Riak, the following two questions, related to the subject of eventual consistency:

Q1. “How do you handle updates if you do not support ACID transactions? For which applications this is sufficient, and when this is not?”

Q2. “You said that Riak takes more of the “BASE” (Basically Available, Soft state, Eventual consistency) approach. Did you use the definition of eventual consistency by Werner Vogels? Reproduced here: “Eventual consistency: The storage system guarantees that if no new updates are made to the object, eventually (after the inconsistency window closes) all accesses will return the last updated value.”
You would not wish to have an “eventual consistency” update to your bank account. For which class of applications is eventual consistency a good system design choice? “

On the same subject, I did a follow up interview with Michael “Monty” Widenius, the main author of the original version of the open-source MySQL database, and currently working on a branch of the MySQL code base, called MariaDB.

RVZ

Q. Justin Sheehy`s reply to Q1: “Riak takes more of the “BASE” approach, which has become accepted over the past several years as a sensible tradeoff for high-availability data systems. By allowing consistency guarantees to be a bit flexible during failure conditions, a Riak cluster is able to provide much more extreme availability guarantees than a strictly ACID system.”
When do you think a “BASE” approach to consistency is justified?

“Monty” Widenius: The big questions are:
a) How are conflict’s solved? Who will win when there are conflicting updates on two nodes and the communication between the nodes are temporarily down?
b) Can a user at any point read data that is not consistent?
c) How long can the conflicting window be?

The answers to the above questions tells us how suitable the solution is for different applications. For analytical things, eventual consistency is ok (as long as you can know after you have run them if they were consistent or not). For real world involving money or resources it’s not necessarily the case.

Q. How do you handle consistency in MariaDB and at the same time ensuring scalability and availability? Aren’t you experiencing the limitations of the CAP Theorem?

“Monty” Widenius: We are using the traditional approaches with transactions or synchronous replication when you need guaranteed consistent answers.

We also provide asynchronous updates to slaves when you can tolerate a log for the data on the slaves. However, as we are only making things visible when the total transaction is run on either master/slave you have always things consistent.

So when it comes to CAP, it’s up the user to define where he wants to have his tradeoff; Speed, reliability or easy to manage.

Q. Justin Sheehy`s reply to Q2: “That definition of Eventual Consistency certainly does apply to Riak, yes. I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense. Traditional accounting is done in an eventually-consistent way and if you send me a payment from your bank to mine then that transaction will be resolved in an eventually consistent way. That is, your bank account and mine will not have a jointly-atomic change in value, but instead yours will have a debit and mine will have a credit, each of which will be applied to our respective accounts.”

“Monty” Widenius: The question is time spent between the consistency and where things will be inconsistent. For example, at no point in time should there be more money on my account than I have the right to use.
The reason why banks in the past have been using eventual consistency is that the computer systems on the banks simply has not kept up with the rest of the word.
In many places there is still human interaction needed to get money on the account! (especially for larger amounts).
Still, if you ask any bank, they would prefer to have things always consistent if they could!

Q. Justin says that “this question contains a very commonly held misconception. The use of eventual consistency in well-designed systems does not lead to inconsistency. Instead, such systems may allow brief (but shortly resolved) discrepancies at precisely the moments when the other alternative would be to simply fail”.

“Monty” Widenius: In some cases it’s better to fail. For example it’s common that ATM will not give out money when the line to the bank account is down. Giving out money is probably always the wrong choice. The other question is if things are 100 % guaranteed to be consistent down to the millisecond during normal operations.

Q. Justin says: “to rephrase your statement, you would not wish your bank to fail to accept a deposit due to an insistence on strict global consistency.”

“Monty” Widenius: Actually you would, if you can’t verify the identity of the user. Certainly the end user would not want to have the deposit be accepted if there is only a record in a single place of the deposit.

Q. Justin says: ”It is precisely the cases where you care about very high availability of a distributed system that eventual consistencymight be a worthwhile tradeoff.”
What is your take on this? Is Eventual Consistency a valid approach also for traditional banking applications?

“Monty” Widenius: That is what banks have traditionally used. However, if they would have a choice between eventual consistency and always consistent they would always choose the later if it would be possible within their resources.

———————
Michael “Monty” Widenius is the main author of the original version of the open-source MySQL database and a founding member of the MySQL AB company. Since 2009, Monty is working on a branch of the MySQL code base, called MariaDB.

Related Posts

On Eventual Consistency– An interview with Justin Sheehy. by Roberto V. Zicari, August 15, 2012

MariaDB: the new MySQL? Interview with Michael Monty Widenius. by Roberto V. Zicari on September 29, 2011

##