“Some people even think that “Hadoop” and “Big Data” are synonymous (though this is an over-characterization). Unfortunately, Hadoop was designed based on a paper by Google in 2004 which was focused on use cases involving unstructured data (e.g. extracting words and phrases from Webpages in order to create Google’s Web index). Since it was not originally designed to leverage the structure in relational data in order to take short-cuts in query processing, its performance for processing relational data is therefore suboptimal.”– Daniel Abadi.
Q1. On the subject of “eventual consistency”, Justin Sheehy of Basho recently said in an interview (1): “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”.
What is your opinion on this? Would you wish to have an “eventual consistency” update to your bank account?
Daniel Abadi: It’s definitely true that it’s a common misconception that banks use the ACID features of database systems to perform certain banking transactions (e.g. a transfer from one customer to another customer) in a single atomic transaction. In fact, they use complicated workflows that leverage semantics that reasonably approximate eventual consistency.
Q2. Justin Sheely also said in the same interview that “the use of eventual consistency in well-designed systems does not lead to inconsistency.” How do you ensure that a system is well designed then?
Daniel Abadi: That’s exactly the problem with eventual consistency. If your database only guarantees eventual consistency, you have to make sure your application is well-designed to resolve all consistency conflicts.
For example, eventually consistent systems allow you to update two different replicas to two different values simultaneously. When these updates propagate to the each other, you get a conflict. Application code has to be smart enough to deal with any possible kind of conflict, and resolve them correctly. Sometimes simple policies like “last update wins” is sufficient to handle all cases.
But other applications are far more complicated, and using an eventually consistent database makes the application code far more complex. Often this complexity can lead to errors and security flaws, such as the ATM heist that was written up recently. Having a database that can make stronger guarantees greatly simplifies application design.
Q3. You have created a start up called Hadapt which claims to be the ” first platform to combine Apache Hadoop and relational DBMS technologies”. What is it? Why combining Hadoop with Relational database technologies?
Daniel Abadi: Hadoop is becoming the standard platform for doing large scale processing of data in the enterprise. It’s rate of growth far exceeds any other “Big Data” processing platform. Some people even think that “Hadoop” and “Big Data” are synonymous (though this is an over-characterization). Unfortunately, Hadoop was designed based on a paper by Google in 2004 which was focused on use cases involving unstructured data (e.g. extracting words and phrases from Webpages in order to create Google’s Web index).
Since it was not originally designed to leverage the structure in relational data in order to take short-cuts in query processing, its performance for processing relational data is therefore suboptimal.
At Hadapt, we’re bringing 3 decades of relational database research to Hadoop. We have added features like indexing, co-partitioned joins, broadcast joins, and SQL access (with interactive query response times) to Hadoop, in order to both accelerate its performance for queries over relational data and also provide an interface that third party data processing and business intelligence tools are familiar with.
Therefore we have taken Hadoop, which used to be just a tool for super-smart data scientists, and brought it to the mainstream by providing a high performance SQL interface that business analysts and data analysis tools already know how to use. However, we’ve gone a step further and made it possible to include both relational data and non-relational data in the same query; so what we’ve got now is a platform that people can use to do really new and innovative types of analytics involving both unstructured data like tweets or blog posts and structured data such as traditional transactional data that usually sits in relational databases.
Q4. What is special about tha Hadapt architecture that makes it different from other Hadoop-based products?
Daniel Abadi: The prevalent architecture that people use to analyze structured and unstructured data is a two-system configuration, where Hadoop is used for processing the unstructured data and a relational database system is used for the structured data. However, this is a highly undesirable architecture, since now you have two systems to maintain, two systems where data may be stored, and if you want to do analysis involving data in both systems, you end up having to send data over the network which can be a major bottleneck. What is special about the Hadapt architecture is that we are bringing database technology to Hadoop, so that Hadapt customers only need to deploy a single cluster — a normal Hadoop cluster — that is optimized for both structured and unstructured data, and is capable of pushing the envelope on the type of analytics that can be run over Big Data.
Q5. You claim that Hadapt SQL queries are an order of magnitude faster than Hadoop+Hive? Do you have any evidence for that? How significant is this comparison for an enterprise?
Daniel Abadi: We ran some experiments in my lab at Yale using the queries from the TPC-H benchmark and compared the technology behind Hadapt to both Hive and also traditional database systems.
These experiments were peer-reviewed and published in SIGMOD 2011. You can see the full 12 page paper here. From my point of view, query performance is certainly important for the enterprise, but not as important as the new type of analytics that our platform opens up, and also the enterprise features around availability, security, and SQL compliance that distinguishes our platform from Hive.
Q6. You say that “Hadoop-connected SQL databases” do not eliminate “silos”. What do you mean with silos? And what does it mean in practice to have silos for an enterprise?
Daniel Abadi: A lot of people are using Hadoop as a sort of data refinery. Data starts off unstructured, and Hadoop jobs are run to clean, transform, and structure the data. Once the data is structured, it is shipped to SQL databases where it can be subsequently analyzed. This leads to the raw data being left in Hadoop and the refined data in the SQL databases. But it’s basically the same data — one is just a cleaned (and potentially aggregated) version of the other. Having multiple copies of the data can lead to all kinds of problems. For example, let’s say you want to update the data in one of the two locations — it does not get automatically propagated to the copy in the other silo. Furthermore, let’s say you are doing some analysis in the SQL database and you see something interesting and want to drill down to the raw data — if the raw data is located on a different system, such a drill down becomes highly nontrivial. Furthermore, data provenance is a total nightmare. It’s just a really ugly architecture to have these two systems with a connector between them.
Q7. What is “multi-structured” data analytics?
Daniel Abadi: I would define it as analytics than include both structured data and unstructured data in the same “query” or “job”. For example, Hadapt enables keyword search, advanced analytic functions, and common machine learning algorithms over unstructured data, all of which can be invoked from a SQL query that includes data from relational tables stored inside the same Hadoop cluster. A great example of how this can be used in real life can be seen in the demo given by Data Scientist Mingsheng Hong.
Q8. On the subject of Big Data Analytics, Werner Vogels of Amazon.com said in an interview  that “one of the core concepts of Big Data is being able to evolve analytics over time. In the new world of data analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without being constrained by resources.”
What is your opinion on this? How Hadapt can help on this?
Daniel Abadi: Yes, I agree. This is precisely why you need a system like Hadapt that has extreme flexibility in terms of the types of data that it can analyze (see above) and types of analysis that can be performed.
Q9. With your research group at Yale you are developing a new data management prototype system called Calvin. What is it? What is special about it?
Daniel Abadi: Calvin is a system designed and built by a large team of researchers in my lab at Yale (Alex Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Anton Petrov, Michael Giuffrida, and Aaron Segal) that scales transactional processing in database systems to millions of transactions per second while still maintaining ACID guarantees.
If you look at what is available in the industry today if you want to scale your transactional workload, you have all these NoSQL systems like HBase or Cassandra or Riak — all of which have to eliminate ACID guarantees in order to achieve scalability. On the other hand, you have NewSQL systems like VoltDB or various sharded MySQL implementations that maintain ACID but can only scale when transactions rarely cross physical server boundaries. Calvin is unique in the way that it uses a deterministic execution model to eliminate the need for commit protocols (such as two phase commit) and allow arbitrary transactions (that can include data on multiple different physical servers) to scale while still maintaining the full set of ACID guarantees. It’s a cool project. I recommend your readers check out the most recent research paper describing the system.
Q10. Looking at three elements: Data, Platform, Analysis, what are the main research challenges ahead?
Daniel Abadi: Here are a few that I think are interesting:
(1) Scalability of non-SQL analytics. How do you parallelize clustering, classification, statistical, and algebraic functions that are not “embarrassingly parallel” (that have traditionally been performed on a single server in main memory) over a large cluster of shared-nothing servers.
(2) Reducing the cognitive complexity of “Big Data” so that it can fit in the working set of the brain of a single analyst who is wrangling with the data.
(3) Incorporating graph data sets and graph algorithms into database management systems.
(4) Enabling platform support for probabilistic data and probabilistic query processing.
Daniel Abadi is an associate professor of computer science at Yale University where his research focuses on database system architecture and implementation, and cloud computing. He received his PhD from MIT, where his dissertation on column-store database systems led to the founding of Vertica (which was eventually acquired by Hewlett Packard).
His research lab at Yale has produced several scalable database systems, the best known of which is HadoopDB, which was subsequently commercialized by Hadapt, where Professor Abadi serves as Chief Scientist. He is a recipient of a Churchill Scholarship, an NSF CAREER Award, a Sloan Research Fellowship, the 2008 SIGMOD Jim Gray Doctoral Dissertation Award, and the 2007 VLDB best paper award.
- MapReduce vs RDBMS [PPT]
- Big Data and Analytical Data Platforms:
Blog Posts | Free Software | Articles | PhD and Master Thesis|
This post is the second part of a series of feedback I received from various experts, with obviously different point of views, on:
Two cons against NoSQL data stores :
Cons1. ” It’s very hard to move data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch. “
Cons2. “There is no standard way to access a NoSQL data store. All tools that already exist for SQL has to be recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many NoSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).”
You can also read Part I here.
J Chris Anderson, Couchbase, cofounder and Mobile Architect:
On Cons1: JSON is the defacto format for APIs these days. I’ve found moving between NoSQL stores to be very simple, just a matter of swapping out a driver. It is also usually quite simple to write a small script to migrate the data between databases. Now, there aren’t pre packaged tools for this yet, but it’s typically one page of code to do. There is some truth that if you’ve tightly bound your application to a particular query capability, there might be some more work involved, but a least you don’t have to redo your stored data structures.
On Cons2: I’m from the “it’s a simple matter of programming” school of thought. e.g. just write the query you need, and a little script to turn it into CSV. If you want to do all of this without writing code, then of course the industry isn’t as mature as RDBMS. It’s only a few years old, not decades. But this isn’t a permanent structural issues, it’s just an artifact of the relative freshness of NoSQL.
Marko Rodriguez, Aurelius, cofounder :
On Cons1: NoSQL is not a mirror image of SQL. SQL databases such as MySQL, Oracle, PostgreSQL all share the same data model (table) and query language (SQL).
As such, the vendor lock-in comment should be directed to a particular type of NoSQL system, not to NoSQL in general. Next, in the graph space, there are efforts to mitigate vendor lock-in. TinkerPop provides a common set of interfaces that allow the various graph computing technologies to work together and it allows developers to “plug and play” the underlying technologies as needed. In this way, for instance, TinkerPop’s Gremlin traversal language works for graph systems such as OrientDB , Neo4j, Titan, InfiniteGraph, RDF Sail stores , and DEX to name several.
To stress the point, with TinkerPop and graphs, there is no need to re-implement an application as any graph vendor’s technology is interoperable and swappable.
On Cons2: Much of the argument above holds for this comment as well. Again, one should not see NoSQL as the space, but the particular subset of NoSQL (by data model) as the space to compare against SQL (table). In support of SQL, SQL and the respective databases have been around for much longer than most of the technologies in the NoSQL space. This means they have had longer to integrate into popular data workflows (e.g. Ruby on Rails). However, it does not mean “that it will always be harder to access data in NoSQL than from SQL.” New technologies emerge, they find their footing within the new generation of technologies (e.g. Hadoop) and novel ways of processing/exporting/understanding data emerge. If SQL was the end of the story, it would have been the end of the story.
David Webber, Oracle, Information Technologist:
On Cons1: Well it makes sense. Of course depends what you are using the NoSQL store for – if it is a niche application – or innovative solution – then a “one off” may not be an issue for you. Do you really see people using NoSQL as their primary data store? As with any technology – knowing when to apply it successfully is always the key. And these aspects of portability help inform when NoSQL is appropriate. There are obviously more criteria as well that people should reference to understand when NoSQL would be suitable for their particular application. The good news is that there are solid and stable choices available should they decide NoSQL is their appropriate option. BTW – in the early days of SQL too – even with the ANSI standard – its was a devil to port across SQL implementations – not just syntax, but performance and technique issues – I know – I did three such projects!
Wiqar Chaudry, NuoDB, Tech Evangelist :
On Cons1: The answer to the first scenario is relatively straightforward. There are many APIs like REST or third-party ETL tools that now support popular NoSQL databases. The right way to think about this is to put yourself in the shoes of multiple different users. If you are a developer then it should be relatively simple and if you are a non-developer then it comes down to what third-party tools you have access to and those with which you are familiar. Re-educating yourself to migrate can be time consuming if you have never used these tools however. In terms of migrating applications from one NoSQL technology to another this is largely dependent on how well the data access layer has been abstracted from the physical database. Unfortunately, since there is limited or no support for ORM technologies this can indeed be a daunting task.
On Cons2: This is a fair assessment of NoSQL. It is limited when it comes to third-party tools and integration. So you will be spending time doing custom design.
However, it’s also important to note that the NoSQL movement was really born out of necessity. For example, technologies such as Cassandra were designed by private companies to solve a specific problem that a particular company was facing. Then the industry saw what NoSQL can do and everyone tried to adopt the technology as a general purpose database. With that said, what many NoSQL companies have ignored is the tremendous opportunity to take from SQL-based technologies the goodness that is applicable to 21st century database needs. .
Robert Greene, Versant, Vice President, Technology:
On Cons1: Yes, I agree that this is difficult with most NoSQL solutions and that is a problem for adopters.
Versant has taken the position of trying to be first to deliver enterprise connectivity and standards into the NoSQL community. Of course, we can only take that so far, because many of the concepts that make NoSQL attractive to adopters simply do not have an equivalent in the relational database world. For example, horizontal scale-out capabilities are only loosely defined for relational technologies, but certainly not standardized. Specifically in terms of moving data in/out of other systems, Versant has developed a connector for the Talend Open Studio which has connectivity to over 400 relational and non-relational data sources, making it easy to move data in and out of Versant depending on your needs. For the case of Excell, while it is certainly not our fastest interface, having recognized the needs of data access from accepted tools, Versant has developed odbc/jdbc capabilities which can be used to get data from Versant databases into things like Excell, Toad, etc.
On Cons2: Yes, I also agree that this is a problem for most NoSQL solutions and again Versant is moving to bring better standards based programming API’s to the NoSQL community. For example, in our Java language interface, we support JPA ( Java Persistence API ), which is the interface application developers get when ever they download the Java SDK. They can create an application using JPA and execute against a Versant NoSQL database without implementing any relational mapping annotations or XML.
Versant thinks this is a great low risk way for enterprise developers to test out the benefits of NoSQL with limited risk. For example, if Versant does not perform much faster that the relational databases, run on much less hardware, scale-out effectively to multiple commodity servers, then they can simply take Hibernate or OpenJPA, EclipseLink, etc and drop it into place, do the mapping exercise and then point it at their relational database with nothing lost in productivity.
In the .NET world,b we have an internal implementation that support LINK and will be made available in the near future to interested developers. We are also supporting other standards in the area of production management, having SNMP capabilities so we can be integrated into tools like OpenView and others where IT folks can get a unified view of all their production systems.
I think we as an engineering discipline should not forget our lessons learned in the early 2000′s. Some pretty smart people helped many realize that what is important is the life cycle of your application objects, some of which are persistent, and that what is important is providing the appropriate abstraction for things like transaction demarcation, caching activation, state tracking ( new, changed, deleted ) etc. These are all features common to any application and developers can easily abstract them away to be database implementation independent, just like we did in the ORM days. Its what we do as good software engineers, find the right abstractions and refine and reuse them over time. It is important that the NoSQL vendors embrace such an approach to ease the development burden of the practitioners that will use the technology.
Jason Hunter, MarkLogic , Chief Architect:
On Cons1: When choosing a database, being future-proof is definitely something to consider. You never know where requirements will take you or what future technologies you’ll want to leverage. You don’t want your data locked into a proprietary format that’s going to paint you into a corner and reduce your options. That’s why MarkLogic chose XML (and now JSON also) as its internal data format. It’s an international standard. It’s plain-text, human readable, fully internationalized, widely deployed, and supported by thousands upon thousands of products. Customers choose MarkLogic for several reasons, but a key reason is that the underlying XML data format will still be understood and supported by vendors decades in the future. Furthermore, I think the first sentence above could be restated, “It’s very hard to move the data out from one SQL to some other system, even other SQL.” Ask anyone who’s tried!
On Cons2: People aren’t choosing NoSQL databases because they’re unhappy with the SQL language. They’re picking them because NoSQL databases provide a combination of feature, performance, cost, and flexibility advantages. Customers don’t pick MarkLogic to run away from SQL, they pick MarkLogic because they want the advantages of a document store, the power of integrated text search, the easy scaling and cost savings of a shared-nothing architecture, and the enterprise reliability of a mature product. Yes, there’s a use case for exporting data to Excel. That’s why MarkLogic has a SQL interface as well as REST and Java interfaces. The SQL interface isn’t the only interface, nor is it the most powerful (it limits MarkLogic down to the subset of functionality expressable in SQL) but it provides an integration path.
“The biggest technical challenge is actually the separation of the technology from the business use! Too often people are making the assumption that big data is synonymous with Hadoop, and any time that technology leads business things become difficult.” –Duncan Ross.
I asked Duncan Ross (Director Data Science, EMEA, Teradata), what is in his opinion the current status of Big Data Analytics industry.
Q1. What is in your opinion the current status of Big Data Analytics Industry?
Duncan Ross: The industry is still in an immature state, dominated by a single technology whilst at the same time experiencing an explosion of different
technological solutions. Many of the technologies are far from robust or enterprise ready, often requiring significant technical skills to support the software even before analysis is attempted.
At the same time there is a clear shortage of analytical experience to take advantage of the new data. Nevertheless the potential value is becoming increasingly clear.
Q2. What are the main technical challenges for Big Data analytics?
Duncan Ross: The biggest technical challenge is actually the seperation of the technology from the business use! Too often people are making the assumption that big data is synonymous with Hadoop, and any time that technology leads business things become difficult. Part of this is the difficulty of use that comes with this. It’s reminiscent of the command line technologies of the 70s – it wasn’t until the GUI became popular that computing could take off.
Q3. Is BI really evolving into Consumer Intelligence? Could you give us some examples of existing use cases?
Duncan Ross: BI and big data analytics are far more than just Consumer Intelligence. Already more than 50% of IP traffic is non human, and M2M will become increasingly important. But out of the connected vehicle we’re already seeing behaviour based insurance pricing and condition based maintenance. Individual movement patterns are being used to detect the early onset of illness.
New measures of voice of the customer are allowing companies to reach out beyond their internal data to understand the motivations and influence of their customers. We’re also seeing the growth of data philanthropy, with these approaches being used to benefit charities and not-for-profits.
Q4. How do you handle large volume of data? When dealing with petabytes of data, how do you ensure scalability and performance?
Duncan Ross: Teradata has years of experience dealing with Petabyte scale data. The core of both our traditional platform and the Teradata Aster big data platform is a shared nothing MPP system with a track history of proven linear scalability. For low information density data we provide a direct link to HDFS and work with partners such as Hortonworks.
Q5. How do you analyze structured data; semi-structured data, and unstructured data at scale?
Duncan Ross: The Teradata Aster technology combines the functionality of MapReduce within the well understood framework of ANSI SQL, allowing complex programatic analysis to sit alongside more traditional data mining techniques. Many MapReduce functions have been simplified (from the users’ perspective) and can be called easily and directly – but more advanced users are free to write their own functions. By parallelizing the analysis within the database you get extremely high scalability and performance.
Q6. How do you reduce the I/O bandwidth requirements for big data analytics workloads?
Duncan Ross: Two methods: firstly by matching analytical approach to technology – set based analysis using traditional SQL based approaches, and programmatic and iterative analysis using MapReduce style approaches.
Secondly by matching data ‘temperature’ to different storage medium: hot data on SSD, cool data on fast disk drives, and cold data on cheap large (slow) drives.
The skill is to automatically rebalance without impacting users!
Q7. What is the tradeoff between Accuracy and Speed that you usually need to make with Big Data?
Duncan Ross: In the world of data mining this isn’t reeally a problem as our approaches are based around sampling anyway. A more important distinction is between speed of analysis and business demand. We’re entering a world where data requires us to work far more agiley than we have in the past.
Q8. Brewer’s CAP theorem says that for really big distributed database systems you can have only 2 out of 3 from Consistency (“atomic”), Availability and (network) Partition Tolerance. Do you have practical evidence of this? And if yes, how?
Duncan Ross: No. Although it may be true for an arbitarily big system, in most real world cases this isn’t too much of a problem.
Q9. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?
Duncan Ross: Well we aren’t using Hadoop, and as I commented earlier, equating Hadoop with Big Data is a dangerous assumption. Many analyses do not require
anything approaching real time, but as closeness to an event becomes more important then we can look to scoring within an EDW environment or even embedding code within an operational system.
To do this requires you to understand the eventual use of your analysis when starting out of course.
A great example of this approach is Ebay’s Hadoop-Singularity-EDW configuration.
Q10. What are the typical In-database support for analytics operations?
Duncan Ross: It’s clear that moving the analysis to the data is more beneficial than moving data to the analysis. Teradata has great experience in this area.
We have examples of fast fourier transforms, predictive modelling, and parameterised modelling all happening in highly parallel ways within the database. I once built and estimated 64 000 models in parallel for a project.
Q11. Cloud computing: What role does it play with Analytics? What are the main differences between Ground vs Cloud analytics?
Duncan Ross: The cloud is a useful approach for evaluating and testing new approaches, but has some significant drawbacks in terms of data security. Of course
there is a huge difference between public and private cloud solutions.
Duncan Ross, Director Data Science, EMEA, Teradata.
Duncan has been a data miner since the mid 1990s. He was Director of Advanced Analytics at Teradata until 2010, leaving to become Data Director of Experian UK. He recently rejoined Teradata to lead their European Data Science team.
At Teradata he has been responsible for developing analytical solutions across a number of industries, including warranty and root cause analysis in manufacturing, and social network analysis in telecommunications.
These solutions have been developed directly with customers and have been deployed against some of the largest consumer bases in Europe.
In his spare time Duncan has been a city Councillor, chair of a national charity, founded an award winning farmers’ market, and is one of the founding Directors of the Society of Data Miners.
Two cons against NoSQL data stores read like this:
1. It’s very hard to move data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.
2. There is no standard way to access a NoSQL data store.
All tools that already exist for SQL has to be recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many NoSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).
These are valid points. I wanted to start a discussion on this.
This post is the first part of a series of feedback I received from various experts, with obviously different point of views.
I plan to publish Part II, with more feedback later on.
You are welcome to contribute to the discussion by leaving a comment if you wish!
1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL.
Dwight Merriman ( founder 10gen, maker of MongoDB): I agree it is still early and I expect some convergence in data models over time. btw I am having conversations with other nosql product groups about standards but it is super early so nothing is imminent.
50% of the nosql products are JSON-based document-oriented databases.
So that is the greatest commonality. Use that and you have some good flexibility and JSON is standards-based and widely used in general which is nice. MongoDB, couchdb, riak for example use JSON. (Mongo internally stores “BSON“.)
So moving data across these would not be hard.
1. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.
Dwight Merriman: Yes. Once again I wouldn’t assume that to be the case forever, but it is for the present. Also I think there is a bit of an illusion of portability with relational. There are subtle differences in the SQL, medium differences in the features, and there are giant differences in the stored procedure languages.
I remember at DoubleClick long ago we migrated from SQL Server to Oracle and it was a HUGE project. (We liked SQL server we just wanted to run on a very very large server — i.e. we wanted vertical scaling at that time.)
Also: while porting might be work, given that almost all these products are open source, the potential “risks” of lock-in I think drops an order of magnitude — with open source the vendors can’t charge too much.
Ironically people are charged a lot to use Oracle, and yet in theory it has the portability properties that folks would want.
I would anticipate SQL-like interfaces for BI tool integration in all the products in the future. However that doesn’t mean that is the way one will write apps though. I don’t really think that even when present those are ideal for application development productivity.
1. For example, how many noSQL databases can export their data to excel? (Something every CEO wants to get sooner or later).
Dwight Merriman: So with MongoDB what I would do would be to use the mongoexport utility to dump to a CSV file and then load that into excel. That is done often by folks today. And when there is nested data that isn’t tabular in structure, you can use the new Aggregation Framework to “unwind” it to a more matrix-like format for Excel before exporting.
You’ll see more and more tooling for stuff like that over time. Jaspersoft and Pentaho have mongo integration today, but the more the better.
John Hugg (VoltDB Engineering): Regarding your first point about the issue with moving data out from one NoSQL to some other system, even other NoSQL.
There are a couple of angles to this. First, data movement itself is indeed much easier between systems that share a relational model.
Most SQL relational systems, including VoltDB, will import and export CSV files, usually without much issue. Sometimes you might need to tweak something minor, but it’s straightforward both to do and to understand.
Beyond just moving data, moving your application to another system is usually more challenging. As soon as you target a platform with horizontal scalability, an application developer must start thinking about partitioning and parallelism. This is true whether you’re moving from Oracle to Oracle RAC/Exadata, or whether you’re moving from MySQL to Cassandra. Different target systems make this easier or harder, from both development and operations perspectives, but the core idea is the same. Moving from a scalable system to another scalable system is usually much easier.
Where NoSQL goes a step further than scalability, is the relaxing of consistency and transactions in the database layer. While this simplifies the NoSQL system, it pushes complexity onto the application developer. A naive application port will be less successful, and a thoughtful one will take more time.
The amount of additional complexity largely depends on the application in question. Some apps are more suited to relaxed consistency than others. Other applications are nearly impossible to run without transactions. Most lie somewhere in the middle.
To the point about there being no standard way to access a NoSQL data store. While the tooling around some of the most popular NoSQL systems is improving, there’s no escaping that these are largely walled gardens.
The experience gained from using one NoSQL system is only loosely related to another. Furthermore, as you point out, non-traditional data models are often more difficult to export to the tabular data expected by many reporting and processing tools.
By embracing the SQL/Relational model, NewSQL systems like VoltDB can leverage a developer’s experience with legacy SQL systems, or other NewSQL systems.
All share a common query language and data model. Most can be queried at a console. Most have familiar import and export functionality.
The vocabulary of transactions, isolation levels, indexes, views and more are all shared understanding. That’s especially impressive given the diversity in underlying architecture and target use cases of the many available SQL/Relational systems.
Finally, SQL/Relational doesn’t preclude NoSQL-style development models. Postgres, Clustrix and VoltDB support MongoDB/CouchDB-style JSON Documents in columns. Functionality varies, but these systems can offer features not easily replicated on their NoSQL inspiration, such as JSON sub-key joins or multi-row/key transactions on JSON data
1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.
Steve Vinoski (Architect at Basho): Keep in mind that relational databases are around 40 years old while NoSQL is 3 years old. In terms of the technology adoption lifecycle, relational databases are well down toward the right end of the curve, appealing to even the most risk-averse consumer. NoSQL systems, on the other hand, are still riding the left side of the curve, appealing to innovators and the early majority who are willing to take technology risks in order to gain advantage over their competitors.
Different NoSQL systems make very different trade-offs, which means they’re not simply interchangeable. So you have to ask yourself: why are you really moving to another database? Perhaps you found that your chosen database was unreliable, or too hard to operate in production, or that your original estimates for read/write rates, query needs, or availability and scale were off such that your chosen database no longer adequately serves your application.
Many of these reasons revolve around not fully understanding your application in the first place, so no matter what you do there’s going to be some inconvenience involved in having to refactor it based on how it behaves (or misbehaves) in production, including possibly moving to a new database that better suits the application model and deployment environment.
2. There is no standard way to access a NoSQL data store.
All tools that already exists for SQL has to recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many noSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).
Steve Vinoski: Don’t make the mistake of thinking that NoSQL is attempting to displace SQL entirely. If you want data for your Excel spreadsheet, or you want to keep using your existing SQL-oriented tools, you should probably just stay with your relational database. Such databases are very well understood, they’re quite reliable, and they’ll be helping us solve data problems for a long time to come. Many NoSQL users still use relational systems for the parts of their applications where it makes sense to do so.
NoSQL systems are ultimately about choice. Rather than forcing users to try to fit every data problem into the relational model, NoSQL systems provide other models that may fit the problem better. In my own career, for example, most of my data problems have fit the key-value model, and for that relational systems were overkill, both functionally and operationally. NoSQL systems also provide different tradeoffs in terms of consistency, latency, availability, and support for distributed systems that are extremely important for high-scale applications. The key is to really understand the problem your application is trying to solve, and then understand what different NoSQL systems can provide to help you achieve the solution you’re looking for.
1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.
Cindy Saracco (IBM Senior Solutions Architect) (these comments reflect my personal views and not necessarily those of my employer, IBM) :
Since NoSQL systems are newer to market than relational DBMSs and employ a wider range of data models and interfaces, it’s understandable that migrating data and applications from one NoSQL system to another — or from NoSQL to relational — will often involve considerable effort.
However, I’ve heard more customer interest around NoSQL interoperability than migration. By that, I mean many potential NoSQL users seem more focused on how to integrate that platform into the rest of their enterprise architecture so that applications and users can have access to the data they need regardless of the underlying database used.
2. There is no standard way to access a NoSQL data store.
All tools that already exists for SQL has to recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many noSQL databases can export their data to excel? (Something every CEO wants to get sooner or later).
Cindy Saracco: From what I’ve seen, most organizations gravitate to NoSQL systems when they’ve concluded that relational DBMSs aren’t suitable for a particular application (or set of applications). So it’s probably best for those groups to evaluate what tools they need for their NoSQL data stores and determine what’s available commercially or via open source to fulfill their needs.
There’s no doubt that a wide range of compelling tools are available for relational DBMSs and, by comparison, fewer such tools are available for any given NoSQL system. If there’s sufficient market demand, more tools for NoSQL systems will become available over time, as software vendors are always looking for ways to increase their revenues.
As an aside, people sometimes equate Hadoop-based offerings with NoSQL.
We’re already seeing some “traditional” business intelligence tools (i.e., tools originally designed to support query, reporting, and analysis of relational data) support Hadoop, as well as newer Hadoop-centric analytical tools emerge.
There’s also a good deal of interest in connecting Hadoop to existing data warehouses and relational DBMSs, so various technologies are already available to help users in that regard . . . . IBM happens to be one vendor that’s invested quite a bit in different types of tools for its Hadoop-based offering (InfoSphere BigInsights), including a spreadsheet-style analytical tool for non-programmers that can export data in CSV format (among others), Web-based facilities for administration and monitoring, Eclipse-based application development tools, text analysis facilities, and more. Connectivity to relational DBMSs and data warehouses are also part IBM’s offerings. (Anyone who wants to learn more about BigInsights can explore links to articles, videos, and other technical information available through its public wiki. )
“For analytical things, eventual consistency is ok (as long as you can know after you have run them if they were consistent or not). For real world involving money or resources it’s not necessarily the case.” — Michael “Monty” Widenius.
Q1. “How do you handle updates if you do not support ACID transactions? For which applications this is sufficient, and when this is not?”
Q2. “You said that Riak takes more of the “BASE” (Basically Available, Soft state, Eventual consistency) approach. Did you use the definition of eventual consistency by Werner Vogels? Reproduced here: “Eventual consistency: The storage system guarantees that if no new updates are made to the object, eventually (after the inconsistency window closes) all accesses will return the last updated value.”
You would not wish to have an “eventual consistency” update to your bank account. For which class of applications is eventual consistency a good system design choice? “
On the same subject, I did a follow up interview with Michael “Monty” Widenius, the main author of the original version of the open-source MySQL database, and currently working on a branch of the MySQL code base, called MariaDB.
Q. Justin Sheehy`s reply to Q1: “Riak takes more of the “BASE” approach, which has become accepted over the past several years as a sensible tradeoff for high-availability data systems. By allowing consistency guarantees to be a bit flexible during failure conditions, a Riak cluster is able to provide much more extreme availability guarantees than a strictly ACID system.”
When do you think a “BASE” approach to consistency is justified?
“Monty” Widenius: The big questions are:
a) How are conflict’s solved? Who will win when there are conflicting updates on two nodes and the communication between the nodes are temporarily down?
b) Can a user at any point read data that is not consistent?
c) How long can the conflicting window be?
The answers to the above questions tells us how suitable the solution is for different applications. For analytical things, eventual consistency is ok (as long as you can know after you have run them if they were consistent or not). For real world involving money or resources it’s not necessarily the case.
Q. How do you handle consistency in MariaDB and at the same time ensuring scalability and availability? Aren’t you experiencing the limitations of the CAP Theorem?
“Monty” Widenius: We are using the traditional approaches with transactions or synchronous replication when you need guaranteed consistent answers.
We also provide asynchronous updates to slaves when you can tolerate a log for the data on the slaves. However, as we are only making things visible when the total transaction is run on either master/slave you have always things consistent.
So when it comes to CAP, it’s up the user to define where he wants to have his tradeoff; Speed, reliability or easy to manage.
Q. Justin Sheehy`s reply to Q2: “That definition of Eventual Consistency certainly does apply to Riak, yes. I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense. Traditional accounting is done in an eventually-consistent way and if you send me a payment from your bank to mine then that transaction will be resolved in an eventually consistent way. That is, your bank account and mine will not have a jointly-atomic change in value, but instead yours will have a debit and mine will have a credit, each of which will be applied to our respective accounts.”
“Monty” Widenius: The question is time spent between the consistency and where things will be inconsistent. For example, at no point in time should there be more money on my account than I have the right to use.
The reason why banks in the past have been using eventual consistency is that the computer systems on the banks simply has not kept up with the rest of the word.
In many places there is still human interaction needed to get money on the account! (especially for larger amounts).
Still, if you ask any bank, they would prefer to have things always consistent if they could!
Q. Justin says that “this question contains a very commonly held misconception. The use of eventual consistency in well-designed systems does not lead to inconsistency. Instead, such systems may allow brief (but shortly resolved) discrepancies at precisely the moments when the other alternative would be to simply fail”.
“Monty” Widenius: In some cases it’s better to fail. For example it’s common that ATM will not give out money when the line to the bank account is down. Giving out money is probably always the wrong choice. The other question is if things are 100 % guaranteed to be consistent down to the millisecond during normal operations.
Q. Justin says: “to rephrase your statement, you would not wish your bank to fail to accept a deposit due to an insistence on strict global consistency.”
“Monty” Widenius: Actually you would, if you can’t verify the identity of the user. Certainly the end user would not want to have the deposit be accepted if there is only a record in a single place of the deposit.
Q. Justin says: ”It is precisely the cases where you care about very high availability of a distributed system that eventual consistencymight be a worthwhile tradeoff.”
What is your take on this? Is Eventual Consistency a valid approach also for traditional banking applications?
“Monty” Widenius: That is what banks have traditionally used. However, if they would have a choice between eventual consistency and always consistent they would always choose the later if it would be possible within their resources.
Michael “Monty” Widenius is the main author of the original version of the open-source MySQL database and a founding member of the MySQL AB company. Since 2009, Monty is working on a branch of the MySQL code base, called MariaDB.
On Eventual Consistency– An interview with Justin Sheehy. by Roberto V. Zicari, August 15, 2012
MariaDB: the new MySQL? Interview with Michael Monty Widenius. by Roberto V. Zicari on September 29, 2011
“While I believe that one size fits most, claims that RDBMS can no longer keep up with modern workloads come in from all directions. When people talk about performance of databases on large systems, the root cause of their concerns is often the performance of the underlying B-tree index”– Martín Farach-Colton.
Scaling MySQL and MariaDB to TBs and still be performant. Possible? I did interview on this topic Martín Farach-Colton, Tokutek Co-founder & Chief Technology Officer.
Q1. What is TokuDB?
Farach-Colton: TokuDB® is a storage engine for MySQL and MariaDB that uses Fractal Tree Indexes. It enables MySQL and MariaDB to scale from GBs to TBs and provides improved query performance, excellent compression, and online schema flexibility. TokuDB requires no changes to MySQL applications or code and is fully ACID and MVCC complaint.
While I believe that one size fits most, claims that RDBMS can no longer keep up with modern workloads come in from all directions. When people talk about performance of databases on large systems, the root cause of their concerns is often the performance of the underlying B-tree index. This makes sense, because almost every system out there that indexes big data does so with a B-tree. The B-tree was invented more than 40 years ago and has been fighting hardware trends ever since.
For example, although some people really do have OLAP workloads and others really do have OLTP workloads, I believe that most users are forced to choose between analytics and live updates by the performance profile of the B-tree, and that their actual use case requires the best of both worlds. The B-tree forces compromises, which the database world has internalized.
Fractal Tree Indexes® replace B-tree indexes and are based on my academic research on so-called Cache-Oblivious Analysis, which I’ve been working on with my Tokutek co-founders, Michael Bender and Bradley Kuszmaul. Fractal Tree Indexes speed up indexing — that is, they are write optimized — without giving up query performance. The best way to get fast queries is to define the right set of indexes, and Fractal Tree Indexes let you do that. Legacy technologies have staked out claims on some part of the B-tree query/indexing tradeoff curve. Fractal Tree Indexes put you on a higher-performing tradeoff curve. Query-optimal write-optimized indexing is all about making general-purpose databases faster. For some of our customers’ workloads, it’s as much as two orders of magnitude faster.
Q2. You claim to be able to “scale MySQL from GBs to TBs while improving insert and query speed, compression, replication performance, and online schema flexibility.” How do you do that?
Farach-Colton: In a Fractal Tree Index, all changes — insertions, deletions, updates, schema changes — are messages that get injected into the tree. Even though these messages get injected into the root and might get moved several times before getting to a leaf, all queries will see all relevant messages. Thus, injecting a message is fast, and queries reflect the effects of a message, e.g., changing a field or adding a column, immediately.
In order to make indexing fast, we make sure that each I/O to disk does a lot of work. I/Os are slow. For example, if accessing the fastest part of memory is like walking across a room, getting something from disk is like walking from New York to St Louis. The analogy in Germany would be to walk from Berlin to Rome — Germany itself isn’t big enough to hold a disk I/O on this scale!
To see how Fractal Tree Indexes work, first consider a B-tree. A B-tree delivers each item to its final destination as it arrives. It’s as if Amazon were to put a single book in a truck and drive it to your home. A Fractal Tree Index manages a set of buffers — imagine regional warehouses — so that whenever you move a truck, it is full of useful information.
The result is that insertion speed is dramatically higher. Query speeds are higher because you keep more indexes. And compression is higher because we can afford to read and write data in larger chunks. Compression on bigger chunks of data is typically more effective, and that rule of thumb has panned out for our customers.
Q3 Do you have any benchmark results to support your claims? If yes, could you please give us some details?
Q4. You developed your own benchmark iiBench, why not use TPC benchmarks?
Farach-Colton: Our benchmarking page also shows performance on TPC and Sysbench workloads. iiBench is meant to expand the range of comparison for storage engines, not to replace existing benchmarks.
Benchmarks are developed to measure performance where vendors differ. iiBench measures how a database performs on a mixed insertion/query workload. B-trees are notoriously slow at updating indexes on the fly, so no one has bothered to produce a benchmark that measures performance on such workloads. We believe (and our customers confirm for us) that this Real-time OLAP (aka Analytical OLTP) workload is critical. Furthermore, iiBench seems to be filling a need in that there has been some third-party adoption, most notably by Mark Callaghan at Facebook.
Q5. How is iiBench defined? What kind of queries did you benchmark and with which data size?
At best, all I could do is summarize it here and risk omitting details important to some of your readers. Instead, let me refer you to our website where you can read it in detail here , and the Facebook modifications can be found here.
Q6. How do you handle MySQL slave lag (delays)?
Farach-Colton: The bottleneck that causes slave lag is the single-threaded nature of master-slave replication. InnoDB has single-threaded indexing that is much slower than its multi-threaded indexing. Thus, the master can support a multi-threaded workload that is substantially higher than that achievable on the single-threaded slave. (There are plans to improve this in upcoming releases of InnoDB.)
TokuDB has much higher insertion rates, and in particular, very high single-threaded insertion rates. Thus, the slave can keep up with a much more demanding workload. (Please see our benchmark page for details.)
Q7. You have developed a technique called “Fractal Tree indexes.” What is it? How does it compare with standard Tree indexes?
Farach-Colton: Much of this question was answered in question 2. Here, I’ll add that Fractal Tree Indexes are the first write-optimized data structure used as an index in databases. The Log-Structured Merge Tree (LSM) also achieves very high indexing rates, but at the cost of so-called read amplification, which means that indexing is much faster than for B-trees, but queries end up being much slower.
Fractal Tree Indexes are as fast as LSMs for indexing, and as fast as B-trees for queries. Thus, they dominate both.
A final note on SSDs: B-trees are I/O bound, and SSDs have much higher IOPS than rotational disks. Thus, it would seem that B-trees and SSDs are a marriage made in heaven. However, consider the fact that when a single row is modified in a B-tree leaf, the entire leaf must be re-written to disk. So, in addition to the problems of write amplification caused by wear leveling and garbage collection on SSDs, B-trees induce a much greater level of write amplification, because small changes induce large writes. Write-optimized indexes largely eliminate both types of write amplification, as well as potentially extending the life of SSDs.
Q8. What is special about your data compression technique?
Farach-Colton: Compression is like a freight train, slow to get started but effective once it’s had a chance to get up to speed. Thus, one of the biggest factors for how much compression is achieved is how much you compress at a time. Give a compression algorithm some data 4KB at a time, and the compression will be poor. Give it 1MB or more at a time, and the same compression algorithm will do a much better job. Because of the structure of our indexes, we can compress larger blocks at a time.
I should also point out that one of the problems that dogs databases is aging or fragmentation, in which the leaves of the B-tree index get scattered on disk, and range queries become much slower. This is because leaf-to-leaf seeks become slower, when the leaves are scattered. When leaves are clustered, they can be read quickly by exploiting the prefetching mechanisms of disks. The standard solution is to periodically rebuild the index, which can involve substantial down time. Fractal Tree indexes do not age, under any workload, for the same reason that they compress so well: by handling much larger blocks of data than a B-tree can, we are able to keep the data much more clustered, and thus it is never necessary to rebuild an index.
Q9. How is TokuDB different with respect to SchoonerSQL, NuoDB and VoltDB?
Farach-Colton: VoltDB is an in-memory database. The point is therefore to get the highest OLTP performance for data that is small enough to fit in RAM.
By comparison, TokuDB focuses on very large data sets — too big for RAM — for clients who are interested in moving away from the OLAP/OLTP dichotomy, that is, for clients who want to keep data up to date while still querying it through rich indexes.
SchoonerSQL has a combination of technologies involving optimization for SSDs as well as scale-out. TokuDB is a scale-up solution that improves performance on a per-node basis.
NuoDB is also a scale-out solution, in this case cloud-based, though I am less familiar with the details of their innovation at a more technical level.
Martín Farach-Colton is a co-founder and chief technology officer at Tokutek. He was an early employee at Google, where he worked on improving performance of the Web Crawl and where he developed the prototype for the AdSense system.
An expert in algorithmic and information retrieval, Prof. Farach-Colton is also a Professor of Computer Science at Rutgers University.
“There are only a few Facebook-sized IT organizations that can have 60 Stanford PhDs on staff to run their Hadoop infrastructure. The others need it to be easier to develop Hadoop applications, deploy them and run them in a production environment.”– John Schroeder.
How easy is to use Hadoop? What are the next generation Hadoop distributions? On these topics, I did Interview John Schroeder, Cofounder and CEO of MapR.
Q1. What is the value that Apache Hadoop provides as a Big Data analytics platform?
John Schroeder: Apache Hadoop is a software framework that supports data-intensive distributed applications. Apache Hadoop provides a new platform to analyze and process Big Data. With data growth exploding and new unstructured sources of data expanding a new approach is required to handle the volume, variety and velocity of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.
Q2. Is scalability the only benefits of Apache Hadoop then?
John Schroeder: No, you can build applications that aren’t feasible using traditional data warehouse platforms.
The combination of scale, ability to process unstructured data along with the availability of machine learning algorithms and recommendation engines creates the opportunity to build new game changing applications.
Q3. What are the typical requirements of advanced Hadoop users as well as those new to Hadoop?
John Schroeder Advanced users of Hadoop are looking to go beyond batch uses of Hadoop to support real-time streaming of content. Advanced users also need multi-tenancy to balance production and decision support workloads on their shared Hadoop cloud.
New users need Hadoop to become easier. There are only a few Facebook-sized IT organizations that can have 60 Stanford PhDs on staff to run their Hadoop infrastructure. The others need it to be easier to develop Hadoop applications, deploy them and run them in a production environment.
Q4. Why this? Please give us some practical examples of applications.
John Schroeder: Product recommendations, ad placements, customer churn, patient outcome predictions, fraud detection and sentiment analysis are just a few examples that improve with real time information.
Organizations are also looking to expand Hadoop use cases to include business critical, secure applications that easily integrate with file-based applications and products. Requirements for data protection include snapshots to provide point-in-time recovery and mirroring for business continuity.
With mainstream adoption comes the need for tools that don’t require specialized skills and programmers. New Hadoop developments must be simple for users to operate and to get data in and out. This includes direct access with standard protocols using existing tools and applications.
Q5. What are in your opinion the core limitations that limit the adoption of Hadoop in the enterprise? How do you contribute in taking Big Data into mainstream?
John Schroeder: MapR has and continues to knock down the barriers to Hadoop adoption. Hadoop needed five 9’s availability and the ability to run in a ‘lights-out” datacenter so we transformed Hadoop into a reliable compute and dependable storage platform. Hadoop use cases were too narrow so we expanded access to Hadoop data for industry standard file-based processing.
The MapR Control Center makes it easy to administer, monitor and provision Hadoop applications.
We improved Hadoop economics by dramatically improving performance. We just released our multi-tenancy features that were key to our recently announced Amazon and Google partnership announcements.
Next on our roadmap is to continue expanding the use cases and additional progress moving Hadoop from batch to real-time.
Q6. What are the benefits of having an automated stateful failover?
John Schroeder: Automated stateful failover provides high availability and continuous operations for organizations.
Even with multiple hardware or software outages and errors applications will continue running without any administrator actions required.
Q7. What is special about MapR architecture? Please give us some technical detail.
John Schroeder: MapR rethought and rebuilt the entire internal infrastructure while maintaining complete compatibility with Hadoop APIs. MapR opened up Hadoop to programming, products and data access by providing a POSIX storage layer, NFS access, ODBC/JDBC and REST. The MapR strategy is to expand use cases appropriate for Hadoop while avoiding proprietary features that would result in vendor lock-in. We rebuilt the underlying storage services layer and eliminated the “append only” limitation of the Hadoop Distributed File System (HDFS). MapR writes directly to block devices eliminating the inefficiencies caused by layering HDFS on top of a Linux local file system. In other Hadoop implementations, these continually rob the entire system of performance efficiencies.
The storage layer rearchitecture also enabled us to implement snapshot and mirroring capabilities. The MapR Distributed NameNode eliminates the well-known file scalability limitation allowing us to optimize the Hadoop shuffle algorithm.
MapR provides transparent compression on the client making it easy to reduce data transmission over the network or on disk.
Finally MapR eliminated the impact of periodic Java garbage collection.
This kind of increase is simply impossible with current implementations because it is limited by the architecture itself.
Q8. There are several commercial Hadoop distribution companies on the market (e.g. Cloudera, Datameer, Greenplum, Hortonworks, Platform Computing to name a few). What is special about MapR?
John Schroeder: Datameer is not a Hadoop distribution provider they provide an analytic application and are a partner of MapR. Platform Computing is also not a Hadoop distribution provider. MapR is the only company to provide an enterprise-grade Hadoop distribution.
Q9. What do you mean with an enterprise-grade Hadoop distribution? Isnt` for example Cloudera (to name one) also an enterprise Hadoop distribution?
John Schroeder: There is no other alternative in the market for full HA, business continuity, real-time streaming, standard file-based access through NFS, full database access through ODBC, and support for mission-critical SLAs.
Q10. How do you support this claim? Did you do a market analysis? What other systems did you look at?
John Schroeder: We performed a complete review of available Hadoop distributions. The recent selections of MapR by Amazon as an integrated offering into their Amazon Elastic MapReduce service and by Google for their Google Compute Engine are further validations of MapR’s differentiated Hadoop offering.
With MapR users can mount a cluster using the standard NFS protocol. Applications can write directly into the cluster with no special clients or applications. Users can use every file-based application that has been developed over the last 20 years, ranging from BI applications to file browsers and command-line utilities such as grep or sort directly on data in the cluster. This dramatically simplifies development. Existing applications and workflows can be used and just the specific steps requiring parallel processing need to be converted to take advantage of the MapReduce framework.
MapR also delivers ease of data management. With clusters expanding well into the petabyte range simplifying how data is managed is critical. MapR uniquely supports volumes to make it easy to apply policies across directories and file contents without managing individual files.
These policies include data protection, retention, snapshots, and access privileges.
Additionally, MapR delivers business critical reliability. This includes full HA, business continuity and data protection features. In addition to replication, MapR includes snapshots and mirroring. Snapshots provide for point-in-time recovery. Mirroring provides backup to an alternate cluster, data center or between on-premise and private Hadoop clouds. These features provide a level of protection that is necessary for business critical uses.
Q11. What are the next generations of Hadoop distributions?
John Schroeder: The first generation of Hadoop surrounded the open source Apache project with services and management utilities. MapR is the next generation of Hadoop that combines standards-based innovations developed by MapR with open source components resulting in the industry’s only differentiated distribution that meets the needs of both the largest and emerging Hadoop installations.
MapR’s extensive engineering efforts have resulted in the first software distribution for Hadoop that provides extreme high performance, unprecedented scale, business continuity and is easy to deploy and manage.
Q12. Do you have any results to show us to support these claims “high performance, unprecedented scale”?
John Schroeder: We have many examples of high performance and scale. Google recently unveiled the Google Compute Engine with MapR on stage at the Google IO conference. We demonstrated a 1256 node cluster perform a Terasort in 1 minute and 20 seconds. One of our customers, comScore presented a session at the Hadoop Summit and showed how they process 30B internet events a day using MapR. As for scale differences we have a customer with 18 billion files in a single MapR cluster. By comparison, the largest clusters of other distributions max out around 200 million files.
Q13. What functionalities still need to be added to Hadoop to serve new business critical and real-time applications?
John Schroeder: Other Hadoop distributions present customers with several challenges including:
• Getting data in and out of Hadoop. Other Hadoop distributions are limited by the append-only nature of the Hadoop Distributed File System (HDFS) that requires programs to batch load and unload data into a cluster.
• Deploying Hadoop into mission critical business projects. The lack of reliability of current Hadoop software platforms is a major impediment for expansion.
• Protecting data against application and user errors. Hadoop has no backup and restore capabilities. Users have to contend with data loss or resort to very expensive solutions that reside outside the actual Hadoop cluster.
According to industry research firm, ESG, half of the companies they surveyed plan to leverage commercial distributions of Hadoop as opposed to the open source version. This trend indicates organizations are moving from experimental and pilot projects to mainstream applications with mission-critical requirements that include high availability, better performance, data protection, security, and ease of use.
Q14. There is work to be done training developers in learning advanced statistics and software (such as Hadoop) to ensure adoption in the Enterprise. Do you agree with this? What is your role here?
John Schroeder: Simply put the limitations of the Hadoop Distributed File System require whole scale changes to existing applications and extensive development of new ones. MapR’s next generation storage services layer provides full random/read support and provides direct access with NFS. This dramatically simplifies development. Existing applications and workflows can be used and just the specific steps requiring parallel processing need to be converted to take advantage of the MapReduce framework.
Q15. Are customers willing to share their private data?
John Schroeder: In general customers are concerned with the protection and security of their data. That said, we see growing adoption of Hadoop in the cloud. Amazon has a significant web-services business around Hadoop and recently added MapR as part of their Elastic MapReduce offering. Google has also announced the Google Compute Engine and integration with MapR.
Q16. Data quality from different sources is a problem. How do you handle this?
John Schroeder: Data quality issues can be similar to those in a traditional data warehouse. Scrubbing can be built into the Hadoop applications using algorithms similar to those used during ELT.
ETL and ELT can both accomplish data scrubbing. The storage/compute resources and ability to combine unlike datasets provide significant advantages to Hadoop-based ELT.
There are different views with respect to this issue. IT personnel that are used to traditional data warehouses are typically concerned with data quality and ETL processes. The advantage of Hadoop is that you can have disparate data from many different sources and different data types in the same cluster. Some advanced users have pointed out that “quality” issues are actually valuable information that can provide insight into issues, anomalies and opportunities. With Hadoop users have the flexibility to process and analyze. Analytics are not dependent on having a pre-defined schema.
Q17. Moving older data online. Is this a business opportunity for you?
John Schroeder: The advantage of Hadoop is performing compute on data. It makes much more sense to perform analytics directly on large data stores so you send only results over the network instead of dragging the entire data set over the network for processing. For this use case to be viable requires a highly reliable cluster with full data protection and business continuity features.
Q18. Yes, but what about big data that is not digitalized yet? This is what I meant with moving older data online.
John Schroeder: Most organizations are looking for a solution to help them cope with fast growing digital sources of machine generated content such as log files, sensor data, etc. Images, video and audio are also a fast growing data source that can provide valuation analytics.
John Schroeder, Cofounder and CEO, MapR.
John has led companies creating innovative and disruptive business intelligence, database management, storage and virtualization technologies at early stage ventures through success as large public companies. John founded MapR to produce the next generation Hadoop distribution to expand the use cases beyond batch Hadoop to include real-time, business critical, secure applications that easily integrate with file-based applications and products.
ODBMS.org Free Downloads and Links
In this section you can download free resources on Big Data and Analytical Data Platforms (Blog Posts | Free Software| Articles| PhD and Master Thesis)
“I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense” –Justin Sheehy.
On the subject of new data models and eventual consistency I did interview Justin Sheehy Chief Technology Officer, Basho Technologies.
Q1. What are in your opinion the main differences and similarities of a key-value store (ala Dynamo), a document store (ala MongoDB), and an “extensible record” store (ala Big Table) when using them in practice?
Justin Sheehy: Describing the kv-store, doc store, and column family data models in general is not the same as describing specific systems like Dynamo, MongoDB, and BigTable. I’ll do the former here as I am guessing that is the intention of the question. Since the following couple of questions ask for differences, I’ll emphasize the similarity here.
All three of these data models have two major things in common: values stored in them are not rigidly structured, and are organized mainly by primary key. The details beyond those similarities, and how given systems expose those details, certainly vary. But the flexibility of semi-structured data and the efficiency of primary-key access generally apply to most such systems.
Q2. When is a key-value store particularly well suited and when is a a document store instead preferable? For which kind of applications and for what kind of data management requirements?
Justin Sheehy: The interesting issue with this question is that “document store” is not well-established as having a specific meaning. Certainly it seems to apply to both MongoDB and CouchDB, but those two systems have very different data access semantics. The closest definition I can come up with quickly that covers the prominent systems known as doc stores might be something like “a key-value store which also has queryability of some kind based on the content of the values.”
If we accept that definition then you can happily use a document store anywhere that a key-value store would work, but would find it most worthwhile when your querying needs are richer than simply primary key direct access.
Q3. What is Riak? A key-value store or a document store? What are the main features of the current version of Riak?
Justin Sheehy: Riak began as being called a key-value store before the current popularity of the term “document store” term, but it is certainly a document store by any reasonable definition that I know — such as the one I gave above. In addition to access by primary key, values in Riak can be queried by secondary key, range query, link walking, full text search, or map/reduce.
Riak has many features, but the core reasons that people come to Riak over other systems are Availability, Scalability, and Predictability. For people whose business demands extremely high availability, easy and linear scalability, or predictable performance over time, Riak is worth a look.
Q4. How do you achieve horizontal scalability? Do you use a “shared nothing” horizontal scaling – replicating and partitioning data over many servers?
What performance metrics do you have for that?
Justin Sheehy: We use a number of techniques to achieve horizontal scalability. Among them is consistent hashing, an approach invented at Akamai and successfully used by many distributed systems since then. This allows for constant time routing to replicas of data based on the hash of the data’s primary key.
Data is partitioned to servers in the cluster based on consistent hashing, and replicated to a configurable number of of those servers. By partitioning the data to many “virtual nodes” per host, growth is relatively easy as new hosts simply (and automatically) take over some of the virtual nodes that has previously owned by existing cluster hosts.
Yes, in terms of data location Riak is a “shared nothing” system.
One (of many) demonstrations of this scalability was performed by Joyent here.
That benchmark is approximately 2 years old, so various specific numbers are quite outdated, but the important lesson in it remains and is summed up by this graph late in this post.
It shows that as servers were added, the throughput (as well as the capacity) of the overall system increased linearly.
Q5. How do you handle updates if you do not support ACID transactions? For which applications this is sufficient, and when this is not?
Justin Sheehy: Riak takes more of the “BASE” approach, which has become accepted over the past several years as a sensible tradeoff for high-availability data systems. By allowing consistency guarantees to be a bit flexible during failure conditions, a Riak cluster is able to provide much more extreme availability guarantees than a strictly ACID system.
Q6. You said that Riak takes more of the “BASE” approach. Did you use the definition of eventual consistency by Werner Vogels?
Reproduced here: “Eventual consistency: The storage system guarantees that if no new updates are made to the object, eventually (after the inconsistency window closes) all accesses will return the last updated value”. You would not wish to have an “eventual consistency” update to your bank account. For which class of applications is eventual consistency a good system design choice?
Justin Sheehy: That definition of Eventual Consistency certainly does apply to Riak, yes.
I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense. Traditional accounting is done in an eventually-consistent way and if you send me a payment from your bank to mine then that transaction will be resolved in an eventually consistent way. That is, your bank account and mine will not have a jointly-atomic change in value, but instead yours will have a debit and mine will have a credit, each of which will be applied to our respective accounts.
This question contains a very commonly held misconception. The use of eventual consistency in well-designed systems does not lead to inconsistency. Instead, such systems may allow brief (but shortly resolved) discrepancies at precisely the moments when the other alternative would be to simply fail.
To rephrase your statement, you would not wish your bank to fail to accept a deposit due to an insistence on strict global consistency.
It is precisely the cases where you care about very high availability of a distributed system that eventual consistency might be a worthwhile tradeoff.
Q7. Why is Riak written in Erlang? What are the implications for the application developers of this choice?
Justin Sheehy: Erlang’s original design goals included making it easy to build systems with soft real-time guarantees and very robust fault-tolerance properties. That is perfectly aligned with our central goals with Riak, and so Erlang was a natural fit for us. Over the past few years, that choice has proven many times to have been a great choice with a huge payoff for Riak’s developers. Application developers using Riak are not required to care about this choice any more than they need to care what language PostgreSQL is written in. The implications for those developers are simply that the database they are using has very predictable performance and excellent resilience.
Q8. Riak is open source. How do you engage the open source community and how do you make sure that no inconsistent versions are generated?
Justin Sheehy: We engage the open source community everywhere that it exists. We do our development in the open on github, and have lively conversations with a wider community via email lists, IRC, Twitter, many in-person venues, and more.
Mark Phillips and others at Basho are dedicated full-time to ensuring that we continue to engage honestly and openly with developer communities, but all of us consider it an essential part of what we do. We do not try to prevent forks. Instead, we are part of the community in such a way that people generally want to contribute their changes back to the central repository. The only barrier we have to merging such code is about maintaining a standard of quality.
Q9. How do you optimize access to non-key attributes?
Justin Sheehy: Riak stores index content in addition to values, encoded by type and in sorted order on disk. A query by index certainly is more expensive than simply accessing a single value directly by key, as the indices are distributed around the cluster — but this also means that the size of the index is not constrained by a single host.
Q10. How do you optimize access to non-key attributes if you do not support indexes in Riak?
Justin Sheehy: We do support indexes in Riak.
Q11 How does Riak compare with a new generation of scalable relational systems (NewSQL)?
Justin Sheehy: The “NewSQL” term is, much like “NoSQL”, a marketing term that doesn’t usefully define a technical category. The primary argument made by NewSQL proponents is that some NoSQL systems have made unnecessary tradeoffs. I personally consider these NewSQL systems to be a part of the greater movement generally dubbed NoSQL despite the seemingly contradictory names, as the core of that movement has nothing to do with SQL — it is about escaping the architectural monoculture that has gripped the commercial database market for the past few decades. In terms of technical comparison, some systems placing themselves under the NewSQL banner are excellent at scalability and performance, but I know of none whose availability and predictability can rival Riak.
Q12 Pls give some examples of use cases where Riak is currently in use. Is Riak in use for analyzing Big Data as well?
Justin Sheehy: A few examples of companies relying on Riak in their business can be found here.
While Riak is primarily about highly-available systems with predictable low-latency performance, it does have analytical capabilities as well and many users make use of map/reduce and other such programming models in Riak. By most definitions of “Big Data”, many of Riak’s users certainly fall into that category.
Q Anything you wish to add?
Justin Sheehy: Thank you for your interest. We’re not done making Riak great!
Chief Technology Officer, Basho Technologies
As Chief Technology Officer, Justin Sheehy directs Basho’s technical strategy, roadmap, and new research into storage and distributed systems.
Justin came to Basho from the MITRE Corporation, where as a principal scientist he managed large research projects for the U.S. Intelligence Community including such efforts as high assurance platforms, automated defensive cyber response, and cryptographic protocol analysis.
He was central to MITRE’s development of research for mission assurance against sophisticated threats, the flagship program of which successfully proposed and created methods for building resilient networks of web services.
Before working for MITRE, Justin worked at a series of technology companies including five years at Akamai Technologies, where he was a senior architect for systems infrastructure giving Justin a broad as well as deep background in distributed systems.
Justin was a key contributor to the technology that enabled fast growth of Akamai’s networks and services while allowing support costs to stay low. Justin performed both undergraduate and postgraduate studies in Computer Science at Northeastern University.
ODBMS.org — Free Downloads and Links
In this section you can download resources covering the following topics:
Big Data and Analytical Data Platforms
Cloud Data Stores
NoSQL Data Stores
Graphs and Data Stores
Entity Framework (EF) Resources
Object-Relational Impedance Mismatch
XML, RDF Data Stores,