“The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point”
–Kingsley Uyi Idehen.
I have interviewed Kingsley Idehen founder and CEO of OpenLink Software. The main topics of this interview are: the Semantic Web, and the Virtuoso Hybrid Data Server.
Q1. The vision of the Semantic Web is the one where web pages contain self describing data that machines will be able to navigate them as easily as humans do now. What are the main benefits? Who could profit most from the Semantics Web?
Kingsley Uyi Idehen: The vision of a Semantic Web is actually the vision of the Web. Unbeknownst to most, they are one and the same. The goal was always to have HTTP URIs denote things, and by implication, said URIs basically resolve to their meaning  .
Paradoxically, the Web bootstrapped on the back of URIs that denoted HTML documents (due to Mosaic’s ingenious exploitation of the “view source” pattern ) thereby accentuating its Web of hyper-linked Documents (i.e., Information Space) aspect while leaving its Web of hyper-linked Data aspect somewhat nascent.
The nascence of the Web of hyper-linked Data (aka Web of Data, Web of Linked Data etc.) laid the foundation for the “Semantic Web Project” which naturally evoled into “The Semantic Web” meme. Unfortunately, “The Semantic Web” meme hit a raft of issues (many self inflicted) that basically disconnected it from its original Web vision and architecture aspect reality.
The Semantic Web is really about the use of hypermedia to enhance the long understood entity relationship model  via the incorporation of _explicit_ machine- and human-comprehensible entity relationship semantics via the RDF data model. Basically, RDF is just about an enhancement to the entity relationship model that leverages URIs for denoting entities and relations that are described using subject->predicate->object based proposition statements.
For the rest of this interview, I would encourage readers to view “The Semantic Web” phrase as meaning: a Web-scale entity relationship model driven by hypermedia resources that bear entity relationship model description graphs that describe entities and their relations (associations).
To answer your question, the benefits of the Semantic Web are as follows: fine-grained access to relevant data on the Web (or private Web-like networks) with increasing degrees of serendipity .
Q2. Who is currently using Semantic Web technologies and how? Could you please give us some examples of current commercial projects?
Kingsley Uyi Idehen: I wouldn’t used “project” to describe endeavors that exploit Semantic Web oriented solutions. Basically, you have entire sectors being redefined by this technology. Examples range from “Open Government” (US, UK, Italy, Spain, Portugal, Brazil etc..) all the way to publishing (BBC, Globo, Elsevier, New York Times, Universal etc..) and then across to pharmaceuticals (OpenPHACTs, St. Judes, Mayo, etc.. ) and automobiles (Daimler Benz, Volkswagen etc..). The Semantic Web isn’t an embryonic endeavor deficient on usecases and case studies, far from it.
Q3. Virtuoso is a Hybrid RDBMS/Graph Column store. How does it differ from relational databases and from XML databases?
Kingsley Uyi Idehen:: First off, we really need to get the definitions of databases clear. As you know, the database management technology realm is vast. For instance, there isn’t anything such thing as a non relational database.
Such a system would be utterly useless beyond an comprehendible definition, to a marginally engaged audience. A relational database management system is typically implemented with support for a relational model oriented query language e.g., SQL, QUEL, OQL (from the Object DBMS era), and more recently SPARQL (for RDF oriented databases and stores). Virtuoso is comprised of a relational database management system that supports SQL, SPARQL, and XQuery. It is optimized to handle relational tables and/or relational property graphs (aka. entity relationship graphs) based data organization. Thus, Virtuoso is about providing you with the ability to exploit the intensional (open world propositions or claims) and extensional (closed world statements of fact) aspects of relational database management without imposing either on its users.
Q4. Is there any difference with Graph Data stores such as Neo4j?
Kingsley Uyi Idehen: Yes, as per my earlier answer, it is a hybrid relational database server that supports relational tables and entity relationship oriented property graphs. It’s support for RDF’s data model enables the use of URIs as native types. Thus, every entity in a Virtuoso DBMS is endowed with a URI as its _super key_. You can de-reference the description of a Virtuoso entity from anywhere on a network, subject to data access policies and resource access control lists.
Q5. How do you position Virtuoso with respect to NoSQL (e.g Cassandra, Riak, MongoDB, Couchbase) and to NewSQL (e.g.NuoDB, VoltDB)?
Kingsley Uyi Idehen: Virtuoso is a SQL, NoSQL, and NewSQL offering. Its URI based _super keys_ capability differentiates it from other SQL, NewSQL, and NoSQL relational database offerings, in the most basic sense. Virtuoso isn’t a data silo, because its keys are URI based. This is a “deceptively simple” claim that is very easy to verify and understand. All you need is a Web Browser to prove the point i.e., a Virtuoso _super key_ can be placed in the address bar of any browser en route to exposing a hypermedia based entity relationship graph that navigable using the Web’s standard follow-your-nose pattern.
Q6. RDF can be encoded in various formats. How do you handle that in Virtuoso?
Kingsley Uyi Idehen: Virtuoso supports all the major syntax notations and data serialization formats associated with the RDF data model. This implies support for N-Triples, Turtle, N3, JSON-LD, RDF/JSON, HTML5+Microdata, (X)HTML+RDFa, CSV, OData+Atom, OData+JSON.
Q7. Does Virtuoso restrict the contents to triples?
Kingsley Uyi Idehen: Assuming you mean: how does it enforce integrity constraints on triple values?
It doesn’t enforce anything per se. since the principle here is “schema last” whereby you don’t have a restrictive schema acting as an inflexible view over the data (as is the case with conventional SQL relational databases). Of course, an application can apply reasoning to OWL (Web Ontology Language) based relation semantics (i.e, in the so-called RBox) as option for constraining entity types that constitute triples. In addition, we will soon be releasing a SPARQL Views mechanism that provides a middle ground for this matter whereby the aforementioned view can be used in a loosely coupled manner at the application, middleware, or dbms layer for applying constraints to entity types that constitute relations expressed by RDF triples.
Q8. RDF can be represented as a direct graph. Graphs, as data structure do not scale well. How do you handle scalability in Virtuoso? How do you handle scale-out and scale-up?
Kingsley Uyi Idehen: The fundamental mission statement of Virtuoso has always be to destroy any notion of performance and scalability as impediments to entity relationship graph model oriented database management. The crux of the matter with regards to Virtuoso is that it is massively scalable due for the following reasons:
• fine-grained multi-threading scoped to CPU cores
• vectorized (array) execution of query commands across fine-grained threads
• column-store based physical storage which provides storage layout and data compaction optimizations (e.g., key compression)
• share-nothing clustering that scales from multiple instances (leveraging the items above) on a single machine all the way up to a cluster comprised of multiple machines.
The scalability prowess of Virtuoso are clearly showcased via live Web instances such as DBpedia and the LOD Cloud Cache (50+ Billion Triples). You also have no shortage of independent benchmark reports to compliment the live instances:
•50 – 150 Billion scale Berlin SPARQL Benchmark (BSBM) report (.pdf)
Q9. Could you give us some commercial examples where Virtuoso is in use?
Kingsley Uyi Idehen: Elsevier, Globo, St. Judes Medical, U.S. Govt., EU, are a tiny snapshot of entities using Virtuoso on a commercial basis.
Q10. Do you plan in the near future to develop integration interfaces to other NoSQL data stores?
Kingsley Uyi Idehen: If a NewSQL or NoSQL store supports any of the following, their integration with Virtuoso is implicit: HTTP based RESTful interaction patterns, SPARQL, ODBC, JDBC, ADO.NET, OLE-DB. In the very worst of cases, we have to convert the structured data returned into 5-Star Linked Data using Virtuoso’s in-built Linked Data middleware layer for heterogeneous data virtualization.
Q11. Virtuoso supports SPARQL. SPARQL is not SQL, how do handle querying relational data then?
Kingsley Uyi Idehen: Virtuoso support SPARQL, SQL, SQL inside SPARQL and SPARQL inside SQL (we call this SPASQL). Virtuoso has always had its own native SQL engine, and that’s integral to the entire product. Virtuoso provides an extremely powerful and scalable SQL engine as exemplified by the fact that the RDF data management services are basically driven by the SQL engine subsystem.
Q12. How do you support Linked Open Data? What advantages are the main benefits of Linked Open Data in your opinion?
Kingsley Uyi Idehen: Virtuoso enables you expose data from the following sources, courtesy of its in-built 5-star Linked Data Deployment functionality:
• RDF based triples loaded from Turtle, N-Triples, RDF/XML, CSV etc. documents
• SQL relational databases via ODBC or JDBC connections
• SOAP based Web Services
• Web Services that provide RESTful interaction patterns for data access.
• HTTP accessible document types e.g., vCard, iCalendar, RSS, Atom, CSV, and many others.
Q13. What are the most promising application domains where you can apply triple store technology such as Virtuoso?
Kingsley Uyi Idehen: Any application that benefits from high-performance and scalable access to heterogeneously shaped data across disparate data sources. Healthcare, Pharmaceuticals, Open Government, Privacy enhanced Social Web and Media, Enterprise Master Data Management, Big Data Analytics etc..
Q14. Big Data Analysis: could you connect Virtuoso with Hadoop? How does Viruoso relate to commercial data analytics platforms, e.g Hadapt, Vertica?
Kingsley Uyi Idehen: You can integrate data managed by Hadoop based ETL workflows via ODBC or Web Services driven by Hapdoop clusters that expose RESTful interaction patterns for data access. As for how Virtuoso relates to the likes of Vertica re., analytics, this is about Virtuoso being the equivalent of Vertica plus the added capability of RDF based data management, Linked Data Deployment, and share-nothing clustering. There is no job that Vertica performs that Virtuoso can’t perform.
There are several jobs that Virtuoso can perform that Vertica, VoltDB, Hadapt, and many other NoSQL and NewSQL simply cannot perform with regards to scalable, high-performance RDF data management and Linked Data deployment. Remember, RDF based Linked Data is all about data management and data access without any kind of platform lock-in. Virtuoso locks you into a value proposition (performance and scale) not the platform itself.
Q15. Do you also benchmark loading trillion of RDF triples? Do you have current benchmark results? How much time does it take to querying them?
Kingsley Uyi Idehen: As per my earlier responses, there is no shortage of benchmark material for Virtuoso.
The benchmarks are also based on realistic platform configurations unlike the RDBMS patterns of the past which compromised the utility of TPC benchmarks.
Q16. In your opinion, what are the main current obstacles for the adoption of Semantic Web technologies in the Enterprise?
Kingsley Uyi Idehen:The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point.
. — 5-Star Linked Data URIs and Semiotic Triangle
. — what do HTTP URIs Identify?
. — View Source Pattern & Web Bootstrap
. — Unified View of Data using the Entity Relationship Model (Peter Chen’s 1976 dissertation)
. — Serendipitous Discovery Quotient (SDQ).
Kingsley Idehen is the Founder and CEO of OpenLink Software. He is an industry acclaimed technology innovator and entrepreneur in relation to technology and solutions associated with data management systems, integration middleware, open (linked) data, and the semantic web.
Kingsley has been at the helm of OpenLink Software for over 20 years during which he has actively provided dual contributions to OpenLink Software and the industry at large, exemplified by contributions and product deliverables associated with: Open Database Connectivity (ODBC), Java Database Connectivity (JDBC), Object Linking and Embedding (OLE-DB), Active Data Objects based Entity Frameworks (ADO.NET), Object-Relational DBMS technology (exemplified by Virtuoso), Linked (Open) Data (where DBpedia and the LOD cloud are live showcases), and the Semantic Web vision in general.
-ODBMS.org free resources on : Relational Databases, NewSQL, XML Databases, RDF Data Stores
Follow ODBMS Industry Watch on Twitter: @odbmsorg
“A common misconception when virtualizing Hadoop clusters is that we decouple the data nodes from the physical infrastructure. This is not necessarily true. When users virtualize a Hadoop cluster using Project Serengeti, they separate data from compute while preserving data locality. By preserving data locality, we ensure that performance isn’t negatively impacted, or essentially making the infrastructure appear as static.” – Joe Russell.
VMware announced in June last year an open source project called Serengeti.
The main idea of Project Serengeti is to enable users and companies to quickly deploy, manage and scale Apache Hadoop on virtual infrastructure.
I have interviewed Joe Russell, VMware, Product Line Marketing Manager, Big Data.
Q1. Why Virtualize Hadoop?
Joe Russell: Hadoop is a technology that Enterprises are increasingly using to process large amounts of information. While the technology is generally pretty early in its lifecycle, we are starting to see more enterprise-grade use cases. In its current form, Hadoop is difficult to use and lacking the toolsets to efficiently deploy run and manage Hadoop clusters in an Enterprise context. Virtualizing Hadoop not only brings enterprise-tested High Availability and Fault Tolerance to Hadoop, but it also allows for much more agile and automated management of Hadoop clusters. Additionally, virtualization allows for separation of data and compute, which allows users to preserve data locality and paves the way towards more advanced use cases such as mixed workload deployments and Hadoop-as-a-service.
Q2. You claim to be able to deploy an Apache Hadoop cluster (HDFS, MapReduce, Pig, Hive) in minutes on an existing vSphere cluster using Serengeti. How do you do this? Could you give us an example on how you customize a Hadoop Cluster?
You will be able to customize Hadoop clusters easily from within the Serengeti tool by specifying node and resource allocations through an easy to use user interface.
Q3. There are concerns on the approach of decoupling Apache Hadoop nodes from the underlying physical infrastructure. Quoting Steve Loughran (HP Research): “Hadoop contains lots of assumptions about running in a static infrastructure; it’s scheduling and recovery algorithms assume this.” What is your take on this?
Joe Russell: A common misconception when virtualizing Hadoop clusters is that we decouple the data nodes from the physical infrastructure. This is not necessarily true. When users virtualize a Hadoop cluster using Project Serengeti, they separate data from compute while preserving data locality. By preserving data locality, we ensure that performance isn’t negatively impacted, or essentially making the infrastructure appear as static. Additionally, it creates true multi-tenancy within more layers of the Hadoop stack, not just the name node.
I think there is some confusion when we say “in the cloud”. Here, Steve is talking about running it on a public cloud like Amazon. Steve is largely introducing the concept of data locality, or the notion that large amounts of data are hard to move. In this scenario, it makes sense to bring compute resources to the data to ensure performance isn’t negatively impacted by networking limitations. VMware advocates that Hadoop should be virtualized, as it introduces a level of flexibility and management that allows companies to easily deploy, manage, and scale internal Hadoop clusters.
Q4. How do ensure High Availability (HA)? How do you protect against host and VM failures?
Joe Russell: We ensure High Availability (HA) by leveraging vSphere’s tested solution via Project Serengeti’s integration with vCenter (management console of vSphere).
In the event of physical server failure, affected virtual machines are automatically restarted on other production servers with spare capacity. In the case of operating system failure, vSphere HA restarts the affected virtual machine on the same physical server.
In Hadoop nomenclature, this means that there is HA on more than just the name node. vSphere’s solution also allows for HA on the jobtracker node, metastores, and on the management server, which are critical pieces of any Hadoop system that require high availability.
More importantly, as Hadoop is a batch-oriented process, it is important that when a physical host does fail, that you are able to pause and then restart that job from the point in time in which it went down. VMware’s vSphere solution allows for this and has been tested amongst the biggest Enterprises for the better part of the past decade.
Q5. How do you get Data Insights? Do you already have examples how such Virtualize Hadoop is currently used in the Enterprises? If yes, which ones?
Joe Russell: Data Insights occur farther up the stack with analytics vendors.
Project Serengeti is a tool that allows you to run Hadoop ontop of vSphere and is a solution designed to allow users to consolidate Hadoop clusters on a single underlying virtual infrastructure. The tool allows for users to run different types of Hadoop distributions on a hypervisor to gain the benefits of virtualization, which include efficiency, elasticity, and agility.
Q6. Does Serengeti only works with VMware vSphere® platform?
Joe Russell: Project Serengeti today only works with the vSphere hypervisor.
However, VMware made the decision to open source Project Serengeti to make the code available to anyone who wishes to use it.
By making it open source vs. just offering a free closed source product, VMware allows users to take the Serengeti code and alter it for their own purposes. For example, any user could download the Project Serengeti code and alter it to make it work with other hypervisors other than vSphere. While it isn’t in VMware’s interest to dedicate resources to make Project Serengeti run with other hypervisors, it doesn’t prevent users from doing so. This is an important point.
Q7. VMware is working with the Apache Hadoop community to contribute changes to the Hadoop Distributed File System (HDFS) and Hadoop MapReduce projects to make them “virtualization-aware”. What does it mean? What are these changes?
Joe Russell: Hadoop Virtual Extensions (“HVE”) is one example of this. VMware contributed HVE back to the Apache community to make Hadoop distributions virtualization aware. This means inserting a node group layer between the rack and host to make Hadoop distributions topology aware for virtualized platforms. In its simplest terms, this allows for VMware to preserve data locality and increase reliability through the separation of data and compute.
A link to a whitepaper with further detail can be found here (.pdf).
Q8. What about the performance of such “Virtualize” Hadoop? Do you have performance measures to share?
Joe Russell: Please see whitepaper referenced above.
Q9. What is the value of Hadoop-in-cloud? How does it relate to the virtualization of Hadoop?
Joe Russell: I don’t necessarily understand the question and it would be particularly helpful to define what you mean by “Hadoop-in-Cloud”.
I think you may be referring to Hadoop-as-a-Service, which is valuable in that users are able to deliver Hadoop resources to internal users based on need. Centralized control through Hadoop-as-a-Service ensures high cluster utilization, lower TCO, and an agile framework to adjust to ever-changing business needs. As Enterprises increasingly look to service internal customers, I expect Hadoop-as-a-Service to become more popular as the Hadoop technology emerges within the enterprise. Please keep in mind that this relates both to private and public clouds. Virtualizing Hadoop is the first step toward being able to provision Hadoop in the cloud.
Q10. VMware also announced updates to Spring for Apache Hadoop. Could you tell us what are these updates?
Q11 VMware is working with a number of Apache Hadoop distribution vendors (Cloudera, Greenplum, Hortonworks, IBM and MapR ) to support a wide range of distributions. Why? Could you tell us exactly what is VMware contribution?
Joe Russell: VMware is focused on providing a common underlying virtual infrastructure so each of these vendors can run their software better on vSphere. Project Serengeti is a toolset that pre-configures, tunes and makes it easier to deploy and run Hadoop with increased reliability on vSphere. These efforts make it easier for enterprises to make architectural decisions around how to setup Hadoop within their companies. Deciding to virtualize Hadoop can have dramatic effects not only on companies just beginning to use Hadoop, but also on more advanced users of the technology. VMware’s contributions through Project Serengeti allow each of the vendor’s software to run better on virtualized infrastructure. As you know, these contributions are available for anyone to use.
Q12 Serengeti, “Virtualize” Hadoop, Hadoop in the Cloud, Spring for Apache Hadoop: what is the global picture here? How all of these efforts relate to each other? What are the main benefits for developers and users of Apache Hadoop?
Joe Russell: All of these efforts improve the technology and make it easier for developers and users of Hadoop to actually use Hadoop. Additionally, these efforts focus on virtualizing Hadoop to make the technology more elastic, reliable, and performant.
VMware is focused on bringing the benefits of virtualization to Hadoop, both from a community standpoint and a customer standpoint. It has been open in its approach to contributing back technology that makes it easier for users / developers to utilize virtualization for their Hadoop clusters. Conversely, it is investing in bringing Hadoop to its existing customers by making the technology more reliable and building easy to use tools around the technology to make it easier to deploy and administrate in an Enterprise setting with SLAs and business critical workloads.
Joe Russell is responsible for product strategy, GTM, evangelism and product marketing of Big Data at VMware.
He has over a decade of experience in a blend of product marketing, finance, operations, and M&A roles.
Previously he worked for Yahoo!, and as an Investment Banking – Technology M&A Analyst for GCA Savvian, Credit Suisse and Societe Generale.
He holds a MSc, Accounting & Finance from London School of Economics and Political Science, a BS, Economics with Honors from University of Washington, and a MBA from Wharton School, University of Pennsylvania.
ODBMS.org- Lecture Notes: Data Management in the Cloud.
by Michael Grossniklaus, David Maier, Portland State University.
Course Description: “Cloud computing has recently seen a lot of attention from research and industry for applications that can be parallelized on shared-nothing architectures and have a need for elastic scalability. As a consequence, new data management requirements have emerged with multiple solutions to address them. This course will look at the principles behind data management in the cloud as well as discuss actual cloud data management systems that are currently in use or being developed. The topics covered in the courserange from novel data processing paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data management platforms (Google BigTable, Microsoft Azure, Amazon S3 and Dynamo, Yahoo PNUTS) and open-source NoSQL databases (Cassandra, MongoDB, Neo4J). The world of cloud data management is currently very diverse and heterogeneous. Therefore, our course will also report on efforts to classify, compare and benchmark the various approaches and systems. Students in this course will gain broad knowledge about the current state of the art in cloud data management and, through a course project, practical experience with a specific system.”
Lecture Notes | Intermediate/Advanced | English | LINK TO DOWNLOAD ~280 slides (PDF)| 2011-12|
Follow ODBMS.org on Twitter: @odbmsorg
“A distribution is not–or not necessarily–a fork of the code and we have no intention to fork Hadoop. At this point, the value-add that we bring to the table is strictly layered on top of Apache HD and interacts cleanly with the vanilla Hadoop stack” –Scott Yara and Florian Waas.
Greenplum announced on Monday, February 25th a new Hadoop distribution: Pivotal HD. I asked a few questions on Pivotal HD to Scott Yara, Senior Vice President, Products and Co-Founder Greenplum/EMC, and Florian Waas, Senior Director of Advanced Research and Development at Greenplum/EMC.
Q1. What is in your opinion the status of adoption of, and investment in, open source projects such as Hadoop within the Enterprise?
Scott Yara, Florian Waas: We have seen a massive shift in perception when it comes to open source.
In the past, innovation was primarily driven by commercial R&D departments and open source was merely trying to catch up to them. And even though a number of open source projects from that era have become household names they weren’t necessarily viewed as leaders in innovation.
This has fundamentally changed in recent years: open source has become a hotbed of innovation in particular in infrastructure technology. Hadoop and a variety of other data management and database products are testament to this change. Enterprise customers do realize this trend and have started adopting open source large-scale. It allows them to get their hands on new technology much faster than was the case before and as a additional perk this technology comes without the dreaded vendor lock-in.
By now, even the most conservative enterprises have developed open source strategies that ensures they have their hand on the pulse and adoption cycles are short and effective.
So, in short, the prospects for open source have never been better!
Q2. In your opinion is the future of Hadoop made of hybrid products?
Scott Yara, Florian Waas: Hadoop is a collection of products or tools and, apart from the relatively mature HDFS interfaces, is still evolving. Its original value proposition has changed quite dramatically. Remember, initially it was all about MapReduce the cool programming paradigm that lets you whip up large-scale distributed programs in no time requiring only rudimentary programming skills.
Yet, that’s not the reason Hadoop has attracted the attention of enterprises lately. Frankly, the MapReduce programming paradigm was a non-starter for most enterprise customers: it’s at too low a level of abstraction and curating and auditing MapReduce programs is prohibitively expensive for customers unless they have a serious software development shop dedicated to it. What has caught on, however, is the idea of ‘cheap scalable storage’!
In our view the future of Hadoop is really this: a solid abstraction of storage in the form of HDFS with any number of different processing stacks on top, including higher-level query languages. Naturally this will be a collection of different products, hybrids where necessary. I think we’ve only seen the tip of the iceberg yet.
Q3. Why introducing a new Hadoop distribution?
Scott Yara, Florian Waas: Let’s be clear about one thing first: to us a distribution is simply a bundle of software components that comes with the assurance that the bundled products have been integration-tested and certified. To enterprise customers this assurance is vital as it gives them the single point of contact when things go wrong. And exactly this is the objective of Pivotal HD.
A distribution is not–or not necessarily–a fork of the code and we have no intention to fork Hadoop. At this point, the value-add that we bring to the table is strictly layered on top of Apache HD and interacts cleanly with the vanilla Hadoop stack.
As long as no vendor actively subverts the Hadoop project, we don’t see any need to fork. That being said, if a single vendor sweeps up a significant number of contributors or even committers of any individual project it always raises a couple of red flags and customers will be concerned whether somebody is going to hijack the project. At this point, we’re not aware of any such threat to the open-source nature of Hadoop.
Q4. How did you expand Hadoop capabilities as a data platform with Pivotal HD?
Scott Yara, Florian Waas: Pivotal HD is a full Apache HD distribution plus some Pivotal add-ons. As we said before, the HDFS abstraction is a pretty good one—but the standard stack on top of it is lacks severely in performance and expressiveness; so we give customers better alternatives. For enterprise customers this means: you can use Pivotal HD like regular Hadoop where applicable but if you need more, you get it in the same bundle.
Q5. What is the rationale beyond introducing HAWQ, a relational database that runs atop of HDFS?
Scott Yara, Florian Waas: Not quite. We’ve transplanted a modern distributed query engine onto HDFS. We stripped out a lot of “incidental” database technology that databases are notorious for. HAWQ gives enterprises the best of both worlds: high-performance query processing for a query language they already know on the one hand, and scalable open storage on the other hand. And, unlike with a database, data isn’t locked away in a proprietary format: in HAWQ you can access all stored data with any number of tools when you need to.
Q6. How does Pivotal HD differ from Hadapt in this respect?
Scott Yara, Florian Waas: Hadapt is still in its infancy with what looks like a long way to go; mainly because they couldn’t tap into a MPP database product.
Folks sometimes forget how much work goes into building a commercially viable query processor. In the case of Greenplum, it’s been about 10 years of engineering.
Q7. How does HAWQ work?
Scott Yara, Florian Waas: HAWQ is modern distributed and parallel query processor atop HDFS–with all the features you truly need, but without the bloat of a complete RDBMS.
Obviously there’s a number of rather technical details how exactly the two worlds integrate and interested readers can find specific technical descriptions on our website.
Q8 You write in the Greenplum Blog that “HAWQ draws from the 10 years of development on the Greenplum Database product”. Can you be more specific?
Scott Yara, Florian Waas: Building a distributed query engine that is general and powerful enough to support deep analytics is a very tough job. There are no shortcuts. Hive and all of these SQL-ish interfaces we’ve recently seen are an attempt at it and work well for simple queries but basically failed to deliver solid performance when it comes to advanced anlaytics.
Having spent a long time working on DB internals we sometimes keep forgetting how steep a development this technology has undergone. Folks new in this space constantly “discover” some of the problems query processing has dealt with for a long time already, like join ordering—this learning-by-doing approach is kind of cute, but not necessarily effective.
Q9. Why does HAWQ have its own execution engine separate from MapReduce? Why does it manage its own data?
Looks like we’re answering the questions in the wrong order
MapReduce is a great tool to teach parallelism and distributed processing and makes for an intuitive hands-on experience. But unless your problem is as simple as a distributed word-count, MapReduce quickly becomes a major development and maintenance headache; and even then the resulting performance is sub-standard.
In short, MapReduce, while maybe great for software shops with deep expertise in distributed programming and a do-it-yourself attitude, is not enterprise-ready.
Q10. HAWQ supports Columnar or row-oriented storage. Why this design choice?
Scott Yara, Florian Waas: Columnar vs. row-orientation really is a smoke screen; always has been. We’ve long advocated to view columnar for what it is: a feature, not an architectural principle. If your query processor follows even the most basic software engineering principles supporting column-orientation is really easy.
Plenty of white papers have been written on the differences and discussed the application scenarios where one out-performs the other and vice versa. As so often, there is no one-size-fits-all. HAWQ lets customers use what they feel is the right format for the job. We want customers to be successful, not blindly follow an ideology.
The same way different requirements in the workload demand different orientation, HAWQ can ingest different data formats way beyond column or row orientation—optimized for query processing, or optimized for 3rd party applications, etc.—which rounds out the picture.
Q11. Could you give us some technical detail on how the SQL parallel query optimizer and planner works?
Scott Yara, Florian Waas: What you see in HAWQ today is the true and tried Greenplum Database MPP optimizer with a couple of modifications but largely the same battle-tested technology. That’s what allowed us to move ahead of the competition so quick while everybody else is still trying to catch up to basic MPP functionality.
Having said that, we’re constantly striving for improvement and pushing the limits. Over the past years, we have invested in what we believe is a ground-breaking optimizer infrastructure which we’ll unveil later this summer. So, stay tuned!
Q12. Could you give us some details on the partitioning strategy you use and what kind of benchmark results do you have?
Scott Yara, Florian Waas: The benchmarks are a funny thing: hardly any competitor can run even the most basic database benchmarks yet, so we’re comparing on the simple, almost trivial, queries only. Anyways, here’s what we’ve been seeing so far: if the query is completely trivial the nearest competitor is slower by at least a factor of two. For anything even slightly more complex the difference widens quickly to one to two orders of magnitude.
Q13. Apache Hadoop is open-source, do you have plans to open up HAWQ and the other technologies layered atop it?
Scott Yara, Florian Waas: We’ve been debating this but haven’t really made a decision, as of yet.
Q14. The Hadoop market is crowded: e.g Cloudera (Impala), Hortonworks’ Data Platform for Windows, Intel’s Hadoop distribution, NewSQL data store Hadapt. How do you stand out of this crowd of competitors?
Scott Yara, Florian Waas: We clearly captured a position of leadership with HAWQ and enterprise customers do recognize that. We’ve also received a lot of attention from competitors which shows that we clearly hit a nerve and deliver a piece of the puzzle enterprises have long been waiting for.
Q15. With Pivotal HD are you competing in the same market space as Teradata Aster?
Scott Yara, Florian Waas: Aster has traditionally targeted a few select verticals. For all we can tell, it looks like we’re seeing the continuation of that strategy with a highly specialized offering going forward.
In contrast to that, Pivotal HD strives to be a general purpose solution for as broad a customer spectrum as you can imagine.
Q16. Jeff Hammerbacher in 2011 said “The best minds of my generation are thinking about how to make people click ads… That sucks. If instead of pointing their incredible infrastructure at making people click on ads, they pointed it at great unsolved problems in science, how would the world be different today?” What is your take on this?
Scott Yara, Florian Waas: Jeff garnered a lot of attention with this quote but let’s face it, this type of criticism isn’t exactly novel nor very productive. For decades, Joseph Weizenbaum, one of the pioneers of AI famously lamented about the genius and technology wasted on TV satellites. Along the same lines other MIT faculty have decried the fact that their most successful engineering students become quants on Wall Street. The list is probably long.
Instead of scolding people for what they didn’t do, I’d say, let’s empower people and give them tools to do great things and solve truly important problems. It’s not at coincidence that Big Data problems are at the heart of the most pressing challenges humanity faces today. So, let’s get moving!
Senior Vice President, Products and Co-Founder Greenplum/EMC.
In his role as SVP, Products, Scott is responsible for the division’s overall product development and go-to-market efforts, including engineering, product management, and marketing. Scott is a co-founder of Greenplum and was President of the company. Prior to Greenplum, Scott served as vice president for Digital Island, a publicly traded Internet infrastructure services company that was acquired by Cable & Wireless in 2001. Prior to Digital Island, Scott served as vice president for Sandpiper Networks, an Internet content delivery services company that merged with Digital Island in 1999. At Sandpiper, Scott helped to create the industry’s first content delivery network (CDN), a globally distributed computing infrastructure comprised of several thousand servers, and used by many of the industry’s largest Internet services including Microsoft and Disney.
As Senior Director of Advanced Research and Development at Greenplum/EMC, Florian Wass heads up the division’s department of Impossible Ideas. That is to say, his day job is to look into ideas that are far from ready to be undertaken as engineering efforts, and then look at what would it take to turn theory into practice.
He obtained his MSc in Computer Science from Passau University, Germany and a PhD in database research from the University of Amsterdam. Florian Waas has worked as a researcher for several European research consortia and universities in Germany, Italy, and The Netherlands. Before joining Greenplum, Florian Waas held positions at Microsoft and Amazon.com.
- ODBMS.org free resources on Big Data and Analytical Data Platforms
Blog Posts | Free Software | Articles | Lecture Notes | PhD and Master Thesis |
Follow ODBMS.org on Twitter: @odbmsorg
“For traditional business applications, the schema is known in advance, so there is no need to use a graph database which has weaker enforcement of integrity. If instead, you’re dealing with at best a generic model to which it conforms, then a schema-oriented approach does not provide much. Instead a graph-oriented approach is more natural and easier to develop against.”– Michael Blaha
Graphs, SQL and Databases. On this topic I have interviewed our expert Michael Blaha.
Q1. A lot of today’s data can be modeled as a heterogeneous set of “vertices” connected by a heterogeneous set of “edges”, people, events, items, etc. related by knowing, attending, purchasing, etc. This world view is not new as the object-oriented community has a similar perspective on data. What is in your opinion the main difference with respect to a graph-centric data world?
Michael Blaha: This world view is also not new because this is the approach Charlie Bachman took with network databases many years ago. I can think of at least two major distinguishing aspects of graph-centric databases relative to relational databases.
(1) Graph-centric databases are occurrence-oriented while relational databases are schema-oriented. If you know the schema in advance and must ensure that data conforms to it, then a schema-oriented approach is best. Examples include traditional business applications, such as flight reservations, payroll, and order processing.
(2) Graph-centric databases emphasize navigation. You start with a root object and pull together a meaningful group of related objects. Relational databases permit navigation via joins, but such navigation is more cumbersome and less natural. Many relational database developers are not adept at performing such navigation.
Q2. The development of scalable graph applications such for example for Facebook, and Twitter require different kind of databases than SQL. Most of these large Web companies have built their own internal graph databases. But what about other enterprise applications?
Michael Blaha: The key is the distinction between being occurrence-oriented and schema-oriented. For traditional business applications, the schema is known in advance, so there is no need to use a graph database which has weaker enforcement of integrity. If instead, you’re dealing with at best a generic model to which it conforms, then a schema-oriented approach does not provide much. Instead a graph-oriented approach is more natural and easier to develop against.
Q3: Marko Rodriguez and Peter Neubauer in an interview say that “the benefit of the graph comes from being able to rapidly traverse structures to an arbitrary depth (e.g., tree structures, cyclic structures) and with an arbitrary path description (e.g. friends that work together, roads below a certain congestion threshold). We call this data processing pattern, the graph traversal pattern. This mind set is much different from the set theoretic notions of the relational database world. In the world of graphs, everything is seen as a walk’s traversal”. What is your take on this?
Michael Blaha: That’s a great point and one that I should have mentioned in my answer to Q1. Relational databases have poor handling of recursion. I will note that the vendor products have extensions for this but they aren’t natural and are an awkward graft onto SQL. Graph databases, in contrast, are great with handling recursion. This is a big advantage of graph databases for applications where recursion arises.
Q4. Is there any synergy between graphs and conventional relational databases?
Michael Blaha: Graphs are also important for relational databases, and more so than some persons may realize…
– Graphs are clearly relevant for data modeling. An Entity-Relationship data model portrays the database structure as a graph.
– Graphs are also important for expressing database constraints. The OMG’s Object Constraint Language (OCL) expresses database constraints using graph traversal. The OCL is a textual language so it can be tedious to use, but it is powerful. The Common Warehouse Metamodel (CWM) specifies many fine constraints with the OCL and is a superb example of proper OCL usage.
– Even though the standard does not emphasize it, the OCL is also an excellent language for database traversal as a starting point for database queries. Bill Premerlani and I explained this in a past book (Object-Oriented Modeling and Design for Database Applications).
– Graphs are also helpful for characterizing the complexity of a relational database design. Robert Hilliard presents an excellent technique for doing this in his book (Information-Driven Business).
Q5. You say that graphs are important for data modeling, but at the end you do not store graphs in a relational database but tables, and you need joins to link them together… Graph databases in contrast cache what is on disk into memory and vendors claim that this makes for a highly reusable in-memory cache. What is your take on this?
Michael Blaha: Relational databases play many optimization games behind the covers. So in general, possible performance differences are often not obvious. I would say that the difference in expressiveness is what determines suitable applications for graph and relational databases and performance is a secondary issue, except for very specialized applications.
Q6: What are advantages of SQL relative to graph databases?
Michael Blaha: Here are some advantages of SQL:
– SQL has a widely-accepted standard.
– SQL is a set-oriented language. This is good for mass-processing of set-oriented data.
– SQL databases have powerful query optimizers for handling set-oriented queries, such as for data warehouses.
– The transaction processing behavior of relational databases (the ACID properties) are robust, powerful, and sound.
– SQL has extensive support for controlling data access.
Q7: What are disadvantages of SQL relative to graph databases?
Michael Blaha: Here are some disadvantages of SQL:
– SQL is awkward for processing the explosion of data that can result from starting with an object and traversing a graph.
SQL, at best, awkwardly handles recursion.
– SQL has lots of overhead for multi-user locking that can make it difficult to access individual objects and their data.
– Advanced and specialty applications often require less rigorous transaction processing with reduced overhead and higher throughput.
Q8: For which applications is SQL best? For which applications are graph databases best?
– SQL is schema based. Define the structure in advance and then store the data. This is a good approach for conventional data processing such as many business and financial systems.
– Graph databases are occurrence based. Store data and relationships as they are encountered. Do not presume that there is an encompassing structure. This is a good approach for some scientific and engineering applications as well as data that is acquired from Web browsers and search engines.
Q9. What about RDF quad/triple stores?
Michael Blaha: I have not paid much attention to this. RDF is an entity-attribute-value approach. From what I can tell, it seems occurrence based and not schema based and my earlier comments apply.
Michael Blaha is a partner at Modelsoft Consulting Corporation.
Blaha received his doctorate from Washington University in St. Louis, Missouri in Chemical Engineering with his dissertation being about databases. Both his academic background and working experience involve engineering and computer science. He is an alumnus of the GE R&D Center in Schenectady, New York, working there for eight years. Since 1993, Blaha has been a consultant and trainer in the areas of modeling, software architecture, database design, and reverse engineering. Blaha has authored six U.S. patents, four books, and many papers. Blaha is an editor for IEEE Computer as well as a member of the IEEE-CS publications board. He has also been active in the IEEE Working Conferences on Reverse Engineering.
Follow ODBMS.org on Twitter: @odbmsorg
” I think that it’s incredibly important for all programmers to have a public presence by being involved in open source or having side projects that are publicly available. The industry is quickly changing and more and more people are realizing how ineffective the standard techniques of programmer evaluation are. This includes things like resumes and coding questions “ –Nathan Marz.
Nathan Marz: You only live once, so it’s important to make the most of the time you have. I find Bezos’s “regret minimization framework” a great way to make decisions with a long term perspective. Too often people make decisions only thinking about marginal, short-term gains, and this can lead you down a path you never intended to go. And failure, if it happens, is not as bad as it seems. Worst comes to worse I’ll have learned an enormous amount, had a unique and interesting experience, and will just try something else.
Q2. Do you want to disclose in general terms what you’ll be working on?
Nathan Marz: Sorry, not at the moment. [ edit: as of now he did not disclose it]
Q3. You open-sourced Cascalog, ElephantDB, and Storm. Which of the three is in your opinion the most rewarding?
Nathan Marz: Storm has been very rewarding because of the sheer number of people using it and the diversity of industries it has penetrated, from healthcare to analytics to social networking to financial services and more.
Q4. What are in your opinion the current main challenges for Big Data analytics?
Nathan Marz: I think the biggest challenge is an educational one. There’s an overwhelming number of tools in the Big Data ecosystem, all very much different than the relational databases people are used to, and none is a one-sized-fits-all solution. This is why I’m writing my book “Big Data” – to show people a structured, principled approach to architecting data systems and how to use those principles to choose the right tool for you particular use case.
Q5. In January 2013, version 0.8.2 of Storm was released? What is new?
Nathan Marz: There was a lot of work done in 0.8.2 on making it easier to use a shared cluster for both production and in-development applications. This included improved monitoring support, which helps with detecting when you’ll need to scale with more resources, and a brand new scheduler that isolates production and development topologies from each other. And of course, the usual bug fixes and small improvements.
Q6. How do you expect Storm evolving?
Nathan Marz: There’s a lot of work happening right now on making Storm enterprise-ready. These include security features such as authentication and authorization, enhanced monitoring capabilities, and high availability for the Storm master. Long term, we want to continue with the theme of having Storm seamlessly integrate with your other realtime backend systems, such as databases, queues, and other services.
Q7. Daniel Abadi of Hadapt, said in a recent interview: “the prevalent architecture that people use to analyze structured and unstructured data is a two-system configuration, where Hadoop is used for processing the unstructured data and a relational database system is used for the structured data. However, this is a highly undesirable architecture, since now you have two systems to maintain, two systems where data may be stored, and if you want to do analysis involving data in both systems, you end up having to send data over the network which can be a major bottleneck.”
What is your opinion on this?
Nathan Marz: I think that “structured” vs. “unstructured” is a false dichotomy. It’s easy to store both unstructured and structured data in a distributed filesystem: just use a tool like Thrift to make your structured schema and serialize those records into files. A common objection to this is: “What if I need to delete or modify one of those records? You can’t cheaply do that when the data is stored in files on a DFS.” The answer is to move beyond the era of CRUD and embrace immutable data models where you only ever create or read data. In the architecture I’ve developed, which I call the “Lambda Architecture“, you then build views off of that data using tools like Hadoop and Storm, and it’s the views that are indexed and go on to feed the low latency requests in your system.
Q8. What are the main lessons learned in the last three years of your professional career?
Nathan Marz: I think that it’s incredibly important for all programmers to have a public presence by being involved in open source or having side projects that are publicly available. The industry is quickly changing and more and more people are realizing how ineffective the standard techniques of programmer evaluation are. This includes things like resumes and coding questions. These techniques frequently label strong people as weak or weak people as strong. Having good work out in the open makes it much easier to evaluate you as a strong programmer. This gives you many more job options and will likely drive up your salary as well because of the increased competition for your services.
For this reason, programmers should strongly prefer to work at companies that are very permissive about contributing to open source or releasing internal projects as open source. Ironically, a company having this policy is assisting in driving up the value (and price) of the employee, but as time goes on I think this policy will be necessary to even have access to the strongest programmers in the first place.
Nathan Marz. was the Lead Engineer at BackType before BackType was acquired by Twitter in July of 2011. At Twitter he started the streaming compute team which provides infrastructure that supports many critical applications throughout the company. He left Twitter in March of 2013 to start his own company (currently in stealth).
- Big Data: Principles and best practices of scalable realtime data systems.
Nathan Marz (Twitter) and James Warren
MEAP Began: January 2012
Softbound print: Fall 2013 | 425 pages
Download Chapter 1: A new paradigm for Big Data (.PDF)
follow ODBMS.org on Twitter: @odbmsorg
“Working with empirical genomic data and modern computational models, the laboratory addresses questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations” –Thibault de Malliard.
Big Data for Genomic Sequencing. On this subject, I have interviewed Thibault de Malliard, researcher at the University of Montreal’s Philip Awadalla Laboratory, who is working on bioinformatics solutions for next-generation genomic sequencing.
Q1. What are the main research activities of the University of Montreal’s Philip Awadalla Laboratory?
Thibault de Malliard: The Philip Awadalla Laboratory is the Medical and Population Genomics Laboratory at the University of Montreal. Working with empirical genomic data and modern computational models, the laboratory addresses questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations. Its research includes work relevant to all types of human diseases: genetic, immunological, infectious, chronic and cancer.
Using genomic data from single-nucleotide polymorphisms (SNP), next-generation re-sequencing, and gene expression, along with modern statistical tools, the lab is able to locate genome regions that are associated with disease pathology and virulence as well as study the mechanisms that cause the mutations.
Q2. What is the lab’s medical and population genomics research database?
Thibault de Malliard: The lab’s database is regrouping all the mutations (SNPs) found by DNA genotyping, DNA sequencing and RNA sequencing for each samples. There is also annotation data from public databases.
Q3. Why is data management important for the genomic research lab?
Thibault de Malliard: All the data we have is in text csv files. This is what our software takes as input, and it will output other text csv files. So we use a lot of Bash and Perl to extract the information we need and to do some stats. As time goes, we multiply the number of files by sample, by experiment, and finally we get statistics based on the whole data that need recalculating each time we perform a new sequencing/genotyping (mutation frequency, mutations per gene, etc).
With this database, we are also preparing for the lab’s future:
• As the amount of data increases, one day the memory will not fit an associative array.
• Looking to a 200 GB file to find one specific mutation will not be a good option.
• Adding new data to the current files will take more and more time/space.
• We need to be able to select the data according to every parameter we have, i.e., grouping by type of mutation and/or by chromosome, and/or by sample information by gender, ethnicity, age, or pathology.
• We then need to export a file, or count / sum / average it.
Q4. Could you give us a description of what kind of data is in the lab’s genomic research database storing and processing? And for what applications?
Thibault de Malliard: We are storing single nucleotide polymorphisms (SNPs), which are the most common form of genetic mutations among people, from sequencing and genotyping. When an SNP is found for a sample, we also look at what we have at the same position for the other samples:
• There is no SNP but data for the sample, so we know this sample does not have the SNP.
• There is no data for the sample, so we cannot assess whether or not there is an SNP for this sample at this position.
We gather between 1.8 and 2.5 million nucleotides (at least one sample has it) per sample, depending on the experiment technique. We store them in the database along with some information:
• how damaged the SNP can be for the function of the gene
• its frequency in different populations (African, European, French Canadian…).
The database also contains information about each sample, such as gender, ethnicity, pathology. This will keep growing with our needs. So, basically, we have a sample table, a mutations table with their information, an experiment table and a big table linking the 3 previous tables with relations one to many.
Here is a very slightly simplified example of a single record in our database:
|Type of data||Data||Table|
|SNP||T||Begin Mutation information table|
|Damaging for gene function?||synonymous|
|Present in known database?||yes||End Mutation information table|
|Sequencing quality||26||Begin Table linking other tables together||containing information about 1 mutation for 1 sample from 1 sequencing|
|Validated by another experiment?||no||End Table linking other tables|
|Sample||345||Begin Sample table|
|family||10||End Sample table|
|Sequencing information||Illumina Hiseq 2500||Begin Sequencing table|
|Sequencing type (DNA RNA…)||RNAseq|
|Analysis pipeline info||No PCR duplicates only Properly paired||End Sequencing table|
The applications are multiple, but here are some which come to my mind:
• extract subset of data to use with our tools
• doing stats, counts
• find specific data
• annotate our data with public databases
Q5. Why did you decide to deploy TokuDB database storage engine to optimize the lab’s medical and population genomics research database?
Thibault de Malliard: We knew that the data could not be managed with MySQL and MyISAM. One big issue is the insert rate, and TokuDB offered a solution up to 50 times faster. Furthermore, TokuDB allows us to manipulate the structure of the database without blocking access to it. As a research team, we always have new information to add, which means column additions.
Q6. Did you look/consider other vendor alternatives? If yes, which ones?
Thibault de Malliard: None. This is much too time consuming.
Q7. What are you specifically using TokuDB for?
Thibault de Malliard: We only store genetic data with information related to this genetic.
Q8. How many databases do you use? What are the data requirements?
Thibault de Malliard: I had planned to use three databases:
1. Database for RNA/DNA sequencing and from DNA genotyping (described before);
2. Database for data from well-known reference databases (dbsnp, 1000genome);
3. A last one to store analyzed data from database 1 and 2.
The data stored is manly the nucleotide (a character: A, C, G, T) with integer information like quality, position, and Boolean flags. I avoid using any string to keep the table as small as possible.
Q9. Especially, what are the requirements for data ingest of records and retrieve of data?
Thibault de Malliard: As a research team, we do not have high requirements like real-time insertion from logs. But I would say, at most, the import should take less than a night. The update of the database 1 is critical with the addition of a new sequencing or genotyping experiment: a batch of 50M records (can be more than 3 times higher!) has to be inserted. This has been happening monthly, but it should increase this year.
We have a huge amount of data, and we need to get query results as fast as possible, We have been used to one or two days (a weekend) of query time – having 10 seconds is much more preferable!
Q10. Could you give some examples of what are the typical research requests that need data ingestion and retrieval
Thibault de Malliard: We have a table with all the SNPs for 1000 samples. This is currently a 100GB table.
A typical query could be to get the sample that got a mutation different from the 999 others. We also have some samples that are families: a child with its parents. We want to find the SNPs present in this child, but not present in the other family member.
We may want to find mutations common to one group of sample given the gender, a disease state, ethnicity.
Q11. What kind of scalability problems did you have?
Thibault de Malliard: The problem is managing this huge amount of data. The number of connections should be very low. Most of the time, there is only one user. So I had to choose the data types carefully and the relationships between the tables. Lately, I ran into a very slow join with a range so I decided to split the position based tables by chromosome. Now there are 26 tables and some procedures to launch queries through the chromosomes. The gain of time is not quantifiable.
Q12. Do you have any benchmarking measures to sustain the claim that Tokutek’s TokuDB has improved scalability of your system?
Thibault de Malliard: I populated the database with two billion records in the main table and then did queries. While I did not see improvements with my particular workload for queries, I did see significant insertion performance gains. When I tried to add an extra 1M record (Load data infile), it took 51 minutes for MyISAM to load the data, but less than one minute with TokuDB. I extend this amount of data to an RNA sequencing experiment: it should take 2.5 days for MyISAM but one hour for TokuDB.
Q13. What are the lessons learned so far in using TokuDB database storage engine in your application domain?
Thibault de Malliard: We are still developing it and adding data. But inserting data into the two main tables (0.9G records, 2.3G records) was done in a fairly good time, less than one day. Adding columns to fulfill the needs of the team is also a very easy feature: it takes one second to create the column. Updating it is another story, but the table is still accessible during this process.
Another great feature, like the one I use with each query, is to be able to follow the state of the query.
You can follow in the process list the number of rows which were queried. So if you have a good estimation of the number of records expected, you know exactly the time of the query. I cannot count the number of process I killed because the query time expected was not acceptable.
Qx. Anything you wish to add?
Thibault de Malliard: The sequencing/genotyping technologies evolve very fast. Evolving means more data from the machines. I expect our data to grow at least three times each year. We are glad to have TokuDB in place to handle the challenge.
Since 2010, Thibault de Malliard has worked in the University of Montreal’s Philip Awadalla Laboratory where he provides bioinformatics support to the lab crew and develops bioinformatics solutions for next-generation genomic sequencing. Previously, he worked for the French National Institute for Agricultural Research (INRA) with the MIG laboratory (Mathematics, Informatics and Genomics) where, as part of the European Nanomubiop project, he was tasked with developing software to produce probes for a HPV chip. He holds a masters degree in bioinformatics (France).
- Big Data for Good. by Roger Barca, Laura Haas, Alon Halevy, Paul Miller, Roberto V. Zicari. June 5, 2012:
A distinguished panel of experts discuss how Big Data can be used to create Social Capital.
Blog Panel | Intermediate | English | DOWNLOAD (PDF)| June 2012|
Follow ODBMS.org on Twitter: @odbmsorg
“We were an early user of Hadoop and found ourselves pushing its scalability bounds and of necessity innovating our own solutions. For example we wrote our own API on top of it, rebuilt its sorter, and developed an alternative file system to achieve better performance and cost-effectiveness” — Jim Kelly and Sriram Rao.
Quantcast released last October the Quantcast File System (QFS) to open source. I have interviewed Jim Kelly, head of R&D at Quantcast, and Sriram Rao, Principal Scientist in Microsoft’s Cloud and Information Services Lab (CISL). Both worked closely on the development of the QFS platform at Quantcast.
Q1. What is the main business of Quantcast?
Jim Kelly: Quantcast measures digital audiences for millions of web destinations and helps advertisers reach their audiences more effectively. We observe billions of anonymous digital media consumption events every month and use large scale machine learning techniques to characterize audiences and to identify and reach relevant ones for advertisers.
We’re not in the distributed file system business, we’re in the same category as Google and Yahoo: an advertising company doing pioneering work in big data and sharing it with the rest of the community.
Q2. What are the main big data processing challenges you have experienced so far?
Jim Kelly: We’ve been doing big data processing since the launch of our measurement product in 2006, which put us in the big data space before it had become a space. We were an early user of Hadoop and found ourselves pushing its scalability bounds and of necessity innovating our own solutions. For example we wrote our own API on top of it, rebuilt its sorter, and developed an alternative file system to achieve better performance and cost-effectiveness. Our business and data volume have grown steadily since then. We currently receive over 50 TB of data per day, continually challenging us to operate at a scale beyond most organizations’ experience and most technologies’ comfort zone.
Sriram: In terms of big data, Hadoop is targeted to the sweet spot where the volume of data being processed in a computation is roughly the size of the amount of RAM in the cluster. At Quantcast, what we found was that we were frequently jobs where the amount of data being data processed per day (even with compression) was substantial. For instance, when I started at Quantcast in 2008, we were doing 100TB per day. Scaling the data processing volume required us to rethink components of Hadoop and build novel soutions. As we developed novel solutions and deployed them in our cluster, we were able to increase data processing volumes to as much as 500TB per day over a 2-year period. This increase occurred purely due to software improvements on the same hardware.
Q3. What lessons did you learn so far in using Hadoop for Big Data Analytics?
Jim Kelly: Hadoop has been a fantastic boon for us and the rest of the community. By providing a framework for developers (and increasingly, non-developers) to do parallel computing without a lot of systems knowledge, it has democratized an arcane and difficult problem. It has been a catalyst for the emergence of the big data community: a whole ecosystem of people, companies and technologies focused on different aspects of the problem. We’ve been a big beneficiary, both in terms of technology we’ve been able to use directly, and team members we’ve been able to attract because they want to do cutting-edge work in the dynamic space Hadoop has created. The lesson, I think, is how much value a technology can provide above and beyond what it does when you run it.
Sriram: The Apache Hadoop distribution is the same one that powers Yahoo!’s large clusters. Having an open source software tested at such scale has provided a great starting point for companies looking to process big data. While the data processing needs at Quantcast are extreme (10’s of PB per day), the main lesson I think is that, for a vast majority of the companies interested in data analytics Hadoop has become a stable platform which works “out-of-the-box”.
Q4. What are the scalability limits of current implementation of Hadoop?
Jim Kelly: It’s difficult to generalize, because there are so many different environments using Hadoop in different ways. What becomes an issue for one set of use cases on one set of hardware is not necessarily relevant for another, and I think it’s safe to say that the great majority of environments working with Hadoop needn’t worry about its scalability.
In our environment we’re processing a steady diet of many-terabyte jobs on many hundreds of servers, and we’ve invested a great deal in making the “sort” phase of MapReduce more efficient. In between the “map” and the “reduce,” Hadoop has to sort and group the mapper output and distribute it to the appropriate reducers. In general every reducer will have to read some of every mapper’s output, and with M mappers and N reducers that causes MxN disk seeks. When your jobs are large enough to require M and N in the thousands, this becomes a big performance bottleneck. We wrote a blog post describing the problem in more detail.
In our own environment we’ve achieved more efficiency at this scale by developing the Quantcast File System (QFS) and Quantsort. Sriram joined us in 2008 and led the effort to adapt the Kosmos File System (KFS) to the job and to build a new sort architecture on top of it. Using the two systems, we’ve sorted data sets of nearly a petabyte, and we regularly process over 20 PB per day.
QFS also improved where Hadoop was scaling too readily for us: cost. By using a more efficient data encoding, it lets us do more data processing using only half the hardware we would need with Hadoop’s HDFS, and therefore half the power and half the cooling and half the maintenance.
Sriram: The fundamental issue with Hadoop is how “shuffle” phase is handled as the data volume scales.
In a Map-Reduce computation, the output generated by map tasks during the “map” phase is partitioned and a reducer obtains its input partition from each of the map tasks. In particular, for large jobs, the size of the map output may exceed the amount of RAM in the cluster. When that happens, data has to be written to disk and then read back and transported over the network.
The “shuffle” is an “external” distributed merge-sort and is known to be seek intensive. Hadoop uses traditional mechanisms for implementing the shuffle and this is known to have scalability issues when the volume of data to be shuffled exceeds the amount of RAM in the cluster.
Q5. Sriram what was your contribution to the QFS project?
Sriram: QFS is an evolution of the KFS system that I started at Kosmix. The bulk of the KFS code base is in QFS. At Quantcast, we continued to evolve KFS by improving performance, by adding features, etc. One key feature we added at Quantcast was support multi-writer append (i.e., atomic append). It turns out that this feature can be leveraged to build high-performance sort architectures, which are fundamental to doing Map-Reduce computations at scale. In general, I worked on all aspects of the KFS system.
Q6. How does the Kosmos File System (KFS) relate to the QFS?
Jim Kelly: QFS evolved from KFS and has a couple years’ improvements on it. The most significant is Reed-Solomon erasure coding, which doubles effective storage and improves performance.
Sriram: The QFS project has its roots in the KFS system that I built. The bulk of the KFS code base is in QFS.
I started KFS originally at Kosmix Corp (now, @WalmartLabs) in 2006. At that point, Kosmix was trying to build a search engine and KFS was intended to be the storage substrate. The KFS architecture is based on the Google filesystem paper. The main idea in KFS is that blocks of a file are striped across nodes and replicated for fault-tolerance. We eventually released KFS as open source in 2007, after which I joined Quantcast. At Quantcast, we continued to evolve KFS by improving performance, by adding features, etc. KFS 0.5 is a snapshot of the code base that was run in production cluster at Quantcast in 2010. So, if you will, think of KFS as QFS 0.5. The significant difference between the two systems is erasure coding.
Q7. Sriram you also worked on another open source project Sailfish. Could you tell us a bit more about it?
Sriram: Data analytics computations are run on machines in a datacenter. The frameworks for doing these computations were designed in 2004-2006 timeframe when bandwidth within the datacenter was a scarce resource. However, that design point is changing.
Over the next few years, technological trends suggest that bandwidth within a datacenter will substantially increase. Inter-node connectivity is expected to go up from 1Gbps between any pair of nodes to 10Gbps (and possibly higher). Given such plentiful bandwidth, how would we build analytical engines (such as, Map-Reduce computation engines)? Sailfish is an attempt at re-designing how the shuffle data is transported in Map-Reduce frameworks by taking advantage of better connectivity.
Sailfish design is based on two key ideas:
1. Use network-wide data aggregation to improve disk subsystem performance.
2. Gather statistics on the aggregated data to plan for reduce phase of execution.
We instantiated these two ideas in the context of Hadoop-0.20.2, where we re-worked the entire shuffle pipeline and how the reduce phase of execution is done. Sailfish leverages the concurrent append capabilities of KFS to move data from mappers to reducers. When a Hadoop Map-Reduce job is run with Sailfish, (1) user does not tune sort parameters and (2) number of reducers in a job and their task assignment is determined at run-time in a data-dependent manner. Essentially, Sailfish simplifies running large-scale Hadoop-based Map-Reduce computations. We have also added support for preemption in Sailfish, where the preemption is implemented via a checkpoint/restart mechanism. This allows us to seamlessly handle skew.
Here is the Sailfish website which has links to the papers we have published and the software we have released.
Q8. Who is currently using Sailfish?
Sriram: Sailfish is a research prototype which has been released to open-source so that other researchers/users can benefit from our work. We are currently working on migrating many of the ideas from Sailfish to Hadoop version 2.0 (aka YARN). We have filed some JIRAs and intend to contribute the code to Hadoop.
Q9. Quantcast released last October the Quantcast File System (QFS) to open source, (link here). Why?
Jim Kelly: We’ve always relied heavily on open source software internally, and it’s good to be able to give back. It’s a chance for developers to have more of an impact than they could working on a proprietary system. And we believe file systems particularly, as foundational infrastructure components, benefit from the open source model where they can get a lot of scrutiny and be challenged to prove themselves in many different environments.
Q10. Is QFS supposed to be alternative to Apache Hadoop`s HDFS or just a complementary solution?
Jim Kelly: Hadoop has a wide variety of customers with diverse priorities. Some are just getting started and put a lot of value on ease of use. Others care most about performance at small scale, or high availability. HDFS offers broad functionality to meet needs for all of them.
Quantcast’s priorities are cost efficiency and high performance at large scale, so in developing QFS we went deeper around that use case. QFS has significantly improved the performance and economics of our data processing and can do the same for other organizations running their own clusters at large scale. By contrast organizations just getting started or needing specific HDFS features will probably find HDFS is a better fit.
Q11. What are the key technical innovations of QFS?
Jim Kelly: Architecturally QFS places a couple of different bets than HDFS. It’s implemented in C++ rather than Java, which allows better control and optimization in a number of areas. It also leans more heavily on modern networks, which have become much faster since HDFS launched and allow different optimization choices.
Those differences in approach make possible QFS’s Reed-Solomon erasure coding, which doubles storage efficiency and improves performance. The concurrent append feature allows many processes to write simultaneously to a single file, enabling applications like Quantsort. QFS’s use of direct I/O also improves both performance and manageability, by ensuring processes live within a predictable memory footprint.
Sriram: One of KFS’s key innovations that we did at Quantcast was support for “atomic append”.
This is the ability to allow multiple writers to append data to a file. Basically, think of the file as a “bag” into which data is dropped and there are no ordering guarantees on the order in which data is appended to a file. It turned out that this feature can be leveraged to build high-performance sort engines.
Q12. What kind of benchmark results do you have comparing QFS with Hadoop`s HDFS? What benchmark did you use for that?
Jim Kelly: Both QFS and HDFS rely on a central server to manage the file system, and its performance is critical as every cluster operation is dispatched through it. We’ve tested the two head-to-head for various operations and found QFS’s leaner C++ implementation pays off. Directory listings were 22% faster, and directory creation was nearly three times as fast.
These were our tests, and we encourage other people to run their own. We’ve put scripts and instructions to help on github.
Our whole-cluster tests will be harder for others to replicate at the same scale we did on our production cluster. We attempted to get some sense of how the file system affects real-world job run time. We measured end-to-end run time for simple Hadoop jobs that wrote or read 20TB of user data on an otherwise idle cluster, using QFS or HDFS as the underlying file system. The average throughput of the write job was about 75% higher using QFS, due to having to write less physical data. The read job throughput was 47% higher, primarily due to better parallelism. More information about these tests is available on the QFS wiki.
Q13. How reliable is QFS? What is your experience in using QFS in the Quantcast environment?
Jim Kelly: We started running KFS internally in 2008 on secondary storage and over years evolved it into today’s QFS, in the process adding features to make it more manageable at large scale and hardening it for production use. By 2011 it was running reliably enough that we were comfortable going all in. We copied all our data over from HDFS and have been running QFS exclusively since then for all our storage and MapReduce work. QFS has handled over six exabytes of I/O since then, so we’re confident it’s ready for other organizations’ production workloads.
Q14. You mentioned Hadoop’s architectural limitations for large jobs (terabytes and beyond). The more tasks a job uses, the less efficient its disk I/O becomes. To improve that, Quantcast developed an improved Hadoop Sort, called Quantsort. What is special about Quantsort?
Jim Kelly: Quantsort avoids the MxN combinatorics Hadoop faces shuffling data between mappers and reducers. It leverages QFS’s concurrent append feature to shuffle data through a fixed number of intermediate files. Reducers read their data in larger chunks, keeping disk I/O efficient. We wrote a blog post describing Quantsort more fully.
Q15. How is the performance of Quantsort?
Jim Kelly: Fast. Sriram did Quantcast’s first large sort back in 2009, of 140 TB in 6.2 hours on much less hardware than we have today. Quantsort’s record on a production job is 971 TB in 6.4 hours.
Sriram: For large jobs (think 10’s-100’s of TB of data and beyond) where the map output exceeds the size of RAM, Quantsort performance can be upto an order of magnitude faster when compared to Hadoop’s shuffle. Quantsort enables you to process more data quickly using the same hardware. Since jobs finish faster, it has the multiplier effect of allowing users to run more jobs on the cluster.
Q17. Is Quantsort part of QFS and therefore, also open source?
Jim Kelly: No, Quantsort is a separate code base.
Q18 In 2011, Quantcast migrated all production data to QSF and turned off HDFS. How did you manage to migrate the data stored in HDFS to QSF?
Jim Kelly: The data migration was very straightforward, a matter of running a trivial MapReduce job that copies data from one place to another. We avoided any outage for our production jobs by phasing them over, making them read from either file system but write to QFS during the transition.
Q19. How easy is to plug-in and use QFS into an existing Hadoop cluster which has already data stored in HDFS?
Jim Kelly: QFS is plug-in compatible with Hadoop, using KFS bindings that are already included in standard Hadoop distributions, so it’s very easy to integrate. The data storage format is different, requiring either migrating data to QFS or phasing it in gradually. The QFS wiki has a migration guide.
Q20. KFS has also been released an open source project. How does it relate to the Quantcast File System (QFS)?
Jim Kelly: Quantcast adopted KFS in 2008, became the main sponsor of contributions to the KFS project for the next two years and subsequently evolved it into QFS. You’ll find the concurrent append feature in the KFS repository, for example, but Reed-Solomon encoding is only in QFS.
Qx Anything you wish to add?
Jim Kelly: An invitation to other organizations doing distributed batch processing and interested in doubling their cluster capacity to give QFS a try. We’re happy to field questions or receive feedback on the QFS mailing list.
Jim Kelly leads Quantcast’s R&D team, which works on both adding computing capacity through cluster software innovations and using it up through new analytic and modeling products. Having been at Quantcast for six years, he has seen its data volumes and processing challenges grow from zero to petabytes and led technical and organizational changes that have kept Quantcast a step ahead. Previously he held engineering leadership roles at Oracle, Kana, and Scopus Technology (acquired by Siebel). Jim holds a PhD in physics from Princeton University.
Sriram Rao is currently a Principal Scientist in Microsoft’s Cloud and Information Services Lab (CISL), and formerly lead the Quantcast’s Cluster Team. Sriram was the creator of the Kosmos File System (KFS), which would become the underlying architecture for the Quantcast File System (QFS) that was released to open source last year. Sriram kiis also the creator of Sailfish, another open source project that is geared towards processing massive amounts of data. Previously he has worked at Quantcast, Yahoo, and Kosmix (now @WalmartLabs). Sriram holds a PhD in Computer Sciences from UT-Austin.
-ODBMS.org: Big Data and Analytical Data Platforms – Free Software
-ODBMS.org: Big Data and Analytical Data Platforms – Articles
Follow ODBMS.org on Twitter: @odbmsorg
“So the synergies in data management come not from how the systems connect but how the data is used to derive business value” –Steve Shine,
On Dec. 21, 2012, Actian Corp. announced the completion of the transaction to buy Versant Corporation. I have interviewed Steve Shine, CEO and President, Actian Corporation.
Q1. Why acquiring an object-oriented database company such as Versant?
Steve Shine: Versant Corporation, like us, has a long pedigree in solving complex data management in some of the world’s largest organisations. We see many synergies in bringing the two companies together. The most important of these is together we are able to invest more resources in helping our customers extract even more value from their data. Our direct clients will have a larger product portfolio to choose from, our partners will be able to expand in adjacent solution segments, and strategically we arm ourselves with the skills and technology to fulfil our plans to deliver innovative solutions in the emerging Big Data Market.
Q2. For the enterprise market, Actian offers its legacy Ingres relational database. Versant on the other hand offers an object oriented database, especially suited for complex science/engineering applications. How does this fit? Do you have a strategy on how to offer a set of support processes and related tools for the enterprise? if yes, how?
Steve Shine: While the two databases may not have a direct logical connection at client installations, we recognise that most clients use these two products as part of a larger more holistic solutions to support their operations. The data they manage is the same and interacts to solve business issues – for example object stores to manage the relationships between entities; transactional systems to manage clients and the supply chain and analytic systems to monitor and tune operational performance. – Different systems using the same underlying data to drive a complex business.
We plan to announce a vision of an integrated platform designed to help our clients manage all their data and their complex interactions, both internal and external so they can not only focus on their running their business, but better exploit the incremental opportunity promised by Big Data.
Q3. Bernhard Woebker, president and chief executive officer of Versant stated, “the combination of Actian and Versant provides numerous synergies for data management”. Could you give us some specific examples of such synergies for data management?
Steve Shine: Here is a specific example of what I mean by helping clients extracting more value from data in the Telco space. These type of incremental opportunities exist in every vertical we have looked at.
An OSS system in a Telco today may use an Object store to manage the complex relationships between the data, the same data is used in a relational store to monitor, control and manage the telephone network.
Another relational store using variants of the same data manages the provisioning, billing and support for the users of the network. The whole data set in Analytical stores is used to monitor and optimise performance and usage of the network.
Fast forwarding to today, the same data used in more sophisticated ways has allowed voice and data networks to converge to provide a seamless interface to mobile users. As a result, Telcos have tremendous incremental revenue opportunities BUT only if they can exploit the data they already have in their networks. For example: The data on their networks has allowed for a huge increase in location based services, knowledge and analysis of the data content has allowed providers to push targeted advertising and other revenue earning services at their users; then turning the phone into a common billing device to get even a greater share of the service providers revenue… You get the picture.
Now imagine other corporations being able to exploit their information in similar ways: Would a retailer benefit from knowing the preferences of who’s in their stores? Would a hedge fund benefit from detecting a sentiment shift for a stock as it happens? Even knowledge of simple events can help organisations become more efficient.
A salesman knowing immediately a key client raises a support ticket; A product manager knowing what’s being asked on discussion forums; A marketing manager knowing a perfect prospect is on the website.
So the synergies in data management come not from how the systems connect but how the data is used to derive business value. We want to help manage all the data in our customers organisations and help them drive incremental value from it. That is the what we mean by numerous synergies from data management and we have a vision to deliver it to our customers.
Q4. Actian claims to have more than 10,000 customers worldwide. What is the value proposition of Versant’s acquisition for the existing Actian`s customers?
Steve Shine: I have covered this in the answers above. They get access to a larger portfolio of products and services and we together drive a vision to help them extract greater value from their data.
Q5. Versant claims to have more than 150,000 installations worldwide. How do you intend to support them?
Steve Shine: Actian already runs a 24/7 global support organisation that prides itself in delivering one of the industry’s best client satisfaction scores. As far as numbers are concerned, Versant’s large user count is in essence driven by only 250 or so very sophisticated large installations whereas Actian already deals with over 10,000 discreet mission critical installations worldwide. So we are confident of maintaining our very high support levels and the Versant support infrastructure is being integrated into Actian’s as we speak.
Q6. Actian is active in the market for big data analytics. How does Versant’s database technology fit into Actian’s big data analytics offerings and capabilities?
Steve Shine: Using the example above imagine using OSS data to analyse network utilisation, CDR’s and billing information to identify pay plans for your most profitable clients.
Now give these clients the ability to take business action on real time changes in their data.Now imagine being able to do that from an integrated product set from one vendor. We will be announcing the vision behind this strategy this quarter. In addition, the Versant technology gives us additional options for solutions for big data for example visualisation and managing meta data.
Q7. Do you intend to combine or integrate your analytics database Vectorwise with Versant’s database technology (such as Versant JPA)? If yes, how?
Steve Shine: Specific plans for integrating products within the overall architecture have not been formulated. We have a strong philosophy that you should use the best tool for the job eg OODB for some things, OLTP RDBMS for other etc. But the real value comes from being able to perform sophisticated analysis and management across the different data stores. That is part of the work out platform integration efforts are focused on.
Q8. What are the plans for future software developments. Will you have a joint development team or else?
Steve Shine: We will be merging the engineering teams to focus on providing innovative solutions for big Data under single leadership.
Q9. You have recently announced two partnerships for Vectorwise, with Inferenda and BiBoard. Will you also pursue this indirect channel path also for Versant’s database technology?
Steve Shine: The beauty of the vision we speak of is that our joint partner have a real opportunity to expand their solutions using Actian’s broader product set and for those that are innovative the opportunity for new emerging markets
Q10. Versant recently developed Versant JPA. Is the Java market important for Actian?
Steve Shine: Yes !
Q11. It is currently a crowded database market: several new database vendors (NoSQL and NewSQL) offering innovative database technology (NuoDB, VoltDB, MongoDB, Cassandra, Couchbase, Riak to name a few), and large companies such as IBM and Oracle, are all chasing the big data market. What is your plan to stand out of the crowd?
Steve Shine: We are very excited about the upcoming announcement on our plans for the Big Data market. We will be happy to brief you on the details closer to the time but I will say that early feedback from analysts houses like Gartner have confirmed that our solution is very effective and differentiated in helping corporations extract business value from Big Data. On a higher scale, many of the start ups are going to get a very rude awakening when they find that delivering a database for mission critical use is much more than speed and scale of technology. Enterprises want world class 24×7 support service with failsafe resilience and security. Real industry grade databases take years and many $m’s to reach scalable maturity. Most of the start ups will not make it. Actian is uniquely positioned in being profitable and having delivered industry grade database innovation but also being singularly focused around data management unlike the broad, cumbersome and expensive bigger players. We believe value conscious enterprises will see our maturity and agility as a great strength.
Qx Anything else you wish to add?
Steve Shine: DATA! – What a great thing to be involved in! Endless value, endless opportunities for innovation and no end in sight as far as growth is concerned. I look forward to the next 5 years.
Steve Shine, CEO and President, Actian Corporation.
Steve comes to Actian from Sybase where he was senior vice president and general manager for EMEA, overseeing all operational, sales, financial and human resources in the region for the past three years. While at Sybase, he achieved more than 200 million in revenue and managed 500 employees, charting over 50 percent growth in the Business Intelligence market for Sybase. Prior to Sybase, Steve was at Canadian-based Geac Computer Corporation for ten successful years, helping to successfully turn around two major global divisions for the ERP firm.
Follow ODBMS.org on Twitter: @odbmsorg
“The data you’re likely to need for any real-world predictive model today is unlikely to be sitting in any one data management system. A data scientist will often combine transactional data from a NoSQL system, demographic data from a RDBMS, unstructured data from Hadoop, and social data from a streaming API” –David Smith.
On the subject of Big Data Analytics I have interviewed David Smith, Vice President of Marketing and Community at Revolution Analytics.
Q1. How would you define the job of a data scientist?
David Smith: A data scientist is someone charged of analyzing and communicating insight from data.
It’s someone with a combination of skills: computer science, to be able to access and manipulate the data; statistical modeling, to be able to make predictions from the data; and domain expertise, to be able to understand and answer the question being asked.
Q2. What are the main technical challenges for Big Data predictive analytics?
David Smith: For a skilled data scientist, the main challenge is time. Big Data takes a long time just to move (so don’t do that, if you don’t have to!), not to mention the time required to apply complex statistical algorithms. That’s why it’s important to have software that can make use of modern data architectures to fit predictive models to Big Data in the shortest time possible. The more iterations a data scientist can make to improve the model, the more robust and accurate it will be.
Q3. R is an open source programming language for statistical analysis. Is R useful for Big Data as well? Can you analyze petabytes of data with R, and at the same time ensure scalability and performance?
David Smith: Petabytes? That’s a heck of a lot of data: even Facebook has “only” 70 Pb of data, total. The important thing to remember is that “Big Data” means different things in different contexts: while raw data in Hadoop may be measured in the petabytes, by the time a data scientist selects, filters and processes it you’re more likely to be in the terabytes or even gigabyte range when the data’s ready to be applied to predictive models.
Open Source R , with its in-memory, single-threaded engine, will still struggle even at this scale, though. That’s why Revolution Analytics added scalable, parallelized algorithms to R, making predictive modeling on terabytes of data possible. With Revolution R Enterprise , you can use SMP servers or MPP grids to fit powerful predictive models to hundreds of millions of rows of data in just minutes.
Q4. Could you give us some information on how Google, and Bank of America use R for their statistical analysis?
David Smith: Google has more than 500 R users , where R is used to study the effectiveness of ads, for forecasting, and for statistical modeling with Big Data.
In the financial sector, R is used by banks like Bank of America and Northern Trust and insurance companies like Allstate for a variety of applications, including data visualization, simulation, portfolio optimization, and time series forecasting.
Q5. How do you handle the Big Data Analytics “process” challenges with deriving insight?
- capturing data
- aligning data from different sources (e.g., resolving when two objects are the same)
- transforming the data into a form suitable for analysis
- modeling it, whether mathematically, or through some form of simulation
- understanding the output
- visualizing and sharing the results
David Smith: These steps reflect the fact that data science is an iterative process: long gone are the days where we would simply pump data through a black-box algorithm and hope for the best. Data transformation, evaluation of multiple model options, and visualizing the results are essential to creating a powerful and reliable statistical model. That’s why the R language has proven so popular: its interactive language encourages exploration, refinement and presentation, and Revolution R Enterprise provides the speed and big-data support to allow the data scientist to iterate through this process quickly.
Q6. What is the tradeoff between Accuracy and Speed that you usually need to make with Big Data?
David Smith: Real-time predictive analytics with Big Data are certainly possible. (See here for a detailed explanation.) Accuracy comes with real-time scoring of the model, which is dependent on a data scientist building the predictive model on Big Data. To maintain accuracy, that model will need to be refreshed on a regular basis using the latest data available.
Q7. In your opinion, is there a technology which is best suited to build Analytics Platform? RDBMS, or instead non relational database technology, such as for example columnar database engine? Else?
David Smith: The data you’re likely to need for any real-world predictive model today is unlikely to be sitting in any one data management system. A data scientist will often combine transactional data from a NoSQL system, demographic data from a RDBMS, unstructured data from Hadoop, and social data from a streaming API.
That’s one of the reasons the R language is so powerful: it provides interfaces to a variety of data storage and processing systems, instead of being dependent on any one technology.
Q8. Cloud computing: What role does it play with Analytics? What are the main differences between Ground vs Cloud analytics?
David Smith: Cloud computing can be a cost-effective platform for the Big-Data computations inherent in predictive modeling: if you occasionally need a 40-node grid to fit a big predictive model, it’s convenient to be able to spin one up at will. The big catch is in the data: if your data is already in the cloud you’re golden, but if it lives in a ground-based data center it’s going to be expensive (in time *and* money) to move it to the cloud.
David Smith, Vice President, Marketing & Community, Revolution Analytics
David Smith has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment.
He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products. David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project.
Today, David leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog.
Prior to joining Revolution Analytics, David served as vice president of product management at Zynchros, Inc.
Follow ODBMS.org on Twitter: @odbmsorg
“Our experience with MongoDB is that it’s architecture requires nodes to be declared as Master, and others as Slaves, and the configuration is complex and unintuitive. C* architecture is much simpler. Every node is a peer in a ring and replication is handled internally by Cassandra based on your desired redundancy level. There is much less manual intervention, which allows us to then easily automate many tasks when managing a C* cluster.” – Christos Kalantzis and Jason Brown.
Netflix, Inc. (NASDAQ: NFLX) is an online DVD and Blu-Ray movie retailer offering streaming movies through video game consoles, Apple TV, TiVo and more.
Last year, Netflix’s had a total of 29.4 million subscribers worldwide for their streaming service (Source).
I have interviewed Christos Kalantzis , Engineering Manager – Cloud Persistence Engineering and Jason Brown, Senior Software Engineer both at Netflix. They were involved in deploying Cassandra in a production EC2 environment at Netflix.
Q1. What are the main technical challenges that Big data analytics pose to modern Data Centers?
Kalantzis, Brown: As companies are learning how to extract value from the data they already have, they are also identifying all the value they could be getting from data they are currently not collecting.
This is creating an appetite for more data collection. This appetite for more data, is pushing the boundaries of traditional RDBMS systems and forcing companies to research alternative data stores.
This new data size, also requires companies to think about the extra costs involved storing this new data (space/power/hardware/redundant copies).
Q2. How do you handle large volume of data? (terabytes to petabytes of data)?
Kalantzis, Brown: Netflix does not have its own datacenter. We store all of our data and applications on Amazon’s AWS. This allows us to focus on creating really good applications without the overhead of thinking about how we are going to architect the Datacenter to hold all of this information.
Economy of Scale also allows us to negotiate a good price with Amazon.
Q3. Why did you choose Apache Cassandra (C*)?
Kalantzis, Brown: There’s several reasons we selected Cassandra. First, as Netflix is growing internationally, a solid multi-datacenter story is important to us. Configurable replication and consistency, as well as resiliency in the face of failure is an absolute requirement, and we have tested those capabilities more than once in production! Other compelling qualities include being an open source, Apache project and having an active and vibrant user community.
Q4: What do you exactly mean with “a solid multi-datacenter story is important to us”? Please explain.
Kalantzis, Brown: As we expand internationally we are moving and standing up new datacenters close to our new customers. In many cases we need a copy of the full dataset of an application. It is important that our Database Product be able to replicate across multiple datacenters, reliably, efficiently and with very little lag.
Q5. Tell us a little bit about the application powered by C*
Kalantzis, Brown: Almost everything we run in the cloud (which is almost the entire Netflix infrastructure) uses C* as a database. From customer/subscriber information, to movie metadata, to monitoring stats, it’s all hosted in Cassandra.
In most of our uses, Cassandra is the source of truth database. There are a few legacy datasets in other solutions, but are actively being migrated.
Q6: What are the typical data insights you obtained by analyzing all of these data? Please give some examples. How do you technically analyze the data? And by the way, how large are your data sets?
Kalantzis, Brown: All the data Netflix gathers goes towards improving the customer experience. We analyze our data to understand viewing preferences, give great recommendations and make appropriate choices when buying new content.
Our BI team has done a great job with the Hadoop platform and has been able to extract the information we need from the terabytes of data we capture and store.
Q7. What Availability expectations do you have from customers?
Kalantzis, Brown: Our internal teams are incredibly demanding on every part of the infrastructure, and databases are certainly no exception. Thus a database solution must have low latency, high throughput, strict uptime/availability requirements, and be scalable to massive amounts of data. The solution must, of course, be able to withstand failure and not fall over.
Q8: Be scalable up to?
Kalantzis, Brown: We don’t have an upper boundary to how much data an application can store. That being said, we also expect the application designers to be intelligent with how much data they need readily available in their OLTP system. Our Applications store anywhere from 10 GB to 100 terabyte [Edit corrected a typo] of data in their respective C* clusters. C* architecture is such that the cluster’s capacity grows linearly with every node added to the cluster. So in theory we can scale “infinitely”.
Q9. What other methods did you consider for continuous availability?
Kalantzis, Brown: We considered and experimented with MongoDB, yet the operational overhead and complexity made it unmanageable so we quickly backed away from it. One team even built a sharded RDBMS cluster, with every node in the cluster being replicated twice. This solution is also very complex to manage. We are currently working to migrate to C* for that application.
Q10. Could you please explain in a bit more detail, what kind of complexity made it unmanageable using MongoDB for you?
Kalantzis, Brown: Netflix strives to choose architectures that are simple and require very little in the way of manual intervention to manage and scale. Our experience with MongoDB is that it’s architecture requires nodes to be declared as Master, and others as Slaves, and the configuration is complex and unintuitive. C* architecture is much simpler. Every node is a peer in a ring and replication is handled internally by Cassandra based on your desired redundancy level. There is much less manual intervention, which allows us to then easily automate many tasks when managing a C* cluster.
Q11. How many data centers are you replicating among?
Kalantzis, Brown: For most workloads, we use two Amazon EC2 regions. For very specific workloads, up to four EC2 regions.
To slice that further, each region has two to three availability zones (you can think of an availability zone as the closest equivalent to a traditional data center). We shard and replicate the data within a region across it’s availability zones, and replicate between regions.
Q12: When you shard data, don’t you have a possible data consistency problem when updating the data?
Kalantzis, Brown: Yes, when writing across multiple nodes there is always the issue of consistency. C* “solves” this by allowing the client (Application) to choose the consistency level it desires when writing. You can choose to write to only 1 node, all nodes or a quorum of nodes. Each choice offers a different level of consistency and the application will only return from its write statement when the desired consistency is reached.
The same is with reading, you can choose the level of consistency you desire with every statement. The timestamp of each record will be compared, and make sure to only return the latest record.
In the end the application developer needs to understand the trade offs of each setting (speed vs consistency) and make the appropriate decision that best fits their use case.
Q13. How do you reduce the I/O bandwidth requirements for big data analytics workloads?
Kalantzis, Brown: What is great about C*, is that it allows you to scale linearly with commodity hardware. Furthermore adding more hardware is very simple and the cluster will rebalance itself. To solve the I/O issue we simply add more nodes. This reduces the amount of data stored in each node allowing us to get as close as possible to the ideal data:memory ratio of 1.
Q14. What is the tradeoff between Accuracy and Speed that you usually need to make with Big Data?
Kalantzis, Brown: When we chose C* we made a conscious decision to accept eventually consistent data.
We decided write/read speed and high availability was more important than consistency. Our application is such that this tradeoff, although might rarely provide inaccurate data (start a movie at the wrong location), it does not negatively impact a user. When an application does require accuracy, then we increase the consistency to quorum.
Q15. How do you ensure that your system does not becomes unavailable?
Kalantzis, Brown: Minimally, we run our c* clusters (over 50 in production now) across multiple availabilty zones within each region of EC2. If a cluster is multiregion (multi-datacenter), then one region is naturally isolated (for reads) from the other; all writes will eventually make it to all regions due to cass
Q16. How do you handle information security?
Kalantzis, Brown: We currently rely on Amazon’s security features to limit who can read/write to our C* clusters. As we move forward with our cloud strategy and want to store Financial data in C* and in the cloud, we are hoping that new security features in DSE 3.0 will provide a roadmap for us. This will enable us to move even more sensitive information to the Cloud & C*.
Q17. How is your experience with virtualization technologies?
Kalantzis, Brown: Netflix runs all of its infrastructure in the cloud (AWS). We have extensive experience with Virtualization, and fully appreciate all the Pros and Cons of virtualization.
Q18 How operationally complex is it to manage a multi-datacenter environment?
Kalantzis, Brown: Netflix invested heavily in creating tools to manage multi-datacenter and multi-region computing environments. Tools such as Asgard & Priam, has made that management easy and scalable.
Q19. How do you handle modularity and flexibility?
Kalantzis, Brown: At the data level, each service typically has it’s own Cassandra cluster so it can scale independently of other services. We are developing internal tools and processes for automatically consolidating and splitting clusters to optimize efficiency and cost. At the code level, we have built and open sourced our java Cassandra client, Astyanax , and added many recipes for extending the use patterns of Cassandra.
Q20. How do you reduce (if any) energy use?
Kalantzis, Brown: Currently, we do not. However, C* is introducing a new feature in version 1.2 called ‘virtual nodes’. The short explanation of virtual nodes is that it makes scaling up and down the size of a c* cluster much easier to manage. Thus, we are planning on using the virtual nodes concept with EC2 auto-scaling, so we would scale up the number of nodes in a C* cluster during peak times, and reduce the node count during troughs. We’ll save energy (and money) by simply using less resources when demand is not present.
Q21. What advice would you give to someone needing to ensure continuous availability?
Kalantzis, Brown: Availability is just one of the 3 dimensions of CAP (Concurrency, Availability and “Partitionability”). They should evaluate which of the other 2 dimensions are important to them and choose their technology accordingly. C* solves for A&P (and some of C).
Q22. What are the main lessons learned at Netflix in using and deploying Cassandra in a production EC2 environment?
Kalantzis, Brown: We learned that deploying C* (or any database product) in a virtualized environment has trade offs. You trade the fact that you have easy and quick access to many virtual machines, yet each virtual machine has MUCH less IOPS than a traditional server. Netflix has learned how to deal with those limitations and trade-offs, by sizing clusters appropriately (number of nodes to use).
It has also forced us to reimagine the DBA role as a DB Engineer role, which means we work closely with application developers to make sure that their schema and application design is as efficient as possible and not access the data store unnecessarily.
(@chriskalan) Engineering Manager – Cloud Persistence Engineering Netflix
Previously Engineering-Platform Manager YouSendIt
A Tech enthusiast at heart, I try to focus my efforts in creating technology that enhances our lives.
I have built and lead teams at YouSendIt and Netflix which has lead to the scaling out of persistence layers, the creation of a cloud file system and the adoption of Apache Cassandra as a scalable and highly available data solution.
I’ve worked as a DB2, SQL Server and MySQL DBA for over 10 years and through, sometimes painful, trial and error I have learned the advantages and limitations of RDBMS and when the modern NoSQL solutions make sense.
I believe in sharing knowledge, that is why I am a huge advocate of Open Source software. I share my software experience through blogging, pod-casting and mentoring new start-ups. I sit on the tech advisory board of the OpenFund Project which is an Angel VC for European start-ups.
(@jasobrown), Senior Software Engineer, Netflix
Jason Brown is a Senior Software Engineer at Netflix where he led the pilot project for using and deploying Cassandra in a production EC2 environment. Lately he’s been contributing to various open source projects around the Cassandra ecosystem, including the Apache Cassandra project itself. Holding a Master’s Degree in Music Composition, Jason longs to write a second string quartet.
Follow ODBMS.org on Twitter: @odbmsorg