“As the database gets used, shards can grow at an uneven rate and one shard might carry a majority of the load. MongoDB corrects this by balancing shards, but because of MongoDB’s lack of concurrency this operation can stall the database unacceptably.”–John Partridge.
I have interviewed John Partridge, President & CEO of Tokutek, Inc.
Q1. Tokutek recently announced to have eliminated performance issues of MongoDB sharding. What was the problem?
John Partridge: The problem occurs after a shard is created. As the database gets used, shards can grow at an uneven rate and one shard might carry a majority of the load. MongoDB corrects this by balancing shards, but because of MongoDB’s lack of concurrency this operation can stall the database unacceptably (see the benchmark).
Q2. For what kind of application users of MongoDB experienced these bottlenecks?
John Partridge: Users who need to scale out, and rely on sharding to do so.
Q3. What is the solution you propose to this problem?
Q4. How TokuMX v1.4 is able to allow shards to be balanced and added without disruption for a NoSQL solution that scales up and scales out?
John Partridge: TokuMX replaces the B-tree indexing used in MongoDB with patented Fractal Tree indexing, which allows for significantly better concurrency (among other things). Because of the improved concurrency, data can be copied, then deleted, from one shard to another without unnecessary locking.
Q5. What is the difference in performance of your solution with respect to the basic MongoDB? What “basic” MongoDB do you use for this comparison?
John Partridge: “Basic” MongoDB is the distro that you get from MongoDB (10gen). We typically see 20x performance improvements but as you might imagine, it depends on the application. Because TokuMX offers document-level locking rather than the database-level locking, TokuMX shines when there are significant reads *and* writes.
Q6. How do you compare TokuMX with other distribution of MongoDB, such as the one of 10gen (now MongoDB)?
John Partridge: There are three major differences: 20x performance improvement, 90% smaller database size (we compress the data), and support for ACID transactions. Look at the bottom of http://www.tokutek.com/products/tokumx-for-mongodb/ for more information on each of these benefits.
Mr. Partridge brings over twenty years of experience in the software industry as a developer, investor, and entrepreneur. He joins Tokutek from StreamBase Systems which John co-founded with database pioneer Dr. Michael Stonebraker. He started his career as a software developer at Microsoft Corporation where he co-authored Excel v1.0. He later worked as a venture capitalist at Accel Partners and the Summit Accelerator Fund where he specialized in investing in early stage internet infrastructure and enterprise software companies. John holds an A.B. in Applied Mathematics / Computer Science from Harvard University and an MBA from the Stanford University Graduate School of Business.
- TokuMX vs. MongoDB : Sharding Balancer Performance Posted on February 16, 2014 by Tim Callaghan, Tokutek.
- What’s new in TokuMX 1.4, Part 4: Smaller, faster sharded clusters. Posted on February 20, 2014 by Leif Walsh, Tokutek.
Follow ODBMS.org on Twitter: @odbmsorg
“So it is not even the volume of data that imputes political or economic value. Hence, it is clear that data has enormous political and economic value. Given the increasing digitization of our world it seems inevitable that our legal, economic, and political systems, amongst others, will ascribe to formal measures of value for data.” –Michael L. Brodie.
What is the other side of Big Data? What are the societal benefits, risks, and values of Big Data? These are difficult questions to answer.
On this topic, I have interviewed Dr. Michael L. Brodie, Research Scientist at MIT Computer Science and Artificial Intelligence Laboratory. Dr. Brodie has over 40 years experience in research and industrial practice.
Q1. You recently wrote  that “we are in the midst of two significant shifts – the shift to Big Data requiring new computational solutions, and the more profound shift in societal benefits, risks, and values”. Can you please elaborate on this?
Michael L. Brodie: The database world deals with data that is bounded, even if vast and growing beyond belief, and used for known, discrete models of our world most of which support a single version of truth. While Big Data expands the existing scale (volume, velocity, variety) it does far more as it takes us into a world that we experience in life but not in computing. I call the vision, the direction that Big Data is taking us, Computing Reality. A simple explanation is that in the database world, we work top-down with schemas that define how the data should behave. For example, Telecom billing systems are essentially all in an equivalence class of the same billing model and require that billing data conform. Billing databases have a single version of truth so that telecom bills have justifiable charges. Not so with Big Data.
If we impose a model or our biases on the data we may prelude the very value that we are trying to discover.
In Big Data worlds, as in life, there is not a single version of truth over the data but multiple perspectives each with a probability of being true or reasonable. We are probably not looking for one likely model but an ensemble of models each of which provides a different perspective and discloses some discoveries in the data that we otherwise would not have found.
So the one paradigm shift is from small data that involve discrete, bounded, top-down approaches to computing to big data that require bottom-up approaches that tend to be vague or probabilistic, unbounded, and provide support multiple perspectives. I call this latter approach Computing Reality, reflecting the vagueness and unboundedness of reality.
A second, related shift – from why to what – can be understood in terms of Scientific Discovery. The history of scientific and Western thought, starting before Aristotle and Plato, has matured into what we know today as the Scientific Method in which one makes observations about a phenomenon, e.g., sees some data, hypothesizes a model, and determines if the model makes sense over the observed data.
This process is What: What are the correlations in the data that might explain the phenomenon.
A reasonable model over the data leads to Why: the Holy Grail of Science – causation – Why does the phenomenon occur.
For over 2,000 years a little What has guided Why – Scientific Discovery through empiricism.
Big Data has the potential of turning scientific discovery on its ear. Big Data is leading to a shift from Why to What.
The value of Big Data and the emergence of Big Data analytics may shift the preponderance of scientific discovery to What, since it is so much cheaper that Why – clinical studies that take vast resources and years of careful work. Here is the challenge. Why – causation – cannot be deduced from What. It is not clear that Big Data practitioners understand the tenuous link between What and Why. Massive Big data blunders [1, 2] suggest that this is the case.
My research into Computing Reality explores this link with the hopes of providing guidance for Big Data tools and techniques. And even cooler than that to accelerate Scientific Discovery by adding mechanisms and metrics of veracity to Big Data and its symbiosis with empericism
Q2. You also wrote  that with Big Data “more than ever before, technology is far ahead of the law”. What do you mean with this?
Michael L. Brodie: The Forth Amendment of the United States Constitution was tested many times by technology including when electronic techniques could be used to determine activities inside a citizen’s home. When the constitution was written in 1787 electronic surveillance could never have been anticipated.
Today, the laws of search and seizure, based on the Fourth Amendment, permit those with a warrant to acquire all of your electronic devices so the government can examine everything on those devices although it appears that the intent of the law was to permit search and seizure of evidence relative to the suspected offence. That is, the current laws were perfectly rational when written; however, technology has so changed the world we live in that the law, interpreted simply allows the government to look at your entire digital life, which for many of us is much of our lives thus minimizing or eliminating the protections of the Fourth Amendment. The simple matter is that technology will always be ahead of the law.
So we must constantly balance current and unforeseen consequences of technology advance on our lives and societies.
Since time immemorial, and as observed by Benjamin Franklin, we must always judiciously balance freedom and security; you can’t have both. Technology more than many domains push this balance.
Q3. John Podesta, Obama’s Counselor and study lead, asked the following question during a workshop: „Does our legal privacy framework support and balance safety and freedom?“ What is your personal view on this especially related to the ongoing discussion on an open and free Internet and Big Data?
Michael L. Brodie: What a great question, worth of serious pursuit, more than I will pursue here. A fundamental part of your question is of a free and open Internet. While it is debatable as to whether computing or the Internet has created economic growth and increased productivity, it is fair to say that our economies have become so dependent on computing and that Balkanizing the Internet, as exemplified recently by Turkey, China, Brasil, and even Switzerland, will surely cause major economic disruption.
Not only does a significant portion of our existing economy ride on an currently open and free Internet, that platform has been and will continue to be a fountain of innovation and potential economic growth, and, ideally, increased productivity; not to mention the daily lives of billions of people on the planet. As we have seen in Tunisia, Egypt, North Korea, China, Syria, and other constrained countries, an open and free Internet, e.g., Twitter, is becoming a means for democratic expression and constraint on totalitarian behaviour. Much is at stake to maintain an open and free Internet.
This should encourage a robust debate of the various Internet Bill of Rights currently on offer. Clearly the Snowden-NSA incidents and the resulting events in the White House, the Supreme Court, and the US Congress clearly indicate that our legal privacy framework is inadequate. The more interesting question is what changes are required to permit a balance of freedom and safety. Such a framework should result from a robust, informed public debate on the issues. Hopefully these discussions will start in earnest. The workshop is an example of the White House’s commitment to such a discussion.
Q4. What would be the discussion on an open and free Internet, while balancing safety with freedom, if Edward Snowden had not disclosed the NSA surveillance?
Michael L. Brodie: What great questions with profound implications, clearly beyond my skills, but fun to poke at. Let me add to the question: Is Snowden a Whistle Blower or terrorist? Is he working to uphold the constitution or undermine it?
I happen to have had some direct experience on this issue. From April 2006 to January 2008 I served on the United States of America National Academies Committee on Technical and Privacy Dimensions of Information for Terrorism Prevention and other National Goals, co-chaired by Dr. Charles Vest, president of the National Academy of Engineering and Dr. William Perry, former US Secretary of Defense, that was commissioned by the Department of Homeland Security and the National Science Foundation.
The recent White House Investigation prompted by Snowden’s disclosures heavily cited the commission’s report .
The 21-month investigation by 20 experts chosen by the academy uncovered some aspects of what Snowden’s disclosures led to, it did not uncover the scope and scale of the NSA actions that emerged from Snowden’s disclosures. It is not until you discover the actions that you question the relevant laws or as the White House justifiably asked, the legal privacy framework to support and balance safety and freedom.
As I said in the piece that you reference the White House and Snowden are asking exactly the same questions. Snowden has said that he saw it as his obligation to do what he did given his oath to uphold the constitution. Hence, such a discussion could emerge without Snowden in the next decade, but it would not have emerged at the moment without his actions.
Would that it had emerged in 2006 or as a consequence of the many other similarly intended investigations.
It seems to me that Snowden blew the whistle on NSA.
Q5. De facto, the Internet is becoming a new battlefield among different political and economic systems in the world. What is the political and economic value of data?
Michael L. Brodie: Again a grand question for my betters. This is another profound question that I am not skilled to answer. But why let that stop me?
Our economic system is based on commodities, goods and services, with almost no means of attributing economic value to data. Indirectly, data is valued at inconceivably high values according to many Internet company acquisitions, especially Facebook’s recent $16 Billion acquisition of Whatsapp that appears to be acquiring people and their data by the network effect.
How do you ascribe value to data? Who owns data? Does it age and does time reduce or raise its value? If it has economic value, then what legal jurisdiction governs data? What is the political value of data? For one example look at Europe’s solicitation of business away from the United States based on data, data ownership, and data governance.
Another example is that President Lyndon Johnson achieved the US Civil Rights Bill because of data – he knew where all the bodies were buried. What is the value of data there?
So it is not even the volume of data that imputes political or economic value. Hence, it is clear that data has enormous political and economic value. Given the increasing digitization of our world it seems inevitable that our legal, economic, and political systems, amongst others, will ascribe to formal measures of value for data.
Q6. There has been a claim that “Big data” has rendered obsolete the current approach to protecting privacy and civil liberties . Is this really so?
Michael L. Brodie: Without question expanding beyond bounded, discrete, top-down models of the world, to a vastly larger, more complex digital version of the world, requires a reevaluation of previous approaches to computing problems, including privacy and civil liberties. The quote is from Craig Mundie  who makes the observation for a policy and strategy point of view. A recent report on machine learning and curly fries claims that organizations, e.g., marketing, can create complete profiles of individuals without their permission and presumably use it in many ways, e.g., refuse providing a loan? Does that threaten privacy and civil liberties?
While I quoted Mundie concerning civil liberties, my knowledge is in computing and databases. My reference concerns the fact that current solutions will simply not scale to the world of Big Data and Computing Reality. It seems a safe statement since Butler Lampson and Mike Stonebraker have both said the same thing. Simply stated, we cannot anticipate every attack, what combination of data accesses could be used to deduce private information. A famous case is to use Netflix movie selection data to identify private patient information from anonymized Medicare data. So while you may do a top-down job applying existing protection mechanisms, your only hope is to detect violations and stop further such attacks, as has been claimed for Heartbleed.
As Butler Lampson said 
“It’s time to change the way we think about computer security: instead of trying to prevent security breaches, we should focus on dealing with them after they happen.
Today computer security depends on access control, and it’s been a failure. Real world security, by contrast, is mainly retroactive: the reason burglars don’t break into my house is that they are afraid of going to jail, and the financial system is secure mainly because almost any transaction can be undone.
There are many ways to make security retroactive: • Track down and punish offenders. • Selectively undo data corruption caused by malware. Require applications and online services to respect people’s ownership of their personal data.
Access control is still needed, but it can be much more coarse-grained, and therefore both more reliable and less intrusive. Authentication and auditing are the most important features. Retroactive security will not be perfect, but perfect security is not to be had, and it will be much better than what we have now.”
 Protecting Individual Privacy in the Struggle Against Terrorism: A Framework for Program Assessment, Committee on Technical and Privacy Dimensions of Information for Terrorism Prevention and Other National Goals, National Research Council, Washington, D.C. 2008. ISBN-10: 0-309-12488-3 ISBN-13: 978-0-309-12488-1
 John Podesta, White House Counselor, White House-MIT Big Data Privacy Workshop: Advancing the State of the Art in Technology and Practice, March 4, 2014, MIT, Cambridge, MA http://web.mit.edu/bigdata-priv/agenda.html
 White House-MIT Big Data Privacy Workshop A Personal View, Dr. Michael L. Brodie , Computer Science and Artificial Intelligence Laboratory, MIT , March 24, 2014 http://www.odbms.org/2014/04/white-house-mit-big-data-privacy-workshop/
Dr. Michael L. Brodie
Dr. Brodie has over 40 years experience in research and industrial practice in databases, distributed systems, integration, artificial intelligence, and multi-disciplinary problem solving. He is concerned with the Big Picture aspects of information ecosystems including business, economic, social, application, and technical. Dr. Brodie is a Research Scientist, MIT Computer Science and Artificial Intelligence Laboratory; advises startups; serves on Advisory Boards of national and international research organizations; and is an adjunct professor at the National University of Ireland, Galway. For over 20 years he served as Chief Scientist of IT, Verizon, a Fortune 20 company, responsible for advanced technologies, architectures, and methodologies for Information Technology strategies and for guiding industrial scale deployments of emergent technologies, most recently Cloud Computing and Big Data and start ups Jisto.com and data-tamer.com. He has served on several National Academy of Science committees.
Dr. Brodie holds a PhD in Databases from the University of Toronto and a Doctor of Science (honoris causa) from the National University of Ireland.
Follow ODBMS.org on Twitter: @odbmsorg
“SciDB is both a data store and a massively parallel compute engine for numerical processing. The inclusion of this computational platform is what makes us the first “computational database”, not just a SQL-style decision support DBMS. Hence, we need a new moniker to describe this class of interactions. We settled on computational databases, but if your readers have a better suggestion, we are all ears!”
–Mike Stonebraker, Paul Brown.
On the SciDB array database, I have interviewed Mike Stonebraker, MIT Professor and Paradigm4 co-founder and CTO, and Paul Brown, Paradigm4 Chief Architect.
Q1: What is SciDB and why did you create it?
Mike Stonebraker, Paul Brown: SciDB is an open source array database with scalable, built-in complex analytics, programmable from R and Python. The requirements for SciDB emerged from discussions between academic database researchers—Mike Stonebraker and Dave DeWitt— and scientists at the first Extremely Large Databases conference (XLDB) at SLAC in 2007 about coping with the peta-scale data from the forthcoming LSST telescope.
Recognizing that commercial and industrial users were about to face the same challenges as scientists, Mike Stonebraker founded Paradigm4 in 2010 to make the ideas explored in early prototypes available as a commercial-quality software product. Paradigm4 develops and supports both a free, open-source Community Edition (scidb.org/forum) and an Enterprise Edition with additional features (paradigm4.com).
Q2. With the rise of Big Data analytics, is the convergence of analytic needs between science and industry really happening?
Mike Stonebraker, Paul Brown: There is a “sea change” occurring as companies move from Business Intelligence (think SQL analytics) to Complex Analytics (think predictive modelling, clustering, correlation, principal components analysis, graph analysis, etc.). Obviously science folks have been doing complex analytics on big data all along.
Another force driving this sea change is all the machine-generated data produced by cell phones, genomic sequencers, and by devices on the Industrial Internet and the Internet of Things. Here too science folks have been working with big data from sensors, instruments, telescopes and satellites all along. So it is quite natural that a scalable computational database like SciDB that serves the science world is a good fit for the emerging needs of commercial and industrial users.
There will be a convergence of the two markets as many more companies aspire to develop innovative products and services using complex analytics on big and diverse data. In the forefront are companies doing electronic trading on Wall Street; insurance companies developing new pricing models using telematics data; pharma and biotech companies analyzing genomics and clinical data; and manufacturing companies building predictive models to anticipate repairs on expensive machinery. We expect everybody will move to this new paradigm over time. After all, a predictive model integrating diverse data is much more useful than a chart of numbers about past behavior.
Q3. What are the typical challenges posed by scientific analytics?
Mike Stonebraker, Paul Brown: We asked a lot of working scientists the same question, and published a paper in the IEEE Computing Science & Engineering summarizing their answers (*see citation below). In a nutshell, there are 4 primary issues.
1. Scale. Science has always been intensely “data driven”. With the ever-increasing massive data-generating capabilities of scientific instruments, sensors, and computer simulations, the average scientist is overwhelmed with data and needs data management and analysis tools that can scale to meet his or her needs, now and in the future.
2. New Analytic Methods. Historically analysis tools have focused on business users, and have provided easy-to-use interfaces for submitting SQL aggregates to data warehouses. Such business intelligence (BI) tools are not useful to scientists, who universally want much more complex analyses, whether it be outlier detection, curve fitting, analysis of variance, predictive models or network analysis. Such “complex analytics” is defined on arrays in linear algebra, and requires a new generation of client-side tools and server side tools in DBMSs.
3. Provenance. One of the central requirements that scientists have is reproducibility. They need to be able to send their data to colleagues to rerun their experiments and produce the same answers. As such, it is crucial to keep prior versions of data in the face of updates, error correction, and the like. The right way to provide such provenance is through a no-overwrite DBMS; which allows time-travel back in time to when the experiment in question was performed.
4. Interactivity. Unlike business users who are often comfortable with batch reporting of information, scientific users are invariably exploring their data, asking “what if” questions and testing hypotheses. What they need in interactivity on very large data sets.
Q3. What are in your opinion the commonalities between scientific and industrial analytics?
Mike Stonebraker, Paul Brown: We would state the question in reverse “What are the differences between the two markets?” In our opinion, the two markets will converge quickly as commercial and industrial companies move to the analytic paradigms pervasive in the science marketplace.
Q4. How come in the past the database system software community has failed to build the kinds of systems that scientists needed for managing massive data sets?
Mike Stonebraker, Paul Brown: Mostly it’s because scientific problems represent a $0 billion market! However, the convergence of industrial requirements and science requirements means that science can “piggy back” on the commercial market and get their needs met.
Q5. SciDB is a scalable array database with native complex analytics. Why did you choose a data model based on multidimensional arrays?
Mike Stonebraker, Paul Brown: Our main motivation is that at scale, the complex analyses done by “post sea change” users are invariably about applying parallelized linear algebraic algorithms to arrays. Whether you are doing regression, singular value decomposition, finding eigenvectors, or doing operations on graphs, you are performing a sequence of matrix operations. Obviously, this is intuitive and natural in an array data model, whereas you have to recast tables into arrays if you begin with an RDBMS or keep data in files. Also, a native array implementation can be made much faster than a table-based system by directly implementing multi-dimensional clustering and doing selective replication of neighboring data items.
Our secondary motivation is that, just like mathematical matrices, geospatial data, time-series data, image data, and graph data are most naturally organized as arrays. By preserving the inherent ordering in the data, SciDB supports extremely fast selection (including vectors, planes, ‘hypercubes’), doing multi-dimensional windowed aggregates, and re-gridding it to change spatial or temporal resolution.
Q6. How do you manage in a nutshell scalability with high degrees of tolerance to failures?
Mike Stonebraker, Paul Brown: In a nutshell? Partitioning, and redundancy (k-replication).
First, SciDB splits each array’s attributes apart, just like any columnar system. Then we partition each array into rectilinear blocks we call “chunks”. Then we employ a variety of mapping functions that map an array’s chunks to SciDB instances. For each copy of an array we use a different mapping function to create copies of each chunk on different node of the cluster. If a node goes down, we figure out where there is a redundant copy of the data and move the computation there.
Q7. How do you handle data compression in SciDB?
Mike Stonebraker, Paul Brown: Use of compression in modern data stores is a very important topic. Minimizing storage while retaining information and supporting extremely rapid data access informs every level of SciDB’s design. For example, SciDB splits every array into single-attribute components. We compress a chunk’s worth of cell values for a specific attribute. At the lowest level, we compress attribute data using techniques like run-length encoding on data. In addition, our implementation has an abstraction for compression to support other compression algorithms.
Q8. Why supporting two query languages?
Mike Stonebraker, Paul Brown: Actually the primary interfaces we are promoting are R and Python as they are the languages of choice of data scientists, quants, bioinformaticians, and scientists. SciDB-R and SciDB-Py allow users to interactively query SciDB using R and Python. Data is persisted in SciDB. Math operators are overloaded so that complex analytical computations execute scalably in the database.
Early on we surveyed potential and existing SciDB users, and found there were two very different types. By and large, commercial users using RDMBSs said “make it look like SQL”. For those users we created AQL—array SQL. On the other hand, data scientists and programmers preferred R, Python, and functional languages. For the second class of users we created SciDB-R, SciDB-Py, and AFL—an array functional language.
All queries get compiled into a query plan, which is a sequence of algebraic operations. Essentially all relational versions of SQL do exactly the same thing. In SciDB, AFL, the array functional language, is the underlying language of algebraic operators. Hence, it is easy to surface and support AFL in addition to AQL, SciDB-R, and SciDB-Py, allowing us to satisfy the preferred mode of working for many classes of users.
Q9. You defined SciDB a computational database – not a data warehouse, not a business-intelligence database, and not a transactional database. Could you please elaborate more on this point?
Mike Stonebraker, Paul Brown: In our opinion, there are two mature markets for DBMSs: transactional DBMSs that are optimized for large numbers of users performing short write-oriented ACID transactions, and data warehouses, which strive for high performance on SQL aggregates and other read-oriented longer queries. The users of SciDB fit into neither category. They are universally doing more complex mathematical calculations than SQL aggregates on their data, and their DBMS interactions are typically longer read-oriented queries. SciDB is both a data store and a massively parallel compute engine for numerical processing. The inclusion of this computational platform is what makes us the first “computational database”, not just a SQL-style decision support DBMS. Hence, we need a new moniker to describe this class of interactions. We settled on computational databases, but if your readers have a better suggestion, we are all ears!
Q10. How does SciDB differ from analytical databases, such as for example HP Vertica, and in-memory analytics databases such as SAP HANA?
Mike Stonebraker, Paul Brown: Both are data warehouse products, optimized for warehouse workloads. SciDB serves a different class of users from these other systems. Our customers’ data are naturally represented as arrays that don’t fit neatly or efficiently into relational tables. Our users want more sophisticated analytics—more numerical, statistical, and graph analysis—and not so much SQL OLAP.
Q11. What about Teradata?
Mike Stonebraker, Paul Brown: Another data warehouse vendor. Plus, SciDB runs on commodity hardware clusters or in a cloud and not on a proprietary appliances or expensive servers.
Q12. Anything else you wish to add?
Mike Stonebraker, Paul Brown: SciDB is currently being used by commercial users for computational finance, bioinformatics and clinical informatics, satellite image analysis, and industrial analytics. The publicly accessible NIH NCBI One Thousand Genomes browser has been running on SciDB since the Fall of 2012.
Anyone can try out SciDB using an AMI or a VM available at scidb.org/forum.
Mike Stonebraker , CTO Paradigm4
Renowned database researcher, innovator, and entrepreneur: Berkeley, MIT, Postgres, Ingres, Illustra, Cohera, Streambase, Vertica, VoltDB, and now Paradigm4.
Paul Brown , Chief Architect Paradigm4
Premier database ‘plumber’ and researcher moving from the “I’s” (Ingres, Illustra, Informix, IBM) to a “P” (Paradigm4).
*Citation for IEEE paper
Stonebraker, M.; Brown, P.; Donghui Zhang; Becla, J., “SciDB: A Database Management System for Applications with Complex Analytics,” Computing in Science & Engineering , vol.15, no.3, pp.54,62, May-June 2013
doi: 10.1109/MCSE.2013.19, URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461866&isnumber=6549993
- ODBMS.org: free resources related to Paradigm4
Follow ODBMS.org on Twitter: @odbmsorg
“High volume and data driven businesses have led to new types of data emerging from the cloud, mobile devices, social media and sensor devices. For applications processing such data, traditional relational databases such as Oracle simply run out of steam.”–Robin Schumacher
The sixth interview in the “Big Data: three questions to “ series of interviews, is with Robin Schumacher, VP of Products at DataStax.
Q1. What is your current product offering?
Robin Schumacher: DataStax offers the first enterprise-class NoSQL platform for data-driven, real-time online applications. Our flagship product is DataStax Enterprise 4.0, built on Apache Cassandra. It is a complete big data platform with the full power of Cassandra offering a range of solutions including built in analytics, integrated search, an in-memory options, and the most comprehensive security feature set of any NoSQL database.
An integrated analytics component allows users to store and manage line of business application data and analyzes that same data within the platform. The analytics capability allows for comprehensive workload management and allows the user to run real time transactions and enterprise search workloads in a seamlessly integrated database.
Built in search offers robust full text search, faceted search, rich document handling and geospatial search.
Benefits include full workload management, continuous availability, real-time functionality and data protection.
Lastly, security runs through the entire platform to protect unauthorized access to guard sensitive data. Visual backup and restore processes make for retrieving lost data extremely easy.
DataStax OpsCenter, a simplified management solution, is included with DataStax Enterprise. This service makes it easy to manage Cassandra and DataStax Enterprise clusters by giving administrators, architects and developers a view of the system from a centralized dashboard. OpsCenter installs seamlessly and gives system operators the flexibility to monitor and manage the most complex workloads from any web browser.
Q2. Who are your current customers and how do they typically use your products?
Robin Schumacher: DataStax is the first viable alternative to Oracle and powers the online applications for 400+ customers and more than 20 of the Fortune 100. Our customer industries range from e-commerce to education to digital entertainment and the top use cases are the following:
1. Fraud detection
2. The Internet of Things
The most common baseline use for our product is to serve as an operational database management system for online applications that must scale to incredible levels and must remain online at all times.
Q3. What are the main new technical features you are currently working on and why?
Robin Schumacher: We recently added an in-memory option that enables companies to process data up to 100 times faster. This option excels in use cases that require fast write and read operations, and is particularly suited when data is overwritten frequently, but not actually deleted. DataStax Enterprise 4.0 is the first NoSQL database to combine this in memory option with Cassandra¹s always on architecture, linear scalability and datacenter support, delivering lightning performance that allows businesses to scale applications with zero downtime – particularly useful in financial services use cases or any application where performance is key.
High volume and data driven businesses have led to new types of data emerging from the cloud, mobile devices, social media and sensor devices. For applications processing such data, traditional relational databases such as Oracle simply run out of steam. DataStax Enterprise 4.0 offers a powerful, modern alternative to help build online applications that scale as the business grows. This in-memory capability equals faster performance, easy development, flexible performance management and seamless search:
Objects created in-memory optimize performance and deliver increased speed which enables businesses to deliver data to customers faster than ever before.
In-memory objects act as Cassandra tables so they are transparent to applications and developers have no learning curve to manage.Administrators can decide where to assign data, making performance optimization easier than ever.
Enhanced internal cluster communications deliver faster search operations help developers build applications more efficiently.
Follow ODBMS.org on Twitter: @odbmsorg
“The real problem here is the word “silo.” To answer today’s data challenges requires a holistic approach. Your storage, network and compute need to work together.”–David Gorbet.
What are the challenges for modern data centers? On this topic I have interviewed David Gorbet, Vice President of Engineering at MarkLogic.
Q1. Data centers are evolving to meet the demands and complexities imposed by increasing business requirements. What are the main challenges?
David Gorbet: The biggest business challenge is the rapid pace of change in both available data and business requirements. It’s no longer acceptable to spend years designing a data application, or the infrastructure to run it. You have to be able to iterate on functionality quickly. This means that your applications need to be developed in a much more agile manner, but you also need to be able to reallocate your infrastructure dynamically to the most pressing needs. In the era of Big Data this problem is exacerbated. The increasing volume and complexity of data under management is stressing both existing technologies and IT budgets. It’s not just a matter of scale, although traditional “scale-up” technologies do become very expensive as data volumes grow. It’s also a matter of complexity of data. Today a lot of data has a mix of structured and unstructured components, and the traditional solution to this problem is to split the structured components into an RDBMS, and use a search technology for the unstructured components. This creates additional complexity in the infrastructure, as different technology stacks are required for what really should be components of the same data.
Traditional technologies for data management are not agile. You have to spend an inordinate amount of time designing schemas and planning indexing strategies, both of which require pretty-much full knowledge of the data and query patterns you need to provide the application value. This has to be done before you can even load data. On the infrastructure side, even if you’ve embraced cloud technologies like virtualization, it’s unlikely you’re able to make good use of them at the data layer. Most database technologies are not architected to allow elastic expansion or contraction of capacity or compute power, which makes it hard to achieve many of the benefits (and cost savings) of cloud technologies.
To solve these problems you need to start thinking differently about your data center strategy. You need to be thinking about a data-centered data center, versus today’s more application-centered model.
Q2. You talked about a “data-centered” data center? What is it? and what is the difference with respect to a classical Data-Warehouse?
David Gorbet: To understand what I mean by “data-centered” data center, you have to think about the alternative, which is more application-centered. Today, if you have data that’s useful, you build an application or put it in a data warehouse to unlock its value. These are database applications, so you need to build out a database to power them. This database needs a schema, and that schema is optimized for the application. To build this schema, you need to understand both the data you’ll be using, and the queries that the application requires.
So you have to know in advance everything the application is going to do before you can build anything. What’s more, you then have to ETL this data from wherever it lives into the application-specific database.
Now, if you want another application, you have to do the same thing. Pretty soon, you have hundreds of data stores with data duplicated all over the place. Actually, it’s not really duplicated, it’s data derived from other data, because as you ETL the data you change its form losing some of the context and combining what’s left with bits of data from other sources. That’s even worse than straight-up duplication because provenance is seldom retained through this process, so it’s really hard to tell where data came from and trace it back to its source. Now imagine that you have to correct some data.
Can you be sure that the correction flowed through to every downstream system? Or what if you have to delete data due to a privacy issue, or change security permissions on data? Even with “small data” this is complicated, but it’s much harder and costlier with high volumes of data.
A “data-centered” data center is one that is focused on the data, its use, and its governance through its lifecycle as the primary consideration. It’s architected to allow a single management and governance model, and to bring the applications to the data, rather than copying data to the applications. With the right technologies, you can build a data-centered data center that minimizes all the data duplication, gives you consistent data governance, enables flexibility both in application development over the data and in scaling up and down capacity to match demand, allowing you to manage your data securely and cost-effectively throughout its lifecycle.
Q3. Data center resources are typically stored in three silos: compute, storage and network: is this a problem?
David Gorbet: It depends on your technology choices. Some data management technologies require direct-attached storage (DAS), so obviously you can’t manage storage separately with that kind of technology. Others can make use of either DAS or shared storage like SAN or NAS.
With the right technology, it’s not necessarily a problem to have storage managed independently from compute.
The real problem here is the word “silo.” To answer today’s data challenges requires a holistic approach. Your storage, network and compute need to work together.
Your question could also apply to application architectures. Traditionally, applications are built in a three-tiered architecture, with a DBMS for data management, an application server for business logic, and a front-end client where the UI lives. There are very good reasons for this architecture, and I believe it’s likely to be the predominant model for years to come. But even though business logic is supposed to reside in the app server, every enterprise DBMS supports stored procedures, and these are commonly used to leverage compute power near the data for cases where it would be too slow and inefficient to move data to the middle tier. Increasingly, enterprise DBMSes also have sophisticated built-in functions (and in many cases user-defined functions) to make it easy to do things that are most efficiently done right where the data lives. Analytic aggregate calculations are a good example of this. Compute doesn’t just reside in the middle tier.
This is nothing new, so why am I bringing it up? Because as data volumes grow larger, the problem of moving data out of the DBMS to do something with it is going to get a lot worse. Consider for example the problem faced by the National Cancer Institute. The current model for institutions wanting to do research based on genomic data is to download a data set and analyze it. But by the end of 2014, the Cancer Genome Atlas is expected to have grown from less than 500 TB to 2.5 PB. Just downloading 2.5 PB, even over 10-gigabit network would take almost a month.
The solution? Bring more compute to the data. The implication? Twofold: First, methods for narrowing down data sets prior to acting on them are critical. This is why search technology is fast becoming a key feature of a next-generation DBMS. Search is the query language for unstructured data, and if you have complex data with a mix of structured and unstructured components, you need to be able to mix search and query seamlessly. Second, DBMS technologies need to become much more powerful so that they can execute sophisticated programs and computations efficiently where the data lives, scoped in real-time to a search that can narrow the input set down significantly. That’s the only way this stuff is going to get fast enough to happen in real-time. Another way of putting this is that the “M” in DBMS is going to increase in scope. It’s not enough just to store and retrieve data. Modern DBMS technology needs to be able to do complex, useful computations on it as well.
Q4. How do you build such a “data-centered” data center?
David Gorbet: First you need to change your mindset. Think about the data as the center of everything. Think about managing your data in one place, and bringing the application to the data by exposing data services off your DBMS. The implications for how you architect your systems are significant. Think service-oriented architectures and continuous deployment models.
Next, you need the right technology stack. One that can provide application functionality for transactions, search and discovery, analytics, and batch computation with a single governance and scale model. You need a storage system that gives great SLAs on high-value data and great TCO on lower-value data, without ETL. You need the ability to expand and contract compute power to serve the application needs in real time without downtime, and to run this infrastructure on premises or in the cloud.
You need the ability to manage data throughout its lifecycle, to take it offline for cost savings while leaving it available for batch analytics, and to bring it back online for real-time search, discovery or analytics within minutes if necessary. To power applications, you need the ability to create powerful, performant and secure data services and expose them right from where the data lives, providing the data in the format needed by your application on the fly.
We call this “schema on read.”
Of course all this has to be enterprise class, with high availability, disaster recovery, security, and all the enterprise functionality your data deserves, and it has to fit in your shrinking IT budget. Sounds impossible, but the technology exists today to make this happen.
Q5. For what kind of mission critical apps is such a “data-centered” data center useful?
David Gorbet: If you have a specific application that uses specific data, and you won’t need to incorporate new data sources to that application or use that data for another application, then you don’t need a data-centered data center. Unfortunately, I’m having a hard time thinking of such an application. Even the dull line of business apps don’t stand alone anymore. The data they create and maintain is sent to a data warehouse for analysis.
The new mindset is that all data is potentially valuable, and that isn’t just restricted to data created in-house.
More and more data comes from outside the organization, whether in the form of reference data, social media, linked data, sensor data, log data… the list is endless.
A data-centered data center strategy isn’t about a specific application or application type. It’s about the way you have to think about your data in this new era.
Q6. How Hadoop fits into this “data-centered” data center?
David Gorbet: Hadoop is a key enabling technology for the data-centered data center. HDFS is a great file system for storing loads of data cheaply.
I think of it as the new shared storage infrastructure for “big data.” Now HDFS isn’t fast, so if you need speed, you may need NAS, SAN, or even DAS or SSD. But if you have a lot of data, it’s going to be much cheaper to store it in HDFS than in traditional data center storage technologies. Hadoop MapReduce is a great technology for batch analytics. If you want to comb through a lot of data and do some pretty sophisticated stuff to it, this is a good way to do it. The downside to MapReduce is that it’s for batch jobs. It’s not real-time.
So Hadoop is an enabling technology for a data-centered data center, but it needs to be complemented with high-performance storage technologies for data that needs this kind of SLA, and more powerful analytic technologies for real-time search, discovery and analysis. Hadoop is not a DBMS, so you also need a DBMS with Hadoop to manage transactions, security, real-time query, etc.
Q7. What are the main challenges when designing an ETL strategy?
David Gorbet: ETL is hard to get right, but the biggest challenge is maintaining it. Every app has a v2, and usually this means new queries that require new data that needs a new schema and revised ETL. ETL also just fundamentally adds complexity to a solution.
It adds latency since many ETL jobs are designed to run in batches. It’s hard to track provenance of data through ETL, and it’s hard to apply data security and lifecycle management rules through ETL. This isn’t the fault of ETL or ETL tools.
It’s just that the model is fundamentally complex.
Q8. With Big Data analytics you don’t know in advance what data you’re going to need (or get in the future). What is the solution to this problem?
David Gorbet: This is a big problem for relational technologies, where you need to design a schema that can fit all your data up front.
The best approach here is to use a technology that does not require a predefined schema, and that allows you to store different entities with different schemas (or no schema) together in the same database and analyze them together.
A document database, which is a type of NoSQL database, is great for this, but be careful which one you choose because some NoSQL databases don’t do transactions and some don’t have the indexing capability you need to search and query the data effectively.
Another trend is to use Semantic Web technology. This involves modeling data as triples, which represent assertions with a subject, a predicate, and an object.
Like “This derivative (subject) is based on (predicate) this underlying instrument (object).”
It turns out you can model pretty much any data that way, and you can invent new relationships (predicates) on the fly as you need them.
No schema required. It’s also easy to relate data entities together, since triples are ideal for modeling relationships. The challenge with this approach is that there’s still quite a bit of thought required to figure out the best way to represent your data as triples. To really make it work, you need to define rules about what predicates you’re going to allow and what they mean so that data is modeled consistently.
Q9. What is the cost to analyze a terabyte of data?
David Gorbet: That depends on what technologies you’re using, and what SLAs are required on that data.
If you’re ingesting new data as you analyze, and you need to feed some of the results of the analysis back to the data in real time, for example if you’re analyzing risk on derivatives trades before confirming them, and executing business rules based on that, then you need fast disk, a fair amount of compute power, replicas of your data for HA failover, and additional replicas for DR. Including compute, this could cost you about $25,000/TB.
If your data is read-only and your analysis does not require high-availability, for example a compliance application to search those aforementioned derivatives transactions, you can probably use cheaper, more tightly packed storage and less powerful compute, and get by with about $4,000/TB. If you’re doing mostly batch analytics and can use HDFS as your storage, you can do this for as low as $1,500/TB.
This wide disparity in prices is exactly why you need a technology stack that can provide real-time capability for data that needs it, but can also provide great TCO for the data that doesn’t. There aren’t many technologies that can work across all these data tiers, which is why so many organizations have to ETL their data out of their transactional system to an analytic or archive system to get the cost savings they need. The best solution is to have a technology that can work across all these storage tiers and can manage migration of data through its lifecycle across these tiers seamlessly.
Again, this is achievable today with the right technology choices.
David Gorbet, Vice President, Engineering, MarkLogic.
David brings two decades of experience bringing to market some of the highest-volume applications and enterprise software in the world. David has shipped dozens of releases of business and consumer applications, server products and services ranging from open source to large-scale online services for businesses, and twice has helped start and grow billion-dollar software products.
Prior to MarkLogic, David helped pioneer Microsoft’s business online services strategy by founding and leading the SharePoint Online team. In addition to SharePoint Online, David has held a number of positions at Microsoft and elsewhere with a number of products, including Microsoft Office, Extricity B2Bi server software, and numerous incubation products.
David holds a Bachelor of Applied Science degree in Systems Design Engineering with an additional major in Psychology from the University of Waterloo, and an MBA from the University of Washington Foster School of Business.
- Got Loss? Get zOVN!
Authors: Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg and Mitch Gusat. IBM Research – Zurich Research Laboratory.
Abstract: Datacenter networking is currently dominated by two major trends. One aims toward lossless, flat layer-2 fabrics based on Converged Enhanced Ethernet or InfiniBand, with ben- efits in efficiency and performance.
- F1: A Distributed SQL Database That Scales
Authors: Jeff Shute, Radek Vingralek, Eric Rollins, Stephan Ellner, Traian Stancescu, Bart Samwel, Mircea Oancea, John Cieslewicz, Himani Apte, Ben Handy, Kyle Littlefield, Ian Rae*. Google, Inc., *University of Wisconsin-Madison
Abstract: F1 is a distributed relational database system built at Google to support the AdWords business.
David Gorbet will be speaking at MarkLogic World in San Francisco from April 7-10, 2014.
ODBMS.org on Twitter: @odbmsorg
“Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications.”–Dave Rosenthal.
On SQL and NoSQL, I have interviewed Dave Rosenthal CEO of FoundationDB.
Q1. What are the suggested criteria for users when they need to choose between durability for lower latency, higher throughput and write availability?
Dave Rosenthal: There is a tradeoff in available between commit latency and durability–especially in distributed databases. At one extreme a database client can just report success immediately (without even talking to the database server) and buffer the writes in the background. Obviously, that hides latency well, but you could lose a suffix of transactions. At the other extreme, you can replicate writes across multiple machines, fsync them on each of the machines, and only then report success to the client.
FoundationDB is optimized to provide good performance in its default setting, which is the safest end of that tradeoff.
Usually, if you want some reasonable amount of durability guarantee, you are talking about a commit latency of small constant factor times the network latency. So, the real latency issues come with databases spanning multiple data centers. In that case FoundationDB users are able to choose whether or not they want durability guarantees in all data centers before commit (increasing commitment latencies), which is our default setting, or whether they would like to relax durability guarantees by returning a commit when the data is fsync’d to disk in just one datacenter.
All that said, in general, we think that the application is usually a more appropriate place to try to hide latency than the database.
Q2. Justin Sheehy of Basho in an interview said  “I would most certainly include updates to my bank account as applications for which eventual consistency is a good design choice. In fact, bankers have understood and used eventual consistency for far longer than there have been computers in the modern sense”. What is your opinion on this?
Dave Rosenthal: Yes, we totally agree with Justin. Despite the obvious shared word ‘transaction’ and the canonical example of a database transaction which modifies multiple bank accounts, I don’t think that database transactions are particularly relevant to financial applications. In fact, true ACID transactions are way more broadly important than that. They give you the ability to build abstractions and systems that you can provide guarantees about.
As Michael Cahill says in his thesis which became the SIGMOD paper of the year: “Serializable isolation enables the development of a complex system by composing modules, after verifying that each module maintains consistency in isolation.” It’s this incredibly important ability to compose that makes a system with transactions special.
Q3. FoundationDB claim to provide full ACID transactions. How do you do that?
Dave Rosenthal: In the same basic way as many other transactional databases do. We use a few strategies that tend to work well in distributed system such as optimistic concurrency and MVCC. We also, of course, have had to solve some of the fundamental challenges associated with distributed systems and all of the crazy things that can happen in them. Honestly, it’s not very hard to build a distributed transactional database. The hard part is making it work gracefully through failure scenarios and to run fast.
Q4. Is this similar to Oracle NoSQL?
Dave Rosenthal: Not really. Both Oracle NoSQL and FoundationDB provide an automatically-partitioned key-value store with fault tolerance. Both also have a concept of ordering keys (for efficient range operations) though Oracle NoSQL only provides ordering “within a Major Key set”. So, those are the similarities, but there are a bunch of other NoSQL systems with all those properties. The huge difference is that FoundationDB provides for ACID transactions over arbitrary keys and ranges, while Oracle NoSQL does not.
Q5. How would you compare your product offering with respect to NoSQL data stores, such as CouchDB, MongoDB, Cassandra and Riak, and NewSQL such as NuoDB and VoltDB?
Dave Rosenthal: The most obvious response for the NoSQL data stores would be “we have ACID transactions, they don’t”, but the more important difference is in philosophy and strategy.
Each of those products expose a single data model and interface. Maybe two. We are pursuing a fundamentally different strategy.
We are building a storage substrate that can be adapted, via layers, to provide a variety of data models, APIs, and true flexibility.
We can do that because of our transactional capabilities. CouchDB, MongoDB, Cassandra and Riak all have different APIs and we talk to companies that run all of those products side-by-side. The NewSQL database players are also offering a single data model, albeit a very popular one, SQL. FoundationDB is offering an ever increasing number of data models through its “layers”, currently including several popular NoSQL data models and with SQL being the next big one to hit. Our philosophy is that you shouldn’t have to increase the complexity of your architecture by adopting a new NoSQL database each time your engineers need access to a new data model.
Q6. Cloud computing and open source: How does it relate to FoundationDB?
Dave Rosenthal: Cloud computing: FoundationDB has been designed from the beginning to run well in cloud environments that make use of large numbers of commodity machines connected through a network. Probably the most important aspect of a distributed database designed for cloud deployment is exceptional fault tolerance under very harsh and strange failure conditions – the kind of exceptionally unlikely things that can only happen when you have many machines working together with components failing unpredictably. We have put a huge amount of effort into testing FoundationDB in these grueling scenarios, and feel very confident in our ability to perform well in these types of environments. In particular, we have users running FoundationDB successfully on many different cloud providers, and we’ve seen the system keep its guarantees under real-world hardware and network failure conditions experienced by our users.
Open source: Although FoundationDB’s core data storage engine is closed source, our layer ecosystem is open source. Although the core data storage engine has a very simple feature set, and is very difficult to properly modify while maintaining correctness, layers are very feature rich and because they are stateless, are much easier to create and modify which makes them well suited to third-party contributions.
Q7 Pls give some examples of use cases where FoundationDB is currently in use. Is FoundationDB in use for analyzing Big Data as well?
Dave Rosenthal: Some examples: User data, meta data, user social graphs, geo data, via ORMs using the SQL layer, metrics collection, etc.
We’ve mostly focused on operational systems, but a few of our customers have built what I would call “big data” applications, which I think of as analytics-focused. The most common use case has been for collecting and analyzing time-series data. FoundationDB is strongest in big data applications that call for lots of random reads and writes, not just big table scans—which many systems can do well.
Q8. Rick Cattel said in an recent interview  “there aren’t enough open source contributors to keep projects competitive in features and performance, and the companies supporting the open source offerings will have trouble making enough money to keep the products competitive themselves”. What is your opinion on this?
Dave Rosenthal: People have great ideas for databases all the time. New data models, new query languages, etc.
If nothing else, this NoSQL experiment that we’ve all been a part of the past few years has shown us all the appetite for data models suited to specific problems. They would love to be able to build these tools, open source them, etc.
The problem is that the checklist of practical considerations for a database is huge: Fault tolerance, scalability, a backup solution, management and monitoring, ACID transactions, etc. Add those together and even the simplest concept sounds like a huge project.
Our vision at FoundationDB is that we have done the hard work to build a storage substrate that simultaneously solves all those tricky practical problems. Our engine can be used to quickly build a database layer for any particular application that inherits all of those solutions and their benefits, like scalability, fault tolerance and ACID compliance.
Q9. Nick Heudecker of Gartner, predicts that  “going forward, we see the bifurcation between relational and NoSQL DBMS markets diminishing over time” . What is your take on this?
Dave Rosenthal: I do think that the lines between SQL and NoSQL will start to blur and I believe that we are leading that charge.We acquired another database startup last year called Akiban that builds an amazing SQL database engine.
In 2014 we’ll be bringing that engine to market as a layer running on top of FoundationDB. That will be a true ANSI SQL database operating as a module directly on top of a transactional “NoSQL” engine, inheriting the operational benefits of our core storage engine – scalability, fault tolerance, ease of operation.
When you run multiple of the SQL layer modules, you can point many of them at the same key-space in FoundationDB and it’s as if they are all part of the same database, with ACID transactions enforced across the separate SQL layer processes.
It’s very cool. Of course, you can even run the SQL layer on a FoundationDB cluster that’s also supporting other data models, like graph or document. That’s about as blurry as it gets.
Dave Rosenthal is CEO of FoundationDB. Dave started his career in games, building a 3D real-time strategy game with a team of high-school friends that won the 1st annual Independent Games Festival. Previously, Dave was CTO at Visual Sciences, a pioneering web-analytics company that is now part of Adobe. Dave has a degree in theoretical computer science from MIT.
Follow ODBMS.org on Twitter: @odbmsorg
“Many tools now exist to run database software without installing software. From vagrant boxes, to one click cloud install, to a cloud service that doesn’t require any installation, developer ease of use has always been a path to storage platform success.”–Brian Bulkowski.
The fifth interview in the “Big Data: three questions to “ series of interviews, is with Brian Bulkowski, Aerospike co-founder and CTO.
Q1. What is your current product offering?
Brian Bulkowski: Aerospike is the first in-memory NoSQL database optimized for flash or solid state drives (SSDs).
In-memory for speed and NoSQL for scale. Our approach to memory is unique – we have built our own file system to access flash, we store indexes in DRAM and you can configure data sets to be in a combination of DRAM or flash. This gives you close to DRAM speeds, the persistence of rotational drives and the price performance of flash.
As next gen apps scale up beyond enterprise scale to “global scale”, managing billions of rows, terabytes of data and processing from 20k to 2 million read/write transactions per second, scaling costs are an important consideration. Servers, DRAM, power and operations – the costs add up, so even developers with small initial deployments must architect their systems with the bottom line in mind and take advantage of flash.
Aerospike is an operational database, a fast key-value store with ACID properties – immediate consistency for single row reads and writes, plus secondary indexes and user defined functions. Values can be simple strings, ints, blobs as well as lists and maps.
Queries are distributed and processed in parallel across the cluster and results on each node can be filtered, transformed, aggregated via user defined functions. This enables developers to enhance key value workloads with a few queries and some in-database processing.
Q2. Who are your current customers and how do they typically use your products?
Brian Bulkowski: We see two use cases – one as an edge database or real-time context store (user profile store, cookie store) and another as a very cost-effective and reliable cache in front of a relational database like mySQL or DB2.
Our customers are some of the biggest names in real-time bidding, cross channel (display, mobile, video, social, gaming) advertising and digital marketing, including AppNexus, BlueKai, TheTradeDesk and [X+1]. These companies use Aerospike to store real-time user profile information like cookies, device-ids, IP addresses, clickstreams, combined with behavioral segment data calculated using analytics platforms and models run in Hadoop or data warehouses. They choose Aerospike for predictable high performance, where reads and writes consistently, meaning 99% of the time, complete within 2-3 milliseconds.
The second set of customers use us in front of an existing database for more cost-effective and reliable caching. In addition to predictable high performance they don’t want to shard Redis, and they need persistence, high availability and reliability. Some need rack-awareness and cross data center support and they all want to take advantage of Aerospike deployments that are both simpler to manage and more cost-effective than alternative NoSQL databases, in-memory databases and caching technologies.
Q3. What are the main new technical features you are currently working on and why?
Brian Bulkowski: We are focused on ease of use, making development easier – quickly writing powerful, scalable applications – with developer tools and connectors. In our Aerospike 3 offering, we launched indexes and distributed queries, user defined functions for in-database processing, expressive API support, and aggregation queries. Performance continues to improve, with support for today’s highly parallel CPUs, higher density flash arrays, and improved allocators for RAM based in-memory use cases.
Developers love Aerospike because it’s easy to run a service operationally. That scale comes after the developer builds their original applications, so developers want samples and connectors that are tested and work easily. Whether that’s an ETL loader for CSV and JSON that’s parallel and scalable, a Hadoop connector to pour insights directly to Aerospike in order to drive hot interface changes, or improving our Mac OSX client that developers need, or HTTP/REST interfaces, developers need the ability to write their core application code to easily use Aerospike.
Many tools now exist to run database software without installing software. From vagrant boxes, to one click cloud install, to a cloud service that doesn’t require any installation, developer ease of use has always been a path to storage platform success.
“The real problem is not collecting the data, even at insanely high speeds; the real problem is acting on it in time. This is where we have things like automatic stock trading systems. The database is integrated rather than separated from the application.” –Joe Celko.
I have interviewed Joe Celko, a well know database expert, on the challenges of Big Data and when it makes sense using Non-Relational Databases.
Q1. Three areas make today’s new data different from the data of the past: Velocity, Volume and Variety. Why?
Joe Celko: I did a keynote at a PostgreSQL conference in Prague with the title “Our Enemy, the Punch Card” on the theme that we had been mimicking the old data models with the new technology. This is no surprise; the first motion pictures were done with a single camera that never moved to mimic a seat at a theater.
Eventually, “moving picture shows” evolved into modern cinema. This is the same pattern in data. It is physically impossible to make a punch card and magnetic tape data move as fast as fiber optics, or hold as many bits. More importantly, the cost per bit dropped by orders of magnitude. Now it was practical computerize everything! And since we can do, and do it cheap, we will do it.
But what we found out that this new, computerizable (is that a word?) data is not always traditionally structured data.
Q2. What about data Veracity? Is this a problem as well?
Q3. When information is changing faster than you can collect and query it,it simply cannot be treated the same as static data. What are the solutions available to solve this problem?
Joe Celko: I have to do a disclaimer here: I have done videos for Streambase and Kx Systems.
There is an old joke about two morons trying to fix a car. Q: “Is my signal light working?” A: “Yes. No. Yes. No. Yes. No. ..” but it summaries the basic problem with streaming data. That is streaming data or “complex events” in the literature.
The model is that tables are replaced by streams of data, but the query language in Streambase is an extended SQL dialect.
The Victory of SELECT-FROM-WHERE!
The Kx products are more like C or other low level languages.
The real problem is not collecting the data, even at insanely high speeds; the real problem is acting on it in time. This is where we have things like automatic stock trading systems. The database is integrated rather than separated from the application.
Q4. Old storage and access models do not work for big data. Why?
Joe Celko: First of all, the old stuff does not hold enough data. How would you put even a day’s worth of Wal-Mart sales on punch cards? Sequential access will not work; we need parallelism. We do not have time to index the data; the traditional tree indexing requires extra time, usually O(lg2(n)). Our best bets are perfect hashing functions and special hardware.
Q5. What are different ways available to store and access data such as petabytes and exabytes?
Joe Celko: Today, we are still stuck with moving disk. Optical storage is still too expensive and slow to write.
Solid State Disk is still too expensive, but dropping fast. My dream is really cheap solid state drives that have lots of processors in the drive which monitor a small subset of the data. We send out a command “Hey, minions, find red widgets and send me your results!” and it happens all at once. The ultimate Map-Reduce model in the hardware!
Q6. Not all data can fit into a relational model, including genetic data, semantic data, and data generated by social networks. How do you handle data variety?
Joe Celko: We have graph databases for social networks. I was a math major, so I love them. Graph theory has a lot of good problems and algorithms we can steal, just like SQL uses set theory and logic. But genetic data and semantics do not have a mature theory behind them. The real way to handle the diversity is new tools, starting at the conceptual level. How many times have you seen someone write 1960′s COBOL file systems in SQL?
Q7 What are the alternative storage, query, and management frameworks needed by certain kinds of Big Data?
Joe Celko: As best you can, do not scare your existing staff with a totally new environment.
Q8. Columnar-data stores, graph-databases, streaming databases, analytic data bases. How do classify and evaluate all of these NewSQL/ NoSQL solutions available?
Joe Celko: First decide what the problem is, then pick the tool. One of my war stories was consulting at a large California company that wanted to put their labor relations law library on their new DB2 database. It was all text, and used by lawyers. Lawyers do not know SQL. Lawyers do not want to learn SQL. But they do know Lexis and WestLaw text query tools. They know labor law and the special lingo. Programmers do not know labor law. Programmers do not want to learn labor law. But the programmers can set up a textbase for the lawyers.
Q9. If you were a user, how would you select the “right” data management tools and technology for the job?
Joe Celko: There is no generic answer. Oh, there will be a better answer by the time you get into production. Welcome to IT!
Joe Celko served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards. Mr. Celko is author a series of books on SQL and RDBMS for Morgan-Kaufmann. He is an independent consultant based in Austin, Texas. He has written over 1300 columns in the computer trade and academic press, mostly dealing with data and databases.
“Joe Celko’s Complete Guide to NoSQL: What Every SQL Professional Needs to Know about Non-Relational Databases“- Paperback: 244 pages, Morgan Kaufmann; 1 edition (October 31, 2013), ISBN-10: 0124071929
“Big Data: Challenges and Opportunities” (.PDF), Roberto V. Zicari, Goethe University Frankfurt, ODBMS.org, October 5, 2012
“In a nutshell, pipelining is a programming technique that combines functions from the database system’s library of vector-based functions into an assembly line of processing for market data, with the output of one function becoming input for the next.”–Steven T. Graves.
The fourth interview in the “Big Data: three questions to “ series of interviews, is with Steven T. Graves, President and CEO McObject
Q1. What is your current product offering?
Steven T. Graves: McObject has two product lines. One is the eXtremeDB product family. eXtremeDB is a real-time embedded database system built on a core in-memory database system (IMDS) architecture, with the eXtremeDB IMDS edition representing the “standard” product. Other eXtremeDB editions offer special features and capabilities such as an optional SQL API, high availability, clustering, 64-bit support, optional and selective persistent storage, transaction logging and more.
In addition, our eXtremeDB Financial Edition database system targets real-time capital markets systems such as algorithmic trading and risk management (and has its own Web site). eXtremeDB Financial Edition comprises a super-set of the individual eXtremeDB editions (bundling together all specialized libraries such as clustering, 64-bit support, etc.) and offers features including columnar data handling and vector-based statistical processing for managing market data (or any other type of time series data).
Features shared across the eXtremeDB product family include: ACID-compliant transactions; multiple application programming interfaces (a native and type-safe C/C++ API; SQL/ODBC/JDBC; native Java, C# and Python interfaces); multi-user concurrency with an optional multi-version concurrency control (MVCC) transaction manager; event notifications; cache prioritization; and support for multiple database indexes (b-tree, r-tree, kd-tree, hash, Patricia trie, etc.). eXtremeDB’s footprint is small, with an approximately 150K code size. eXtremeDB is available for a wide range of server, real-time operating system (RTOS) and desktop operating systems, and McObject provides eXtremeDB source code for porting.
McObject’s second product offering is the Perst open source, object-oriented embedded database system, available in all-Java and all-C# (.NET) versions. Perst is small (code size typically less than 500K) and very fast, with features including ACID-compliant transactions; specialized collection classes (such as a classic b-tree implementation; r-tree indexes for spatial data; database containers optimized for memory-only access, etc.); garbage collection; full-text search; schema evolution; a “wrapper” that provides a SQL-like interface (SubSQL); XML import/export; database replication, and more.
Perst also operates in specialized environments. Perst for .NET includes support for .NET Compact Framework, Windows Phone 8 (WP8) and Silverlight (check out our browser-based Silverlight CRM demo, which showcases Perst’s support for storage on users’ local file systems). The Java edition supports the Android smartphone platform, and includes the Perst Lite embedded database for Java ME.
Q2. Who are your current customers and how do they typically use your products?
Steven T. Graves: eXtremeDB initially targeted real-time embedded systems, often residing in non-PC devices such as set-top boxes, telecom switches or industrial controllers.
There are literally millions of eXtremeDB -based devices deployed by our customers; a few examples are set-top boxes from DIRECTV (eXtremeDB is the basis of an electronic programming guide); F5 Networks’ BIG-IP network infrastructure (eXtremeDB is built into the devices’ proprietary embedded operating system); and BAE Systems (avionics in the Panavia Tornado GR4 combat jet). A recent new customer in telecom/networking is Compass-EOS, which has released the first photonics-based core IP router, using eXtremeDB High Availability to manage the device’s control plane database.
Addition of “enterprise-friendly” features (support for SQL, Java, 64-bit, MVCC, etc.) drove eXtremeDB’s adoption for non-embedded systems that demand fast performance. Examples include software-as-a-service provider hetras Gmbh (eXtremeDB handles the most performance-intensive queries in its Cloud-based hotel management system); Transaction Network Services (eXtremeDB is used in a highly scalable system for real-time phone number lookups/ routing); and MeetMe.com (formerly MyYearbook.com – eXtremeDB manages data in social networking applications).
In the financial industry, eXtremeDB is used by a variety of trading organizations and technology providers. Examples include the broker-dealer TradeStation (McObject’s database technology is part of its next-generation order execution system); Financial Technologies of India, Ltd. (FTIL), which has deployed eXtremeDB in the order-matching application used across its network of financial exchanges in Asia and the Middle East; and NSE.IT (eXtremeDB supports risk management in algorithmic trading).
Users of Perst are many and varied, too. You can find Perst in many commercial software applications such as enterprise application management solutions from the Wily Division of CA. Perst has also been adopted for community-based open source projects, including the Frost client for the Freenet global peer-to-peer network. Some of the most interesting Perst-based applications are mobile. For example, 7City Learning, which provides training for financial professionals, gives students an Android tablet with study materials that are accessed using Perst. Several other McObject customers use Perst in mobile medical apps.
Q3. What are the main new technical features you are currently working on and why?
Steven T. Graves: One feature we’re very excited about is the ability to pipeline vector-based statistical functions in eXtremeDB Financial Edition – we’ve even released a short video and a 10-page white paper describing this capability. In a nutshell, pipelining is a programming technique that combines functions from the database system’s library of vector-based functions into an assembly line of processing for market data, with the output of one function becoming input for the next.
This may not sound unusual, since almost any algorithm or program can be viewed as a chain of operations acting on data.
But this pipelining has a unique purpose and a powerful result: it keeps market data inside CPU cache as the data is being worked.
Without pipelining, the results of each function would typically be materialized outside cache, in temporary tables residing in main memory. Handing interim results back and forth “across the transom” between CPU cache and main memory imposes significant latency, which is eliminated by pipelining. We’ve been improving this capability by adding new statistical functions to the library. (For an explanation of pipelining that’s more in-depth than the video but shorter than the white paper, check out this article on the financial technology site Low-Latency.com.)
We are also adding to the capabilities of eXtremeDB Cluster edition to make clustering faster and more flexible, and further simplify cluster administration. Improvements include a local tables option, in which database tables can be made exempt from replication, but shareable through a scatter/gather mechanism. Dynamic clustering, added in our recent v. 5.0 upgrade, enables nodes to join and leave clusters without interrupting processing. This further simplifies administration for a clustering database technology that counts minimal run-time maintenance as a key benefit. On selected platforms, clustering now supports the Infiniband switched fabric interconnect and Message Passing Interface (MPI) standard. In our tests, these high performance networking options accelerated performance more than 7.5x compared to “plain vanilla” gigabit networking (TCP/IP and Ethernet).
“Some of our current priorities include: augmenting capabilities in the area of real-time analytics – especially around online operations, SQL functionality, integrations with messaging applications, statistics and monitoring procedures, and enhanced developer features.”– Ryan Betts.
The third interview in the “Big Data: three questions to “ series of interviews, is with Ryan Betts, CTO of VoltDB.
Q1. What are your current product offerings?
Ryan Betts: VoltDB is a high-velocity database platform that enables developers to build next generation real-time operational applications. VoltDB converges all of the following:
• A dynamically scalable in-memory relational database delivering high-velocity, ACID-compliant OLTP
• High-velocity data ingestion, with millions of writes per second
• Real-time analytics, to enable instant operational visibility at the individual event level
• Real-time decisioning, to enable applications to act on data when it is most valuable—the moment it arrives
Version 4.0 delivers enhanced in-memory analytics capabilities and expanded integrations. VoltDB 4.0 is the only high performance operational database that combines in-memory analytics with real-time transactional decision-making in a single system.
It gives organizations an unprecedented ability to extract actionable intelligence about customer and market behavior, website interactions, service performance and much more by performing real-time analytics on data moving at breakneck speed.
Specifically, VoltDB 4.0 features a tenfold throughput improvement of analytic queries and is capable of writes and reads on millions of data events per second. It provides large-scale concurrent, multiuser access to data, the ability to factor current incoming data into analytics, and enhanced SQL support. VoltDB 4.0 also delivers expanded integrations with an organization’s existing data infrastructure such as message queue systems, improved JDBC driver and monitoring utilities such as New Relic.
Q2. Who are your current customers and how do they typically use your products?
Ryan Betts: Customers use VoltDB for a wide variety of data-management functions, including data caching, stream processing and “on the fly” ETL.
Current VoltDB customers represent industries ranging from telecommunications to e-commerce, power & energy, financial services, online gaming, retail and more.
Following are common use cases:
• Optimized, real-time information delivery
• Personalized audience targeting
• Real-time analytics dashboards
• Caching server replacements
• Session / user management
• Network analysis & monitoring
• Ingestion and on-the-fly-ETL
Below are the customers that have been publicly announced thus far:
Q3. What are the main new technical features you are currently working on and why?
Ryan Betts: Our customers are reaping the benefits of VoltDB in the areas of transactional decision-making and generating real-time analytics on that data—right at the moment it’s coming in.
Therefore, some of our current priorities include: augmenting capabilities in the area of real-time analytics – especially around online operations, SQL functionality, integrations with messaging applications, statistics and monitoring procedures, and enhanced developer features.
Although VoltDB has proven to be the industry’s “easiest to use” database, we are also continuing to invest quite heavily in making the process of building and deploying real-time operational applications with VoltDB even easier. Among other things, we are extending the power and simplicity that we offer developers in building high throughput applications to building modest sized throughput applications.