ODBMS Industry Watch » Big Data Analytics http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Fri, 09 Feb 2018 21:04:31 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.19 On the future of Data Warehousing. Interview with Jacque Istok and Mike Waas http://www.odbms.org/blog/2017/11/on-the-future-of-data-warehousing-interview-with-jacque-istok-and-mike-waas/ http://www.odbms.org/blog/2017/11/on-the-future-of-data-warehousing-interview-with-jacque-istok-and-mike-waas/#comments Thu, 09 Nov 2017 08:54:27 +0000 http://www.odbms.org/blog/?p=4502

” Open source software comes with a promise, and that promise is not about looking at the code, rather it’s about avoiding vendor lock-in.” –Jacque Istok.

” The cloud has out-paced the data center by far and we should expect to see the entire database market being replatformed into the cloud within the next 5-10 years.” –Mike Waas.

I have interviewed Jacque Istok, Head of Data Technical Field for Pivotal, and Mike Waas, founder and CEO Datometry.
Main topics of the interview are: the future of Data Warehousing, how are open source and the Cloud affecting the Data Warehouse market, and Datometry Hyper-Q and Pivotal Greenplum.


Q1. What is the future of Data Warehouses?

Jacque Istok: I believe that what we’re seeing in the market is a slight course correct with regards to the traditional data warehouse. For 25 years many of us spent many cycles building the traditional data warehouse.
The single source of the truth. But the long duration it took to get alignment from each of the business units regarding how the data related to each other combined with the cost of the hardware and software of the platforms we built it upon left everybody looking for something new. Enter Hadoop and suddenly the world found out that we could split up data on commodity servers and, with the right human talent, could move the ball forward faster and cheaper. Unfortunately the right human talent has proved hard to come by and the plethora of projects that have spawned up are neither production ready nor completely compliant or compatible with the expensive tools they were trying to replace.
So what looks to be happening is the world is looking for the features of yesterday combined with the cost and flexibility of today. In many cases that will be a hybrid solution of many different projects/platforms/applications, or at the very least, something that can interface easily and efficiently with many different projects/platforms/applications.

Mike Waas: Indeed, flexibility is what most enterprises are looking for nowadays when it comes to data warehousing. The business needs to be able to tap data quickly and effectively. However, in today’s world we see an enormous access problem with application stacks that are tightly bonded with the underlying database infrastructure. Instead of maintaining large and carefully curated data silos, data warehousing in the next decade will be all about using analytical applications from a quickly evolving application ecosystem with any and all data sources in the enterprise: in short, any application on any database. I believe data warehouses remain the most valuable of databases, therefore, cracking the access problem there will be hugely important from an economic point of view.

Q2. How is open source affecting the Data Warehouse market?

Jacque Istok: The traditional data warehouse market is having its lunch eaten by open source. Whether it’s one of the Hadoop distributions, one of the up and coming new NoSQL engines, or companies like Pivotal making large bets and open source production proven alternatives like Greenplum. What I ask prospective customers is if they were starting a new organization today, what platforms, databases, or languages would you choose that weren’t open source? The answer is almost always none. Open source software comes with a promise, and that promise is not about looking at the code, rather it’s about avoiding vendor lock-in.

Mike Waas: Whenever a technology stack gets disrupted by open source, it’s usually a sign that the technology has reached a certain maturity and customers have begun doubting the advantage of proprietary solutions. For the longest time, analytical processing was considered too advanced and too far-reaching in scope for an open source project. Greenplum Database is a great example for breaking through this ceiling: it’s the first open source database system with a query optimizer not only worth that title but setting a new standard, and a whole array of other goodies previously only available in proprietary systems.

Q3. Are databases an obstacle to adopting Cloud-Native Technology?

Jacque Istok: I believe quite the contrary, databases are a requirement for Cloud-Native Technology. Any applications that are created need to leverage data in some way. I think where the technology is going is to make it easier for developers to leverage whichever database or datastore makes the most sense for them or they have the most experience with – essentially leveraging the right tool for the right job, instead of the tool “blessed” by IT or Operations for general use. And they are doing this by automating the day 0, day 1, and day 2 operations of those databases. Making it easy to instantiate and use these platforms for anyone, which has never really been the case.

Mike Waas: In fact, a cloud-first strategy is incomplete unless it includes the data assets, i.e., the databases. Now, databases have always been one of the hardest things to move or replatform, and, naturally, it’s the ultimate challenge when moving to the cloud: firing up any new instance in the cloud is easy as 1-2-3 but what to do with the 10s of years of investment in application development? I would say it’s actually not the database that’s the obstacle but the applications and their dependencies.

Q4. What are the pros and cons of moving enterprise data to the cloud?

Jacque Istok: I think there are plenty of pros to moving enterprise data to the cloud, the extent of that list will really depend on the enterprise you’re talking to and the vertical that they are in. But cons? The only cons would be using these incredible tools incorrectly, at which point you might find yourself spending more money and feeling that things are slower or less flexible. Treating the cloud as a virtual data center, and simply moving things there without changing how they are architected or how they are used would be akin to taking

Mike Waas: I second that. A few years ago enterprises were still concerned about security, completeness of offering, and maturity of the stack. But now, the cloud has out-paced the data center by far and we should expect to see the entire database market being replatformed into the cloud within the next 5-10 years. This is going to be the biggest revolution in the database industry since the relational model with great opportunities for vendors and customers alike.

Q5. How do you quantify when is appropriate for an enterprise to move their data management to a new platform?

Jacque Istok: It’s pretty easy from my perspective, when any enterprise is done spending exorbitant amounts of money it might be time to move to a new platform. When you are coming up on a renewal or an upgrade of a legacy and/or expensive system it might be time to move to a new platform. When you have new initiatives to start it might be time to move to a new platform. When you are ready to compete with your competitors, both known and unknown (aka startups), it might be time to move to a new platform. The move doesn’t have to be scary either, as some products are designed to be a bridge to a modern a data platform.

Mike Waas: Traditionally, enterprises have held off from replatforming for too long: the switching cost has deterred them from adopting new and highly superior technology with the result that they have been unable to cut costs or gain true competitive advantage. Staying on an old platform is simply bad for business. Every organization needs to ask themselves constantly the question whether their business can benefit from adopting new technology. At Datometry, we make it easy for enterprises to move their analytics — so easy, in fact, the standard reaction to our technology is, “this is too good to be true.”

Q6. What is the biggest problem when enterprises want to move part or all of their data management to the cloud?

Jacque Istok: I think the biggest problem tends to be not architecting for the cloud itself, but instead treating the cloud like their virtual data center. Leveraging the same techniques, the same processes, and the same architectures will not lead to the cost or scalability efficiencies that you were hoping for.

Mike Waas: As Jacque points out, you really need to change your approach. However, the temptation is to use the move to the cloud as a trigger event to rework everything else at the same time. This quickly leads to projects that spiral out of control, run long, go over budget, or fail altogether. Being able to replatform quickly and separate the housekeeping from the actual move is, therefore, critical.
However, when it comes to databases, trouble runs deeper as applications and their dependencies on specific databases are the biggest obstacle. SQL code is embedded in thousands of applications and, probably most surprising, even third-party products that promise portability between databases get naturally contaminated with system-specific configuration and SQL extensions. We see roughly 90% of third-party systems (ETL, BI tools, and so forth) having been so customized to the underlying database that moving them to a different system requires substantial effort, time, and money.

Q7. How does an enterprise move the data management to a new platform without having to re-write all of the applications that rely on the database?

Mike Waas: At Datometry, we looked very carefully at this problem and, with what I said above, identified the need to rewrite applications each time new technology is adopted as the number one problem in the modern enterprise. Using Adaptive Data Virtualization (ADV) technology, this will quickly become a problem of the past! Systems like Datometry Hyper-Q let existing applications run natively and instantly on a new database without requiring any changes to the application. What would otherwise be a multi-year migration project and run into the millions, is now reduced in time, cost, and risk to a fraction of the conventional approach. “VMware for databases” is a great mental model that has worked really well for our customers.

Q8. What is Adaptive Data Virtualization technology, and how can it help adopting Cloud-Native Technology?

Mike Waas: Adaptive Data Virtualization is the simple, yet incredibly powerful, abstraction of a database: by intercepting the communication between application and database, ADV is able to translate in real-time and dynamically between the existing application and the new database. With ADV, we are drawing on decades of database research and solving what is essentially a compatibility problem between programming languages and systems with an elegant and highly effective approach. This is a space that has traditionally been served by consultants and manual migrations which are incredibly labor-intensive and expensive undertaking.
Through ADV, adopting cloud technology becomes orders of magnitude simpler as it takes away the compatibility challenges that hamper any replatforming initiative.

Q9. Can you quantify what are the reduced time, cost, and risk when virtualizing the data warehouse?

Jacque Istok: In the past, virtualizing the data warehouse meant sacrificing performance in order to get some of the common benefits of virtualization (reduced time for experimentation, maximizing resources, relative ease to readjust the architecture, etc). What we have found recently is that virtualization, when done correctly, actually provides no sacrifices in terms of performance, and the only question becomes whether or not the capital cost expenditure of bare metal versus the opex cost structure of virtual is something that makes sense for your organisation.

Mike Waas: I’d like to take it a step further and include ADV into this context too: instead of a 3-5 year migration, employing 100+ consultants, and rewriting millions of lines of application code, ADV lets you leverage new technology in weeks, with no re-writing of applications. Our customers can expect to save at least 85% of the transition cost.

Q10. What is the massively parallel processing (MPP) Scatter/Gather Streaming™ technology, and what is it useful for?

Jacque Istok: This is arguably one of the most powerful features of Pivotal Greenplum and it allows for the fastest loading of data in the industry. Effectively we scatter data into the Greenplum data cluster as fast as possible with no care in the world to where it will ultimately end up. Terabytes of data per hour, basically as much as you can feed down the wires, is sent to each of the workers within the cluster. The data is therefore disseminated to the cluster in the fastest physical way possible. At that point, each of the workers gathers the data that is pertinent to them according to the architecture you have chosen for the layout of those particular data elements, allowing for a physical optimization to be leveraged during interrogation of the data after it has been loaded.

Q11. How Datometry Hyper-Q & Pivotal Greenplum data warehouse work together?

Jacque Istok: Pivotal Greenplum is the world’s only true open source, production proven MPP data platform that provides out of the box ANSI compliant SQL capabilities along with Machine Learning, AI, Graph, Text, and Spatial analytics all in one. When combined with Datometry Hyper-Q, you can transparently and seamlessly take any Teradata application and, without changing a single line of code or a single piece of SQL, run it and stop paying the outrageous Teradata tax that you have been bearing all this time. Once you’re able to take out your legacy and expensive Teradata system, without a long investment to rewrite anything, you’ll be able to leverage this software platform to really start to analyze the data you have. And that analysis can be either on premise or in the cloud, giving you a truly hybrid and cross-cloud proven platform.

Mike Waas: I’d like to share a use case featuring Datometry Hyper-Q and Pivotal Greenplum featuring a Fortune 100 Global Financial Institution needing to scale their business intelligence application, built using 2000-plus stored procedures. The customer’s analysis showed that replacing their existing data warehouse footprint was prohibitively expensive and rewriting the business applications to a more cost-effective and modern data warehouse posed significant expense and business risk. Hyper-Q allowed the customer to transfer the stored procedures in days without refactoring the logic of the application and implement various control-flow primitives, a time-consuming and expensive proposition.

Qx. Anything else you wish to add?

Jacque Istok: Thank you for the opportunity to speak with you. We have found that there has never been a more valid time than right now for customers to stop paying their heavy Teradata tax and the combination of Pivotal Greenplum and Datometry Hyper-Q allows them to do that right now, with no risk, and immediate ROI. On top of that, they are then able to find themselves on a modern data platform – one that allows them to grow into more advanced features as they are able. Pivotal Greenplum becomes their bridge to transforming your organization by offering the advanced analytics you need but giving you traditional, production proven capabilities immediately. At the end of the day, there isn’t a single Teradata customer that I’ve spoken to that doesn’t want Teradata-like capabilities at Hadoop-like prices and you get all this and more with Pivotal Greenplum.

Mike Waas: Thank you for this great opportunity to speak with you. We, at Datometry, believe that data is the key that will unlock competitive advantage for enterprises and without adopting modern data management technologies, it is not possible to unlock value. According to the leading industry group, TDWI, “today’s consensus says that the primary path to big data’s business value is through the use of so-called ‘advanced’ forms of analytics based on technologies for mining, predictions, statistics, and natural language processing (NLP). Each analytic technology has unique data requirements, and DWs must modernize to satisfy all of them.”
We believe virtualizing the data warehouse is the cornerstone of any cloud-first strategy because data warehouse migration is one of the most risk-laden and most expensive initiatives that a company can embark on during their journey to to the cloud.
Interestingly, the cost of migration is primarily the cost of process and not technology and this is where Datometry comes in with its data warehouse virtualization technology.
We are the key that unlocks the power of new technology for enterprises to take advantage of the latest technology and gain competitive advantage.

Jacque Istok serves as the Head of Data Technical Field for Pivotal, responsible for setting both data strategy and execution of pre and post sales activities for data engineering and data science. Prior to that, he was Field CTO helping customers architect and understand how the entire Pivotal portfolio could be leveraged appropriately.
A hands on technologist, Mr. Istok has been implementing and advising customers in the architecture of big data applications and back end infrastructure the majority of his career.

Prior to Pivotal, Mr. Istok co-founded Professional Innovations, Inc. in 1999, a leading consulting services provider in the business intelligence, data warehousing, and enterprise performance management space, and served as its President and Chairman. Mr. Istok is on the board of several emerging startup companies and serves as their strategic technical advisor.

Mike Waas Datometry 1
Mike Waas, CEO Datometry, Inc.
Mike Waas founded Datometry after having spent over 20 years in database research and commercial database development. Prior to Datometry, Mike was Sr. Director of Engineering at Pivotal, heading up Greenplum’s Advanced R&Dteam. He is also the founder and architect of Greenplum’s ORCA query optimizer initiative. Mike has held senior engineering positions at Microsoft, Amazon, Greenplum, EMC, and Pivotal, and was a researcher at Centrum voor Wiskunde en Informatica (CWI), Netherlands, and at Humboldt University, Berlin.

Mike received his M.S. in Computer Science from University of Passau, Germany, and his Ph.D. in Computer Science from the University of Amsterdam, Netherlands. He has authored or co-authored 36 publications on the science of databases and has 24 patents to his credit.


Datometry Releases Hyper-Q Data Warehouse Virtualization Software Version 3.0. AUGUST 11, 2017

Replatforming Custom Business Intelligence | Use Case, ODBMS.org, NOVEMBER 7, 2017

Disaster Recovery Cloud Data Warehouse | Use Case. ODBMS.org, NOVEMBER 3, 2017

– Scaling Business Intelligence in the Cloud | Use Case. ODBMS.org · NOVEMBER 3, 2017

– Re-Platforming Data Warehouses – Without Costly Migration Of Applications. ODBMS.org · NOVEMBER 3, 2017

– Meet Greenplum 5: The World’s First Open-Source, Multi-Cloud Data Platform Built for Advanced Analytics. ODBMS.org · SEPTEMBER 21, 2017

Related Posts

– On Open Source Databases. Interview with Peter ZaitsevODBMS Industry Watch, Published on 2017-09-06

– On Apache Ignite, Apache Spark and MySQL. Interview with Nikita Ivanov , ODBMS Industry Watch, Published on 2017-06-30

– On the new developments in Apache Spark and Hadoop. Interview with Amr AwadallahODBMS Industry Watch, Published on 2017-03-13

Follow us on Twitter: @odbmsorg


http://www.odbms.org/blog/2017/11/on-the-future-of-data-warehousing-interview-with-jacque-istok-and-mike-waas/feed/ 0
Identity Graph Analysis at Scale. Interview with Niels Meersschaert http://www.odbms.org/blog/2017/05/interview-with-niels-meersschaert/ http://www.odbms.org/blog/2017/05/interview-with-niels-meersschaert/#comments Tue, 09 May 2017 07:10:19 +0000 http://www.odbms.org/blog/?p=4359

“I’ve found the best engineers actually have art backgrounds or interests. The key capability is being able to see problems from multiple perspectives, and realizing there are multiple solutions to a problem. Music, photography and other arts encourage that.”–Niels Meersschaert.

I have interviewed Niels Meersschaert, Chief Technology Officer at Qualia. The Qualia team relies on over one terabyte of graph data in Neo4j, combined with larger amounts of non-graph data to provide major companies with consumer insights for targeted marketing and advertising opportunities.


Q1. Your background is in Television & Film Production. How does it relate to your current job?

Niels Meersschaert: Engineering is a lot like producing. You have to understand what you are trying to achieve, understand what parts and roles you’ll need to accomplish it, all while doing it within a budget. I’ve found the best engineers actually have art backgrounds or interests. The key capability is being able to see problems from multiple perspectives, and realizing there are multiple solutions to a problem. Music, photography and other arts encourage that. Engineering is both art and science and creativity is a critical skill for the best engineers. I also believe that a breath of languages is critical for engineers.

Q2. Your company collects data on more than 90% of American households. What kind of data do you collect and how do you use such data?

Niels Meersschaert: We focus on high quality data that is indicative of commercial intent. Some examples include wishlist interaction, content consumption, and location data. While we have the breath of a huge swath of the American population, a key feature is that we have no personally identifiable information. We use anonymous unique identifiers.
So, we know this ID did actions indicative of interest in a new SUV, but we don’t know their name, email address, phone number or any other personally identifiable information about a consumer. We feel this is a good balance of commercial need and individual privacy.

Q3. If you had to operate with data from Europe, what would be the impact of the new EU General Data Protection Regulation (GDPR) on your work?

Niels Meersschaert: Europe is a very different market than the U.S. and many of the regulations you mentioned do require a different approach to understanding consumer behaviors. Given that we avoid personal IDs, our approach is already better situated than many peers, that rely on PII.

Q4. Why did you choose a graph database to implement your system consumer behavior tracking system?

Niels Meersschaert: Our graph database is used for ID management. We don’t use it for understanding the intent data, but rather recognizing IDs. Conceptually, describing the various IDs involved is a natural fit for a graph.
As an example, a conceptual consumer could be thought of as the top of the graph. That consumer uses many devices and each device could have 1 or more anonymous IDs associated with it, such as cookie IDs. Each node can represent an associated device or ID and the relationships between each node allow us to see the path. A key element we have in our system is something we call the Borg filter. It’s a bit of a reference to Star Trek, but essentially when we find a consumer is too connected, i.e. has dozens or hundreds of devices, we remove all those IDs from the graph as clearly something has gone wrong. A graph database makes it much easier to determine how many connected nodes are at each level.

Q5. Why did you choose Neo4j?

Niels Meersschaert: Neo4J had a rich query language and very fast performance, especially if your hot set was in RAM.

Q6. You manage one terabyte of graph data in Neo4j. How do you combine them with larger amounts of non-graph data?

Niels Meersschaert: You can think of the graph as a compression system for us. While consumer actions occur on multiple devices and anonymous IDs, they represent the actions of a single consumer. This actually simplifies things for us, since the unique grouping IDs is much smaller than the unique source IDs. It also allows us to eliminate non-human IDs from the graph. This does mean we see the world in different ways they many peers. As an example, if you focus only on cookie IDs, you tend to have a much larger number of unique IDs than actual consumers those represent. Sadly, the same thing happens with website monthly uniques, many are highly inflated both on the number of unique people they represent, but also since many of the IDs are non-human. Ultimately, the entire goal of advertising is to influence consumers, so we feel that having the better representation of actual consumers allows us to be more effective.

Q7. What are the technical challenges you face when blending data with different structure?

Niels Meersschaert: A key challenge is some unifying element between different systems or structures that link data. What we did with Neo4J is create a unique property on the nodes that we use for interchange. The internal node IDs that are part of Neo4J aren’t something we use except internally within the graph DB.

Q8. If your data is sharded manually, how do you handle scalability?

Niels Meersschaert: We don’t shard the data manually, but scalability is one of the biggest challenges. We’ve spent a lot of time tuning queries and grouping operations to take advantage of some of the capabilities of Neo4J and to work around some limitations it has. The vast majority of graph customers wouldn’t have the volume nor the volatility of data that we do, so our challenges are unique.

Q9. What other technologies do you use and how they interact with Neo4j?

Niels Meersschaert: We use the classic big data tools like Hadoop and Spark. We also use MongoDB and Google’s Big Query. If you look at the graph as the truth set of device IDs, we interact with it on ingestion and export only. Everything in the middle can operate on the consumer ID, which is far more efficient.

Q10. How do you measure the ROI of your solution?

Niels Meersschaert: There are a few factors we consider. First is how much does the infrastructure cost us to process the data and output? How fast is it in terms of execution time? How much development effort does it take relative to other solutions? How flexible is it for us to extend it? This is an ever evolving situation and one we always look at how to improve, especially as a smaller business.


Niels Meersschaert
I’ve been coding since I was 7 years old on an Apple II. I’d built radio control model cars and aircraft as a child and built several custom chassis using controlled flex as suspension to keep weight & parts count down. So, I’d had an early interest in both software and physical engineering.

My father was from the Netherlands and my maternal grandfather was a linguist fluent in 43 languages. As a kid, my father worked for the airlines, so we traveled often to Europe to see family, so I grew up multilingual. Computer languages are just different ways to describe something, the basic concepts are similar, just as they are in spoken languages albeit with different grammatical and syntax structure. Whether you’re speaking French, or writing a program in Python or C, the key is you are trying to get your communication across to the target of your message, whether it is another person or a computer.

I originally started university in aeronautical engineering, but in my sophomore year, Grumman let go about 3000 engineers, so I didn’t think the career opportunities would be as great. I’d always viewed problem solutions as a combination of art & science, so I switched majors to one in which I could combine the two.

After school I worked producing and editing commercials and industrials, often with special effects. I got into web video early on & spent a lot of time on compression and distribution systems. That led to working on search, and bringing the linguistics back front and center again. I then combined the two and came full circle back to advertising, but from the technical angle at Magnetic, where we built search retargeting. At Qualia, we kicked this into high gear, where we understand consumer intent by analyzing sentiment, content and actions across multiple devices and environments and the interaction and timing between them to understand the point in the intent path of a consumer.


EU General Data Protection Regulation (GDPR):

Reform of EU data protection rules

European Commission – Fact Sheet Questions and Answers – Data protection reform

General Data Protection Regulation (Wikipedia)

Neo4j Sandbox: The Neo4j Sandbox enables you to get started with Neo4j, with built-in guides and sample datasets for popular use cases.

Related Posts

LDBC Developer Community: Benchmarking Graph Data Management Systems. ODBMS.org, 6 APR, 2017

Graphalytics benchmark.ODBMS.org 6 APR, 2017
The Graphalytics benchmark is an industrial-grade benchmark for graph analysis platforms such as Giraph. It consists of six core algorithms, standard datasets, synthetic dataset generators, and reference outputs, enabling the objective comparison of graph analysis platforms.

Collaborative Filtering: Creating the Best Teams Ever. By Maurits van der Goes, Graduate Intern | February 16, 2017

Follow us on Twitter: @odbmsorg


http://www.odbms.org/blog/2017/05/interview-with-niels-meersschaert/feed/ 0
How the 11.5 million Panama Papers were analysed. Interview with Mar Cabra http://www.odbms.org/blog/2016/10/how-the-11-5-million-panama-papers-were-analysed-interview-with-mar-cabra/ http://www.odbms.org/blog/2016/10/how-the-11-5-million-panama-papers-were-analysed-interview-with-mar-cabra/#comments Tue, 11 Oct 2016 17:54:36 +0000 http://www.odbms.org/blog/?p=4214

“The best way to explore all The Panama Papers data was using graph database technology, because it’s all relationships, people connected to each other or people connected to companies.” –Mar Cabra.

I have interviewed Mar Cabra, head of the Data & Research Unit of the International Consortium of Investigative Journalists (ICIJ). Main subject of the interview is how the 11.5 million Panama Papers were analysed.


Q1. What is the mission of the International Consortium of Investigative Journalists (ICIJ)?

Mar Cabra: Founded in 1997, the ICIJ is a global network of more than 190 independent journalists in more than 65 countries who collaborate on breaking big investigative stories of global social interest.

Q2. What is your role at ICIJ?

Mar Cabra: I am the Editor at the Data and Research Unit – the desk at the ICIJ that deals with data, analysis and processing, as well as supporting the technology we use for our projects.

Q3. The Panama Papers investigation was based on a 2.6 Terabyte trove of data obtained by Süddeutsche Zeitung and shared with ICIJ and a network of more than 100 media organisations. What was your role in this data investigation?

Mar Cabra: I co-ordinated the work of the team of developers and journalists that first got the leak from Süddeutsche Zeitung, then processed it to make it available online though secure platforms with more than 370 journalists.
I also supervised the data analysis that my team did to enhance and focus the stories. My team was also in charge of the interactive product that we produced for the publication stage of The Panama Papers, so we built an interactive visual application called the ‘Powerplayers’ where we detailed the main stories of the politicians with connections to the offshore world. We also released a game explaining how the offshore world works! Finally, in early May, we updated the offshore database with information about the Panama Papers companies, the 200,000-plus companies connected with Mossack Fonseca.

Q4. The leaked dataset are 11.5 million files from Panamanian law firm Mossack Fonseca. How was all this data analyzed?

Mar Cabra: We relied on Open Source technology and processes that we had worked on in previous projects to process the data. We used Apache Tika to process the documents and also to access them, and created a processing chain of 30 to 40 machines in Amazon Web Services which would process in parallel those documents, then index them onto a document search platform that could be used by 100s of journalists from anywhere in the world.

Q5. Why did you decide to use a graph-based approach for that?

Mar Cabra: Inside the 11.5 million files in the original dataset given to us, there were more than 3 million that came from Mossaka Fonseca’s internal database, which basically contained names of companies in offshore jurisdictions and the people behind them. In other words, that’s a graph! The best way to explore all The Panama Papers data was using graph database technology, because it’s all relationships, people connected to each other or people connected to companies.

Q6. What were the main technical challenges you encountered in analysing such a large dataset?

Mar Cabra: We had already used all the tools that we were using in this investigation, in previous projects. The main issue here was dealing with many more files in many more formats. So the main challenge was how can we make readable all those files, which in many cases were images, in a fast way.
Our next problem was how could we make them understandable to journalists that are not tech savvy. Again, that’s where a graph database became very handy, because you don’t need to be a data scientist to work with a graph representation of a dataset, you just see dots on a screen, nodes, and then just click on them and find the connections – like that, very easily, and without having to hand-code or build queries. I should say you can build queries if you want using Cypher, but you don’t have to.

Q7. What are the similarities with the way you analysed data in the Swiss Leaks story (exposing the fraudulent activity of 100,000 HSBC private bank clients in Switzerland)?

Mar Cabra: We used the same tools for that – a document search platform and a graph database and we used them in combination to find stories. The baseline was the same but the complexity was 100 times more for the Panama Papers. So the technology is the same in principle, but because we were dealing with many more documents, much more complex data, in many more formats, we had to make a lot of improvements in the tools so they really worked for this project. For example, we had to improve the document search platform with a batch search feature, where journalists would upload a list of names and then they would get a list back of links when that list of names had a hit a document.

Q8. Emil Eifrem, CEO, Neo Technology wrote: “If the Panama Papers leak had happened ten years ago, no story would have been written because no one else would have had the technology and skillset to make sense of such a massive dataset at this scale.” What is your take on this?

Mar Cabra: We would have done the Panama Papers papers differently, probably printing the documents – and that would have had a tremendous effect on the paper supplies of the world, because printing out all 11.5 million files would have been crazy! We would have published some stories and the public might have seen some names on the front page of a few newspapers, but the scale and the depth and the understanding of this complex world would not have been able to happen without access to the technology we have today. We would just have not been able to do such an in-depth investigation at a global scale without the technology we have access to now.

Q9. Whistleblowers take incredible risks to help you tell data stories. Why do they do it?

Mar Cabra: Occasionally, some whistleblowers have a grudge and are motivated in more personal terms. Many have been what we call in Spanish ‘widows of power’: people who have been in power and have lost it, and those who wish to expose the competition or have a grudge. Motivations of Whistleblowers vary, but I think there is always an intention to expose injustice. ‘John Doe’ is the source behind the Panama Papers, and a few weeks after we published, he explained his motivation; he wanted to expose an unjust system.

Mar Cabra is the head of ICIJ’s Data & Research Unit, which produces the organization’s key data work and also develops tools for better collaborative investigative journalism. She has been an ICIJ staff member since 2011, and is also a member of the network.

Mar fell in love with data while being a Fulbright scholar and fellow at the Stabile Center for Investigative Journalism at Columbia University in 2009/2010. Since then, she’s promoted data journalism in her native Spain, co-creating the first ever masters degree on investigative reporting, data journalism and visualisation  and the national data journalism conference, which gathers more than 500 people every year.

She previously worked in television (BBC, CCN+ and laSexta Noticias) and her work has been featured in the International Herald Tribune, The Huffington Post, PBS, El País, El Mundo or El Confidencial, among others.
In 2012 she received the Spanish Larra Award to the country’s most promising journalist under 30. (PGP public key)


– Panama Papers Source Offers Documents To Governments, Hints At More To Come. International Consortium of Investigative Journalists. May 6, 2016

The Panama Papers. ICIJ

– The two journalists from Sueddeutsche ZeitungFrederik Obermaier and Bastian Obermayer

– Offshore Leaks Database: Released in June 2013, the Offshore Leaks Database is a simple search box.

Open Source used for analysing the #PanamaPapers:

– Oxwall: We found an open source social network tool called Oxwall that we tweaked to our advantage. We basically created a private social network for our reporters.

– Apache Tika and Tesseract to do optical character recognition (OCR),

– We created a small program ourselves which we called Extract which is actually in our GitHub account that allowed us to do this parallel processing. Extract would get a file and try to see if it could recognize the content. If it couldn’t recognize the content, then we would do OCR and then send it to our document searching platform, which was Apache Solr.

– Based on Apache Solr, we created an index, and then we used Project Blacklight, another open source tool that was originally used for libraries, as our front-end tool. For example, Columbia University Library, where I studied, used this tool.

– Linkurious: Linkurious is software that allows you to visualize graphs very easily. You get a license, you put it in your server, and if you have a database in Neo4j you just plug it in and within hours you have the system set up. It also has this private system where our reporters can login or logout.

– Thanks to another open source tool – in this case Talend – and extractions from a load tool, we were able to easily transform our database into Neo4j, plug in Linkurious and get reporters to search.

Neo4j: Neo4j is a highly scalable, native graph database purpose-built to leverage not only data but also its relationships. Neo4j’s native graph storage and processing engine deliver constant, real-time performance, helping enterprises build intelligent applications to meet today’s evolving data challenges.

-The good thing about Linkurious is that the reporters or the developers at the other end of the spectrum can also make highly technical Cypher queries if they want to start looking more in depth at the data.

Related Posts


http://www.odbms.org/blog/2016/10/how-the-11-5-million-panama-papers-were-analysed-interview-with-mar-cabra/feed/ 0
On Big Data Analytics. Interview with Anthony Bak http://www.odbms.org/blog/2014/12/anthony-bak-data-scientist-mathematician-ayasdi/ http://www.odbms.org/blog/2014/12/anthony-bak-data-scientist-mathematician-ayasdi/#comments Sun, 07 Dec 2014 19:27:53 +0000 http://www.odbms.org/blog/?p=3288

“The biggest challenge facing data analytics is how to turn complex data into actionable information. One way to think about complexity is that there are many stories happening simultaneously in the data – some relevant to the problem being solved but most irrelevant. The goal of Big Data Analytics is to find the relevant story, reducing complexity to actionable information.”–Anthony Bak

On Big Data Analytics, I have interviewed Anthony Bak, Data Scientist and Mathematician at Ayasdi.


Q1. What are the most important challenges for Big Data Analytics?

Anthony Bak: The biggest challenge facing data analytics is how to turn complex data into actionable information. One way to think about complexity is that there are many stories happening simultaneously in the data – some relevant to the problem being solved but most irrelevant. The goal of Big Data Analytics is to find the relevant story, reducing complexity to actionable information. How do we sort through all the stories in an efficient manner?

Historically, organizations extracted value from data by building data infrastructure and employing large teams of highly trained Data Scientists who spend months, and sometimes years, asking questions of data to find breakthrough insights. The probability of discovering these insights is low because there are too many questions to ask and not enough data scientists to ask them.

Ayasdi’s platform uses Topological Data Analysis (TDA) to automatically find the relevant stories in complex data and operationalize them to solve difficult and expensive problems. We combine machine learning and statistics with topology, allowing for ground-breaking automation of the discovery process.

Q2. How can you “measure” the value you extract from Big Data in practice?

Anthony Bak: We work closely with our clients to find valuable problems to solve. Before we tackle a problem we quantify both its value to the customer and the outcome delivering that value.

Q3. You use a so called Topological Data Analysis. What is it?

Anthony Bak: Topology is the branch of pure mathematics that studies the notion of shape.
We use topology as a framework combining statistics and machine learning to form geometric summaries of Big Data spaces. These summaries allow us to understand the important and relevant features of the data. We like to say that “Data has shape and shape has meaning”. Our goal is to extract shapes from the data and then understand their meaning.

While there is no complete taxonomy of all geometric features and their meaning there are a few simple patterns that we see in many data sets: clusters, flares and loops.

Clusters are the most basic property of shape a data set can have. They represent natural segmentations of the data into distinct pieces, groups or classes. An example might find two clusters of doctors committing insurance fraud.
Having two groups suggests that there may be two types of fraud represented in the data. From the shape we extract meaning or insight about the problem.

That said, many problems don’t naturally split into clusters and we have to use other geometric features of the data to get insight. We often see that there’s a core of data points that are all very similar representing “normal” behavior and coming off of the core we see flares of points. Flares represent ways and degrees of deviation from the norm.
An example might be gene expression levels for cancer patients where people in various flares have different survival rates.

Loops can represent periodic behavior in the data set. An example might be patient disease profiles (clinical and genetic information) where they go from being healthy, through various stages of illness and then finally back to healthy.
The loop in the data is formed not by a single patient but by sampling many patients in various stages of disease. Understanding and characterizing the disease path potentially allows doctors to give better more targeted treatment.

Finally, a given data set can exhibit all of these geometric features simultaneously as well as more complicated ones that we haven’t described here. Topological Data Analysis is the systematic discovery of geometric features.

Q4. The core algorithm you use is called “Mapper“, developed at Stanford in the Computational Topology group by Gunnar Carlsson and Gurjeet Singh. How has your company, Ayasdi, turned this idea into a product?

Anthony Bak: Gunnar Carlsson, co-founder and Stanford University mathematics professor, is one of the leaders in a branch of mathematics called topology. While topology has been studied for the last 300 years, it’s in just the last 15 years that Gunnar has pioneered the application of topology to understand large and complex sets of data.

Between 2001 and 2005, DARPA and the National Science Foundation sponsored Gunnar’s research into what he called Topological Data Analysis (TDA). Tony Tether, the director of DARPA at the time, has said that TDA was one of the most important projects DARPA was involved in during his eight years at the agency.
Tony told the New York Times, “The discovery techniques of topological data analysis are going to have a huge impact, and Gunnar Carlsson is at the forefront of this research.”

That led to Gunnar teaming up with a group of others to develop a commercial product that could aid the efforts of life sciences, national security, oil and gas and financial services organizations. Today, Ayasdi already has customers in a broad range of industries, including at least 3 of the top global pharmaceutical companies, at least 3 of the top oil and gas companies and several agencies and departments inside the U.S. Government.

Q5. Do you have some uses cases where Topological Data Analysis is implemented to share?

Anthony Bak: There is a well known, 11-year old data set representing a breast cancer research project conducted by the Netherlands Cancer Institute-Antoni van Leeuwenhoek Hospital. The research looked at 272 cancer patients covering 25,000 different genetic markers. Scientists around the world have analyzed this data over and over again. In essence, everyone believed that anything that could be discovered from this data had been discovered.

Within a matter of minutes, Ayasdi was able to identify new, previously undiscovered populations of breast cancer survivors. Ayasdi’s discovery was recently published in Nature.

Using connections and visualizations generated from the breast cancer study, oncologists can map their own patients data onto the existing data set to custom-tailor triage plans. In a separate study, Ayasdi helped discover previously unknown biomarkers for leukaemia.

You can find additional case studies here.

Q6. Query-Based Approach vs. Query-Free Approach: could you please elaborate on this and explain the trade off?

Anthony Bak: Since the creation of SQL in the 1980s, data analysts have tried to find insights by asking questions and writing queries. This approach has two fundamental flaws. First, all queries are based on human assumptions and bias. Secondly, query results only reveal slices of data and do not show relationships between similar groups of data. While this method can uncover clues about how to solve problems, it is a game of chance that usually results in weeks, months, and years of iterative guesswork.

Ayasdi’s insight is that the shape of the data – its flares, cluster, loops – tells you about natural segmentations, groupings and relationships in the data. This information forms the basis of a hypothesis to query and investigate further. The analytical process no longer starts with coming up with a hypothesis and then testing it, instead we let the data, through its geometry, tell us where to look and what questions to ask.

Q7 Anything else you wish to add?

Anthony Bak: Topological data analysis represents a fundamental new framework for thinking about, analyzing and solving complex data problems. While I have emphasized its geometric and topological properties it’s important to point out that TDA does not replace existing statistical and machine learning methods. 
Instead, it forms a framework that utilizes existing tools while gaining additional insight from the geometry.

I like to say that statistics and geometry form orthogonal toolsets for analyzing data, to get the best understanding of your data you need to leverage both. TDA is the framework for doing just that.

Anthony Bak is currently a Data Scientist and mathematician at Ayasdi. Prior to Ayasdi, Anthony was at Stanford University where he worked with Ayasdi co-founder Gunnar Carlsson on new methods and applications of Topological Data Analysis. He did his Ph.D. work in algebraic geometry with applications to string theory.


Extracting insights from the shape of complex data using topology
P. Y. Lum,G. Singh,A. Lehman,T. Ishkanov,M. Vejdemo-Johansson,M. Alagappan,J. Carlsson & G. Carlsson
Nature, Scientific Reports 3, Article number: 1236 doi:10.1038/srep01236, 07 February 2013

Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition

Extracting insights from the shape of complex data using topology

Related Posts

Predictive Analytics in Healthcare. Interview with Steve Nathan,ODBMS Industry Watch,August 26, 2014

Follow ODBMS.org on Twitter: @odbmsorg


http://www.odbms.org/blog/2014/12/anthony-bak-data-scientist-mathematician-ayasdi/feed/ 0
Big Data: Three questions to McObject. http://www.odbms.org/blog/2014/02/big-data-three-questions-to-mcobject/ http://www.odbms.org/blog/2014/02/big-data-three-questions-to-mcobject/#comments Fri, 14 Feb 2014 08:21:08 +0000 http://www.odbms.org/blog/?p=2874

“In a nutshell, pipelining is a programming technique that combines functions from the database system’s library of vector-based functions into an assembly line of processing for market data, with the output of one function becoming input for the next.”–Steven T. Graves.

The fourth interview in the “Big Data: three questions to “ series of interviews, is with Steven T. Graves, President and CEO McObject


Q1. What is your current product offering?

Steven T. Graves: McObject has two product lines. One is the eXtremeDB product family. eXtremeDB is a real-time embedded database system built on a core in-memory database system (IMDS) architecture, with the eXtremeDB IMDS edition representing the “standard” product. Other eXtremeDB editions offer special features and capabilities such as an optional SQL API, high availability, clustering, 64-bit support, optional and selective persistent storage, transaction logging and more.

In addition, our eXtremeDB Financial Edition database system targets real-time capital markets systems such as algorithmic trading and risk management (and has its own Web site). eXtremeDB Financial Edition comprises a super-set of the individual eXtremeDB editions (bundling together all specialized libraries such as clustering, 64-bit support, etc.) and offers features including columnar data handling and vector-based statistical processing for managing market data (or any other type of time series data).

Features shared across the eXtremeDB product family include: ACID-compliant transactions; multiple application programming interfaces (a native and type-safe C/C++ API; SQL/ODBC/JDBC; native Java, C# and Python interfaces); multi-user concurrency with an optional multi-version concurrency control (MVCC) transaction manager; event notifications; cache prioritization; and support for multiple database indexes (b-tree, r-tree, kd-tree, hash, Patricia trie, etc.). eXtremeDB’s footprint is small, with an approximately 150K code size. eXtremeDB is available for a wide range of server, real-time operating system (RTOS) and desktop operating systems, and McObject provides eXtremeDB source code for porting.

McObject’s second product offering is the Perst open source, object-oriented embedded database system, available in all-Java and all-C# (.NET) versions. Perst is small (code size typically less than 500K) and very fast, with features including ACID-compliant transactions; specialized collection classes (such as a classic b-tree implementation; r-tree indexes for spatial data; database containers optimized for memory-only access, etc.); garbage collection; full-text search; schema evolution; a “wrapper” that provides a SQL-like interface (SubSQL); XML import/export; database replication, and more.

Perst also operates in specialized environments. Perst for .NET includes support for .NET Compact Framework, Windows Phone 8 (WP8) and Silverlight (check out our browser-based Silverlight CRM demo, which showcases Perst’s support for storage on users’ local file systems). The Java edition supports the Android smartphone platform, and includes the Perst Lite embedded database for Java ME.

Q2. Who are your current customers and how do they typically use your products?

Steven T. Graves: eXtremeDB initially targeted real-time embedded systems, often residing in non-PC devices such as set-top boxes, telecom switches or industrial controllers.
There are literally millions of eXtremeDB -based devices deployed by our customers; a few examples are set-top boxes from DIRECTV (eXtremeDB is the basis of an electronic programming guide); F5 Networks’ BIG-IP network infrastructure (eXtremeDB is built into the devices’ proprietary embedded operating system); and BAE Systems (avionics in the Panavia Tornado GR4 combat jet). A recent new customer in telecom/networking is Compass-EOS, which has released the first photonics-based core IP router, using eXtremeDB High Availability to manage the device’s control plane database.

Addition of “enterprise-friendly” features (support for SQL, Java, 64-bit, MVCC, etc.) drove eXtremeDB’s adoption for non-embedded systems that demand fast performance. Examples include software-as-a-service provider hetras Gmbh (eXtremeDB handles the most performance-intensive queries in its Cloud-based hotel management system); Transaction Network Services (eXtremeDB is used in a highly scalable system for real-time phone number lookups/ routing); and MeetMe.com (formerly MyYearbook.com – eXtremeDB manages data in social networking applications).

In the financial industry, eXtremeDB is used by a variety of trading organizations and technology providers. Examples include the broker-dealer TradeStation (McObject’s database technology is part of its next-generation order execution system); Financial Technologies of India, Ltd. (FTIL), which has deployed eXtremeDB in the order-matching application used across its network of financial exchanges in Asia and the Middle East; and NSE.IT (eXtremeDB supports risk management in algorithmic trading).

Users of Perst are many and varied, too. You can find Perst in many commercial software applications such as enterprise application management solutions from the Wily Division of CA. Perst has also been adopted for community-based open source projects, including the Frost client for the Freenet global peer-to-peer network. Some of the most interesting Perst-based applications are mobile. For example, 7City Learning, which provides training for financial professionals, gives students an Android tablet with study materials that are accessed using Perst. Several other McObject customers use Perst in mobile medical apps.

Q3. What are the main new technical features you are currently working on and why?

Steven T. Graves: One feature we’re very excited about is the ability to pipeline vector-based statistical functions in eXtremeDB Financial Edition – we’ve even released a short video and a 10-page white paper describing this capability. In a nutshell, pipelining is a programming technique that combines functions from the database system’s library of vector-based functions into an assembly line of processing for market data, with the output of one function becoming input for the next.

This may not sound unusual, since almost any algorithm or program can be viewed as a chain of operations acting on data.
But this pipelining has a unique purpose and a powerful result: it keeps market data inside CPU cache as the data is being worked.
Without pipelining, the results of each function would typically be materialized outside cache, in temporary tables residing in main memory. Handing interim results back and forth “across the transom” between CPU cache and main memory imposes significant latency, which is eliminated by pipelining. We’ve been improving this capability by adding new statistical functions to the library. (For an explanation of pipelining that’s more in-depth than the video but shorter than the white paper, check out this article on the financial technology site Low-Latency.com.)

We are also adding to the capabilities of eXtremeDB Cluster edition to make clustering faster and more flexible, and further simplify cluster administration. Improvements include a local tables option, in which database tables can be made exempt from replication, but shareable through a scatter/gather mechanism. Dynamic clustering, added in our recent v. 5.0 upgrade, enables nodes to join and leave clusters without interrupting processing. This further simplifies administration for a clustering database technology that counts minimal run-time maintenance as a key benefit. On selected platforms, clustering now supports the Infiniband switched fabric interconnect and Message Passing Interface (MPI) standard. In our tests, these high performance networking options accelerated performance more than 7.5x compared to “plain vanilla” gigabit networking (TCP/IP and Ethernet).

Related Posts

Big Data: Three questions to VoltDB.
ODBMS INdustry Watch, February 6, 2014

Big Data: Three questions to Pivotal.
ODBMS Industry Watch, January 20, 2014.

Big Data: Three questions to InterSystems.
ODBMS Industry Watch, January 13, 2014.

Cloud based hotel management– Interview with Keith Gruen.
ODBMS Industry Watch, July 25, 2013

In-memory database systems. Interview with Steve Graves, McObject.
ODBMS Industry Watch, March 16, 2012


ODBMS.org: Free resources on Big Data, Analytics, Cloud Data Stores, Graphs Databases, NewSQL, NoSQL, Object Databases.

  • Follow ODBMS.org on Twitter: @odbmsorg
  • ##

    http://www.odbms.org/blog/2014/02/big-data-three-questions-to-mcobject/feed/ 0
    On Big Graph Data. http://www.odbms.org/blog/2012/08/on-big-graph-data/ http://www.odbms.org/blog/2012/08/on-big-graph-data/#comments Mon, 06 Aug 2012 10:41:46 +0000 http://www.odbms.org/blog/?p=1612 ” The ultimate goal is to ensure that the graph community is not hindered by vendor lock-in” –Marko A. Rodriguez.
    There are three components to scaling OLTP graph databases: effective edge compression, efficient vertex centric query support, and intelligent graph partitioning” — Matthias Broecheler.

    Titan is a new distributed graph database available in alpha release. It is an open source Apache project maintained and funded by Aurelius. To learn more about it, I have interviewed Dr. Marko A. Rodriguez and Dr. Matthias Broecheler cofounders of Aurelius.


    Q1. What is Titan?

    MATTHIAS: Titan is a highly scalable OLTP graph database system optimized for thousands of users concurrently accessing and updating one huge graph.

    Q2. Who needs to handle graph-data and why?

    MARKO: Much of today’s data is composed of a heterogeneous set of “things” (vertices) connected by a heterogeneous set of relationships (edges) — people, events, items, etc. related by knowing, attending, purchasing, etc. The property graph model leveraged by Titan espouses this world view. This world view is not new as the object-oriented community has a similar perspective on data.
    However, graph-centric data aligns well with the numerous algorithms and statistical techniques developed in both the network science and graph theory communities.

    Q3. What are the main technical challenges when storing and processing graphs?

    MATTHIAS: At the interface level, Titan strives to strike a balance between simplicity, so that developers can think in terms of graphs and traversals without having to worry about the persistence and efficiency details. This is achieved by both using the Blueprint’s API and by extending it with methods that allow developers to give Titan “hints” about the graph data. Titan can then exploit these “hints” to ensure performance at scale.

    Q4. Graphs are hard to scale. What are the key ideas that make it so that Titan scales? Do you have any performance metrics available?

    MATTHIAS: There are three components to scaling OLTP graph databases: effective edge compression, efficient vertex centric query support, and intelligent graph partitioning.
    Edge compression in Titan comprises various techniques for keeping the memory footprint of each edge as small as possible and storing all edge information in one consecutive block of memory for fast retrieval.
    Vertex centric queries allow users to query for a specific set of edges by leveraging vertex centric indices and a query optimizer.
    Graph data partitioning refers to distributing the graph across multiple machines such that frequently co accessed data is co-located. Graph partitioning is a (NP-) hard problem and this is an aspect of Titan where we will see most improvement in future releases.
    The current alpha release focuses on balanced partitioning and multi-threaded parallel traversals for scale.

    MARKO: To your question about performance metrics, Matthias and his colleague Dan LaRocque are currently working on a benchmark that will demonstrate Titan’s performance when tens of thousands of transactions are concurrently interacting with Titan. We plan to release this benchmark via the Aurelius blog.
    [Edit: The benchmark is now available here. ]

    Q5. What is the relationships of Titan with other open source projects you were previously involved with, such as TinkerPop? Is Titan open source?

    MARKO: Titan is a free, open source Apache2 project maintained and funded by Aurelius . Aurelius (our graph consulting firm) developed Titan in order to meet the scalability requirements of a number of our clients.
    In fact, Pearson is a primary supporter and early adopter of Titan. TinkerPop, on the other hand, is not directly funded by any company and as such, is an open source group developing graph-based tools that any graph database vendor can leverage.
    With that said, Titan natively implements the Blueprint 2 API and is able to leverage the TinkerPop suite of technologies: Pipes, Gremlin, Frames, and Rexster.
    We believe this demonstrates the power of the TinkerPop stack — if you are developing a graph persistence store, implement Blueprints and your store automatically gets a traversal language, an OGM (object-to-graph mapper) framework, and a RESTful server.

    Q6. How is Titan addressing the problem of analyzing Big Data at scale?

    MATTHIAS: Titan is an OLTP database that is optimized for many concurrent users running short transactions, e.g. graph updates or short traversals against one huge graph. Titan significantly simplifies the development of scalable graph applications such as Facebook, Twitter, and the like.
    Interestingly enough, most of these large companies have built their own internal graph databases.
    We hope Titan will allow organizations to not reinvent the wheel. In this way, companies can focus on the value their data adds, not on the “plumbing” needed to process that data.

    MARKO: In order to support the type of global OLAP operations typified by the Big Data community, Aurelius will be providing a suite of technologies that will allow developers to make use of global graph algorithms. Faunus is a Hadoop-connector that implements a multi-relational path algebra developed by myself and Joshua Shinavier. This algebra allows users to derive smaller, “semantically rich” graphs that can then be effectively computed on within the memory confines of a single machine. Fulgora will be the in-memory processing engine. Currently, as Matthias has shown in prototype, Fulgora can store ~90 billion edges on a 64-Gig RAM machine for graphs with a natural, real-world topology. Titan, Faunus, and Fulgora form Aurelius’ OLAP story

    Q7. How do you handle updates?

    MATTHIAS: Updates are bundled in transactions which are executed against the underlying storage backend. Titan can be operated on multiple storage backends and currently supports Apache Cassandra, Apache HBase and Oracle BerkeleyDB.
    The degree of transactional support and isolation depends on the chosen storage backend. For non-transactional storage backends Titan provides its own locking system and fine grained locking support to achieve consistency while maintaining scalability.

    Q8. Do you offer support for declarative queries?

    MARKO: Titan implements the Blueprints 2 API and as such, supports Gremlin as its query/traversal language. Gremlin is a data flow language for graphs whereby traversals are prescriptively described using path expressions.

    MATTHIAS: With respects to a declarative query language, the TinkerPop teams is currently in the design process of a graph-centric language called “Troll.” We invite anybody interested in graph algorithms and graph processing to help in this effort.
    We want to identify the key graph use cases and then build a language that addresses those most effectively. Note that this is happening in TinkerPop and any Blueprints-enabled graph database will ultimately be able to add “Troll” to their supported languages.

    Q9. How does Titan compare with other commercial graph databases and RDF triple stores?

    MARKO: As Matthias has articulated previously, Titan is optimized for thousands of concurrent users reading and writing to a single massive graph. Most popular graph databases on the market today are single machine databases and simply can’t handle the scale of data and number of concurrent users that Titan can support. However, because Titan is a Blueprints-enabled graph database, it provides that same perspective on graph data as other graph databases.
    In terms of RDF quad/triple stores, the biggest obvious difference is the data model. RDF stores make use of a collection of triples composed of a subject, predicate, and object. There is no notion of key/value pairs associated with vertices and edges like Blueprints-based databases. When one wants to model edge weights, timestamps, etc., RDF becomes cumbersome. However, the RDF community has a rich collection of tools and standards that make working with RDF data easy and compatible across all RDF vendors.
    For example, I have a deep appreciation for OpenRDF.
    Similar to OpenRDF, TinkerPop hopes to make it easy for developers to migrate between various graph solutions whether they be graph databases, in-memory graph frameworks, Hadoop-based graph processing solutions, etc.
    The ultimate goal is to ensure that the graph community is not hindered by vendor lock-in.

    Q10. How does Titan compare with respect to NoSQL data stores and NewSQL databases?

    MATTHIAS: Titan builds on top of the innovation at the persistence layer that we have seen in recent years in the NoSQL movement. At the lowest level, a graph database needs to store bits and bytes and therefore has to address the same issues around persistence, fault tolerance, replication, synchronization, etc. that NoSQL solutions are tackling.
    Rather than reinventing the wheel, Titan is standing on the shoulders of giants by being able to utilize different NoSQL solutions for storage through an abstract storage interface. This allows Titan to cover all three sides of the CAP theorem triangle — please see here.

    Q11. Prof. Stonebraker argues that “blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context.” What is your opinion on this?

    MATTHIAS: We absolutely agree with Mike on this. The relational model is a way of looking at your data through tables and SQL is the language you use when you adopt this tabular view. There is nothing intrinsically inefficient about tables or relational algebra. But its important to note that the relational model is simply one way of looking at your data. We promote the graph data model which is the natural data representation for many applications where entities are highly connected with one another. Using a graph database for such applications will make developers significantly more productive and change the way one can derive value from their data.

    Dr. Marko A. Rodriguez is the founder of the graph consulting firm Aurelius. He has focused his academic and commercial career on the theoretical and applied aspects of graphs. Marko is a cofounder of TinkerPop and the primary developer of the Gremlin graph traversal language.

    Dr. Matthias Broecheler has been researching and developing large-scale graph database systems for many years in both academia and in his role as a cofounder of the Aurelius graph consulting firm. He is the primary developer of the distributed graph database Titan.
    Matthias focuses most of his time and effort on novel OLTP and OLAP graph processing solutions.

    Related Posts

    “Applying Graph Analysis and Manipulation to Data Stores.” (June 22, 2011)

    “Marrying objects with graphs”: Interview with Darren Wood. (March 5, 2011)

    Resources on Graphs and Data Stores
    Blog Posts | Free Software | Articles, Papers, Presentations| Tutorials, Lecture Notes


    http://www.odbms.org/blog/2012/08/on-big-graph-data/feed/ 3
    Interview with Mike Stonebraker. http://www.odbms.org/blog/2012/05/interview-with-mike-stonebraker/ http://www.odbms.org/blog/2012/05/interview-with-mike-stonebraker/#comments Wed, 02 May 2012 13:14:16 +0000 http://www.odbms.org/blog/?p=1415 “I believe that “one size does not fit all”. I.e. in every vertical market I can think of, there is a way to beat legacy relational DBMSs by 1-2 orders of magnitude.” — Mike Stonebraker.

    I have interviewed Mike Stonebraker, serial entrepreneur and professor at MIT. In particular, I wanted to know more about his last endeavor, VoltDB.


    Q1. In your career you developed several data management systems, namely: the Ingres relational DBMS, the object-relational DBMS PostgreSQL, the Aurora Borealis stream processing engine(commercialized as StreamBase), the C-Store column-oriented DBMS (commercialized as Vertica), and the H-Store transaction processing engine (commercialized as VoltDB). In retrospective, what are, in a nutshell, the main differences and similarities between all these systems? What are they respective strengths and weaknesses?

    Stonebraker: In addition, I am building SciDB, a DBMS oriented toward complex analytics.
    I believe that “one size does not fit all”. I.e. in every vertical market I can think of, there is a way to beat legacy relational DBMSs by 1-2 orders of magnitude.
    The techniques used vary from market to market. Hence, StreamBase, Vertica, VoltDB and SciDB are all specialized to different markets. At this point Postgres and Ingres are legacy code bases.

    Q2. In 2009 you co-founded VoltDB, a commercial start up based on ideas from the H-Store project. H-Store is a distributed In Memory OLTP system. What is special of VoltDB? How does it compare with other In-memory databases, for example SAP HANA, or Oracle TimesTen?

    Stonebraker: A bunch of us wrote a paper “Through the OLTP Looking Glass and What We Found There” (SIGMOD 2008). In it, we identified 4 sources of significant OLTP overhead (concurrency control, write-ahead logging, latching and buffer pool management).
    Unless you make a big dent in ALL FOUR of these sources, you will not run dramatically faster than current disk-based RDBMSs. To the best of my knowledge, VoltDB is the only system that eliminates or drastically reduces all four of these overhead components. For example, TimesTen uses conventional record level locking, an Aries-style write ahead log and conventional multi-threading, leading to substantial need for latching. Hence, they eliminate only one of the four sources.

    Q3. VoltDB is designed for what you call “high velocity” applications. What do you mean with that? What are the main technical challenges for such systems?

    Stonebraker: Consider an application that maintains the “state” of a multi-player internet game. This state is subject to a collection of perhaps thousands of
    streams of player actions. Hence, there is a collective “firehose” that the DBMS must keep up with.

    In a variety of OLTP applications, the input is a high velocity stream of some sort. These include electronic trading, wireless telephony, digital advertising, and network monitoring.
    In addition to drinking from the firehose, such applications require ACID transactions and light real-time analytics, exactly the requirements of traditional OLTP.

    In effect, the definition of transaction processing has been expanded to include non-traditional applications.

    Q4. Goetz Grafe (HP fellow) said in an interview that “disk-less databases are appropriate where the database contains only application state, e.g., current account balances, currently active logins, current shopping carts, etc. Disks will continue to have a role and economic value where the database also contains history (e.g. cold history such as transactions that affected the account balances, login & logout events, click streams eventually leading to shopping carts, etc.)” What is your take on this?

    Stonebraker: In my opinion the best way to organize data management is to run a specialized OLTP engine on current data. Then, send transaction history data,
    perhaps including an ETL component, to a companion data warehouse. VoltDB is a factor of 50 or so faster than legacy RDBMSs on the transaction piece, while column stores, such as Vertica, are a similar amount faster on historical analytics. In other words, specialization allows each component to run dramatically faster than a “one size fits all” solution.

    A “two system” solution also avoids resource management issues and lock contention, and is very widely used as a DBMS architecture.

    Q5. Where will the (historical) data go if we have no disks? In the Cloud?

    Stonebraker: Into a companion data warehouse. The major DW players are all disk-based.

    Q6. How VoltDB ensures durability?

    Stonebraker: VoltDB automatically replicates all tables. On a failure, it performs “Tandem-style” failover and eventual failback. Hence, it totally masks most errors. To protect against cluster-wide failures (such as power issues), it supports snapshotting of data and an innovative “command logging” capability. Command logging
    has been shown to be wildly faster than data logging, and supports the same durability as data logging.

    Q7. How does VoltDB support atomicity, consistency and isolation?

    Stonebraker: All transaction are executed (logically) in timestamp order. Hence, the net outcome of a stream of transactions on a VoltDB data base is equivalent
    to their serial execution in timestamp order.

    Q8. Would you call VoltDB a relational database system? Does it supports standard SQL? How do you handle scalability problems for complex joins of large amount of data?

    Stonebraker: VoltDB supports standard SQL.
    Complex joins should be run on a companion data warehouse. After all, the only way to interleave “big reads” with “small writes” in a legacy RDBMS is to use snapshot isolation or run with a reduced level of consistency.
    You either get an out-of-date, but consistent answer or an up-to-date, but inconsistent answer. Directing big reads to a companion DW, gives you the same result as snapshot isolation. Hence, I don’t see any disadvantage to doing big reads on a companion system.

    Concerning larger amounts of data, our experience is that OLTP problems with more than a few Tbyte of data are quite rare. Hence, these can easily fit in main memory, using a VoltDB architecture.

    In addition, we are planning extensions of the VoltDB architecture to handle larger-than-main-memory data sets. Watch for product announcements in this area.

    Q9. Does VoltDB handle disaster recovery? If yes, how?

    Stonebraker: VoltDB just announced support for replication over a wide area network. This capability support failover to a remote site if a disaster occurs. Check
    out voltdb web site for details.

    Q10. VoltDB`s mission statement is “to deliver the fastest, most scalable in-memory database products on the planet”. What performance measurements do you have until now to sustain this claim?

    Stonebraker: We have run TPC-C at about 50 X the performance of a popular legacy RDBMS. In addition, we have shown linear TPC-C scalability to 384 cores
    (more than 3 million transactions per second). That was the biggest cluster we could get access to; there is no reason why VoltDB would not continue to scale.

    Q11. Can In-Memory Data Management play a significant role also for Big Data Analytics (up to several PB of data)? If yes, how? What are the largest data sets that VoltDB can handle?

    Stonebraker: VoltDB is not focused on analytics. We believe they should be run on a companion data warehouse.

    Most of the warehouse customers I talk to want to keep increasing large amounts of increasingly diverse history to run their analytics over. The major data warehouse players are routinely being asked to manage petabyte-sized data warehouses. It is not clear how important main memory will be in this vertical market.

    Q12. You were very critical about Apache Hadoop, but VoltDB offers an integration with Hadoop. Why? How does it work technically?
    What are the main business benefits from such an integration?

    Stonebraker: Consider the “two system” solution mentioned above. VoltDB is intended for the OLTP portion, and some customers wish to run Hadoop as a data
    warehouse platform. To facilitate this architecture, VoltDB offers a Hadoop connector.

    Q13. How “green” is VoltDB? What are the tradeoff between total power consumption and performance: Do you have any benchmarking results for that?

    Stonebraker: We have no official benchmarking numbers. However, on a large variety of applications VoltDB is a factor of 50 or more faster than traditional RDBMSs. Put differently, if legacy folks need 100 nodes, then we need 2!

    In effect, if you can offer vastly superior performance (say times 50) on the same hardware, compared to another system, then you can offer the same performance on 1/50th of the hardware. By definition, you are 50 times “greener” than they are.

    Q14. You are currently working on science-oriented DBMSs and search engines for accessing the deep web. Could you please give us some details. What kind of results did you obtain so far?

    Stonebraker: We are building SciDB, which is oriented toward complex analytics (regression, clustering, machine learning, …). It is my belief that such analytics
    will become much more important off into the future. Such analytics are invariably defined on arrays, not tables. Hence, SciDB is an array DBMS, supporting a dialect of SQL for array data. We expect it to be wildly faster than legacy RDBMSs on this kind of application. See SciDB.org for more information.

    Q15. You are a co-founder of several venture capital backed start-ups. In which area?

    Stonebraker: The recent ones are: StreamBase (stream procession), Vertica (data warehouse market), VoltDB (OLTP), Goby.com (data aggregation of web sources), Paradigm4 (SciDB and complex analytics)

    Check the company web sites for more details.

    Mike Stonebraker
    Dr. Stonebraker has been a pioneer of data base research and technology for more than a quarter of a century. He was the main architect of the INGRES relational DBMS, and the object-relational DBMS, POSTGRES. These prototypes were developed at the University of California at Berkeley where Stonebraker was a Professor of Computer Science for twenty five years. More recently at M.I.T. he was a co-architect of the Aurora/Borealis stream processing engine, the C-Store column-oriented DBMS, and the H-Store transaction processing engine. Currently, he is working on science-oriented DBMSs, OLTP DBMSs, and search engines for accessing the deep web. He is the founder of five venture-capital backed startups, which commercialized his prototypes. Presently he serves as Chief Technology Officer of VoltDB, Paradigm4, Inc. and Goby.com.

    Professor Stonebraker is the author of scores of research papers on data base technology, operating systems and the architecture of system software services. He was awarded the ACM System Software Award in 1992, for his work on INGRES. Additionally, he was awarded the first annual Innovation award by the ACM SIGMOD special interest group in 1994, and was elected to the National Academy of Engineering in 1997. He was awarded the IEEE John Von Neumann award in 2005, and is presently an Adjunct Professor of Computer Science at M.I.T.

    Related Posts

    In-memory database systems. Interview with Steve Graves, McObject.
    (March 16, 2012)

    On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum. (February 1, 2012)

    A super-set of MySQL for Big Data. Interview with John Busch, Schooner. (February 20, 2012)

    Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB. (December 14, 2011)

    On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica. (November 16, 2011)

    vFabric SQLFire: Better then RDBMS and NoSQL? (October 24, 2011)

    The future of data management: “Disk-less” databases? Interview with Goetz Graefe. (August 29, 2011).


    http://www.odbms.org/blog/2012/05/interview-with-mike-stonebraker/feed/ 3
    In-memory database systems. Interview with Steve Graves, McObject. http://www.odbms.org/blog/2012/03/in-memory-database-systems-interview-with-steve-graves-mcobject/ http://www.odbms.org/blog/2012/03/in-memory-database-systems-interview-with-steve-graves-mcobject/#comments Fri, 16 Mar 2012 07:43:44 +0000 http://www.odbms.org/blog/?p=1371 “Application types that benefit from an in-memory database system are those for which eliminating latency is a key design goal, and those that run on systems that simply have no persistent storage, like network routers and low-end set-top boxes” — Steve Graves.

    On the topic of in-memory database systems, I did interview one of our expert, Steve Graves, co-founder and CEO of McObject.


    Q1. What is an in-memory database system (IMDS)?

    Steve Graves: An in-memory database system (IMDS) is a database management system (DBMS) that uses main memory as its primary storage medium.
    A “pure” in-memory database system is one that requires no disk or file I/O, whatsoever.
    In contrast, a conventional DBMS is designed around the assumption that records will ultimately be written to persistent storage (usually hard disk or flash memory).
    Obviously, disk or flash I/O is expensive, in performance terms, and therefore retrieving data from RAM is faster than fetching it from disk or flash, so IMDSs are very fast.
    An IMDS also offers a more streamlined design. Because it is not built around the assumption of storage on hard disk or flash memory, the IMDS can eliminate the various DBMS sub-systems required for persistent storage, including cache management, file management and others. For this reason, an in-memory database is also faster than a conventional database that is either fully-cached or stored on a RAM-disk.

    In other areas (not related to persistent storage) an IMDS can offer the same features as a traditional DBMS. These include SQL and/or native language (C/C++, Java, C#, etc.) programming interfaces; formal data definition language (DDL) and database schemas; support for relational, object-oriented, network or combination data designs; transaction logging; database indexes; client/server or in-process system architectures; security features, etc. The list could go on and on. In-memory database systems are a sub-category of DBMSs, and should be able to do everything that entails.

    Q2. What are significant differences between an in-memory database versus a database that happens to be in memory (e.g. deployed on a RAM-disk).

    Steve Graves: We use the comparison to illustrate IMDSs’ contribution to performance beyond the obvious elimination of disk I/O. If IMDSs’ sole benefit stemmed from getting rid of physical I/O, then we could get the same performance by deploying a traditional DBMS entirely in memory – for example, using a RAM-disk in place of a hard drive.

    We tested an application performing the same tasks with three storage scenarios: using an on-disk DBMS with a hard drive; the same on-disk DBMS with a RAM-disk; and an IMDS (McObject’s eXtremeDB). Moving the on-disk database to a RAM drive resulted in nearly 4x improvement in database reads, and more than 3x improvement in writes. But the IMDS (using main memory for storage) outperformed the RAM-disk database by 4x for reads and 420x for writes.

    Clearly, factors other than eliminating disk I/O contribute to the IMDS’s performance – otherwise, the DBMS-on-RAM-disk would have matched it. The explanation is that even when using a RAM-disk, the traditional DBMS is still performing many persistent storage-related tasks.
    For example, it is still managing a database cache – even though the cache is now entirely redundant, because the data is already in RAM. And the DBMS on a RAM-disk is transferring data to and from various locations, such as a file system, the file system cache, the database cache and the client application, compared to an IMDS, which stores data in main memory and transfers it only to the application. These sources of processing overhead are hard-wired into on-disk DBMS design, and persist even when the DBMS uses a RAM-disk.

    An in-memory database system also uses the storage space (memory) more efficiently.
    A conventional DBMS can use extra storage space in a trade-off to minimize disk I/O (the assumption being that disk I/O is expensive, and storage space is abundant, so it’s a reasonable trade-off). Conversely, an IMDS needs to maximize storage efficiency because memory is not abundant in the way that disk space is. So a 10 gigabyte traditional database might only be 2 gigabytes when stored in an in-memory database.

    Q3. What is in your opinion the current status of the in-memory database technology market?

    Steve Graves: The best word for the IMDS market right now is “confusing.” “In-memory database” has become a hot buzzword, with seemingly every DBMS vendor now claiming to have one. Often these purported IMDSs are simply the providers’ existing disk-based DBMS products, which have been tweaked to keep all records in memory – and they more closely resemble a 100% cached database (or a DBMS that is using a RAM-disk for storage) than a true IMDS. The underlying design of these products has not changed, and they are still burdened with DBMS overhead such as caching, data transfer, etc. (McObject has published a white paper, Will the Real IMDS Please Stand Up?, about this proliferation of claims to IMDS status.)

    Only a handful of vendors offer IMDSs that are built from scratch as in-memory databases. If you consider these to comprise the in-memory database technology market, then the status of the market is mature. The products are stable, have existed for a decade or more and are deployed in a variety of real-time software applications, ranging from embedded systems to real-time enterprise systems.

    Q4. What are the application types that benefit the use of an in-memory database system?

    Steve Graves: Application types that benefit from an IMDS are those for which eliminating latency is a key design goal, and those that run on systems that simply have no persistent storage, like network routers and low-end set-top boxes. Sometimes these types overlap, as in the case of a network router that needs to be fast, and has no persistent storage. Embedded systems often fall into the latter category, in fields such as telco and networking gear, avionics, industrial control, consumer electronics, and medical technology. What we call the real-time enterprise sector is represented in the first category, encompassing uses such as analytics, capital markets (algorithmic trading, order matching engines, etc.), real-time cache for e-commerce and other Web-based systems, and more.

    Software that must run with minimal hardware resources (RAM and CPU) can also benefit.
    As discussed above, IMDSs eliminate sub-systems that are part-and-parcel of on-disk DBMS processing. This streamlined design results in a smaller database system code size and reduced demand for CPU cycles. When it comes to hardware, IMDSs can “do more with less.” This means that the manufacturer of, say, a set-top box that requires a database system for its electronic programming guide, may be able to use a less powerful CPU and/or less memory in each box when it opts for an IMDS instead of an on-disk DBMS. These manufacturing cost savings are particularly desirable in embedded systems products targeting the mass market.

    Q5. McObject offers an in-memory database system called eXtremeDB, and an open source embedded DBMS, called Perst. What is the difference between the two? Is there any synergy between the two products?

    Steve Graves: Perst is an object-oriented embedded database system.
    It is open source and available in Java (including Java ME) and C# (.NET) editions. The design goal for Perst is to provide as nearly transparent persistence for Java and C# objects as practically possibly within the normal Java and .NET frameworks. In other words, no special tools, byte codes, or virtual machine are needed. Perst should provide persistence to Java and C# objects while changing the way a programmer uses those objects as little as possible.

    eXtremeDB is not an object-oriented database system, though it does have attributes that give it an object-oriented “flavor.” The design goals of eXtremeDB were to provide a full-featured, in-memory DBMS that could be used right across the computing spectrum: from resource-constrained embedded systems to high-end servers used in systems that strive to squeeze out every possible microsecond of latency. McObject’s eXtremeDB in-memory database system product family has features including support for multiple APIs (SQL ODBC/JDBC & native C/C++, Java and C#), varied database indexes (hash, B-tree, R-tree, KD-tree, and Patricia Trie), ACID transactions, multi-user concurrency (via both locking and “optimistic” transaction managers), and more. The core technology is embodied in the eXtremeDB IMDS edition. The product family includes specialized editions, built on this core IMDS, with capabilities including clustering, high availability, transaction logging, hybrid (in-memory and on-disk) storage, 64-bit support, and even kernel mode deployment. eXtremeDB is not open source, although McObject does license the source code.

    The two products do not overlap. There is no shared code, and there is no mechanism for them to share or exchange data. Perst for Java is written in Java, Perst for .NET is written in C#, and eXtremeDB is written in C, with optional APIs for Java and .NET. Perst is a candidate for Java and .NET developers that want an object-oriented embedded database system, have no need for the more advanced features of eXtremeDB, do not need to access their database from C/C++ or from multiple programming languages (a Perst database is compatible with Java or C#), and/or prefer the open source model. Perst has been popular for smartphone apps, thanks to its small footprint and smart engineering that enables Perst to run on mobile platforms such as Windows Phone 7 and Java ME.
    eXtremeDB will be a candidate when eliminating latency is a key concern (Perst is quite fast, but not positioned for real-time applications), when the target system doesn’t have a JVM (or sufficient resources for one), when the system needs to support multiple programming languages, and/or when any of eXtremeDB’s advanced features are required.

    Q6. What are the current main technological developments for in-memory database systems?

    Steve Graves: At McObject, we’re excited about the potential of IMDS technology to scale horizontally, across multiple hardware nodes, to deliver greater scalability and fault-tolerance while enabling more cost-effective system expansion through the use of low-cost (i.e. “commodity”) servers. This enthusiasm is embodied in our new eXtremeDB Cluster edition, which manages data stores across distributed nodes. Among eXtremeDB Cluster’s advantages is that it eliminates any performance ceiling from being CPU-bound on a single server.

    Scaling across multiple hardware nodes is receiving a lot of attention these days with the emergence of NoSQL solutions. But database system clustering actually has much deeper roots. One of the application areas where it is used most widely is in telecommunications and networking infrastructure, where eXtremeDB has always been a strong player. And many emerging application categories – ranging from software-as-a-service (SaaS) platforms to e-commmerce and social networking applications – can benefit from a technology that marries IMDSs’ performance and “real” DBMS features, with a distributed system model.

    Q7. What are the similarities and differences between current various database clustering solutions? In particular, let’s look at dimensions such as scalability, ACID vs. CAP, intended/applicable problem domains, structured vs. unstructured, and complexity of implementation.

    Steve Graves: ACID support vs. “eventual consistency” is a good place to start looking at the differences between clustering database solutions (including some cluster-like NoSQL products). ACID-compliant transactions will be Atomic, Consistent, Isolated and Durable; consistency implies the transaction will bring the database from one valid state to another and that every process will have a consistent view of the database. ACID-compliance enables an on-line bookstore to ensure that a purchase transaction updates the Customers, Orders and Inventory tables of its DBMS. All other things being equal, this is desirable: updating Customers and Orders while failing to change Inventory could potentially result in other orders being taken for items that are no longer available.

    However, enforcing the ACID properties becomes more of a challenge with distributed solutions, such as database clusters, because the node initiating a transaction has to wait for acknowledgement from the other nodes that the transaction can be successfully committed (i.e. there are no conflicts with concurrent transactions on other nodes). To speed up transactions, some solutions have relaxed their enforcement of these rules in favor of an “eventual consistency” that allows portions of the database (typically on different nodes) to become temporarily out-of-synch (inconsistent).

    Systems embracing eventual consistency will be able to scale horizontally better than ACID solutions – it boils down to their asynchronous rather than synchronous nature.

    Eventual consistency is, obviously, a weaker consistency model, and implies some process for resolving consistency problems that will arise when multiple asynchronous transactions give rise to conflicts. Resolving such conflicts increases complexity.

    Another area where clustering solutions differ is along the lines of shared-nothing vs. shared-everything approaches. In a shared-nothing cluster, each node has its own set of data.
    In a shared-everything cluster, each node works on a common copy of database tables and rows, usually stored in a fast storage area network (SAN). Shared-nothing architecture is naturally more complex: if the data in such a system is partitioned (each node has only a subset of the data) and a query requests data that “lives” on another node, there must be code to locate and fetch it. If the data is not partitioned (each node has its own copy) then there must be code to replicate changes to all nodes when any node commits a transaction that modifies data.

    NoSQL solutions emerged in the past several years to address challenges that occur when scaling the traditional RDBMS. To achieve scale, these solutions generally embrace eventual consistency (thus validating the CAP Theorem, which holds that a system cannot simultaneously provide Consistency, Availability and Partition tolerance). And this choice defines the intended/applicable problem domains. Specifically, it eliminates systems that must have consistency. However, many systems don’t have this strict consistency requirement – an on-line retailer such as the bookstore mentioned above may accept the occasional order for a non-existent inventory item as a small price to pay for being able to meet its scalability goals. Conversely, transaction processing systems typically demand absolute consistency.

    NoSQL is often described as a better choice for so-called unstructured data. Whereas RDBMSs have a data definition language that describes a database schema and becomes recorded in a database dictionary, NoSQL databases are often schema-less, storing opaque “documents” that are keyed by one or more attributes for subsequent retrieval. Proponents argue that schema-less solutions free us from the rigidity imposed by the relational model and make it easier to adapt to real-world changes. Opponents argue that schema-less systems are for lazy programmers, create a maintenance nightmare, and that there is no equivalent to relational calculus or the ANSI standard for SQL. But the entire structured or unstructured discussion is tangential to database cluster solutions.

    Q7. Are in-memory database systems an alternative to classical disk-based relational database systems?

    Steve Graves: In-memory database systems are an ideal alternative to disk-based DBMSs when performance and efficiency are priorities. However, this explanation is a bit fuzzy, because what programmer would not claim speed and efficiency as goals? To nail down the answer, it’s useful to ask, “When is an IMDS not an alternative to a disk-based database system?”

    Volatility is pointed to as a weak point for IMDSs. If someone pulls the plug on a system, all the data in memory can be lost. In some cases, this is not a terrible outcome. For example, if a set-top box programming guide database goes down, it will be re-provisioned from the satellite transponder or cable head-end. In cases where volatility is more of a problem, IMDSs can mitigate the risk. For example, an IMDS can incorporate transaction logging to provide recoverability. In fact, transaction logging is unavoidable with some products, such as Oracle’s TimesTen (it is optional in eXtremeDB). Database clustering and other distributed approaches (such as master/slave replication) contribute to database durability, as does use of non-volatile RAM (NVRAM, or battery-backed RAM) as storage instead of standard DRAM. Hybrid IMDS technology enables the developer to specify persistent storage for selected record types (presumably those for which the “pain” of loss is highest) while all other records are managed in memory.

    However, all of these strategies require some effort to plan and implement. The easiest way to reduce volatility is to use a database system that implements persistent storage for all records by default – and that’s a traditional DBMS. So, the IMDS use-case occurs when the need to eliminate latency outweighs the risk of data loss or the cost of the effort to mitigate volatility.

    It is also the case that FLASH and, especially, spinning memory are much less expensive than DRAM, which puts an economic lid on very large in-memory databases for all but the richest users. And, riches notwithstanding, it is not yet possible to build a system with 100’s of terabytes, let alone petabytes or exabytes, of memory, whereas spinning memory has no such limitation.

    By continuing to use traditional databases for most applications, developers and end-users are signaling that DBMSs’ built-in persistence is worth its cost in latency. But the growing role of IMDSs in real-time technology ranging from financial trading to e-commerce, avionics, telecom/Netcom, analytics, industrial control and more shows that the need for speed and efficiency often outweighs the convenience of a traditional DBMS.

    Steve Graves is co-founder and CEO of McObject, a company specializing in embedded Database Management System (DBMS) software. Prior to McObject, Steve was president and chairman of Centura Solutions Corporation and vice president of worldwide consulting for Centura Software Corporation.

    Related Posts

    A super-set of MySQL for Big Data. Interview with John Busch, Schooner.

    Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB.

    On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.

    vFabric SQLFire: Better then RDBMS and NoSQL?

    Related Resources

    ODBMS.ORG: Free Downloads and Links:
    Object Databases
    NoSQL Data Stores
    Graphs and Data Stores
    Cloud Data Stores
    Object-Oriented Programming
    Entity Framework (EF) Resources
    ORM Technology
    Object-Relational Impedance Mismatch
    Databases in general
    Big Data and Analytical Data Platforms


    http://www.odbms.org/blog/2012/03/in-memory-database-systems-interview-with-steve-graves-mcobject/feed/ 0
    On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum. http://www.odbms.org/blog/2012/02/on-big-data-analytics-interview-with-florian-waas-emcgreenplum/ http://www.odbms.org/blog/2012/02/on-big-data-analytics-interview-with-florian-waas-emcgreenplum/#comments Wed, 01 Feb 2012 11:34:20 +0000 http://www.odbms.org/blog/?p=1321 “With terabytes, things are actually pretty simple — most conventional databases scale to terabytes these days. However, try to scale to petabytes and it’s a whole different ball game.” –Florian Waas.

    On the subject of Big Data Analytics, I interviewed Florian Waas (flw). Florian is the Director of Software Engineering at EMC/Greenplum and heads up the Query Processing team.


    Q1. What are the main technical challenges for big data analytics?

    Florian Waas: Put simply, in the Big Data era the old paradigm of shipping data to the application isn’t working any more. Rather, the application logic must “come” to the data or else things will break: this is counter to conventional wisdom and the established notion of strata within the database stack.
    Instead of stand-alone products for ETL, BI/reporting and analytics we have to think about seamless integration: in what ways can we open up a data processing platform to enable applications to get closer?
    What language interfaces, but also what resource management facilities can we offer? And so on.

    At Greenplum, we’ve pioneered a couple of ways to make this integration reality: a few years ago with a Map-Reduce interface for the database and more recently with MADlib, an open source in-database analytics package. In fact, both rely on a powerful query processor under the covers that automates shipping application logic directly to the data.

    Q2. When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?

    Florian Waas: With terabytes, things are actually pretty simple — most conventional databases scale to terabytes these days. However, try to scale to petabytes and it’s a whole different ball game.
    Scale and performance requirements strain conventional databases. Almost always, the problems are a matter of the underlying architecture. If not built for scale from the ground-up a database will ultimately hit the wall — this is what makes it so difficult for the established vendors to play in this space because you cannot simply retrofit a 20+ year-old architecture to become a distributed MPP database over night.
    Having said that, over the past few years, a whole crop of new MPP database companies has demonstrated that multiple PB’s don’t pose a terribly big challenge if you approach it with the right architecture in mind.

    Q3. How do you handle structured and unstructured data?

    Florian Waas: As a rule of thumb, we suggest to our customers to use Greenplum Database for structured data and to consider Greenplum HD—Greenplum’s enterprise Hadoop edition—for unstructured data. We’ve equipped both systems with high-performance connectors to import and export data to each other, which makes for a smooth transition when using one for pre-processing for the other, query HD using Greenplum Database, or whatever combination the application scenario might call for.

    Having said this, we have seen a growing number of customers loading highly unstructured data directly into Greenplum Database and convert it into structured data on the fly through in-database logic for data cleansing, etc.

    Q4. Cloud computing and open source: Do you they play a role at Greenplum? If yes, how?

    Florian Waas: Cloud computing is an important direction for our business and hardly any vendor is better positioned than EMC in this space. Suffice it to say, we’re working on some exciting projects.
    So, stay tuned!

    As you know, Greenplum has been very close to the open source movement, historically. Besides our ties with the Postgres and Hadoop communities we released our own open source distribution of MADlib for in-database analytics (see also madlib.net)

    Q5. In your blog you write that classical database benchmarks “aren’t any good at assessing the query optimizer”. Can you please elaborate on this?

    Florian Waas: Unlike customer workloads, standard benchmarks pose few challenges for a query optimizer – the emphasis in these benchmarks is on query execution and storage structures. Recently, several systems that have no query optimizer to speak of have scored top results in the TPC-H benchmark.
    And, while impressive at these benchmarks, these systems do usually not perform well in customer accounts when faced with ad-hoc queries — that’s where a good optimizer makes all the difference.

    Q6. Why do we need specialized benchmarks for a subcomponent of a database?

    Florian Waas: On the one hand, an optimizer benchmark will be a great tool for consumers.
    A significant portion of the total cost of ownership of a database system comes from the cost of query tuning and manual query rewriting, in other words, the shortcomings of the query optimizer. Without an optimizer benchmark it’s impossible for consumers to compare the maintenance cost. That’s like buying a car without knowing its fuel consumption!

    On the other hand, an optimizer benchmark will be extremely useful for engineering teams in optimizer development. It’s somewhat ironic that vendors haven’t invested in a methodology to show off that part of the system where most of their engine development cost goes.

    Q7. Are you aware of any work in this area (Benchmarking query optimizers)?

    Florian Waas: Funny, you’d asked. Over the past months I’ve been working with coworkers and colleagues in the industry on some techniques – we’re still far away from a complete benchmark but we’ve made some inroads.

    Q8. You had done some work with “dealing with plan regressions caused by changes to the query optimizer”. Could you please explain what the problem is and what kind of solutions did you develop?

    Florian Waas: A plan regression is a regression of a query due to changes to the optimizer from one release to the next. For the customer this could mean, after an upgrade or patch release one or more of their truly critical queries might run slower–maybe even so slow that it start impacting their daily business operations.

    With the current test technology plan regressions are very hard to guard against simply because the size of the input space makes it impossible to achieve perfect test coverage. This dilemma made a number of vendors increasingly risk averse and turned into the biggest obstacle for innovation in this space. Some vendors came up with rather reactionary safety measures. To use another car analogy: many of these are akin to driving with defective breaks but wearing a helmet in the hopes that this will help prevent the worst in a crash.

    I firmly believe in fixing the defective breaks, so to speak, and developing better test and analysis tools. We’ve made some good progress on this front and start seeing some payback already. This is an exciting and largely under-developed area of research!

    Q9. Glenn Paulley of Sybase in a keynote at SIGMOD 2011 asked the question of ‘how much more complexity can database systems deal with? What is your take on this?

    Florian Waas: Unnecessary complexity is bad. I think everybody will agree with that. Some complexity is inevitable though, and the question becomes: How are we dealing with it?

    Database vendors have all too often fallen into the trap of implementing questionable features quickly without looking at the bigger picture. This has led to tons of internal complexity and special casing, not to mention the resulting spaghetti code.
    When abstracted correctly and broken down into sound building blocks a lot of complexity can actually be handled quite well. Again, query optimization is a great example here: modern optimizers can be a joy to work with.
    They are built and maintained by small surgical teams that innovate effectively! Whereas older models require literally dozens of engineers just to maintain the code base and fix bugs.

    In short, I view dealing with complexity primarily as an exciting architecture and design challenge and I’m proud we assembled a team here at Greenplum that’s equally excited to take on this challenge!

    Q10. I published an interview with Marko Rodriguez and Peter Neubauer, leaders of the Tinkerpop2 project. What is your opinion on Graph Analysis and Manipulation for databases?

    Florian Waas: Great stuff these guys are building –- I’m interested to see how we can combine Big Data with graph analysis!

    Q11. Anything else you wish to add?

    Florian Waas: It’s been fun!

    Florian Waas (flw) is Director of Software Engineering at EMC/Greenplum and heads up the Query Processing team. His day job is to bring theory and practice together in the form of scalable and robust database technology.
    Related Posts
    On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.

    On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica.

    On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com

    Related Resources

    ODBMS.ORG: Resources on Analytical Data Platforms: Blog Posts | Free Software| Articles|


    http://www.odbms.org/blog/2012/02/on-big-data-analytics-interview-with-florian-waas-emcgreenplum/feed/ 0