“My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.”–Jules J. Berman.
I have interviewed Jules J. Berman, former President of the Association for Pathology Informatics. The focus of the interview is on how to manage Big Data.
Q1. In your experience what are the common mistakes that endanger most Big Data projects?
Jules J. Berman: Overconfidence is the biggest culprit. The creators of Big Data resources like to believe that they have collected all the data relevant to their domain, that all of the data is accurate, and that the data is organized in a manner that supports meaningful data searches. The Big Data analysts like to believe that their results and conclusions are correct. Hah!
Q2. How do you organize large volumes of complex data? Any insights you could give us on this?
Jules J. Berman: Large volumes of Big Data are organized the same way that humans organize the large volumes of complex data held in their brains: through classification. We could not cope with all the sensory input we receive each day if we did not bin visual objects into categories.
There is a science to constructing classifications, and if the science is misapplied, then the complex data objects held in a Big Data resource cannot be sensibly retrieved, or collected with objects to which they are logically related. Novices to the field make two common errors: confusing properties with classes (e.g., creating red-colored objects as a new class), or assigning a part of an object as a subclass (e.g., making “legs” a subclass of “person”). Just like any other science, the science of classification must be studied, practiced, and mastered.
Q3. You have been working on data permanence: what does it mean in practice? How can it be achieved when the content of the data is constantly changing?
Jules J. Berman: Everyone knows the slogan from Orwell’s masterpiece, 1984: “Big Brother is watching you”. If you’ve read the book, you’ll remember that there was another major theme; one that involved data mutability. The minions of Big Brother were constantly fiddling with collected data to distort reality. Because Big Brother held all the data, Big Brother could create perceptions of reality that suited the totalitarian state.
I see the problem of data mutability (i.e., the ability to modify, delete, or fabricate data) as being much more important than issues related to over-surveillance. In hospitals, the regrettable act of “retro-noting” (i.e., inserting patient notes out of sequence to cover omissions, or to justify billing, or to eradicate errors), is an example of data mutability.
The solution involves employing time stamps and metadata, and procedures that block data erasures. Data mutability, and the related topic of missing legacy data, are two of my favorite issues, and they are both covered in my book, “Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.”
Q4. Data verification: what are the challenges?
Jules J. Berman: The biggest challenge involves getting data analysts to take the topic of data verification seriously. I personally know data scientists who have the attitude that data verification is “not my job.” They believe that they have no control over the data; their job is to do the best they can with the data they receive.
I think that we really must get everybody on board with the idea that data needs to be verified. The task of creating verified sets of data is child’s play compared with the professional issues instigated by recalcitrant data scientists.
Q5. Data validation: what are the challenges?
Jules J. Berman: There are many ways of thinking about validation, but my perception is that most people in the field are approaching validation as a post-analytic process, wherein old conclusions are tested on new data, or tested on alternate data sources, or are re-calculated on a regular basis. The validation process is aimed at determining whether what seems true for me today will be true for you and me, today and tomorrow.
Like anything else in Big Data, it requires work and vigilance, and a delay in gratification.
Q6. Are there any general methods for data verification and validation that can be specifically applied to Big Data resources?
Jules J. Berman: There’s a large literature out there on this subject. In my opinion, the methods are not as important as the documentation. Protocols must be written, actions must be recommended, and steps must be taken to implement corrections. If you’re serious about Big Data, you must be serious about documenting everything: how you found errors, what you did to correct the errors, what you did to make sure that future errors of the same kind will not occur, what you did to monitor the occurrence of future errors of the same type. It never seems to end, but it’s just part of the job.
Q7. How would you find relationships among data objects held in disparate Big Data resources: Could you give us some examples?
Jules J. Berman: In my book Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. published in May, I give real-world examples for all of the points raised in this interview, but my favorite “reach” into disparate data involves some inventive research into the sinking of the Titanic.
Here’s an excerpt from my book: “A recent headline story explains how century old tidal data plausibly explained the appearance the iceberg that sank the titanic, on April 15, 1912. Records show that several months earlier, in 1912, the moon earth and sun aligned to produce a strong tidal pull, and this happened when the moon was the closest to the earth in 1,400 years. The resulting tidal surge was sufficient to break the January Labrador ice sheet, sending an unusual number of icebergs towards the open North Atlantic waters.
The Labrador icebergs arrived in the commercial shipping lanes four months later, in time for a fateful rendezvous with the Titanic. Back in January 1912, when tidal measurements were being collected, nobody foresaw that the data would be examined a century later.”
Of course, the finest tool for finding relationships among data objects held in disparate Big Data resources is the human brain. Good data analysts spend lots of time surveying the data held in various resources. When you spend the time, the inspirational moments will come, and you will begin to synthesize new relationships among data from different knowledge domains. Typically, analysis follows inspiration; not vice versa.
Q8. Data integration: how can data be extracted and integrated with data from other resources?
Jules J. Berman: Of course, standards, specifications, and metadata play an important role.
The Holy Grail in the Big Data field involves finding and implementing standard methods for organizing and tagging data, so that every piece of data held on any computer, can be linked and combined into a virtual Super-Big Data resource.
On a less grand scale, it’s always nice when workers in a common field collect their data in a standard form.
In most cases, I’ve been favoring specifications over standards. Data standards seldom, if ever “fit” your data correctly, are prone to re-versioning, often cost money, and usually come with a fine-print license that restricts how the standards are used and how your annotated data are distributed. Specifications are recommendations for describing data; RDF is a good example. Specifications provide the flexibility required for complex data, but the structure required for data integration.
A smart data manager can do a lot more with a specification than with a standard.
Q9. What about Big Data sharing?
Jules J. Berman: Data sharing is absolutely essential to the field of data science. If the data upon which your assertions are based is unavailable to the public, then why would anyone believe your results and conclusions?
In the Big Data realm, there are lots of things that can go wrong with a data analysis project. The chances that any new analysis is correct, on first pass, is slim-to-none. Everything must be repeated over and over, critiqued, and validated on fresh data.
My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.
Jules Berman received two baccalaureate degrees from MIT; in Mathematics, and in Earth and Planetary Sciences. He received the Ph.D. from Temple University, and the M.D. from the U. of Miami.
He received post-doctoral training at NIH and residency training at Geo. Washington U Med Ctr. He is board certified in anatomic pathology and in cytopathology. He served as Chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and the Johns Hopkins Medical Institutions. In 1998, he became a Medical Officer at the U.S. National Cancer Institute and served as the Program Director for Pathology Informatics in the Institute’s Cancer Diagnosis Program. In 2006, Jules Berman was President of the Association for Pathology Informatics. In 2011 he received the Lifetime Achievement Award from the Association for Pathology Informatics. Today, Jules Berman is a free-lance writer. He has first-authored more than 100 articles and 11 book titles in science and medicine.
- Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information, Jules J Berman, Ph.D., M.D. Paperback: 288 pages, Morgan Kaufmann; 1 edition (June 13, 2013), ISBN-10: 0124045766
Follow ODBMS.org on Twitter: @odbmsorg
“A true columnar store is not only about the way you store data, but the engine and the optimizations that are enabled by the column store”–Shilpa Lawande.
On the subject of column stores, and what are the important features in the new release of the HP Vertica Analytics Platform, I have interviewed Shilpa Lawande, VP Engineering & Customer Experience at HP Vertica.
Q1. Back in 2011 I did an interview with you  at the time Vertica was just acquired by HP. What is new in the current version of Vertica?
Shilpa Lawande: We’ve come a long way since 2011 and our innovation engine is going strong!
From “Bulldozer” to “Crane” and now “Dragline,” we’ve built on our columnar-compressed, MPP share-nothing core, expanded security and manageability, dramatically expanded data ingestion capabilities, and what’s most exciting is that we’ve added a host of advanced analytics functions and extensibility APIs to the HP Vertica Analytics Platform itself. One key innovation is our ability to ingest and auto-schematize semi-structured data using HP Vertica Flex Zone, which takes away much of the friction in the analytic life-cycle from exploration to production.
We’ve also grown a vibrant community of practitioners and an ecosystem of complementary tools, including Hadoop.
Dragline, our next release of the HP Vertica Analytics Platform addresses the needs of the most demanding, analytic-driven organizations by providing many new features, including:
- Project Maverick’s Live Aggregate Projections to speed up queries that rely on resource- intensive aggregate functions like SUM, MIN/MAX, and COUNT.
- Dynamic mixed workload management, which identifies and adapts to varying query complexities — simple and ad-hoc queries as well as long-running advanced queries — and dynamically assigns the appropriate amount of resources to ensure the needs of all data consumers
- HP Vertica Pulse, which helps organizations leverage an in-database sentiment analysis tool that scores short data posts, including social data, such as Twitter feeds or product reviews, to gauge the most popular topics of interest, analyze how sentiment changes over time and identify advocates and detractors.
- HP Vertica Place, which stores and analyzes geospatial data in real time, including locations, networks and regions.
- An expanded SQL-on-Hadoop offering that gives users freedom to pick their data formats and where to store it, including HDFS, but still benefit from the power of the Vertica analytic engine. OF course, there’s a lot more to the “Dragline” release, but these are the highlights.
Q2. Vertica is referred to as an analytics platform. How does it differentiate with respect to conventional relational database systems (RDBMSes)?
Shilpa Lawande: Good question. First, let me clear the misconception that column stores are not relational – Vertica is a relational database, an RDBMS – it speaks tables and columns and standard SQL and ODBC, and like your favorite RDBMS, talks to a variety of BI tools. Now, there are many variations in the database market from low-cost solutions that lack advanced analytics to high-end solutions that can’t handle big data.
HP Vertica is the only one purpose-built for big data analytics – most conventional RDBMS were purpose- built for OLTP and then retrofitted for analytics. Vertica’s core architecture with columnar storage, a columnar engine, aggressive use of data compression, our scale-out architecture, and, most importantly, our unique hybrid load architecture enables what we call real-time analytics, which gives us the edge over the competition.
You can keep loading your data throughout the day — not in batch at night — and you can query the data as it comes in, without any specialized indexes, materialized views, or other pre- processing. And we have a huge and ever-growing library of features and functions to explore and perform analytics on big data–both structured and semi-structured. All of these core capabilities add up to a powerful analytics platform–far beyond a conventional relational database.
Q3. Vertica is column-based. Could you please explain what are the main technological differences with respect to a conventional relational database system?
Shilpa Lawande: It’s about performance. A conventional RDBMS is bottlenecked with disk I/O.
The reason for this is that with a traditional database, data is stored on disks in a row-wise manner, so even if the query needs only a few columns, the entire row must be retrieved from disk. In analytic workloads, often there are hundreds of columns in the data and only a few are used in the query, so row-oriented databases simply don’t scale as the data sets get large.
Vendors who offer this type of database often require that you create indexes and materialized views to retrieve the relevant data in a reasonable about of time. With columnar storage, you store data for each column separately, so that you can grab just the columns you need to answer the query. This can speed query times immensely, where hour-long queries can happen in minutes or seconds. Furthermore, Vertica stores and processes the data sorted, which enables us to do all manner of interesting optimizations to queries that further boost performance.
Some of the traditional database vendors out there claim they now have columnar store, but a true columnar store is not only about the way you store data, but the engine and the optimizations that are enabled by the column store.
For instance, an optimization called late materialized allows Vertica to delay retrieval of columns as late as possible in query processing so that minimal I/O and data movement is done until absolutely necessary. Vertica is the only engine that is true columnar; everything else out there is a retrofit of a general purpose engine that can read some kind of a columnar format.
Q4. What is so special of Vertica data compression?
Shilpa Lawande: The capability of Vertica to store data in columns allows us to take advantage of the similar traits in data. This gives us not only a footprint reduction in the disk needed to store data, but also an I/O performance boost — compressed data takes a shorter time to load. But, even more importantly, we use various encoding techniques on the data itself that enable us to process the data without expanding it first.
We have over a dozen schemes for how we store the data to optimize its storage, retrieval, and processing.
Q5. Vertica is designed for massively parallel processing (MPP). What is it?
Shilpa Lawande: Vertica is a database designed to run on a cluster of industry-standard hardware.
There are no special- purpose hardware components. The database is based on a shared-nothing architecture, where many nodes each store part of the database and do part of the work in processing queries. We optimize the processing so much as to minimize data traffic over the network. We have built-in high availability to handle node failures. We also have a sophisticated elasticity mechanism that allows us to efficiently add and remove nodes from the cluster. This enables us to scale-out to very large data sizes and handle very large data problems. In other words, it is massively parallel processing!
Q6. In the past, columnar databases were said to be slow to load. Is it still true now?
Shilpa Lawande: This may have been true with older unsophisticated columnar databases. We have customers loading over 35 TB data / hour into Vertica, so I think we’ve put that one squarely to rest.
Q7. Who are the users ready to try column-based “data slicers”? And for what kind of use cases?
Shilpa Lawande: Vertica is a technology broadly applicable in many industries and in many business situations. Here are just a few of them.
Data Warehouse Modernization – the customer has some underperforming solution for data warehouse in place and they want to replace or augment their current analytics with a solution that will scale and deliver faster analytics at an overall lower TCO that requires substantially less hardware resources.
Hadoop Acceleration – the customer has bought into Hadoop for a data lake solution and would like a more expressive and faster SQL-on-Hadoop solution or an analytic platform that can offer real-time analytics for production use.
Predictive analytics – the customer has some kind of machine data, clickstream logs, call detail records, security event data, network performance data, etc. over long periods of time and they would like to get value out of this data via predictive analytics. Use-cases include website personalization, network performance optimization, security thread forensics, quality control, predictive maintenance, etc.
Q8. What are the typical indicators which are used to measure how well systems are running and analyzing data in the enterprise? In other words, how “good” is the value derived from analyzing (Big) Data?
Shilpa Lawande: There are many, many advantages and places to derive value from big data.
First, just having the ability to answer your daily analytics faster can be a huge boost for the organization. For example, we had one brick-and-mortar retailer who wanted to brief sales associates and managers daily on what the hottest selling products were, who had inventory and other store trends. With their legacy analytics system, they could not deliver analytics fast enough to have these analytics on hand. With Vertica, they now provide very detailed (and I might add graphically pleasing) analytics across all of their stores, right in the hands of the store manager via a tablet device. The analytics has boosted sales performance and efficiency across the chain. The user experience they get wouldn’t be possible without the speed of Vertica.
But what is most exciting to me is when Vertica is used to save lives and the environment. We have a client in the medical field who has used Vertica analytics to better detect infections in newborn infants by leveraging the data they have from the NICU. It’s difficult to detect infections in newborns because they don’t often run a fever, nor can they explain how they feel. The estimate is that this big data analytics has saved the lives of hundreds of newborn babies in the first year of use. Another example is the HP Earth Insights project, which used Vertica to create an early warning system to identify species threatened by destruction of tropical forests around the world.
This project done in cooperation with Conservation International is making an amazing difference to scientists and helping inform and influence policy decisions around our environment.
There are a LOT of great use cases like these coming out of the Vertica community.
Q9. What are the main technical challenges when analyzing data at speed?
Shilpa Lawande: In an analytics system, you tend to have a lot going on at the same time. There are data loads, both in batch and trickle loads. There is daily and regular analytics for generating daily reports. There may be data discovery where users are trying to find value in data. Of course, there are dashboards that executives rely upon to stay up to date. Finally, you may have ad-hoc queries that come in and try to take away resources. So perhaps the biggest challenge is dealing with all of these workloads and coming up with the most efficient way to manage it all.
We’ve invested a lot of resources in this area and the fruit of that labor is very much evident in the “Dragline” release.
Q10. Do you have some concrete example of use cases where HP Vertica is used to analyze data at speed?
Shilpa Lawande: Yes, we have many, see here.
Q11. How HP Vertica differs with respect to other analytical platforms offered by competitors such as IBM, Teradata, to in-memory databases such as SAP HANA?
Shilpa Lawande: Vertica offers everything that’s good about legacy data warehouse technologies like the ability to use your favorite visualization tools, standard SQL, and advanced analytic functionality.
In general, the legacy databases you mentioned are pretty good at handling analysis of business data, but they are still playing catch-up when it comes to big data – the volume, variety, and velocity. A row store simply cannot deliver the analytical performance and scale of an MPP columnar platform like Vertica.
In-memory databases are a good acceleration solution for some classes of business analytics, but, again, when it comes to very large data problems, the economics of putting all the data in memory simply do not work. That said, Vertica itself has an in-memory component which is at the core of our high-speed loading architecture, so I believe we have the best of both worlds – ability to use memory where it matters and still support petabyte scales!
Shilpa Lawande has been an integral part of the Vertica engineering team from its inception to its acquisition by HP in 2011. Shilpa brings over 15 years of experience in databases, data warehousing and grid computing to HP/Vertica.
Besides being responsible for Vertica’s Engineering team, Shilpa also manages the Customer Experience organization for Vertica including Customer Support, Training and Professional Services. Prior to Vertica, she was a key member of the Oracle Server Technologies group where she worked directly on several data warehousing and self-managing features in the Oracle Database.
Shilpa is a co-inventor on several patents on query optimization, materialized views and automatic index tuning for databases. She has also co-authored two books on data warehousing using the Oracle database as well as a book on Enterprise Grid Computing. She has been named to the 2012 Women to Watch list by Mass High Tech and awarded HP Software Business Unit Leader of the year in 2012.
Shilpa has a Masters in Computer Science from the University of Wisconsin-Madison and a Bachelors in Computer Science and Engineering from the Indian Institute of Technology, Mumbai.
Follow ODBMS.org on Twitter: @odbmsorg
” Hadoop continue to mature with regards to structuring data and interactive query, so future overlap between Hadoop and OLAP will increase.”– John Schroeder.
I have interviewed John Schroeder, CEO and Cofounder of MapR Technologies. Main topics of the interview are managing Big Data projects and how the Hadoop market is evolving.
Q1. What are the most common problems and challenges encountered in Big Data projects?
John Schroeder: First of all there is no single Big Data use case. Applications cut across industries and involve a wide variety of data sources. These projects can result in revenue gains, cost reductions or risk mitigation. While the challenges for these projects also vary, we see customers embracing our platform to deal with common challenges in meeting mission critical service levels, addressing real-time response pressures, and supporting multiple users and applications.
Q2. How do you see the Hadoop market evolving?
John Schroeder: We have leading customers in diverse industries who are using Hadoop to drive operational analytics, customer examples include performing 100B ad auctions a day, fraud detection for over 100 million card holders and real-time adjustments to improve fleet efficiency. These examples require the right architecture to support streaming writes so data can be constantly writing to the system while analysis is being conducted; high performance to meet the business needs and real-time operations; and the ability to perform online database operations to react to the business situation and impact business as it happens not producing a batch to report days or weeks later.
Q3. Is Hadoop really replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?
John Schroeder: Hadoop’s impact is more disruptive than a replacement for OLAP technologies that have been in the market since the 90s. Customers deploy use cases on Hadoop that were not feasible or cost effective using these traditional technologies. For example, the use of clustering algorithms and recommendation engines that can be run much more frequently against much larger datasets open opportunities for use cases that drive new revenue streams.
Hadoop is also more powerful for unstructured data. So while we do see customers offload data warehouse processing on MapR, most MapR customers are deploying net new use cases. The business impact is the net new growth of analytic use cases is being done on Hadoop.
Hadoop is not currently a direct replacement to OLAP or an Enterprise Data Warehouse, for that matter. These technologies will continue to have their place. Hadoop does not require schema definition or structuring of data. In fact, acting as a Datahub, Hadoop can be quite complementary to these by offloading processing and data from these systems. The average cost to store data in a data warehouse is $16,000/terabyte. The cost for MapR is less than $1000/terabyte. OLAP engines leverage data that has been transformed and processed into precise schemas. They can perform very well for well understood problems. One of the benefits of Hadoop is that you don’t need to understand the questions you are going to ask ahead of time, you can combine many different data types and determine required analysis you need after the data is in place. Hadoop continue to mature with regards to structuring data and interactive query, so future overlap between Hadoop and OLAP will increase.
Q4. Organizations embracing Hadoop often struggle to empower large groups of business analysts who require sophisticated SQL and BI tools to do their jobs. How do you handle this problem?
John Schroeder: MapR has the broadest support concerning SQL-in-Hadoop and SQL-on-Hadoop. Hive, Drill, Spark and Impala continue to mature as technologies. We are consultative to our customers assisting them to select the technology best suited to their use case. These technologies are rapidly evolving so we assist in “future proofing” the SQL technology selection to reduce technology lock in. In the case of large groups of business analysts and users we’re very excited about our partnership with HP Vertica. HP Vertica runs natively within the MapR platform and it provides full 100% ANSI SQL support to users. MapR also supports a broad range of SQL solutions designed specifically for Hadoop.
MapR also provides a standard file-based interface so any tool that uses enterprise storage systems can easily access data directly in MapR.
With MapR, you are in charge. You decide what you want to use to query your data; we focus on providing a reliable, scalable and affordable platform with full enterprise support.
Q5. How do you define the Total Cost of Ownership for Big Data architecture?
John Schroeder: There are many factors that drive TCO. The cost of storing data in MapR can be 50 to 100 times cheaper than other analytic platforms. MapR has innovated at the architecture level to drive many important areas to result in a much lower TCO, these include hardware performance and efficiency that results in a much smaller footprint which saves on hardware, operations and management costs. We have had customers tell us that they would need to deploy clusters 2-5 times larger with other distributions for the same workloads. We have also spent a great deal of time on the underlying data platform to provide high availability, reliability, and serviceability to make a MapR deployment extremely efficient. When customers are deploying an in-Hadoop database, MapR provides many TCO advantages. Our M7 Database Edition is an in-Hadoop NoSQL database that addresses HBase limitations by eliminating region servers, eliminating compactions and automating table management to support continuous, low latency on-line applications.
Q6. Is YARN expanding Hadoop use cases in the enterprise? And if yes, how?
John Schroeder: Much has been talked about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. YARN’s promise is to enable multiple execution frameworks to run on top of Hadoop, thereby expanding the Hadoop use cases beyond batch into interactive, real-time and others. At its core, YARN is a resource allocation framework that allows for execution frameworks such as classical MapReduce, and also newer ones like interactive SQL-on-Hadoop, streaming, and others to ask for and receive CPU and memory resources on the cluster for a period of time. YARN’s power is in making the resource allocation of a Hadoop cluster a more streamlined and centralized decision, thereby allowing for more efficient cluster use and more importantly, opening up Hadoop for emerging use cases. We’re happy to include YARN in MapR’s distribution and have uniquely enhanced YARN to allow both Map Reduce V1 and Map Reduce V2 applications to run simultaneously on the same cluster to reduce the barrier to YARN adoption.
Q7. Do you have any metrics to define how good is the “value” that can be derived by analyzing Big Data?
John Schroeder: We have customers that get 50X the performance at 1/50 the cost. We have other customers that have ROI over 1000X because of better approaches to drive revenue. We have other customers whose entire business model is built on the advantages that Hadoop provides. Earlier, I pointed out operational workloads that allow customers to dramatically transform their businesses, these are the applications that really drive value for organizations.
Beyond top line or cost savings value is the ability to support use cases that were not feasible before MapR.
MapR is key to Rubicon running Internet ad exchanges and comScore’s ability to measure what people do as they navigate the digital world.
Q8. What are the benefits of MapR’s Hadoop Distribution on the Google Compute Engine at Google I/O?
John Schroeder: Through the Google Compute Engine infrastructure, MapR makes big data accessible to any size business by leveraging the Google Compute Engine to provide a high performance, scalable, predictable, and easy to provision Hadoop infrastructure.
With respect to the scale and performance advantages, using MapR, Google was able to demonstrate a significant Hadoop price/performance breakthrough. We were able to run the Hadoop TeraSort benchmark to sort 1TB of data in a world-record setting time of 54 seconds on a 1003-node cluster that Google provided for our use. This broke the previous world record with approximately one third the number of cores.
Q9. You recently announced the early access release of the new HP Vertica Analytics Platform on MapR. What are the benefits of such cooperation for the enterprise?
John Schroeder: MapR and Vertica together demonstrate technical leadership in providing the best-of-breed SQL-on-Hadoop solution for enterprises. HP Vertica and MapR produce a comprehensive, tightly integrated, scalable, open-standards big data platform solution. There is no need to manage a dual cluster environment.
MapR is the only platform that could integrate an MPP analytic platform natively on Hadoop without requiring connectors or external tables in order for the MPP platform to interact with Hadoop data. With this integration, HP Vertica works as a native application on top of MapR, sharing the cluster resources with other Hadoop frameworks and applications.
The storage utilization of each application is dynamic and grows to the needs of business without requiring pre-allocation of file system space for HP Vertica. The architecture also allows customers to leverage MapR’s consistent snapshots and mirroring to provide point-in-time recovery and disaster recovery for HP Vertica with practically no effort.
For analysts, data scientists, and business users wanting more analytical power and faster ability to drive business decisions and execution, HP Vertica delivers the industry’s most advanced SQL-on-Hadoop analytics directly on MapR for higher performance and lower TCO.
Qx Anything else you wish to add?
John Schroeder: Two additional thoughts: data agility and operations.
MapR is investing engineering resources for data agility by decreasing time to value from data. Apache Drill is the only interactive SQL project that is architected for both centrally structured and self-describing data. Requiring DBAs-like work to structure new data sources and the cumbersome process for altering structure, delays time to value from new or changed data. Drill supports query of data structured in HCatalog, but also can query data structures using data-interchange formats like JSON.
Many use cases have batch, interactive and real-time (operational) aspects. Ad exchanges have to store and analyze auctions, but they also have to provide information like yield estimates in real-time to publishers and brands.
Credit fraud has analytic aspects but also have to interact during a credit card swipe. Investment in MapR’s M7 in-Hadoop NoSQL database has, and continues, to provide technology to support those real-time operations and avoid the cost and complexity of a second non-Hadoop platform. We aren’t going to replace and OLTP database, but we can cover many of the operational use cases.
John Schroeder, CEO and Cofounder, MapR Technologies. John has served as MapR’s Chief Executive Officer and Chairman of the Board since founding the company in 2009. Prior to founding MapR, John held executive positions in a number of enterprise software companies with a focus on data, storage and business intelligence at both private and public companies including: CEO of Calista Technologies (now Microsoft), CEO of Rainfinity (now EMC), SVP of Products and Marketing at Brio Technologies (BRYO) and General Manager at Compuware (CPWR).
Follow ODBMS.org and ODBMS Industry Watch on Twitter: @odbmsorg
“There are four things that motivate open source development teams:
1. The challenge/puzzle of programming, 2. Need for the software, 3. Personal advancement, 4. Belief in open source” — Bruce Momjian.
On PostgreSQL and the challenges of motivating and managing open source teams, I have interviewed Bruce Momjian, Senior Database Architect at EnterpriseDB, and Co-founder of the PostgreSQL Global Development Group and Core Contributor.
Q1. How did you manage to transform PostgreSQL from an abandoned academic project into a commercially viable, now enterprise relational database?
Bruce Momjian: Ever since I was a developer of database applications, I have been interested in how SQL databases work internally. In 1996, Postgres was the only open source database available that was like the commercial ones I used at work. And I could look at the Postgres code and see how SQL was processed.
At the time, I also kind of had a boring job, or at least a non-challenging one, writing database reports and applications. So getting to know Postgres was exciting and helped me grow as a developer. I started getting more involved with Postgres. I took over the Postgres website from the university, reading bug reports and fixes, and I started interacting with other developers through the website. Fortunately, I got to know other developers who also found Postgres attractive and exciting, and together we assembled a group of people with similar interests.
We had a small user community at the time, but found enough developers to keep the database feature set moving forward.
As we got more users, we got more developers. Then commercial support opportunities began to grow, helping foster a rich ecosystem of users, developers and support companies that created a self-reinforcing structure that in turn continued to drive growth.
You look at Postgres now and it looks like there was some grand plan. But in fact, it was just a matter of setting up the some structure and continuing to make sure all the aspects continued to efficiently reinforce each other.
Q2. What are the current activities and projects of the PostgreSQL community Global Development Group?
Bruce Momjian: We are working on finalizing the version 9.4 beta, which will feature greatly improved JSON capabilities and lots of other stuff. We hope to release 9.4 in September/October of this year.
Q3. How do you manage motivating and managing open source teams?
Bruce Momjian: There are four things that motivate open source development teams:
* The challenge/puzzle of programming
* Need for the software
* Personal advancement
* Belief in open source
Our developers are motivated by a combination of these. Plus, our community members are very supportive of one another, meaning that working on Postgres is seen as something that gives contributors a sense of purpose and value.
I couldn’t tell you the mix of motivations that drive any one individual and I doubt many people you were to ask could answer the question simply. But it’s clear that a mixture of these motivations really drives everything we do, even if we can’t articulate exactly which are most important at any one time.
Q4. What is the technical roadmap for PostgreSQL?
Bruce Momjian: We are continuing to work on handling NoSQL-like workloads better. Our plans for expanded JSON support in 9.4 are part of that.
We need greater parallelism, particularly to allow a single query to make better user of all the server’s resources.
We have made some small steps in this area in v9.3 and plan to in v9.4, but there is still much work to be done.
We are also focused on data federation, allowing Postgres to access data from other data sources.
We already support many data interfaces, but they need improvement, and we need to add the ability to push joins and aggregates to foreign data sources, where applicable.
I should add that Evan Quinn, an analyst at Enterprise Management Associates, wrote a terrific research report, PostgreSQL: The Quite Giant of Enterprise Database, and included some information on plans for 9.4.
Q5. How do managers and executives view Postgres, and particularly how they make Postgres deployment decisions?
Bruce Momjian: In the early years, users that had little or no money, and organizations with heavy data needs and small profit margins drove the adoption of Postgres. These users were willing to overlook Postgres’ limitations.
Now that Postgres has filled out its feature set, almost every user segment using relational databases is considering Postgres. For management, cost savings are key. For engineers, it’s Postgres’ technology, ease of use and flexibility.
Q6. What is your role at EnterpriseDB?
Bruce Momjian: My primary responsibility at EnterpriseDB is to help the Postgres community.
EnterpriseDB supports my work as a core team member and I play an active role in the overall decision-making and the organization of community initiatives. I also travel frequently to conferences worldwide, delivering presentations on advances in Postgres and leading Postgres training sessions. At EnterpriseDB, I occasionally do trainings, help with tech support, attend conferences as a Postgres ambassador and visit customers. And of course, do some PR and interviews, like this one.
Q7. Has PostgreSQL still something in common with the original Ingres project at the University of California, Berkeley?
Bruce Momjian: Not really. There is no Ingres code in Postgres though I think the psql terminal SQL tool is similar to the one in Ingres.
Q8. If you had to compare PostgreSQL with MySQL and MariaDB, what would be the differentiators?
Bruce Momjian: The original focus of MySQL was simple read-only queries. While it has improved since then, it has struggled to go beyond that. Postgres has always targeted the middle-level SQL workload, and is now targeting high-end usage, and the simple-usage cases of NoSQL.
MySQL certainly has greater adoption and application support. But in almost every other measure, Postgres is a better choice for most users. The good news is that people are finally starting to realize that.
Q9. How do you see the database market evolving? And how do you position PostgreSQL in such database market?
Bruce Momjian: We are in close communication with our user community, with our developers reading and responding to email requests daily. That keeps our focus on users’ needs. Postgres, being an object-relational, extensible database, is well suited to being expanded to meet changing user workloads. I don’t think any other database has that level of flexibility and strong developer/user community interaction.
Qx. Would you like to add something?
Bruce Momjian: We had a strong PG NYC conference recently. I posted a summary about it here.
I think that conference highlights some significant trends for Postgres in the months ahead.
Bruce Momjian co-founded in 1996 the PostgreSQL community Global Development Group, the organization of volunteers that steer the development and release of the PostgreSQL open source database. Bruce played a key role in organizing like-minded database professionals to shepherd PostgreSQL from an abandoned academic project into a commercially viable, now enterprise-class relational database. He dedicates the bulk of his time organizing, educating and evangelizing within the open source database community while acting as Senior Database Architect for EnterpriseDB. Bruce began his career as a high school math and computer science teacher and still serves as an adjunct profession at Drexel University. After leaving high school education, Bruce worked for more than a decade as a database consultant building specialized applications for law firms. He then went on to work for the PostgreSQL community with the support of several private companies before joining EnterpriseDB in 2006 to continue his work in the community. Bruce holds a master’s degree from Arcadia University and earned his bachelor’s degree at Columbia University.
Follow ODBMS.org on Twitter: @odbmsorg
“The Internet of Things is a good fit for NoSQL technologies, as you face the challenge of dealing with huge volumes of data over time. For businesses that wish to scale their IoT implementations and make use of the data that these networks create, NoSQL solutions are a better fit than RDBMS options.”–Mike Williams
I have interviewed Mike Williams, software director for i20 Water. Mike has an interesting use case for NoSQL in the area of water distribution networks. We also discussed how NoSQL can be used for the Internet of Things.
Q1. What is the business of i20 Water?
Mike Williams: i2O is the world’s leading developer of Smart Pressure Management solutions for water distribution networks.
Q2. What are the main benefits for utility companies?
Mike Williams: i2O’s Smart Pressure Management solutions optimise the performance of water distribution networks through improving network visibility, and enabling the remote control and automatic optimisation of network pressures.
These technology-enabled best practices deliver benefits in six key areas, with customers typically achieving return on investment in 6-18 months.
The opportunities for savings fall into two areas: on the network side, we see leakage reduction, energy savings and a big reduction in burst pipes.
For our utility customers, there are business-level returns as well based on improved customer service and operational cost savings. We also see customers being able to extend the life of their assets across the network as well, so they see a real long-term benefit to being able to control water pressure more accurately.
Q3. Do you have any metrics to share with us on the estimated volume of water saved by using your software?
Mike Williams: The best metric we can give is the simple volume of water that we help customers save: we currently help our customers to save over 235 Million Litres of water every day.
Q4. How can big data be used to reduce water leakage?
Mike Williams: The i2O system monitors and controls water pressure throughout a zone or network. This enables water companies to fully optimise water pressures remotely and automatically to meet agreed customer service levels throughout the network. The Big Data is all of this time-series data of pressures and flows (and other metrics) for lots of locations over years’ worth of points.
The solution continuously learns key characteristics within a Zone and then automatically controls the pressure within the Zone to achieve a stable target pressure at the critical point. This is achieved through a sophisticated mathematical algorithm, which automatically generates a control model. The control model is supplied ‘over the air’ from i2O software to the Pressure Reducing Valve (PRV) controller.
The control model is automatically updated if the software detects a significant change in the head loss characteristics. This ensures that no more pressure than is required enters the network on an ongoing basis. The i2O Automatic Optimisation solution is the world’s first – and most widely deployed – system for automatically optimising and remotely controlling water pressure in your network.
i2O’s Automatic PRV Optimisation has delivered significant results in hundreds of zones for major water companies worldwide.
Q5. Could you please describe your IT back-end infrastructure? What are the main challenges you have?
Mike Williams: We have an eco-system of distributed loosely coupled Services, each with access to numerous dedicated and tailored data stores.
These services collaborate through an Event-Driven Architecture (EDA) and a distributed Event Broker to deliver business services to our customers.
The main challenges are around the scalability of these services and the data stores where the vast amount of time-series data are held.
Due to the way we have architected our data, we have the ability to replay these events over history when we make changes to our products and services – we can use this historical data to analyse how devices on our networks respond to changes in circumstances, and measure what difference our new features would provide.
Q6. Why did you make the shift to NoSQL? What were you doing previously?
Mike Williams: The challenge was the overall scale–up of the whole platform. From both a technical and a business stand-point, we needed to scale up for the next years. Based on the number of devices that our customers had in place, and the number of new customers that we were projected to win, our RDBMS was not able to cope by design. Storing time-series data is a specialist need that column-oriented data stores are better suited to than traditional RDBMS row-oriented technologies. We were (and for some customers, still are) using MS SQL Server as the only data store. NoSQL technologies also better fit our data modelling needs such as searching, where we employ the ElasticSearch NoSQL solution in a clustered manner.
Q7. What is your experience of using a NoSQL database so far?
Mike Williams: So far it has been very positive. We had a learning curve to go up and the technologies themselves (Cassandra and ElasticSearch) have matured greatly in the last 18 months that we have been using them.
Q8. What does the future hold for Cassandra in your organisation? How are you using other database types as well?
Mike Williams: We moved over to using Cassandra as our main data store for time-series data – this was because it provides better support for columnar data, as well as meeting the requirement that we had around scale. We are committed to Cassandra for the foreseeable future and so it’s future is to grow alongside our business. We also use ElasticSearch as mentioned, as well as PostgresSQL when our specific data models dictate the use of tabular, relational data.
Q9. Do you work on the cross-over with the Internet of Things?
Mike Williams: Indeed we do as we produce intelligent devices that communicate with our Platform over the Internet (GPRS mobile network). The devices automatically sample for pressure, temperature and other information that can then be compared across the network.
The Internet of Things is a good fit for NoSQL technologies, as you face the challenge of dealing with huge volumes of data over time. For businesses that wish to scale their IoT implementations and make use of the data that these networks create, NoSQL solutions are a better fit than RDBMS options.
The ability to capture information from across our network has two key value propositions: the first is for our customers right now, as they can manage their water pressure more effectively. The second is the long term value that the data can provide. By being able to model and re-use historical data, we can offer much more value to customers than they can achieve by themselves. We can add new features to our platform, and demonstrate how these new features can provide greater opportunities to save money and water for customers.
Mike Williams is the software director for i20 Water. He has 25 years of experience working for innovative high-tech companies in finance, payments, process engineering and the environment, where he has focused on solving problems and translating business challenges into tangible technical solutions. Before joining i2O Water Mike was the Chief Software Architect and head of development for Bottomline Technologies, the leading supplier of banking transaction and payments software.
Mike is also an Agile Coach and has helped transform numerous businesses as they become Agile in their approaches to business as a whole and not just software development. Mike is the organiser and founder of the Agile South Coast group.
Follow ODBMS.org on Twitter: @odbmsorg
“Spring for Apache Hadoop together with Spring XD is used by many large organizations for developing new big data apps for stream processing and using HDFS for storage.”–Thomas Risberg.
On Spring for Apache Hadoop, I have interviewed Thomas Risberg, Software Engineer focusing on Big Data at Pivotal.
Q1. What is new with the current Spring for Apache Hadoop?
Thomas Risberg: The main focus for Spring for Apache Hadoop version 2.0 is to support new distributions based on Hadoop v2 including support for YARN. We provide backwards compatibility with Hadoop v1 based distributions, so as an end-user you can choose when to move to a new Hadoop version.
In Spring for Apache Hadoop 2.0 we are adding YARN application development support in addition to improvements in the HDFS and MapReduce support. We are introducing a Spring Boot based programming model for easy YARN app development. The main goal is to provide a simplified development experience so the developer can focus on getting the business logic implemented and not having to worry about the “plumbing”.
Q2. How do you measure the “effectiveness” of how Spring for Apache Hadoop simplifies developing Apache Hadoop?
Thomas Risberg: The main differentiator would be developer productivity where Spring for Apache Hadoop provides support for the infrastructure plumbing code and configuration allowing the developer to focus on code that brings business value. The most common measurement would be how long completing a project would take compared to the same project being developed without Spring.
Q3. What is the rationale for offering a unified configuration model and an APIs for using HDFS, MapReduce, Pig, and Hive?
Thomas Risberg: Big Data workflow development usually involves parts that are executed on Hadoop and parts that are executed or at least interact with resources outside of Hadoop. So, a unified configuration strategy will help developers when they move between different parts of the workflow. The basic configuration used across all Hadoop components is based on Spring and therefore similar whether working on a Hive job or an FTP file transfer job.
Q4. How does Spring for Apache Hadoop relate to the Spring Data project, and in general to the overall Spring ecosystem of projects?
Thomas Risberg: Spring for Apache Hadoop is part of the Spring Data umbrella project, but not part of the “release train” that most other Spring Data projects are part of. The reason is that the coupling is not very tight with other Spring Data projects. Spring for Apache Hadoop doesn’t directly use other Spring Data components although anyone using Spring for Apache Hadoop can use the Spring Data MongoDB project when exporting data from HDFS to MongoDB.
Q5. What about the integration with other software systems that are not part of the Spring ecosystem?
Thomas Risberg: In terms of Hadoop we have integration with Hive, Pig and HBase. We also use features from projects outside of the Apache Hadoop family. One example is Kite SDK which is a project that started out at Cloudera but is now a separate project that can be used with any Hadoop distribution. Other examples would be a JSON library like Jackson etc. The whole Spring IO platform uses 689 third party libraries so managing all of this is crucial for anyone using Spring. That is the main motivating factor behind the new Spring IO platform that provides a unified set of dependency versions across all Spring projects.
Q6. Could you give us some detail on how you handle big data ingest/export: e.g. from enterprise databases into Hadoop and vice versa? How is this different than conventional ETL?
Thomas Risberg: We rely on Spring Batch functionality for this task, so in case HDFS ingest isn’t different than loading data into any other data store. It’s simply a different batch writer. Spring Batch is a proven technology that is the bases for JSR-352 and is now certified as a JSR-352 compliant implementation. We support import/export with most relational databases that have a JDBC driver and also with many NoSQL stores that have Spring Data support like MongoDB, Cassandra, Couchbase or Redis.
Q7. What is the main contribution of Spring Data Hadoop to the Hadoop workflow and security?
Thomas Risberg: Spring for Apache Hadoop allows the developer to treat Hadoop workloads the same way as they would approach any workflow problem. Just because Hadoop is involved doesn’t have to mean that you need to use new and different tools from what you are used to. In terms of security we use what Hadoop itself provides and haven’t so far attempted to integrate that with Spring Security.
Q8. Do you also offer tools for analyzing Big Data? If yes, which ones?
Thomas Risberg: Spring XD provides integration with PMML analytics via a plug-in module. That module integrates with the JPMML-Evaluator library that provides support for a wide range of model types and is interoperable with models exported from R, Rattle, KNIME, and RapidMiner. Pivotal also provides MADLib, developed in collaboration with researchers at UC Berkeley and a growing world wide user community. This library is typically used with Pivotal’s Greenplum database or HAWQ which is the SQL engine that is part of Pivotal’s Hadoop distribution.
Q9. In which situations Spring Data Hadoop can add value, and in which situations would it be a poor choice?
Thomas Risberg: It definitely adds a lot of value if you are already using Spring in your workflow and just want to add some Hadoop functionality. It also makes sense if you are using Java and would like to take advantage of Spring’s dependency injection approach when developing your enterprise applications. It would make less sense for an organization that is not using Java as their development language or someone that already have a working solution using other tools that they are happy with.
Q10 Who is currently using Spring Data Hadoop and for which projects/business problems?
Thomas Risberg: I can’t name names, but Spring for Apache Hadoop together with Spring XD is used by many large organizations for developing new big data apps for stream processing and using HDFS for storage. Industries include telecommunications, equipment manufacturing, retail and finance. I’ve mentioned Spring XD and Spring for Apache Hadoop is a key component of this project. Spring XD is Pivotal’s new Spring project providing a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. The Spring XD project’s goal is to simplify the development of big data applications.
Thomas Risberg, Software Engineer focusing on Big Data, Pivotal, New Hampshire, USA
My current focus is on the “Spring XD”, “Spring for Apache Hadoop” and “Spring Data JDBC Extensions” projects. I’m a co-author of “Spring Data, Modern Data Access for Enterprise Java” published by O’Reilly Media in 2013 and “Professional Java Development with the Spring Framework” published by Wiley in 2005.
Follow ODBMS.org on Twitter: @odbmsorg
“You need a team of dedicated data scientists to develop and tune the core intellectual property–statistical, predictive, and other analytic models–that drive your Big Data applications. You don’t often think of data scientists as “programmers,” per se, but they are the pivotal application developers in the age of Big Data.”–James Kobielus
Managing the pitfalls and challenges of Big Data projects. On this topic I have interviewed James Kobielus, IBM Senior Program Director, Product Marketing, Big Data Analytics solutions.
Q1. Why run a Big Data project in the enterprise?
James Kobielus: Many Big Data projects are in support of customer relationship management (CRM) initiatives in marketing, customer service, sales, and brand monitoring. Justifying a Big Data project with a CRM focus involves identifying the following quantitative ROI:
• Volume-based value: The more comprehensive your 360-degree view of customers and the more historical data you have on them, the more insight you can extract from it all and, all things considered, the better decisions you can make in the process of acquiring, retaining, growing and managing those customer relationships.
• Velocity-based value: The more customer data you can ingest rapidly into your big-data platform and the more questions that a user can pose more rapidly against that data (via queries, reports, dashboards, etc.) within a given time period prior, the more likely you are to make the right decision at the right time to achieve your customer relationship management objectives.
• Variety-based value: The more varied customer data you have – from the CRM system, social media, call-center logs, etc. – the more nuanced portrait you have on customer profiles, desires and so on, hence the better-informed decisions you can make in engaging with them.
• Veracity-based value: The more consolidated, conformed, cleansed, consistent current the data you have on customers, the more likely you are to make the right decisions based on the most accurate data.
How can you attach a dollar value to any of this? It’s not difficult. Customer lifetime value (CLV) is a standard metric that you can calculate from big-data analytics’ impact on customer acquisition, onboarding, retention, upsell, cross-sell and other concrete bottom-line indicators, as well as from corresponding improvements in operational efficiency.
Q2. What are the business decisions that need to be made in order to successfully support a Big Data project in the enterprise?
James Kobielus: In order to successfully support a Big Data project in the enterprise, you have to make the infrastructure and applications production-ready in your operations.
Production-readiness means that your big-data investment is fit to realize its full operational potential. If you think “productionizing” can be done in a single step, such as by, say, introducing HDFS NameNode redundancy, then you need a cold slap of reality. Productionizing demands a lifecycle focus that encompasses all of your big-data platforms, not just a single one (e.g., Hadoop/HDFS), and addresses more than just a single requirement (e.g., ensuring a highly available distributed file system).
Productionizing involves jumping through a series of procedural hoops to ensure that your big-data investment can function as a reliable business asset. Here are several high-level considerations to keep in mind as you ready your big-data initiative for primetime deployment:
• Stakeholders: Have you aligned your big-data initiatives with stakeholder requirements? If stakeholders haven’t clearly specified their requirements or expectations for your big-data initiative, it’s not production-ready. The criteria of production-readiness must conform to what stakeholders require, and that depends greatly on the use cases and applications they have in mind for Big Data. Service-level agreements (SLAs) vary widely for Big Data deployed as an enterprise data warehouse (EDW), as opposed to an exploratory data-science sandbox, an unstructured information transformation tier, a queryable archive, or some other use. SLAs for performance, availability, security, governance, compliance, monitoring, auditing and so forth will depend on the particulars of each big-data application, and on how your enterprise prioritizes them by criticality.
• Stacks: Have you hardened your big-data technology stack – databases, middleware, applications, tools, etc. – to address the full range of SLAs associated with the chief use cases? If the big-data platform does not meet the availability, security and other robustness requirements expected of most enterprise infrastructure, it’s not production-ready. Ideally, all production-grade big-data platforms should benefit from a common set of enterprise management tools.
• Scalability: Have you architected your environment for modular scaling to keep pace with inexorable growth in data volumes, velocities and varieties? If you can’t provision, add, or reallocate new storage, compute and network capacity on the big-data platform in a fast, cost-effective, modular way to meet new requirements, the platform is not production-ready.
• Skillsets: Have you beefed up your organization’s big-data skillsets for maximum productivity? If your staff lacks the requisite database, integration and analytics skills and tools to support your big-data initiatives over their expected life, your platform is not production-ready. Don’t go deep on Big Data until your staff skills are upgraded.
• Seamless service: Have your re-engineered your data management and analytics IT processes for seamless support for disparate big-data initiatives? If you can’t provide trouble response, user training and other support functions in an efficient, reliable fashion that’s consistent with existing operations, your big-data platform is not production-ready.
To the extent that your enterprise already has a mature enterprise data warehousing (EDW) program in production, you should use that as the template for your big-data platform. There is absolutely no need to redefine “productionizing” for Big Data’s sake.
Q3. What are the most common problems and challenges encountered in Big Data projects?
James Kobielus: The most common problems and challenges in Big Data projects revolve around integrated lifecycle management (ILM).
ILM faces a new frontier when it comes to Big Data. The core challenges are threefold: the sheer unbounded size of Big Data, the ephemeral nature of much of the new data, and the difficulty of enforcing consistent quality as the data scales along any and all of the three Vs (volume, velocity, and variability). Comprehensive ILM has grown more difficult to ensure in Big Data environments, given rapid changes in the following areas:
• New Big Data platform: Big data is ushering a menagerie of new platforms (Hadoop, NoSQL, in-memory, and graph databases) into enterprise computing environments, alongside stalwarts such MPP RDBMS, columnar, and dimensional databases. The chance that your existing ILM tools work out of the box with all of these new platforms is slim. Also, to the extent that you’re doing Big Data in a public cloud, you may be required to use whatever ILM features — strong, weak, or middling — that may be native to the provider’s environment. To mitigate your risks in this heterogeneous new world and to maintain strong confidence in your core data, you’ll need to examine new Big Data platforms closely to ensure they have ILM features (data security, governance, archiving, retention) that are commensurate to the roles for which you plan to deploy them.
• New Big Data subject domains: Big data has not altered enterprise requirements for data governance hubs where you store and manage office systems of record (customers, finances, HR). This is the role of your established EDW, most of which run on traditional RDBMS-based data platforms and incorporate strong ILM. But these systems of record data domains may have very little presence on your newer Big Data platforms, many of which focus instead on handling fresh data from social, event, sensor, clickstream, geospatial, and other new sources. These new data domains are often “ephemeral” in the sense there may be no need to retain the bulk of the data in permanent systems of record.
• New Big Data scales: Big data does not mean that your new platforms support infinite volume, instantaneous velocity, or unbounded varieties. The sheer magnitudes of new data will make it impossible to store most of it anywhere, given the stubborn technological and economic constraints we all face. This reality will deepen Big Data managers’ focus on tweaking multitemperature storage management, archiving, and retention policies. As you scale your Big Data environment, you will need to ensure that ILM requirements can be supported within your current constraints of volume (storage capacity), velocity (bandwidth, processor, and memory speeds), and variety (metadata depth).
Q4. How best is to get started with a Big Data project?
James Kobielus: Scope the project well to deliver near-term business benefit. Using the nucleus project as the foundation for accelerating future Big Data projects. Recognize that the initial database technology you use in that initial project is just one of many storage layers that will need to play together in a hybridized, multi-tier Big Data architecture of your future.
In the larger evolutionary perspective, Big Data is evolving into a hybridized paradigm under which Hadoop, massively parallel processing (MPP) enterprise data warehouses (EDW), in-memory columnar, stream computing, NoSQL, document databases, and other approaches support extreme analytics in the cloud.
Hybrid architectures address the heterogeneous reality of Big Data environments and respond to the need to incorporate both established and new analytic database approaches into a common architecture. The fundamental principle of hybrid architectures is that each constituent Big Data platform is fit-for-purpose to the role for which it’s best suited. These Big Data deployment roles may include any or all of the following:
• Data acquisition
• Interactive exploration
In any role, a fit-for-purpose Big Data platform often supports specific data sources, workloads, applications, and users.
Hybrid is the future of Big Data because users increasingly realize that no single type of analytic platform is always best for all requirements. Also, platform churn—plus the heterogeneity it usually produces—will make hybrid architectures more common in Big Data deployments. The inexorable trend is toward hybrid environments that address the following enterprise Big Data imperatives:
• Extreme scalability and speed: The emerging hybrid Big Data platform will support scale-out, shared-nothing massively parallel processing, optimized appliances, optimized storage, dynamic query optimization, and mixed workload management.
• Extreme agility and elasticity: The hybrid Big Data environment will persist data in diverse physical and logical formats across a virtualized cloud of interconnected memory and disk that can be elastically scaled up and out at a moment’s notice.
• Extreme affordability and manageability: The hybrid environment will incorporate flexible packaging/pricing, including licensed software, modular appliances, and subscription-based cloud approaches.
Hybrid deployments are already widespread in many real-world Big Data deployments. The most typical are the three-tier—also called “hub-and-spoke”—architectures. These environments may have, for example, Hadoop (e.g., IBM InfoSphere BigInsights) in the data acquisition, collection, staging, preprocessing, and transformation layer; relational-based MPP EDWs (e.g., IBM PureData System for Analytics) in the hub/governance layer; and in-memory databases (e.g., IBM Cognos TM1) in the access and interaction layer.
The complexity of hybrid architectures depends on range of sources, workloads, and applications you’re trying to support. In the back-end staging tier, you might need different preprocessing clusters for each of the disparate sources: structured, semi-structured, and unstructured. In the hub tier, you may need disparate clusters configured with different underlying data platforms—RDBMS, stream computing, HDFS, HBase, Cassandra, NoSQL, and so on—-and corresponding metadata, governance, and in-database execution components. And in the front-end access tier, you might require various combinations of in-memory, columnar, OLAP, dimensionless, and other database technologies to deliver the requisite performance on diverse analytic applications, ranging from operational BI to advanced analytics and complex event processing.
Ensuring that hybrid Big Data architectures stay cost-effective demands the following multipronged approach to optimization of distributed storage:
• Apply fit-for-purpose databases to particular Big Data use cases: Hybrid architectures spring from the principle that no single data storage, persistence, or structuring approach is optimal for all deployment roles and workloads. For example, no matter how well-designed the dimensional data model is within an OLAP environment, users eventually outgrow these constraints and demand more flexible decision support. Other database architectures—such as columnar, in-memory, key-value, graph, and inverted indexing—may be more appropriate for such applications, but not generic enough to address other broader deployment roles.
• Align data models with underlying structures and applications: Hybrid architectures leverage the principle that no fixed Big Data modeling approach—physical and logical—can do justice to the ever-shifting mix of queries, loads, and other operations. As you implement hybrid Big Data architectures, make sure you adopt tools that let you focus on logical data models, while the infrastructure automatically reconfigures the underlying Big Data physical data models, schemas, joins, partitions, indexes, and other artifacts for optimal query and data load performance.
• Intelligently compress and manage the data: Hybrid architectures should allow you to apply intelligent compression to Big Data sets to reduce their footprint and make optimal use of storage resources. Also, some physical data models are more inherently compact than others (e.g., tokenized and columnar storage are more efficient than row-based storage), just as some logical data models are more storage-efficient (e.g., third-normal-form relational is typically more compact than large denormalized tables stored in a dimensional star schema).
Q5. What kind of expertise do you need to run a Big Data project in the enterprise?
James Kobielus: Data-driven organizations succeed when all personnel—both technical and business—have a common understanding of the core big-data best skills, tools and practices. You need all the skills of data management, integration, modeling, and so forth that you already have running your data marts, warehouses, OLAP cubes, and the like.
Just as important, you need a team of dedicated data scientists to develop and tune the core intellectual property–statistical, predictive, and other analytic models–that drive your Big Data applications. You don’t often think of data scientists as “programmers,” per se, but they are the pivotal application developers in the age of Big Data.
The key practical difference between data scientists and other programmers—including those who develop orchestration logic—is that the former specifies logic grounded in non-deterministic patterns (i.e., statistical models derived from propensities revealed inductively from historical data), whereas the latter specifies logic whose basis is predetermined (i.e., if/then/else, case-based and other rules, procedural and/or declarative, that were deduced from functional analysis of some problem domain).
The practical distinctions between data scientists and other programmers have always been a bit fuzzy, and they’re growing even blurrier over time. For starters, even a cursory glance at programming paradigms shows that core analytic functions—data handling and calculation—have always been the heart of programming. For another, many business applications leverage statistical analyses and other data-science models to drive transactional and other functions.
Furthermore, data scientists and other developers use a common set of programming languages. Of course, data scientists differ from most other types of programmers in various ways that go beyond the deterministic vs. non-deterministic logic distinction mentioned above:
• Data scientists have adopted analytic domain-specific languages such as R, SAS, SPSS and Matlab.
• Data scientists specialize in business problems that are best addressed with statistical analysis.
• Data scientists are often more aligned with specific business-application domains—such as marketing campaign optimization and financial risk mitigation—than the traditional programmer.
These distinctions primarily apply to what you might call the “classic” data scientist, such as multivariate statistical analysts and data mining professionals. But the notion of a “classic” data scientist might be rapidly fading away in the big-data era as more traditional programmers need some grounding in statistical modeling in order to do their jobs effectively—or, at the very least, need to collaborate productively with statistical modelers.
Q6. How do you select the “right” software and hardware for a Big Data project?
James Kobielus: It’s best to choose the right appliance–a pre-optimized, pre-configured hardware/software appliance–for the specific workloads and applications of your Big Data project. At the same time, you should make sure that the chosen appliances can figure into the eventual cloud architecture toward which your Big Data infrastructure is likely to evolve.
An appliance is a workload-optimized system. Its hardware/software nodes are the key building block for every Big Data cloud. In other words, appliances, also known as expert integrated systems, are the bedrock of all three “Vs” of the Big Data universe, regardless of whether your specific high-level topology is centralized, hub-and-spoke, federated or some other configuration, and regardless of whether you’ve deployed all of these appliance nodes on premises or are outsourcing some or all of it to a cloud/SaaS provider.
Within the coming 2-3 years, expert integrated systems will become a dominant approach for enterprises to put Hadoop and other emerging Big Data approaches into production. Already, appliances are the principal approach in the core Big Data platform market: enterprise data warehousing solutions that implement massively parallel processing, such as those powered by IBM PureData Systems for Analytics..
The core categories of workloads that user need their optimized Big Data appliances to support within cloud environments are as follows:
• Big-data storage: A Big Data appliance can be core building block in a enterprise data storage architecture. Chief uses may be for archiving, governance and replication, as well as for discovering, acquiring, aggregating and governing multistructured content. The appliance should provide the modularity, scalability and efficiency of high-performance applications for these key data consolidation functions. Typically, it would support these functions through integration with a high-capacity storage area network architecture such as IBM provides.
• Big-data processing: A Big Data appliance should support massively parallel execution of advanced data processing, manipulation, analysis and access functions. It should support the full range of advanced analytics, as well as some functions traditionally associated with EDWs, BI and OLAP. It should have all the metadata, models and other services needed to handle such core analytics functions as query, calculation, data loading and data integration. And it should handle a subset of these functions and interface through connectors to analytic platforms such as IBM PureData Systems.
• Big-data development: A Big Data appliance should support Big Data modeling, mining, exploration and analysis. The appliance should provide a scalable “sandbox” with tools that allow data scientists, predictive modelers and business analysts to interactively and collaboratively explore rich information sets. It should incorporate a high-performance analytic runtime platform where these teams can aggregate and prepare data sets, tweak segmentations and decision trees, and iterate through statistical models as they look for deep statistical patterns. It should furnish data scientists with massively parallel CPU, memory, storage and I/O capacity for tackling analytics workloads of growing complexity. And it should enable elastic scaling of sandboxes from traditional statistical analysis, data mining and predictive modeling, into new frontiers of Hadoop/MapReduce, R, geospatial, matrix manipulation, natural language processing, sentiment analysis and other resource-intensive types of Big Data processing.
A big-data appliance should not be a stand-alone server, but, instead, a repeatable, modular building block that, when deployed in larger cloud configurations, can be rapidly optimized to new workloads as they come online. Many appliances will be configured to support mixes of two or all three of these types of workloads within specific cloud nodes or specific clusters. Some will handle low latency and batch jobs with equal agility in your cloud. And still others will be entirely specialized to a particular function that they perform with lightning speed and elastic scalability. The best appliances, like IBM Netezza, facilitate flexible re-optimization by streamlining the myriad deployment, configuration tuning tasks across larger, more complex deployments.
You may not be able to forecast with fine-grained precision the mix of workloads you’ll need to run on your big-data cloud two years from next Tuesday. But investing in the right family of big-data appliance building blocks should give you confidence that, when the day comes, you’ll have the foundation in place to provision resources rapidly and efficiently.
Q7. Is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?
James Kobielus: No. Hadoop is powering unstructured ETL, queryable archiving, data-science exploratory sandboxing, and other use cases. OLAP–in terms of traditional cubing–remains key to front-end query acceleration in decision support applications and data marts. In support of those front-end applicatioins, OLAP is facing competition from other approaches, especially in-memory, columnar databases (such as the BLU Acceleration feature of IBM DB2 10.5).
Q8. Could you give some examples of successful Big Data projects?
James Kobielus: Examples are here.
James Kobielus is IBM Senior Program Director, Product Marketing, Big Data Analytics solutions. He is an industry veteran, a popular speaker and social media participant, and a thought leader in big data, Hadoop, enterprise data warehousing, advanced analytics, business intelligence, data management, and next best action technologies.
Follow ODBMS.org on Twitter: @odbmsorg
“As the database gets used, shards can grow at an uneven rate and one shard might carry a majority of the load. MongoDB corrects this by balancing shards, but because of MongoDB’s lack of concurrency this operation can stall the database unacceptably.”–John Partridge.
I have interviewed John Partridge, President & CEO of Tokutek, Inc.
Q1. Tokutek recently announced to have eliminated performance issues of MongoDB sharding. What was the problem?
John Partridge: The problem occurs after a shard is created. As the database gets used, shards can grow at an uneven rate and one shard might carry a majority of the load. MongoDB corrects this by balancing shards, but because of MongoDB’s lack of concurrency this operation can stall the database unacceptably (see the benchmark).
Q2. For what kind of application users of MongoDB experienced these bottlenecks?
John Partridge: Users who need to scale out, and rely on sharding to do so.
Q3. What is the solution you propose to this problem?
Q4. How TokuMX v1.4 is able to allow shards to be balanced and added without disruption for a NoSQL solution that scales up and scales out?
John Partridge: TokuMX replaces the B-tree indexing used in MongoDB with patented Fractal Tree indexing, which allows for significantly better concurrency (among other things). Because of the improved concurrency, data can be copied, then deleted, from one shard to another without unnecessary locking.
Q5. What is the difference in performance of your solution with respect to the basic MongoDB? What “basic” MongoDB do you use for this comparison?
John Partridge: “Basic” MongoDB is the distro that you get from MongoDB (10gen). We typically see 20x performance improvements but as you might imagine, it depends on the application. Because TokuMX offers document-level locking rather than the database-level locking, TokuMX shines when there are significant reads *and* writes.
Q6. How do you compare TokuMX with other distribution of MongoDB, such as the one of 10gen (now MongoDB)?
John Partridge: There are three major differences: 20x performance improvement, 90% smaller database size (we compress the data), and support for ACID transactions. Look at the bottom of http://www.tokutek.com/products/tokumx-for-mongodb/ for more information on each of these benefits.
Mr. Partridge brings over twenty years of experience in the software industry as a developer, investor, and entrepreneur. He joins Tokutek from StreamBase Systems which John co-founded with database pioneer Dr. Michael Stonebraker. He started his career as a software developer at Microsoft Corporation where he co-authored Excel v1.0. He later worked as a venture capitalist at Accel Partners and the Summit Accelerator Fund where he specialized in investing in early stage internet infrastructure and enterprise software companies. John holds an A.B. in Applied Mathematics / Computer Science from Harvard University and an MBA from the Stanford University Graduate School of Business.
- TokuMX vs. MongoDB : Sharding Balancer Performance Posted on February 16, 2014 by Tim Callaghan, Tokutek.
- What’s new in TokuMX 1.4, Part 4: Smaller, faster sharded clusters. Posted on February 20, 2014 by Leif Walsh, Tokutek.
Follow ODBMS.org on Twitter: @odbmsorg
“So it is not even the volume of data that imputes political or economic value. Hence, it is clear that data has enormous political and economic value. Given the increasing digitization of our world it seems inevitable that our legal, economic, and political systems, amongst others, will ascribe to formal measures of value for data.” –Michael L. Brodie.
What is the other side of Big Data? What are the societal benefits, risks, and values of Big Data? These are difficult questions to answer.
On this topic, I have interviewed Dr. Michael L. Brodie, Research Scientist at MIT Computer Science and Artificial Intelligence Laboratory. Dr. Brodie has over 40 years experience in research and industrial practice.
Q1. You recently wrote  that “we are in the midst of two significant shifts – the shift to Big Data requiring new computational solutions, and the more profound shift in societal benefits, risks, and values”. Can you please elaborate on this?
Michael L. Brodie: The database world deals with data that is bounded, even if vast and growing beyond belief, and used for known, discrete models of our world most of which support a single version of truth. While Big Data expands the existing scale (volume, velocity, variety) it does far more as it takes us into a world that we experience in life but not in computing. I call the vision, the direction that Big Data is taking us, Computing Reality. A simple explanation is that in the database world, we work top-down with schemas that define how the data should behave. For example, Telecom billing systems are essentially all in an equivalence class of the same billing model and require that billing data conform. Billing databases have a single version of truth so that telecom bills have justifiable charges. Not so with Big Data.
If we impose a model or our biases on the data we may prelude the very value that we are trying to discover.
In Big Data worlds, as in life, there is not a single version of truth over the data but multiple perspectives each with a probability of being true or reasonable. We are probably not looking for one likely model but an ensemble of models each of which provides a different perspective and discloses some discoveries in the data that we otherwise would not have found.
So the one paradigm shift is from small data that involve discrete, bounded, top-down approaches to computing to big data that require bottom-up approaches that tend to be vague or probabilistic, unbounded, and provide support multiple perspectives. I call this latter approach Computing Reality, reflecting the vagueness and unboundedness of reality.
A second, related shift – from why to what – can be understood in terms of Scientific Discovery. The history of scientific and Western thought, starting before Aristotle and Plato, has matured into what we know today as the Scientific Method in which one makes observations about a phenomenon, e.g., sees some data, hypothesizes a model, and determines if the model makes sense over the observed data.
This process is What: What are the correlations in the data that might explain the phenomenon.
A reasonable model over the data leads to Why: the Holy Grail of Science – causation – Why does the phenomenon occur.
For over 2,000 years a little What has guided Why – Scientific Discovery through empiricism.
Big Data has the potential of turning scientific discovery on its ear. Big Data is leading to a shift from Why to What.
The value of Big Data and the emergence of Big Data analytics may shift the preponderance of scientific discovery to What, since it is so much cheaper that Why – clinical studies that take vast resources and years of careful work. Here is the challenge. Why – causation – cannot be deduced from What. It is not clear that Big Data practitioners understand the tenuous link between What and Why. Massive Big data blunders [1, 2] suggest that this is the case.
My research into Computing Reality explores this link with the hopes of providing guidance for Big Data tools and techniques. And even cooler than that to accelerate Scientific Discovery by adding mechanisms and metrics of veracity to Big Data and its symbiosis with empericism
Q2. You also wrote  that with Big Data “more than ever before, technology is far ahead of the law”. What do you mean with this?
Michael L. Brodie: The Forth Amendment of the United States Constitution was tested many times by technology including when electronic techniques could be used to determine activities inside a citizen’s home. When the constitution was written in 1787 electronic surveillance could never have been anticipated.
Today, the laws of search and seizure, based on the Fourth Amendment, permit those with a warrant to acquire all of your electronic devices so the government can examine everything on those devices although it appears that the intent of the law was to permit search and seizure of evidence relative to the suspected offence. That is, the current laws were perfectly rational when written; however, technology has so changed the world we live in that the law, interpreted simply allows the government to look at your entire digital life, which for many of us is much of our lives thus minimizing or eliminating the protections of the Fourth Amendment. The simple matter is that technology will always be ahead of the law.
So we must constantly balance current and unforeseen consequences of technology advance on our lives and societies.
Since time immemorial, and as observed by Benjamin Franklin, we must always judiciously balance freedom and security; you can’t have both. Technology more than many domains push this balance.
Q3. John Podesta, Obama’s Counselor and study lead, asked the following question during a workshop: „Does our legal privacy framework support and balance safety and freedom?“ What is your personal view on this especially related to the ongoing discussion on an open and free Internet and Big Data?
Michael L. Brodie: What a great question, worth of serious pursuit, more than I will pursue here. A fundamental part of your question is of a free and open Internet. While it is debatable as to whether computing or the Internet has created economic growth and increased productivity, it is fair to say that our economies have become so dependent on computing and that Balkanizing the Internet, as exemplified recently by Turkey, China, Brasil, and even Switzerland, will surely cause major economic disruption.
Not only does a significant portion of our existing economy ride on an currently open and free Internet, that platform has been and will continue to be a fountain of innovation and potential economic growth, and, ideally, increased productivity; not to mention the daily lives of billions of people on the planet. As we have seen in Tunisia, Egypt, North Korea, China, Syria, and other constrained countries, an open and free Internet, e.g., Twitter, is becoming a means for democratic expression and constraint on totalitarian behaviour. Much is at stake to maintain an open and free Internet.
This should encourage a robust debate of the various Internet Bill of Rights currently on offer. Clearly the Snowden-NSA incidents and the resulting events in the White House, the Supreme Court, and the US Congress clearly indicate that our legal privacy framework is inadequate. The more interesting question is what changes are required to permit a balance of freedom and safety. Such a framework should result from a robust, informed public debate on the issues. Hopefully these discussions will start in earnest. The workshop is an example of the White House’s commitment to such a discussion.
Q4. What would be the discussion on an open and free Internet, while balancing safety with freedom, if Edward Snowden had not disclosed the NSA surveillance?
Michael L. Brodie: What great questions with profound implications, clearly beyond my skills, but fun to poke at. Let me add to the question: Is Snowden a Whistle Blower or terrorist? Is he working to uphold the constitution or undermine it?
I happen to have had some direct experience on this issue. From April 2006 to January 2008 I served on the United States of America National Academies Committee on Technical and Privacy Dimensions of Information for Terrorism Prevention and other National Goals, co-chaired by Dr. Charles Vest, president of the National Academy of Engineering and Dr. William Perry, former US Secretary of Defense, that was commissioned by the Department of Homeland Security and the National Science Foundation.
The recent White House Investigation prompted by Snowden’s disclosures heavily cited the commission’s report .
The 21-month investigation by 20 experts chosen by the academy uncovered some aspects of what Snowden’s disclosures led to, it did not uncover the scope and scale of the NSA actions that emerged from Snowden’s disclosures. It is not until you discover the actions that you question the relevant laws or as the White House justifiably asked, the legal privacy framework to support and balance safety and freedom.
As I said in the piece that you reference the White House and Snowden are asking exactly the same questions. Snowden has said that he saw it as his obligation to do what he did given his oath to uphold the constitution. Hence, such a discussion could emerge without Snowden in the next decade, but it would not have emerged at the moment without his actions.
Would that it had emerged in 2006 or as a consequence of the many other similarly intended investigations.
It seems to me that Snowden blew the whistle on NSA.
Q5. De facto, the Internet is becoming a new battlefield among different political and economic systems in the world. What is the political and economic value of data?
Michael L. Brodie: Again a grand question for my betters. This is another profound question that I am not skilled to answer. But why let that stop me?
Our economic system is based on commodities, goods and services, with almost no means of attributing economic value to data. Indirectly, data is valued at inconceivably high values according to many Internet company acquisitions, especially Facebook’s recent $16 Billion acquisition of Whatsapp that appears to be acquiring people and their data by the network effect.
How do you ascribe value to data? Who owns data? Does it age and does time reduce or raise its value? If it has economic value, then what legal jurisdiction governs data? What is the political value of data? For one example look at Europe’s solicitation of business away from the United States based on data, data ownership, and data governance.
Another example is that President Lyndon Johnson achieved the US Civil Rights Bill because of data – he knew where all the bodies were buried. What is the value of data there?
So it is not even the volume of data that imputes political or economic value. Hence, it is clear that data has enormous political and economic value. Given the increasing digitization of our world it seems inevitable that our legal, economic, and political systems, amongst others, will ascribe to formal measures of value for data.
Q6. There has been a claim that “Big data” has rendered obsolete the current approach to protecting privacy and civil liberties . Is this really so?
Michael L. Brodie: Without question expanding beyond bounded, discrete, top-down models of the world, to a vastly larger, more complex digital version of the world, requires a reevaluation of previous approaches to computing problems, including privacy and civil liberties. The quote is from Craig Mundie  who makes the observation for a policy and strategy point of view. A recent report on machine learning and curly fries claims that organizations, e.g., marketing, can create complete profiles of individuals without their permission and presumably use it in many ways, e.g., refuse providing a loan? Does that threaten privacy and civil liberties?
While I quoted Mundie concerning civil liberties, my knowledge is in computing and databases. My reference concerns the fact that current solutions will simply not scale to the world of Big Data and Computing Reality. It seems a safe statement since Butler Lampson and Mike Stonebraker have both said the same thing. Simply stated, we cannot anticipate every attack, what combination of data accesses could be used to deduce private information. A famous case is to use Netflix movie selection data to identify private patient information from anonymized Medicare data. So while you may do a top-down job applying existing protection mechanisms, your only hope is to detect violations and stop further such attacks, as has been claimed for Heartbleed.
As Butler Lampson said 
“It’s time to change the way we think about computer security: instead of trying to prevent security breaches, we should focus on dealing with them after they happen.
Today computer security depends on access control, and it’s been a failure. Real world security, by contrast, is mainly retroactive: the reason burglars don’t break into my house is that they are afraid of going to jail, and the financial system is secure mainly because almost any transaction can be undone.
There are many ways to make security retroactive: • Track down and punish offenders. • Selectively undo data corruption caused by malware. Require applications and online services to respect people’s ownership of their personal data.
Access control is still needed, but it can be much more coarse-grained, and therefore both more reliable and less intrusive. Authentication and auditing are the most important features. Retroactive security will not be perfect, but perfect security is not to be had, and it will be much better than what we have now.”
 Protecting Individual Privacy in the Struggle Against Terrorism: A Framework for Program Assessment, Committee on Technical and Privacy Dimensions of Information for Terrorism Prevention and Other National Goals, National Research Council, Washington, D.C. 2008. ISBN-10: 0-309-12488-3 ISBN-13: 978-0-309-12488-1
 John Podesta, White House Counselor, White House-MIT Big Data Privacy Workshop: Advancing the State of the Art in Technology and Practice, March 4, 2014, MIT, Cambridge, MA http://web.mit.edu/bigdata-priv/agenda.html
 White House-MIT Big Data Privacy Workshop A Personal View, Dr. Michael L. Brodie , Computer Science and Artificial Intelligence Laboratory, MIT , March 24, 2014 http://www.odbms.org/2014/04/white-house-mit-big-data-privacy-workshop/
Dr. Michael L. Brodie
Dr. Brodie has over 40 years experience in research and industrial practice in databases, distributed systems, integration, artificial intelligence, and multi-disciplinary problem solving. He is concerned with the Big Picture aspects of information ecosystems including business, economic, social, application, and technical. Dr. Brodie is a Research Scientist, MIT Computer Science and Artificial Intelligence Laboratory; advises startups; serves on Advisory Boards of national and international research organizations; and is an adjunct professor at the National University of Ireland, Galway. For over 20 years he served as Chief Scientist of IT, Verizon, a Fortune 20 company, responsible for advanced technologies, architectures, and methodologies for Information Technology strategies and for guiding industrial scale deployments of emergent technologies, most recently Cloud Computing and Big Data and start ups Jisto.com and data-tamer.com. He has served on several National Academy of Science committees.
Dr. Brodie holds a PhD in Databases from the University of Toronto and a Doctor of Science (honoris causa) from the National University of Ireland.
Follow ODBMS.org on Twitter: @odbmsorg
“SciDB is both a data store and a massively parallel compute engine for numerical processing. The inclusion of this computational platform is what makes us the first “computational database”, not just a SQL-style decision support DBMS. Hence, we need a new moniker to describe this class of interactions. We settled on computational databases, but if your readers have a better suggestion, we are all ears!”
–Mike Stonebraker, Paul Brown.
On the SciDB array database, I have interviewed Mike Stonebraker, MIT Professor and Paradigm4 co-founder and CTO, and Paul Brown, Paradigm4 Chief Architect.
Q1: What is SciDB and why did you create it?
Mike Stonebraker, Paul Brown: SciDB is an open source array database with scalable, built-in complex analytics, programmable from R and Python. The requirements for SciDB emerged from discussions between academic database researchers—Mike Stonebraker and Dave DeWitt— and scientists at the first Extremely Large Databases conference (XLDB) at SLAC in 2007 about coping with the peta-scale data from the forthcoming LSST telescope.
Recognizing that commercial and industrial users were about to face the same challenges as scientists, Mike Stonebraker founded Paradigm4 in 2010 to make the ideas explored in early prototypes available as a commercial-quality software product. Paradigm4 develops and supports both a free, open-source Community Edition (scidb.org/forum) and an Enterprise Edition with additional features (paradigm4.com).
Q2. With the rise of Big Data analytics, is the convergence of analytic needs between science and industry really happening?
Mike Stonebraker, Paul Brown: There is a “sea change” occurring as companies move from Business Intelligence (think SQL analytics) to Complex Analytics (think predictive modelling, clustering, correlation, principal components analysis, graph analysis, etc.). Obviously science folks have been doing complex analytics on big data all along.
Another force driving this sea change is all the machine-generated data produced by cell phones, genomic sequencers, and by devices on the Industrial Internet and the Internet of Things. Here too science folks have been working with big data from sensors, instruments, telescopes and satellites all along. So it is quite natural that a scalable computational database like SciDB that serves the science world is a good fit for the emerging needs of commercial and industrial users.
There will be a convergence of the two markets as many more companies aspire to develop innovative products and services using complex analytics on big and diverse data. In the forefront are companies doing electronic trading on Wall Street; insurance companies developing new pricing models using telematics data; pharma and biotech companies analyzing genomics and clinical data; and manufacturing companies building predictive models to anticipate repairs on expensive machinery. We expect everybody will move to this new paradigm over time. After all, a predictive model integrating diverse data is much more useful than a chart of numbers about past behavior.
Q3. What are the typical challenges posed by scientific analytics?
Mike Stonebraker, Paul Brown: We asked a lot of working scientists the same question, and published a paper in the IEEE Computing Science & Engineering summarizing their answers (*see citation below). In a nutshell, there are 4 primary issues.
1. Scale. Science has always been intensely “data driven”. With the ever-increasing massive data-generating capabilities of scientific instruments, sensors, and computer simulations, the average scientist is overwhelmed with data and needs data management and analysis tools that can scale to meet his or her needs, now and in the future.
2. New Analytic Methods. Historically analysis tools have focused on business users, and have provided easy-to-use interfaces for submitting SQL aggregates to data warehouses. Such business intelligence (BI) tools are not useful to scientists, who universally want much more complex analyses, whether it be outlier detection, curve fitting, analysis of variance, predictive models or network analysis. Such “complex analytics” is defined on arrays in linear algebra, and requires a new generation of client-side tools and server side tools in DBMSs.
3. Provenance. One of the central requirements that scientists have is reproducibility. They need to be able to send their data to colleagues to rerun their experiments and produce the same answers. As such, it is crucial to keep prior versions of data in the face of updates, error correction, and the like. The right way to provide such provenance is through a no-overwrite DBMS; which allows time-travel back in time to when the experiment in question was performed.
4. Interactivity. Unlike business users who are often comfortable with batch reporting of information, scientific users are invariably exploring their data, asking “what if” questions and testing hypotheses. What they need in interactivity on very large data sets.
Q3. What are in your opinion the commonalities between scientific and industrial analytics?
Mike Stonebraker, Paul Brown: We would state the question in reverse “What are the differences between the two markets?” In our opinion, the two markets will converge quickly as commercial and industrial companies move to the analytic paradigms pervasive in the science marketplace.
Q4. How come in the past the database system software community has failed to build the kinds of systems that scientists needed for managing massive data sets?
Mike Stonebraker, Paul Brown: Mostly it’s because scientific problems represent a $0 billion market! However, the convergence of industrial requirements and science requirements means that science can “piggy back” on the commercial market and get their needs met.
Q5. SciDB is a scalable array database with native complex analytics. Why did you choose a data model based on multidimensional arrays?
Mike Stonebraker, Paul Brown: Our main motivation is that at scale, the complex analyses done by “post sea change” users are invariably about applying parallelized linear algebraic algorithms to arrays. Whether you are doing regression, singular value decomposition, finding eigenvectors, or doing operations on graphs, you are performing a sequence of matrix operations. Obviously, this is intuitive and natural in an array data model, whereas you have to recast tables into arrays if you begin with an RDBMS or keep data in files. Also, a native array implementation can be made much faster than a table-based system by directly implementing multi-dimensional clustering and doing selective replication of neighboring data items.
Our secondary motivation is that, just like mathematical matrices, geospatial data, time-series data, image data, and graph data are most naturally organized as arrays. By preserving the inherent ordering in the data, SciDB supports extremely fast selection (including vectors, planes, ‘hypercubes’), doing multi-dimensional windowed aggregates, and re-gridding it to change spatial or temporal resolution.
Q6. How do you manage in a nutshell scalability with high degrees of tolerance to failures?
Mike Stonebraker, Paul Brown: In a nutshell? Partitioning, and redundancy (k-replication).
First, SciDB splits each array’s attributes apart, just like any columnar system. Then we partition each array into rectilinear blocks we call “chunks”. Then we employ a variety of mapping functions that map an array’s chunks to SciDB instances. For each copy of an array we use a different mapping function to create copies of each chunk on different node of the cluster. If a node goes down, we figure out where there is a redundant copy of the data and move the computation there.
Q7. How do you handle data compression in SciDB?
Mike Stonebraker, Paul Brown: Use of compression in modern data stores is a very important topic. Minimizing storage while retaining information and supporting extremely rapid data access informs every level of SciDB’s design. For example, SciDB splits every array into single-attribute components. We compress a chunk’s worth of cell values for a specific attribute. At the lowest level, we compress attribute data using techniques like run-length encoding on data. In addition, our implementation has an abstraction for compression to support other compression algorithms.
Q8. Why supporting two query languages?
Mike Stonebraker, Paul Brown: Actually the primary interfaces we are promoting are R and Python as they are the languages of choice of data scientists, quants, bioinformaticians, and scientists. SciDB-R and SciDB-Py allow users to interactively query SciDB using R and Python. Data is persisted in SciDB. Math operators are overloaded so that complex analytical computations execute scalably in the database.
Early on we surveyed potential and existing SciDB users, and found there were two very different types. By and large, commercial users using RDMBSs said “make it look like SQL”. For those users we created AQL—array SQL. On the other hand, data scientists and programmers preferred R, Python, and functional languages. For the second class of users we created SciDB-R, SciDB-Py, and AFL—an array functional language.
All queries get compiled into a query plan, which is a sequence of algebraic operations. Essentially all relational versions of SQL do exactly the same thing. In SciDB, AFL, the array functional language, is the underlying language of algebraic operators. Hence, it is easy to surface and support AFL in addition to AQL, SciDB-R, and SciDB-Py, allowing us to satisfy the preferred mode of working for many classes of users.
Q9. You defined SciDB a computational database – not a data warehouse, not a business-intelligence database, and not a transactional database. Could you please elaborate more on this point?
Mike Stonebraker, Paul Brown: In our opinion, there are two mature markets for DBMSs: transactional DBMSs that are optimized for large numbers of users performing short write-oriented ACID transactions, and data warehouses, which strive for high performance on SQL aggregates and other read-oriented longer queries. The users of SciDB fit into neither category. They are universally doing more complex mathematical calculations than SQL aggregates on their data, and their DBMS interactions are typically longer read-oriented queries. SciDB is both a data store and a massively parallel compute engine for numerical processing. The inclusion of this computational platform is what makes us the first “computational database”, not just a SQL-style decision support DBMS. Hence, we need a new moniker to describe this class of interactions. We settled on computational databases, but if your readers have a better suggestion, we are all ears!
Q10. How does SciDB differ from analytical databases, such as for example HP Vertica, and in-memory analytics databases such as SAP HANA?
Mike Stonebraker, Paul Brown: Both are data warehouse products, optimized for warehouse workloads. SciDB serves a different class of users from these other systems. Our customers’ data are naturally represented as arrays that don’t fit neatly or efficiently into relational tables. Our users want more sophisticated analytics—more numerical, statistical, and graph analysis—and not so much SQL OLAP.
Q11. What about Teradata?
Mike Stonebraker, Paul Brown: Another data warehouse vendor. Plus, SciDB runs on commodity hardware clusters or in a cloud and not on a proprietary appliances or expensive servers.
Q12. Anything else you wish to add?
Mike Stonebraker, Paul Brown: SciDB is currently being used by commercial users for computational finance, bioinformatics and clinical informatics, satellite image analysis, and industrial analytics. The publicly accessible NIH NCBI One Thousand Genomes browser has been running on SciDB since the Fall of 2012.
Anyone can try out SciDB using an AMI or a VM available at scidb.org/forum.
Mike Stonebraker , CTO Paradigm4
Renowned database researcher, innovator, and entrepreneur: Berkeley, MIT, Postgres, Ingres, Illustra, Cohera, Streambase, Vertica, VoltDB, and now Paradigm4.
Paul Brown , Chief Architect Paradigm4
Premier database ‘plumber’ and researcher moving from the “I’s” (Ingres, Illustra, Informix, IBM) to a “P” (Paradigm4).
*Citation for IEEE paper
Stonebraker, M.; Brown, P.; Donghui Zhang; Becla, J., “SciDB: A Database Management System for Applications with Complex Analytics,” Computing in Science & Engineering , vol.15, no.3, pp.54,62, May-June 2013
doi: 10.1109/MCSE.2013.19, URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6461866&isnumber=6549993
- ODBMS.org: free resources related to Paradigm4
Follow ODBMS.org on Twitter: @odbmsorg