IBM – ODBMS Industry Watch

On RDBMS, NoSQL and NewSQL databases. Interview with John Ryan

Roberto V. Zicari — Fri, 09 Mar 2018 11:05:17 +0000

“The single most important lesson I’ve learned is to keep it simple. I find designers sometimes deliver over-complex, generic solutions that could (in theory) do anything, but in reality are remarkably difficult to operate, and often misunderstood.”–John Ryan

I have interviewed John Ryan, Data Warehouse Solution Architect (Director) at UBS.

RVZ

Q1. You are an experienced Data Warehouse architect, designer and developer. What are the main lessons you have learned in your career?

John Ryan: The single most important lesson I’ve learned is to keep it simple. I find designers sometimes deliver over-complex, generic solutions that could (in theory) do anything, but in reality are remarkably difficult to operate, and often misunderstood. I believe this stems from a lack of understanding of the requirement – the second most important lesson.

Everyone from the senior stakeholders to architects, designers and developers need to fully understand the goal. Not the solution, but the “problem we’re trying to solve”. End users never ask for what they need (the requirement), but instead, they describe a potential solution. IT professionals are by nature delivery focused, and, get frustrated when it appears “the user changed their mind”. I find the user seldom changes their mind. In reality, the requirement was never fully understood.

To summarise. Focus on the problem not the solution. Then (once understood), suggest a dozen solutions and pick the best one. But keep it simple.

Q2. How has the Database industry changed in the last 20 years?

John Ryan: On the surface, not a lot. As recent as 2016 Gartner estimated Oracle, Microsoft and IBM accounted for 80% of the commercial database market, but that hides an underlying trend that’s disrupting this $50 billion industry.

Around the year 2000, the primary options were Oracle, DB2 or SQL Server with data warehouse appliances from Teradata and Netezza. Fast forward to today, and the database engine rankings include over 300 databases of which 50% are open source, and there are over 11 categories including Graph, Wide column, Key-value and Document stores; each suited to a different use-case.

While relational databases are still popular, 4 of the top 10 most popular solutions are non-relational (classed as NoSQL), including MongoDB, Redis and Cassandra. Cross-reference this against a of survey of the most highly sought skills, and we find MongoDB, Redis and Cassandra again in the top ranking, with nearly 40% of respondents seeking MongoDB skills compared to just 12% seeking Oracle expertise.

Likewise open source databases make up 60% of the top ten ranking, with open source database MySQL in second place behind Oracle in the rankings, and Gartner states that “By 2018, more than 70% of new in-house applications will be developed on an [Open Source] DBMS”.

The move towards cloud computing and database-as-a-service is causing further disruption in the data warehouse space with cloud and hybrid challengers including Vertica, Amazon Redshift and Snowflake.

In conclusion, the commercial relational vendors currently dominate the market in revenue terms. However, there has been a remarkable growth of open source alternatives, and huge demand for solutions to handle high velocity unstructured and semi-structured data. These use-cases including social media, and the Internet of Things are ill-suited to the legacy structured databases provided by Oracle, DB2 and SQL Server, and this void has been largely filled by open source NoSQL and NewSQL databases.

Q3. RDBMS vs. NoSQL vs. NewSQL: How do you compare Database Technologies?

John Ryan: The traditional RDBMS solutions from Oracle, IBM and Microsoft implement the relational model on a 1970s hardware architecture, and typically provide a good general purpose database platform which can be applied to most OLTP and Data Warehouse use cases.

However, as Dr. Michael Stonebraker indicated in this 2007 paper, The End of an Architectural Era (It’s Time for a Complete Rewrite), these are no longer fit for purpose, as both the hardware technology, and processing demands have moved on. In particular, the need for real time (millisecond) performance, greater scalability to handle web-scale applications, and the need to handle unstructured and semi-structured data.

Whereas the legacy RDBMS is a general purpose (will do anything) database, the NoSQL and NewSQL solutions are dedicated to a single problem, for example, short lived OLTP operations.

The Key-Value NoSQL databases were developed to handle the massive transaction volume, and low latency needed to handle web commerce at Amazon and LinkedIn. Others (eg. MongoDB) where developed to handle semi-structured data, while still others (eg. Neo4J) were built to efficiently model data networks of the type found at Facebook or LinkedIn.

The common thread with NoSQL databases is they tend to use an API interface rather than industry standard SQL, although increasingly that’s changing. They do however, entirely reject the relational model and ACID compliance. They typically don’t support cross-table join operations, and are entirely focused on low latency, trading consistency for scalability.

The so-called NewSQL databases include VoltDB , MemSQL and CockroachDB are a return to the relational model, but re-architected for modern hardware and web scale use cases. Like NoSQL solutions, they tend to run on a shared nothing architecture, and scale to millions of transactions per second, but they also have full transaction support and ACID compliance that are critical for financial operations.

Q4. What are the typical trade-off of performance and consistency when using NoSQL and NewSQL databases to support high velocity OLTP and real time analytics?

John Ryan: The shared nothing architecture is built to support horizontal scalability, and when combined with data replication, can provide high availability and good performance. If one node in the cluster fails, the system continues, as the data is available on other nodes. The NoSQL database is built upon this architecture, and to maximize throughput, ACID compliance is relaxed in favor of Eventual Consistency, and in some cases (eg. Cassandra), it supports tunable consistency, allowing the developer to trade performance for consistency, and durability.

For example, after a write operation, the change cannot be considered durable (the D in ACID) until the change is replicated to at least one, ideally two other nodes , but this would increase latency, and reduce performance. It’s possible however, to relax this constraint, and return immediately, with the risk the change may be lost if the node crashes before the data is successfully replicated. This becomes even more of a potential issue if the node is temporarily disconnected from the network, but is allowed to continue accepting transactions until the connection is restored. In practice, consistency will be eventually be achieved when the connection is reestablished – hence the term Eventual Consistency.

A NewSQL database on the other hand accepts no such compromise, and some databases (eg. VoltDB), even support full serializability, executing transactions as if they were executed serially. Impressively, they manage this impressive feat at a rate of millions of transactions per second, potentially on commodity hardware.

Q5. One of the main challenges for real time systems architects is the potentially massive throughput required which could exceed a million transactions per second. How do handle such a challenge?

John Ryan: The short answer is – with care! The longer answer is described in my article, Big Data – Velocity. I’d break the problem into three components, Data Ingestion, Transformation and Storage.

Data ingestion requires message based middleware (eg. Apache Kafka), with a range of adapters and interfaces, and the ability to smooth out the potentially massive spikes in velocity, with the ability to stream data to multiple targets.

Transformation, typically requires an in-memory data streaming solution to restructure and transform data in near-real time. Options include Spark Streaming, Storm or Flink.

Storage and Analytics is sometimes handled by a NoSQL database, but for application simplicity (avoiding the need to implement transactions or handle eventual consistency problems in the application), I’d recommend a NewSQL database.
All the low-latency, high throughput of the NoSQL solutions, but with the flexibility and ease of a full relational database, and full SQL support.

In conclusion, the solution needs to abandon the traditional batch oriented solution in favour of an always-on streaming solution with all processing in memory.

Q6. Michael Stonebraker introduced the so called “One Size no longer fits all”-concept. Has this concept come true on the database market?

John Ryan: First stated in the paper One Size Fits All – An Idea Whose Time Has Come And Gone Dr. Michael Stonebraker argued that the legacy RDBMS dominance was at an end, and would be replaced by specialized database technology including stream processing, OLTP and Data Warehouse solutions.

Certainly disruption in the Data Warehouse database market has been accelerated with the move towards the cloud, and as this Gigaom Report illustrates, there are at least nine major players in the market, with new specialized tools including Google Big Query, Amazon Redshift and Snowflake, and the column store (in memory or on secondary storage) dominates.

Finally, the explosion of specialized NoSQL and NewSQL databases, each with its own specialty including Key-Value, Document Stores, Text Search and Graph databases lend credence to the statement “One Size no longer fits all”.

I do think however, we’re still in a transformation stage, and the shake-out is not yet complete. I think a lot of large corporations (especially Financial Services) are wary of change, but it’s already happening.

I think the quote Stewart Brand is appropriate: “Once a new technology rolls over you, if you’re not part of the steamroller, your part of the road”.

Q7. Eventual consistency vs ACID. A pragmatic approach or a step too far?

John Ryan: As with so many decisions in IT, it depends. Eventual Consistency was built into the Amazon Dynamo database as a pragmatic decision, because it’s difficult to maintain high throughput and availability at scale. Amazon accepted the relatively minor risk of inconsistencies because of the huge benefits including scalability and availability. In many other web scale applications (eg. Social media), the implications of temporary inconsistency are not important, and it’s therefore an acceptable approach.

Having said that, it does place a huge addition burden on the developer to code for relatively rare and unexpected conditions, and I’d question why would anyone settle for a database which supported eventual consistency, when full ACID compliance and transaction handling is available.

Q8. What is your take on the Lambda Architecture ?

John Ryan: The Lambda Architecture is a an approach to handle massive data volumes, and provide real time results by running two parallel solutions, a Batch Processing and a real time Speed Processing stream.

As a Data Warehouse and ETL designer, I’m aware of the complexity involved in data transformation, and I was immediately concerned about an approach which involved duplicating the logic, probably in two different technologies.

I’d also question the sense of repeatedly executing batch processing on massive data volumes when processing is moving to the cloud, on a pay for what you use basis.

I’ve since found this article on Lambda by Jay Kreps of LinkedIn and more recently Confluent (who developed Apache Kafka), and he describes the challenges from a position of real (and quite painful) experience.

The article recommends an alternative approach, a single Speed Processing stream with slight changes to allow data to be reprocessed (the primary advantage of the Lambda architecture). This solution, named the Kappa Architecture, is based upon the ability of Kafka to retain and re-play data, and it seems an entirely sensible and more logical approach.

Qx Anything else you wish to add?

John Ryan: Thank you for this opportunity. It may be a Chinese curse that “may you live in interesting times”, but I think it’s a fascinating time to be working in the database industry.

—————————–

John Ryan, Data Warehouse Solution Architect (Director), UBS.

John has over 30 years experience in the IT industry, and has maintained a keen interest in database technology since his University days in the 1980s when he invented a B-Tree index, only to find Oracle (and several others) had got there already.

He’s worked as a Data Warehouse Architect and Designer for the past 20 years in a range of industries including Mobile Communications, Energy and Financial Services. He’s regularly found writing articles on Big Data and database architecture, and you can find him on LinkedIn.

Resources

– ODBMS.org: BIG DATA, ANALYTICAL DATA PLATFORMS, DATA SCIENCE

– ODBMS.org: NEWSQL, XML, RDF DATA STORES, RDBMS

– ODBMS.org: NOSQL DATA STORES

Related Posts

– On the InterSystems IRIS Data Platform ,ODBMS Industry Watch, 2018-02-09

– Facing the Challenges of Real-Time Analytics. Interview with David Flower, ODBMS Industry Watch, 2017-12-19

– On the future of Data Warehousing. Interview with Jacque Istok and Mike Waas, ODBMS Industry Watch, 2017-11-09

– On Vertica and the new combined Micro Focus company. Interview with Colin Mahony, ODBMS Industry Watch, 2017-10-25

– On Open Source Databases. Interview with Peter Zaitsev, ODBMS Industry Watch, 2017-09-06

Follow us on Twitter: @odbmsorg

On the Industrial Internet of Things. Interview with Leon Guzenda

Roberto V. Zicari — Fri, 23 Feb 2018 15:09:18 +0000

“With IBM having contributed huge amounts of code and other resources to Spark we are likely to see an explosion in the number of new machine learning components.”–Leon Guzenda

I have interviewed Leon Guzenda, co- founder of Objectivity, Inc. We covered in the interview: the Industrial Internet of Things, Sensor Fusion systems and ThingSpan.

RVZ

Q1. What is the Industrial Internet of Things (IIoT) ? How is it different from the Internet of Things (IoT) ?

Leon Guzenda: The IIoT generally refers to the application of IoT technologies to manufacturing or process control problems. As such it is a subset of IoT with specialized extensions for the problems that it has to tackle.

Q2. What is a sensor fusion system?

Leon Guzenda: A sensor fusion system takes data streamed from multiple sensors and combines it, and possibly other data, to form a composite view of a situation or system. An example would be combining data from different kinds of reconnaissance sources, such as images, signals intelligence and infrared sensors, taken from different viewpoints to produce a 3D visualization for tracking or targeting purposes.

Q3. What are the key requirements for a database software platform for real-time data and sensor fusion systems?

Leon Guzenda: Some sensor fusion systems combine one or a few types of data from multiples sources, such as the detectors in a linear accelerator, or measurements from medical instruments, though there may be many variants of a single kind of data. However, most have to handle a wide variety of data types, ranging from video to documents and streams of financial or other information. The data may be highly interconencted by many types of relationship, forming tree or graph structures. The DBMS must make it easy to track the provenance and quality of data as algorithms are applied to the raw data to make it suitable for downstream processes and queries. The DBMS must have low latency, i.e. the time from receiving data to it being available to multiple users. It has to be able to cope with fast moving streams of data in addition to small transactions and batched inputs. Above all, it must be able to scale and work in distributed environments.

Q4. What are the main technical challenges in capturing and analysing information from many different sources in near real-time for new insights in order to make critical decisions?

Leon Guzenda: The DBMS must have the ability to support compute intensive algorithms, which generally precludes the use of tabular schemas. There is a trend to suing modular, open source components, such as Spark Machine Learning Library (MLlib), so support for Spark Dataframes is important in some applications. It must have a flexible schema so that it can adapt rapidly to deal with new or changed data sources. Maintaining consistent, low latency is challenging when fast moving streams of incoming data have to be merged with and correlated with huge volumes of existing data.

Q5. Why does ‘after the fact analysis’ not work with real-time and data sensor fusion systems?

Leon Guzenda: Processes managed with the help of sensors and fusion systems may fail or get out of control if action isn’t taken immediately when changes occur. In other cases, opportunities may be lost if resources can’t be brought to bear on a problem, be it a cybersecurity or physical threat.

Q6. What is the impact of open source technologies, such as Spark, Kafka, HDFS, YARN, for the Industrial Internet of Things?

Leon Guzenda: Apache Spark provides a scalable, standard and flexible platform for bringing multiple components together to build standard or ad hoc workflows, e.g. with YARN. Kafka and Samza make it easier to split streams of data into pipelines for parallel ingest and query handling. HDFS is good for reliably storing files, but it is far from ideal for handling randomly accessed data as it moves data in 64 MB blocks, increasing latency. Nevertheless, ThingSpan can run on HDFS with data cached by Spark, but we prefer to run it on industry standard POSIX filesystems for most purposes.

With IBM having contributed huge amounts of code and other resources to Spark we are likely to see an explosion in the number of new machine learning components. By combining this with ThingSpan’s graph analytics capabilities, we’ll be able to attack new kinds of problem.

Q7. Why ThingSpan’s offer DO as a query language and not an extension of SQL?

Leon Guzenda: We would like to contribute the graph processing ideas in DO to the SQL community and are seeking partners to try to make that happen. However, our customers need a solution now, so we considered open source options, such as Cypher and SparQL. However, we decided that it would be faster and more controllable to leverage the flexible schema and query handling components within the ThingSpan kernel to give our products a competitive edge, particularly at scale.

Q8. What are the similarities and differences between ThingSpan and Neo4j? They both handle complex graphs.

Leon Guzenda: Both handle Vertex and Edge objects. Neo4j depends on properties whereas ThingSpan can also operate with connections that have no data within them. The ThingSpan declarative query language, DO, incorporates most of the graph querying capabilities of Cypher and extends them with advanced parallel pathfinding capabilities.

However, the main differentiator is performance as a graph scales. Although Neo4j has been introducing some distributed operations and has a port for Spark, it is inherently not a distributed DBMS with a single logical view of all of the data within a repository. Although it is capable of handling graphs with millions of nodes it hasn’t shown the ability to handle very large graphs. Objectivity has customers processing tens of trillions of nodes and connections per day for thousands of analysts.

Q9. ThingSpan and Objectivity/DB: how do they relate with each other (if any)?

Leon Guzenda: ThingSpan uses Objectivity/DB as its data repository. Besides the Java, C++, C# and Python APIs It also has a REST API and adaptors for Spark Dataframes and HDFS. Objectivity/DB is a component of the ThingSpan suite and can be purchased on its own for embedded applications or to run in non-Spark environments.

Q10. What kinds of things are on the roadmap for ThingSpan?

Leon Guzenda: We recently announced the availability of ThingSpan on the Amazon AWS Market Place, making it easier to evaluate and deploy ThingSpan in a resilient, elastic cloud environment.

The next release, which is in QA at the moment, will add high speed pipelining for ingesting streamed data. It also has extensions to DO, particularly in regard to pathfinding and schema manipulation. There is also a new graph visualization tool for developers.

——————————–

Leon Guzenda, Chief Technology Marketing Officer, was one of the founding members of Objectivity in 1988 and one of the original architects of Objectivity/DB.

He worked with Objectivity’s major partners and customers to help them deploy the industry’s highest-performing, most reliable DBMS technology. Leon has over 40 years experience in the software industry. At Automation Technology Products, he managed the development of the ODBMS for the Cimplex solid modeling and numerical control system. Before that he was Principal Project Director for the Dataskil division of International Computers Ltd. in the United Kingdom, delivering major projects for NATO and leading multinationals. He was also design and development manager for ICL’s 2900 IDMS product at ICL Bracknell. He spent the first 7 years of his career working in defense and government systems.

Resources

– Internet of Things From Hype to Reality

– Objectivity Powers the Most Demanding Real-Time Data and Sensor Fusion Systems on the Planet

– Objectivity Enables Complex Configuration Management for Rockwell Collins’ Avionics Product Lines

– Kx Systems Joins the Industrial Internet Consortium

– Sensor Data Storage for Industrial IoT

– KX FUELING IOT REVOLUTION WITH MANUFACTURING WIN

Follow us on Twitter: @odbmsorg

New Gartner Magic Quadrant for Operational Database Management Systems. Interview with Nick Heudecker

Roberto V. Zicari — Wed, 30 Nov 2016 20:30:20 +0000

“It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.”–Nick Heudecker.

I have interviewed Nick Heudecker, Research Director on Gartner’s Data & Analytics team.
The main topic of the interview is the new Magic Quadrant for Operational Database Management Systems.

RVZ

Q1. You have published the new Magic Quadrant for Operational Database Management Systems (*). How do you define the operational database management system market?

Nick Heudecker: We define a DBMS as a complete software system used to define, create, manage, update and query a database. DBMSs provide interfaces to independent programs and tools that both support and govern the performance of a variety of concurrent workload types. There is no presupposition that DBMSs must support the relational model or that they must support the full set of possible data types in use today. OPDBMSs must include functionality to support backup and recovery, and have some form of transaction durability — although the atomicity, consistency, isolation and durability model is not a requirement. OPDBMSs may support multiple delivery models, such as stand-alone DBMS software, certified configurations, cloud (public and private) images or versions, and database appliances.

Q2. Can you explain the methodology you used for this new Magic Quadrant?

Nick Heudecker: The methodologies for several Gartner methodologies are public. The Magic Quadrant methodology can be found here.

We use a number of data sources when we’re creating the Magic Quadrant for Operational Database Management Systems.
We survey vendor reference customers and include data from our interactions with Gartner clients. We also consider earlier information and any news about vendors’ products, customers and finances that came to light during the time frame for our analysis.

Once we have the data, we score vendors across the various dimensions of Completeness of Vision and Ability to Execute.
One thing that’s important to note is Magic Quadrants are relative assessments of vendors in a market. We couldn’t have one vendor on an MQ because it would be right in the middle – there’s nothing to compare it to.

Q3. Why were there no Visionaries this year?

Nick Heudecker: We determined there was an overall lack of vision in the market. After a few years of rapid feature expansion, the focus has shifted to operational excellence and execution. Even Leaders shifted to the left on vision, but are still placed in the Leaders quadrant based on their vision for the development of hybrid database management, hardware optimization and integration, emerging deployment models such as containerization, as well as vertical features.

Q4. Were you surprised by the analysis and some of the results you obtained?

Nick Heudecker: The lack of overall vision in the market struck us the most. Other than in a few notable cases, we received largely the same story from most vendors. The explosion of features, and the vendors emerging to implement them, has slowed. The features that initiated the expansion, such as storing new data types, geographically distributed storage, cloud and flexible data consistency models, have become common. Today, nearly every established or emerging DBMS vendor supports these features to some degree. The OPDBMS market has shifted from a phase of rapid innovation to a phase of maturing products and capabilities.

Q5. Do you believe the “NoSQL” label will continue to distinguish DBMSs?

Nick Heudecker: If you look at the entire operational DBMS space, there’s already a great deal of convergence between NoSQL vendors, as well as between NoSQL and traditionally relational vendors. Nearly every vendor, nonrelational and relational, supports multiple data types, like JSON documents, graph or wide-column. NoSQL vendors are adding SQL: MongoDB’s BI Connector and Couchbase’s N1QL are good, if diverse, examples. They’re also adding things like schema management and data validation capabilities.
On the relational side, they’re adding horizontal scaling options and alternative consistency models, as well as modern APIs. And everyone either has or is adding in-memory and cloud capabilities.

It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.

Q6. What are the other “Vendors to Consider”?

Nick Heudecker: The other vendors to consider are vendors that did not meet the inclusion requirements for the Magic Quadrant. Usually this is because they missed our minimum revenue requirements, but that doesn’t mean they don’t have compelling products.

——————————-
Nick Heudecker is a Research Director on Gartner’s Data & Analytics team. His coverage includes data management technologies and practices.

——————————-

Resources
(*) Magic Quadrant for Operational Database Management Systems. Published: 05 October 2016 ID: G00293203Analyst(s): Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald

– Complimentary Gartner Research: 100 Data and Analytics Predictions Through 2020. Get exclusive access to Gartner’s top 100 data and analytics predictions through 2020. Plus access other relevant Gartner research including Magic Quadrant reports for database and data warehouse solutions, and the market guide for in-memory computing (LINK to MemSQL web site – registration required).

– MarkLogic Recognized in New Gartner® Magic Quadrant. Gartner Magic Quadrant for Operational Database Management Systems positions MarkLogic® the highest for ability to execute in the Challengers Quadrant

– Accelerating Business Value with a Multi-Model, Multi-Workload Data Platform

– NuoDB Recognized by Gartner in Critical Capabilities for Operational Database Management Systems. Elastic SQL database achieves top five score in all four use cases.

– Clustrix Recognized in Gartner Magic Quadrant for Operational Database Management Systems

– Learn why EDB is named a “Challenger” in the 2016 Gartner ODBMS Magic Quadrant

– DataStax Receives Highest Scores in 2 Use Cases in Gartner’s Critical Capabilities for Operational Database Management Systems

– Gartner Scores Oracle Highest In 3 of 4 Use Cases: Gartner Critical Capabilities for Operational Database Management Systems Report

– Gartner Critical Capabilities For Operational Database Management Systems 2016 – Redis Labs Ranked Second Highest In 2/4 Categories (Link- Registation required)

Follow us on Twitter: @odbmsorg

Machines of Loving Grace. Interview with John Markoff.

Roberto V. Zicari — Thu, 11 Aug 2016 19:13:46 +0000

“Intelligent system designers do have ethical responsibilities.”
–John Markoff.

I have interviewed John Markoff, technology writer at The New York Times.
In 2013 he was awarded a Pulitzer Prize.
The interview is related to his recent book “Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots, published in August of 2015 by HarperCollins Ecco.

RVZ

Q1. Do you share the concerns of prominent technology leaders such as Tesla’s chief executive, Elon Musk, who suggested we might need to regulate the development of artificial intelligence?

John Markoff: I share their concerns, but not their assertions that we may be on the cusp of some kind of singularity or rapid advance to artificial general intelligence. I do think that machine autonomy raises specific ethical and safety concerns and regulation is an obvious response.

Q2. How difficult is it to reconcile the different interests of the people who are involved in a direct or indirect way in developing and deploying new technology?

John Markoff: This is why we have governments and governmental regulation. I think AI, in that respect is no different than any other technology. It should and can be regulated when human safety is at stake.

Q3. In your book Machines of Loving Grace you argued that “we must decide to design ourselves into our future, or risk being excluded from it altogether”. What do you mean by that?

John Markoff: You can use AI technologies either to automate or to augment humans. The problem is minimized when you take an approach that is based on human centric design principles.

Q4. How is it possible in practice? Isn’t the technology space dominated by giants such as IBM, Apple,Google who dictate the direction of new technology?

John Markoff: This is a very interesting time with “giant” technology companies realizing that there are consequences in the deployment of these technologies. Google, IBM and Microsoft have all recently made public commitments to the safe use of AI.

Q5. What are the most significant new developments in the humans-computers area, that are likely to have a significant influence in our daily life in the near future?

John Markoff: One of the best things about being a reporter is that you don’t have to predict the future. You only have to note what the various visionaries say, so you can call that to their attention when their predictions prove inaccurate. With that caveat, if I am forced to bet on any particular information technology it would be augmented reality. This is because I believe that multi-touch interfaces for mobile devices simply can’t be the last step in user interface.

Q6. Do you believe that robots will really transform modern life?

John Markoff: I struggle with the definition of what is a “robot.” If something is tele-operated, for example, is it a robot? That said I think that we will increasingly be surrounded by machines that perform tasks.
The question is will they come as quickly as Silicon Valley seems to believe. My friend Paul Saffo has said, “Never mistake a clear view for a short distance.” And I think that is the case with all kinds of mobile robots, including self driving cars.

Q7. For the designers of Intelligent Systems, how difficult is to draw a line between what is human and what is machine?

John Markoff: I feel strongly that the possibility of designing cyborgs, particularly with respect to intellectual prosthesis is a boundary we should cross with great caution. Remember the Borg from StarTrek. “Resistance is futile, you will be assimilated.” I think the challenge is to use these systems to enhance human thought, not for social control.

Q8. What are the ethical responsibilities of designers of intelligent systems?

John Markoff: I think the most important aspect of that question is the simple acknowledgement that intelligent system designers do have ethical responsibilities. That has not always been the case, but it seems to be a growing force within the community of AI and robotics designers in the past five years, so I’m not entirely pessimistic.

Q9. If humans delegate decisions to machines, who will be responsible for the consequences?

John Markoff: Ben Shneiderman, the University of Maryland computer scientist and user interface designer has written eloquently on this point. Indeed he argues against autonomous systems for precisely this reason. His point is that it is essential to keep a human in the loop. If not you run the risk of abdicating ethical responsibility for system design.

Q10. Assuming there is a real potential in using data–driven methods to both help charities develop better services and products, and understand civil society activity. In your opinion, what are the key lessons and recommendations for future work in this space?

John Markoff: I’m afraid I’m not an expert in the IT needs of either charities or NGOs. That said a wide range of AI advances are already being delivered at nominal cost via smart phones. As cheap sensors proliferate virtually all everyday objects will gain intelligence that will be widely accessible.

Qx. Anything else you wish to add?

John Markoff: Only that I think it is interesting that the augmentation vs automation dichotomy is increasingly seen as a path through which to navigate the impact of these technologies. Computer system designers are the ones who will decide what the impact of these technologies are and whether to replace or augment humans in society.

—————————————-

JOHN GREGORY MARKOFF

John Markoff joined The New York Times in March 1988 as a reporter for the business section. He is now a technology writer based in San Francisco bureau of the paper. Prior to joining the Times, he worked for The San Francisco Examiner from 1985 to 1988. He reported for the New York Times Science Section from 2010 to 2015.

Markoff has written about technology and science since 1977. He covered technology and the defense industry for The Pacific News Service in San Francisco from 1977 to 1981; he was a reporter at Infoworld from 1981 to 1983; he was the West Coast editor for Byte Magazine from 1984 to 1985 and wrote a column on personal computers for The San Jose Mercury from 1983 to 1985.

He has also been a lecturer at the University of California at Berkeley School of Journalism and an adjunct faculty member of the Stanford Graduate Program on Journalism.

The Times nominated him for a Pulitzer Prize in 1995, 1998 and 2000. The San Francisco Examiner nominated him for a Pulitzer in 1987. In 2005, with a group of Times reporters, he received the Loeb Award for business journalism. In 2007 he shared the Society of American Business Editors and Writers Breaking News award. In 2013 he was awarded a Pulitzer Prize in explanatory reporting as part of a New York Times project on labor and automation.

In 2007 he became a member of the International Media Council at the World Economic Forum. Also in 2007, he was named a fellow of the Society of Professional Journalists, the organization’s highest honor.

In June of 2010 the New York Times presented him with the Nathaniel Nash Award, which is given annually for foreign and business reporting.

Born in Oakland, California on October 29, 1949, Markoff grew up in Palo Alto, California and graduated from Whitman College, Walla Walla, Washington, in 1971. He attended graduate school at the University of Oregon and received a masters degree in sociology in 1976.

Markoff is the co-author of “The High Cost of High Tech,” published in 1985 by Harper & Row. He wrote “Cyberpunk: Outlaws and Hackers on the Computer Frontier” with Katie Hafner, which was published in 1991 by Simon & Schuster.
In January of 1996 Hyperion published “Takedown: The Pursuit and Capture of America’s Most Wanted Computer Outlaw,” which he co-authored with Tsutomu Shimomura. “What the Dormouse Said: How the Sixties Counterculture shaped the Personal Computer Industry,” was published in 2005 by Viking Books. “Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots,” was published in August of 2015 by HarperCollins Ecco.

He is currently researching a biography of Stewart Brand.

He is married to Leslie Terzian Markoff and they live in San Francisco, Calif.

Resources

MACHINES OF LOVING GRACE – The Quest for Common Ground Between Humans and Robots By John Markoff, Illustrated. 378 pp. Ecco/HarperCollins Publishers.

–Shneiderman’s “Eight Golden Rules of Interface Design”. These rules were obtained from the text Designing the User Interface by Ben Shneiderman.

– “Designing the User Interface”, 6th Edition. This is a revised edition of the highly successful textbook on Human Computer Interaction originally developed by Ben Shneiderman and Catherine Plaisant at the University of Maryland.

Related Posts

– Recruit Institute of Technology. Interview with Alon Halevy. ODBMS Industry Watch, Published on 2016-04-02

– Civility in the Age of Artificial Intelligence, by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

– On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

– On Big Data and Society. Interview with Viktor Mayer-Schönberger, ODBMS Industry Watch.

Follow us on Twitter: @odbmsorg

# #

On Big Data and Data Science. Interview with James Kobielus

Roberto V. Zicari — Tue, 19 Apr 2016 08:34:09 +0000

“One of the most typical mistakes in large-scale data projects is losing sight of the biases that may skew the insights you extract.”– James Kobielus

On the topics of Big Data, and Data Science, I have interviewed James Kobielus, IBM Big Data Evangelist.

RVZ

Q1. What kind of companies generate Big Data, besides the Internet giants?

James Kobielus: Big data isn’t something you “generate.” Rather, the term refers to the ability to achieve differentiated value from advanced analytics on trustworthy data at any scale. In other words, it’s a best practice, not a specific type of data or even a specific scale of data (measured in volume, velocity, and/or variety).

When considered in this light, you can identify big data analytic applications in every industry. Every C-level executive has strategic applications of big data. Here are just a smattering:

Chief Marketing Officers have been the prime movers on many big data initiatives that involve Hadoop, NoSQL, and other approaches. Their primary applications consist of marketing campaign optimization, customer churn and loyalty, upsell and cross-sell analysis, targeted offers, behavioral targeting, social media monitoring, sentiment analysis, brand monitoring, influencer analysis, customer experience optimization, content optimization, and placement optimization
Chief Information Officers use big data platforms for data discovery, data integration, business analytics, advanced analytics, exploratory data science.
Chief Operations Officers rely on big data for supply chain optimization, defect tracking, sensor monitoring, and smart grid, among other applications.
Chief Information Security Officer run security incident and event management, anti-fraud detection, and other sensitive applications on big data.
Chief Technology Officers do IT log analysis, event analytics, network analytics, and other systems monitoring, troubleshooting, and optimization applications on big data.
Chief Financial Officers run complex financial risk analysis and mitigation modeling exercises on big data platforms.

Q2. What are the most challenging problems you are facing when analysing Big Data?

James Kobielus: Searching for actionable intelligence in big data involves building and testing advanced-analytics models against large volumes of complex data that may be flowing in at high velocities.

At these scales, it’s easy to get overwhelmed in your analysis unless you automate the end-to-end processes of extracting intelligence at scale. Automation can also help control the cost of managing a growing volume of algorithmic models against ever expanding big-data collections. The key processes that need automating are data discovery, profiling, sampling, and preparation, as well as model building, scoring, and deployment.

Q3. How do you typically handle them?

James Kobielus: Automating the modeling process will boost data scientist productivity by an order of magnitude, freeing them from drudgery so that they can focus on the sorts of exploration, modeling, and visualization challenges that demand expert human judgment. Data scientists can accelerate their modeling automation initiatives by following these steps:

Virtualize access to data, metadata, rules, and predictive models, as well as to data integration, data warehousing, and advanced analytic applications through a BI semantic virtualization layer;
Unify access, governance, orchestration, automation, and administration across these resources within a service-oriented architecture;
Explore commercial tools that support maximum automation of model development, scoring, deployment, and execution;
Consolidate, accelerate, and deepen predictive analytics through integration into big-data platforms with scalable in-database execution; and
Migrate existing analytical data marts into multidomain big-data platforms with unified data, metadata, and model governance within service-oriented virtualization framework.

Q4. What are in your experience the typical mistakes made in large scale data projects?

James Kobielus: One of the most typical mistakes in large-scale data projects is losing sight of the biases that may skew the insights you extract.

Even if you accept that a data scientist’s integrity is rock-solid, intentions pure, skills stellar, and discipline rigorous, there’s no denying that bias may creep inadvertently into their work with big data. The biases may be minor or major, episodic or systematic, tangential or material to their findings and recommendations. Whatever their nature, the biases must be understood and corrected as fully as possible.

Here are some of the key sources of bias that may crop up in a data scientist’s work with big data:

Cognitive bias: This is the tendency to make skewed decisions based on pre-existing cognitive and heuristic factors–such as a misunderstanding of probabilities–rather than on the data and other hard evidence. You might say that the educated intuition that drives data science is rife with cognitive bias, but that’s not always a bad thing.

Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient, and cost-effective for your purposes, as opposed to being necessarily the most valid and relevant for your study. Clearly, data scientists do not have unlimited budgets, may operate under tight deadlines, and don’t use data for which they lack authorization. These constraints may introduce an unconscious bias in the big-data collections they are able to assemble.
Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population most relevant to the initial scope of a data-science project, thereby making it unlikely that you will uncover any meaningful correlations that may apply to other segments. Another source of sampling bias is “data dredging,” in which the data scientist uses regression techniques that may find correlations in samples but that may not be statistically significant in the wider population. Consequently, you’re likely to spuriously confirm your initial model for the segments that happen to make the sampling cut.
Modeling bias: Beyond the biases just discussed, this is the tendency to skew data-science models by starting with a biased set of project assumptions that drive selection of the wrong variables, the wrong data, the wrong algorithms, and the wrong metrics of fitness. In addition, overfitting of models to past data without regard for predictive lift is a common bias. Likewise, failure to score and iterate models in a timely fashion with fresh observational data also introduces model decay, hence bias.
Funding bias: This may be the most silent but pernicious bias in data-scientific studies of all sorts. It’s the unconscious tendency to skew all modeling assumptions, interpretations, data, and applications to favor the interests of the party–employer, customer, sponsor, etc.–that employs or otherwise financially supports the data-science initiative. Funding bias makes it highly unlikely that data scientists will uncover disruptive insights that will “break the rice bowl” in which they make their living.

Q5. How do you measure “success” when analysing data?

James Kobielus: You measure success in your ability to distill useful insights in a timely fashion from the data at your disposal.

Q6. What skills are required to be an effective Data Scientist?

James Kobielus: Data science’s learning curve is formidable. To a great degree, you will need a degree, or something substantially like it, to prove you’re committed to this career. You will need to submit yourself to a structured curriculum to certify you’ve spent the time, money and midnight oil necessary for mastering this demanding discipline.

Sure, there are run-of-the-mill degrees in data-science-related fields, and then there are uppercase, boldface, bragging-rights “DEGREES.” To some extent, it matters whether you get that old data-science sheepskin from a traditional university vs. an online school vs. a vendor-sponsored learning program. And it matters whether you only logged a year in the classroom vs. sacrificed a considerable portion of your life reaching for the golden ring of a Ph.D. And it certainly matters whether you simply skimmed the surface of old-school data science vs. pursued a deep specialization in a leading-edge advanced analytic discipline.

But what matters most to modern business isn’t that every data scientist has a big honking doctorate. What matters most is that a substantial body of personnel has a common grounding in core curriculum of skills, tools and approaches. Ideally, you want to build a team where diverse specialists with a shared foundation can collaborate productively.

Big data initiatives thrive if all data scientists have been trained and certified on a curriculum with the following foundation:

Paradigms and practices: Every data scientist should acquire a grounding in core concepts of data science, analytics and data management. They should gain a common understanding of the data science lifecycle, as well as the typical roles and responsibilities of data scientists in every phase. They should be instructed on the various role(s) of data scientists and how they work in teams and in conjunction with business domain experts and stakeholders. And they learn a standard approach for establishing, managing and operationalizing data science projects in the business.
Algorithms and modeling: Every data scientist should obtain a core understanding of linear algebra, basic statistics, linear and logistic regression, data mining, predictive modeling, cluster analysis, association rules, market basket analysis, decision trees, time-series analysis, forecasting, machine learning, Bayesian and Monte Carlo Statistics, matrix operations, sampling, text analytics, summarization, classification, primary components analysis, experimental design, unsupervised learning constrained optimization.
Tools and platforms: Every data scientist should master a core group of modeling, development and visualization tools used on your data science projects, as well as the platforms used for storage, execution, integration and governance of big data in your organization. Depending on your environment, and the extent to which data scientists work with both structured and unstructured data, this may involve some combination of data warehousing, Hadoop, stream computing, NoSQL and other platforms. It will probably also entail providing instruction in MapReduce, R and other new open-source development languages, in addition to SPSS, SAS and any other established tools.
Applications and outcomes: Every data scientist should learn the chief business applications of data science in your organization, as well as in how to work best with subject-domain experts. In many companies, data science focuses on marketing, customer service, next best offer, and other customer-centric applications. Often, these applications require that data scientists understand how to leverage customer data acquired from structured survey tools, sentiment analysis software, social media monitoring tools and other sources. It also essential that every data scientist gain an understanding of the key business outcomes–such as maximizing customer lifetime value–that should focus their modeling initiatives.

Classroom instruction is important, but a curriculum that is 100 percent devoted to reading books, taking tests and sitting through lectures is insufficient. Hands-on laboratory work is paramount for a truly well-rounded data scientist. Make sure that your data scientists acquire certifications and degrees that reflect them actually developing statistical models that use real data and address substantive business issues.

A business-oriented data-science curriculum should produce expert developers of statistical and predictive models. It should not degenerate into a program that produces analytics geeks with heads stuffed with theory but whose diplomas are only fit for hanging on the wall.

Q7. Hadoop vs. Spark: what are the pros and cons?

James Kobielus: Big data analytics infrastructures are growing more hybridized than ever. Every new technology—such as Hadoop, in-memory databases, and graph databases—finds its specific niche in terms of use cases, deployment modes, and applications for which it is best suited.

Even as Apache Spark pushes more deeply into big-data environments, it won’t substantially change this trend. Yes, of course Spark is on the fast track to ubiquity in big-data analytics. This is especially true for the next generation of machine-learning applications that feed on growing in-memory pools and require low-latency distributed computations for streaming and graph analytics. But those use cases aren’t the sum total of big-data analytics and never will be.

As we all grow more infatuated with Spark, it’s important to continually remind ourselves of what it’s not suitable for. If, for example, one considers all the critical data management, integration, and preparation tasks that must be performed prior to modeling in Spark, it’s clear that these will not be executed in any of the Spark engines (Spark SQL, Spark Streaming, GraphX). Instead, they’ll be carried out in the data platforms and elastic clusters (HDFS, Cassandra, HBase, Mesos, cloud services, etc.) upon which those engines run. Likewise, you’d be hardpressed to find anyone who’s seriously considering Spark in isolation for data warehousing, data governance, master data management, or operational business intelligence.

Above all else, Spark is the new power tool for data scientists who are pushing boundaries in the emerging era of in-memory big data analytics in low-latency scenarios of all types. Spark is proving its value as a development tool for the new generation of data scientists building the in-memory statistical models upon which it all will depend.

Let’s not fall into the delusion that everything is converging toward Spark, as if it were the ravenous maw that will devour every other big-data analytics tool and platform. Spark is just another approach that’s being fitted to and optimized for specific purposes.

And let’s resist the hype that treats Spark as Hadoop’s “successor.” This implies that Hadoop and other big-data approaches are “legacy,” rather than what they are, which is foundational. For example, no one is seriously considering doing “data lakes,” “data reservoirs,” or “data refineries” on anything but Hadoop or NoSQL.

——————–

James Kobielus is an industry veteran and serves as IBM Big Data Evangelist; Senior Program Director for Product Marketing in Big Data Analytics; and Team Lead, Technical Marketing, IBM Big Data & Analytics Hub. He spearheads thought leadership activities across the IBM Analytics solution portfolio. He has spoken at such leading industry events as IBM Insight, Hadoop Summit, and Strata. He has published several business technology books and is a very popular provider of original commentary on blogs and many social media.

Resources

– Master of Information and Data Science, UC Berkeley School of Information.

– MS in Data Science, NYU Center for Data Science.

– Free data science curriculum, kdnuggets.com

– Data Science | Coursera

– Master of Science in Data Science – Data Science Institute

–Data Mining and Applications Graduate Certificate, Stanford

–The European Data Science Academy (EDSA) designs curricula for data science training and data science education across the European Union (EU).

-The EDISON project will focus on activities to establish the new profession of ‘Data Scientist’, following the emergence of Data Science technologies (also referred to as Data Intensive or Big Data technologies) which changes the way research is done, how scientists think and how the research data are used and shared. This includes definition of the required skills, competences framework/profile, corresponding Body Of Knowledge and model curriculum. It will develop a sustainability/business model to ensure a sustainable increase of Data Scientists, graduated from universities and trained by other professional education and training institutions in Europe.
EDISON will facilitate the establishment of a Data Science education and training infrastructure at major European universities by promoting experience of ‘champion’ universities involving them into coordinated development and implementation of the model curriculum and creation of cooperative educational and training infrastructure.

–Open Source Software and IBM’s Big Data platform. By Cynthia M. Saracco, senior solutions architect at IBM’s Silicon Valley Laboratory. ODBMS.org, April 2016.

–Looking back at Big Data in 2015, By Cynthia M. Saracco, IBM Senior Solution Architect, ODBMS.org. November 2015

– Heuristics for a Data Scientist: A common sense approach. BY Silvia Dassiè, Data Scientist at Ryanair. ODBMS.org, December 2015

–The rise of immutable data stores. By Alan Morrison, Senior Manager, PwC Center for technology and innovation. ODBMS.org. October 2015

Follow us on Twitter: @odbmsorg

On Data Mining and Data Science. Interview with Charu Aggarwal

Roberto V. Zicari — Tue, 12 May 2015 12:03:08 +0000

“What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging” — Charu Aggarwal.

On Data Mining, Data Science and Big Data, I have interviewed Charu Aggarwal, Research Scientist at the IBM T. J. Watson Research Center, an expert in this area.

RVZ

Q1. You recently edited two books: Data Classification: Algorithms and Applications and Data Clustering: Algorithms and Applications.
What are the main lessons learned in data classification and data clustering that you can share with us?

Charu Aggarwal: The most important lesson, which is perhaps true for all of data mining applications, is that feature extraction, selection and representation are extremely important. It is all too often that we ignore these important aspects of the data mining process.

Q2. How Data Classification and Data Clustering relate to each other?

Charu Aggarwal: Data classification is the supervised version of data clustering. Data clustering is about dividing the data into groups of similar points. In data classification, examples of groups of points are made available to you. Then, for a given test instance, you are supposed to predict which group this point might belong to.
In the latter case, the groups often have a semantic interpretation. For example, the groups might correspond to fraud/not fraud labels in a credit-card application. In many cases, it is natural for the groups in classification to be clustered as well. However, this is not always the case.
Some methods such as semi-supervised clustering/classification leverage the natural connections between these problems to provide better quality results.

Q3. Can data classification and data clustering be useful also for large data sets and data streams? If yes, how?

Charu Aggarwal: Data clustering is definately useful for large data sets, because clusters can be viewed as summaries of the data. In fact, a particular form of fine-grained clustering, referred to as micro-clustering, is commonly used for summarizing high-volume streaming data in real time. These summaries are then used for many different applications, such as first-story detection, novelty detection, prediction, and so on.
In this sense, clustering plays an intermediate role in enabling other applications for large data sets.
Classification can also be used to generate different types of summary information, although it is a little less common. The reason is that classification is often used as the end-user application, rather than as an intermediate application
like clustering. Therefore, big-data serves as a challenge and as an opportunity for classification.
It serves as a challenge because of obvious computational reasons. It serves as an opportunity because you can build more complex and accurate models with larger data sets without creating a situation, where the model inadvertently overfits to the random noise in the data.

Q4. How do you typically extract “information” from Big Data?

Charu Aggarwal: This is a highly application-specific question, and it really depends on what you are looking for. For example, for the same stream of health-care data, you might be looking for different types of information, depending on whether you are trying to detect fraud, or whether you are trying to discover clinical anomalies. At the end of the day, the role of the domain expert can never be discounted.
However, the common theme in all these cases is to create a more compressed, concise, and clean representation into one of the data types we all recognize and know how to process. Of course, this step is required in all data mining applications, and not just big data applications. What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging.
For example, if you look at Google’s original MapReduce framework, it was motivated by a need to efficiently perform operations that are almost trivial for smaller data sets, but suddenly become very expensive in the big-data setting.

Q5. What are the typical problems and scenarios when you cluster multimedia, text, biological, categorical, network, streams, and uncertain data?

Charu Aggarwal: The heterogeneity of the data types causes significant challenges.
One problem is that the different data types may often be mixed, as a result of which the existing methods can sometimes not be used directly. Some common scenarios in which such data types arise are photo/music/video-sharing (multimedia), healthcare (time-series streams and biological), and social networks. Among these different data types, the probabilistic (uncertain) data types does not seem to have graduated from academia into industry very well. Of course, it is a new area and there is a lot of active research going on. The picture will become clearer in a few years.

Q6. How effective are today ́s clustering algorithms?

Charu Aggarwal: Clustering problems have become increasingly effective in recent years because of advances in high-dimensional methods. In the past, when the data was very high-dimensional most existing methods work poorly because of locally irrelevant attributes and concentration effects. These are collectively referred to as the curse of dimensionality. Techniques such as subspace and projected clustering have been introduced to discover clusters in lower dimensional views of the data. One nice aspect of this approach is that some variations of it are highly interpretable.

Q7. What is in common between pattern recognition, database analytics, data mining, and machine learning?

Charu Aggarwal: They really do the same thing, which is that of analyzing and gleaning insights from data. It is just that the styles and emphases are different in various communities. Database folks are more concerned
about scalability. Pattern recognition and machine learning folks are somewhat more theoretical. The statistical folks tend to use their statistical models. The data mining community is the most recent one, and it was formed to create a common meeting ground for these diverse communities.
The first KDD conference was held in 1995, and we have come a long way since then towards integration. I believe that the KDD conference has played a very major role in the amalgamation of these communities. Today, it is actually possible for the folks from database and machine learning communities to be aware of each other’s work. This was not quite true 20 years ago.

Q8. What are the most “precise” methods in data classification?

Charu Aggarwal: I am sure that you will find experts who are willing to swear by a particular model. However, each model comes with a different set of advantages over different data sets. Furthermore, some models, such as univariate decision trees and rule-based methods, have the advantage of being interpretable even when they are outperformed by other methods. After all, analysts love to know about the “why” aside from the “what.”

While I cannot say which models are the most accurate (highly data specific), I can certainly point to the most “popular” ones today from a research point of view. I would say that SVMs, and neural networks (deep learning) are the most popular classification methods. However, my personal experience has been mixed.
While I have found SVMs to work quite well across a wide variety of settings, neural networks are generally less robust. They can easily over fit to noise or show unstable performance over small ranges of parameters. I am watching the debate over deep learning with some interest to see how it plays out.

Q9. When to use Mahout for classification? and What is the advantage of using Mahout for classification?

Charu Aggarwal: Apache Mahout is a scalable machine learning environment for data mining applications. One distinguishing feature of Apache Mahout is that it builds on top of distributed infrastructures like MapReduce, and enables easy building of machine learning applications. It includes libraries of various operations and applications.
Therefore, it reduces the effort of the end user beyond the basic MapReduce framework. It should be used in cases, where the data is large enough to require the use of such distributed infrastructures.

Q10. What are your favourite success stories in Data Classifications and/or Data Clustering?

Charu Aggarwal: One of my favorite success stores is in the field of high dimensional data, where I explored the effect of locally irrelevant dimensions and concentration effects on various data mining algorithms.
I designed a suite of algorithms for such high-dimensional tasks as clustering, similarity search, and outlier detection.
The algorithms continue to be relevant even today, and we have even generalized some of these results to big-data (streaming) scenarios and other application domains, such as the graph and text domains.

Qx Anything else you wish to add?

Charu Aggarwal: Data mining and data sciences are at exciting cross-roads today. I have been working in this field since 1995, and I have never seen as much excitement about data science in my first 15 years, as I have seen
in the last 5. This is truly quite amazing!

——————————-
Charu C. Aggarwal is a Research Scientist at the IBM T. J. Watson Research Center in Yorktown Heights, New York.
He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996.
His research interest during his Ph.D. years was in combinatorial optimization (network flow algorithms), and his thesis advisor was Professor James B. Orlin.
He has since worked in the field of performance analysis, databases, and data mining. He has published over 200 papers in refereed conferences and journals, and has applied for or been granted over 80 patents. He is author or editor of nine books.
Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research.
He has served on the program committees of most major database/data mining conferences, and served as program vice-chairs of the SIAM Conference on Data Mining, 2007, the IEEE ICDM Conference, 2007, the WWW Conference 2009, and the IEEE ICDM Conference, 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering Journal from 2004 to 2008. He is an associate editor of the ACM TKDD Journal, an action editor of the Data Mining and Knowledge Discovery Journal, an associate editor of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal.
He is a fellow of the IEEE for “contributions to knowledge discovery and data mining techniques”, and a life-member of the ACM.

Resources

Books

– Data Clustering: Algorithms and Applications, Edited by Charu C. Aggarwal, Chandan K. Reddy, August 21, 2013 by Chapman and Hall/CRC

– MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
Appeared in:OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Download: PDF Version

Related Posts

On HP Distributed R. Interview with Walter Maguire and Indrajit Roy
Source: ODBMS Industry Watch, Published on 2015-04-09

On Data Curation. Interview with Andy Palmer
Source: ODBMS Industry Watch, Published on 2015-01-14

Follow ODBMS.org on Twitter: @odbmsorg

How to run a Big Data project. Interview with James Kobielus

Roberto V. Zicari — Thu, 15 May 2014 17:20:16 +0000

“You need a team of dedicated data scientists to develop and tune the core intellectual property–statistical, predictive, and other analytic models–that drive your Big Data applications. You don’t often think of data scientists as “programmers,” per se, but they are the pivotal application developers in the age of Big Data.”–James Kobielus

Managing the pitfalls and challenges of Big Data projects. On this topic I have interviewed James Kobielus, IBM Senior Program Director, Product Marketing, Big Data Analytics solutions.

RVZ

Q1. Why run a Big Data project in the enterprise?

James Kobielus: Many Big Data projects are in support of customer relationship management (CRM) initiatives in marketing, customer service, sales, and brand monitoring. Justifying a Big Data project with a CRM focus involves identifying the following quantitative ROI:
• Volume-based value: The more comprehensive your 360-degree view of customers and the more historical data you have on them, the more insight you can extract from it all and, all things considered, the better decisions you can make in the process of acquiring, retaining, growing and managing those customer relationships.
• Velocity-based value: The more customer data you can ingest rapidly into your big-data platform and the more questions that a user can pose more rapidly against that data (via queries, reports, dashboards, etc.) within a given time period prior, the more likely you are to make the right decision at the right time to achieve your customer relationship management objectives.
• Variety-based value: The more varied customer data you have – from the CRM system, social media, call-center logs, etc. – the more nuanced portrait you have on customer profiles, desires and so on, hence the better-informed decisions you can make in engaging with them.
• Veracity-based value: The more consolidated, conformed, cleansed, consistent current the data you have on customers, the more likely you are to make the right decisions based on the most accurate data.
How can you attach a dollar value to any of this? It’s not difficult. Customer lifetime value (CLV) is a standard metric that you can calculate from big-data analytics’ impact on customer acquisition, onboarding, retention, upsell, cross-sell and other concrete bottom-line indicators, as well as from corresponding improvements in operational efficiency.

Q2. What are the business decisions that need to be made in order to successfully support a Big Data project in the enterprise?

James Kobielus: In order to successfully support a Big Data project in the enterprise, you have to make the infrastructure and applications production-ready in your operations.
Production-readiness means that your big-data investment is fit to realize its full operational potential. If you think “productionizing” can be done in a single step, such as by, say, introducing HDFS NameNode redundancy, then you need a cold slap of reality. Productionizing demands a lifecycle focus that encompasses all of your big-data platforms, not just a single one (e.g., Hadoop/HDFS), and addresses more than just a single requirement (e.g., ensuring a highly available distributed file system).
Productionizing involves jumping through a series of procedural hoops to ensure that your big-data investment can function as a reliable business asset. Here are several high-level considerations to keep in mind as you ready your big-data initiative for primetime deployment:
• Stakeholders: Have you aligned your big-data initiatives with stakeholder requirements? If stakeholders haven’t clearly specified their requirements or expectations for your big-data initiative, it’s not production-ready. The criteria of production-readiness must conform to what stakeholders require, and that depends greatly on the use cases and applications they have in mind for Big Data. Service-level agreements (SLAs) vary widely for Big Data deployed as an enterprise data warehouse (EDW), as opposed to an exploratory data-science sandbox, an unstructured information transformation tier, a queryable archive, or some other use. SLAs for performance, availability, security, governance, compliance, monitoring, auditing and so forth will depend on the particulars of each big-data application, and on how your enterprise prioritizes them by criticality.
• Stacks: Have you hardened your big-data technology stack – databases, middleware, applications, tools, etc. – to address the full range of SLAs associated with the chief use cases? If the big-data platform does not meet the availability, security and other robustness requirements expected of most enterprise infrastructure, it’s not production-ready. Ideally, all production-grade big-data platforms should benefit from a common set of enterprise management tools.
• Scalability: Have you architected your environment for modular scaling to keep pace with inexorable growth in data volumes, velocities and varieties? If you can’t provision, add, or reallocate new storage, compute and network capacity on the big-data platform in a fast, cost-effective, modular way to meet new requirements, the platform is not production-ready.
• Skillsets: Have you beefed up your organization’s big-data skillsets for maximum productivity? If your staff lacks the requisite database, integration and analytics skills and tools to support your big-data initiatives over their expected life, your platform is not production-ready. Don’t go deep on Big Data until your staff skills are upgraded.
• Seamless service: Have your re-engineered your data management and analytics IT processes for seamless support for disparate big-data initiatives? If you can’t provide trouble response, user training and other support functions in an efficient, reliable fashion that’s consistent with existing operations, your big-data platform is not production-ready.
To the extent that your enterprise already has a mature enterprise data warehousing (EDW) program in production, you should use that as the template for your big-data platform. There is absolutely no need to redefine “productionizing” for Big Data’s sake.

Q3. What are the most common problems and challenges encountered in Big Data projects?

James Kobielus: The most common problems and challenges in Big Data projects revolve around integrated lifecycle management (ILM).
ILM faces a new frontier when it comes to Big Data. The core challenges are threefold: the sheer unbounded size of Big Data, the ephemeral nature of much of the new data, and the difficulty of enforcing consistent quality as the data scales along any and all of the three Vs (volume, velocity, and variability). Comprehensive ILM has grown more difficult to ensure in Big Data environments, given rapid changes in the following areas:
• New Big Data platform: Big data is ushering a menagerie of new platforms (Hadoop, NoSQL, in-memory, and graph databases) into enterprise computing environments, alongside stalwarts such MPP RDBMS, columnar, and dimensional databases. The chance that your existing ILM tools work out of the box with all of these new platforms is slim. Also, to the extent that you’re doing Big Data in a public cloud, you may be required to use whatever ILM features — strong, weak, or middling — that may be native to the provider’s environment. To mitigate your risks in this heterogeneous new world and to maintain strong confidence in your core data, you’ll need to examine new Big Data platforms closely to ensure they have ILM features (data security, governance, archiving, retention) that are commensurate to the roles for which you plan to deploy them.
• New Big Data subject domains: Big data has not altered enterprise requirements for data governance hubs where you store and manage office systems of record (customers, finances, HR). This is the role of your established EDW, most of which run on traditional RDBMS-based data platforms and incorporate strong ILM. But these systems of record data domains may have very little presence on your newer Big Data platforms, many of which focus instead on handling fresh data from social, event, sensor, clickstream, geospatial, and other new sources. These new data domains are often “ephemeral” in the sense there may be no need to retain the bulk of the data in permanent systems of record.
• New Big Data scales: Big data does not mean that your new platforms support infinite volume, instantaneous velocity, or unbounded varieties. The sheer magnitudes of new data will make it impossible to store most of it anywhere, given the stubborn technological and economic constraints we all face. This reality will deepen Big Data managers’ focus on tweaking multitemperature storage management, archiving, and retention policies. As you scale your Big Data environment, you will need to ensure that ILM requirements can be supported within your current constraints of volume (storage capacity), velocity (bandwidth, processor, and memory speeds), and variety (metadata depth).

Q4. How best is to get started with a Big Data project?

James Kobielus: Scope the project well to deliver near-term business benefit. Using the nucleus project as the foundation for accelerating future Big Data projects. Recognize that the initial database technology you use in that initial project is just one of many storage layers that will need to play together in a hybridized, multi-tier Big Data architecture of your future.
In the larger evolutionary perspective, Big Data is evolving into a hybridized paradigm under which Hadoop, massively parallel processing (MPP) enterprise data warehouses (EDW), in-memory columnar, stream computing, NoSQL, document databases, and other approaches support extreme analytics in the cloud.
Hybrid architectures address the heterogeneous reality of Big Data environments and respond to the need to incorporate both established and new analytic database approaches into a common architecture. The fundamental principle of hybrid architectures is that each constituent Big Data platform is fit-for-purpose to the role for which it’s best suited. These Big Data deployment roles may include any or all of the following:
• Data acquisition
• Collection
• Transformation
• Movement
• Cleansing
• Staging
• Sandboxing
• Modeling
• Governance
• Access
• Delivery
• Interactive exploration
• Archiving
In any role, a fit-for-purpose Big Data platform often supports specific data sources, workloads, applications, and users.
Hybrid is the future of Big Data because users increasingly realize that no single type of analytic platform is always best for all requirements. Also, platform churn—plus the heterogeneity it usually produces—will make hybrid architectures more common in Big Data deployments. The inexorable trend is toward hybrid environments that address the following enterprise Big Data imperatives:
• Extreme scalability and speed: The emerging hybrid Big Data platform will support scale-out, shared-nothing massively parallel processing, optimized appliances, optimized storage, dynamic query optimization, and mixed workload management.
• Extreme agility and elasticity: The hybrid Big Data environment will persist data in diverse physical and logical formats across a virtualized cloud of interconnected memory and disk that can be elastically scaled up and out at a moment’s notice.
• Extreme affordability and manageability: The hybrid environment will incorporate flexible packaging/pricing, including licensed software, modular appliances, and subscription-based cloud approaches.
Hybrid deployments are already widespread in many real-world Big Data deployments. The most typical are the three-tier—also called “hub-and-spoke”—architectures. These environments may have, for example, Hadoop (e.g., IBM InfoSphere BigInsights) in the data acquisition, collection, staging, preprocessing, and transformation layer; relational-based MPP EDWs (e.g., IBM PureData System for Analytics) in the hub/governance layer; and in-memory databases (e.g., IBM Cognos TM1) in the access and interaction layer.
The complexity of hybrid architectures depends on range of sources, workloads, and applications you’re trying to support. In the back-end staging tier, you might need different preprocessing clusters for each of the disparate sources: structured, semi-structured, and unstructured. In the hub tier, you may need disparate clusters configured with different underlying data platforms—RDBMS, stream computing, HDFS, HBase, Cassandra, NoSQL, and so on—-and corresponding metadata, governance, and in-database execution components. And in the front-end access tier, you might require various combinations of in-memory, columnar, OLAP, dimensionless, and other database technologies to deliver the requisite performance on diverse analytic applications, ranging from operational BI to advanced analytics and complex event processing.
Ensuring that hybrid Big Data architectures stay cost-effective demands the following multipronged approach to optimization of distributed storage:
• Apply fit-for-purpose databases to particular Big Data use cases: Hybrid architectures spring from the principle that no single data storage, persistence, or structuring approach is optimal for all deployment roles and workloads. For example, no matter how well-designed the dimensional data model is within an OLAP environment, users eventually outgrow these constraints and demand more flexible decision support. Other database architectures—such as columnar, in-memory, key-value, graph, and inverted indexing—may be more appropriate for such applications, but not generic enough to address other broader deployment roles.
• Align data models with underlying structures and applications: Hybrid architectures leverage the principle that no fixed Big Data modeling approach—physical and logical—can do justice to the ever-shifting mix of queries, loads, and other operations. As you implement hybrid Big Data architectures, make sure you adopt tools that let you focus on logical data models, while the infrastructure automatically reconfigures the underlying Big Data physical data models, schemas, joins, partitions, indexes, and other artifacts for optimal query and data load performance.
• Intelligently compress and manage the data: Hybrid architectures should allow you to apply intelligent compression to Big Data sets to reduce their footprint and make optimal use of storage resources. Also, some physical data models are more inherently compact than others (e.g., tokenized and columnar storage are more efficient than row-based storage), just as some logical data models are more storage-efficient (e.g., third-normal-form relational is typically more compact than large denormalized tables stored in a dimensional star schema).

Q5. What kind of expertise do you need to run a Big Data project in the enterprise?

James Kobielus: Data-driven organizations succeed when all personnel—both technical and business—have a common understanding of the core big-data best skills, tools and practices. You need all the skills of data management, integration, modeling, and so forth that you already have running your data marts, warehouses, OLAP cubes, and the like.

Just as important, you need a team of dedicated data scientists to develop and tune the core intellectual property–statistical, predictive, and other analytic models–that drive your Big Data applications. You don’t often think of data scientists as “programmers,” per se, but they are the pivotal application developers in the age of Big Data.
The key practical difference between data scientists and other programmers—including those who develop orchestration logic—is that the former specifies logic grounded in non-deterministic patterns (i.e., statistical models derived from propensities revealed inductively from historical data), whereas the latter specifies logic whose basis is predetermined (i.e., if/then/else, case-based and other rules, procedural and/or declarative, that were deduced from functional analysis of some problem domain).
The practical distinctions between data scientists and other programmers have always been a bit fuzzy, and they’re growing even blurrier over time. For starters, even a cursory glance at programming paradigms shows that core analytic functions—data handling and calculation—have always been the heart of programming. For another, many business applications leverage statistical analyses and other data-science models to drive transactional and other functions.
Furthermore, data scientists and other developers use a common set of programming languages. Of course, data scientists differ from most other types of programmers in various ways that go beyond the deterministic vs. non-deterministic logic distinction mentioned above:
• Data scientists have adopted analytic domain-specific languages such as R, SAS, SPSS and Matlab.
• Data scientists specialize in business problems that are best addressed with statistical analysis.
• Data scientists are often more aligned with specific business-application domains—such as marketing campaign optimization and financial risk mitigation—than the traditional programmer.
These distinctions primarily apply to what you might call the “classic” data scientist, such as multivariate statistical analysts and data mining professionals. But the notion of a “classic” data scientist might be rapidly fading away in the big-data era as more traditional programmers need some grounding in statistical modeling in order to do their jobs effectively—or, at the very least, need to collaborate productively with statistical modelers.

Q6. How do you select the “right” software and hardware for a Big Data project?

James Kobielus: It’s best to choose the right appliance–a pre-optimized, pre-configured hardware/software appliance–for the specific workloads and applications of your Big Data project. At the same time, you should make sure that the chosen appliances can figure into the eventual cloud architecture toward which your Big Data infrastructure is likely to evolve.

An appliance is a workload-optimized system. Its hardware/software nodes are the key building block for every Big Data cloud. In other words, appliances, also known as expert integrated systems, are the bedrock of all three “Vs” of the Big Data universe, regardless of whether your specific high-level topology is centralized, hub-and-spoke, federated or some other configuration, and regardless of whether you’ve deployed all of these appliance nodes on premises or are outsourcing some or all of it to a cloud/SaaS provider.
Within the coming 2-3 years, expert integrated systems will become a dominant approach for enterprises to put Hadoop and other emerging Big Data approaches into production. Already, appliances are the principal approach in the core Big Data platform market: enterprise data warehousing solutions that implement massively parallel processing, such as those powered by IBM PureData Systems for Analytics..
The core categories of workloads that user need their optimized Big Data appliances to support within cloud environments are as follows:
• Big-data storage: A Big Data appliance can be core building block in a enterprise data storage architecture. Chief uses may be for archiving, governance and replication, as well as for discovering, acquiring, aggregating and governing multistructured content. The appliance should provide the modularity, scalability and efficiency of high-performance applications for these key data consolidation functions. Typically, it would support these functions through integration with a high-capacity storage area network architecture such as IBM provides.
• Big-data processing: A Big Data appliance should support massively parallel execution of advanced data processing, manipulation, analysis and access functions. It should support the full range of advanced analytics, as well as some functions traditionally associated with EDWs, BI and OLAP. It should have all the metadata, models and other services needed to handle such core analytics functions as query, calculation, data loading and data integration. And it should handle a subset of these functions and interface through connectors to analytic platforms such as IBM PureData Systems.
• Big-data development: A Big Data appliance should support Big Data modeling, mining, exploration and analysis. The appliance should provide a scalable “sandbox” with tools that allow data scientists, predictive modelers and business analysts to interactively and collaboratively explore rich information sets. It should incorporate a high-performance analytic runtime platform where these teams can aggregate and prepare data sets, tweak segmentations and decision trees, and iterate through statistical models as they look for deep statistical patterns. It should furnish data scientists with massively parallel CPU, memory, storage and I/O capacity for tackling analytics workloads of growing complexity. And it should enable elastic scaling of sandboxes from traditional statistical analysis, data mining and predictive modeling, into new frontiers of Hadoop/MapReduce, R, geospatial, matrix manipulation, natural language processing, sentiment analysis and other resource-intensive types of Big Data processing.
A big-data appliance should not be a stand-alone server, but, instead, a repeatable, modular building block that, when deployed in larger cloud configurations, can be rapidly optimized to new workloads as they come online. Many appliances will be configured to support mixes of two or all three of these types of workloads within specific cloud nodes or specific clusters. Some will handle low latency and batch jobs with equal agility in your cloud. And still others will be entirely specialized to a particular function that they perform with lightning speed and elastic scalability. The best appliances, like IBM Netezza, facilitate flexible re-optimization by streamlining the myriad deployment, configuration tuning tasks across larger, more complex deployments.
You may not be able to forecast with fine-grained precision the mix of workloads you’ll need to run on your big-data cloud two years from next Tuesday. But investing in the right family of big-data appliance building blocks should give you confidence that, when the day comes, you’ll have the foundation in place to provision resources rapidly and efficiently.

Q7. Is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?

James Kobielus: No. Hadoop is powering unstructured ETL, queryable archiving, data-science exploratory sandboxing, and other use cases. OLAP–in terms of traditional cubing–remains key to front-end query acceleration in decision support applications and data marts. In support of those front-end applicatioins, OLAP is facing competition from other approaches, especially in-memory, columnar databases (such as the BLU Acceleration feature of IBM DB2 10.5).

Q8. Could you give some examples of successful Big Data projects?

James Kobielus: Examples are here.

James Kobielus is IBM Senior Program Director, Product Marketing, Big Data Analytics solutions. He is an industry veteran, a popular speaker and social media participant, and a thought leader in big data, Hadoop, enterprise data warehousing, advanced analytics, business intelligence, data management, and next best action technologies.

–What are the challenges for modern Data Centers? Interview with David Gorbet.
ODBMS Industry Watch, March 25, 2014

–Setting up a Big Data project. Interview with Cynthia M. Saracco.
ODBMS Industry Watch, January 27, 2014

Resources

– BIG DATA AND ANALYTICAL DATA PLATFORMS

Downloads for:

Follow ODBMS.org on Twitter: @odbmsorg
##

Setting up a Big Data project. Interview with Cynthia M. Saracco.

Roberto V. Zicari — Mon, 27 Jan 2014 07:39:46 +0000

“Begin with a clear definition of the project’s business objectives and timeline, and be sure that you have appropriate executive sponsorship. The key stakeholders need to agree on a minimal set of compelling results that will impact your business; furthermore, technical leaders need to buy into the overall feasibility of the project and bring design and implementation ideas to the table.”–Cynthia M. Saracco.

How easy is to set up a Big Data project? On this topic I have interviewed Cynthia M. Saracco, senior solutions architect at IBM’s Silicon Valley Laboratory. Cynthia is an expert in Big Data, analytics, and emerging technologies. She has more than 25 years of software industry experience.

RVZ

Q1. How best is to get started with a Big Data project?

Cynthia M. Saracco: Begin with a clear definition of the project’s business objectives and timeline, and be sure that you have appropriate executive sponsorship.
The key stakeholders need to agree on a minimal set of compelling results that will impact your business; furthermore, technical leaders need to buy into the overall feasibility of the project and bring design and implementation ideas to the table. At that point, you can evaluate your technical options for the best fit. Those options might include Hadoop, a relational DBMS, a stream processing engine, analytic tools, visualization tools, and other types of software. Often, a combination of several types of software is needed for a single Big Data project. Keep in mind that every technology has its strengths and weaknesses, so be sure you understand enough about the technologies you’re inclined to use before moving forward.

If you decide that Hadoop should be part of your project, give serious consideration to using a distribution that packages commonly needed components into a single bundle so you can minimize the time required to install and configure your environment. It’s also helpful to keep in mind the existing skills of your staff and seek out offerings that enable them to be productive quickly.
Tools, applications, and support for common scripting and query languages all contribute to improved productivity. If your business application needs to integrate with existing analytical tools, DBMSs, or other software, look for offerings that have some built-in support for that as well.

Finally, because Big Data projects can get pretty complex, I often find it helpful to segment the work into broad categories and then drill down into each to create a solid plan. Examples of common technical tasks include collecting data (perhaps from various sources), preparing the data for analysis (which can range from simple format conversions to more sophisticated data cleansing and enrichment operations), analyzing the data, and rendering or sharing the results of that analysis with business users or downstream applications. Consider scalability and performance needs in addition to your functional requirements.

Q2. What are the most common problems and challenges encountered in Big Data projects?

Cynthia M. Saracco: Lack of appropriately scoped objectives and lack of required skills are two common problems. Regarding objectives, you need to find an appropriate use case that will impact your business and tailor your project’s technical work to meet the business goals of that project efficiently. Big Data is an exciting, rapidly evolving technology area, and it’s easy to get side tracked experimenting with technical features that may not be essential to solving your business problem. While such experimentation can be fun and educational, it can also result in project delays as well as deliverables that are off target. In addition, without well-scoped business objectives, the technical staff may end up chasing a moving target.

Regarding skills, there’s high demand for data scientists, architects, and developers experienced with Big Data projects. So you may need to decide if you want to engage a service provider to supplement in-house skills or if you want to focus on growing (or acquiring) new in-house skills. Fortunately, there are a number of Big Data training options available today that didn’t exist several years ago. Online courses, conferences, workshops, MeetUps, and self-study tutorials can help motivated technical professionals expand their skill set. However, from a project management point of view, organizations need to be realistic about the time required for staff to learn new Big Data technologies. Giving someone a few days or weeks to master Hadoop and its complementary offerings isn’t very realistic. But really, I see the skills challenge as a point-in-time issue. Many people recognize the demand for Big Data skills and are actively expanding their skills, so supply will grow.

Q3. Do you have any metrics to define how good is the “value” that can be derived by analyzing Big Data?

Cynthia M. Saracco: Most organizations want to focus on their return on investment (ROI). Even if your Big Data solution uses open source software, there are still expenses involved for designing, developing, deploying, and maintaining your solution. So what did your business gain from that investment?
The answer to that question is going to be specific to your application and your business. For example, if a telecommunications firm is able to reduce customer churn by 10% as a result of a Big Data project, what’s that worth? If an organization can improve the effectiveness of an email marketing campaign by 20%, what’s that worth? If an organization can respond to business requests twice as quickly, what’s that worth? Many clients have these kinds of metrics in mind as they seek to quantify the value they have derived — or hope to derive — from their investment in a Big Data project.

Q4. Is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?

Cynthia M. Saracco: More often, I’ve seen Hadoop used to augment or extend traditional forms of analytical processing, such as OLAP, rather than completely replace them. For example, Hadoop is often deployed to bring large volumes of new types of information into the analytical mix — information that might have traditionally been ignored or discarded. Log data, sensor data, and social data are just a few examples of that. And yes, preparing that data for analysis is certainly one of the tasks for which Hadoop is used.

Q4. IBM is offering BigInsights and Big SQL? What is it?

Cynthia M. Saracco: InfoSphere BigInsights is IBM’s Hadoop-based platform for analyzing and managing Big Data. It includes Hadoop, a number of complementary open source projects (such as HBase, Hive, ZooKeeper, Flume, Pig, and others) and a number of IBM-specific technologies designed to add value.

Big SQL is part of BigInsights. It’s IBM’s SQL interface to data stored in BigInsights. Users can create tables, query data, load data from various sources, and perform other functions. For a quick introduction to Big SQL, read this article.

Q5. How does it compare to RDBMS technology? When’s it most useful?

Cynthia M. Saracco: Big SQL provides standard SQL-based query access to data managed by BigInsights. Query support includes joins, unions, sub-queries, windowed aggregates, and other popular capabilities. Because Big SQL is designed to exploit the Hadoop ecosystem, it introduces Hadoop-specific language extensions for certain SQL statements.
For example, Big SQL supports Hive and HBase for storage management, so a Big SQL CREATE TABLE statement might include clauses related to data formats, field delimiters, SerDes (serializers/deserializers), column mappings, column families, etc. The article I mentioned earlier has some examples of these, and the product InfoCenter has further details.

In many ways, Big SQL can serve as an easy on-ramp to Hadoop for technical professionals who have a relational DBMS background. Big SQL is good for organizations that want to exploit in-house SQL skills to work with data managed by BigInsights. Because Big SQL supports JDBC and ODBC, many traditional SQL-based tools can work readily with Big SQL tables, which can also make Big Data easier to use by a broader user community.

However, Big SQL doesn’t turn Hadoop — or BigInsights — into a relational DBMS. Commercial relational DBMSs come with built-in, ACID-based transaction management services and model data largely in tabular formats. They support granular levels of security via SQL GRANT and REVOKE statements. In addition, some RDBMSs support 3GL applications developed in “legacy” programming languages such as COBOL. These are some examples of capabilities aren’t part of Big SQL.

Q6. What are some of its current limitations?

Cynthia M. Saracco: The current level of Big SQL included in BigInsights V2.1.0.1 enables users to create tables but not views.
Date/time data is supported through a full TIMESTAMP data type, and some common SQL operations supported by relational DBMSs aren’t available or have specific restrictions.
Examples include INSERT, UPDATE, DELETE, GRANT, and REVOKE statements. For more details on what’s currently supported in Big SQL, skim through the InfoCenter.

Q7. How BigInsights differs from / adds value to open source Hadoop?

Cynthia M. Saracco: As I mentioned earlier, BigInsights includes a number of IBM-specific technologies designed to add value to the open source technologies included with the product. Very briefly, these include:
– A Web console with administrative facilities, a Web application catalog, customizable dashboards, and other features.
– A text analytic engine and library that extracts phone numbers, names, URLs, addresses, and other popular business artifacts from messages, documents, and other forms of textual data.
– Big SQL, which I mentioned earlier.
– BigSheets, a spreadsheet-style tool for business analysts.
– Web-accessible sample applications for importing and exporting data, collecting data from social media sites, executing ad hoc queries, and monitoring the cluster. In addition, application accelerators (tool kits with dozens of pre-built software articles) are available for those working with social data and machine data.
– Eclipse tooling to speed development and testing of BigInsights applications, new text extractors, BigSheets functions, SQL-based applications, Java applications, and more.
– An integrated installation tool that installs and configures all selected components across the cluster and performs a system-wide health check.
– Connectivity to popular enterprise software offerings, including IBM and non-IBM RDBMSs.
– Platform enhancements focusing on performance, security, and availability. These include options to use with an alternative, POSIX-compliant distributed file system (GPFS-FPO) and an alternative MapReduce layer (Adaptive MapReduce) that features Platform Symphony’s advanced job scheduler, workload manager, and other capabilities.

You might wonder what practical benefits these kinds of capabilities bring. While that varies according to each organization’s usage patterns, one industry analyst study concluded that BigInsights lowers total cost of ownership (TCO) by an average of 28% over a three-year period compared with an open source-only implementation.

Finally, a number of IBM and partner offerings support BigInsights, which is something that’s important to organizations that want to integrate a Hadoop-based environment into their broader IT infrastructure. Some examples of IBM products that support BigInsights include DataStage, Cognos Business Intelligence, Data Explorer, and InfoSphere Streams.

Q8. Could you give some examples of successful Big Data projects?

Cynthia M. Saracco: I’ll summarize a few that have been publicly discussed so you can follow links I provide for more details. An energy firm launched a Big Data project to analyze large volumes of data that could help it improve the placement of new wind turbines and significantly reduce response time to business user requests.
A financial services firm is using Big Data to process large volumes of text data in minutes and offer its clients more comprehensive information based on both in-house and Internet-based data.
An online marketing firm is using Big Data to improve the performance of its clients email campaigns.
And other firms are using Big Data to detect fraud, assess risk, cross-sell products and services, prevent or minimize network outages, and so on. You can find a collection of videos about Big Data projects undertaken by various organizations; many of these videos feature users speaking directly about their Big Data experiences and the results of their projects.
And a recent report on Analytics: The real-world use of big data contains further examples. based the results of a survey of more than 1100 businesses that the Said Business School at the University of Oxford conducted with IBM’s Institute for Business Value.

Qx Anything else to add?

Cynthia M. Saracco: Hadoop isn’t the only technology relevant to managing and analyzing Big Data, and IBM’s Big Data software portfolio certainly includes more than BigInsights (its Hadoop-based offering). But if you’re a technologist who wants to learn more about Hadoop, your best bet is to work with the software. You’ll find a number of free online courses in the public domain, such as those at Big Data University. And IBM offers a free copy of its Quick Start Edition of BigInsights as a VMWare image or an installable image to help you get started with minimal effort.

—–
Cynthia M. Saracco is a senior solutions architect at IBM’s Silicon Valley Laboratory, specializing in Big Data, analytics, and emerging technologies. She has more than 25 years of software industry experience, has written three books and more than 70 technical papers, and holds six patents.
—————
Related Posts

– Big Data: Three questions to Pivotal. ODBMS Industry Watch, January 20, 2014.

–Big Data: Three questions to InterSystems. ODBMS Industry Watch, January 13, 2014.

– Operational Database Management Systems. Interview with Nick Heudecker. ODBMS Industry Watch, December 16, 2013.

– On Big Data and Hadoop. Interview with Paul C. Zikopoulos. ODBMS Industry Watch, June 10, 2013.

Resources

– What’s the big deal about Big SQL? by Cynthia M. Saracco , Senior Software Engineer, IBM, and Uttam Jain, Software Architect, IBM.

Follow ODBMS.org on Twitter: @odbmsorg

##
——-

On Big Data and Hadoop. Interview with Paul C. Zikopoulos.

Roberto V. Zicari — Mon, 10 Jun 2013 06:35:23 +0000

“We’re not all LinkedIns and Facebooks; we don’t have budgets to hire 1000s of new hires with these skills, and what’s more we’ve invested in existing skills and people today. So to democratize Big Data, you need it to be consumable and integrated. These will flatten the time to value for Hadoop” — Paul C. Zikopoulos.

I have interviewed Paul C. Zikopoulos, Director of Technical Professionals for IBM Software Group’s Information Management division. The topic: Apache Hadoop and Big Data, State of the Union in 2013 and Vision for the future.

RVZ

Q1. What what do you think is still needed for big data analytics to be really useful for the enterprise?

Paul C. Zikopoulos: Integration and Consumability. We’re not all LinkedIns and Facebooks; we don’t have budgets to hire 1000s of new hires with these skills, and what’s more we’ve invested in existing skills and people today.
So to democratize Big Data, you need it to be consumable and integrated.
These will flatten the time to value for Hadoop. IBM is working really hard in these areas. I could go into other areas, but this is key.

Q2. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey.
When you speak with your customers what are the typical use cases and requirements they have?

Paul C. Zikopoulos: No matter what industry I’m working with, 90% of the Big Data use cases always have 2 common denominators: Whole Population Analytics to break free of traditional capacity constrained samples and analytics for data at-rest moving to in-motion.
So if you think about churn prediction, next best action, next best offer, fraud prediction, condition monitor, out of tolerance quality predictors, and more – it’s all going to rely on using more data (could be volume, could be variety, and often both) to build better models.
If you’re looking for specific use cases by industry, here’s a bunch of them that we’ve worked with clients on at IBM.

Q3. How do you categorize the various stages of the Hadoop usage in the enterprises?

Paul C. Zikopoulos: The IBM Institute for Business Value did a joint study with Said Business School (University of Oxford). They talked to a lot of Big Data folks and found that 28% were in the pilot phase, 24% haven’t started anything, and 47% are planning. After going through their research, they broke the answers into four stages: Educate / Explore / Engage / Execute.
So I’ll detail those four stages, but you can get the entire study here.

Educate: Building a base of knowledge (24 percent of respondents).
In the Educate stage, the primary focus is on awareness and knowledge development.
Almost 25 percent of respondents indicated they are not yet using big data within their organizations. While some remain relatively unaware of the topic of big data, our interviews suggest that most organizations in this stage are studying the potential benefits of big data technologies and analytics, and trying to better understand how big data can help address important business opportunities in their own industries or markets.
Within these organizations, it is mainly individuals doing the knowledge gathering as opposed to formal work groups, and their learnings are not yet being used by the organization. As a result, the potential for big data has not yet been fully understood and embraced by the business executives.

Explore: Defining the business case and roadmap (47 percent).
The focus of the Explore stage is to develop an organization’s roadmap for big data development.
Almost half of respondents reported formal, ongoing discussions within their organizations about how to use big data to solve important business challenges.
Key objectives of these organizations include developing a quantifiable business case and creating a big data blueprint.
This strategy and roadmap takes into consideration existing data, technology and skills, and then outlines where to start and how to develop a plan aligned with the organization’s business strategy.

Engage: Embracing big data (22 percent).
In the Engage stage, organizations begin to prove the business value of big data, as well as perform an assessment of their technologies and skills.
More than one in five respondent organizations is currently developing POCs to validate the requirements associated with implementing big data initiatives, as well as to articulate the expected returns. Organizations in this group are working – within a defined, limited scope – to understand and test the technologies and skills required to capitalize on new sources of data.

Execute: Implementing big data at scale (6 percent).
In the Execute stage, big data and analytics capabilities are more widely operationalized and implemented within the organization. However, only 6 percent of respondents reported that their organizations have implemented two or more big data solutions at scale – the threshold for advancing to this stage. The small number of organizations in the Execute stage is consistent with the implementations we see in the marketplace. Importantly, these leading organizations are leveraging big data to transform their businesses and thus are deriving the greatest value from their information assets.
With the rate of enterprise big data adoption accelerating rapidly – as evidenced by 22 percent of respondents in the Engage stage, with either POCs or active pilots underway – we expect the percentage of organizations at this stage to more than double over the next year. NOW ! While only 6% are executing, about 25% of respondents in this study are ‘piloting’ initiatives.

Q4. Could you give us some examples on how do you get (Big) Data Insights?

Paul C. Zikopoulos: IBM has a non-forked version of Hadoop called BigInsights.
When it comes to open source, it’s really hard to look past IBM’s achievements. Lucene, Apache Derby, Apache Jakarta, Apache Geronimo, Eclipse and so much more – so it shouldn’t surprise anyone that IBM is squarely in Hadoop’s corner.
Our strategy here is Embrace and Extend. We will embrace the open source Hadoop community. We are a vibrant part of it (in the latest Hadoop patch as of the time of this interview, the most fixes came from IBM; we have a number of contribution to HBase, and more). IBM has a long history in understanding enterprise concerns, that’s the extend part.
Some of the extensions work just fine with open source. For example, we provide a rich management tool, a quick installer, and concentrate opens ports into a single one to make your Hadoop cluster pass audit easier.
Some of our extensions overlay Hadoop. For example, our Adaptive Map Reduce which can deliver a 30% performance boost using its algorithms to optimize the overhead of MapReduce task startup.
We have enhanced schedulers, announced the option to use GPFS as the file system which provides a lot of benefits, and more. But these are optional. If you use BigInsights you are using a non-forked Hadoop distro.
Some of our extensions are ’round-trip-able’ – if you use them, you can walk back to pure Open Source Hadoop at any time, and some aren’t. If you want to get our fast to install non extended version of Hadoop for free, you can download InfoSphere BigInsights Basic Edition here.

Q5. What are the main technical challenges for big data analytics when data is in motion rather than at rest?

Paul C. Zikopoulos: Well the challenge is to ask yourselves how do I get those analytics artifacts that I learn at rest either in Hadoop or the EDW and get them to real time; I call this Nowcasting instead of Forecasting.
In order to do that, with agility and speed, you’re going to want a platform that’s designed for in-motion at-rest analytics.
I’m not seeing that in the marketplace today. In fact, I’m not seeing a focus on in-motion analytics.
When I refer to in-motion, I refer to the Velocity attribute of Big Data (people often talk to the Big Vs in Big Data, so that’s the one for in-motion). Velocity IS the game change.
It’s not just how fast data is produces or changes, BUT the speed at which it must be understood, acted upon, turned into something useful. So to me the main technical challenge in getting to in-motion from at-rest is the fact that I’m not really seeing that kind of true integration and it’s something we squarely hit on in the IBM Big Data platform.
Let me share an example, if you were to build some text analytical function at rest in Hadoop, perhaps an email phrase that’s highly correlated with a customer churn even, you can SEAMLESSLY take that artifact and deploy it on InfoSphere Streams (our Big Data Velocity engine) without any work at all, you just deploy the compiled AOG file. Wow! Platform.
The other challenge is just the volume and speed in which you have to process events. IBM invented our streaming products with the US government – and it can scale. For example, one of our clients analyzes and correlates over 5M market messages a second to execute algorithmic option trades with average latency of 50 microseconds.
The point is that this is not CEP; this is not 1 or 2 servers with 10-20,000 events a second. CEP can be a style or a technology.
You need to be able to do the style, but you need a technology platform too. If you asked me what is one of the biggest things IBM has done in the Big Data space, it is flattening the technical challenge to perform Big Data analytics on data in motion.

Q6. In your opinion, is there a technology which is best suited to build a Big Data Analytics Data Platform? If yes, which one?

Paul C. Zikopoulos: Well you say the word platform, and that’s going to imply a number of technologies. Right?
When I get asked this question, I refer to my Big Data Platform Manifesto, this is what you’re going to need to form a Big Data platform. Many people think big data is about Hadoop technology. It is and it isn’t. Its about a lot more than Hadoop.
One of the key requirements is to understand and navigate federated sources of big data – to discover data in place.
New technology has emerged that discovers, indexes, searches, and navigates diverse sources of big data. Of course big data is also about Hadoop. Hadoop is a collection of open source capabilities.
Two of the most prominent ones are Hadoop Distributed File System (HDFS) for storing a variety of information, and MapReduce – a parallel processing engine.
Data warehouses also manage big data- the volume of structured data is growing quickly. The ability to run deep analytic queries on huge volumes of structured data is a big data problem. It requires massive parallel processing data warehouses and purpose-built appliances for deep analytics.
Big data isn’t just at rest – it’s also in motion. Streaming data represents an entirely different big data problem – the ability to quickly analyze and act upon data while its still moving. This new technology opens a world of possibilities – from processing volumes of data that were just not practical to store, to detecting insight and responding quickly.
As much of the worlds big data is unstructured and in textual content, text analytics is a critical component to analyze and derive meaning from text.
And finally, integration and governance technology – ETL, data quality, security, MDM, and lifecycle management. Integration and governance technology establishes the veracity of big data, and is critical in determining whether information is trusted.
Finally, consumability, characteristics here include such items as being able to declare what you want done, not how to do it, expert integrated systems, deployment patterns, and so on.

So if you wanted a short answer a Big Data platform needs to be consumable, governable, give the opportunity for analytics in-motion, at rest (in an EDW AND things like Hadoop), discovery and index Big Data, and finally, provide the ability to analyze unstructured data.

Notice I didn’t mention one IBM product above; you can piece together a platform with a mash of vendors if you want; if you start to look into what IBM is doing, and although I’m bias and work there, I think you will find we have a true Big Data platform.

Q6. Does it make sense in your opinion to virtualize Hadoop?

Paul C. Zikopoulos: It can. It’s going to depend on the use case right? I see a lot of efforts by EMC in that area and that’s cool. Of course the Cloud and Hadoop kind of go hand and hand. I think this space is growing by leaps and bounds…fun to watch.

Q7. What is your opinion on the evolution of Hadoop?

Paul C. Zikopoulos: It’s just that – an evolution. I think that innovation is going to deliver more and more of what enterprises need from a ‘hardening’ aspect as time goes on. Hadoop 2.0 is a big step forward for availability. It’s out there yet now, but not ready for production in my humble opinion (although some vendors are shipping it, their documentation tells you it’s not ready for production). Next version of MapReduce (Yarn) and making Hive really fast (Tez) are also part of the evolution, stay close here, it’s changing fast!
That’s the best part of community. Now if you look at most of the vendors in this space, many are getting distracted and working on non-Hadoop’ish things to help Hadoop, and that’s fine too. We’re on a good path here.
A lot of vendors here are and more popping up all the time (like Intel just announced their own distribution). At some point, I think there will be a consolidated of distros out there, but with the hype around it right now, it will continue to evolve.
For example, it’s becoming more than just a MapReduce processing areas. Right? Lots of technologies are storing data in Hadoop’s HDFS, but bypassing MapReduce. So I find the file system key to the evolution.

Q8. Can In-Memory Data Management play a significant role for Big Data Analytics? If yes, how?

Paul C. Zikopoulos: I think it’s essential, but in a Big Data world, it would seem that the amount of data we are storing – at least right now – is proportionally bigger than the amount we can get into memory at a cost effective rate.
So in-memory needs to harmoniously live with the database. If you look at what we did with BLU Acceleration and DB2, we did just that.
In-memory columnar and typical relational tables live side by side in the same database kernel.
You can work with both structures together, in the same memory structures, queries, and so on.

When you can’t fit all the columns into memory, performance either falls off the cliff, or worse! Could crash the system.

From an analytics side, BLU Acceleration allows you to run queries faster, amazingly faster. That’s going to get more iterations of queries, analytics and what not. It’s not for everything, but if you can help my reports run faster, that’s cool. So imagine you find in a Discovery Zone powered by a Hadoop engine some interesting pieces of information, pulling that out and packing it into an in-memory structure and surfacing it to the enterprise is going to be pretty cool

Q9. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?

Paul C. Zikopoulos: This is pretty important because I need the utility-like nature of a Hadoop cluster, without the capital investment. Time to analytics is the benefit here. After all, if you’re a start-up analytics firm seeking venture capital funding, do you really walk into to your investor and ask for millions to set up a cluster; you’ll get kicked out the door.
No, you go to Racksapce or Amazon, swipe a card, and get going. IBM is there with its Hadoop clusters (private and public) and you’re looking at clusters that cost as low as $0.60 US an hour.
I think at one time I costed out a 100 node Hadoop cluster for an hour and it was like $34US – and the price has likely gone down. What’s more, your cluster will be up and running in 30 minutes. So on-premise or off-premise Cloud is key for these environments.

___________________________
Paul C. Zikopoulos, B.A., M.B.A., is the Director of Technical Professionals for IBM Software Group’s Information Management division and additionally leads the World Wide Competitive Database and Big Data Technical Sales Acceleration teams.
Paul is an award winning writer and speaker with more than 19 years of experience in Information Management.
Paul is seen as a global expert in Big Data and database. He was picked by SAP as one of its “Top 50 Big Data Twitter Influencers”, named by BigData Republic to its “Top 100 Most Influential” list, Technopedia listed him a “A Big Data Expert to Follow”, and he was consulted on Big Data by the popular TV show “60 Minutes”.
Paul has written more than 350 magazine articles and 16 books, some of which include “Harness the Power of Big Data”, “Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data”, “Warp Speed, Time Travel, Big Data, and More: DB2 10 New Features”, “DB2 pureScale: Risk Free Agile Scaling”, “DB2 Certification for Dummies”, “DB2 for Dummies”, and more.
In his spare time, he enjoys all sorts of sporting activities, including running with his dog Chachi, avoiding punches in his MMA training, and trying to figure out the world according to Chloë—his daughter.

–On Pivotal HD. Interview with Scott Yara and Florian Waas. April 22, 2013

–On Big Data Velocity. Interview with Scott Jarr. January 28, 2013

Resources

– Harness the Power of Big Data The IBM Big Data Platform.
Paul C. Zikopoulos, Dirk deRoos, Krishnan Parasuraman, Thomas Deutsch, David Corrigan,James Giles, Chris Eaton.
Book, Copyright © 2013 by The McGraw-Hill Companies.
Download Book (.PDF 250 pages)

– Warp Speed, Time Travel, Big Data, and More. DB2 10 for Linux, UNIX, and Windows New Features.
Paul Zikopoulos, George Baklarz, Matt Huras, Walid Rjaibi, Dale McInnis, Matthias Nicola, Leon Katsnelson.
Book, Copyright © 2012 by The McGraw-Hill Companies.
Download book (.PDF 217 pages)

– Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data.
Paul C. Zikopoulos, Chris Eaton, Dirk deRoos, Thomas Deutsch, George Lapis,
Book, Copyright © 2012 by The McGraw-Hill Companies.
Download book (.PDF 142 pages)

– Follow ODBMS.org on Twitter: @odbmsorg

Two Cons against NoSQL. Part I.

Roberto V. Zicari — Tue, 30 Oct 2012 16:57:52 +0000

Two cons against NoSQL data stores read like this:

1. It’s very hard to move data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

2. There is no standard way to access a NoSQL data store.
All tools that already exist for SQL has to be recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many NoSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).

These are valid points. I wanted to start a discussion on this.
This post is the first part of a series of feedback I received from various experts, with obviously different point of views.

I plan to publish Part II, with more feedback later on.

You are welcome to contribute to the discussion by leaving a comment if you wish!

RVZ
————

1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL.

Dwight Merriman ( founder 10gen, maker of MongoDB): I agree it is still early and I expect some convergence in data models over time. btw I am having conversations with other nosql product groups about standards but it is super early so nothing is imminent.
50% of the nosql products are JSON-based document-oriented databases.
So that is the greatest commonality. Use that and you have some good flexibility and JSON is standards-based and widely used in general which is nice. MongoDB, couchdb, riak for example use JSON. (Mongo internally stores “BSON“.)

So moving data across these would not be hard.

1. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

Dwight Merriman: Yes. Once again I wouldn’t assume that to be the case forever, but it is for the present. Also I think there is a bit of an illusion of portability with relational. There are subtle differences in the SQL, medium differences in the features, and there are giant differences in the stored procedure languages.
I remember at DoubleClick long ago we migrated from SQL Server to Oracle and it was a HUGE project. (We liked SQL server we just wanted to run on a very very large server — i.e. we wanted vertical scaling at that time.)

Also: while porting might be work, given that almost all these products are open source, the potential “risks” of lock-in I think drops an order of magnitude — with open source the vendors can’t charge too much.

Ironically people are charged a lot to use Oracle, and yet in theory it has the portability properties that folks would want.

I would anticipate SQL-like interfaces for BI tool integration in all the products in the future. However that doesn’t mean that is the way one will write apps though. I don’t really think that even when present those are ideal for application development productivity.

1. For example, how many noSQL databases can export their data to excel? (Something every CEO wants to get sooner or later).

Dwight Merriman: So with MongoDB what I would do would be to use the mongoexport utility to dump to a CSV file and then load that into excel. That is done often by folks today. And when there is nested data that isn’t tabular in structure, you can use the new Aggregation Framework to “unwind” it to a more matrix-like format for Excel before exporting.

You’ll see more and more tooling for stuff like that over time. Jaspersoft and Pentaho have mongo integration today, but the more the better.

John Hugg (VoltDB Engineering): Regarding your first point about the issue with moving data out from one NoSQL to some other system, even other NoSQL.
There are a couple of angles to this. First, data movement itself is indeed much easier between systems that share a relational model.
Most SQL relational systems, including VoltDB, will import and export CSV files, usually without much issue. Sometimes you might need to tweak something minor, but it’s straightforward both to do and to understand.

Beyond just moving data, moving your application to another system is usually more challenging. As soon as you target a platform with horizontal scalability, an application developer must start thinking about partitioning and parallelism. This is true whether you’re moving from Oracle to Oracle RAC/Exadata, or whether you’re moving from MySQL to Cassandra. Different target systems make this easier or harder, from both development and operations perspectives, but the core idea is the same. Moving from a scalable system to another scalable system is usually much easier.

Where NoSQL goes a step further than scalability, is the relaxing of consistency and transactions in the database layer. While this simplifies the NoSQL system, it pushes complexity onto the application developer. A naive application port will be less successful, and a thoughtful one will take more time.
The amount of additional complexity largely depends on the application in question. Some apps are more suited to relaxed consistency than others. Other applications are nearly impossible to run without transactions. Most lie somewhere in the middle.

To the point about there being no standard way to access a NoSQL data store. While the tooling around some of the most popular NoSQL systems is improving, there’s no escaping that these are largely walled gardens.
The experience gained from using one NoSQL system is only loosely related to another. Furthermore, as you point out, non-traditional data models are often more difficult to export to the tabular data expected by many reporting and processing tools.

By embracing the SQL/Relational model, NewSQL systems like VoltDB can leverage a developer’s experience with legacy SQL systems, or other NewSQL systems.
All share a common query language and data model. Most can be queried at a console. Most have familiar import and export functionality.
The vocabulary of transactions, isolation levels, indexes, views and more are all shared understanding. That’s especially impressive given the diversity in underlying architecture and target use cases of the many available SQL/Relational systems.

Finally, SQL/Relational doesn’t preclude NoSQL-style development models. Postgres, Clustrix and VoltDB support MongoDB/CouchDB-style JSON Documents in columns. Functionality varies, but these systems can offer features not easily replicated on their NoSQL inspiration, such as JSON sub-key joins or multi-row/key transactions on JSON data

1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

Steve Vinoski (Architect at Basho): Keep in mind that relational databases are around 40 years old while NoSQL is 3 years old. In terms of the technology adoption lifecycle, relational databases are well down toward the right end of the curve, appealing to even the most risk-averse consumer. NoSQL systems, on the other hand, are still riding the left side of the curve, appealing to innovators and the early majority who are willing to take technology risks in order to gain advantage over their competitors.

Different NoSQL systems make very different trade-offs, which means they’re not simply interchangeable. So you have to ask yourself: why are you really moving to another database? Perhaps you found that your chosen database was unreliable, or too hard to operate in production, or that your original estimates for read/write rates, query needs, or availability and scale were off such that your chosen database no longer adequately serves your application.
Many of these reasons revolve around not fully understanding your application in the first place, so no matter what you do there’s going to be some inconvenience involved in having to refactor it based on how it behaves (or misbehaves) in production, including possibly moving to a new database that better suits the application model and deployment environment.

2. There is no standard way to access a NoSQL data store.
All tools that already exists for SQL has to recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many noSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).

Steve Vinoski: Don’t make the mistake of thinking that NoSQL is attempting to displace SQL entirely. If you want data for your Excel spreadsheet, or you want to keep using your existing SQL-oriented tools, you should probably just stay with your relational database. Such databases are very well understood, they’re quite reliable, and they’ll be helping us solve data problems for a long time to come. Many NoSQL users still use relational systems for the parts of their applications where it makes sense to do so.

NoSQL systems are ultimately about choice. Rather than forcing users to try to fit every data problem into the relational model, NoSQL systems provide other models that may fit the problem better. In my own career, for example, most of my data problems have fit the key-value model, and for that relational systems were overkill, both functionally and operationally. NoSQL systems also provide different tradeoffs in terms of consistency, latency, availability, and support for distributed systems that are extremely important for high-scale applications. The key is to really understand the problem your application is trying to solve, and then understand what different NoSQL systems can provide to help you achieve the solution you’re looking for.

Cindy Saracco (IBM Senior Solutions Architect) (these comments reflect my personal views and not necessarily those of my employer, IBM) :
Since NoSQL systems are newer to market than relational DBMSs and employ a wider range of data models and interfaces, it’s understandable that migrating data and applications from one NoSQL system to another — or from NoSQL to relational — will often involve considerable effort.
However, I’ve heard more customer interest around NoSQL interoperability than migration. By that, I mean many potential NoSQL users seem more focused on how to integrate that platform into the rest of their enterprise architecture so that applications and users can have access to the data they need regardless of the underlying database used.

2. There is no standard way to access a NoSQL data store.
All tools that already exists for SQL has to recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many noSQL databases can export their data to excel? (Something every CEO wants to get sooner or later).

Cindy Saracco: From what I’ve seen, most organizations gravitate to NoSQL systems when they’ve concluded that relational DBMSs aren’t suitable for a particular application (or set of applications). So it’s probably best for those groups to evaluate what tools they need for their NoSQL data stores and determine what’s available commercially or via open source to fulfill their needs.
There’s no doubt that a wide range of compelling tools are available for relational DBMSs and, by comparison, fewer such tools are available for any given NoSQL system. If there’s sufficient market demand, more tools for NoSQL systems will become available over time, as software vendors are always looking for ways to increase their revenues.

As an aside, people sometimes equate Hadoop-based offerings with NoSQL.
We’re already seeing some “traditional” business intelligence tools (i.e., tools originally designed to support query, reporting, and analysis of relational data) support Hadoop, as well as newer Hadoop-centric analytical tools emerge.
There’s also a good deal of interest in connecting Hadoop to existing data warehouses and relational DBMSs, so various technologies are already available to help users in that regard . . . . IBM happens to be one vendor that’s invested quite a bit in different types of tools for its Hadoop-based offering (InfoSphere BigInsights), including a spreadsheet-style analytical tool for non-programmers that can export data in CSV format (among others), Web-based facilities for administration and monitoring, Eclipse-based application development tools, text analysis facilities, and more. Connectivity to relational DBMSs and data warehouses are also part IBM’s offerings. (Anyone who wants to learn more about BigInsights can explore links to articles, videos, and other technical information available through its public wiki. )

– On Eventual Consistency– An interview with Justin Sheehy. by Roberto V. Zicari, August 15, 2012

– Hadoop and NoSQL: Interview with J. Chris Anderson by Roberto V. Zicari