ODBMS Industry Watch » Google http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Sun, 02 Apr 2017 17:59:10 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.13 On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/ http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/#comments Mon, 13 Mar 2017 10:54:21 +0000 http://www.odbms.org/blog/?p=4326

“What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).”–Amr Awadallah

I have interviewed Amr Awadallah, Chief Technology Officer at Cloudera.  
Main topics of the interview are: the new developments in Apache Spark 2.0 Beta, and Hadoop  3.0.0-alpha1 release ; the lessons learned from Amr´s experience of using Hadoop at Yahoo!; and the business problems that world’s leading organisations do have.

RVZ

Q1. Before Cloudera, you served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organisations to use Hadoop for data analysis and business intelligence. What are the main lessons you learned in that period?

Amr Awadallah: Couple of things. First, I learned that Hadoop is capable of solving all the business intelligence problems that I had at Yahoo.
Namely:
(1) our systems weren’t scaling fast enough (we needed to cut down transformation times from hours to minutes),
(2) our systems weren’t economical on a $/TB basis thus making it hard to retain valuable data for longer time periods, and (3) we needed new methods to be able to store and analyze semi-structured (e.g. logs) and unstructured data (e.g. social media).
By implementing Hadoop in our team we saw first hand how it can address all these problems. The second lesson that I learned was that Hadoop, back then, was very rough to deploy and program against (it took us many months to deploy it and reprogram our transformations to run on it). It was these lessons that made it clear that there is room for a startup to focus on Hadoop since (1) it was solving a very real data problems that many organizations will face, and (2) it needed a lot of polish to make it work smoothly, securely, and reliably within the enterprise.

Q2. In 2008 you founded Cloudera together with Mike Olson (Oracle), Jeff Hammerbacher (Facebook) and Christophe Bisciglia (Google). What was your main motivation at that time?

Amr Awadallah: Pretty much to do what I describe above, we wanted to make the Hadoop technology easy to use for organizations. That included: (1) creating a distribution for Hadoop that bundles all the necessary open-source projects that make it work (we call that CDH, short for Cloudera Distribution for Apache Hadoop). (2) We also created a number of proprietary system management, security, and meta-data management tools around CDH to make it easier for organizations to deploy and operate Hadoop in production.

Q3. What are the typical challenging business problems that world’s leading organisations have?

Amr Awadallah: The technology we provide is very powerful and can be used to solve many problems across many industries, but we see four common themes: The first is simply using Hadoop as a faster, bigger, cheaper system for business intelligence and data analytics. i.e. a lot of organizations just use us to do things they have been doing already, just doing these things in a more economically scalable way.
The second use case is around deeper understanding of customers, i.e. moving away from segmenting all customers into a number of predefined buckets, but rather creating a dynamic micro-segment addressing each customer in a more precise way (thus reducing false positives).
The third use case is about using data to build better products and services, and this use-case is catalyzed by of the internet-of-things. Due to smart-sensors we are able to measure the real-world better than ever before; so this use-case is about taking all that data and leveraging it to either enhance our current product/service offerings, or build entirely new ones.
The fourth use case is about reducing business risk, and it manifests itself in a number of different sub-cases depending on the industry. For example, cyber-security is one of the key ways to reduce risk, and we have an open source project co-developed with Intel, called Apache Spot, which organizations can use to collect all their network flow data then use Spark machine learning algorithms to detect the anomalies in that data. Anti-money laundering and fraud detection is another way that our banking customers employ our platform to reduce risk within their businesses. Similarly, our insurance industry customers use our system to detect fraudulent claims, etc.

Q4. Can they be solved by analysing data? Can you give us some examples of how the use of advanced analytics drive business decisions?

Amr Awadallah: Yes, all the problems mentioned above can be solved with data. I want to highlight though that this isn’t necessarily about business decisions, which is what the Business Intelligence movement was about (we just help make that cheaper and faster). What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).
One of my favorite examples is a solution that one of our customers built to give voice to premature babies in neonatal intensive care units. They analyze the signals coming from the baby (sounds, blood pressure, heart rate, temperature, few brain signals), and based on that a message appears on the monitor above the infant showing the nurse if they are hungry, distressed from too much noise or light, etc.
That is really what we mean by using data to create new products and services that weren’t possible before (and not just reports/dashboard).

Q4. Graphs are important. Is it possible to do scalable graph analytics? If yes, how?

Amr Awadallah: Graphs are indeed important, a lot of our customer use-cases trace back to that (not just for social media analytics, but for example anti-money laundering requires analyzing relationships between many financial accounts for detecting bad behaviors, similarly for cyber security applications). I think scalability depends a fair bit on what’s being analyzed and how scalable we mean by scalable. But for most practical purposes I would say Spark’s GraphX is good enough. For example, you can compute PageRank fairly efficiently and scalably on a cluster using GraphX.

Q5. Data security is increasing important. The risk is due to the growing number of device endpoints. What solutions do exist to minimise such risk?

Amr Awadallah: A comprehensive enterprise data security strategy seeks to mitigate the risks presented by a growing number of potentially compromised endpoints connecting to corporate networks. Endpoint security will enable one or all of the following preventative controls:
The first is policy based enforcement of endpoint security configuration prior to granting and endpoint access to network based corporate assets. This ensures that any endpoint connected to corporate networks meets minimum requirements for endpoint security configuration.
The second measure is endpoint based anti-malware software (the existence of which may be a policy requirement to connect to the network per the first measure). Anti-malware prevents malicious code from infecting endpoints by monitoring for changes to system configuration and unusual activity or processes.
The third measure is endpoint encryption of corporate data on hard drives, folders and even removable media.
As mentioned above we also collaborate with Intel on Apache Spot, which tracks network flow patterns to detect anomalous communication behavior between different devices (including end point devices). Apache Spot just recently won InfoWorld 2017 Tech of the Year Award. Other advanced analytics security partners we closely work with are: CounterTack, Securonix, Niara, and Jask.

Q6. You recently announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform. How does it work? And how does it differ from the Hadoop-based data platform?

Amr Awadallah: First, at a meta-level, Hadoop (MapReduce specifically) was very good at achieving scalable computation by spreading jobs across many CPU cores and hard disk spindles. That said, MapReduce wasn’t very efficient in how it leveraged memory to optimize the performance of data processing pipelines that have many stages or iterations.
The main power of Spark, that made it take over from MapReduce, was how it truly leveraged memory to achieve better performance in deep or iterative data pipelines. That coupled with a simpler developer API made Spark take over very quickly from MapReduce.
Most of our new customer implementations for data processing or data science tend to be in Spark these days, versus MapReduce.
I should clarify however that this doesn’t mean that Hadoop is dead as some say. Apache Hadoop is comprised of three key subsystems: (1) MapReduce for computation, (2) YARN for resource scheduling, and (3) HDFS for storage. Spark only replaces MapReduce, we still rely heavily on both YARN and HDFS.

That said, the most notable features in Apache Spark 2.0 are:

1) Dataset API: It is a new API that represents the distributed collections of objects processed by Spark’s execution engine. It is an extension of Spark’s Dataframe API. It improves upon the Dataframe API by providing type-safe, object oriented programming interfaces. Users can now write User-Defined Functions and Lambda functions that provide compile time type safety. With the Dataset API, users benefit from optimized operations (like sort, join, hash, etc) in the SparkSQL engine, while also getting compile time type safety for user defined functions.

2) Model & Pipeline Persistence in Spark’s ML library: Machine learning Pipelines built with Spark’s ML library can now be serialized to a file and read back in.
The ability to save and reload these pipelines makes it easy for users to perform version control on the pipelines and safely distribute the pipelines. This helps in operationalizing them in production systems.

3) Structured Streaming: New stream processing API and engine that provides SQL like abstractions for authoring operations on data streams, and also improves performance by using the SparkSQL engine for processing the data streams. However, this is still an experimental API and not ready for production usage yet.

Besides the above 3 notable enhancements, there are a bunch of performance and scalability improvements across the board.

Q7. Apache Impala vs. Amazon Redshift: How Does Redshift Compare to Impala?

Amr Awadallah: Apache Impala is an analytic database engine architecturally designed to perform high-performance highly-concurrent SQL analytics on scalable, open data platforms like Hadoop’s HDFS and Amazon S3.
Impala decouples data storage from compute and lets users query data without having to move/load data specifically into an Impala storage-engine (it doesn’t have one). This architectural difference uniquely enables Impala to deliver a more flexible Business Intelligence experience than traditional database architectures like Redshift (which requires pre-loading the data).

Some of the key benefits of the Impala approach include:

* On-demand resources that are immediately ready to query existing S3 data without loading to a different data silo
* Ability to elastically grow/shrink clusters as needed due to decoupled storage and compute
* More predictable, multi-tenant isolation due to the ability to have multiple Impala clusters sharing a common S3 data repository
* Ability to share common data not only amongst Impala clusters, but also any application that runs on cloud-native S3 storage (for example, you can have both Apache Impala and Apache Spark run against the same data asset in S3, while it isn’t possible to have Apache Spark easily access the data stored in Redshift, it has to go through SQL first).
* Greater flexibility to explore new use cases, analytics, and data by directly querying S3 without rigid traditional data models and ETL

Not only does Impala deliver this additional flexibility, it does so at greater cost-performance and scalability compared to Redshift. See the following benchmark for data on that.

That said, Redshift’s sweet spot is in a different target as a smaller datamart as most Redshift installations are in the dozen of nodes range where Redshift’s limitations in scalability, elasticity, flexibility, and requirement to maintain separate copies of data are less critical.

Q8. What is Apache Kudu, and why is it relevant for Impala Users?

Amr Awadallah: Historically we had two storage engines in our distribution: (1) HDFS which is optimized for high-throughput analytics, but doesn’t support updates/inserts and (2) HBase which is optimized for low-latency updates/inserts but isn’t good for doing high-throughput queries. To build a proper data warehouse or time-series analytics system, you typically still need to make updates/inserts and that was why we created Apache Kudu.

Kudu is a new storage system that combines the benefits of both HDFS and HBase into one: it allows for low-latency updates/inserts, but also supports high-throughput analytical queries (i.e. fast analytics on fast moving data).
Unlike HDFS, Kudu is not a file-system, it is a record-based system, so the unit of storage is a record as opposed to a file. This allows Kudu to unlock Impala for real-time streaming applications that were not possible with HDFS.
In HDFS the data would only be visible to Impala after we finish closing the file, which typically happens after a large number of records are accumulated (that adds latency between when records are written to when they become visible to the analytical engine). With Kudu as soon as a record is written it is immediately visible to the Impala analytical engine. Finally, just like HDFS and HBase, the Kudu storage engine is fully integrated with our entire stack, not just Impala.
For example, you can also use Apache Spark for machine-learning jobs directly against Kudu.

Q9. The Apache Hadoop project recently announced its 3.0.0-alpha1 release. What is it?

Amr Awadallah: HDFS Erasure Encoding is really the main exciting new feature in Hadoop 3. Traditionally HDFS required three replicas, by default, for every data block to achieve durability, concurrent performance, and availability. Using erasure encoding techniques, HDFS in Hadoop 3 allows us to significantly reduce the storage overhead from 3x (i.e. 200%) to just 20% extra bits for parity. This will allow us to achieve the same durability benefits of 3x replication, but comes at the cost of potentially lower concurrent performance (when more than one job are trying to access the same block at same time) and lower availability resilience in face of top-of-rack switch failures (less of an issue these days).

Other cool additions are ATS v2 and classpath isolation which you can read more about here

Q10. What is the roadmap ahead for Cloudera Enterprise?

Amr Awadallah: We don’t discuss details of our product roadmap publicly, but there are three guiding themes for us in 2017: The first theme is fast-analytics on fast-moving data (which I covered above in regards to Kudu).
The second theme is cloud, which is making Cloudera Enterprise work better in cloud environments, and make it easier to move workloads (and skill sets) from on-premise clusters to transient cloud clusters in AWS, Azure, and/or Google Cloud.
The third theme is simplifying data-science and machine learning development, especially reducing the time from when a new algorithm is developed to how it can be deployed into production (stay tuned for more on that front).
——————————
Amr Awadallah, Ph.D. Chief Technology Officer, Cloudera
Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.

Resources

Download Page for Apache Spark™

Apache Impala supported by Cloudera Enterprise

DATA-X: Videobook- 8 short videos introduce query analytics for Apache Hadoop

A package that allows R developers to use Hadoop HBase

Book: Big Data Analytics with Spark

Related Posts

Streaming Analytics for Chain Monitoring. By Natalino Busa, Head of Data Science at Teradata — Thursday, ODBMS.org January 12, 2017

Five Challenges to IoT Analytics Success. By Dr. Srinath Perera. ODBMS.org SEPTEMBER 23, 2016

Next-Generation Genomics Analysis with Apache Spark. by Jason Bailey. ODBMS.org Thursday, June 30th, 2016

Supporting the Fast Data Paradigm with Apache Spark BY Stephen Dillon, Data Architect, Schneider Electric. ODBMS.org,23 APR, 2016

– The new series of Q&A with Leading Data Scientists– ODBMS.org:
Part II
Part I

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/feed/ 0
Big Data and The Great A.I. Awakening. Interview with Steve Lohr http://www.odbms.org/blog/2016/12/big-data-and-the-great-a-i-awakening-interview-with-steve-lohr/ http://www.odbms.org/blog/2016/12/big-data-and-the-great-a-i-awakening-interview-with-steve-lohr/#comments Mon, 19 Dec 2016 08:35:56 +0000 http://www.odbms.org/blog/?p=4274

“I think we’re just beginning to grapple with implications of data as an economic asset” –Steve Lohr.

My last interview for this year is with Steve Lohr. Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting. We discussed Big Data and how it influences the new Artificial Intelligence awakening.

Wishing you all the best for the Holiday Season and a healthy and prosperous New Year!

RVZ

Q1. Why do you think Google (TensorFlow) and Microsoft (Computational Network Toolkit) are open-sourcing their AI software?

Steve Lohr: Both Google and Microsoft are contributing their tools to expand and enlarge the AI community, which is good for the world and good for their businesses. But I also think the move is a recognition that algorithms are not where their long-term advantage lies. Data is.

Q2. What are the implications of that for both business and policy?

Steve Lohr: The companies with big data pools can have great economic power. Today, that shortlist would include Google, Microsoft, Facebook, Amazon, Apple and Baidu.
I think we’re just beginning to grapple with implications of data as an economic asset. For example, you’re seeing that now with Microsoft’s plan to buy LinkedIn, with its personal profiles and professional connections for more than 400 million people. In the evolving data economy, is that an antitrust issue of concern?

Q3. In this competing world of AI, what is more important, vast data pools, sophisticated algorithms or deep pockets?

Steve Lohr: The best answer to that question, I think, came from a recent conversation with Andrew Ng, a Stanford professor who worked at GoogleX, is co-founder of Coursera and is now chief scientist at Baidu. I asked him why Baidu, and he replied there were only a few places to go to be a leader in A.I. Superior software algorithms, he explained, may give you an advantage for months, but probably no more. Instead, Ng said, you look for companies with two things — lots of capital and lots of data. “No one can replicate your data,” he said. “It’s the defensible barrier, not algorithms.”

Q4. What is the interplay and implications of big data and artificial intelligence?

Steve Lohr: The data revolution has made the recent AI advances possible. We’ve seen big improvements in the last few years, for example, in AI tasks like speech recognition and image recognition, using neural network and deep learning techniques. Those technologies have been around for decades, but they are getting a huge boost from the abundance of training data because of all the web image and voice data that can be tapped now.

Q5. Is data science really only a here-and-now version of AI?

Steve Lohr: No, certainly not only. But I do find that phrase a useful way to explain to most of my readers — intelligent people, but not computer scientists — the interplay between data science and AI. To convey that rudiments of data-driven AI are already all around us. It’s not — surely not yet — robot armies and self-driving cars as fixtures of everyday life. But it is internet search, product recommendations, targeted advertising and elements of personalized medicine, to cite a few examples.

Q6. Technology is moving beyond increasing the odds of making a sale, to being used in higher-stakes decisions like medical diagnosis, loan approvals, hiring and crime prevention. What are the societal implications of this?

Steve Lohr: The new, higher-stakes decisions that data science and AI tools are increasingly being used to make — or assist in making — are fundamentally different than marketing and advertising. In marketing and advertising, a decision that is better on average is plenty good enough. You’ve increased sales and made more money. You don’t really have to know why.
But the other decisions you mentioned are practically and ethically very different. These are crucial decisions about individual people’s lives. Better on average isn’t good enough. For these kinds of decisions, issues of accuracy, fairness and discrimination come into play.
That, I think, argues for two things. First, some sort of auditing tool; the technology has to be able to explain itself, to explain how a data-driven algorithm came to the decision or recommendation that it did.
Second, I think it argues for having a “human in the loop” for most of these kinds of decisions for the foreseeable future.

Q7. Will data analytics move into the mainstream of the economy (far beyond the well known, born-on-the-internet success stories like Google, Facebook and Amazon)?

Steve Lohr: Yes, and I think we’re seeing that now in nearly every field — health care, agriculture, transportation, energy and others. That said, it is still very early. It is a phenomenon that will play out for years, and decades.
Recently, I talked to Jeffrey Immelt, the chief executive of General Electric, America’s largest industrial company. GE is investing heavily to put data-generating sensors on its jet engines, power turbines, medical equipment and other machines — and to hire software engineers and data scientists.
Immelt said if you go back more than a century to the origins of the company, dating back to Thomas Edison‘s days, GE’s technical foundation has been materials science and physics. Data analytics, he said, will be the third fundamental technology for GE in the future.
I think that’s a pretty telling sign of where things are headed.

—————————–
Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years and writes for the Times’ Bits blog. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting.
He was a foreign correspondent for a decade and served as an editor, and has written for national publications such as the New York Times Magazine, the Atlantic, and the Washington Monthly. He is the author of Go To: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Maverick Scientists, Iconoclasts—the Programmers Who Created the Software Revolution and Data-ism The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else.
He lives in New York City.

————————–

Resources

Google (TensorFlow): TensorFlow™ is an open source software library for numerical computation using data flow graphs.

Microsoft (Computational Network Toolkit): A free, easy-to-use, open-source, commercial-grade toolkit that trains deep learning algorithms to learn like the human brain.

Data-ism The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else. by Steve Lohr. 2016 HarperCollins Publishers

Related Posts

Don’t Fear the Robots. By STEVE LOHR. -OCT. 24, 2015-The New York Times, SundayReview | NEWS ANALYSIS

G.E., the 124-Year-Old Software Start-Up. By STEVE LOHR. -AUG. 27, 2016- The New York Times, TECHNOLOGY

Machines of Loving Grace. Interview with John Markoff. ODBMS Industry Watch, Published on 2016-08-11

Recruit Institute of Technology. Interview with Alon Halevy. ODBMS Industry Watch, Published on 2016-04-02

Civility in the Age of Artificial Intelligence, by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

On Big Data and Society. Interview with Viktor Mayer-Schönberger, ODBMS Industry Watch.

Follow us on Twitter:@odbmsorg

##

]]>
http://www.odbms.org/blog/2016/12/big-data-and-the-great-a-i-awakening-interview-with-steve-lohr/feed/ 1
New Gartner Magic Quadrant for Operational Database Management Systems. Interview with Nick Heudecker http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/ http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/#comments Wed, 30 Nov 2016 20:30:20 +0000 http://www.odbms.org/blog/?p=4272

“It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.”–Nick Heudecker.

I have interviewed Nick Heudecker, Research Director on Gartner’s Data & Analytics team.
The main topic of the interview is the new Magic Quadrant for Operational Database Management Systems.

RVZ

Q1. You have published the new Magic Quadrant for Operational Database Management Systems (*). How do you define the operational database management system market?

Nick Heudecker: We define a DBMS as a complete software system used to define, create, manage, update and query a database. DBMSs provide interfaces to independent programs and tools that both support and govern the performance of a variety of concurrent workload types. There is no presupposition that DBMSs must support the relational model or that they must support the full set of possible data types in use today. OPDBMSs must include functionality to support backup and recovery, and have some form of transaction durability — although the atomicity, consistency, isolation and durability model is not a requirement. OPDBMSs may support multiple delivery models, such as stand-alone DBMS software, certified configurations, cloud (public and private) images or versions, and database appliances.

Q2. Can you explain the methodology you used for this new Magic Quadrant?

Nick Heudecker: The methodologies for several Gartner methodologies are public. The Magic Quadrant methodology can be found here.

We use a number of data sources when we’re creating the Magic Quadrant for Operational Database Management Systems.
We survey vendor reference customers and include data from our interactions with Gartner clients. We also consider earlier information and any news about vendors’ products, customers and finances that came to light during the time frame for our analysis.

Once we have the data, we score vendors across the various dimensions of Completeness of Vision and Ability to Execute.
One thing that’s important to note is Magic Quadrants are relative assessments of vendors in a market. We couldn’t have one vendor on an MQ because it would be right in the middle – there’s nothing to compare it to.

Q3. Why were there no Visionaries this year?

Nick Heudecker: We determined there was an overall lack of vision in the market. After a few years of rapid feature expansion, the focus has shifted to operational excellence and execution. Even Leaders shifted to the left on vision, but are still placed in the Leaders quadrant based on their vision for the development of hybrid database management, hardware optimization and integration, emerging deployment models such as containerization, as well as vertical features.

Q4. Were you surprised by the analysis and some of the results you obtained?

Nick Heudecker: The lack of overall vision in the market struck us the most. Other than in a few notable cases, we received largely the same story from most vendors. The explosion of features, and the vendors emerging to implement them, has slowed. The features that initiated the expansion, such as storing new data types, geographically distributed storage, cloud and flexible data consistency models, have become common. Today, nearly every established or emerging DBMS vendor supports these features to some degree. The OPDBMS market has shifted from a phase of rapid innovation to a phase of maturing products and capabilities.

Q5. Do you believe the “NoSQL” label will continue to distinguish DBMSs?

Nick Heudecker: If you look at the entire operational DBMS space, there’s already a great deal of convergence between NoSQL vendors, as well as between NoSQL and traditionally relational vendors. Nearly every vendor, nonrelational and relational, supports multiple data types, like JSON documents, graph or wide-column. NoSQL vendors are adding SQL: MongoDB’s BI Connector and Couchbase’s N1QL are good, if diverse, examples. They’re also adding things like schema management and data validation capabilities.
On the relational side, they’re adding horizontal scaling options and alternative consistency models, as well as modern APIs. And everyone either has or is adding in-memory and cloud capabilities.

It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.

Q6. What are the other “Vendors to Consider”?

Nick Heudecker: The other vendors to consider are vendors that did not meet the inclusion requirements for the Magic Quadrant. Usually this is because they missed our minimum revenue requirements, but that doesn’t mean they don’t have compelling products.

——————————-
Nick Heudecker is a Research Director on Gartner’s Data & Analytics team. His coverage includes data management technologies and practices.

——————————-

Resources
(*) Magic Quadrant for Operational Database Management Systems. Published: 05 October 2016 ID: G00293203Analyst(s): Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald

– Complimentary Gartner Research: 100 Data and Analytics Predictions Through 2020. Get exclusive access to Gartner’s top 100 data and analytics predictions through 2020. Plus access other relevant Gartner research including Magic Quadrant reports for database and data warehouse solutions, and the market guide for in-memory computing (LINK to MemSQL web site – registration required).

Related Posts

MarkLogic Named a Next-Generation Database Challenger in 2016 Gartner Magic Quadrant. By GARY BLOOM, Chief Executive Officer and President MARKLOGIC

MarkLogic Recognized in New Gartner® Magic Quadrant. Gartner Magic Quadrant for Operational Database Management Systems positions MarkLogic® the highest for ability to execute in the Challengers Quadrant

– Accelerating Business Value with a Multi-Model, Multi-Workload Data Platform

– NuoDB Recognized by Gartner in Critical Capabilities for Operational Database Management Systems. Elastic SQL database achieves top five score in all four use cases.

– Clustrix Recognized in Gartner Magic Quadrant for Operational Database Management Systems

– Learn why EDB is named a “Challenger” in the 2016 Gartner ODBMS Magic Quadrant

– DataStax Receives Highest Scores in 2 Use Cases in Gartner’s Critical Capabilities for Operational Database Management Systems

– Gartner Scores Oracle Highest In 3 of 4 Use Cases: Gartner Critical Capabilities for Operational Database Management Systems Report

Gartner Critical Capabilities For Operational Database Management Systems 2016 – Redis Labs Ranked Second Highest In 2/4 Categories (Link- Registation required)

 

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/feed/ 0
Database Challenges and Innovations. Interview with Jim Starkey http://www.odbms.org/blog/2016/08/database-challenges-and-innovations-interview-with-jim-starkey/ http://www.odbms.org/blog/2016/08/database-challenges-and-innovations-interview-with-jim-starkey/#comments Wed, 31 Aug 2016 03:33:42 +0000 http://www.odbms.org/blog/?p=4218

“Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?.–Jim Starkey.

I have interviewed Jim Starkey. A database legendJim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history.

RVZ

Q1. In your opinion, what are the most significant advances in databases in the last few years?

Jim Starkey: I’d have to say the “atom programming model” where a database is layered on a substrate of peer-to-peer replicating distributed objects rather than disk files. The atom programming model enables scalability, redundancy, high availability, and distribution not available in traditional, disk-based database architectures.

Q2. What was your original motivation to invent the NuoDB Emergent Architecture?

Jim Starkey: It all grew out of a long Sunday morning shower. I knew that the performance limits of single-computer database systems were in sight, so distributing the load was the only possible solution, but existing distributed systems required that a new node copy a complete database or partition before it could do useful work. I started thinking of ways to attack this problem and came up with the idea of peer to peer replicating distributed objects that could be serialized for network delivery and persisted to disk. It was a pretty neat idea. I came out much later with the core architecture nearly complete and very wrinkled (we have an awesome domestic hot water system).

Q3. In your career as an entrepreneur and architect what was the most significant innovation you did?

Jim Starkey: Oh, clearly multi-generational concurrency control (MVCC). The problem I was trying to solve was allowing ad hoc access to a production database for a 4GL product I was working on at the time, but the ramifications go far beyond that. MVCC is the core technology that makes true distributed database systems possible. Transaction serialization is like Newtonian physics – all observers share a single universal reference frame. MVCC is like special relativity, where each observer views the universe from his or her reference frame. The views appear different but are, in fact, consistent.

Q4. Proprietary vs. open source software: what are the pros and cons?

Jim Starkey: It’s complicated. I’ve had feet in both camps for 15 years. But let’s draw a distinction between open source and open development. Open development – where anyone can contribute – is pretty good at delivering implementations of established technologies, but it’s very difficult to push the state of the art in that environment. Innovation, in my experience, requires focus, vision, and consistency that are hard to maintain in open development. If you have a controlled development environment, the question of open source versus propriety is tactics, not philosophy. Yes, there’s an argument that having the source available gives users guarantees they don’t get from proprietary software, but with something as complicated as a database, most users aren’t going to try to master the sources. But having source available lowers the perceived risk of new technologies, which is a big plus.

Q5. You led the Falcon project – a transactional storage engine for the MySQL server- through the acquisition of MySQL by Sun Microsystems. What impact did it have this project in the database space?

Jim Starkey: In all honesty, I’d have to say that Falcon’s most important contribution was its competition with InnoDB. In the end, that competition made InnoDB three times faster. Falcon, multi-version in memory using the disk for backfill, was interesting, but no matter how we cut it, it was limited by the performance of the machine it ran on. It was fast, but no single node database can be fast enough.

Q6. What are the most challenging issues in databases right now?

Jim Starkey: I think it’s time to step back and reexamine the assumptions that have accreted around database technology – data model, API, access language, data semantics, and implementation architectures. The “relational model”, for example, is based on what Codd called relations and we call tables, but otherwise have nothing to do with his mathematic model. That model, based on set theory, requires automatic duplicate elimination. To the best of my knowledge, nobody ever implemented Codd’s model, but we still have tables which bear a scary resemblance to decks of punch cards. Are they necessary? Or do they just get in the way?
Isn’t it ironic that in 2016 a non-skilled user can find a web page from Google’s untold petabytes of data in millisecond time, but a highly trained SQL expert can’t do the same thing in a relational database one billionth the size?. SQL has no provision for flexible text search, no provision for multi-column, multi-table search, and no mechanics in the APIs to handle the results if it could do them. And this is just one a dozen problems that SQL databases can’t handle. It was a really good technical fit for computers, memory, and disks of the 1980’s, but is it right answer now?

Q7. How do you see the database market evolving?

Jim Starkey: I’m afraid my crystal ball isn’t that good. Blobs, another of my creations, spread throughout the industry in two years. MVCC took 25 years to become ubiquitous. I have a good idea of where I think it should go, but little expectation of how or when it will.

Qx. Anything else you wish to add?

Jim Starkey: Let me say a few things about my current project, AmorphousDB, an implementation of the Amorphous Data Model (meaning, no data model at all). AmorphousDB is my modest effort to question everything database.
The best way to think about Amorphous is to envision a relational database and mentally erase the boxes around the tables so all records free float in the same space – including data and metadata. Then, if you’re uncomfortable, add back a “record type” attribute and associated syntactic sugar, so table-type semantics are available, but optional. Then abandon punch card data semantics and view all data as abstract and subject to search. Eliminate the fourteen different types of numbers and strings, leaving simply numbers and strings, but add useful types like URL’s, email addresses, and money. Index everything unless told not to. Finally, imagine an API that fits on a single sheet of paper (OK, 9 point font, both sides) and an implementation that can span hundreds of nodes. That’s AmorphousDB.

————
Jim Starkey invented the NuoDB Emergent Architecture, and developed the initial implementation of the product. He founded NuoDB [formerly NimbusDB] in 2008, and retired at the end of 2012, shortly before the NuoDB product launch.

Jim’s career as an entrepreneur, architect, and innovator spans more than three decades of database history from the Datacomputer project on the fledgling ARPAnet to his most recent startup, NuoDB, Inc. Through the period, he has been
responsible for many database innovations from the date data type to the BLOB to multi-version concurrency control (MVCC). Starkey has extensive experience in proprietary and open source software.

Starkey joined Digital Equipment Corporation in 1975, where he created the Datatrieve family of products, the DEC Standard Relational Interface architecture, and the first of the Rdb products, Rdb/ELN. Starkey was also software architect for DEC’s database machine group.

Leaving DEC in 1984, Starkey founded Interbase Software to develop relational database software for the engineering workstation market. Interbase was a technical leader in the database industry producing the first commercial implementations of heterogeneous networking, blobs, triggers, two phase commit, database events, etc. Ashton-Tate acquired Interbase Software in 1991, and was, in turn, acquired by Borland International a few months later. The Interbase database engine was released open source by Borland in 2000 and became the basis for the Firebird open source database project.

In 2000, Starkey founded Netfrastructure, Inc., to build a unified platform for distributable, high quality Web applications. The Netfrastructure platform included a relational database engine, an integrated search engine, an integrated Java virtual machine, and a high performance page generator.

MySQL, AB, acquired Netfrastructure, Inc. in 2006 to be the kernel of a wholly owned transactional storage engine for the MySQL server, later known as Falcon. Starkey led the Falcon project through the acquisition of MySQL by Sun Microsystems.

Jim has a degree in Mathematics from the University of Wisconsin.
For amusement, Jim codes on weekends, while sailing, but not while flying his plane.

——————

Resources

NuoDB Emergent Architecture (.PDF)

On Database Resilience. Interview with Seth Proctor, ODBMs Industry Watch, March 17, 2015

Related Posts

– Challenges and Opportunities of The Internet of Things. Interview with Steve Cellini, ODBMS Industry Watch, October 7, 2015

– Hands-On with NuoDB and Docker, BY MJ Michaels, NuoDB. ODBMS.org– OCT 27 2015

– How leading Operational DBMSs rank popularity wise? By Michael Waclawiczek– ODBMS.org · JANUARY 27, 2016

– A Glimpse into U-SQL BY Stephen Dillon, Schneider Electric, ODBMS.org-DECEMBER 7, 2015

– Gartner Magic Quadrant for Operational DBMS 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/08/database-challenges-and-innovations-interview-with-jim-starkey/feed/ 0
Machines of Loving Grace. Interview with John Markoff. http://www.odbms.org/blog/2016/08/machines-of-loving-grace-interview-with-john-markoff/ http://www.odbms.org/blog/2016/08/machines-of-loving-grace-interview-with-john-markoff/#comments Thu, 11 Aug 2016 19:13:46 +0000 http://www.odbms.org/blog/?p=4190

“Intelligent system designers do have ethical responsibilities.”
–John Markoff.

I have interviewed John Markoff, technology writer at The New York Times. 
In 2013 he was awarded a Pulitzer Prize.
The interview is related to his recent book “Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots, published in August of 2015 by HarperCollins Ecco.

RVZ

Q1. Do you share the concerns of prominent technology leaders such as Tesla’s chief executive, Elon Musk, who suggested we might need to regulate the development of artificial intelligence?

John Markoff: I share their concerns, but not their assertions that we may be on the cusp of some kind of singularity or rapid advance to artificial general intelligence. I do think that machine autonomy raises specific ethical and safety concerns and regulation is an obvious response.

Q2. How difficult is it to reconcile the different interests of the people who are involved in a direct or indirect way in developing and deploying new technology?

John Markoff: This is why we have governments and governmental regulation. I think AI, in that respect is no different than any other technology. It should and can be regulated when human safety is at stake.

Q3. In your book Machines of Loving Grace you argued that “we must decide to design ourselves into our future, or risk being excluded from it altogether”. What do you mean by that?

John Markoff: You can use AI technologies either to automate or to augment humans. The problem is minimized when you take an approach that is based on human centric design principles.

Q4. How is it possible in practice? Isn’t the technology space dominated by giants such as IBM, Apple,Google who dictate the direction of new technology?

John Markoff:  This is a very interesting time with “giant” technology companies realizing that there are consequences in the deployment of these technologies. Google, IBM and Microsoft have all recently made public commitments to the safe use of AI.

Q5. What are the most significant new developments in the humans-computers area, that are likely to have a significant influence in our daily life in the near future?

John Markoff:  One of the best things about being a reporter is that you don’t have to predict the future. You only have to note what the various visionaries say, so you can call that to their attention when their predictions prove inaccurate. With that caveat, if I am forced to bet on any particular information technology it would be augmented reality. This is because I believe that multi-touch interfaces for mobile devices simply can’t be the last step in user interface.

Q6. Do you believe that robots will really transform modern life?

John Markoff:  I struggle with the definition of what is a “robot.” If something is tele-operated, for example, is it a robot? That said I think that we will increasingly be surrounded by machines that perform tasks.
The question is will they come as quickly as Silicon Valley seems to believe. My friend Paul Saffo has said, “Never mistake a clear view for a short distance.” And I think that is the case with all kinds of mobile robots, including self driving cars.

Q7. For the designers of Intelligent Systems, how difficult is to draw a line between what is human and what is machine?

John Markoff:  I feel strongly that the possibility of designing cyborgs, particularly with respect to intellectual prosthesis is a boundary we should cross with great caution. Remember the Borg from StarTrek. “Resistance is futile, you will be assimilated.” I think the challenge is to use these systems to enhance human thought, not for social control.

Q8. What are the ethical responsibilities of designers of intelligent systems?

John Markoff: I think the most important aspect of that question is the simple acknowledgement that intelligent system designers do have ethical responsibilities. That has not always been the case, but it seems to be a growing force within the community of AI and robotics designers in the past five years, so I’m not entirely pessimistic.

Q9. If humans delegate decisions to machines, who will be responsible for the consequences?

John Markoff: Ben Shneiderman, the University of Maryland computer scientist and user interface designer has written eloquently on this point. Indeed he argues against autonomous systems for precisely this reason. His point is that it is essential to keep a human in the loop. If not you run the risk of abdicating ethical responsibility for system design.

Q10. Assuming there is a real potential in using data–driven methods to both help charities develop better services and products, and understand civil society activity. In your opinion, what are the key lessons and recommendations for future work in this space?

John Markoff: I’m afraid I’m not an expert in the IT needs of either charities or NGOs. That said a wide range of AI advances are already being delivered at nominal cost via smart phones. As cheap sensors proliferate virtually all everyday objects will gain intelligence that will be widely accessible.

Qx. Anything else you wish to add?

John Markoff: Only that I think it is interesting that the augmentation vs automation dichotomy is increasingly seen as a path through which to navigate the impact of these technologies. Computer system designers are the ones who will decide what the impact of these technologies are and whether to replace or augment humans in society.

—————————————-

JOHN GREGORY MARKOFF

John Markoff joined The New York Times in March 1988 as a reporter for the business section. He is now a technology writer based in San Francisco bureau of the paper. Prior to joining the Times, he worked for The San Francisco Examiner from 1985 to 1988. He reported for the New York Times Science Section from 2010 to 2015.

Markoff has written about technology and science since 1977. He covered technology and the defense industry for The Pacific News Service in San Francisco from 1977 to 1981; he was a reporter at Infoworld from 1981 to 1983; he was the West Coast editor for Byte Magazine from 1984 to 1985 and wrote a column on personal computers for The San Jose Mercury from 1983 to 1985.

He has also been a lecturer at the University of California at Berkeley School of Journalism and an adjunct faculty member of the Stanford Graduate Program on Journalism.

The Times nominated him for a Pulitzer Prize in 1995, 1998 and 2000. The San Francisco Examiner nominated him for a Pulitzer in 1987. In 2005, with a group of Times reporters, he received the Loeb Award for business journalism. In 2007 he shared the Society of American Business Editors and Writers Breaking News award. In 2013 he was awarded a Pulitzer Prize in explanatory reporting as part of a New York Times project on labor and automation.

In 2007 he became a member of the International Media Council at the World Economic Forum. Also in 2007, he was named a fellow of the Society of Professional Journalists, the organization’s highest honor.

In June of 2010 the New York Times presented him with the Nathaniel Nash Award, which is given annually for foreign and business reporting.

Born in Oakland, California on October 29, 1949, Markoff grew up in Palo Alto, California and graduated from Whitman College, Walla Walla, Washington, in 1971. He attended graduate school at the University of Oregon and received a masters degree in sociology in 1976.

Markoff is the co-author of “The High Cost of High Tech,” published in 1985 by Harper & Row. He wrote “Cyberpunk: Outlaws and Hackers on the Computer Frontier” with Katie Hafner, which was published in 1991 by Simon & Schuster.
In January of 1996 Hyperion published “Takedown: The Pursuit and Capture of America’s Most Wanted Computer Outlaw,” which he co-authored with Tsutomu Shimomura. “What the Dormouse Said: How the Sixties Counterculture shaped the Personal Computer Industry,” was published in 2005 by Viking Books. “Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots,” was published in August of 2015 by HarperCollins Ecco.

He is currently researching a biography of Stewart Brand.

He is married to Leslie Terzian Markoff and they live in San Francisco, Calif.

Resources

MACHINES OF LOVING GRACE – The Quest for Common Ground Between Humans and Robots By John Markoff, Illustrated. 378 pp. Ecco/HarperCollins Publishers.

Shneiderman’s “Eight Golden Rules of Interface Design”. These rules were obtained from the text Designing the User Interface by Ben Shneiderman.

“Designing the User Interface”, 6th Edition. This is a revised edition of the highly successful textbook on Human Computer Interaction originally developed by Ben Shneiderman and Catherine Plaisant at the University of Maryland.

Related Posts

– Recruit Institute of Technology. Interview with Alon Halevy ODBMS Industry Watch, Published on 2016-04-02

– Civility in the Age of Artificial Intelligence,  by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

– On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

– On Big Data and Society. Interview with Viktor Mayer-SchönbergerODBMS Industry Watch.

Follow us on Twitter: @odbmsorg

# #

]]>
http://www.odbms.org/blog/2016/08/machines-of-loving-grace-interview-with-john-markoff/feed/ 3
Recruit Institute of Technology. Interview with Alon Halevy http://www.odbms.org/blog/2016/04/recruit-institute-of-technology-interview-with-alon-halevy/ http://www.odbms.org/blog/2016/04/recruit-institute-of-technology-interview-with-alon-halevy/#comments Sat, 02 Apr 2016 15:10:02 +0000 http://www.odbms.org/blog/?p=4112

” A revolution will happen when tools like Siri can truly serve as your personal assistant and you start relying on such an assistant throughout your day. To get there, these systems need more knowledge about your life and preferences, more knowledge about the world, better conversational interfaces and at least basic commonsense reasoning capabilities. We’re still quite far from achieving these goals.”–Alon Halevy

I have interviewed Alon Halevy, CEO at Recruit Institute of Technology.

RVZ

Q1. What is the mission of the Recruit Institute of Technology?

Alon Halevy: Before I describe the mission, I should introduce our parent company Recruit Holdings to those who may not be familiar with it. Recruit (founded in 1960), is a leading “life-style” information services and human resources company in Japan with services in the areas of recruitment, advertising, employment placement, staffing, education, housing and real estate, bridal, travel, dining, beauty, automobiles and others. The company is currently expanding worldwide and operates similar businesses in the U.S., Europe and Asia. In terms of size, Recruit has over 30,000 employees and its revenues are similar to those of Facebook at this point in time.

The mission of R.I.T is threefold. First, being the lab of Recruit Holdings, our goal is to develop technologies that improve the products and services of our subsidiary companies and create value for our customers from  the vast collections of data we have. Second, our mission is to advance scientific knowledge by contributing to the research community through publications in top-notch venues. Third, we strive to use technology for social good. This latter goal may be achieved through contributing to open-source software, working on digital artifacts that would be of general use to society, or even working with experts in a particular domain to contribute to a cause.

Q2. Isn`t similar to the mission of the Allen Institute for Artificial Intelligence?

Alon Halevy: The Allen Institute is a non-profit whose admirable goal is to make fundamental contributions to Artificial Intelligence. While R.I.T strives to make fundamental contributions to A.I and related areas such as data management, we plan to work closely with our subsidiary companies and to impact the world through their products.

Q3. Driverless cars, digital Personal Assistants (e.g. Siri), Big Data, the Internet of Things, Robots: Are we on the brink of the next stage of the computer revolution?

Alon Halevy: I think we are seeing many applications in which AI and data (big or small) are starting to make a real difference and affecting people’s lives. We will see much more of it in the next few years as we refine our techniques. A revolution will happen when tools like Siri can truly serve as your personal assistant and you start relying on such an assistant throughout your day. To get there, these systems need more knowledge about your life and preferences, more knowledge about the world, better conversational interfaces and at least basic commonsense reasoning capabilities. We’re still quite far from achieving these goals.

Q4. You were for more than 10 years senior staff research scientist at Google, leading the Structured Data Group in Google Research. Was it difficult to leave Google?

Alon Halevy: It was extremely difficult leaving Google! I struggled with the decision for quite a while, and waving goodbye to my amazing team on my last day was emotionally heart wrenching. Google is an amazing company and I learned so much from my colleagues there. Fortunately, I’m very excited about my new colleagues and the entrepreneurial spirit of Recruit.
One of my goals at R.I.T is to build a lab with the same culture as that of Google and Google Research. So in a sense, I’m hoping to take Google with me. Some of my experiences from a decade at Google that are relevant to building a successful research lab are described in a blog post I contributed to the SIGMOD blog in September, 2015.

Q5. What is your vision for the next three years for the Recruit Institute of Technology?

Alon Halevy: I want to build a vibrant lab with world-class researchers and engineers. I would like the lab to become a world leader in the broad area of making data usable, which includes data discovery, cleaning, integration, visualization and analysis.
In addition, I would like the lab to build collaborations with disciplines outside of Computer Science where computing techniques can make an even broader impact on society.

Q6. What are the most important research topics you intend to work on?

Alon Halevy: One of the roadblocks to applying AI and analysis techniques more widely within enterprises is data preparation.
Before you can analyze data or apply AI techniques to it, you need to be able to discover which datasets exist in the enterprise, understand the semantics of a dataset and its underlying assumptions, and to combine disparate datasets as needed. We plan to work on the full spectrum of these challenges with the goal of enabling many more people in the enterprise to explore their data.

Recruit being a lifestyle company, another  fundamental question we plan to investigate is whether technology can help people make better life decisions. In particular, can technology help you take into consideration many factors in your life as you make decisions and steer you towards decisions that will make you happier over time. Clearly, we’ll need more than computer scientists to even ask the right questions here.

Q7. If we delegate decisions to machines, who will be responsible for the consequences? What are the ethical responsibilities of designers of intelligent systems?

Alon Halevy: You got an excellent answer from Oren Etzioni to this question in a recent interview. I agree with him fully and could not say it any better than he did.

Qx Anything you wish to add?

Alon Halevy: Yes. We’re hiring! If you’re a researcher or strong engineer who wants to make real impact on products and services in the fascinating area of lifestyle events and decision making, please consider R.I.T!

———-

Alon Halevy is the Executive Director of the Recruit Institute of Technology. From 2005 to 2015 he headed the Structured Data Management Research group at Google. Prior to that, he was a professor of Computer Science at the University of Washington in Seattle, where he founded the Database Group. In 1999, Dr. Halevy co-founded Nimble Technology, one of the first companies in the Enterprise Information Integration space, and in 2004, Dr. Halevy founded Transformic, a company that created search engines for the deep web, and was acquired by Google.
Dr. Halevy is a Fellow of the Association for Computing Machinery, received the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000, and was a Sloan Fellow (1999-2000). Halevy is the author of the book “The Infinite Emotions of Coffee”, published in 2011, and serves on the board of the Alliance of Coffee Excellence.
He is also a co-author of the book “Principles of Data Integration”, published in 2012.
Dr. Halevy received his Ph.D in Computer Science from Stanford University in 1993 and his Bachelors from the Hebrew University in Jerusalem.

Resources

– Civility in the Age of Artificial Intelligence,  by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

The threat from AI is real, but everyone has it wrong, by Robert Munro, CEO Idibon, ODBMS.org

Related Posts

On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

– On Big Data and Society. Interview with Viktor Mayer-Schönberger ODBMS Industry Watch.

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/04/recruit-institute-of-technology-interview-with-alon-halevy/feed/ 0
On Dark Data. Interview with Gideon Goldin http://www.odbms.org/blog/2015/11/on-dark-data-interview-with-gideon-goldin/ http://www.odbms.org/blog/2015/11/on-dark-data-interview-with-gideon-goldin/#comments Mon, 16 Nov 2015 12:19:11 +0000 http://www.odbms.org/blog/?p=4023

“Top­down cataloging and master­data management tools typically require expensive data curators, and are not simple to use. This poses a significant threat to cataloging efforts since so much knowledge about your organization’s data is inevitably clustered across the minds of the people who need to question it and the applications they use to answer those questions.”–Gideon Goldin

I have interviewed Gideon Goldin, UX Architect, Product Manager at Tamr.

RVZ

Q1. What is “dark data”?

Gideon Goldin: Gartner refers to dark data as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing).” For most organizations, dark data comprises the majority of available data, and it is often the result of the constantly changing and unpredictable nature of enterprise data ­something that is likely to be exacerbated by corporate restructuring, M&A activity, and a number of external factors.

By shedding light on this data, organizations are better suited to make more data­driven, accurate business decisions.
Tamr Catalog, which is available as a free downloadable app, aims to do this, providing users with a view of their entire data landscape so they can quickly understand what was in the dark and why.

Q2. What are the main drawbacks of traditional top­down methods of cataloging or “master data management”?

Gideon Goldin: The main drawbacks are scalability and simplicity. When Yahoo, for example, started to catalog the web they employed some top­down approaches, hiring specialists to curate structured directories of information. As the web grew, however, their solution became less relevant and significantly more costly. Google, on the other hand, mined the web to understand references that exist between pages, allowing the relevance of sites to emerge from the bottom­up. As a result, Google’s search engine was more accurate, easier to scale, and simpler.

Top­down cataloging and master­data management tools typically require expensive data curators, and are not simple to use. This poses a significant threat to cataloging efforts since so much knowledge about your organization’s data is inevitably clustered across the minds of the people who need to question it and the applications they use to answer those questions. Tamr Catalog aims to deliver an innovative and vastly simplified method for cataloging your organization’s data.

Q3. Tamr recently opened a public Beta program ­Tamr Catalog ­ for an enterprise metadata catalog. What is it?

Gideon Goldin: The Tamr Catalog Beta Program is an open invitation to test­drive our free cataloging software. We have yet to find an organization that is content with their current cataloging approaches, and we found that the biggest barrier to reform is often knowing where to start. Catalog can help: the goal of the Catalog Beta Program is to better understand how people want and need to collaborate around their data sources. We believe that an early partnership with the community will ensure that we develop useful functionality and thoughtful design.

Q4 What are the core functionality of Tamr Catalog?

Gideon Goldin: Tamr Catalog enables users to easily register, discover and organize their data assets.

Q5. How does it help simplify access to high­quality data sets for analytics?

Gideon Goldin: Not surprisingly, people are biased to use the data sets closest to them. With Catalog, scientists and analysts can easily discover unfamiliar data sets­­data sets, for example, that may belong to other departments or analysts. Catalog profiles and collects pointers to your sources, providing multifaceted and visual browsing of all data trivializing the search for any given set of data.

Q6. How does Tamr Catalog relate to the Tamr Data Unification Platform?

Gideon Goldin: Before organizations can unify their data, preparing it for improved analysis or management, they need to know what they have. Organizations often lack a good approach for this first (and repeating) step in data unification. We realized this quickly when helping large organizations begin their unification projects, and we even realized we lacked a satisfactory tool to understand our own data. Thus, we built Catalog as a part of the Tamr Data Unification Platform to illuminate your data landscape, such that people can be confident that their unification efforts are as comprehensive as possible.

Q7. What are the main challenges (technical and non technical) in achieving a broad adoption of a vendor­ and platform ­neutral metadata cataloging?

Gideon Goldin: Often the challenge isn’t about volume, it’s about variety. While a vendor­ neutral Catalog intends to solve exactly this, there remains a technical challenge in providing a flexible and elegant interface for cataloging dozens or hundreds of different types of data sets and the structures they comprise.

However, we find that some of the biggest (and most interesting) challenges revolve around organizational processes and culture. Some organizations have developed sophisticated but unsustainable approaches to managing their data, while others have become paralyzed by the inherently disorganized nature of their data. It can be difficult to appreciate the value of investing in these problems. Figuring out where to start, however, shouldn’t be difficult. This is why we chose to release a lightweight application free of charge.

Q8. Chief Data Officers (CDOs), data architects and business analysts have different requirements and different modes of collaborating on (shared) data sets. How do you address this in your catalog?

Gideon Goldin: The goal of cataloging isn’t cataloging, it’s helping CDOs identify business opportunities, empowering architects to improve infrastructures, enabling analysts to enrich their studies, and more. Catalog allows anyone to register and organize sources, encouraging open communication along the way.

Q9. How do you handle issues such as data protection, ownership, provenance and licensing in the Tamr catalog?

Gideon Goldin: Catalog allows users to indicate who owns what. Over the course of our Beta program, we have been fortunate enough to have over 800 early users of Catalog and have collected feedback about how our users would like to see data protection and provenance implemented in their own environments. We are eager to release new functionality to address these needs in the near future.

Q10. Do you plan to use the Tamr Catalog also for collecting data sets that can be used for data projects for the Common Good?

Gideon Goldin: We do­­ know of a few instances of Catalog being used for such purposes, including projects that will build on the documenting of city and​ ​health data. In addition to our Catalog Beta Program, we are introducing a Community Developer Program, where we are eager to see how the community links Tamr Catalogs to new sources (including those in other catalogs), new analytics and visualizations, and ultimately insights. We believe in the power of open data at Tamr, and we’re excited to learn how we can help the Common Good.

—————————–
Gideon Goldin, UX Architect, Product Manager at Tamr.

Prior to Tamr, Gideon Goldin worked as a data visualization/UX consultant and university lecturer. He holds a Masters in HCI and a PhD in cognitive science from Brown University, and is interested in designing novel human­machine experiences. You can reach Gideon on Twitter at @gideongoldin or email him at Gideon.Goldin at tamr.com.

Resources

–  Download Free Tamr Catalog app.

-​Tamr Catalog Developer Community
 Online community where Tamr catalog users can comment, interact directly with the development team, and learn more about the software; and where​ developers can explore extending the tool by creating new data connectors.

Gartner IT Glossary: Dark data

Related Posts

Data for the Common Good. Interview with Andrea Powell. ODBMS Industry Watch, June 9, 2015

Doubt and Verify: Data Science Power Tools By Michael L. Brodie, CSAIL, MIT

Data Wisdom for Data Science Bin Yu, Departments of Statistics and EECS, University of California at Berkeley

Follow ODBMs.org on Twitter: @odbmsorg

]]>
http://www.odbms.org/blog/2015/11/on-dark-data-interview-with-gideon-goldin/feed/ 0
On Hadoop and Big Data. Interview with John Leach http://www.odbms.org/blog/2015/07/on-hadoop-and-big-data-interview-with-john-leach/ http://www.odbms.org/blog/2015/07/on-hadoop-and-big-data-interview-with-john-leach/#comments Mon, 13 Jul 2015 08:32:52 +0000 http://www.odbms.org/blog/?p=3941

“One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration.”– John Leach

I have interviewed John Leach, CTO & Cofounder Splice Machine.  Main topics of the interview are Hadoop, Big Data integration and what Splice Machine has to offer in this space.  Monte Zweben, CEO of Splice Machine also contributed to the interview.

RVZ

Q1. What are the Top Ten Pitfalls to Avoid in a SQL-on-Hadoop Implementation?

John Leach, Monte Zweben:
1. Individual record lookups. Most SQL-on-Hadoop engines are designed for full table scans in analytics, but tend to be too slow for the individual record lookups and ranges scan used by operational applications.
2. Dirty Data. Dirty data is a problem for any system, but it is compounded in Big Data, often resulting in bad reports and delays to reload an entire data set.
3. Sharding. It can be difficult to know what key to distribute data and the right shard size. This results in slow queries, especially for large joins or aggregations.
4. Hotspotting. This happens when data becomes too concentrated in a few nodes, especially for time series data. The impact is slow queries and poor parallelization.
5. SQL coverage. Limited SQL dialects will make it so you can’t run queries to meet business needs. You’ll want to make sure you do your homework. Compile the list of toughest queries and test.
6. Concurrency. Low concurrency can result in the inability to power real-time apps, handle many users, support many input sources, and deliver reports as updates happen.
7. Columnar. Not all columnar solutions are created equally. Besides columnar storage, there are many other optimizations, such as vectorization and run length encoding that can have a big impact on analytic performance. If your OLAP queries run slower, common with large joins and aggregations, this will result in poor productivity. Queries may take minutes or hours instead of seconds. On the flip-side is using columnar when you need concurrency and real-time.
8. Node Sizing. Do your homework and profile your workload. Choosing the wrong node size (e.g., CPU cores, memory) can negatively impact price/performance and create performance bottlenecks.
9. Brittle ETL on Hadoop. With many SQL-on-Hadoop solutions being unable to provide update or delete capabilities without a full data reload, this can cause a very brittle ETL that will require restarting your ETL pipeline because of errors or data quality issues. The result is a missed ETL window and delayed reports to business users.
10. Cost-Based Optimizer. A cost-based optimizer improves performance by selecting the right join strategy, the right index, and the right ordering. Some SQL-on-Hadoop engines have no cost-based optimizer or relatively immature ones that can result in poor performance and poor productivity, as well as manual tuning by DBAs.

Q2. In your experience, what are the most common problems in Big Data integration?

John Leach, Monte Zweben: Providing users access to data in a fashion they can understand and at the moment they need it, while ensuring quality and security, can be incredibly challenging.

The volume and velocity of data that businesses are churning out, along with the variety of different sources, can pose many issues.

One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration. Businesses may also find their skill sets, workload, and budgets over-stretched by the need to manage terabytes or petabytes of structured and unstructured data in a way that delivers genuine value to business users.

When data is siloed and there is no solution put into place, businesses can’t access the real-time insights they need to make the best decisions for their business. Performance goes down, headaches abound and cost goes way up, all in the effort to manage the data. That’s why a Big Data integration solution is a prerequisite for getting the best performance and the most real-time insights, at the lowest cost.

Q3. What are the capabilities of Hadoop beyond data storage?

John Leach, Monte Zweben: Hadoop has a very broad range of capabilities and tools:

Oozie for workflow
Pig for scripting
Mahout or SparkML for machine learning
Kafka and Storm for streaming
Flume and Sqoop for integration
Hive, Impala, Spark, and Drill for SQL analytic querying
HBase for NoSQL
Splice Machine for operational, transactional RDBMS

Q4. What programming skills are required to handle application development around Big Data platforms like Hadoop?

John Leach, Monte Zweben: To handle application development on Hadoop, individuals have choices to go raw Hadoop or SQL-on-Hadoop. When going the SQL route, very little new skills are required and developers can open connections to an RDBMS on Hadoop just like they used to do on Oracle, DB2, SQLServer, or Teradata. Raw HAdoop application developers should know their way around the core components of the Hadoop stack–such as HDFS, MapReduce, Kafaka, Storm, Oozie, Hive, Pig, HBase, and YARN. They should also be proficient in Java.

Q5. What are the current challenges for real-time application deployment on Hadoop?

John Leach, Monte Zweben: When we talk about real-time at Splice Machine, we’re focused on applications that require not only real-time responses to queries, but also real-time database updates from a variety of data sources. The former is not all that uncommon on Hadoop; the latter is nearly impossible for most Hadoop-based systems.

Deploying real-time applications on Hadoop is really a function of moving Hadoop beyond its batch processing roots to be able to handle real-time database updates with high concurrency and transactional integrity. We harness HBase along with a lockless snapshot isolation design to provide full ACID transactions across rows and tables.

This technology enables Splice Machine to execute the high concurrency of transactions required by real-time applications.

Q6. What is special about Splice Machine auto-sharding replication and failover technology?

John Leach, Monte Zweben: As part of its automatic auto-sharding, HBase horizontally partitions or splits each table into smaller chunks or shards that are distributed across multiple servers. Using the inherent failover and replication capabilities of HBase and Hadoop, Splice Machine can support applications that demand high availability.

HBase co-processors are used to embed Splice Machine in each distributed HBase region (i.e., data shard). This enables Splice Machine to achieve massive parallelization by pushing the computation down to each distributed data shard without any overhead of MapReduce.

Q7. How difficult is it for customers to migrate from legacy databases to Splice Machine?

John Leach, Monte Zweben: Splice Machine offers a variety of services to help businesses efficiently deploy the Splice Machine database and derive maximum value from their investment. These services include both implementation consulting and educational offerings delivered by our expert team.

Splice Machine has designed a Safe Journey program to significantly ease the effort and risk for companies migrating to a Splice Machine database. The Safe Journey program includes a proven methodology that helps choose the right workloads to migrate, implements risk-mitigation best practices, and includes commercial tools that automate most of the PL/SQL conversion process.

This is not to suggest that all legacy databases will convert to a Hadoop RDBMS.
The best candidates will typically have over 1TB of data, which often leads to cost and scaling issues in legacy databases.

Q8. You have recently announced partnership with Talend, mrc (michaels, ross & cole ltd.) and RedPoint Global. Why Talend, mrc, and RedPoint Global? What is the strategic meaning of these partnerships for Splice Machine?

John Leach, Monte Zweben: Our uptick in recent partnerships demonstrates the tremendous progress our team has made over the past year. We have been working relentlessly to develop the Splice Machine Hadoop RDBMS into a fully enterprise-ready database that can replace legacy database systems.

The demand for programming talent to handle application development is growing faster than the supply of skilled talent, especially around newer platforms like Hadoop. We partnered with mrc to give businesses a solution that can speed real-time application deployment on Hadoop with the staff and tools they currently have, while also offering future-proof applications over a database that scales to meet increasing data demands.

We partnered with Talend to bring our customers the benefit of two different approaches for managing data integration affordable and at scale. Talend’s rich capabilities including drag and drop user interface, and adaptable platform allow for increased productivity and streamlined testing for faster deployment of web, mobile, OLTP or Internet of Things applications.

And finally, we integrated and certified our Hadoop RDBMS on RedPoint’s Convergent Marketing Platform™ to create a new breed of solution for marketers. With cost-efficient database scale-out and real-time cross-channel execution, the solution enables enterprises to future-proof their marketing technology investment through affordable access to all their data (social, mobile, click streams, website behaviors, etc.) across a proliferating and ever-changing list of channels. Furthermore, it complements any existing Hadoop deployment, including those on the Cloudera, MapR and Hortonworks distributions.

Q9. How is Splice Machine working with Hadoop distribution partners –such as MapR, Hortonworks and Cloudera?

John Leach, Monte Zweben: Since Splice Machine does not modify HBase, it can be used with any standard Hadoop distribution that includes HBase, including Cloudera, MapR and Hortonworks. Splice Machine enables enterprises using these three companies to tap into real-time updates with transactional integrity, an important feature for companies looking to become real-time, data-driven businesses.

In 2013, Splice Machine partnered with MapR to enable companies to use the MapR distribution for Hadoop to build their real time, SQL-on-Hadoop applications. In 2014, we joined the Cloudera Connect Partner Program, after certifying on CDH 5. We are working closely with Cloudera to maximize the potential of its full suite of Hadoop-powered software and our unique approach to real-time Hadoop.

That same year, we joined Hortonworks Technology Partner program. This enabled our users to harness innovations in management, provisioning and security for HDP deployments. For HDP users, Splice Machine enables them to build applications that use ANSI-standard SQL and support real-time updates with transactional integrity, allowing Hadoop to be used in both OLTP and OLAP applications.

Earlier this year, we were excited to achieve Hortonworks® Data Platform (HDP™) Certification. With the HDP certification, our customers can leverage the pre-built and validated integrations between leading enterprise technologies and the Hortonworks Data Platform, the industry’s only 100-percent open source Hadoop distribution, to simplify and accelerate their Splice Machine and Hadoop deployments.

Q10 What are the challenges of running online transaction processing on Hadoop?

John Leach, Monte Zweben: With its heritage as a batch processing system, Hadoop does not provide the transaction support required by online transaction processing. Transaction support can be tricky enough to implement for shared-disk RDBMSs such as Oracle, but it becomes far more difficult to implement in distributed environments such as Hadoop. A distributed transactional model requires high-levels of coordination across a cluster with too much overhead, while simultaneously providing high performance for a high concurrency of small read and writes, high-speed ingest, and massive bulk loads. We prove this by being able to run the TPC-C benchmark at scale.

Splice Machine met those requirements by using distributed snap isolation, a Multi-Version Concurrency Control model that delivers lockless, and high-concurrency transactional support. Splice Machine extended research from Google’s Percolator project, Yahoo Lab’s OMID project, and the University of Waterloo’s HBaseSI project to develop its own patent-pending, distributed transactions.

 

———————-
John LeachCTO & Cofounder Splice Machine
With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies.
Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning.
John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach currently is the organizer for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.

Monte Zweben – CEO & Cofounder Splice Machine
A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program.
Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.

In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA.
Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.

Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.

 

Resources

– Splice Machine resource page, ODBMS.org

Related Posts

Common misconceptions about SQL on Hadoop. By Cynthia M. Saracco, ODBMS.org, July 2015

– SQL over Hadoop: Performance isn’t everything… By Simon Harris, ODBMS.org, March 2015

– Archiving Everything with Hadoop. By Mark Cusack, ODBMS.org. December 2014.

–  On Hadoop RDBMS. Interview with Monte Zweben. ODBMS Industry Watch  November 2, 2014

– AsterixDB: Better than Hadoop? Interview with Mike Carey, ODBMS Industry Watch, October 22, 2014

 

Follow ODBMS.org on Twitter: @odbmsorg

##

 

]]>
http://www.odbms.org/blog/2015/07/on-hadoop-and-big-data-interview-with-john-leach/feed/ 0
On MarkLogic 8. Interview with Stephen Buxton http://www.odbms.org/blog/2015/02/stephen-buxton/ http://www.odbms.org/blog/2015/02/stephen-buxton/#comments Fri, 13 Feb 2015 09:55:02 +0000 http://www.odbms.org/blog/?p=3780

“When trades are reconciled with counterparties and then closed, updates can and do occur. Bitemporal helps ensure investment banks can always go back and see when updates occurred for specific trades. This is critical to managing risk and handling increased concerns about regulatory compliance and future audits. “– Stephen Buxton.

MarkLogic recently released MarkLogic 8. I wanted to know more about this release. For that, I have interviewed Stephen Buxton, Senior Director, Product Management at MarkLogic.

RVZ

Q1. You have recently launched MarkLogic® 8 software release. How is it positioned in the Big Data market? How does it differentiate from other products from NoSQL vendors?

Stephen Buxton: MarkLogic 8 is our biggest release ever, further solidifying MarkLogic’s position in the market as the only Enterprise NoSQL database.
With MarkLogic 8, you can now store, manage and search JSON, XML, and RDF all in one unified platform—without sacrificing enterprise features such as transactional consistency, security, or backup and recovery.
While other database companies are still figuring out how to strengthen their platform and add features like transactional consistency, we’ve moved far ahead of them by working on new innovative features such as Bitemporal and Semantics. It’s for these reasons that over 500 enterprise organizations have chosen MarkLogic to run their mission-critical applications.

MarkLogic 8 is more powerful, agile, and trusted than ever before, and is an ideal platform for doing two things: making heterogeneous data integration simpler and faster; and for doing dynamic content delivery at massive scale.
Relational databases do not offer enough flexibility—integration projects can take multiple years, cost millions of dollars, and struggle at scale. But, the newer NoSQL databases that do have agility still lack the enterprise features required to run in the data centers at large organizations. MarkLogic is the only NoSQL database that is able to solve today’s challenge, having the flexibility to serve as an operational and analytical database for all of an organization’s data.

Q2. Could you please explain the way the new version of MarkLogic supports JavaScript and JSON? Could you gives us an example of how does it work?

Stephen Buxton: MarkLogic 8 introduces a new phase in our roadmap with JSON and JavaScript. JSON is rapidly becoming the data format of choice for many use cases, and now MarkLogic provides the ability to store JSON natively, right alongside other formats such as XML and RDF so you don’t have to worry about slow and brittle conversion between data formats. The combination of Server-Side JavaScript and native JSON provides an ideal platform for building JSON-based services with JavaScript in every tier of an application.

Within MarkLogic, the JSON structure is mapped directly to the internal structure already used by the XML document format, so it has the same speed and scalability as with XML. This also means that all of the production-proven indexing, data management, and security capabilities that MarkLogic is known for are fully maintained.

With Server-Side JavaScript, developers now have access to the powerful query and data manipulation capabilities of MarkLogic in a language and with tools that they’re already familiar with. Developers now have a friendly API to express queries, aggregates, and data manipulation while automatically distributing evaluation across a MarkLogic cluster to run in parallel, close to the data. MarkLogic 8’s implementation of Server-Side JavaScript is done by embedding Google’s V8 engine—the same one that powers Chrome V8.

Not only that, but MarkLogic 8 also includes a Node.js Client API, an open source JavaScript library that allows developers to quickly, easily, and reliably access MarkLogic from an application they built using Node.js.

Q3. In MarkLogic 8 you have been adding full SPARQL 1.1 support and Inferencing capability. Could you please explain what kind of Inferencing capability did you add and what are they useful for?

Stephen Buxton: We made a big leap forward on the semantics foundation that was laid in our previous release, adding full SPARQL 1.1 support, which includes support for property paths, aggregates, and SPARQL Update. Support for automatic inferencing was also added, which is a powerful capability that allows the database to combine existing data and apply pre-defined rules to infer new data. SPARQL 1.1 is a standard defined by the W3C that is supported by many RDF triple stores. But, MarkLogic differentiates itself among triple stores as you can store your documents and data right alongside your triples, and you can query across all three data models easily and efficiently.

Automatic inferencing is a really powerful feature that is part of an overall strategy to provide a more intelligent data layer so that you can build smarter apps.
With inferencing, for example, if you had two pieces of data stored as RDF triples, such as “John lives in Virginia” and “Virginia is in the United States”, then MarkLogic 8 could infer the new fact, “John lives in the United States.
This can make search results richer and also show you new relationships in your data.

In MarkLogic 8, rules for inferencing are applied at query time. This approach is referred to as backward-chaining inference, a very flexible approach in which only the required rules are applied for each query, so the server does the minimum work necessary to get the correct results; and when your data or ontology or rules sets change, that change is available immediately – it takes effect with the very next query. And, of course, inference queries are transactional, distributed, and obey MarkLogic’s rule-based security, just like any other query. MarkLogic 8 has supplied rule sets for RDFS, RDFS-Plus, OWL-Horst, and their subsets; and you can create your own. With MarkLogic 8 you can further restrict any SPARQL query (with or without inference) by any document attribute, including timestamp, provenance, or even a bitemporal constraint.
More details and examples can be found at developer.marklogic.com.

Q4. The additions to SPARQL include Property Paths, Aggregates, and SPARQL Update. Could you please explain briefly each of them?

Stephen Buxton: SPARQL 1.1 brings support for property paths, aggregates, and SPARQL Update. These capabilities make working with RDF data simpler and more powerful, which means increased context for your data—all using the SPARQL 1.1 industry standard query language.

SPARQL 1.1’s property paths let you traverse an RDF graph – bouncing from point-to-point across a graph. This graph traversal allows you to do powerful, complex queries such as, “Show me all the people who are connected to John” by finding people that know John, and people that know people that know John, and so on.

With aggregate SPARQL functions you can do analytic queries over hundreds of billions of triples. MarkLogic 8 supports all the SPARQL 1.1 Aggregate functions – COUNT, SUM, MIN, MAX, and AVG – as well as the grouping operations GROUP BY, GROUP BY .. HAVING, GROUP_CONCAT and SAMPLE.

SPARQL 1.1 also includes SPARQL Update. With these capabilities, you can delete, insert, and update (delete/insert) individual triples, and manipulate RDF graphs, all using SPARQL 1.1.

Q5. The addition of SPARQL Update capabilities could have the potential to influence the capability you offer of a RDF triple store that scales horizontally and manages billions of triples. Any comment on that?

Stephen Buxton: The enhancements in MarkLogic 8 make it able to function as a full-featured, stand-alone triple store– this means you can now get a triple store that is horizontally scalable as part of a shared-nothing cluster, and still get all of the enterprise features MarkLogic is known for such as such as High Availability, Disaster Recovery, and certified security. Beyond that, anyone looking for “just a triple store” will find they can also store, manage, and query documents and data in the same database, a unique capability that only MarkLogic has.

Q6. You have been adding a so called Bitemporal Data Management. What is it and why is it useful?

Stephen Buxton: Bitemporal is a new feature that allows you to ask, “What did you know and when did you know it?” The MarkLogic Bitemporal feature answers this critical question by tracking what happened, when it happened, and when we found out. A bitemporal database is much more powerful than a temporal database that can only track when something happened. The difference between when something happened and when you found out about it can be incredibly significant, particularly when it comes to audits and regulation.

A bitemporal database tracks time across two different axes, the system and valid time axes. This allows you to go back in time and explore data, manage historical data across systems, ensure data integrity, and do complex bitemporal analysis. You can answer complex questions such as:
• Where did John Thomas live on August 20th as we knew it on September 1st?
• Where was the Blue Van on October 12th as we knew it on October 23rd?

Bitemporal is important for a wide variety of use cases across industries. Getting a more accurate picture of a business at different points-in-time used to be impossible, or very challenging at best. Bitemporal helps ensure that you always have a full and accurate picture of your data at every point-in-time, which is particularly useful in regulated industries.

Regulatory requirements – Avoid the increasingly harsh downside consequences from not adhering to government and industry regulations, particularly in financial services and insurance
Audits – Preserve the history of all your data, including the changes made to it, so that clear audits can be conducted without having to worry about lost data, data integrity, or cumbersome ETL processes with archived data
Investigations and Intelligence – No more lost emails and no more missing information. Bitemporal databases never erase data, so it is possible to see exactly how data was updated based on what was known at the time
Business Analytics – Run complex queries that were not previously possible in order to better understand your business and answer new questions about how different decisions and changes in the past could have led to different results
Cost reduction – Manage data with a smaller footprint as the shape of the data changes, avoiding the need to set up additional databases for historical data.

Bitemporal is enhanced by MarkLogic’s Tiered Storage, which allows you to more easily archive your data to cheaper storage tiers with little administrative overhead. This keeps Bitemporal simple, and obviates the high cost imposed by the few relational databases that do have Bitemporal. MarkLogic also eliminates the schema roadblocks that relational databases that have Bitemporal struggle with. MarkLogic is schema-agnostic and can adjust to the shape of data as that data changes over time.

Q7. How is bitemporal different from versioning?

Stephen Buxton: Bitemporal works by ingesting bitemporal documents that are managed as a series of documents with range indexes for valid and system time axes. Documents are stored in a temporal collection protected by security permissions. The initial document inserted into the database is kept and never changes, allowing you to track the provenance of information with full governance and immutability.

Q8. Could you give us some examples of how Bitemporal Data Management could be useful applications for the financial services industry?

Stephen Buxton: One example of Bitemporal is trade reconciliation in financial services. When trades are reconciled with counterparties and then closed, updates can and do occur. Bitemporal helps ensure investment banks can always go back and see when updates occurred for specific trades. This is critical to managing risk and handling increased concerns about regulatory compliance and future audits.
Imagine the Head of IT Architecture at a major bank working on mining information and looking for changes in risk profiles. The risk profiles cannot be accurately calculated without having an accurate picture of the reference and trade data, and how it changed over time. This task becomes simple and fast using Bitemporal.

Qx Anything else you wish to add?

Stephen Buxton: In addition to innovative features such as Bitemporal and Semantics, and features that make MarkLogic more widely accessible in the developer community, there are other updates in Marklogic 8 that make it easier to administer and manage. For example, Incremental Backup, another feature added in MarkLogic 8, allows DBAs to perform backups faster while using less storage.
With MarkLogic 8, you can have multiple daily incremental backups with only a minimal impact on database performance. This feature is one worth highlighting because it will help make DBAs live much easier, and will save an organization time and money.
It’s just another example of MarkLogic’s continuing dedication to being an enterprise NoSQL database that is more powerful, agile, and trusted than anything else.

————–
Stephen Buxton is Senior Director of Product Management for Search and Semantics at MarkLogic, where he has been a member of the Products team for 8 years. Stephen focuses on bringing a rich semantic search experience to users of the MarkLogic NoSQL database, document store, and triple store. Before joining MarkLogic, Stephen was Director of Product Management for Text and XML at Oracle Corporation.

Resources

MarkLogic 8: What’s new (ODBMS.org)

Related Posts

On making information accessible. Interview with David Leeming. ODBMS Industry Watch, July 30, 2014

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2015/02/stephen-buxton/feed/ 0
On the Hadoop market. Interview with John Schroeder http://www.odbms.org/blog/2014/06/john-schroeder-ceo-cofounder-mapr-technologies/ http://www.odbms.org/blog/2014/06/john-schroeder-ceo-cofounder-mapr-technologies/#comments Mon, 30 Jun 2014 18:25:26 +0000 http://www.odbms.org/blog/?p=3212

” Hadoop continue to mature with regards to structuring data and interactive query, so future overlap between Hadoop and OLAP will increase.”– John Schroeder.

I have interviewed John Schroeder, CEO and Cofounder of MapR Technologies. Main topics of the interview are managing Big Data projects and how the Hadoop market is evolving.

RVZ

Q1. What are the most common problems and challenges encountered in Big Data projects?

John Schroeder: First of all there is no single Big Data use case. Applications cut across industries and involve a wide variety of data sources. These projects can result in revenue gains, cost reductions or risk mitigation. While the challenges for these projects also vary, we see customers embracing our platform to deal with common challenges in meeting mission critical service levels, addressing real-time response pressures, and supporting multiple users and applications.

Q2. How do you see the Hadoop market evolving?

John Schroeder: We have leading customers in diverse industries who are using Hadoop to drive operational analytics, customer examples include performing 100B ad auctions a day, fraud detection for over 100 million card holders and real-time adjustments to improve fleet efficiency. These examples require the right architecture to support streaming writes so data can be constantly writing to the system while analysis is being conducted; high performance to meet the business needs and real-time operations; and the ability to perform online database operations to react to the business situation and impact business as it happens not producing a batch to report days or weeks later.

Q3. Is Hadoop really replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?

John Schroeder: Hadoop’s impact is more disruptive than a replacement for OLAP technologies that have been in the market since the 90s. Customers deploy use cases on Hadoop that were not feasible or cost effective using these traditional technologies. For example, the use of clustering algorithms and recommendation engines that can be run much more frequently against much larger datasets open opportunities for use cases that drive new revenue streams.
Hadoop is also more powerful for unstructured data. So while we do see customers offload data warehouse processing on MapR, most MapR customers are deploying net new use cases. The business impact is the net new growth of analytic use cases is being done on Hadoop.

Hadoop is not currently a direct replacement to OLAP or an Enterprise Data Warehouse, for that matter. These technologies will continue to have their place. Hadoop does not require schema definition or structuring of data. In fact, acting as a Datahub, Hadoop can be quite complementary to these by offloading processing and data from these systems. The average cost to store data in a data warehouse is $16,000/terabyte. The cost for MapR is less than $1000/terabyte. OLAP engines leverage data that has been transformed and processed into precise schemas. They can perform very well for well understood problems. One of the benefits of Hadoop is that you don’t need to understand the questions you are going to ask ahead of time, you can combine many different data types and determine required analysis you need after the data is in place. Hadoop continue to mature with regards to structuring data and interactive query, so future overlap between Hadoop and OLAP will increase.

Q4. Organizations embracing Hadoop often struggle to empower large groups of business analysts who require sophisticated SQL and BI tools to do their jobs. How do you handle this problem?

John Schroeder: MapR has the broadest support concerning SQL-in-Hadoop and SQL-on-Hadoop. Hive, Drill, Spark and Impala continue to mature as technologies. We are consultative to our customers assisting them to select the technology best suited to their use case. These technologies are rapidly evolving so we assist in “future proofing” the SQL technology selection to reduce technology lock in. In the case of large groups of business analysts and users we’re very excited about our partnership with HP Vertica. HP Vertica runs natively within the MapR platform and it provides full 100% ANSI SQL support to users. MapR also supports a broad range of SQL solutions designed specifically for Hadoop.
MapR also provides a standard file-based interface so any tool that uses enterprise storage systems can easily access data directly in MapR.

With MapR, you are in charge. You decide what you want to use to query your data; we focus on providing a reliable, scalable and affordable platform with full enterprise support.

Q5. How do you define the Total Cost of Ownership for Big Data architecture?

John Schroeder: There are many factors that drive TCO. The cost of storing data in MapR can be 50 to 100 times cheaper than other analytic platforms. MapR has innovated at the architecture level to drive many important areas to result in a much lower TCO, these include hardware performance and efficiency that results in a much smaller footprint which saves on hardware, operations and management costs. We have had customers tell us that they would need to deploy clusters 2-5 times larger with other distributions for the same workloads. We have also spent a great deal of time on the underlying data platform to provide high availability, reliability, and serviceability to make a MapR deployment extremely efficient. When customers are deploying an in-Hadoop database, MapR provides many TCO advantages. Our M7 Database Edition is an in-Hadoop NoSQL database that addresses HBase limitations by eliminating region servers, eliminating compactions and automating table management to support continuous, low latency on-line applications.

Q6. Is YARN expanding Hadoop use cases in the enterprise? And if yes, how?

John Schroeder: Much has been talked about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. YARN’s promise is to enable multiple execution frameworks to run on top of Hadoop, thereby expanding the Hadoop use cases beyond batch into interactive, real-time and others. At its core, YARN is a resource allocation framework that allows for execution frameworks such as classical MapReduce, and also newer ones like interactive SQL-on-Hadoop, streaming, and others to ask for and receive CPU and memory resources on the cluster for a period of time. YARN’s power is in making the resource allocation of a Hadoop cluster a more streamlined and centralized decision, thereby allowing for more efficient cluster use and more importantly, opening up Hadoop for emerging use cases. We’re happy to include YARN in MapR’s distribution and have uniquely enhanced YARN to allow both Map Reduce V1 and Map Reduce V2 applications to run simultaneously on the same cluster to reduce the barrier to YARN adoption.

Q7. Do you have any metrics to define how good is the “value” that can be derived by analyzing Big Data?

John Schroeder: We have customers that get 50X the performance at 1/50 the cost. We have other customers that have ROI over 1000X because of better approaches to drive revenue. We have other customers whose entire business model is built on the advantages that Hadoop provides. Earlier, I pointed out operational workloads that allow customers to dramatically transform their businesses, these are the applications that really drive value for organizations.
Beyond top line or cost savings value is the ability to support use cases that were not feasible before MapR.
MapR is key to Rubicon running Internet ad exchanges and comScore’s ability to measure what people do as they navigate the digital world.

Q8. What are the benefits of MapR’s Hadoop Distribution on the Google Compute Engine at Google I/O?

John Schroeder: Through the Google Compute Engine infrastructure, MapR makes big data accessible to any size business by leveraging the Google Compute Engine to provide a high performance, scalable, predictable, and easy to provision Hadoop infrastructure.

With respect to the scale and performance advantages, using MapR, Google was able to demonstrate a significant Hadoop price/performance breakthrough. We were able to run the Hadoop TeraSort benchmark to sort 1TB of data in a world-record setting time of 54 seconds on a 1003-node cluster that Google provided for our use. This broke the previous world record with approximately one third the number of cores.

Q9. You recently announced the early access release of the new HP Vertica Analytics Platform on MapR. What are the benefits of such cooperation for the enterprise?

John Schroeder: MapR and Vertica together demonstrate technical leadership in providing the best-of-breed SQL-on-Hadoop solution for enterprises. HP Vertica and MapR produce a comprehensive, tightly integrated, scalable, open-standards big data platform solution. There is no need to manage a dual cluster environment.

MapR is the only platform that could integrate an MPP analytic platform natively on Hadoop without requiring connectors or external tables in order for the MPP platform to interact with Hadoop data. With this integration, HP Vertica works as a native application on top of MapR, sharing the cluster resources with other Hadoop frameworks and applications.
The storage utilization of each application is dynamic and grows to the needs of business without requiring pre-allocation of file system space for HP Vertica. The architecture also allows customers to leverage MapR’s consistent snapshots and mirroring to provide point-in-time recovery and disaster recovery for HP Vertica with practically no effort.

For analysts, data scientists, and business users wanting more analytical power and faster ability to drive business decisions and execution, HP Vertica delivers the industry’s most advanced SQL-on-Hadoop analytics directly on MapR for higher performance and lower TCO.

Qx Anything else you wish to add?

John Schroeder: Two additional thoughts: data agility and operations.
MapR is investing engineering resources for data agility by decreasing time to value from data. Apache Drill is the only interactive SQL project that is architected for both centrally structured and self-describing data. Requiring DBAs-like work to structure new data sources and the cumbersome process for altering structure, delays time to value from new or changed data. Drill supports query of data structured in HCatalog, but also can query data structures using data-interchange formats like JSON.
Many use cases have batch, interactive and real-time (operational) aspects. Ad exchanges have to store and analyze auctions, but they also have to provide information like yield estimates in real-time to publishers and brands.
Credit fraud has analytic aspects but also have to interact during a credit card swipe. Investment in MapR’s M7 in-Hadoop NoSQL database has, and continues, to provide technology to support those real-time operations and avoid the cost and complexity of a second non-Hadoop platform. We aren’t going to replace and OLTP database, but we can cover many of the operational use cases.
————————
John Schroeder, CEO and Cofounder, MapR Technologies. John has served as MapR’s Chief Executive Officer and Chairman of the Board since founding the company in 2009. Prior to founding MapR, John held executive positions in a number of enterprise software companies with a focus on data, storage and business intelligence at both private and public companies including: CEO of Calista Technologies (now Microsoft), CEO of Rainfinity (now EMC), SVP of Products and Marketing at Brio Technologies (BRYO) and General Manager at Compuware (CPWR).

Related Posts

How to run a Big Data project. Interview with James Kobielus. ODBMS Industry Watch,May 15, 2014

Setting up a Big Data project. Interview with Cynthia M. Saracco. ODBMS Industry Watch, January 27, 2014

Resources

MapR Apache Hadoop Distribution

BigDataBench: As a multi-discipline research effort, BigDataBench is an open-source big data benchmark suite.

SQL-on-Hadoop without compromise, IBM Software Group Thought Leadership White Paper

Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst. Dean Abbott, 456 pages, Wiley, May 2014

Professional Hadoop Solutions,Boris Lublinsky, Kevin T. Smith, Alexey Yakubovich, Wiley, October 2013.

From TPC-C to Big Data Benchmarks: A Functional Workload Model. Authors: Yanpei Chen, Francois Raab, Randy H. Katz.


Follow ODBMS.org and ODBMS Industry Watch on Twitter: @odbmsorg

]]>
http://www.odbms.org/blog/2014/06/john-schroeder-ceo-cofounder-mapr-technologies/feed/ 0