On the Hadoop market. Interview with John Schroeder
” Hadoop continue to mature with regards to structuring data and interactive query, so future overlap between Hadoop and OLAP will increase.”– John Schroeder.
I have interviewed John Schroeder, CEO and Cofounder of MapR Technologies. Main topics of the interview are managing Big Data projects and how the Hadoop market is evolving.
Q1. What are the most common problems and challenges encountered in Big Data projects?
John Schroeder: First of all there is no single Big Data use case. Applications cut across industries and involve a wide variety of data sources. These projects can result in revenue gains, cost reductions or risk mitigation. While the challenges for these projects also vary, we see customers embracing our platform to deal with common challenges in meeting mission critical service levels, addressing real-time response pressures, and supporting multiple users and applications.
Q2. How do you see the Hadoop market evolving?
John Schroeder: We have leading customers in diverse industries who are using Hadoop to drive operational analytics, customer examples include performing 100B ad auctions a day, fraud detection for over 100 million card holders and real-time adjustments to improve fleet efficiency. These examples require the right architecture to support streaming writes so data can be constantly writing to the system while analysis is being conducted; high performance to meet the business needs and real-time operations; and the ability to perform online database operations to react to the business situation and impact business as it happens not producing a batch to report days or weeks later.
Q3. Is Hadoop really replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?
John Schroeder: Hadoop’s impact is more disruptive than a replacement for OLAP technologies that have been in the market since the 90s. Customers deploy use cases on Hadoop that were not feasible or cost effective using these traditional technologies. For example, the use of clustering algorithms and recommendation engines that can be run much more frequently against much larger datasets open opportunities for use cases that drive new revenue streams.
Hadoop is also more powerful for unstructured data. So while we do see customers offload data warehouse processing on MapR, most MapR customers are deploying net new use cases. The business impact is the net new growth of analytic use cases is being done on Hadoop.
Hadoop is not currently a direct replacement to OLAP or an Enterprise Data Warehouse, for that matter. These technologies will continue to have their place. Hadoop does not require schema definition or structuring of data. In fact, acting as a Datahub, Hadoop can be quite complementary to these by offloading processing and data from these systems. The average cost to store data in a data warehouse is $16,000/terabyte. The cost for MapR is less than $1000/terabyte. OLAP engines leverage data that has been transformed and processed into precise schemas. They can perform very well for well understood problems. One of the benefits of Hadoop is that you don’t need to understand the questions you are going to ask ahead of time, you can combine many different data types and determine required analysis you need after the data is in place. Hadoop continue to mature with regards to structuring data and interactive query, so future overlap between Hadoop and OLAP will increase.
Q4. Organizations embracing Hadoop often struggle to empower large groups of business analysts who require sophisticated SQL and BI tools to do their jobs. How do you handle this problem?
John Schroeder: MapR has the broadest support concerning SQL-in-Hadoop and SQL-on-Hadoop. Hive, Drill, Spark and Impala continue to mature as technologies. We are consultative to our customers assisting them to select the technology best suited to their use case. These technologies are rapidly evolving so we assist in “future proofing” the SQL technology selection to reduce technology lock in. In the case of large groups of business analysts and users we’re very excited about our partnership with HP Vertica. HP Vertica runs natively within the MapR platform and it provides full 100% ANSI SQL support to users. MapR also supports a broad range of SQL solutions designed specifically for Hadoop.
MapR also provides a standard file-based interface so any tool that uses enterprise storage systems can easily access data directly in MapR.
With MapR, you are in charge. You decide what you want to use to query your data; we focus on providing a reliable, scalable and affordable platform with full enterprise support.
Q5. How do you define the Total Cost of Ownership for Big Data architecture?
John Schroeder: There are many factors that drive TCO. The cost of storing data in MapR can be 50 to 100 times cheaper than other analytic platforms. MapR has innovated at the architecture level to drive many important areas to result in a much lower TCO, these include hardware performance and efficiency that results in a much smaller footprint which saves on hardware, operations and management costs. We have had customers tell us that they would need to deploy clusters 2-5 times larger with other distributions for the same workloads. We have also spent a great deal of time on the underlying data platform to provide high availability, reliability, and serviceability to make a MapR deployment extremely efficient. When customers are deploying an in-Hadoop database, MapR provides many TCO advantages. Our M7 Database Edition is an in-Hadoop NoSQL database that addresses HBase limitations by eliminating region servers, eliminating compactions and automating table management to support continuous, low latency on-line applications.
Q6. Is YARN expanding Hadoop use cases in the enterprise? And if yes, how?
John Schroeder: Much has been talked about Hadoop 2.x and YARN and how it promises to expand Hadoop beyond MapReduce. YARN’s promise is to enable multiple execution frameworks to run on top of Hadoop, thereby expanding the Hadoop use cases beyond batch into interactive, real-time and others. At its core, YARN is a resource allocation framework that allows for execution frameworks such as classical MapReduce, and also newer ones like interactive SQL-on-Hadoop, streaming, and others to ask for and receive CPU and memory resources on the cluster for a period of time. YARN’s power is in making the resource allocation of a Hadoop cluster a more streamlined and centralized decision, thereby allowing for more efficient cluster use and more importantly, opening up Hadoop for emerging use cases. We’re happy to include YARN in MapR’s distribution and have uniquely enhanced YARN to allow both Map Reduce V1 and Map Reduce V2 applications to run simultaneously on the same cluster to reduce the barrier to YARN adoption.
Q7. Do you have any metrics to define how good is the “value” that can be derived by analyzing Big Data?
John Schroeder: We have customers that get 50X the performance at 1/50 the cost. We have other customers that have ROI over 1000X because of better approaches to drive revenue. We have other customers whose entire business model is built on the advantages that Hadoop provides. Earlier, I pointed out operational workloads that allow customers to dramatically transform their businesses, these are the applications that really drive value for organizations.
Beyond top line or cost savings value is the ability to support use cases that were not feasible before MapR.
MapR is key to Rubicon running Internet ad exchanges and comScore’s ability to measure what people do as they navigate the digital world.
Q8. What are the benefits of MapR’s Hadoop Distribution on the Google Compute Engine at Google I/O?
John Schroeder: Through the Google Compute Engine infrastructure, MapR makes big data accessible to any size business by leveraging the Google Compute Engine to provide a high performance, scalable, predictable, and easy to provision Hadoop infrastructure.
With respect to the scale and performance advantages, using MapR, Google was able to demonstrate a significant Hadoop price/performance breakthrough. We were able to run the Hadoop TeraSort benchmark to sort 1TB of data in a world-record setting time of 54 seconds on a 1003-node cluster that Google provided for our use. This broke the previous world record with approximately one third the number of cores.
Q9. You recently announced the early access release of the new HP Vertica Analytics Platform on MapR. What are the benefits of such cooperation for the enterprise?
John Schroeder: MapR and Vertica together demonstrate technical leadership in providing the best-of-breed SQL-on-Hadoop solution for enterprises. HP Vertica and MapR produce a comprehensive, tightly integrated, scalable, open-standards big data platform solution. There is no need to manage a dual cluster environment.
MapR is the only platform that could integrate an MPP analytic platform natively on Hadoop without requiring connectors or external tables in order for the MPP platform to interact with Hadoop data. With this integration, HP Vertica works as a native application on top of MapR, sharing the cluster resources with other Hadoop frameworks and applications.
The storage utilization of each application is dynamic and grows to the needs of business without requiring pre-allocation of file system space for HP Vertica. The architecture also allows customers to leverage MapR’s consistent snapshots and mirroring to provide point-in-time recovery and disaster recovery for HP Vertica with practically no effort.
For analysts, data scientists, and business users wanting more analytical power and faster ability to drive business decisions and execution, HP Vertica delivers the industry’s most advanced SQL-on-Hadoop analytics directly on MapR for higher performance and lower TCO.
Qx Anything else you wish to add?
John Schroeder: Two additional thoughts: data agility and operations.
MapR is investing engineering resources for data agility by decreasing time to value from data. Apache Drill is the only interactive SQL project that is architected for both centrally structured and self-describing data. Requiring DBAs-like work to structure new data sources and the cumbersome process for altering structure, delays time to value from new or changed data. Drill supports query of data structured in HCatalog, but also can query data structures using data-interchange formats like JSON.
Many use cases have batch, interactive and real-time (operational) aspects. Ad exchanges have to store and analyze auctions, but they also have to provide information like yield estimates in real-time to publishers and brands.
Credit fraud has analytic aspects but also have to interact during a credit card swipe. Investment in MapR’s M7 in-Hadoop NoSQL database has, and continues, to provide technology to support those real-time operations and avoid the cost and complexity of a second non-Hadoop platform. We aren’t going to replace and OLTP database, but we can cover many of the operational use cases.
John Schroeder, CEO and Cofounder, MapR Technologies. John has served as MapR’s Chief Executive Officer and Chairman of the Board since founding the company in 2009. Prior to founding MapR, John held executive positions in a number of enterprise software companies with a focus on data, storage and business intelligence at both private and public companies including: CEO of Calista Technologies (now Microsoft), CEO of Rainfinity (now EMC), SVP of Products and Marketing at Brio Technologies (BRYO) and General Manager at Compuware (CPWR).
– How to run a Big Data project. Interview with James Kobielus. ODBMS Industry Watch,May 15, 2014
–Setting up a Big Data project. Interview with Cynthia M. Saracco. ODBMS Industry Watch, January 27, 2014
– MapR Apache Hadoop Distribution
– BigDataBench: As a multi-discipline research effort, BigDataBench is an open-source big data benchmark suite.
– SQL-on-Hadoop without compromise, IBM Software Group Thought Leadership White Paper
– Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst. Dean Abbott, 456 pages, Wiley, May 2014
– Professional Hadoop Solutions,Boris Lublinsky, Kevin T. Smith, Alexey Yakubovich, Wiley, October 2013.
– From TPC-C to Big Data Benchmarks: A Functional Workload Model. Authors: Yanpei Chen, Francois Raab, Randy H. Katz.
Follow ODBMS.org and ODBMS Industry Watch on Twitter: @odbmsorg