Next generation Hadoop — interview with John Schroeder.
“There are only a few Facebook-sized IT organizations that can have 60 Stanford PhDs on staff to run their Hadoop infrastructure. The others need it to be easier to develop Hadoop applications, deploy them and run them in a production environment.”— John Schroeder.
How easy is to use Hadoop? What are the next generation Hadoop distributions? On these topics, I did Interview John Schroeder, Cofounder and CEO of MapR.
Q1. What is the value that Apache Hadoop provides as a Big Data analytics platform?
John Schroeder: Apache Hadoop is a software framework that supports data-intensive distributed applications. Apache Hadoop provides a new platform to analyze and process Big Data. With data growth exploding and new unstructured sources of data expanding a new approach is required to handle the volume, variety and velocity of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.
Q2. Is scalability the only benefits of Apache Hadoop then?
John Schroeder: No, you can build applications that aren’t feasible using traditional data warehouse platforms.
The combination of scale, ability to process unstructured data along with the availability of machine learning algorithms and recommendation engines creates the opportunity to build new game changing applications.
Q3. What are the typical requirements of advanced Hadoop users as well as those new to Hadoop?
John Schroeder Advanced users of Hadoop are looking to go beyond batch uses of Hadoop to support real-time streaming of content. Advanced users also need multi-tenancy to balance production and decision support workloads on their shared Hadoop cloud.
New users need Hadoop to become easier. There are only a few Facebook-sized IT organizations that can have 60 Stanford PhDs on staff to run their Hadoop infrastructure. The others need it to be easier to develop Hadoop applications, deploy them and run them in a production environment.
Q4. Why this? Please give us some practical examples of applications.
John Schroeder: Product recommendations, ad placements, customer churn, patient outcome predictions, fraud detection and sentiment analysis are just a few examples that improve with real time information.
Organizations are also looking to expand Hadoop use cases to include business critical, secure applications that easily integrate with file-based applications and products. Requirements for data protection include snapshots to provide point-in-time recovery and mirroring for business continuity.
With mainstream adoption comes the need for tools that don’t require specialized skills and programmers. New Hadoop developments must be simple for users to operate and to get data in and out. This includes direct access with standard protocols using existing tools and applications.
Q5. What are in your opinion the core limitations that limit the adoption of Hadoop in the enterprise? How do you contribute in taking Big Data into mainstream?
John Schroeder: MapR has and continues to knock down the barriers to Hadoop adoption. Hadoop needed five 9’s availability and the ability to run in a ‘lights-out” datacenter so we transformed Hadoop into a reliable compute and dependable storage platform. Hadoop use cases were too narrow so we expanded access to Hadoop data for industry standard file-based processing.
The MapR Control Center makes it easy to administer, monitor and provision Hadoop applications.
We improved Hadoop economics by dramatically improving performance. We just released our multi-tenancy features that were key to our recently announced Amazon and Google partnership announcements.
Next on our roadmap is to continue expanding the use cases and additional progress moving Hadoop from batch to real-time.
Q6. What are the benefits of having an automated stateful failover?
John Schroeder: Automated stateful failover provides high availability and continuous operations for organizations.
Even with multiple hardware or software outages and errors applications will continue running without any administrator actions required.
Q7. What is special about MapR architecture? Please give us some technical detail.
John Schroeder: MapR rethought and rebuilt the entire internal infrastructure while maintaining complete compatibility with Hadoop APIs. MapR opened up Hadoop to programming, products and data access by providing a POSIX storage layer, NFS access, ODBC/JDBC and REST. The MapR strategy is to expand use cases appropriate for Hadoop while avoiding proprietary features that would result in vendor lock-in. We rebuilt the underlying storage services layer and eliminated the “append only” limitation of the Hadoop Distributed File System (HDFS). MapR writes directly to block devices eliminating the inefficiencies caused by layering HDFS on top of a Linux local file system. In other Hadoop implementations, these continually rob the entire system of performance efficiencies.
The storage layer rearchitecture also enabled us to implement snapshot and mirroring capabilities. The MapR Distributed NameNode eliminates the well-known file scalability limitation allowing us to optimize the Hadoop shuffle algorithm.
MapR provides transparent compression on the client making it easy to reduce data transmission over the network or on disk.
Finally MapR eliminated the impact of periodic Java garbage collection.
This kind of increase is simply impossible with current implementations because it is limited by the architecture itself.
Q8. There are several commercial Hadoop distribution companies on the market (e.g. Cloudera, Datameer, Greenplum, Hortonworks, Platform Computing to name a few). What is special about MapR?
John Schroeder: Datameer is not a Hadoop distribution provider they provide an analytic application and are a partner of MapR. Platform Computing is also not a Hadoop distribution provider. MapR is the only company to provide an enterprise-grade Hadoop distribution.
Q9. What do you mean with an enterprise-grade Hadoop distribution? Isnt` for example Cloudera (to name one) also an enterprise Hadoop distribution?
John Schroeder: There is no other alternative in the market for full HA, business continuity, real-time streaming, standard file-based access through NFS, full database access through ODBC, and support for mission-critical SLAs.
Q10. How do you support this claim? Did you do a market analysis? What other systems did you look at?
John Schroeder: We performed a complete review of available Hadoop distributions. The recent selections of MapR by Amazon as an integrated offering into their Amazon Elastic MapReduce service and by Google for their Google Compute Engine are further validations of MapR’s differentiated Hadoop offering.
With MapR users can mount a cluster using the standard NFS protocol. Applications can write directly into the cluster with no special clients or applications. Users can use every file-based application that has been developed over the last 20 years, ranging from BI applications to file browsers and command-line utilities such as grep or sort directly on data in the cluster. This dramatically simplifies development. Existing applications and workflows can be used and just the specific steps requiring parallel processing need to be converted to take advantage of the MapReduce framework.
MapR also delivers ease of data management. With clusters expanding well into the petabyte range simplifying how data is managed is critical. MapR uniquely supports volumes to make it easy to apply policies across directories and file contents without managing individual files.
These policies include data protection, retention, snapshots, and access privileges.
Additionally, MapR delivers business critical reliability. This includes full HA, business continuity and data protection features. In addition to replication, MapR includes snapshots and mirroring. Snapshots provide for point-in-time recovery. Mirroring provides backup to an alternate cluster, data center or between on-premise and private Hadoop clouds. These features provide a level of protection that is necessary for business critical uses.
Q11. What are the next generations of Hadoop distributions?
John Schroeder: The first generation of Hadoop surrounded the open source Apache project with services and management utilities. MapR is the next generation of Hadoop that combines standards-based innovations developed by MapR with open source components resulting in the industry’s only differentiated distribution that meets the needs of both the largest and emerging Hadoop installations.
MapR’s extensive engineering efforts have resulted in the first software distribution for Hadoop that provides extreme high performance, unprecedented scale, business continuity and is easy to deploy and manage.
Q12. Do you have any results to show us to support these claims “high performance, unprecedented scale”?
John Schroeder: We have many examples of high performance and scale. Google recently unveiled the Google Compute Engine with MapR on stage at the Google IO conference. We demonstrated a 1256 node cluster perform a Terasort in 1 minute and 20 seconds. One of our customers, comScore presented a session at the Hadoop Summit and showed how they process 30B internet events a day using MapR. As for scale differences we have a customer with 18 billion files in a single MapR cluster. By comparison, the largest clusters of other distributions max out around 200 million files.
Q13. What functionalities still need to be added to Hadoop to serve new business critical and real-time applications?
John Schroeder: Other Hadoop distributions present customers with several challenges including:
• Getting data in and out of Hadoop. Other Hadoop distributions are limited by the append-only nature of the Hadoop Distributed File System (HDFS) that requires programs to batch load and unload data into a cluster.
• Deploying Hadoop into mission critical business projects. The lack of reliability of current Hadoop software platforms is a major impediment for expansion.
• Protecting data against application and user errors. Hadoop has no backup and restore capabilities. Users have to contend with data loss or resort to very expensive solutions that reside outside the actual Hadoop cluster.
According to industry research firm, ESG, half of the companies they surveyed plan to leverage commercial distributions of Hadoop as opposed to the open source version. This trend indicates organizations are moving from experimental and pilot projects to mainstream applications with mission-critical requirements that include high availability, better performance, data protection, security, and ease of use.
Q14. There is work to be done training developers in learning advanced statistics and software (such as Hadoop) to ensure adoption in the Enterprise. Do you agree with this? What is your role here?
John Schroeder: Simply put the limitations of the Hadoop Distributed File System require whole scale changes to existing applications and extensive development of new ones. MapR’s next generation storage services layer provides full random/read support and provides direct access with NFS. This dramatically simplifies development. Existing applications and workflows can be used and just the specific steps requiring parallel processing need to be converted to take advantage of the MapReduce framework.
Q15. Are customers willing to share their private data?
John Schroeder: In general customers are concerned with the protection and security of their data. That said, we see growing adoption of Hadoop in the cloud. Amazon has a significant web-services business around Hadoop and recently added MapR as part of their Elastic MapReduce offering. Google has also announced the Google Compute Engine and integration with MapR.
Q16. Data quality from different sources is a problem. How do you handle this?
John Schroeder: Data quality issues can be similar to those in a traditional data warehouse. Scrubbing can be built into the Hadoop applications using algorithms similar to those used during ELT.
ETL and ELT can both accomplish data scrubbing. The storage/compute resources and ability to combine unlike datasets provide significant advantages to Hadoop-based ELT.
There are different views with respect to this issue. IT personnel that are used to traditional data warehouses are typically concerned with data quality and ETL processes. The advantage of Hadoop is that you can have disparate data from many different sources and different data types in the same cluster. Some advanced users have pointed out that “quality” issues are actually valuable information that can provide insight into issues, anomalies and opportunities. With Hadoop users have the flexibility to process and analyze. Analytics are not dependent on having a pre-defined schema.
Q17. Moving older data online. Is this a business opportunity for you?
John Schroeder: The advantage of Hadoop is performing compute on data. It makes much more sense to perform analytics directly on large data stores so you send only results over the network instead of dragging the entire data set over the network for processing. For this use case to be viable requires a highly reliable cluster with full data protection and business continuity features.
Q18. Yes, but what about big data that is not digitalized yet? This is what I meant with moving older data online.
John Schroeder: Most organizations are looking for a solution to help them cope with fast growing digital sources of machine generated content such as log files, sensor data, etc. Images, video and audio are also a fast growing data source that can provide valuation analytics.
John Schroeder, Cofounder and CEO, MapR.
John has led companies creating innovative and disruptive business intelligence, database management, storage and virtualization technologies at early stage ventures through success as large public companies. John founded MapR to produce the next generation Hadoop distribution to expand the use cases beyond batch Hadoop to include real-time, business critical, secure applications that easily integrate with file-based applications and products.
ODBMS.org Free Downloads and Links
In this section you can download free resources on Big Data and Analytical Data Platforms (Blog Posts | Free Software| Articles| PhD and Master Thesis)