On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah
“What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).”–Amr Awadallah
I have interviewed Amr Awadallah, Chief Technology Officer at Cloudera.
Main topics of the interview are: the new developments in Apache Spark 2.0 Beta, and Hadoop 3.0.0-alpha1 release ; the lessons learned from Amr´s experience of using Hadoop at Yahoo!; and the business problems that world’s leading organisations do have.
Q1. Before Cloudera, you served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organisations to use Hadoop for data analysis and business intelligence. What are the main lessons you learned in that period?
Amr Awadallah: Couple of things. First, I learned that Hadoop is capable of solving all the business intelligence problems that I had at Yahoo.
(1) our systems weren’t scaling fast enough (we needed to cut down transformation times from hours to minutes),
(2) our systems weren’t economical on a $/TB basis thus making it hard to retain valuable data for longer time periods, and (3) we needed new methods to be able to store and analyze semi-structured (e.g. logs) and unstructured data (e.g. social media).
By implementing Hadoop in our team we saw first hand how it can address all these problems. The second lesson that I learned was that Hadoop, back then, was very rough to deploy and program against (it took us many months to deploy it and reprogram our transformations to run on it). It was these lessons that made it clear that there is room for a startup to focus on Hadoop since (1) it was solving a very real data problems that many organizations will face, and (2) it needed a lot of polish to make it work smoothly, securely, and reliably within the enterprise.
Q2. In 2008 you founded Cloudera together with Mike Olson (Oracle), Jeff Hammerbacher (Facebook) and Christophe Bisciglia (Google). What was your main motivation at that time?
Amr Awadallah: Pretty much to do what I describe above, we wanted to make the Hadoop technology easy to use for organizations. That included: (1) creating a distribution for Hadoop that bundles all the necessary open-source projects that make it work (we call that CDH, short for Cloudera Distribution for Apache Hadoop). (2) We also created a number of proprietary system management, security, and meta-data management tools around CDH to make it easier for organizations to deploy and operate Hadoop in production.
Q3. What are the typical challenging business problems that world’s leading organisations have?
Amr Awadallah: The technology we provide is very powerful and can be used to solve many problems across many industries, but we see four common themes: The first is simply using Hadoop as a faster, bigger, cheaper system for business intelligence and data analytics. i.e. a lot of organizations just use us to do things they have been doing already, just doing these things in a more economically scalable way.
The second use case is around deeper understanding of customers, i.e. moving away from segmenting all customers into a number of predefined buckets, but rather creating a dynamic micro-segment addressing each customer in a more precise way (thus reducing false positives).
The third use case is about using data to build better products and services, and this use-case is catalyzed by of the internet-of-things. Due to smart-sensors we are able to measure the real-world better than ever before; so this use-case is about taking all that data and leveraging it to either enhance our current product/service offerings, or build entirely new ones.
The fourth use case is about reducing business risk, and it manifests itself in a number of different sub-cases depending on the industry. For example, cyber-security is one of the key ways to reduce risk, and we have an open source project co-developed with Intel, called Apache Spot, which organizations can use to collect all their network flow data then use Spark machine learning algorithms to detect the anomalies in that data. Anti-money laundering and fraud detection is another way that our banking customers employ our platform to reduce risk within their businesses. Similarly, our insurance industry customers use our system to detect fraudulent claims, etc.
Q4. Can they be solved by analysing data? Can you give us some examples of how the use of advanced analytics drive business decisions?
Amr Awadallah: Yes, all the problems mentioned above can be solved with data. I want to highlight though that this isn’t necessarily about business decisions, which is what the Business Intelligence movement was about (we just help make that cheaper and faster). What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).
One of my favorite examples is a solution that one of our customers built to give voice to premature babies in neonatal intensive care units. They analyze the signals coming from the baby (sounds, blood pressure, heart rate, temperature, few brain signals), and based on that a message appears on the monitor above the infant showing the nurse if they are hungry, distressed from too much noise or light, etc.
That is really what we mean by using data to create new products and services that weren’t possible before (and not just reports/dashboard).
Q4. Graphs are important. Is it possible to do scalable graph analytics? If yes, how?
Amr Awadallah: Graphs are indeed important, a lot of our customer use-cases trace back to that (not just for social media analytics, but for example anti-money laundering requires analyzing relationships between many financial accounts for detecting bad behaviors, similarly for cyber security applications). I think scalability depends a fair bit on what’s being analyzed and how scalable we mean by scalable. But for most practical purposes I would say Spark’s GraphX is good enough. For example, you can compute PageRank fairly efficiently and scalably on a cluster using GraphX.
Q5. Data security is increasing important. The risk is due to the growing number of device endpoints. What solutions do exist to minimise such risk?
Amr Awadallah: A comprehensive enterprise data security strategy seeks to mitigate the risks presented by a growing number of potentially compromised endpoints connecting to corporate networks. Endpoint security will enable one or all of the following preventative controls:
The first is policy based enforcement of endpoint security configuration prior to granting and endpoint access to network based corporate assets. This ensures that any endpoint connected to corporate networks meets minimum requirements for endpoint security configuration.
The second measure is endpoint based anti-malware software (the existence of which may be a policy requirement to connect to the network per the first measure). Anti-malware prevents malicious code from infecting endpoints by monitoring for changes to system configuration and unusual activity or processes.
The third measure is endpoint encryption of corporate data on hard drives, folders and even removable media.
As mentioned above we also collaborate with Intel on Apache Spot, which tracks network flow patterns to detect anomalous communication behavior between different devices (including end point devices). Apache Spot just recently won InfoWorld 2017 Tech of the Year Award. Other advanced analytics security partners we closely work with are: CounterTack, Securonix, Niara, and Jask.
Q6. You recently announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform. How does it work? And how does it differ from the Hadoop-based data platform?
Amr Awadallah: First, at a meta-level, Hadoop (MapReduce specifically) was very good at achieving scalable computation by spreading jobs across many CPU cores and hard disk spindles. That said, MapReduce wasn’t very efficient in how it leveraged memory to optimize the performance of data processing pipelines that have many stages or iterations.
The main power of Spark, that made it take over from MapReduce, was how it truly leveraged memory to achieve better performance in deep or iterative data pipelines. That coupled with a simpler developer API made Spark take over very quickly from MapReduce.
Most of our new customer implementations for data processing or data science tend to be in Spark these days, versus MapReduce.
I should clarify however that this doesn’t mean that Hadoop is dead as some say. Apache Hadoop is comprised of three key subsystems: (1) MapReduce for computation, (2) YARN for resource scheduling, and (3) HDFS for storage. Spark only replaces MapReduce, we still rely heavily on both YARN and HDFS.
That said, the most notable features in Apache Spark 2.0 are:
1) Dataset API: It is a new API that represents the distributed collections of objects processed by Spark’s execution engine. It is an extension of Spark’s Dataframe API. It improves upon the Dataframe API by providing type-safe, object oriented programming interfaces. Users can now write User-Defined Functions and Lambda functions that provide compile time type safety. With the Dataset API, users benefit from optimized operations (like sort, join, hash, etc) in the SparkSQL engine, while also getting compile time type safety for user defined functions.
2) Model & Pipeline Persistence in Spark’s ML library: Machine learning Pipelines built with Spark’s ML library can now be serialized to a file and read back in.
The ability to save and reload these pipelines makes it easy for users to perform version control on the pipelines and safely distribute the pipelines. This helps in operationalizing them in production systems.
3) Structured Streaming: New stream processing API and engine that provides SQL like abstractions for authoring operations on data streams, and also improves performance by using the SparkSQL engine for processing the data streams. However, this is still an experimental API and not ready for production usage yet.
Besides the above 3 notable enhancements, there are a bunch of performance and scalability improvements across the board.
Q7. Apache Impala vs. Amazon Redshift: How Does Redshift Compare to Impala?
Amr Awadallah: Apache Impala is an analytic database engine architecturally designed to perform high-performance highly-concurrent SQL analytics on scalable, open data platforms like Hadoop’s HDFS and Amazon S3.
Impala decouples data storage from compute and lets users query data without having to move/load data specifically into an Impala storage-engine (it doesn’t have one). This architectural difference uniquely enables Impala to deliver a more flexible Business Intelligence experience than traditional database architectures like Redshift (which requires pre-loading the data).
Some of the key benefits of the Impala approach include:
* On-demand resources that are immediately ready to query existing S3 data without loading to a different data silo
* Ability to elastically grow/shrink clusters as needed due to decoupled storage and compute
* More predictable, multi-tenant isolation due to the ability to have multiple Impala clusters sharing a common S3 data repository
* Ability to share common data not only amongst Impala clusters, but also any application that runs on cloud-native S3 storage (for example, you can have both Apache Impala and Apache Spark run against the same data asset in S3, while it isn’t possible to have Apache Spark easily access the data stored in Redshift, it has to go through SQL first).
* Greater flexibility to explore new use cases, analytics, and data by directly querying S3 without rigid traditional data models and ETL
Not only does Impala deliver this additional flexibility, it does so at greater cost-performance and scalability compared to Redshift. See the following benchmark for data on that.
That said, Redshift’s sweet spot is in a different target as a smaller datamart as most Redshift installations are in the dozen of nodes range where Redshift’s limitations in scalability, elasticity, flexibility, and requirement to maintain separate copies of data are less critical.
Q8. What is Apache Kudu, and why is it relevant for Impala Users?
Amr Awadallah: Historically we had two storage engines in our distribution: (1) HDFS which is optimized for high-throughput analytics, but doesn’t support updates/inserts and (2) HBase which is optimized for low-latency updates/inserts but isn’t good for doing high-throughput queries. To build a proper data warehouse or time-series analytics system, you typically still need to make updates/inserts and that was why we created Apache Kudu.
Kudu is a new storage system that combines the benefits of both HDFS and HBase into one: it allows for low-latency updates/inserts, but also supports high-throughput analytical queries (i.e. fast analytics on fast moving data).
Unlike HDFS, Kudu is not a file-system, it is a record-based system, so the unit of storage is a record as opposed to a file. This allows Kudu to unlock Impala for real-time streaming applications that were not possible with HDFS.
In HDFS the data would only be visible to Impala after we finish closing the file, which typically happens after a large number of records are accumulated (that adds latency between when records are written to when they become visible to the analytical engine). With Kudu as soon as a record is written it is immediately visible to the Impala analytical engine. Finally, just like HDFS and HBase, the Kudu storage engine is fully integrated with our entire stack, not just Impala.
For example, you can also use Apache Spark for machine-learning jobs directly against Kudu.
Q9. The Apache Hadoop project recently announced its 3.0.0-alpha1 release. What is it?
Amr Awadallah: HDFS Erasure Encoding is really the main exciting new feature in Hadoop 3. Traditionally HDFS required three replicas, by default, for every data block to achieve durability, concurrent performance, and availability. Using erasure encoding techniques, HDFS in Hadoop 3 allows us to significantly reduce the storage overhead from 3x (i.e. 200%) to just 20% extra bits for parity. This will allow us to achieve the same durability benefits of 3x replication, but comes at the cost of potentially lower concurrent performance (when more than one job are trying to access the same block at same time) and lower availability resilience in face of top-of-rack switch failures (less of an issue these days).
Other cool additions are ATS v2 and classpath isolation which you can read more about here
Q10. What is the roadmap ahead for Cloudera Enterprise?
Amr Awadallah: We don’t discuss details of our product roadmap publicly, but there are three guiding themes for us in 2017: The first theme is fast-analytics on fast-moving data (which I covered above in regards to Kudu).
The second theme is cloud, which is making Cloudera Enterprise work better in cloud environments, and make it easier to move workloads (and skill sets) from on-premise clusters to transient cloud clusters in AWS, Azure, and/or Google Cloud.
The third theme is simplifying data-science and machine learning development, especially reducing the time from when a new algorithm is developed to how it can be deployed into production (stay tuned for more on that front).
Amr Awadallah, Ph.D. Chief Technology Officer, Cloudera
Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.
Follow us on Twitter: @odbmsorg