ODBMS Industry Watch » Cloudera http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Fri, 09 Feb 2018 21:04:31 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.19 On the new developments in Apache Spark and Hadoop. Interview with Amr Awadallah http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/ http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/#comments Mon, 13 Mar 2017 10:54:21 +0000 http://www.odbms.org/blog/?p=4326

“What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).”–Amr Awadallah

I have interviewed Amr Awadallah, Chief Technology Officer at Cloudera.  
Main topics of the interview are: the new developments in Apache Spark 2.0 Beta, and Hadoop  3.0.0-alpha1 release ; the lessons learned from Amr´s experience of using Hadoop at Yahoo!; and the business problems that world’s leading organisations do have.


Q1. Before Cloudera, you served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organisations to use Hadoop for data analysis and business intelligence. What are the main lessons you learned in that period?

Amr Awadallah: Couple of things. First, I learned that Hadoop is capable of solving all the business intelligence problems that I had at Yahoo.
(1) our systems weren’t scaling fast enough (we needed to cut down transformation times from hours to minutes),
(2) our systems weren’t economical on a $/TB basis thus making it hard to retain valuable data for longer time periods, and (3) we needed new methods to be able to store and analyze semi-structured (e.g. logs) and unstructured data (e.g. social media).
By implementing Hadoop in our team we saw first hand how it can address all these problems. The second lesson that I learned was that Hadoop, back then, was very rough to deploy and program against (it took us many months to deploy it and reprogram our transformations to run on it). It was these lessons that made it clear that there is room for a startup to focus on Hadoop since (1) it was solving a very real data problems that many organizations will face, and (2) it needed a lot of polish to make it work smoothly, securely, and reliably within the enterprise.

Q2. In 2008 you founded Cloudera together with Mike Olson (Oracle), Jeff Hammerbacher (Facebook) and Christophe Bisciglia (Google). What was your main motivation at that time?

Amr Awadallah: Pretty much to do what I describe above, we wanted to make the Hadoop technology easy to use for organizations. That included: (1) creating a distribution for Hadoop that bundles all the necessary open-source projects that make it work (we call that CDH, short for Cloudera Distribution for Apache Hadoop). (2) We also created a number of proprietary system management, security, and meta-data management tools around CDH to make it easier for organizations to deploy and operate Hadoop in production.

Q3. What are the typical challenging business problems that world’s leading organisations have?

Amr Awadallah: The technology we provide is very powerful and can be used to solve many problems across many industries, but we see four common themes: The first is simply using Hadoop as a faster, bigger, cheaper system for business intelligence and data analytics. i.e. a lot of organizations just use us to do things they have been doing already, just doing these things in a more economically scalable way.
The second use case is around deeper understanding of customers, i.e. moving away from segmenting all customers into a number of predefined buckets, but rather creating a dynamic micro-segment addressing each customer in a more precise way (thus reducing false positives).
The third use case is about using data to build better products and services, and this use-case is catalyzed by of the internet-of-things. Due to smart-sensors we are able to measure the real-world better than ever before; so this use-case is about taking all that data and leveraging it to either enhance our current product/service offerings, or build entirely new ones.
The fourth use case is about reducing business risk, and it manifests itself in a number of different sub-cases depending on the industry. For example, cyber-security is one of the key ways to reduce risk, and we have an open source project co-developed with Intel, called Apache Spot, which organizations can use to collect all their network flow data then use Spark machine learning algorithms to detect the anomalies in that data. Anti-money laundering and fraud detection is another way that our banking customers employ our platform to reduce risk within their businesses. Similarly, our insurance industry customers use our system to detect fraudulent claims, etc.

Q4. Can they be solved by analysing data? Can you give us some examples of how the use of advanced analytics drive business decisions?

Amr Awadallah: Yes, all the problems mentioned above can be solved with data. I want to highlight though that this isn’t necessarily about business decisions, which is what the Business Intelligence movement was about (we just help make that cheaper and faster). What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).
One of my favorite examples is a solution that one of our customers built to give voice to premature babies in neonatal intensive care units. They analyze the signals coming from the baby (sounds, blood pressure, heart rate, temperature, few brain signals), and based on that a message appears on the monitor above the infant showing the nurse if they are hungry, distressed from too much noise or light, etc.
That is really what we mean by using data to create new products and services that weren’t possible before (and not just reports/dashboard).

Q4. Graphs are important. Is it possible to do scalable graph analytics? If yes, how?

Amr Awadallah: Graphs are indeed important, a lot of our customer use-cases trace back to that (not just for social media analytics, but for example anti-money laundering requires analyzing relationships between many financial accounts for detecting bad behaviors, similarly for cyber security applications). I think scalability depends a fair bit on what’s being analyzed and how scalable we mean by scalable. But for most practical purposes I would say Spark’s GraphX is good enough. For example, you can compute PageRank fairly efficiently and scalably on a cluster using GraphX.

Q5. Data security is increasing important. The risk is due to the growing number of device endpoints. What solutions do exist to minimise such risk?

Amr Awadallah: A comprehensive enterprise data security strategy seeks to mitigate the risks presented by a growing number of potentially compromised endpoints connecting to corporate networks. Endpoint security will enable one or all of the following preventative controls:
The first is policy based enforcement of endpoint security configuration prior to granting and endpoint access to network based corporate assets. This ensures that any endpoint connected to corporate networks meets minimum requirements for endpoint security configuration.
The second measure is endpoint based anti-malware software (the existence of which may be a policy requirement to connect to the network per the first measure). Anti-malware prevents malicious code from infecting endpoints by monitoring for changes to system configuration and unusual activity or processes.
The third measure is endpoint encryption of corporate data on hard drives, folders and even removable media.
As mentioned above we also collaborate with Intel on Apache Spot, which tracks network flow patterns to detect anomalous communication behavior between different devices (including end point devices). Apache Spot just recently won InfoWorld 2017 Tech of the Year Award. Other advanced analytics security partners we closely work with are: CounterTack, Securonix, Niara, and Jask.

Q6. You recently announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform. How does it work? And how does it differ from the Hadoop-based data platform?

Amr Awadallah: First, at a meta-level, Hadoop (MapReduce specifically) was very good at achieving scalable computation by spreading jobs across many CPU cores and hard disk spindles. That said, MapReduce wasn’t very efficient in how it leveraged memory to optimize the performance of data processing pipelines that have many stages or iterations.
The main power of Spark, that made it take over from MapReduce, was how it truly leveraged memory to achieve better performance in deep or iterative data pipelines. That coupled with a simpler developer API made Spark take over very quickly from MapReduce.
Most of our new customer implementations for data processing or data science tend to be in Spark these days, versus MapReduce.
I should clarify however that this doesn’t mean that Hadoop is dead as some say. Apache Hadoop is comprised of three key subsystems: (1) MapReduce for computation, (2) YARN for resource scheduling, and (3) HDFS for storage. Spark only replaces MapReduce, we still rely heavily on both YARN and HDFS.

That said, the most notable features in Apache Spark 2.0 are:

1) Dataset API: It is a new API that represents the distributed collections of objects processed by Spark’s execution engine. It is an extension of Spark’s Dataframe API. It improves upon the Dataframe API by providing type-safe, object oriented programming interfaces. Users can now write User-Defined Functions and Lambda functions that provide compile time type safety. With the Dataset API, users benefit from optimized operations (like sort, join, hash, etc) in the SparkSQL engine, while also getting compile time type safety for user defined functions.

2) Model & Pipeline Persistence in Spark’s ML library: Machine learning Pipelines built with Spark’s ML library can now be serialized to a file and read back in.
The ability to save and reload these pipelines makes it easy for users to perform version control on the pipelines and safely distribute the pipelines. This helps in operationalizing them in production systems.

3) Structured Streaming: New stream processing API and engine that provides SQL like abstractions for authoring operations on data streams, and also improves performance by using the SparkSQL engine for processing the data streams. However, this is still an experimental API and not ready for production usage yet.

Besides the above 3 notable enhancements, there are a bunch of performance and scalability improvements across the board.

Q7. Apache Impala vs. Amazon Redshift: How Does Redshift Compare to Impala?

Amr Awadallah: Apache Impala is an analytic database engine architecturally designed to perform high-performance highly-concurrent SQL analytics on scalable, open data platforms like Hadoop’s HDFS and Amazon S3.
Impala decouples data storage from compute and lets users query data without having to move/load data specifically into an Impala storage-engine (it doesn’t have one). This architectural difference uniquely enables Impala to deliver a more flexible Business Intelligence experience than traditional database architectures like Redshift (which requires pre-loading the data).

Some of the key benefits of the Impala approach include:

* On-demand resources that are immediately ready to query existing S3 data without loading to a different data silo
* Ability to elastically grow/shrink clusters as needed due to decoupled storage and compute
* More predictable, multi-tenant isolation due to the ability to have multiple Impala clusters sharing a common S3 data repository
* Ability to share common data not only amongst Impala clusters, but also any application that runs on cloud-native S3 storage (for example, you can have both Apache Impala and Apache Spark run against the same data asset in S3, while it isn’t possible to have Apache Spark easily access the data stored in Redshift, it has to go through SQL first).
* Greater flexibility to explore new use cases, analytics, and data by directly querying S3 without rigid traditional data models and ETL

Not only does Impala deliver this additional flexibility, it does so at greater cost-performance and scalability compared to Redshift. See the following benchmark for data on that.

That said, Redshift’s sweet spot is in a different target as a smaller datamart as most Redshift installations are in the dozen of nodes range where Redshift’s limitations in scalability, elasticity, flexibility, and requirement to maintain separate copies of data are less critical.

Q8. What is Apache Kudu, and why is it relevant for Impala Users?

Amr Awadallah: Historically we had two storage engines in our distribution: (1) HDFS which is optimized for high-throughput analytics, but doesn’t support updates/inserts and (2) HBase which is optimized for low-latency updates/inserts but isn’t good for doing high-throughput queries. To build a proper data warehouse or time-series analytics system, you typically still need to make updates/inserts and that was why we created Apache Kudu.

Kudu is a new storage system that combines the benefits of both HDFS and HBase into one: it allows for low-latency updates/inserts, but also supports high-throughput analytical queries (i.e. fast analytics on fast moving data).
Unlike HDFS, Kudu is not a file-system, it is a record-based system, so the unit of storage is a record as opposed to a file. This allows Kudu to unlock Impala for real-time streaming applications that were not possible with HDFS.
In HDFS the data would only be visible to Impala after we finish closing the file, which typically happens after a large number of records are accumulated (that adds latency between when records are written to when they become visible to the analytical engine). With Kudu as soon as a record is written it is immediately visible to the Impala analytical engine. Finally, just like HDFS and HBase, the Kudu storage engine is fully integrated with our entire stack, not just Impala.
For example, you can also use Apache Spark for machine-learning jobs directly against Kudu.

Q9. The Apache Hadoop project recently announced its 3.0.0-alpha1 release. What is it?

Amr Awadallah: HDFS Erasure Encoding is really the main exciting new feature in Hadoop 3. Traditionally HDFS required three replicas, by default, for every data block to achieve durability, concurrent performance, and availability. Using erasure encoding techniques, HDFS in Hadoop 3 allows us to significantly reduce the storage overhead from 3x (i.e. 200%) to just 20% extra bits for parity. This will allow us to achieve the same durability benefits of 3x replication, but comes at the cost of potentially lower concurrent performance (when more than one job are trying to access the same block at same time) and lower availability resilience in face of top-of-rack switch failures (less of an issue these days).

Other cool additions are ATS v2 and classpath isolation which you can read more about here

Q10. What is the roadmap ahead for Cloudera Enterprise?

Amr Awadallah: We don’t discuss details of our product roadmap publicly, but there are three guiding themes for us in 2017: The first theme is fast-analytics on fast-moving data (which I covered above in regards to Kudu).
The second theme is cloud, which is making Cloudera Enterprise work better in cloud environments, and make it easier to move workloads (and skill sets) from on-premise clusters to transient cloud clusters in AWS, Azure, and/or Google Cloud.
The third theme is simplifying data-science and machine learning development, especially reducing the time from when a new algorithm is developed to how it can be deployed into production (stay tuned for more on that front).
Amr Awadallah, Ph.D. Chief Technology Officer, Cloudera
Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.


Download Page for Apache Spark™

Apache Impala supported by Cloudera Enterprise

DATA-X: Videobook- 8 short videos introduce query analytics for Apache Hadoop

A package that allows R developers to use Hadoop HBase

Book: Big Data Analytics with Spark

Related Posts

Streaming Analytics for Chain Monitoring. By Natalino Busa, Head of Data Science at Teradata — Thursday, ODBMS.org January 12, 2017

Five Challenges to IoT Analytics Success. By Dr. Srinath Perera. ODBMS.org SEPTEMBER 23, 2016

Next-Generation Genomics Analysis with Apache Spark. by Jason Bailey. ODBMS.org Thursday, June 30th, 2016

Supporting the Fast Data Paradigm with Apache Spark BY Stephen Dillon, Data Architect, Schneider Electric. ODBMS.org,23 APR, 2016

– The new series of Q&A with Leading Data Scientists– ODBMS.org:
Part II
Part I

Follow us on Twitter: @odbmsorg


http://www.odbms.org/blog/2017/03/on-the-new-developments-in-apache-spark-and-hadoop-interview-with-amr-awadallah/feed/ 0
On Hadoop and Big Data. Interview with Lawrence Schwartz http://www.odbms.org/blog/2015/08/on-hadoop-and-big-data-interview-with-lawrence-schwartz/ http://www.odbms.org/blog/2015/08/on-hadoop-and-big-data-interview-with-lawrence-schwartz/#comments Wed, 19 Aug 2015 03:09:29 +0000 http://www.odbms.org/blog/?p=3978

“The best way to define Big Data ROI is to look at how our customers define it and benefit from Hadoop.
Wellcare has been able to improve its query speeds from 30 days to just 7 days. This acceleration enabled the Company to increase its analytics and operational reporting by 73%.”–Lawrence Schwartz

I have interviewed Lawrence Schwartz, Chief Marketing Officer,Attunity.


Q1. What are the common challenges that enterprises face when trying to use Hadoop?

Lawrence Schwartz: The advent of Hadoop and Big Data has significantly changed the way organizations handle data. There’s a need now for new skills, new organizational processes, new strategies and technologies to adapt to the new playing field. It’s a change that permeates everywhere from how you touch the data, to how much you can support resource-wise and architecturally, to how you manage it and use it to stay competitive. Hadoop itself presents two primary challenges. First, the data has to come from somewhere. Enterprises must efficiently load high volumes of widely-varied data in a timely fashion. We can help with software that enables automated bulk loading into Hadoop without manual coding, and change data capture for efficient updates. The second challenge is finding engineers and Data Scientists with the right skills to exploit Hadoop. Talent is scarce in this area.

Q2. Could you give us some examples of how your customers use Hadoop for their businesses?

Lawrence Schwartz: We have an interesting range of customers using Hadoop, so I’ll provide three examples. One major cable provider we are working with uses Hadoop as a data lake. They are integrating feeds from 200 data stores into Pivotal HD. This data lake includes fresh enterprise data – fed in real-time, not just as an archival area – to run up-to-date reporting and analytics without hitting key transactional systems. This enables them to improve decision support and gain competitive advantage.

Another example of how our customers are using Hadoop highlights a Fortune 50 high technology manufacturer. This customer’s business analytics requirements were growing exponentially, straining IT resources, systems and budgets. 
The company selected Attunity Visibility to help it better understand its enterprise-wide data usage analytics across its various data platforms.
Having this capability enables the company to optimize business performance and maximize its investment in its Hadoop, data warehouse and business analytics systems. Attunity Visibility has helped to improve the customer’s system throughput by 25% enabling them to onboard new analytic applications without increasing investment in data warehouse infrastructure.

The third example is a financial services institution. This customer has many different data sources, including Hadoop, and one of its key initiatives is to streamline and optimize fraud detection. Using a historical analysis component, the organization would monitor real-time activity against historical trends to detect any suspicious activity. For example, if you go to a grocery store outside of your normal home ZIP code one day and pay for your goods with a credit card, this could trigger an alert at your bank. The bank would then see that you historically did not use your credit card at that retailer, prompting them to put a hold on your card, but potentially preventing a thief from using your card unlawfully. Using Attunity to leverage both historical and real-time transactions in its analytics, this company is able to decrease fraud and improve customer satisfaction.

Q3. How difficult is it to perform deep insight into data usage patterns? 

Lawrence Schwartz: Historically, enterprises just haven’t had the tools to efficiently understand how datasets and data warehouse infrastructure are being used. We provide Visibility software that uniquely enables organizations to understand how tables and other Data Warehouse components are being used by business lines, departments, organizations etc. It continuously collects, stores, and analyzes all queries and applications against data warehouses. They are then correlated with data usage and workload performance metrics in a centralized repository that provides detailed usage and performance metrics for the entire data warehouse. With this insight, organizations can place the right data on the right platform at the right time. This can reduce the cost and complexity of managing multiple platforms.

Q4. Do you believe that moving data across platforms is a feasible alternative for Big Data? 

Lawrence Schwartz: It really has to be, because nearly every enterprise has more than one platform, even before Hadoop is considered in the mix. Having multiple types of platforms also yields the benefits and challenges of trying to tier data based on its value, between data warehouses, Hadoop, and cloud offerings. Our customers rely on Attunity to help them with this challenge every day. Moving heterogeneous data in many different formats, and from many different sources is challenging when you don’t have the right tools or resources at your disposal. The problem gets magnified when you’re under the gun to meet real-time SLAs. In order to be able to do all of that well, you need to have a way to understand what data to move, and how to move the data easily, seamlessly and in a timely manner. Our solutions make the whole process of data management and movement automated and seamless, and that’s our hallmark.

Q5. What is “Application Release Automation” and why is it important for enterprises?

Lawrence Schwartz: Application release automation (ARA) solutions are a proven way to support Agile development, accelerate release cycles, and standardize deployment processes across all tiers of the application and content lifecycles. ARA solutions can be used to support a wide variety of activities, ranging from publishing and modifying web site content to deploying web-based tools, distributing software to business end users, and moving code between Development, Test, and Production environments.

Attunity addresses this market with an automation platform for enterprise server, web operations, shared hosting, and data center operations teams. Attunity ARA solutions are designed to offload critical, time-consuming deployment processes in complex enterprise IT environments. Enterprises that adopt ARA solutions enjoy greater business flexibility, improved productivity, better cross-team collaboration, and improved consistency.

Q6. What is your relationships with other Hadoop vendors? 

Lawrence Schwartz : Attunity has great working partnerships with all of the major Hadoop platform vendors, including Cloudera, Hortonworks, Pivotal and MapR. We have terrific synergy and work together towards a common goal – to help our customers meet the demands of a growing data infrastructure, optimize their Big Data environments, and make onboarding to Hadoop as easy as possible. Our solutions are certified with each of these vendors, so customers feel confident knowing that they can rely on us to deliver a complete and seamless joint solution for Hadoop.

Q7. Attunity recently acquired  Appfluent Technology, Inc.  and BIReady. Why Appfluent Technology? Why BIReady? How do these acquisitions fit into Attunity`s overall strategy?

Lawrence Schwartz: When we talk with enterprises today, we hear about how they are struggling to manage mountains of growing data and looking for ways to make complex processes easier. We develop software and acquire companies that help our customers streamline and optimize existing systems as well as scale to meet the growing demands of business.

Appfluent brings the Visibility software I described earlier. With Visibility, companies can rebalance data to improve performance and cost in high-scale, rapidly growing environments. They also can meet charge-back, show-back and audit requirements.

BIReady, now known as Attunity Compose, helps enterprises build and update data warehouses more easily. Data warehouse creation and administration is among the most labor-intensive and time-consuming aspects of analytics preparation. Attunity Compose overcomes the complexity with automation, using significantly less resources. It automatically designs, generates and populates enterprise data warehouses and data marts, adding data modeling and structuring capabilities inside the data warehouse.

Q8. How do you define Big Data ROI?

Lawrence Schwartz: The best way to define this is to look at how our customers define it and benefit from Hadoop.

One of our Fortune 500 customers is Wellcare, which provides managed care services to government-sponsored healthcare programs like Medicaid and Medicare. Wellcare plans to use our software to load data from its Pivotal data warehouse into Hadoop, where they will do much of their data processing and transformations. They will then move a subset of that data from Hadoop back into Pivotal and run their analytics from there. So in this case Hadoop is a staging area. As a result of implementing the first half of this solution (moving data from various databases into Pivotal), Wellcare has been able to improve its query speeds from 30 days to just 7 days. This acceleration enabled the Company to increase its analytics and operational reporting by 73%. At the same time, the solution helps Wellcare meet regulatory requirements in a timely manner more easily, ensuring that it receives the state and federal funding required to run efficiently and productively.

In another example, one of our customers, a leading online travel services company, was dealing with exploding data volumes, escalating costs and an insatiable appetite for business analytics. They selected Attunity Visibility to reduce costs and improve information agility by offloading data and workload from their legacy data warehouse systems to a Hadoop Big Data platform. Attunity Visibility has saved the company over $6 million in two years by ensuring that the right workload and data are stored and processed on the most cost-effective platform based on usage.


CUSTOMER SPOTLIGHT WEBINAR SERIES: Healthcare Success Story – How WellCare Accelerated Big Data Delivery to Improve Analytics

Related Posts

Streamlining the Big Data Landscape: Real World Network Security Usecase By Sonali Parthasarathy Accenture Technology Labs. ODBMS.org

Thirst for Advanced Analytics Driving Increased Need for Collective Intelligence By John K. Thompson – General Manager, Advanced Analytics, Dell Software -August 2015, ODBMS.org

Evolving Analytics by Carlos Andre Reis Pinheiro, Data Scientist, Teradata. ODBMS.org

Business Requirements First, Technology Second BY Tamara Dull, Director of Emerging Technologies, SAS Best Practices, ODBMS.org

A Cheat Sheet: What Executives Want to Know about Big Data by Tamara Dull, Director of Emerging Technologies for SAS Best Practices, ODBMS.org

Follow ODBMS.org on Twitter: @odbmsorg

http://www.odbms.org/blog/2015/08/on-hadoop-and-big-data-interview-with-lawrence-schwartz/feed/ 0
On Hadoop and Big Data. Interview with John Leach http://www.odbms.org/blog/2015/07/on-hadoop-and-big-data-interview-with-john-leach/ http://www.odbms.org/blog/2015/07/on-hadoop-and-big-data-interview-with-john-leach/#comments Mon, 13 Jul 2015 08:32:52 +0000 http://www.odbms.org/blog/?p=3941

“One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration.”– John Leach

I have interviewed John Leach, CTO & Cofounder Splice Machine.  Main topics of the interview are Hadoop, Big Data integration and what Splice Machine has to offer in this space.  Monte Zweben, CEO of Splice Machine also contributed to the interview.


Q1. What are the Top Ten Pitfalls to Avoid in a SQL-on-Hadoop Implementation?

John Leach, Monte Zweben:
1. Individual record lookups. Most SQL-on-Hadoop engines are designed for full table scans in analytics, but tend to be too slow for the individual record lookups and ranges scan used by operational applications.
2. Dirty Data. Dirty data is a problem for any system, but it is compounded in Big Data, often resulting in bad reports and delays to reload an entire data set.
3. Sharding. It can be difficult to know what key to distribute data and the right shard size. This results in slow queries, especially for large joins or aggregations.
4. Hotspotting. This happens when data becomes too concentrated in a few nodes, especially for time series data. The impact is slow queries and poor parallelization.
5. SQL coverage. Limited SQL dialects will make it so you can’t run queries to meet business needs. You’ll want to make sure you do your homework. Compile the list of toughest queries and test.
6. Concurrency. Low concurrency can result in the inability to power real-time apps, handle many users, support many input sources, and deliver reports as updates happen.
7. Columnar. Not all columnar solutions are created equally. Besides columnar storage, there are many other optimizations, such as vectorization and run length encoding that can have a big impact on analytic performance. If your OLAP queries run slower, common with large joins and aggregations, this will result in poor productivity. Queries may take minutes or hours instead of seconds. On the flip-side is using columnar when you need concurrency and real-time.
8. Node Sizing. Do your homework and profile your workload. Choosing the wrong node size (e.g., CPU cores, memory) can negatively impact price/performance and create performance bottlenecks.
9. Brittle ETL on Hadoop. With many SQL-on-Hadoop solutions being unable to provide update or delete capabilities without a full data reload, this can cause a very brittle ETL that will require restarting your ETL pipeline because of errors or data quality issues. The result is a missed ETL window and delayed reports to business users.
10. Cost-Based Optimizer. A cost-based optimizer improves performance by selecting the right join strategy, the right index, and the right ordering. Some SQL-on-Hadoop engines have no cost-based optimizer or relatively immature ones that can result in poor performance and poor productivity, as well as manual tuning by DBAs.

Q2. In your experience, what are the most common problems in Big Data integration?

John Leach, Monte Zweben: Providing users access to data in a fashion they can understand and at the moment they need it, while ensuring quality and security, can be incredibly challenging.

The volume and velocity of data that businesses are churning out, along with the variety of different sources, can pose many issues.

One common struggle for data-driven enterprises is managing unnecessarily complicated data workflows with bloated ETL pipelines and a lack of native system integration. Businesses may also find their skill sets, workload, and budgets over-stretched by the need to manage terabytes or petabytes of structured and unstructured data in a way that delivers genuine value to business users.

When data is siloed and there is no solution put into place, businesses can’t access the real-time insights they need to make the best decisions for their business. Performance goes down, headaches abound and cost goes way up, all in the effort to manage the data. That’s why a Big Data integration solution is a prerequisite for getting the best performance and the most real-time insights, at the lowest cost.

Q3. What are the capabilities of Hadoop beyond data storage?

John Leach, Monte Zweben: Hadoop has a very broad range of capabilities and tools:

Oozie for workflow
Pig for scripting
Mahout or SparkML for machine learning
Kafka and Storm for streaming
Flume and Sqoop for integration
Hive, Impala, Spark, and Drill for SQL analytic querying
HBase for NoSQL
Splice Machine for operational, transactional RDBMS

Q4. What programming skills are required to handle application development around Big Data platforms like Hadoop?

John Leach, Monte Zweben: To handle application development on Hadoop, individuals have choices to go raw Hadoop or SQL-on-Hadoop. When going the SQL route, very little new skills are required and developers can open connections to an RDBMS on Hadoop just like they used to do on Oracle, DB2, SQLServer, or Teradata. Raw HAdoop application developers should know their way around the core components of the Hadoop stack–such as HDFS, MapReduce, Kafaka, Storm, Oozie, Hive, Pig, HBase, and YARN. They should also be proficient in Java.

Q5. What are the current challenges for real-time application deployment on Hadoop?

John Leach, Monte Zweben: When we talk about real-time at Splice Machine, we’re focused on applications that require not only real-time responses to queries, but also real-time database updates from a variety of data sources. The former is not all that uncommon on Hadoop; the latter is nearly impossible for most Hadoop-based systems.

Deploying real-time applications on Hadoop is really a function of moving Hadoop beyond its batch processing roots to be able to handle real-time database updates with high concurrency and transactional integrity. We harness HBase along with a lockless snapshot isolation design to provide full ACID transactions across rows and tables.

This technology enables Splice Machine to execute the high concurrency of transactions required by real-time applications.

Q6. What is special about Splice Machine auto-sharding replication and failover technology?

John Leach, Monte Zweben: As part of its automatic auto-sharding, HBase horizontally partitions or splits each table into smaller chunks or shards that are distributed across multiple servers. Using the inherent failover and replication capabilities of HBase and Hadoop, Splice Machine can support applications that demand high availability.

HBase co-processors are used to embed Splice Machine in each distributed HBase region (i.e., data shard). This enables Splice Machine to achieve massive parallelization by pushing the computation down to each distributed data shard without any overhead of MapReduce.

Q7. How difficult is it for customers to migrate from legacy databases to Splice Machine?

John Leach, Monte Zweben: Splice Machine offers a variety of services to help businesses efficiently deploy the Splice Machine database and derive maximum value from their investment. These services include both implementation consulting and educational offerings delivered by our expert team.

Splice Machine has designed a Safe Journey program to significantly ease the effort and risk for companies migrating to a Splice Machine database. The Safe Journey program includes a proven methodology that helps choose the right workloads to migrate, implements risk-mitigation best practices, and includes commercial tools that automate most of the PL/SQL conversion process.

This is not to suggest that all legacy databases will convert to a Hadoop RDBMS.
The best candidates will typically have over 1TB of data, which often leads to cost and scaling issues in legacy databases.

Q8. You have recently announced partnership with Talend, mrc (michaels, ross & cole ltd.) and RedPoint Global. Why Talend, mrc, and RedPoint Global? What is the strategic meaning of these partnerships for Splice Machine?

John Leach, Monte Zweben: Our uptick in recent partnerships demonstrates the tremendous progress our team has made over the past year. We have been working relentlessly to develop the Splice Machine Hadoop RDBMS into a fully enterprise-ready database that can replace legacy database systems.

The demand for programming talent to handle application development is growing faster than the supply of skilled talent, especially around newer platforms like Hadoop. We partnered with mrc to give businesses a solution that can speed real-time application deployment on Hadoop with the staff and tools they currently have, while also offering future-proof applications over a database that scales to meet increasing data demands.

We partnered with Talend to bring our customers the benefit of two different approaches for managing data integration affordable and at scale. Talend’s rich capabilities including drag and drop user interface, and adaptable platform allow for increased productivity and streamlined testing for faster deployment of web, mobile, OLTP or Internet of Things applications.

And finally, we integrated and certified our Hadoop RDBMS on RedPoint’s Convergent Marketing Platform™ to create a new breed of solution for marketers. With cost-efficient database scale-out and real-time cross-channel execution, the solution enables enterprises to future-proof their marketing technology investment through affordable access to all their data (social, mobile, click streams, website behaviors, etc.) across a proliferating and ever-changing list of channels. Furthermore, it complements any existing Hadoop deployment, including those on the Cloudera, MapR and Hortonworks distributions.

Q9. How is Splice Machine working with Hadoop distribution partners –such as MapR, Hortonworks and Cloudera?

John Leach, Monte Zweben: Since Splice Machine does not modify HBase, it can be used with any standard Hadoop distribution that includes HBase, including Cloudera, MapR and Hortonworks. Splice Machine enables enterprises using these three companies to tap into real-time updates with transactional integrity, an important feature for companies looking to become real-time, data-driven businesses.

In 2013, Splice Machine partnered with MapR to enable companies to use the MapR distribution for Hadoop to build their real time, SQL-on-Hadoop applications. In 2014, we joined the Cloudera Connect Partner Program, after certifying on CDH 5. We are working closely with Cloudera to maximize the potential of its full suite of Hadoop-powered software and our unique approach to real-time Hadoop.

That same year, we joined Hortonworks Technology Partner program. This enabled our users to harness innovations in management, provisioning and security for HDP deployments. For HDP users, Splice Machine enables them to build applications that use ANSI-standard SQL and support real-time updates with transactional integrity, allowing Hadoop to be used in both OLTP and OLAP applications.

Earlier this year, we were excited to achieve Hortonworks® Data Platform (HDP™) Certification. With the HDP certification, our customers can leverage the pre-built and validated integrations between leading enterprise technologies and the Hortonworks Data Platform, the industry’s only 100-percent open source Hadoop distribution, to simplify and accelerate their Splice Machine and Hadoop deployments.

Q10 What are the challenges of running online transaction processing on Hadoop?

John Leach, Monte Zweben: With its heritage as a batch processing system, Hadoop does not provide the transaction support required by online transaction processing. Transaction support can be tricky enough to implement for shared-disk RDBMSs such as Oracle, but it becomes far more difficult to implement in distributed environments such as Hadoop. A distributed transactional model requires high-levels of coordination across a cluster with too much overhead, while simultaneously providing high performance for a high concurrency of small read and writes, high-speed ingest, and massive bulk loads. We prove this by being able to run the TPC-C benchmark at scale.

Splice Machine met those requirements by using distributed snap isolation, a Multi-Version Concurrency Control model that delivers lockless, and high-concurrency transactional support. Splice Machine extended research from Google’s Percolator project, Yahoo Lab’s OMID project, and the University of Waterloo’s HBaseSI project to develop its own patent-pending, distributed transactions.


John LeachCTO & Cofounder Splice Machine
With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies.
Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning.
John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach currently is the organizer for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.

Monte Zweben – CEO & Cofounder Splice Machine
A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program.
Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.

In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA.
Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.

Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.



– Splice Machine resource page, ODBMS.org

Related Posts

Common misconceptions about SQL on Hadoop. By Cynthia M. Saracco, ODBMS.org, July 2015

– SQL over Hadoop: Performance isn’t everything… By Simon Harris, ODBMS.org, March 2015

– Archiving Everything with Hadoop. By Mark Cusack, ODBMS.org. December 2014.

–  On Hadoop RDBMS. Interview with Monte Zweben. ODBMS Industry Watch  November 2, 2014

– AsterixDB: Better than Hadoop? Interview with Mike Carey, ODBMS Industry Watch, October 22, 2014


Follow ODBMS.org on Twitter: @odbmsorg



http://www.odbms.org/blog/2015/07/on-hadoop-and-big-data-interview-with-john-leach/feed/ 0
On Big Data benchmarks. Interview with Francois Raab and Yanpei Chen. http://www.odbms.org/blog/2014/08/big-data-benchmarks-interview-francois-raab-yanpei-chen/ http://www.odbms.org/blog/2014/08/big-data-benchmarks-interview-francois-raab-yanpei-chen/#comments Thu, 14 Aug 2014 04:29:03 +0000 http://www.odbms.org/blog/?p=3153

“It’s unlikely that a big data benchmark will gain wide recognition until a clear “playing field” has emerged and focused the competitive pressure.” –Francois Raab

On the topic of constructing big data benchmarks I have interviewed Francois Raab and Yanpei Chen. Francois is the original author of the TPC-C Benchmark. He is currently the President of InfoSizing, Inc.
Yanpei is a member of the Performance Engineering Team at Cloudera.


Q1.There have been a number of attempts at constructing big data benchmarks. None of them has yet gained wide recognition and usage. Why?

Yanpei: Many big data benchmarks are just like big data systems – new, and with room to improve and grow.
In more detail big data systems:
– rapidly evolve, so it’s important to define performance in ways that matter for end customers.
– consist of many interdependent components, so it’s difficult to measure performance in a reliable fashion.
– service diverse business needs using diverse implementations, so benchmarks need to accommodate different system implementations.

Francois: It’s unlikely that a big data benchmark will gain wide recognition until a clear “playing field” has emerged and focused the competitive pressure. There are 3 phases in the evolution of a new technology. First, the technology is introduced and applied to a wide array of solutions without a proven return on investment. Next, a “killer app” emerges from the early adopters and its rapid growth draws all the vendors into competing on a common playing field. Lastly, some technologies emerge as clear winners in the race and the market start to consolidate around a few dominant vendors. Big data has not entered the second phase yet.

Q2. Is it possible to build a truly representative big data benchmark?

Yanpei: Absolutely!
To me, the rise of “big data” in part comes from our increased ability to instrument, measure, and ultimately derive value from large scale systems – technology systems, financial systems, medical systems, or physical systems touching day-to-day life. Big data systems, as a special case of technology systems, also deal with ever increasing instrumentation and measurement. Over time, I am absolutely confident that we will increase our understanding of big data systems, and with it, improve the quality of our big data benchmarks.

Cloudera’s broad customers base gives us visibility into big data deployments across telecom, banking, retail, manufacturing, media, government, healthcare, and many other industry sectors. We’re in a great position to identify representative use cases.

Francois: A benchmark is a somewhat abstract (i.e. simplified) model of a real life scenario. The question we face today is to identify a scenario that Fortune 500 companies would widely recognize as relevant to their operations and vital to their competitive survival. Once that critical mass has been reached it will quickly spread to the entire commercial data processing landscape and a successful big data benchmark will be built based on that scenario.

Q3. How would you define a Big Data Benchmark ?

Yanpei: The key properties of good big data benchmarks are a re-cast of the same properties for benchmarks of more established systems.

A good big data benchmark should be representative of real-life use cases; it should generate performance insights immediately relevant to diverse and evolving big data use cases. The benchmark should also be scalable; it should stress big data systems today, as well as the vastly improved systems in the future. The benchmark should be portable, meaning it should accommodate systems with different implementations that achieve the same end-goal. The benchmark should also be verifiable, in that the results can be checked by independent auditors if needed, and end-users can reproduce on their own systems the winning configurations and result.

Q4. Can you give some examples of Successful Benchmarks ?

Yanpei: My co-author Francois was a lead contributor to TPC-C, a very successful benchmark for online transactional processing (OLTP). He can share other examples.

Francois: The success of a benchmark can be measured by its number of published results and by its longevity over shifts in the underlying technologies. By that measure TPC-C and TPC-H are leading the field. While it can be argued that they have lost relevance over their two decades lifetime, they still encapsulate critical elements at the core of the application domains they represent (transaction processing and decision support).

Q5. One of the main purposes of a benchmark is to evaluate and contrast the merits of various implementations of the same set of requirements. How do you do this with Big Data?

Yanpei: You construct benchmarks that are portable. In other words, you specify implementation-independent requirements.

Best illustrated by example – TPC-C. TPC-C specifies five operations – New Order, Payment, Delivery, Order-Status, and Stock-Level. It also describes the interdependencies between these operations. For example, every New Order will be accompanied by Payment, but only one in ten New Orders will trigger an Order Status. TPC-C describes the load that the system under test should handle – many concurrent operations arriving in randomized order with randomized inter-arrival time, but at controlled relative frequencies. TPC-C also specifies the initial content of all the datasets, as well as how the content grows over the execution of the benchmark. This is an implementation-independent set of requirements – “handle these operations on these data sets.” The underlying system could be a relational database, or a key-value store like HBase.

Francois: Benchmarks can be defined one of two way: by creating a kit to be deployed on technology specific platforms or by specifying a set of technology agnostic requirements to be implemented at will. Because big data has first emerged from the MapReduce paradigm, we have seen a number of technology centric benchmarks (also called component benchmarks) that put a narrow focus on one or more components of a predefined solution. But we should soon expect to see a big data application emerge as the new must-have in commercial data centers.

Q6. In a recent position paper you argued for building future big data benchmarks using what you call a “functional workload model”. What is it?

Francois: We introduced a couple of terms in that position paper to highlight the core concepts underlying representative, scalable, portable, and verifiable big data benchmarks.

The “functional workload model” is a way to specify such benchmarks. It contains three things – the “functions of abstraction”, the load pattern serviced by the system, and the data sets being acted upon.

“Functions of abstraction” describes “what is being computed” without specifying “how the computation should be done.”
The intent is an abstract, functional description that allows the benchmark to be portable across systems of different compute paradigms. “What is being computed” should be justified by empirical evidence, either system traces or industry-wide surveys, with emphasis on identifying the common computation goals.

The load pattern describes “what is the serviced load” without specifying “how it is serviced.” It outlines the execution frequency, distribution, arrival rate, bursts and averages over time of each individual function of abstraction.

The data sets describe “what is the data and the relationships within the data” without specifying “how it is represented.” It is in terms of the structure and interdependence between data elements, initial size and contents, how it evolves over the course of the workload execution, and how it is expected to scale with the system size and load volume.

These concepts help us routinely identify shortcomings in haphazardly specified benchmarks. For example, some of the most often-cited big data benchmarks contain artificial functions of abstraction that do not match any common use cases.
Or, a multi-job, multi-query load pattern is missing altogether, or the data sets are represented in unrealistic formats that inflate performance advantages.

Q7. Why did you select TPC-C as a starting point for your work?

Yanpei: Because TPC-C already has a functional workload model within its specification. And because Francois wrote TPC-C.

Francois: The functional workload model is the underlying structure on which TPC-C was built. Subsequent TPC benchmarks, like TPC-H and TPC-E, were also built based on a functional workload model.

Q8. How does your functional workload model compares with TPC-C ?

Yanpei: TPC-C already uses the functional workload concept.

Q9. For your functions of abstractions concept to be useful, it must be applicable to different types of big data systems. Two important examples are relational databases and MapReduce. How do you do that? How does your work compare with other MapReduce-Specific Benchmarks ?

Yanpei: Best illustrated by example.

Suppose we discover that sorting data is a common operation in real-life production use cases. We would then define “sort” as a function of abstraction. We would define it in the same fashion as the official Sort Benchmark – the input data is of size X, format Y, and the system is asked to produce output sorted by order Z.

A relational database implementation could do, say, “insert into TABLE … ” followed by “select * from TABLE ordered by COLUMN”. A MapReduce implementation would use the IdentityMapper and IdentityReducer, and rely on the implicit shuffle-sort in MapReduce.

This is obvious for sort, because the sort operation has traditionally been defined in a system-independent way.
In contrast, many of the existing MapReduce and relational database performance measurement tools are specified in ways that do not translate across different types of systems. The many SQL-on-Hadoop systems are fast removing the boundary.
The functions of abstraction concept allows us to understand use-case at a level above than any SQL-only or Hadoop-only specifications.

Q10. What are in your opinion the Emerging Big Data Application Domains?

Francois: Everyone wants to figure out which application domain will become the big data killer app. Today, no commercial data center can live without on-line transaction processing or without decision support systems.
Which big data application will become indispensable tomorrow? That is the million dollar question! Once we know that, a standard big data benchmark will soon follow.

Yanpei: The maturation of the Hadoop platform has been relentless. Its role has changed as the platform has gotten more secure, more reliable, more powerful, and (especially) more real-time. It’s no longer a system used for just big batch jobs. Instead, it has become the first place that data lands. It scales and it can store anything – no data need be discarded. It’s used to pre-process data before delivering it to an enterprise data warehouse, a document repository, an analytic engine, a CRM or ERP application, or other specialized system. Most significantly, it has begun to take over some of the work previously done by those traditional platforms, because it can do real-time search and analysis on the data directly, in place, and without further Extract-Transform-Load (ETL).

This leads to the emergence of the enterprise data hub (EDH), a new architecture to complement existing investments and help put data at the center of an organisation’s business. An enterprise data hub allows storage of any amount and type of data, for as long as is needed, and accessible in any way needed.
Additional necessary attributes of EDHs include: It’s Secure and Compliant, offering perimeter security and encryption, plus fine-grained (row and column-level), role-based access controls over data, just like a data warehouse. It’s Governed, enabling users to do data discovery, data auditing, and data lineage, thus understanding what data is in their EDH and how the data are used.
It’s Unified and Manageable, providing native high-availability, fault-tolerance, self-healing storage, automated replication, and disaster recovery, as well as advanced workload management capabilities to enable multiple speciacialist systems to analyze the same data set. And it’s Open, ensuring that customers are not locked in to any particular vendor’s license agreement, that you can choose what tools to use with your EDH, and nobody can hold your data or applications hostage.

The emergence of EDHs pose both challenges and opportunities for defining big data benchmarks. As Francois alluded to, the representative scenarios typically involve application domains whose performance has traditionally been measured separately, such as the case for on-line transaction processing and decision support systems. How to define and measure performance for such concurrent application domains present both a challenge and an opportunity.
Further, to compare different EDHs, it becomes necessary to quantify characteristics that are previously yes/no checks – which is the more secure EDH? the better governed? the more unified and more manageable? the more open? How to quantify such characteristics will stretch our performance thinking and measurement methodology into new territory.

Q11. Future Work ?

Yanpei: We have a strong Performance Engineering Team at Cloudera. We insist on systematic, fair, and repeatable tests both for our internal performance assessment and competitive studies. We are also engaged with community efforts to define big data benchmarks. Look for our future posts on the Cloudera Developer Blog!


Francois Raab is a recognized, award winning expert in the field of performance engineering, benchmark design and system testing. He is the original author of the TPC-C Benchmark, the most successful industry standard measure of OLTP performance. He was also co-author of “The Benchmark Handbook” (pub. Morgan Kaufmann). Francois is accredited as a Certified Benchmark Auditor by the Transaction Processing Performance Council. His consulting services are retained by most major system vendors as well as Fortune-500 IT organizations. With over 30 years of experience in the field of databases and commercial data processing, Francois is a leading member of the performance measurement, system sizing and technology evaluation community. He is currently the President of InfoSizing, Inc.

Yanpei Chen is a member of the Performance Engineering Team at Cloudera, where he works on internal and competitive performance measurement and optimization. His work touches upon multiple interconnected computation frameworks, including Cloudera Search, Cloudera Impala, Apache Hadoop, Apache HBase, and Apache Hive. He is the lead author of the Statistical Workload Injector for MapReduce (SWIM), an open source tool that allows someone to synthesize and replay MapReduce production workloads. SWIM has become a standard MapReduce performance measurement tool used to certify many Cloudera partners. He received his doctorate at the UC Berkeley AMP Lab, where he worked on performance-driven, large-scale system design and evaluation.


New (August 18, 2014): TPCx-HS: First Vendor-Neutral, Industry Standard Big Data Benchmark.
The Transaction Processing Performance Council (TPC) announced the immediate availability of TPCx-HS, developed to provide verifiable performance, price/performance, availability, and optional energy consumption metrics of big data systems.

From TPC-C to Big Data Benchmarks: A Functional Workload Model. Yanpei Chen, Francois Raab, Randy H. Katz, July 1, 2012

Workload-Driven Design and Evaluation of Large-Scale Data-Centric Systems. Yanpei Chen. Spring 2012

Statistical Workload Injector for MapReduce (SWIM). Yanpei Chen, Sara Alspaugh, Archana Ganapathi, Rean Griffith, Randy Katz

The Fifth Workshop on Big Data Benchmarking (5th WBDB) August 5-6, 2014, Potsdam, Germany: Program and Videos of all talks.

Related Posts

Benchmarking XML Databases: New TPoX Benchmark Results Available. ODBMS INdustry Watch. September 19, 2011

Measuring the scalability of SQL and NoSQL systems. ODBMS Industry Watch. May 30, 2011

Follow ODBMS.org and ODBMS Industry Watch on Twitter: @odbmsorg

http://www.odbms.org/blog/2014/08/big-data-benchmarks-interview-francois-raab-yanpei-chen/feed/ 0
On Pivotal HD. Interview with Scott Yara and Florian Waas. http://www.odbms.org/blog/2013/04/on-pivotal-hd-interview-with-scott-yara-and-florian-waas/ http://www.odbms.org/blog/2013/04/on-pivotal-hd-interview-with-scott-yara-and-florian-waas/#comments Mon, 22 Apr 2013 06:28:17 +0000 http://www.odbms.org/blog/?p=2223

“A distribution is not–or not necessarily–a fork of the code and we have no intention to fork Hadoop. At this point, the value-add that we bring to the table is strictly layered on top of Apache HD and interacts cleanly with the vanilla Hadoop stack” –Scott Yara and Florian Waas.

Greenplum announced on Monday, February 25th a new Hadoop distribution: Pivotal HD. I asked a few questions on Pivotal HD to Scott Yara, Senior Vice President, Products and Co-Founder Greenplum/EMC, and Florian Waas, Senior Director of Advanced Research and Development at Greenplum/EMC.


Q1. What is in your opinion the status of adoption of, and investment in, open source projects such as Hadoop within the Enterprise?

Scott Yara, Florian Waas: We have seen a massive shift in perception when it comes to open source.

In the past, innovation was primarily driven by commercial R&D departments and open source was merely trying to catch up to them. And even though a number of open source projects from that era have become household names they weren’t necessarily viewed as leaders in innovation.

This has fundamentally changed in recent years: open source has become a hotbed of innovation in particular in infrastructure technology. Hadoop and a variety of other data management and database products are testament to this change. Enterprise customers do realize this trend and have started adopting open source large-scale. It allows them to get their hands on new technology much faster than was the case before and as a additional perk this technology comes without the dreaded vendor lock-in.

By now, even the most conservative enterprises have developed open source strategies that ensures they have their hand on the pulse and adoption cycles are short and effective.

So, in short, the prospects for open source have never been better!

Q2. In your opinion is the future of Hadoop made of hybrid products?

Scott Yara, Florian Waas: Hadoop is a collection of products or tools and, apart from the relatively mature HDFS interfaces, is still evolving. Its original value proposition has changed quite dramatically. Remember, initially it was all about MapReduce the cool programming paradigm that lets you whip up large-scale distributed programs in no time requiring only rudimentary programming skills.

Yet, that’s not the reason Hadoop has attracted the attention of enterprises lately. Frankly, the MapReduce programming paradigm was a non-starter for most enterprise customers: it’s at too low a level of abstraction and curating and auditing MapReduce programs is prohibitively expensive for customers unless they have a serious software development shop dedicated to it. What has caught on, however, is the idea of ‘cheap scalable storage’!

In our view the future of Hadoop is really this: a solid abstraction of storage in the form of HDFS with any number of different processing stacks on top, including higher-level query languages. Naturally this will be a collection of different products, hybrids where necessary. I think we’ve only seen the tip of the iceberg yet.

Q3. Why introducing a new Hadoop distribution?

Scott Yara, Florian Waas: Let’s be clear about one thing first: to us a distribution is simply a bundle of software components that comes with the assurance that the bundled products have been integration-tested and certified. To enterprise customers this assurance is vital as it gives them the single point of contact when things go wrong. And exactly this is the objective of Pivotal HD.

A distribution is not–or not necessarily–a fork of the code and we have no intention to fork Hadoop. At this point, the value-add that we bring to the table is strictly layered on top of Apache HD and interacts cleanly with the vanilla Hadoop stack.

As long as no vendor actively subverts the Hadoop project, we don’t see any need to fork. That being said, if a single vendor sweeps up a significant number of contributors or even committers of any individual project it always raises a couple of red flags and customers will be concerned whether somebody is going to hijack the project. At this point, we’re not aware of any such threat to the open-source nature of Hadoop.

Q4. How did you expand Hadoop capabilities as a data platform with Pivotal HD?

Scott Yara, Florian Waas: Pivotal HD is a full Apache HD distribution plus some Pivotal add-ons. As we said before, the HDFS abstraction is a pretty good one—but the standard stack on top of it is lacks severely in performance and expressiveness; so we give customers better alternatives. For enterprise customers this means: you can use Pivotal HD like regular Hadoop where applicable but if you need more, you get it in the same bundle.

Q5. What is the rationale beyond introducing HAWQ, a relational database that runs atop of HDFS?

Scott Yara, Florian Waas: Not quite. We’ve transplanted a modern distributed query engine onto HDFS. We stripped out a lot of “incidental” database technology that databases are notorious for. HAWQ gives enterprises the best of both worlds: high-performance query processing for a query language they already know on the one hand, and scalable open storage on the other hand. And, unlike with a database, data isn’t locked away in a proprietary format: in HAWQ you can access all stored data with any number of tools when you need to.

Q6. How does Pivotal HD differ from Hadapt in this respect?

Scott Yara, Florian Waas: Hadapt is still in its infancy with what looks like a long way to go; mainly because they couldn’t tap into a MPP database product.

Folks sometimes forget how much work goes into building a commercially viable query processor. In the case of Greenplum, it’s been about 10 years of engineering.

Q7. How does HAWQ work?

Scott Yara, Florian Waas: HAWQ is modern distributed and parallel query processor atop HDFS–with all the features you truly need, but without the bloat of a complete RDBMS.

Obviously there’s a number of rather technical details how exactly the two worlds integrate and interested readers can find specific technical descriptions on our website.

Q8 You write in the Greenplum Blog that “HAWQ draws from the 10 years of development on the Greenplum Database product”. Can you be more specific?

Scott Yara, Florian Waas: Building a distributed query engine that is general and powerful enough to support deep analytics is a very tough job. There are no shortcuts. Hive and all of these SQL-ish interfaces we’ve recently seen are an attempt at it and work well for simple queries but basically failed to deliver solid performance when it comes to advanced anlaytics.

Having spent a long time working on DB internals we sometimes keep forgetting how steep a development this technology has undergone. Folks new in this space constantly “discover” some of the problems query processing has dealt with for a long time already, like join ordering—this learning-by-doing approach is kind of cute, but not necessarily effective.

Q9. Why does HAWQ have its own execution engine separate from MapReduce? Why does it manage its own data?

Looks like we’re answering the questions in the wrong order :-)

MapReduce is a great tool to teach parallelism and distributed processing and makes for an intuitive hands-on experience. But unless your problem is as simple as a distributed word-count, MapReduce quickly becomes a major development and maintenance headache; and even then the resulting performance is sub-standard.

In short, MapReduce, while maybe great for software shops with deep expertise in distributed programming and a do-it-yourself attitude, is not enterprise-ready.

Q10. HAWQ supports Columnar or row-oriented storage. Why this design choice?

Scott Yara, Florian Waas: Columnar vs. row-orientation really is a smoke screen; always has been. We’ve long advocated to view columnar for what it is: a feature, not an architectural principle. If your query processor follows even the most basic software engineering principles supporting column-orientation is really easy.

Plenty of white papers have been written on the differences and discussed the application scenarios where one out-performs the other and vice versa. As so often, there is no one-size-fits-all. HAWQ lets customers use what they feel is the right format for the job. We want customers to be successful, not blindly follow an ideology.

The same way different requirements in the workload demand different orientation, HAWQ can ingest different data formats way beyond column or row orientation—optimized for query processing, or optimized for 3rd party applications, etc.—which rounds out the picture.

Q11. Could you give us some technical detail on how the SQL parallel query optimizer and planner works?

Scott Yara, Florian Waas: What you see in HAWQ today is the true and tried Greenplum Database MPP optimizer with a couple of modifications but largely the same battle-tested technology. That’s what allowed us to move ahead of the competition so quick while everybody else is still trying to catch up to basic MPP functionality.

Having said that, we’re constantly striving for improvement and pushing the limits. Over the past years, we have invested in what we believe is a ground-breaking optimizer infrastructure which we’ll unveil later this summer. So, stay tuned!

Q12. Could you give us some details on the partitioning strategy you use and what kind of benchmark results do you have?

Scott Yara, Florian Waas: The benchmarks are a funny thing: hardly any competitor can run even the most basic database benchmarks yet, so we’re comparing on the simple, almost trivial, queries only. Anyways, here’s what we’ve been seeing so far: if the query is completely trivial the nearest competitor is slower by at least a factor of two. For anything even slightly more complex the difference widens quickly to one to two orders of magnitude.

Q13. Apache Hadoop is open-source, do you have plans to open up HAWQ and the other technologies layered atop it?

Scott Yara, Florian Waas: We’ve been debating this but haven’t really made a decision, as of yet.

Q14. The Hadoop market is crowded: e.g Cloudera (Impala), Hortonworks’ Data Platform for Windows, Intel’s Hadoop distribution, NewSQL data store Hadapt. How do you stand out of this crowd of competitors?

Scott Yara, Florian Waas: We clearly captured a position of leadership with HAWQ and enterprise customers do recognize that. We’ve also received a lot of attention from competitors which shows that we clearly hit a nerve and deliver a piece of the puzzle enterprises have long been waiting for.

Q15. With Pivotal HD are you competing in the same market space as Teradata Aster?

Scott Yara, Florian Waas: Aster has traditionally targeted a few select verticals. For all we can tell, it looks like we’re seeing the continuation of that strategy with a highly specialized offering going forward.

In contrast to that, Pivotal HD strives to be a general purpose solution for as broad a customer spectrum as you can imagine.

Q16. Jeff Hammerbacher in 2011 said “The best minds of my generation are thinking about how to make people click ads… That sucks. If instead of pointing their incredible infrastructure at making people click on ads, they pointed it at great unsolved problems in science, how would the world be different today?” What is your take on this?

Scott Yara, Florian Waas: Jeff garnered a lot of attention with this quote but let’s face it, this type of criticism isn’t exactly novel nor very productive. For decades, Joseph Weizenbaum, one of the pioneers of AI famously lamented about the genius and technology wasted on TV satellites. Along the same lines other MIT faculty have decried the fact that their most successful engineering students become quants on Wall Street. The list is probably long.

Instead of scolding people for what they didn’t do, I’d say, let’s empower people and give them tools to do great things and solve truly important problems. It’s not at coincidence that Big Data problems are at the heart of the most pressing challenges humanity faces today. So, let’s get moving!

Scott Yara
Senior Vice President, Products and Co-Founder Greenplum/EMC.
In his role as SVP, Products, Scott is responsible for the division’s overall product development and go-to-market efforts, including engineering, product management, and marketing. Scott is a co-founder of Greenplum and was President of the company. Prior to Greenplum, Scott served as vice president for Digital Island, a publicly traded Internet infrastructure services company that was acquired by Cable & Wireless in 2001. Prior to Digital Island, Scott served as vice president for Sandpiper Networks, an Internet content delivery services company that merged with Digital Island in 1999. At Sandpiper, Scott helped to create the industry’s first content delivery network (CDN), a globally distributed computing infrastructure comprised of several thousand servers, and used by many of the industry’s largest Internet services including Microsoft and Disney

Florian Waas
As Senior Director of Advanced Research and Development at Greenplum/EMC, Florian Wass heads up the division’s department of Impossible Ideas. That is to say, his day job is to look into ideas that are far from ready to be undertaken as engineering efforts, and then look at what would it take to turn theory into practice.
He obtained his MSc in Computer Science from Passau University, Germany and a PhD in database research from the University of Amsterdam. Florian Waas has worked as a researcher for several European research consortia and universities in Germany, Italy, and The Netherlands. Before joining Greenplum, Florian Waas held positions at Microsoft and Amazon.com.

Related Posts

Big Data: Improving Hadoop for Petascale Processing at Quantcast. March 13, 2013

On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. December 5, 2012

On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum. February 1, 2012


– ODBMS.org free resources on Big Data and Analytical Data Platforms
Blog Posts | Free Software | Articles | Lecture Notes | PhD and Master Thesis |

Follow ODBMS.org on Twitter: @odbmsorg


http://www.odbms.org/blog/2013/04/on-pivotal-hd-interview-with-scott-yara-and-florian-waas/feed/ 0
Lufthansa and Data Analytics. Interview with James Dixon. http://www.odbms.org/blog/2013/02/lufthansa-and-data-analytics-interview-with-james-dixon/ http://www.odbms.org/blog/2013/02/lufthansa-and-data-analytics-interview-with-james-dixon/#comments Mon, 04 Feb 2013 10:12:22 +0000 http://www.odbms.org/blog/?p=1852 “Lufthansa is now able to aggregate and feed data into a management cockpit to analyze collected data for key decision-making purposes in the future. Users get instantly notified of transmission errors, enabling the company to detect patterns on large amounts of data at a rapid speed. There is also an automatic alarm messages sent out to IT product management, and partner airlines are informed of errors right away in the case of transmission errors between different IT systems for passenger data. Lufthansa is now able to comprehensively monitor one of its most important core processes in real-time for quality management: the handover of passenger data between different airlines” — James Dixon.

On the state of the market for Big Data Analytics I have Interviewed James Dixon, co-founder and Chief Geek / CTO, Pentaho Corporation.


Q1. What is In your opinion the expected realistic Market Demand for Big Data analytics?

James Dixon: Big. Until recently it has not been possible to perform analysis of sub-transactional and detailed operational data for a reasonable price-tag. Systems such as Hadoop and the NoSQL repositories such as MongoDB and Cassandra make it possible to economically store and process large amount of data. The first use of this data is often to answer operational and tactical questions. Shortly after that comes the desire to answer managerial and strategic questions, and this where Big Data Analytics comes in. I estimate that 90% of all Big Data repositories will have some form of reporting/visualization/analysis requirement applied to it.

Q2. Aren’t we too early with respect to the maturity of the Big Data Analytics technology and the market acceptance?

James Dixon: We are early, but not too early. There is significant market acceptance in certain domains already – financial services, SaaS application providers, and media companies to name a few. As these initial markets mature we will see common use cases emerge and public endorsements of these technologies, this will help to increase acceptance in other markets. We’ve seen a significant uptake in commercial deals over the last few quarters whereas 2011 was more tire-kicking and exploratory.

Q3. Pentaho has worked with Lufthansa to improve their passenger handling. Could you please tell us more about this? In particular what requirement and technical challenges did you have for this project? And how did you solve them?

James Dixon: Lufthansa needed a solution that would make the core processes of Inter Airline Through Check In (IATCI) accessible, measurable and available for real-time operational monitoring. They also wanted to deliver consolidated management reporting dashboards to inform decision making out of this information. This was implemented by Pentaho’s services organization with onsite training and consulting. Our Pentaho Business Analytics suite was used for the front-end for real-time data analysis and report generation. In the back-end, Pentaho Data Integration (aka Kettle) retrieves, transforms and loads the message data streams into the data warehouse on a continuous basis.

Q4. And what results did you obtain so far?

James Dixon: Lufthansa is now able to aggregate and feed data into a management cockpit to analyze collected data for key decision-making purposes in the future. Users get instantly notified of transmission errors, enabling the company to detect patterns on large amounts of data at a rapid speed. There is also an automatic alarm messages sent out to IT product management, and partner airlines are informed of errors right away in the case of transmission errors between different IT systems for passenger data. Lufthansa is now able to comprehensively monitor one of its most important core processes in real-time for quality management: the handover of passenger data between different airlines. With Pentaho, Lufthansa is now instantly aware if they are dealing with a single occurrence of an error or if there is a pattern. They can immediately take action in order to minimize the impact on their passengers.

Q5. What is special about Pentaho’s big data analytic platform? How does it differ with respect to other vendors?

James Dixon: We have an end-to-end offering that encompasses data integration/orchestration across Big Data and regular data stores/sources, data transformation, desktop and web-based reporting, slice-and-dice analysis tools, dash boarding, and predictive analytics. Very few vendors have the breadth of technology that we do, and those that do are mainly pushing hardware and services. We enable the creation of hybrid solutions that allow companies to use the most appropriate data storage technology for every part of their system – we don’t force you to load all your data into Hadoop, for example. From an architecture perspective our ability to run our data integration engine inside of MapReduce tasks on the data nodes is a unique capability. And we provide analytics directly on top of big data tech that gives users instant results via our schema-on-read approach – you don’t have to predefine ETL or Schemas or Data Marts – we do it on the fly.

Q6. What are the technical challenges in creating and viewing Analytics on the iPad?

James Dixon: The navigation concepts are different on mobile devices, so the overall user experience of the analysis software needs to be adapted for the iPad. Vendors need to be sensitive to the interaction techniques that the touch screens provide. We have changed the way that all of our end-user web-based interfaces work so that experience on the iPad is similar. It is possible to allow ad-hoc analysis and content authoring on the iPad, and Pentaho provides that with our recent V4.8 release.

Q7. Big Data and Mobile: what are the challenges and opportunities?

James Dixon: There are some use cases that are easy to identify. Report bursting to mobile and non-mobile devices is a technique that is easy to do today. Real-time analysis of Big Data combined with the alerting and notification capability of mobile devices is an interesting combination.

Q8. How are you supposed to view complex analytics with the limited display of a mobile phone?

James Dixon: Even with a desktop computer and a large monitor, analysis of Big Data requires lots of aggregation and/or lots of filtering. If you could display all the raw data from a Big Data repository, you would not be able to interpret it. As the display gets smaller the amount of aggregation and filtering has to go up, and the complexity has to come down. It is possible to do reasonably complex analysis on a tablet, but it is certainly a challenge on the smaller devices.

Q9. What are the main technical and business challenges that customers face when they want to use Cloud analytics deployments?

James Dixon: Moving large amounts of data around is a hurdle for some organizations. For this reason cloud analytics is not very appealing to companies with established data centers. However young companies that exclusively use hosted applications do not have their data on-premise. As these companies grow and mature we will see the market for cloud analytics increase.

Q10. Pentaho has announced in July this year a technical integration of their analytics platform with Cloudera. What is the technical and business meaning of this? What are results obtained so far?
James Dixon:We are working closely with Coudera on a technical and business level. For example we worked with Cloudera to test their new Impala database with Pentaho’s analytics, so that we could demo the integration on the day that Impala was announced. We also have joint marketing campaigns and sales field engagement, as customers of Cloudera find tremendous benefit in engaging with Pentaho and vice-versa. Our tech makes it much easier and 20x faster to get Hadoop productive so their customers gravitate to us naturally.

James Dixon, Founder and Chief Geek / CTO, Pentaho Corporation
As “Chief Geek” (CTO) at Pentaho, James Dixon is responsible for Pentaho’s architecture and technology roadmap. James has over 15 years of professional experience in software architecture, development and systems consulting. Prior to Pentaho, James held key technical roles at AppSource Corporation (acquired by Arbor Software which later merged into Hyperion Solutions) and Keyola (acquired by Lawson Software). Earlier in his career, James was a technology consultant working with large and small firms to deliver the benefits of innovative technology in real-world environments.

Related Posts

On Big Data Velocity. Interview with Scott Jarr. on January 28, 2013

The Gaia mission, one year later. Interview with William O’Mullane. on January 16, 2013

Big Data Analytics– Interview with Duncan Ross on November 12, 2012

On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. on December 5, 2012

Managing Big Data. An interview with David Gorbet on July 2, 2012

On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com. by Roberto V. Zicari on November 2, 2011

Analytics at eBay. An interview with Tom Fastner. on October 6, 2011


– Big Data: Challenges and Opportunities.
Roberto V. Zicari, October 5, 2012.
Abstract: In this presentation I review three current aspects related to Big Data:
1. The business perspective, 2. The Technology perspective, and 3. Big Data for social good.

Presentation (89 pages) | Intermediate| English | DOWNLOAD (PDF)| October 2012|

ODBMS.org: Big Data and Analytical Data Platforms.
Blog Posts | Free Software | Articles | PhD and Master Thesis |


You can follow ODBMS.org on Twitter : @odbmsorg.

http://www.odbms.org/blog/2013/02/lufthansa-and-data-analytics-interview-with-james-dixon/feed/ 0
Objects in Space vs. Friends in Facebook. http://www.odbms.org/blog/2011/04/objects-in-space-vs-friends-in-facebook/ http://www.odbms.org/blog/2011/04/objects-in-space-vs-friends-in-facebook/#comments Wed, 13 Apr 2011 06:14:09 +0000 http://www.odbms.org/blog/?p=765 “Data is everywhere, never be at a single location. Not scalable, not maintainable.”–Alex Szalay

I recently reported about the Gaia mission which is considered by the experts “the biggest data processing challenge to date in astronomy“.

Alex Szalay- who knows about data and astronomy, having worked from 1992 till 2008 with the Sloan Digital Sky Survey together with Jim Gray – wrote back in 2004:
“Astronomy is a good example of the data avalanche. It is becoming a data-rich science. The computational-Astronomers are riding the Moore’s Law curve, producing larger and larger datasets each year.” [Gray,Szalay 2004]

Gray and Szalay observed: “If you are reading this you are probably a “database person”, and have wondered why our “stuff” is widely used to manage information in commerce and government but seems to not be used by our colleagues in the sciences. In particular our physics, chemistry, biology, geology, and oceanography colleagues often tell us: “I tried to use databases in my project, but they were just to [slow | hard-to-use |expensive | complex ]. So, I use files.” Indeed, even our computer science colleagues typically manage their experimental data without using database tools. What’s wrong with our database tools? What are we doing wrong? “ [Gray,Szalay 2004].

Six years later, Szalay in his presentation “Extreme Data-Intensive Computing” presented what he calls “Jim Gray`s Law of Data Engineering”:
1. Scientific computing is revolving around data.
2. Need scale-out solution for analysis.”

He also says about Scientific Data Analysis, or as he calls it (DISC: Data Intensive Scientific Computing): “Data is everywhere, never be at a single location. Not scalable, not maintainable.”[Szalay2010]

I would like to make three observations:

i. Great thinkers do anticipate the future. They “feel” it. Better said, they “see” more clearly how things really are.
Consider for example what the philosopher Friedrich Nietzsche wrote in his book “Thus Spoke Zarathustra”: “The middle is everywhere.” Confirmed 128 years later by “Data is everywhere”….

ii. “Astronomy is a good example of the data avalanche”: the Universe is beyond our comprenshion, which means I believe, that ultimately we will figure out that indeed “data is not scalable, and not maintainable.”

iii. I now dare to twist the quote: “If you are reading this you are probably a “database person”, and have wondered why our “stuff” is widely used to manage information in commerce and government but seems to not be used by our colleagues at Facebook or Google”….

I have asked Professor Alex Szalay for his opinion.

Alexander Szalay is a professor in the Department of Physics and Astronomy of the Johns Hopkins University. His research interests are theoretical astrophysics and galaxy formation.


Alex Szalay: This is very flattering… and I agree. But to be fair, the Facebook guys are using databases, first MySQL, and now Oracle in the middle of their whole system.

I have recently heard a talk by Jeff Hammerbacher, who built the original infrastructure for Facebook. Now he quit, and formed Cloudera. He did explicitly say that in the middle there will always be SQL, but people use Hadoop/MR for the ETL layer… and R and other tools for analytics and reporting.

As far as I can see Google is also gently moving towards not quite a database yet, but Jeff Dean is building Bloom filters and other indexes into BigTable. So even if it is NoSQL, some of their stuff starts to resemble a database….

So I think there is a general agreement that indices are useful, but for large scale data analytics, we do not need full ACID, transactions are much more a burden than an advantage. And there is a lot of religion there, of course.

I would put it in such a way, that there is a phase transition coming, and there is an increasing diversification, where there were only three DB vendors 5 years ago, now there are many options and a broad spectrum of really interesting niche tools. In a healthy ecosystem everything is a 1/f power law, and we will see a much bigger diversity. And this is great for academic research. “In every crisis there is an opportunity” — we again have a chance to do something significant in academia.

RVZ: The National Science Foundation has awarded a $2M grant to you and your team of co-investigators from across many scientific disciplines, to build a 5.2 Petabyte Data-Scope, a new instrument targeted at analyzing the huge data sets emerging in almost all areas of science. The instrument will be a special data-supercomputer, the largest of its kind in the academic world.

What is the project about?

Alex Szalay: We feel that the Data-Scope is not a traditional multi-user computing cluster, but a new kind of instrument, that enables people to do science with datasets ranging between 100TB and 1000TB.
This is simply not possible today. The task is much more, than just throw the necessary storage together.
It requires a holistic approach: the data must be first brought to the instrument, then staged, and then moved to the computing nodes that have both enough compute power and enough storage bandwidth (450GBps) to perform the typical analyses, and then the (complex) analyses must be performed.

RVZ: Could you please explain what are the main challenges that this project poses?

Alex Szalay: It would be quite difficult, if not outright impossible to develop a new instrument with so many cutting-edge features without adequately considering all aspects of the system, beyond the hardware. We need to write at least a barebones set of system management tools (beyond the basic operating system etc), and we need to provide help and support for the teams who are willing to be at the “bleeding-edge” to be able to solve their big data problems today, rather than wait another 5 years, when such instruments become more common.
This is why we feel that our proposal reflects a realistic mix of hardware and personnel, which leads to a high probability of success.

The instrument will be open for scientists beyond JHU. There was an unbelievable amount if interest just at JHU in such an instrument, since analyzing such data sets is beyond the capability of any group on campus. There were 20 groups with data sets totaling over 2.8PB just within JHU, who would use the facility immediately, if it was available. We expect to go no-line at the end of this summer.


Extreme Data-Intensive Computing (.pdf)
Alex Szalay, The Johns Hopkins University, 2010.

[Gray,Szalay 2004]
Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science.
Jim Gray,Microsoft Research and Alex Szalay,Johns Hopkins University.
IEEE Data Engineering Bulletin and Technical Report, MSR-TR-2004-110, Microsoft Research, 2004

Friedrich Nietzsche,
Thus Spoke Zarathustra: a Book for Everyone and No-one. (Also Sprach Zarathustra: Ein Buch für Alle und Keinen) – written between 1883 and 1885.

Related Posts

Objects in Space

Objects in Space: “Herschel” the largest telescope ever flown.

Objects in Space. –The biggest data processing challenge to date in astronomy: The Gaia mission.–

Big Data

Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.

The evolving market for NoSQL Databases: Interview with James Phillips.


http://www.odbms.org/blog/2011/04/objects-in-space-vs-friends-in-facebook/feed/ 1
Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera. http://www.odbms.org/blog/2011/04/hadoop-for-business-interview-with-mike-olson-chief-executive-officer-at-cloudera/ http://www.odbms.org/blog/2011/04/hadoop-for-business-interview-with-mike-olson-chief-executive-officer-at-cloudera/#comments Mon, 04 Apr 2011 06:58:42 +0000 http://www.odbms.org/blog/?p=730 “Data is the big one challenge ahead” –Michael Olson.

I was interested to learn more about Hadoop, why it is important, and how it is used for business.
I have therefore interviewed Michael Olson, Chief Executive Officer, Cloudera..


RVZ: In my understanding of how the market of Data Management Platforms is evolving, I have identified three phases:

Phase I– New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.

Phase II- The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerge.

Phase III– Evolving Analytical Data Platforms. Hadoop for analytic. Companies such a Cloudera, but also IBM`s BigInsights are in this space.

Q1. Would you agree with this? Would you have anything to add/change?

Michael Olson: I think that’s generally accurate. The one qualification I’d offer is that the process isn’t a waterfall, where the Phase I players do some innovative work that flows down to Phase II where it’s implemented, and so on. The open source projects have come up with some novel and innovative ideas not described in the papers. The arrival of commercial players like Cloudera and IBM was also more interesting than the progression suggests. We’ve spent considerable time with customers, but also in the community, and we’ll continue to build reputation by contributing alongside the many great people working on Apache Hadoop and related projects around the world. IBM’s been working with Hadoop for a couple of years, and BigInsights really flows out of some early exploratory work they did with a small number of early customers.

Q2. What is Hadoop? Why is it becoming so popular?

Michael Olson:Hadoop is an open source project, sponsored by the Apache Software Foundation, aimed at storing and processing data in new ways. It’s able to store any kind of information — you needn’t define a schema and it handles any kind of data, including the tabular stuff than an RDBMS understands, but also complex data like video, text, geolocation data, imagery, scientific data and so on. It can run arbitrary user code over the data it stores, and it’s able to do large-scale parallel processing very easily, so you can answer petabyte-scale questions quickly. It’s genuinely a new thing in the data management world.

Apache Hadoop, the project, consists of three major components: the Hadoop Distributed File System, or HDFS, which handles storage; MapReduce, the distributed processing infrastructure, which handles the work of running analyses on data; and Common, which is a bunch of shared infrastructure that both HDFS and MapReduce need. When most people talk about “Hadoop,” they’re talking about this collection of software.

Hadoop’s developed by a global community of programmers. It’s one hundred percent open source. No single company owns or controls it.

Q3. Who needs Hadoop?, and for what kind of applications?

Michael Olson:The software was developed first by big web properties, notably Yahoo! and Facebook, to handle jobs like large-scale document indexing and web log processing. It was applied to real business problems by those properties. For example, it can examine the behavior of large numbers of users on a web site, cluster individual users into groups based on the the things that they do, and then predict the behavior of individuals based on the behavior of the group.

You should think of Hadoop in kind of the same way that you think of a relational database. All by itself, it’s a general-purpose platform for storing and operating on data. What makes the platform really valuable is the application that runs on top of it. Hadoop is hugely flexible. Lots of businesses want to understand users in the way that Yahoo! and Facebook do, above, but the platform supports other business workloads as well: portfolio analysis and valuation, intrusion detection and network security applications, sequence alignment in biotechnology and more.

Really, anyone who has a large amount of data — structured, complex, messy, whatever — and who wants to ask really hard questions about that data can use Hadoop to do that. These days, that’s just about every significant enterprise on the planet.

Q4. Can you use Hadoop stand alone or do you need other components? If yes which one and why?

Michael Olson: Hadoop provides the storage and processing infrastructure you need, but that’s all. If you need to load data into the platform, then you either have to write some code or else go find a tool that does that, like Flume (for streaming data) or Sqoop (for relational data). There’s no query tool out of the box, so if you want to do interactive data explorations, you have to go find a tool like Apache Pig or Apache Hive to do that.

There’s actually a pretty big collection of those tools that we’ve found that customers need. It’s the main reason we created the open source package we call Cloudera’s Distribution including Apache Hadoop, or CDH. We assemble Apache Hadoop and tools like Pig, Hive, Flume, Sqoop and others — really, the full suite of open source tools that our users have found they require — and we make it available in a single package. It’s 100% open source, not proprietary to us. We’re big believers in the open source platform — customers love not being shackled to a vendor by proprietary code.

Analytical Data Platforms

Q5. It is said that more than 95% of enterprise data is unstructured, and that enterprise data volumes are growing rapidly. Is it true? What kind of applications generate such high volume of unstructured data? and what can be done with such data?

Michael Olson: You have to talk to a firm like IDC to get numbers. What I will say is that what you call “unstructured” data (I prefer “complex” because all data has structure) is big and getting bigger really, really fast.

It used to be that data was generated at human scale. You’d buy or sell something and a transaction record would happen. You’d hire or fire someone and you’d hit the “employee” table in your database.

These days, data comes from machines talking to machines. The servers, switches, routers and disks on your LAN are all furiously conversing. The content of their messages is interesting, and also the patterns and timing of the messages that they send to one another. (In fact, if you can capture all that data and do some pattern detection and machine learning, you have a pretty good tool for finding bad guys breaking into your network.) Same is true for programmed trading on Wall Street, mobile telephony and many other pieces of technology infrastructure we rely on.

Hadoop knows how to capture and store that data cheaply and reliably, even if you get to petabytes. More importantly, Hadoop knows how to process that data — it can run different algorithms and analytic tools, spread across its massively parallel infrastructure, to answer hard questions on enormous amounts of information very quickly.

Q6. Why building data analysis applications on Hadoop? Why not using already existing BI products?

Michael Olson: Lots of BI tools today talk to relational databases. If that’s the case, then you’re constrained to operating on data types that an RDBMS understands, and most of the data in the world — see above — doesnt’ fit in an RDBMS. Also, there are some kinds of analyses — complex modeling of systems, user clustering and behavioral analysis, natural language processing — that BI tools were never designed to handle.

I want to be clear: RDBMS engines and the BI tools that run on them are excellent products, hugely successful and handling mission-critical problems for demanding users in production every day. They’re not going away. But for a new generation of problems that they weren’t designed to consider, a new platform is necessary, and we believe that that platform is Apache Hadoop, with a new suite of analytic tools, from existing or new vendors, that understand the data and can answer the questions that Hadoop was designed to handle.

Q7. Why Cloudera? What do you see as your main value proposition?

Michael Olson: We make Hadoop consumable in the way that enterprises require.
Cloudera Enterprise provides an open source platform for data storage and analysis, along with the management, monitoring and administrative applications that enterprise IT staff can use to keep the cluster running. We help our customers set and meet SLAs for work on the cluster, do capacity planning, provision new users, set and enforce policies and more. Of course Cloudera Enterprise comes with 24×7 support and a subscription to updates and fixes during the year.

When one of our customers deploys Hadoop, it’s to solve serious business problems. They can’t tolerate missed deadlines. They need their existing IT staff, who probably know how to run an Oracle database or VMware or other big data center infrastructure. That kind of person can absolutely run Hadoop, but needs the right applications and dashboards to do so. That’s what we provide.

Q8. In your opinion, what role will RDBMS and classical Data Warehouse systems play in the future in the market for Analytical Data Platforms? What about other data stores such NoSQL and Object Databases? Will they play a role?

Michael Olson: I believe that RDBMS and classic EDWs are here to stay. They’re outstanding at what they do — they’ve evolved alongside the problems they solve for the last thirty years. You’d be nuts to take them on. We view Hadoop as strictly complementary, solving a new class of problems: Complex analyses, complex data, generally at scale.

As to NoSQL and ODBMS, I don’t have a strong view. The “NoSQL” moniker isn’t well-defined, in my opinion.
There are a bunch of different key-value stores out there that provide a bunch of different services and abstractions. Really, it’s knives and arrows and battleships — they’re all useful, but which one you want depends on what kind of fight you’re in.

Q9. Is Cloud technology important in this context? Why?

Michael Olson: “Cloud” is a deployment detail, not fundamental. Where you run your software and what software you run are two different decisions, and you need to make the right choice in both cases.

Q10. Looking at three elements: Data, Platform, and Analysis, what are the main business and technical challenges ahead?

Michael Olson: Data is the big one. Seriously: More. More complex, more variable, more useful if you can figure out what’s locked up in it. More than you can imagine, even if you take this statement into account.

We obviously need to improve the platforms we have, and I think the next decade will be an exciting time for that. That’s good news — I’ve been in the database industry since 1986, and it has frankly been pretty dull. Same is true for analyses, but our opportunities there will be constrained by both the platforms we have and the data on which we can operate.

Q11. Anything you wish to add?

Michael Olson: Thanks for the opportunity!

Michael Olson, Chief Executive Officer, Cloudera.
Mike was formerly CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine. Mike spent two years at Oracle Corporation as Vice President for Embedded Technologies after Oracle’s acquisition of Sleepycat in 2006. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies and Informix Software. Mike has Bachelor’s and Master’s degrees in Computer Science from the University of California at Berkeley.)

Related Post

The evolving market for NoSQL Databases: Interview with James Phillips.


http://www.odbms.org/blog/2011/04/hadoop-for-business-interview-with-mike-olson-chief-executive-officer-at-cloudera/feed/ 0