Hadoop for Business: Interview with Mike Olson, Chief Executive Officer at Cloudera.
“Data is the big one challenge ahead” –Michael Olson.
I was interested to learn more about Hadoop, why it is important, and how it is used for business.
I have therefore interviewed Michael Olson, Chief Executive Officer, Cloudera..
RVZ: In my understanding of how the market of Data Management Platforms is evolving, I have identified three phases:
Phase I– New Proprietary data platforms developed: Amazon (Dynamo), Google (BigTable). Both systems remained proprietary and are in use by Amazon and Google.
Phase II- The advent of Open Source Developments: Apache projects such as Cassandra, Hadoop (MapReduce, Hive, Pig). Facebook and Yahoo! played major roles. Multitude of new data platforms emerge.
Phase III– Evolving Analytical Data Platforms. Hadoop for analytic. Companies such a Cloudera, but also IBM`s BigInsights are in this space.
Q1. Would you agree with this? Would you have anything to add/change?
Michael Olson: I think that’s generally accurate. The one qualification I’d offer is that the process isn’t a waterfall, where the Phase I players do some innovative work that flows down to Phase II where it’s implemented, and so on. The open source projects have come up with some novel and innovative ideas not described in the papers. The arrival of commercial players like Cloudera and IBM was also more interesting than the progression suggests. We’ve spent considerable time with customers, but also in the community, and we’ll continue to build reputation by contributing alongside the many great people working on Apache Hadoop and related projects around the world. IBM’s been working with Hadoop for a couple of years, and BigInsights really flows out of some early exploratory work they did with a small number of early customers.
Q2. What is Hadoop? Why is it becoming so popular?
Michael Olson:Hadoop is an open source project, sponsored by the Apache Software Foundation, aimed at storing and processing data in new ways. It’s able to store any kind of information — you needn’t define a schema and it handles any kind of data, including the tabular stuff than an RDBMS understands, but also complex data like video, text, geolocation data, imagery, scientific data and so on. It can run arbitrary user code over the data it stores, and it’s able to do large-scale parallel processing very easily, so you can answer petabyte-scale questions quickly. It’s genuinely a new thing in the data management world.
Apache Hadoop, the project, consists of three major components: the Hadoop Distributed File System, or HDFS, which handles storage; MapReduce, the distributed processing infrastructure, which handles the work of running analyses on data; and Common, which is a bunch of shared infrastructure that both HDFS and MapReduce need. When most people talk about “Hadoop,” they’re talking about this collection of software.
Hadoop’s developed by a global community of programmers. It’s one hundred percent open source. No single company owns or controls it.
Q3. Who needs Hadoop?, and for what kind of applications?
Michael Olson:The software was developed first by big web properties, notably Yahoo! and Facebook, to handle jobs like large-scale document indexing and web log processing. It was applied to real business problems by those properties. For example, it can examine the behavior of large numbers of users on a web site, cluster individual users into groups based on the the things that they do, and then predict the behavior of individuals based on the behavior of the group.
You should think of Hadoop in kind of the same way that you think of a relational database. All by itself, it’s a general-purpose platform for storing and operating on data. What makes the platform really valuable is the application that runs on top of it. Hadoop is hugely flexible. Lots of businesses want to understand users in the way that Yahoo! and Facebook do, above, but the platform supports other business workloads as well: portfolio analysis and valuation, intrusion detection and network security applications, sequence alignment in biotechnology and more.
Really, anyone who has a large amount of data — structured, complex, messy, whatever — and who wants to ask really hard questions about that data can use Hadoop to do that. These days, that’s just about every significant enterprise on the planet.
Q4. Can you use Hadoop stand alone or do you need other components? If yes which one and why?
Michael Olson: Hadoop provides the storage and processing infrastructure you need, but that’s all. If you need to load data into the platform, then you either have to write some code or else go find a tool that does that, like Flume (for streaming data) or Sqoop (for relational data). There’s no query tool out of the box, so if you want to do interactive data explorations, you have to go find a tool like Apache Pig or Apache Hive to do that.
There’s actually a pretty big collection of those tools that we’ve found that customers need. It’s the main reason we created the open source package we call Cloudera’s Distribution including Apache Hadoop, or CDH. We assemble Apache Hadoop and tools like Pig, Hive, Flume, Sqoop and others — really, the full suite of open source tools that our users have found they require — and we make it available in a single package. It’s 100% open source, not proprietary to us. We’re big believers in the open source platform — customers love not being shackled to a vendor by proprietary code.
Analytical Data Platforms
Q5. It is said that more than 95% of enterprise data is unstructured, and that enterprise data volumes are growing rapidly. Is it true? What kind of applications generate such high volume of unstructured data? and what can be done with such data?
Michael Olson: You have to talk to a firm like IDC to get numbers. What I will say is that what you call “unstructured” data (I prefer “complex” because all data has structure) is big and getting bigger really, really fast.
It used to be that data was generated at human scale. You’d buy or sell something and a transaction record would happen. You’d hire or fire someone and you’d hit the “employee” table in your database.
These days, data comes from machines talking to machines. The servers, switches, routers and disks on your LAN are all furiously conversing. The content of their messages is interesting, and also the patterns and timing of the messages that they send to one another. (In fact, if you can capture all that data and do some pattern detection and machine learning, you have a pretty good tool for finding bad guys breaking into your network.) Same is true for programmed trading on Wall Street, mobile telephony and many other pieces of technology infrastructure we rely on.
Hadoop knows how to capture and store that data cheaply and reliably, even if you get to petabytes. More importantly, Hadoop knows how to process that data — it can run different algorithms and analytic tools, spread across its massively parallel infrastructure, to answer hard questions on enormous amounts of information very quickly.
Q6. Why building data analysis applications on Hadoop? Why not using already existing BI products?
Michael Olson: Lots of BI tools today talk to relational databases. If that’s the case, then you’re constrained to operating on data types that an RDBMS understands, and most of the data in the world — see above — doesnt’ fit in an RDBMS. Also, there are some kinds of analyses — complex modeling of systems, user clustering and behavioral analysis, natural language processing — that BI tools were never designed to handle.
I want to be clear: RDBMS engines and the BI tools that run on them are excellent products, hugely successful and handling mission-critical problems for demanding users in production every day. They’re not going away. But for a new generation of problems that they weren’t designed to consider, a new platform is necessary, and we believe that that platform is Apache Hadoop, with a new suite of analytic tools, from existing or new vendors, that understand the data and can answer the questions that Hadoop was designed to handle.
Q7. Why Cloudera? What do you see as your main value proposition?
Michael Olson: We make Hadoop consumable in the way that enterprises require.
Cloudera Enterprise provides an open source platform for data storage and analysis, along with the management, monitoring and administrative applications that enterprise IT staff can use to keep the cluster running. We help our customers set and meet SLAs for work on the cluster, do capacity planning, provision new users, set and enforce policies and more. Of course Cloudera Enterprise comes with 24×7 support and a subscription to updates and fixes during the year.
When one of our customers deploys Hadoop, it’s to solve serious business problems. They can’t tolerate missed deadlines. They need their existing IT staff, who probably know how to run an Oracle database or VMware or other big data center infrastructure. That kind of person can absolutely run Hadoop, but needs the right applications and dashboards to do so. That’s what we provide.
Q8. In your opinion, what role will RDBMS and classical Data Warehouse systems play in the future in the market for Analytical Data Platforms? What about other data stores such NoSQL and Object Databases? Will they play a role?
Michael Olson: I believe that RDBMS and classic EDWs are here to stay. They’re outstanding at what they do — they’ve evolved alongside the problems they solve for the last thirty years. You’d be nuts to take them on. We view Hadoop as strictly complementary, solving a new class of problems: Complex analyses, complex data, generally at scale.
As to NoSQL and ODBMS, I don’t have a strong view. The “NoSQL” moniker isn’t well-defined, in my opinion.
There are a bunch of different key-value stores out there that provide a bunch of different services and abstractions. Really, it’s knives and arrows and battleships — they’re all useful, but which one you want depends on what kind of fight you’re in.
Q9. Is Cloud technology important in this context? Why?
Michael Olson: “Cloud” is a deployment detail, not fundamental. Where you run your software and what software you run are two different decisions, and you need to make the right choice in both cases.
Q10. Looking at three elements: Data, Platform, and Analysis, what are the main business and technical challenges ahead?
Michael Olson: Data is the big one. Seriously: More. More complex, more variable, more useful if you can figure out what’s locked up in it. More than you can imagine, even if you take this statement into account.
We obviously need to improve the platforms we have, and I think the next decade will be an exciting time for that. That’s good news — I’ve been in the database industry since 1986, and it has frankly been pretty dull. Same is true for analyses, but our opportunities there will be constrained by both the platforms we have and the data on which we can operate.
Q11. Anything you wish to add?
Michael Olson: Thanks for the opportunity!
Michael Olson, Chief Executive Officer, Cloudera.
Mike was formerly CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine. Mike spent two years at Oracle Corporation as Vice President for Embedded Technologies after Oracle’s acquisition of Sleepycat in 2006. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies and Informix Software. Mike has Bachelor’s and Master’s degrees in Computer Science from the University of California at Berkeley.)