“We’re not all LinkedIns and Facebooks; we don’t have budgets to hire 1000s of new hires with these skills, and what’s more we’ve invested in existing skills and people today. So to democratize Big Data, you need it to be consumable and integrated. These will flatten the time to value for Hadoop” — Paul C. Zikopoulos.
I have interviewed Paul C. Zikopoulos, Director of Technical Professionals for IBM Software Group’s Information Management division. The topic: Apache Hadoop and Big Data, State of the Union in 2013 and Vision for the future.
Q1. What what do you think is still needed for big data analytics to be really useful for the enterprise?
Paul C. Zikopoulos: Integration and Consumability. We’re not all LinkedIns and Facebooks; we don’t have budgets to hire 1000s of new hires with these skills, and what’s more we’ve invested in existing skills and people today.
So to democratize Big Data, you need it to be consumable and integrated.
These will flatten the time to value for Hadoop. IBM is working really hard in these areas. I could go into other areas, but this is key.
Q2. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey.
When you speak with your customers what are the typical use cases and requirements they have?
Paul C. Zikopoulos: No matter what industry I’m working with, 90% of the Big Data use cases always have 2 common denominators: Whole Population Analytics to break free of traditional capacity constrained samples and analytics for data at-rest moving to in-motion.
So if you think about churn prediction, next best action, next best offer, fraud prediction, condition monitor, out of tolerance quality predictors, and more – it’s all going to rely on using more data (could be volume, could be variety, and often both) to build better models.
If you’re looking for specific use cases by industry, here’s a bunch of them that we’ve worked with clients on at IBM.
Q3. How do you categorize the various stages of the Hadoop usage in the enterprises?
Paul C. Zikopoulos: The IBM Institute for Business Value did a joint study with Said Business School (University of Oxford). They talked to a lot of Big Data folks and found that 28% were in the pilot phase, 24% haven’t started anything, and 47% are planning. After going through their research, they broke the answers into four stages: Educate / Explore / Engage / Execute.
So I’ll detail those four stages, but you can get the entire study here.
Educate: Building a base of knowledge (24 percent of respondents).
In the Educate stage, the primary focus is on awareness and knowledge development.
Almost 25 percent of respondents indicated they are not yet using big data within their organizations. While some remain relatively unaware of the topic of big data, our interviews suggest that most organizations in this stage are studying the potential benefits of big data technologies and analytics, and trying to better understand how big data can help address important business opportunities in their own industries or markets.
Within these organizations, it is mainly individuals doing the knowledge gathering as opposed to formal work groups, and their learnings are not yet being used by the organization. As a result, the potential for big data has not yet been fully understood and embraced by the business executives.
Explore: Defining the business case and roadmap (47 percent).
The focus of the Explore stage is to develop an organization’s roadmap for big data development.
Almost half of respondents reported formal, ongoing discussions within their organizations about how to use big data to solve important business challenges.
Key objectives of these organizations include developing a quantifiable business case and creating a big data blueprint.
This strategy and roadmap takes into consideration existing data, technology and skills, and then outlines where to start and how to develop a plan aligned with the organization’s business strategy.
Engage: Embracing big data (22 percent).
In the Engage stage, organizations begin to prove the business value of big data, as well as perform an assessment of their technologies and skills.
More than one in five respondent organizations is currently developing POCs to validate the requirements associated with implementing big data initiatives, as well as to articulate the expected returns. Organizations in this group are working – within a defined, limited scope – to understand and test the technologies and skills required to capitalize on new sources of data.
Execute: Implementing big data at scale (6 percent).
In the Execute stage, big data and analytics capabilities are more widely operationalized and implemented within the organization. However, only 6 percent of respondents reported that their organizations have implemented two or more big data solutions at scale – the threshold for advancing to this stage. The small number of organizations in the Execute stage is consistent with the implementations we see in the marketplace. Importantly, these leading organizations are leveraging big data to transform their businesses and thus are deriving the greatest value from their information assets.
With the rate of enterprise big data adoption accelerating rapidly – as evidenced by 22 percent of respondents in the Engage stage, with either POCs or active pilots underway – we expect the percentage of organizations at this stage to more than double over the next year. NOW ! While only 6% are executing, about 25% of respondents in this study are ‘piloting’ initiatives.
Q4. Could you give us some examples on how do you get (Big) Data Insights?
Paul C. Zikopoulos: IBM has a non-forked version of Hadoop called BigInsights.
When it comes to open source, it’s really hard to look past IBM’s achievements. Lucene, Apache Derby, Apache Jakarta, Apache Geronimo, Eclipse and so much more – so it shouldn’t surprise anyone that IBM is squarely in Hadoop’s corner.
Our strategy here is Embrace and Extend. We will embrace the open source Hadoop community. We are a vibrant part of it (in the latest Hadoop patch as of the time of this interview, the most fixes came from IBM; we have a number of contribution to HBase, and more). IBM has a long history in understanding enterprise concerns, that’s the extend part.
Some of the extensions work just fine with open source. For example, we provide a rich management tool, a quick installer, and concentrate opens ports into a single one to make your Hadoop cluster pass audit easier.
Some of our extensions overlay Hadoop. For example, our Adaptive Map Reduce which can deliver a 30% performance boost using its algorithms to optimize the overhead of MapReduce task startup.
We have enhanced schedulers, announced the option to use GPFS as the file system which provides a lot of benefits, and more. But these are optional. If you use BigInsights you are using a non-forked Hadoop distro.
Some of our extensions are ’round-trip-able’ – if you use them, you can walk back to pure Open Source Hadoop at any time, and some aren’t. If you want to get our fast to install non extended version of Hadoop for free, you can download InfoSphere BigInsights Basic Edition here.
Q5. What are the main technical challenges for big data analytics when data is in motion rather than at rest?
Paul C. Zikopoulos: Well the challenge is to ask yourselves how do I get those analytics artifacts that I learn at rest either in Hadoop or the EDW and get them to real time; I call this Nowcasting instead of Forecasting.
In order to do that, with agility and speed, you’re going to want a platform that’s designed for in-motion at-rest analytics.
I’m not seeing that in the marketplace today. In fact, I’m not seeing a focus on in-motion analytics.
When I refer to in-motion, I refer to the Velocity attribute of Big Data (people often talk to the Big Vs in Big Data, so that’s the one for in-motion). Velocity IS the game change.
It’s not just how fast data is produces or changes, BUT the speed at which it must be understood, acted upon, turned into something useful. So to me the main technical challenge in getting to in-motion from at-rest is the fact that I’m not really seeing that kind of true integration and it’s something we squarely hit on in the IBM Big Data platform.
Let me share an example, if you were to build some text analytical function at rest in Hadoop, perhaps an email phrase that’s highly correlated with a customer churn even, you can SEAMLESSLY take that artifact and deploy it on InfoSphere Streams (our Big Data Velocity engine) without any work at all, you just deploy the compiled AOG file. Wow! Platform.
The other challenge is just the volume and speed in which you have to process events. IBM invented our streaming products with the US government – and it can scale. For example, one of our clients analyzes and correlates over 5M market messages a second to execute algorithmic option trades with average latency of 50 microseconds.
The point is that this is not CEP; this is not 1 or 2 servers with 10-20,000 events a second. CEP can be a style or a technology.
You need to be able to do the style, but you need a technology platform too. If you asked me what is one of the biggest things IBM has done in the Big Data space, it is flattening the technical challenge to perform Big Data analytics on data in motion.
Q6. In your opinion, is there a technology which is best suited to build a Big Data Analytics Data Platform? If yes, which one?
Paul C. Zikopoulos: Well you say the word platform, and that’s going to imply a number of technologies. Right?
When I get asked this question, I refer to my Big Data Platform Manifesto, this is what you’re going to need to form a Big Data platform. Many people think big data is about Hadoop technology. It is and it isn’t. Its about a lot more than Hadoop.
One of the key requirements is to understand and navigate federated sources of big data – to discover data in place.
New technology has emerged that discovers, indexes, searches, and navigates diverse sources of big data. Of course big data is also about Hadoop. Hadoop is a collection of open source capabilities.
Two of the most prominent ones are Hadoop Distributed File System (HDFS) for storing a variety of information, and MapReduce – a parallel processing engine.
Data warehouses also manage big data- the volume of structured data is growing quickly. The ability to run deep analytic queries on huge volumes of structured data is a big data problem. It requires massive parallel processing data warehouses and purpose-built appliances for deep analytics.
Big data isn’t just at rest – it’s also in motion. Streaming data represents an entirely different big data problem – the ability to quickly analyze and act upon data while its still moving. This new technology opens a world of possibilities – from processing volumes of data that were just not practical to store, to detecting insight and responding quickly.
As much of the worlds big data is unstructured and in textual content, text analytics is a critical component to analyze and derive meaning from text.
And finally, integration and governance technology – ETL, data quality, security, MDM, and lifecycle management. Integration and governance technology establishes the veracity of big data, and is critical in determining whether information is trusted.
Finally, consumability, characteristics here include such items as being able to declare what you want done, not how to do it, expert integrated systems, deployment patterns, and so on.
So if you wanted a short answer a Big Data platform needs to be consumable, governable, give the opportunity for analytics in-motion, at rest (in an EDW AND things like Hadoop), discovery and index Big Data, and finally, provide the ability to analyze unstructured data.
Notice I didn’t mention one IBM product above; you can piece together a platform with a mash of vendors if you want; if you start to look into what IBM is doing, and although I’m bias and work there, I think you will find we have a true Big Data platform.
Q6. Does it make sense in your opinion to virtualize Hadoop?
Paul C. Zikopoulos: It can. It’s going to depend on the use case right? I see a lot of efforts by EMC in that area and that’s cool. Of course the Cloud and Hadoop kind of go hand and hand. I think this space is growing by leaps and bounds…fun to watch.
Q7. What is your opinion on the evolution of Hadoop?
Paul C. Zikopoulos: It’s just that – an evolution. I think that innovation is going to deliver more and more of what enterprises need from a ‘hardening’ aspect as time goes on. Hadoop 2.0 is a big step forward for availability. It’s out there yet now, but not ready for production in my humble opinion (although some vendors are shipping it, their documentation tells you it’s not ready for production). Next version of MapReduce (Yarn) and making Hive really fast (Tez) are also part of the evolution, stay close here, it’s changing fast!
That’s the best part of community. Now if you look at most of the vendors in this space, many are getting distracted and working on non-Hadoop’ish things to help Hadoop, and that’s fine too. We’re on a good path here.
A lot of vendors here are and more popping up all the time (like Intel just announced their own distribution). At some point, I think there will be a consolidated of distros out there, but with the hype around it right now, it will continue to evolve.
For example, it’s becoming more than just a MapReduce processing areas. Right? Lots of technologies are storing data in Hadoop’s HDFS, but bypassing MapReduce. So I find the file system key to the evolution.
Q8. Can In-Memory Data Management play a significant role for Big Data Analytics? If yes, how?
Paul C. Zikopoulos: I think it’s essential, but in a Big Data world, it would seem that the amount of data we are storing – at least right now – is proportionally bigger than the amount we can get into memory at a cost effective rate.
So in-memory needs to harmoniously live with the database. If you look at what we did with BLU Acceleration and DB2, we did just that.
In-memory columnar and typical relational tables live side by side in the same database kernel.
You can work with both structures together, in the same memory structures, queries, and so on.
When you can’t fit all the columns into memory, performance either falls off the cliff, or worse! Could crash the system.
From an analytics side, BLU Acceleration allows you to run queries faster, amazingly faster. That’s going to get more iterations of queries, analytics and what not. It’s not for everything, but if you can help my reports run faster, that’s cool. So imagine you find in a Discovery Zone powered by a Hadoop engine some interesting pieces of information, pulling that out and packing it into an in-memory structure and surfacing it to the enterprise is going to be pretty cool
Q9. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?
Paul C. Zikopoulos: This is pretty important because I need the utility-like nature of a Hadoop cluster, without the capital investment. Time to analytics is the benefit here. After all, if you’re a start-up analytics firm seeking venture capital funding, do you really walk into to your investor and ask for millions to set up a cluster; you’ll get kicked out the door.
No, you go to Racksapce or Amazon, swipe a card, and get going. IBM is there with its Hadoop clusters (private and public) and you’re looking at clusters that cost as low as $0.60 US an hour.
I think at one time I costed out a 100 node Hadoop cluster for an hour and it was like $34US – and the price has likely gone down. What’s more, your cluster will be up and running in 30 minutes. So on-premise or off-premise Cloud is key for these environments.
Paul C. Zikopoulos, B.A., M.B.A., is the Director of Technical Professionals for IBM Software Group’s Information Management division and additionally leads the World Wide Competitive Database and Big Data Technical Sales Acceleration teams.
Paul is an award winning writer and speaker with more than 19 years of experience in Information Management.
Paul is seen as a global expert in Big Data and database. He was picked by SAP as one of its “Top 50 Big Data Twitter Influencers”, named by BigData Republic to its “Top 100 Most Influential” list, Technopedia listed him a “A Big Data Expert to Follow”, and he was consulted on Big Data by the popular TV show “60 Minutes”.
Paul has written more than 350 magazine articles and 16 books, some of which include “Harness the Power of Big Data”, “Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data”, “Warp Speed, Time Travel, Big Data, and More: DB2 10 New Features”, “DB2 pureScale: Risk Free Agile Scaling”, “DB2 Certification for Dummies”, “DB2 for Dummies”, and more.
In his spare time, he enjoys all sorts of sporting activities, including running with his dog Chachi, avoiding punches in his MMA training, and trying to figure out the world according to Chloë—his daughter.
–On Virtualize Hadoop. Interview with Joe Russell. April 29, 2013
–On Pivotal HD. Interview with Scott Yara and Florian Waas. April 22, 2013
–On Big Data Velocity. Interview with Scott Jarr. January 28, 2013
– Harness the Power of Big Data The IBM Big Data Platform.
Paul C. Zikopoulos, Dirk deRoos, Krishnan Parasuraman, Thomas Deutsch, David Corrigan,James Giles, Chris Eaton.
Book, Copyright © 2013 by The McGraw-Hill Companies.
Download Book (.PDF 250 pages)
– Warp Speed, Time Travel, Big Data, and More. DB2 10 for Linux, UNIX, and Windows New Features.
Paul Zikopoulos, George Baklarz, Matt Huras, Walid Rjaibi, Dale McInnis, Matthias Nicola, Leon Katsnelson.
Book, Copyright © 2012 by The McGraw-Hill Companies.
Download book (.PDF 217 pages)
– Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data.
Paul C. Zikopoulos, Chris Eaton, Dirk deRoos, Thomas Deutsch, George Lapis,
Book, Copyright © 2012 by The McGraw-Hill Companies.
Download book (.PDF 142 pages)
– ODBMS.org Resources on Big Data and Analytical Data Platforms:
Blog Posts | Free Software | Articles | Lecture Notes | PhD and Master Thesis|
– Follow ODBMS.org on Twitter: @odbmsorg