How to run a Big Data project. Interview with James Kobielus
“You need a team of dedicated data scientists to develop and tune the core intellectual property–statistical, predictive, and other analytic models–that drive your Big Data applications. You don’t often think of data scientists as “programmers,” per se, but they are the pivotal application developers in the age of Big Data.”–James Kobielus
Managing the pitfalls and challenges of Big Data projects. On this topic I have interviewed James Kobielus, IBM Senior Program Director, Product Marketing, Big Data Analytics solutions.
Q1. Why run a Big Data project in the enterprise?
James Kobielus: Many Big Data projects are in support of customer relationship management (CRM) initiatives in marketing, customer service, sales, and brand monitoring. Justifying a Big Data project with a CRM focus involves identifying the following quantitative ROI:
• Volume-based value: The more comprehensive your 360-degree view of customers and the more historical data you have on them, the more insight you can extract from it all and, all things considered, the better decisions you can make in the process of acquiring, retaining, growing and managing those customer relationships.
• Velocity-based value: The more customer data you can ingest rapidly into your big-data platform and the more questions that a user can pose more rapidly against that data (via queries, reports, dashboards, etc.) within a given time period prior, the more likely you are to make the right decision at the right time to achieve your customer relationship management objectives.
• Variety-based value: The more varied customer data you have – from the CRM system, social media, call-center logs, etc. – the more nuanced portrait you have on customer profiles, desires and so on, hence the better-informed decisions you can make in engaging with them.
• Veracity-based value: The more consolidated, conformed, cleansed, consistent current the data you have on customers, the more likely you are to make the right decisions based on the most accurate data.
How can you attach a dollar value to any of this? It’s not difficult. Customer lifetime value (CLV) is a standard metric that you can calculate from big-data analytics’ impact on customer acquisition, onboarding, retention, upsell, cross-sell and other concrete bottom-line indicators, as well as from corresponding improvements in operational efficiency.
Q2. What are the business decisions that need to be made in order to successfully support a Big Data project in the enterprise?
James Kobielus: In order to successfully support a Big Data project in the enterprise, you have to make the infrastructure and applications production-ready in your operations.
Production-readiness means that your big-data investment is fit to realize its full operational potential. If you think “productionizing” can be done in a single step, such as by, say, introducing HDFS NameNode redundancy, then you need a cold slap of reality. Productionizing demands a lifecycle focus that encompasses all of your big-data platforms, not just a single one (e.g., Hadoop/HDFS), and addresses more than just a single requirement (e.g., ensuring a highly available distributed file system).
Productionizing involves jumping through a series of procedural hoops to ensure that your big-data investment can function as a reliable business asset. Here are several high-level considerations to keep in mind as you ready your big-data initiative for primetime deployment:
• Stakeholders: Have you aligned your big-data initiatives with stakeholder requirements? If stakeholders haven’t clearly specified their requirements or expectations for your big-data initiative, it’s not production-ready. The criteria of production-readiness must conform to what stakeholders require, and that depends greatly on the use cases and applications they have in mind for Big Data. Service-level agreements (SLAs) vary widely for Big Data deployed as an enterprise data warehouse (EDW), as opposed to an exploratory data-science sandbox, an unstructured information transformation tier, a queryable archive, or some other use. SLAs for performance, availability, security, governance, compliance, monitoring, auditing and so forth will depend on the particulars of each big-data application, and on how your enterprise prioritizes them by criticality.
• Stacks: Have you hardened your big-data technology stack – databases, middleware, applications, tools, etc. – to address the full range of SLAs associated with the chief use cases? If the big-data platform does not meet the availability, security and other robustness requirements expected of most enterprise infrastructure, it’s not production-ready. Ideally, all production-grade big-data platforms should benefit from a common set of enterprise management tools.
• Scalability: Have you architected your environment for modular scaling to keep pace with inexorable growth in data volumes, velocities and varieties? If you can’t provision, add, or reallocate new storage, compute and network capacity on the big-data platform in a fast, cost-effective, modular way to meet new requirements, the platform is not production-ready.
• Skillsets: Have you beefed up your organization’s big-data skillsets for maximum productivity? If your staff lacks the requisite database, integration and analytics skills and tools to support your big-data initiatives over their expected life, your platform is not production-ready. Don’t go deep on Big Data until your staff skills are upgraded.
• Seamless service: Have your re-engineered your data management and analytics IT processes for seamless support for disparate big-data initiatives? If you can’t provide trouble response, user training and other support functions in an efficient, reliable fashion that’s consistent with existing operations, your big-data platform is not production-ready.
To the extent that your enterprise already has a mature enterprise data warehousing (EDW) program in production, you should use that as the template for your big-data platform. There is absolutely no need to redefine “productionizing” for Big Data’s sake.
Q3. What are the most common problems and challenges encountered in Big Data projects?
James Kobielus: The most common problems and challenges in Big Data projects revolve around integrated lifecycle management (ILM).
ILM faces a new frontier when it comes to Big Data. The core challenges are threefold: the sheer unbounded size of Big Data, the ephemeral nature of much of the new data, and the difficulty of enforcing consistent quality as the data scales along any and all of the three Vs (volume, velocity, and variability). Comprehensive ILM has grown more difficult to ensure in Big Data environments, given rapid changes in the following areas:
• New Big Data platform: Big data is ushering a menagerie of new platforms (Hadoop, NoSQL, in-memory, and graph databases) into enterprise computing environments, alongside stalwarts such MPP RDBMS, columnar, and dimensional databases. The chance that your existing ILM tools work out of the box with all of these new platforms is slim. Also, to the extent that you’re doing Big Data in a public cloud, you may be required to use whatever ILM features — strong, weak, or middling — that may be native to the provider’s environment. To mitigate your risks in this heterogeneous new world and to maintain strong confidence in your core data, you’ll need to examine new Big Data platforms closely to ensure they have ILM features (data security, governance, archiving, retention) that are commensurate to the roles for which you plan to deploy them.
• New Big Data subject domains: Big data has not altered enterprise requirements for data governance hubs where you store and manage office systems of record (customers, finances, HR). This is the role of your established EDW, most of which run on traditional RDBMS-based data platforms and incorporate strong ILM. But these systems of record data domains may have very little presence on your newer Big Data platforms, many of which focus instead on handling fresh data from social, event, sensor, clickstream, geospatial, and other new sources. These new data domains are often “ephemeral” in the sense there may be no need to retain the bulk of the data in permanent systems of record.
• New Big Data scales: Big data does not mean that your new platforms support infinite volume, instantaneous velocity, or unbounded varieties. The sheer magnitudes of new data will make it impossible to store most of it anywhere, given the stubborn technological and economic constraints we all face. This reality will deepen Big Data managers’ focus on tweaking multitemperature storage management, archiving, and retention policies. As you scale your Big Data environment, you will need to ensure that ILM requirements can be supported within your current constraints of volume (storage capacity), velocity (bandwidth, processor, and memory speeds), and variety (metadata depth).
Q4. How best is to get started with a Big Data project?
James Kobielus: Scope the project well to deliver near-term business benefit. Using the nucleus project as the foundation for accelerating future Big Data projects. Recognize that the initial database technology you use in that initial project is just one of many storage layers that will need to play together in a hybridized, multi-tier Big Data architecture of your future.
In the larger evolutionary perspective, Big Data is evolving into a hybridized paradigm under which Hadoop, massively parallel processing (MPP) enterprise data warehouses (EDW), in-memory columnar, stream computing, NoSQL, document databases, and other approaches support extreme analytics in the cloud.
Hybrid architectures address the heterogeneous reality of Big Data environments and respond to the need to incorporate both established and new analytic database approaches into a common architecture. The fundamental principle of hybrid architectures is that each constituent Big Data platform is fit-for-purpose to the role for which it’s best suited. These Big Data deployment roles may include any or all of the following:
• Data acquisition
• Interactive exploration
In any role, a fit-for-purpose Big Data platform often supports specific data sources, workloads, applications, and users.
Hybrid is the future of Big Data because users increasingly realize that no single type of analytic platform is always best for all requirements. Also, platform churn—plus the heterogeneity it usually produces—will make hybrid architectures more common in Big Data deployments. The inexorable trend is toward hybrid environments that address the following enterprise Big Data imperatives:
• Extreme scalability and speed: The emerging hybrid Big Data platform will support scale-out, shared-nothing massively parallel processing, optimized appliances, optimized storage, dynamic query optimization, and mixed workload management.
• Extreme agility and elasticity: The hybrid Big Data environment will persist data in diverse physical and logical formats across a virtualized cloud of interconnected memory and disk that can be elastically scaled up and out at a moment’s notice.
• Extreme affordability and manageability: The hybrid environment will incorporate flexible packaging/pricing, including licensed software, modular appliances, and subscription-based cloud approaches.
Hybrid deployments are already widespread in many real-world Big Data deployments. The most typical are the three-tier—also called “hub-and-spoke”—architectures. These environments may have, for example, Hadoop (e.g., IBM InfoSphere BigInsights) in the data acquisition, collection, staging, preprocessing, and transformation layer; relational-based MPP EDWs (e.g., IBM PureData System for Analytics) in the hub/governance layer; and in-memory databases (e.g., IBM Cognos TM1) in the access and interaction layer.
The complexity of hybrid architectures depends on range of sources, workloads, and applications you’re trying to support. In the back-end staging tier, you might need different preprocessing clusters for each of the disparate sources: structured, semi-structured, and unstructured. In the hub tier, you may need disparate clusters configured with different underlying data platforms—RDBMS, stream computing, HDFS, HBase, Cassandra, NoSQL, and so on—-and corresponding metadata, governance, and in-database execution components. And in the front-end access tier, you might require various combinations of in-memory, columnar, OLAP, dimensionless, and other database technologies to deliver the requisite performance on diverse analytic applications, ranging from operational BI to advanced analytics and complex event processing.
Ensuring that hybrid Big Data architectures stay cost-effective demands the following multipronged approach to optimization of distributed storage:
• Apply fit-for-purpose databases to particular Big Data use cases: Hybrid architectures spring from the principle that no single data storage, persistence, or structuring approach is optimal for all deployment roles and workloads. For example, no matter how well-designed the dimensional data model is within an OLAP environment, users eventually outgrow these constraints and demand more flexible decision support. Other database architectures—such as columnar, in-memory, key-value, graph, and inverted indexing—may be more appropriate for such applications, but not generic enough to address other broader deployment roles.
• Align data models with underlying structures and applications: Hybrid architectures leverage the principle that no fixed Big Data modeling approach—physical and logical—can do justice to the ever-shifting mix of queries, loads, and other operations. As you implement hybrid Big Data architectures, make sure you adopt tools that let you focus on logical data models, while the infrastructure automatically reconfigures the underlying Big Data physical data models, schemas, joins, partitions, indexes, and other artifacts for optimal query and data load performance.
• Intelligently compress and manage the data: Hybrid architectures should allow you to apply intelligent compression to Big Data sets to reduce their footprint and make optimal use of storage resources. Also, some physical data models are more inherently compact than others (e.g., tokenized and columnar storage are more efficient than row-based storage), just as some logical data models are more storage-efficient (e.g., third-normal-form relational is typically more compact than large denormalized tables stored in a dimensional star schema).
Q5. What kind of expertise do you need to run a Big Data project in the enterprise?
James Kobielus: Data-driven organizations succeed when all personnel—both technical and business—have a common understanding of the core big-data best skills, tools and practices. You need all the skills of data management, integration, modeling, and so forth that you already have running your data marts, warehouses, OLAP cubes, and the like.
Just as important, you need a team of dedicated data scientists to develop and tune the core intellectual property–statistical, predictive, and other analytic models–that drive your Big Data applications. You don’t often think of data scientists as “programmers,” per se, but they are the pivotal application developers in the age of Big Data.
The key practical difference between data scientists and other programmers—including those who develop orchestration logic—is that the former specifies logic grounded in non-deterministic patterns (i.e., statistical models derived from propensities revealed inductively from historical data), whereas the latter specifies logic whose basis is predetermined (i.e., if/then/else, case-based and other rules, procedural and/or declarative, that were deduced from functional analysis of some problem domain).
The practical distinctions between data scientists and other programmers have always been a bit fuzzy, and they’re growing even blurrier over time. For starters, even a cursory glance at programming paradigms shows that core analytic functions—data handling and calculation—have always been the heart of programming. For another, many business applications leverage statistical analyses and other data-science models to drive transactional and other functions.
Furthermore, data scientists and other developers use a common set of programming languages. Of course, data scientists differ from most other types of programmers in various ways that go beyond the deterministic vs. non-deterministic logic distinction mentioned above:
• Data scientists have adopted analytic domain-specific languages such as R, SAS, SPSS and Matlab.
• Data scientists specialize in business problems that are best addressed with statistical analysis.
• Data scientists are often more aligned with specific business-application domains—such as marketing campaign optimization and financial risk mitigation—than the traditional programmer.
These distinctions primarily apply to what you might call the “classic” data scientist, such as multivariate statistical analysts and data mining professionals. But the notion of a “classic” data scientist might be rapidly fading away in the big-data era as more traditional programmers need some grounding in statistical modeling in order to do their jobs effectively—or, at the very least, need to collaborate productively with statistical modelers.
Q6. How do you select the “right” software and hardware for a Big Data project?
James Kobielus: It’s best to choose the right appliance–a pre-optimized, pre-configured hardware/software appliance–for the specific workloads and applications of your Big Data project. At the same time, you should make sure that the chosen appliances can figure into the eventual cloud architecture toward which your Big Data infrastructure is likely to evolve.
An appliance is a workload-optimized system. Its hardware/software nodes are the key building block for every Big Data cloud. In other words, appliances, also known as expert integrated systems, are the bedrock of all three “Vs” of the Big Data universe, regardless of whether your specific high-level topology is centralized, hub-and-spoke, federated or some other configuration, and regardless of whether you’ve deployed all of these appliance nodes on premises or are outsourcing some or all of it to a cloud/SaaS provider.
Within the coming 2-3 years, expert integrated systems will become a dominant approach for enterprises to put Hadoop and other emerging Big Data approaches into production. Already, appliances are the principal approach in the core Big Data platform market: enterprise data warehousing solutions that implement massively parallel processing, such as those powered by IBM PureData Systems for Analytics..
The core categories of workloads that user need their optimized Big Data appliances to support within cloud environments are as follows:
• Big-data storage: A Big Data appliance can be core building block in a enterprise data storage architecture. Chief uses may be for archiving, governance and replication, as well as for discovering, acquiring, aggregating and governing multistructured content. The appliance should provide the modularity, scalability and efficiency of high-performance applications for these key data consolidation functions. Typically, it would support these functions through integration with a high-capacity storage area network architecture such as IBM provides.
• Big-data processing: A Big Data appliance should support massively parallel execution of advanced data processing, manipulation, analysis and access functions. It should support the full range of advanced analytics, as well as some functions traditionally associated with EDWs, BI and OLAP. It should have all the metadata, models and other services needed to handle such core analytics functions as query, calculation, data loading and data integration. And it should handle a subset of these functions and interface through connectors to analytic platforms such as IBM PureData Systems.
• Big-data development: A Big Data appliance should support Big Data modeling, mining, exploration and analysis. The appliance should provide a scalable “sandbox” with tools that allow data scientists, predictive modelers and business analysts to interactively and collaboratively explore rich information sets. It should incorporate a high-performance analytic runtime platform where these teams can aggregate and prepare data sets, tweak segmentations and decision trees, and iterate through statistical models as they look for deep statistical patterns. It should furnish data scientists with massively parallel CPU, memory, storage and I/O capacity for tackling analytics workloads of growing complexity. And it should enable elastic scaling of sandboxes from traditional statistical analysis, data mining and predictive modeling, into new frontiers of Hadoop/MapReduce, R, geospatial, matrix manipulation, natural language processing, sentiment analysis and other resource-intensive types of Big Data processing.
A big-data appliance should not be a stand-alone server, but, instead, a repeatable, modular building block that, when deployed in larger cloud configurations, can be rapidly optimized to new workloads as they come online. Many appliances will be configured to support mixes of two or all three of these types of workloads within specific cloud nodes or specific clusters. Some will handle low latency and batch jobs with equal agility in your cloud. And still others will be entirely specialized to a particular function that they perform with lightning speed and elastic scalability. The best appliances, like IBM Netezza, facilitate flexible re-optimization by streamlining the myriad deployment, configuration tuning tasks across larger, more complex deployments.
You may not be able to forecast with fine-grained precision the mix of workloads you’ll need to run on your big-data cloud two years from next Tuesday. But investing in the right family of big-data appliance building blocks should give you confidence that, when the day comes, you’ll have the foundation in place to provision resources rapidly and efficiently.
Q7. Is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?
James Kobielus: No. Hadoop is powering unstructured ETL, queryable archiving, data-science exploratory sandboxing, and other use cases. OLAP–in terms of traditional cubing–remains key to front-end query acceleration in decision support applications and data marts. In support of those front-end applicatioins, OLAP is facing competition from other approaches, especially in-memory, columnar databases (such as the BLU Acceleration feature of IBM DB2 10.5).
Q8. Could you give some examples of successful Big Data projects?
James Kobielus: Examples are here.
James Kobielus is IBM Senior Program Director, Product Marketing, Big Data Analytics solutions. He is an industry veteran, a popular speaker and social media participant, and a thought leader in big data, Hadoop, enterprise data warehousing, advanced analytics, business intelligence, data management, and next best action technologies.
Follow ODBMS.org on Twitter: @odbmsorg