{"id":3006,"date":"2014-05-15T17:20:16","date_gmt":"2014-05-15T17:20:16","guid":{"rendered":"http:\/\/www.odbms.org\/blog\/?p=3006"},"modified":"2014-05-16T08:15:20","modified_gmt":"2014-05-16T08:15:20","slug":"james-kobielus","status":"publish","type":"post","link":"https:\/\/www.odbms.org\/blog\/2014\/05\/james-kobielus\/","title":{"rendered":"How to run a Big Data project. Interview with James Kobielus"},"content":{"rendered":"<blockquote><p> <strong>&#8220;You need a team of dedicated data scientists to develop and tune the core intellectual property\u2013statistical, predictive, and other analytic models\u2013that drive your Big Data applications. You don\u2019t often think of data scientists as \u201cprogrammers,\u201d per se, but they are the pivotal application developers in the age of Big Data.&#8221;&#8211;James Kobielus<\/strong><\/p><\/blockquote>\n<p>Managing the pitfalls and challenges of Big Data projects. On this topic I have interviewed <strong>James Kobielus<\/strong>, IBM Senior Program Director, Product Marketing, Big Data Analytics solutions.<\/p>\n<p>RVZ<\/p>\n<p><strong>Q1. Why run a Big Data project in the enterprise?<\/strong><\/p>\n<p><strong>James Kobielus: <\/strong>Many Big Data projects are in support of <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Customer_relationship_management');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Customer_relationship_management\" target=\"_blank\">customer relationship management (CRM)<\/a> initiatives in marketing, customer service, sales, and brand monitoring. Justifying a Big Data project with a CRM focus involves identifying the following quantitative ROI:<br \/>\n\u2022 Volume-based value: The more comprehensive your 360-degree view of customers and the more historical data you have on them, the more insight you can extract from it all and, all things considered, the better decisions you can make in the process of acquiring, retaining, growing and managing those customer relationships.<br \/>\n\u2022 Velocity-based value: The more customer data you can ingest rapidly into your big-data platform and the more questions that a user can pose more rapidly against that data (via queries, reports, dashboards, etc.) within a given time period prior, the more likely you are to make the right decision at the right time to achieve your customer relationship management objectives.<br \/>\n\u2022 Variety-based value: The more varied customer data you have \u2013 from the CRM system, social media, call-center logs, etc. \u2013 the more nuanced portrait you have on customer profiles, desires and so on, hence the better-informed decisions you can make in engaging with them.<br \/>\n\u2022 Veracity-based value: The more consolidated, conformed, cleansed, consistent current the data you have on customers, the more likely you are to make the right decisions based on the most accurate data.<br \/>\nHow can you attach a dollar value to any of this? It\u2019s not difficult. <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Customer_lifetime_value');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Customer_lifetime_value\" target=\"_blank\">Customer lifetime value (CLV)<\/a> is a standard metric that you can calculate from big-data analytics\u2019 impact on customer acquisition, onboarding, retention, upsell, cross-sell and other concrete bottom-line indicators, as well as from corresponding improvements in operational efficiency.<\/p>\n<p><strong>Q2. What are the business decisions that need to be made in order to successfully support a Big Data project in the enterprise?<\/strong><\/p>\n<p><strong>James Kobielus: <\/strong>In order to successfully support a Big Data project in the enterprise, you have to make the infrastructure and applications production-ready in your operations.<br \/>\nProduction-readiness means that your big-data investment is fit to realize its full operational potential. If you think \u201cproductionizing\u201d can be done in a single step, such as by, say, introducing <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/hadoop.apache.org\/docs\/r2.3.0\/hadoop-yarn\/hadoop-yarn-site\/HDFSHighAvailabilityWithQJM.html');\"  href=\"http:\/\/hadoop.apache.org\/docs\/r2.3.0\/hadoop-yarn\/hadoop-yarn-site\/HDFSHighAvailabilityWithQJM.html\" target=\"_blank\">HDFS NameNode redundancy<\/a>, then you need a cold slap of reality. Productionizing demands a lifecycle focus that encompasses all of your big-data platforms, not just a single one (e.g., Hadoop\/HDFS), and addresses more than just a single requirement (e.g., ensuring a highly available distributed file system).<br \/>\nProductionizing involves jumping through a series of procedural hoops to ensure that your big-data investment can function as a reliable business asset. Here are several high-level considerations to keep in mind as you ready your big-data initiative for primetime deployment:<br \/>\n\u2022 Stakeholders: Have you aligned your big-data initiatives with stakeholder requirements? If stakeholders haven\u2019t clearly specified their requirements or expectations for your big-data initiative, it\u2019s not production-ready. The criteria of production-readiness must conform to what stakeholders require, and that depends greatly on the use cases and applications they have in mind for Big Data. Service-level agreements (SLAs) vary widely for Big Data deployed as an enterprise data warehouse (EDW), as opposed to an exploratory data-science sandbox, an unstructured information transformation tier, a queryable archive, or some other use. SLAs for performance, availability, security, governance, compliance, monitoring, auditing and so forth will depend on the particulars of each big-data application, and on how your enterprise prioritizes them by criticality.<br \/>\n\u2022 Stacks: Have you hardened your big-data technology stack \u2013 databases, middleware, applications, tools, etc. \u2013 to address the full range of SLAs associated with the chief use cases? If the big-data platform does not meet the availability, security and other robustness requirements expected of most enterprise infrastructure, it\u2019s not production-ready. Ideally, all production-grade big-data platforms should benefit from a common set of enterprise management tools.<br \/>\n\u2022 Scalability: Have you architected your environment for modular scaling to keep pace with inexorable growth in data volumes, velocities and varieties? If you can\u2019t provision, add, or reallocate new storage, compute and network capacity on the big-data platform in a fast, cost-effective, modular way to meet new requirements, the platform is not production-ready.<br \/>\n\u2022 Skillsets: Have you beefed up your organization\u2019s big-data skillsets for maximum productivity? If your staff lacks the requisite database, integration and analytics skills and tools to support your big-data initiatives over their expected life, your platform is not production-ready. Don\u2019t go deep on Big Data until your staff skills are upgraded.<br \/>\n\u2022 Seamless service: Have your re-engineered your data management and analytics IT processes for seamless support for disparate big-data initiatives? If you can\u2019t provide trouble response, user training and other support functions in an efficient, reliable fashion that\u2019s consistent with existing operations, your big-data platform is not production-ready.<br \/>\nTo the extent that your enterprise already has a mature <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Data_warehouse');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Data_warehouse\" target=\"_blank\">enterprise data warehousing (EDW)<\/a> program in production, you should use that as the template for your big-data platform. There is absolutely no need to redefine \u201cproductionizing\u201d for Big Data\u2019s sake.<\/p>\n<p><strong>Q3. What are the most common problems and challenges encountered in Big Data projects?<\/strong><\/p>\n<p><strong>James Kobielus: <\/strong>The most common problems and challenges in Big Data projects revolve around<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Information_Lifecycle_Management');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Information_Lifecycle_Management\" target=\"_blank\"> integrated lifecycle management (ILM)<\/a>.<br \/>\nILM faces a new frontier when it comes to Big Data. The core challenges are threefold: the sheer unbounded size of Big Data, the ephemeral nature of much of the new data, and the difficulty of enforcing consistent quality as the data scales along any and all of the three Vs (v<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Big_data');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Big_data\" target=\"_blank\">olume, velocity, and variability<\/a>). Comprehensive ILM has grown more difficult to ensure in Big Data environments, given rapid changes in the following areas:<br \/>\n\u2022 New Big Data platform: Big data is ushering a menagerie of new platforms (<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/hadoop.apache.org');\"  href=\"http:\/\/hadoop.apache.org\" target=\"_blank\">Hadoop<\/a>, <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/NoSQL');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/NoSQL\" target=\"_blank\">NoSQL<\/a>, <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/In-memory_database');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/In-memory_database\" target=\"_blank\">in-memory<\/a>, and <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Graph_database');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Graph_database\" target=\"_blank\">graph databases<\/a>) into enterprise computing environments, alongside stalwarts such <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Data_warehouse_appliance');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Data_warehouse_appliance\" target=\"_blank\">MPP RDBMS<\/a>, <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Column-oriented_DBMS');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Column-oriented_DBMS\" target=\"_blank\">columnar<\/a>, and <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/stackoverflow.com\/questions\/2798595\/relational-vs-dimensional-databases-whats-the-difference');\"  href=\"http:\/\/stackoverflow.com\/questions\/2798595\/relational-vs-dimensional-databases-whats-the-difference\" target=\"_blank\">dimensional databases<\/a>. The chance that your existing ILM tools work out of the box with all of these new platforms is slim. Also, to the extent that you&#8217;re doing Big Data in a public cloud, you may be required to use whatever ILM features &#8212; strong, weak, or middling &#8212; that may be native to the provider&#8217;s environment. To mitigate your risks in this heterogeneous new world and to maintain strong confidence in your core data, you&#8217;ll need to examine new Big Data platforms closely to ensure they have ILM features (data security, governance, archiving, retention) that are commensurate to the roles for which you plan to deploy them.<br \/>\n\u2022 New Big Data subject domains: Big data has not altered enterprise requirements for data governance hubs where you store and manage office systems of record (customers, finances, HR). This is the role of your established EDW, most of which run on traditional RDBMS-based data platforms and incorporate strong ILM. But these systems of record data domains may have very little presence on your newer Big Data platforms, many of which focus instead on handling fresh data from social, event, sensor, clickstream, geospatial, and other new sources. These new data domains are often &#8220;ephemeral&#8221; in the sense there may be no need to retain the bulk of the data in permanent systems of record.<br \/>\n\u2022 New Big Data scales: Big data does not mean that your new platforms support infinite volume, instantaneous velocity, or unbounded varieties. The sheer magnitudes of new data will make it impossible to store most of it anywhere, given the stubborn technological and economic constraints we all face. This reality will deepen Big Data managers&#8217; focus on tweaking multitemperature storage management, archiving, and retention policies. As you scale your Big Data environment, you will need to ensure that ILM requirements can be supported within your current constraints of volume (storage capacity), velocity (bandwidth, processor, and memory speeds), and variety (metadata depth).<\/p>\n<p><strong>Q4. How best is to get started with a Big Data project?<\/strong><\/p>\n<p><strong>James Kobielus: <\/strong>Scope the project well to deliver near-term business benefit. Using the nucleus project as the foundation for accelerating future Big Data projects. Recognize that the initial database technology you use in that initial project is just one of many storage layers that will need to play together in a hybridized, multi-tier Big Data architecture of your future.<br \/>\nIn the larger evolutionary perspective, Big Data is evolving into a hybridized paradigm under which Hadoop, massively parallel processing (MPP) enterprise data warehouses (EDW), in-memory columnar, stream computing, NoSQL, document databases, and other approaches support extreme analytics in the cloud.<br \/>\nHybrid architectures address the heterogeneous reality of Big Data environments and respond to the need to incorporate both established and new analytic database approaches into a common architecture. The fundamental principle of hybrid architectures is that each constituent Big Data platform is fit-for-purpose to the role for which it\u2019s best suited. These Big Data deployment roles may include any or all of the following:<br \/>\n\u2022 Data acquisition<br \/>\n\u2022 Collection<br \/>\n\u2022 Transformation<br \/>\n\u2022 Movement<br \/>\n\u2022 Cleansing<br \/>\n\u2022 Staging<br \/>\n\u2022 Sandboxing<br \/>\n\u2022 Modeling<br \/>\n\u2022 Governance<br \/>\n\u2022 Access<br \/>\n\u2022 Delivery<br \/>\n\u2022 Interactive exploration<br \/>\n\u2022 Archiving<br \/>\nIn any role, a fit-for-purpose Big Data platform often supports specific data sources, workloads, applications, and users.<br \/>\nHybrid is the future of Big Data because users increasingly realize that no single type of analytic platform is always best for all requirements. Also, platform churn\u2014plus the heterogeneity it usually produces\u2014will make hybrid architectures more common in Big Data deployments. The inexorable trend is toward hybrid environments that address the following enterprise Big Data imperatives:<br \/>\n\u2022 Extreme scalability and speed: The emerging hybrid Big Data platform will support scale-out, shared-nothing massively parallel processing, optimized appliances, optimized storage, dynamic query optimization, and mixed workload management.<br \/>\n\u2022 Extreme agility and elasticity: The hybrid Big Data environment will persist data in diverse physical and logical formats across a virtualized cloud of interconnected memory and disk that can be elastically scaled up and out at a moment\u2019s notice.<br \/>\n\u2022 Extreme affordability and manageability: The hybrid environment will incorporate flexible packaging\/pricing, including licensed software, modular appliances, and subscription-based cloud approaches.<br \/>\nHybrid deployments are already widespread in many real-world Big Data deployments. The most typical are the three-tier\u2014also called \u201c<a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/en.wikipedia.org\/wiki\/Spoke-hub_distribution_paradigm');\"  href=\"http:\/\/en.wikipedia.org\/wiki\/Spoke-hub_distribution_paradigm\" target=\"_blank\">hub-and-spoke<\/a>\u201d\u2014architectures. These environments may have, for example, Hadoop (e.g., <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www-01.ibm.com\/software\/data\/infosphere\/biginsights\/');\"  href=\"http:\/\/www-01.ibm.com\/software\/data\/infosphere\/biginsights\/\" target=\"_blank\">IBM InfoSphere BigInsights<\/a>) in the data acquisition, collection, staging, preprocessing, and transformation layer; relational-based MPP EDWs (e.g., <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www-01.ibm.com\/software\/data\/puredata\/analytics\/');\"  href=\"http:\/\/www-01.ibm.com\/software\/data\/puredata\/analytics\/\" target=\"_blank\">IBM PureData System for Analytics<\/a>) in the hub\/governance layer; and in-memory databases (e.g., <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www-03.ibm.com\/software\/products\/de\/cognostm1');\"  href=\"http:\/\/www-03.ibm.com\/software\/products\/de\/cognostm1\" target=\"_blank\">IBM Cognos TM1<\/a>) in the access and interaction layer.<br \/>\nThe complexity of hybrid architectures depends on range of sources, workloads, and applications you\u2019re trying to support. In the back-end staging tier, you might need different preprocessing clusters for each of the disparate sources: structured, semi-structured, and unstructured. In the hub tier, you may need disparate clusters configured with different underlying data platforms\u2014RDBMS, stream computing, <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html');\"  href=\"http:\/\/hadoop.apache.org\/docs\/r1.2.1\/hdfs_design.html\" target=\"_blank\">HDFS<\/a>, <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/hbase.apache.org');\"  href=\"https:\/\/hbase.apache.org\" target=\"_blank\">HBase<\/a>, <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/cassandra.apache.org');\"  href=\"http:\/\/cassandra.apache.org\" target=\"_blank\">Cassandra<\/a>, NoSQL, and so on\u2014-and corresponding metadata, governance, and in-database execution components. And in the front-end access tier, you might require various combinations of in-memory, columnar, OLAP, dimensionless, and other database technologies to deliver the requisite performance on diverse analytic applications, ranging from operational BI to advanced analytics and complex event processing.<br \/>\nEnsuring that hybrid Big Data architectures stay cost-effective demands the following multipronged approach to optimization of distributed storage:<br \/>\n\u2022 Apply fit-for-purpose databases to particular Big Data use cases: Hybrid architectures spring from the principle that no single data storage, persistence, or structuring approach is optimal for all deployment roles and workloads. For example, no matter how well-designed the dimensional data model is within an OLAP environment, users eventually outgrow these constraints and demand more flexible decision support. Other database architectures\u2014such as columnar, in-memory, key-value, graph, and inverted indexing\u2014may be more appropriate for such applications, but not generic enough to address other broader deployment roles.<br \/>\n\u2022 Align data models with underlying structures and applications: Hybrid architectures leverage the principle that no fixed Big Data modeling approach\u2014physical and logical\u2014can do justice to the ever-shifting mix of queries, loads, and other operations. As you implement hybrid Big Data architectures, make sure you adopt tools that let you focus on logical data models, while the infrastructure automatically reconfigures the underlying Big Data physical data models, schemas, joins, partitions, indexes, and other artifacts for optimal query and data load performance.<br \/>\n\u2022 Intelligently compress and manage the data: Hybrid architectures should allow you to apply intelligent compression to Big Data sets to reduce their footprint and make optimal use of storage resources. Also, some physical data models are more inherently compact than others (e.g., tokenized and columnar storage are more efficient than row-based storage), just as some logical data models are more storage-efficient (e.g., third-normal-form relational is typically more compact than large denormalized tables stored in a dimensional star schema).<\/p>\n<p><strong>Q5. What kind of expertise do you need to run a Big Data project in the enterprise?<\/strong><\/p>\n<p><strong>James Kobielus: <\/strong>Data-driven organizations succeed when all personnel\u2014both technical and business\u2014have a common understanding of the core big-data best skills, tools and practices. You need all the skills of data management, integration, modeling, and so forth that you already have running your data marts, warehouses, OLAP cubes, and the like.<\/p>\n<p>Just as important, you need a team of dedicated data scientists to develop and tune the core intellectual property&#8211;statistical, predictive, and other analytic models&#8211;that drive your Big Data applications. You don\u2019t often think of data scientists as \u201cprogrammers,\u201d per se, but they are the pivotal application developers in the age of Big Data.<br \/>\nThe key practical difference between data scientists and other programmers\u2014including those who develop orchestration logic\u2014is that the former specifies logic grounded in non-deterministic patterns (i.e., statistical models derived from propensities revealed inductively from historical data), whereas the latter specifies logic whose basis is predetermined (i.e., if\/then\/else, case-based and other rules, procedural and\/or declarative, that were deduced from functional analysis of some problem domain).<br \/>\nThe practical distinctions between data scientists and other programmers have always been a bit fuzzy, and they\u2019re growing even blurrier over time. For starters, even a cursory glance at programming paradigms shows that core analytic functions\u2014data handling and calculation\u2014have always been the heart of programming. For another, many business applications leverage statistical analyses and other data-science models to drive transactional and other functions.<br \/>\nFurthermore, data scientists and other developers use a common set of programming languages. Of course, data scientists differ from most other types of programmers in various ways that go beyond the deterministic vs. non-deterministic logic distinction mentioned above:<br \/>\n\u2022 Data scientists have adopted analytic domain-specific languages such as R, SAS, SPSS and Matlab.<br \/>\n\u2022 Data scientists specialize in business problems that are best addressed with statistical analysis.<br \/>\n\u2022 Data scientists are often more aligned with specific business-application domains\u2014such as marketing campaign optimization and financial risk mitigation\u2014than the traditional programmer.<br \/>\nThese distinctions primarily apply to what you might call the \u201cclassic\u201d data scientist, such as multivariate statistical analysts and data mining professionals. But the notion of a \u201cclassic\u201d data scientist might be rapidly fading away in the big-data era as more traditional programmers need some grounding in statistical modeling in order to do their jobs effectively\u2014or, at the very least, need to collaborate productively with statistical modelers.<\/p>\n<p><strong>Q6. How do you select the &#8220;right&#8221; software and hardware for a Big Data project?<\/strong><\/p>\n<p><strong>James Kobielus: <\/strong>It&#8217;s best to choose the right appliance&#8211;a pre-optimized, pre-configured hardware\/software appliance&#8211;for the specific workloads and applications of your Big Data project. At the same time, you should make sure that the chosen appliances can figure into the eventual cloud architecture toward which your Big Data infrastructure is likely to evolve.<\/p>\n<p>An appliance is a workload-optimized system. Its hardware\/software nodes are the key building block for every Big Data cloud. In other words, appliances, also known as expert integrated systems, are the bedrock of all three \u201cVs\u201d of the Big Data universe, regardless of whether your specific high-level topology is centralized, hub-and-spoke, federated or some other configuration, and regardless of whether you\u2019ve deployed all of these appliance nodes on premises or are outsourcing some or all of it to a cloud\/SaaS provider.<br \/>\nWithin the coming 2-3 years, expert integrated systems will become a dominant approach for enterprises to put Hadoop and other emerging Big Data approaches into production. Already, appliances are the principal approach in the core Big Data platform market: enterprise data warehousing solutions that implement massively parallel processing, such as those powered by IBM PureData Systems for Analytics..<br \/>\nThe core categories of workloads that user need their optimized Big Data appliances to support within cloud environments are as follows:<br \/>\n\u2022 Big-data storage: A Big Data appliance can be core building block in a enterprise data storage architecture. Chief uses may be for archiving, governance and replication, as well as for discovering, acquiring, aggregating and governing multistructured content. The appliance should provide the modularity, scalability and efficiency of high-performance applications for these key data consolidation functions. Typically, it would support these functions through integration with a high-capacity storage area network architecture such as IBM provides.<br \/>\n\u2022 Big-data processing: A Big Data appliance should support massively parallel execution of advanced data processing, manipulation, analysis and access functions. It should support the full range of advanced analytics, as well as some functions traditionally associated with EDWs, BI and OLAP. It should have all the metadata, models and other services needed to handle such core analytics functions as query, calculation, data loading and data integration. And it should handle a subset of these functions and interface through connectors to analytic platforms such as IBM PureData Systems.<br \/>\n\u2022 Big-data development: A Big Data appliance should support Big Data modeling, mining, exploration and analysis. The appliance should provide a scalable \u201csandbox\u201d with tools that allow data scientists, predictive modelers and business analysts to interactively and collaboratively explore rich information sets. It should incorporate a high-performance analytic runtime platform where these teams can aggregate and prepare data sets, tweak segmentations and decision trees, and iterate through statistical models as they look for deep statistical patterns. It should furnish data scientists with massively parallel CPU, memory, storage and I\/O capacity for tackling analytics workloads of growing complexity. And it should enable elastic scaling of sandboxes from traditional statistical analysis, data mining and predictive modeling, into new frontiers of Hadoop\/MapReduce, R, geospatial, matrix manipulation, natural language processing, sentiment analysis and other resource-intensive types of Big Data processing.<br \/>\nA big-data appliance should not be a stand-alone server, but, instead, a repeatable, modular building block that, when deployed in larger cloud configurations, can be rapidly optimized to new workloads as they come online. Many appliances will be configured to support mixes of two or all three of these types of workloads within specific cloud nodes or specific clusters. Some will handle low latency and batch jobs with equal agility in your cloud. And still others will be entirely specialized to a particular function that they perform with lightning speed and elastic scalability. The best appliances, like <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www-01.ibm.com\/software\/data\/netezza\/');\"  href=\"http:\/\/www-01.ibm.com\/software\/data\/netezza\/\" target=\"_blank\">IBM Netezza<\/a>, facilitate flexible re-optimization by streamlining the myriad deployment, configuration tuning tasks across larger, more complex deployments.<br \/>\nYou may not be able to forecast with fine-grained precision the mix of workloads you\u2019ll need to run on your big-data cloud two years from next Tuesday. But investing in the right family of big-data appliance building blocks should give you confidence that, when the day comes, you\u2019ll have the foundation in place to provision resources rapidly and efficiently.<\/p>\n<p><strong>Q7. Is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?<\/strong><\/p>\n<p><strong>James Kobielus: <\/strong>No. Hadoop is powering unstructured ETL, queryable archiving, data-science exploratory sandboxing, and other use cases. OLAP&#8211;in terms of traditional cubing&#8211;remains key to front-end query acceleration in decision support applications and data marts. In support of those front-end applicatioins, OLAP is facing competition from other approaches, especially in-memory, columnar databases (such as the <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www-01.ibm.com\/software\/data\/db2\/linux-unix-windows\/db2-blu-acceleration\/');\"  href=\"http:\/\/www-01.ibm.com\/software\/data\/db2\/linux-unix-windows\/db2-blu-acceleration\/\" target=\"_blank\">BLU Acceleration feature of IBM DB2 10.5<\/a>).<\/p>\n<p><strong>Q8. Could you give some examples of successful Big Data projects?<\/strong><\/p>\n<p><strong>James Kobielus: <\/strong>Examples are <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www-01.ibm.com\/software\/success\/cssdb.nsf\/solutionareaL2VW?OpenView&amp;Count=30&amp;RestrictToCategory=default_BigData');\"  href=\"http:\/\/www-01.ibm.com\/software\/success\/cssdb.nsf\/solutionareaL2VW?OpenView&amp;Count=30&amp;RestrictToCategory=default_BigData\" target=\"_blank\">here<\/a>.<\/p>\n<p><strong>James Kobielus<\/strong> <em>is IBM Senior Program Director, Product Marketing, Big Data Analytics solutions. He is an industry veteran, a popular speaker and social media participant, and a thought leader in big data, Hadoop, enterprise data warehousing, advanced analytics, business intelligence, data management, and next best action technologies. <\/em><\/p>\n<p><strong>Related Posts<\/strong><\/p>\n<p>&#8211;<a href=\"http:\/\/www.odbms.org\/blog\/2014\/04\/side-big-data-interview-michael-l-brodie\/\" target=\"_blank\">The other side of Big Data. Interview with Michael L. Brodie.<br \/>\nODBMS Industry Watch, April 26, 2014<\/a><\/p>\n<p>&#8211;<a href=\"http:\/\/www.odbms.org\/blog\/2014\/03\/data-centers-challenges-interview-david-gorbet\/\" target=\"_blank\">What are the challenges for modern Data Centers? Interview with David Gorbet.<br \/>\nODBMS Industry Watch, March 25, 2014<\/a><\/p>\n<p>&#8211;<a href=\"http:\/\/www.odbms.org\/blog\/2014\/01\/setting-up-a-big-data-project-interview-with-cynthia-m-saracco\/\" target=\"_blank\">Setting up a Big Data project. Interview with Cynthia M. Saracco.<br \/>\nODBMS Industry Watch, January 27, 2014<\/a><\/p>\n<p><strong>Resources<\/strong><\/p>\n<div>&#8211; BIG DATA AND ANALYTICAL DATA PLATFORMS<\/div>\n<div>\n<article><strong>Downloads<\/strong> for:<\/p>\n<ul>\n<li><a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/category\/downloads\/big-data-and-analytical-data-platforms\/big-data-and-analytical-data-platforms-free-software\/');\"  href=\"http:\/\/www.odbms.org\/category\/downloads\/big-data-and-analytical-data-platforms\/big-data-and-analytical-data-platforms-free-software\/\">Free Software<\/a><\/li>\n<li><a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/category\/downloads\/big-data-and-analytical-data-platforms\/big-data-and-analytical-data-platforms-articles\/');\"  href=\"http:\/\/www.odbms.org\/category\/downloads\/big-data-and-analytical-data-platforms\/big-data-and-analytical-data-platforms-articles\/\">Articles<\/a><\/li>\n<li><a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/category\/downloads\/big-data-and-analytical-data-platforms\/big-data-and-analytical-data-platforms-lecture-notes\/');\"  href=\"http:\/\/www.odbms.org\/category\/downloads\/big-data-and-analytical-data-platforms\/big-data-and-analytical-data-platforms-lecture-notes\/\">Lecture Notes<\/a><\/li>\n<li><a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/www.odbms.org\/category\/downloads\/big-data-and-analytical-data-platforms\/big-data-and-analytical-data-platforms-phd-and-master-thesis\/');\"  href=\"http:\/\/www.odbms.org\/category\/downloads\/big-data-and-analytical-data-platforms\/big-data-and-analytical-data-platforms-phd-and-master-thesis\/\">PhD and Master Thesis<\/a><\/li>\n<\/ul>\n<\/article>\n<\/div>\n<p><strong>Follow ODBMS.org on Twitter: <a onclick=\"javascript:pageTracker._trackPageview('\/outgoing\/twitter.com\/odbmsorg');\"  href=\"https:\/\/twitter.com\/odbmsorg\" target=\"_blank\">@odbmsorg<\/a><\/strong><br \/>\n##<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>&#8220;You need a team of dedicated data scientists to develop and tune the core intellectual property\u2013statistical, predictive, and other analytic models\u2013that drive your Big Data applications. You don\u2019t often think of data scientists as \u201cprogrammers,\u201d per se, but they are the pivotal application developers in the age of Big Data.&#8221;&#8211;James Kobielus Managing the pitfalls and [&hellip;]<!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[35,66,239,263,677,412,413],"_links":{"self":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts\/3006"}],"collection":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/comments?post=3006"}],"version-history":[{"count":9,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts\/3006\/revisions"}],"predecessor-version":[{"id":3282,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/posts\/3006\/revisions\/3282"}],"wp:attachment":[{"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/media?parent=3006"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/categories?post=3006"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.odbms.org\/blog\/wp-json\/wp\/v2\/tags?post=3006"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}