Setting up a Big Data project. Interview with Cynthia M. Saracco.
“Begin with a clear definition of the project’s business objectives and timeline, and be sure that you have appropriate executive sponsorship. The key stakeholders need to agree on a minimal set of compelling results that will impact your business; furthermore, technical leaders need to buy into the overall feasibility of the project and bring design and implementation ideas to the table.”–Cynthia M. Saracco.
How easy is to set up a Big Data project? On this topic I have interviewed Cynthia M. Saracco, senior solutions architect at IBM’s Silicon Valley Laboratory. Cynthia is an expert in Big Data, analytics, and emerging technologies. She has more than 25 years of software industry experience.
Q1. How best is to get started with a Big Data project?
Cynthia M. Saracco: Begin with a clear definition of the project’s business objectives and timeline, and be sure that you have appropriate executive sponsorship.
The key stakeholders need to agree on a minimal set of compelling results that will impact your business; furthermore, technical leaders need to buy into the overall feasibility of the project and bring design and implementation ideas to the table. At that point, you can evaluate your technical options for the best fit. Those options might include Hadoop, a relational DBMS, a stream processing engine, analytic tools, visualization tools, and other types of software. Often, a combination of several types of software is needed for a single Big Data project. Keep in mind that every technology has its strengths and weaknesses, so be sure you understand enough about the technologies you’re inclined to use before moving forward.
If you decide that Hadoop should be part of your project, give serious consideration to using a distribution that packages commonly needed components into a single bundle so you can minimize the time required to install and configure your environment. It’s also helpful to keep in mind the existing skills of your staff and seek out offerings that enable them to be productive quickly.
Tools, applications, and support for common scripting and query languages all contribute to improved productivity. If your business application needs to integrate with existing analytical tools, DBMSs, or other software, look for offerings that have some built-in support for that as well.
Finally, because Big Data projects can get pretty complex, I often find it helpful to segment the work into broad categories and then drill down into each to create a solid plan. Examples of common technical tasks include collecting data (perhaps from various sources), preparing the data for analysis (which can range from simple format conversions to more sophisticated data cleansing and enrichment operations), analyzing the data, and rendering or sharing the results of that analysis with business users or downstream applications. Consider scalability and performance needs in addition to your functional requirements.
Q2. What are the most common problems and challenges encountered in Big Data projects?
Cynthia M. Saracco: Lack of appropriately scoped objectives and lack of required skills are two common problems. Regarding objectives, you need to find an appropriate use case that will impact your business and tailor your project’s technical work to meet the business goals of that project efficiently. Big Data is an exciting, rapidly evolving technology area, and it’s easy to get side tracked experimenting with technical features that may not be essential to solving your business problem. While such experimentation can be fun and educational, it can also result in project delays as well as deliverables that are off target. In addition, without well-scoped business objectives, the technical staff may end up chasing a moving target.
Regarding skills, there’s high demand for data scientists, architects, and developers experienced with Big Data projects. So you may need to decide if you want to engage a service provider to supplement in-house skills or if you want to focus on growing (or acquiring) new in-house skills. Fortunately, there are a number of Big Data training options available today that didn’t exist several years ago. Online courses, conferences, workshops, MeetUps, and self-study tutorials can help motivated technical professionals expand their skill set. However, from a project management point of view, organizations need to be realistic about the time required for staff to learn new Big Data technologies. Giving someone a few days or weeks to master Hadoop and its complementary offerings isn’t very realistic. But really, I see the skills challenge as a point-in-time issue. Many people recognize the demand for Big Data skills and are actively expanding their skills, so supply will grow.
Q3. Do you have any metrics to define how good is the “value” that can be derived by analyzing Big Data?
Cynthia M. Saracco: Most organizations want to focus on their return on investment (ROI). Even if your Big Data solution uses open source software, there are still expenses involved for designing, developing, deploying, and maintaining your solution. So what did your business gain from that investment?
The answer to that question is going to be specific to your application and your business. For example, if a telecommunications firm is able to reduce customer churn by 10% as a result of a Big Data project, what’s that worth? If an organization can improve the effectiveness of an email marketing campaign by 20%, what’s that worth? If an organization can respond to business requests twice as quickly, what’s that worth? Many clients have these kinds of metrics in mind as they seek to quantify the value they have derived — or hope to derive — from their investment in a Big Data project.
Q4. Is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?
Cynthia M. Saracco: More often, I’ve seen Hadoop used to augment or extend traditional forms of analytical processing, such as OLAP, rather than completely replace them. For example, Hadoop is often deployed to bring large volumes of new types of information into the analytical mix — information that might have traditionally been ignored or discarded. Log data, sensor data, and social data are just a few examples of that. And yes, preparing that data for analysis is certainly one of the tasks for which Hadoop is used.
Q4. IBM is offering BigInsights and Big SQL? What is it?
Cynthia M. Saracco: InfoSphere BigInsights is IBM’s Hadoop-based platform for analyzing and managing Big Data. It includes Hadoop, a number of complementary open source projects (such as HBase, Hive, ZooKeeper, Flume, Pig, and others) and a number of IBM-specific technologies designed to add value.
Big SQL is part of BigInsights. It’s IBM’s SQL interface to data stored in BigInsights. Users can create tables, query data, load data from various sources, and perform other functions. For a quick introduction to Big SQL, read this article.
Q5. How does it compare to RDBMS technology? When’s it most useful?
Cynthia M. Saracco: Big SQL provides standard SQL-based query access to data managed by BigInsights. Query support includes joins, unions, sub-queries, windowed aggregates, and other popular capabilities. Because Big SQL is designed to exploit the Hadoop ecosystem, it introduces Hadoop-specific language extensions for certain SQL statements.
For example, Big SQL supports Hive and HBase for storage management, so a Big SQL CREATE TABLE statement might include clauses related to data formats, field delimiters, SerDes (serializers/deserializers), column mappings, column families, etc. The article I mentioned earlier has some examples of these, and the product InfoCenter has further details.
In many ways, Big SQL can serve as an easy on-ramp to Hadoop for technical professionals who have a relational DBMS background. Big SQL is good for organizations that want to exploit in-house SQL skills to work with data managed by BigInsights. Because Big SQL supports JDBC and ODBC, many traditional SQL-based tools can work readily with Big SQL tables, which can also make Big Data easier to use by a broader user community.
However, Big SQL doesn’t turn Hadoop — or BigInsights — into a relational DBMS. Commercial relational DBMSs come with built-in, ACID-based transaction management services and model data largely in tabular formats. They support granular levels of security via SQL GRANT and REVOKE statements. In addition, some RDBMSs support 3GL applications developed in “legacy” programming languages such as COBOL. These are some examples of capabilities aren’t part of Big SQL.
Q6. What are some of its current limitations?
Cynthia M. Saracco: The current level of Big SQL included in BigInsights V184.108.40.206 enables users to create tables but not views.
Date/time data is supported through a full TIMESTAMP data type, and some common SQL operations supported by relational DBMSs aren’t available or have specific restrictions.
Examples include INSERT, UPDATE, DELETE, GRANT, and REVOKE statements. For more details on what’s currently supported in Big SQL, skim through the InfoCenter.
Q7. How BigInsights differs from / adds value to open source Hadoop?
Cynthia M. Saracco: As I mentioned earlier, BigInsights includes a number of IBM-specific technologies designed to add value to the open source technologies included with the product. Very briefly, these include:
- A Web console with administrative facilities, a Web application catalog, customizable dashboards, and other features.
- A text analytic engine and library that extracts phone numbers, names, URLs, addresses, and other popular business artifacts from messages, documents, and other forms of textual data.
- Big SQL, which I mentioned earlier.
- BigSheets, a spreadsheet-style tool for business analysts.
- Web-accessible sample applications for importing and exporting data, collecting data from social media sites, executing ad hoc queries, and monitoring the cluster. In addition, application accelerators (tool kits with dozens of pre-built software articles) are available for those working with social data and machine data.
- Eclipse tooling to speed development and testing of BigInsights applications, new text extractors, BigSheets functions, SQL-based applications, Java applications, and more.
- An integrated installation tool that installs and configures all selected components across the cluster and performs a system-wide health check.
- Connectivity to popular enterprise software offerings, including IBM and non-IBM RDBMSs.
- Platform enhancements focusing on performance, security, and availability. These include options to use with an alternative, POSIX-compliant distributed file system (GPFS-FPO) and an alternative MapReduce layer (Adaptive MapReduce) that features Platform Symphony’s advanced job scheduler, workload manager, and other capabilities.
You might wonder what practical benefits these kinds of capabilities bring. While that varies according to each organization’s usage patterns, one industry analyst study concluded that BigInsights lowers total cost of ownership (TCO) by an average of 28% over a three-year period compared with an open source-only implementation.
Finally, a number of IBM and partner offerings support BigInsights, which is something that’s important to organizations that want to integrate a Hadoop-based environment into their broader IT infrastructure. Some examples of IBM products that support BigInsights include DataStage, Cognos Business Intelligence, Data Explorer, and InfoSphere Streams.
Q8. Could you give some examples of successful Big Data projects?
Cynthia M. Saracco: I’ll summarize a few that have been publicly discussed so you can follow links I provide for more details. An energy firm launched a Big Data project to analyze large volumes of data that could help it improve the placement of new wind turbines and significantly reduce response time to business user requests.
A financial services firm is using Big Data to process large volumes of text data in minutes and offer its clients more comprehensive information based on both in-house and Internet-based data.
An online marketing firm is using Big Data to improve the performance of its clients email campaigns.
And other firms are using Big Data to detect fraud, assess risk, cross-sell products and services, prevent or minimize network outages, and so on. You can find a collection of videos about Big Data projects undertaken by various organizations; many of these videos feature users speaking directly about their Big Data experiences and the results of their projects.
And a recent report on Analytics: The real-world use of big data contains further examples. based the results of a survey of more than 1100 businesses that the Said Business School at the University of Oxford conducted with IBM’s Institute for Business Value.
Qx Anything else to add?
Cynthia M. Saracco: Hadoop isn’t the only technology relevant to managing and analyzing Big Data, and IBM’s Big Data software portfolio certainly includes more than BigInsights (its Hadoop-based offering). But if you’re a technologist who wants to learn more about Hadoop, your best bet is to work with the software. You’ll find a number of free online courses in the public domain, such as those at Big Data University. And IBM offers a free copy of its Quick Start Edition of BigInsights as a VMWare image or an installable image to help you get started with minimal effort.
Cynthia M. Saracco is a senior solutions architect at IBM’s Silicon Valley Laboratory, specializing in Big Data, analytics, and emerging technologies. She has more than 25 years of software industry experience, has written three books and more than 70 technical papers, and holds six patents.
- What’s the big deal about Big SQL? by Cynthia M. Saracco , Senior Software Engineer, IBM, and Uttam Jain, Software Architect, IBM.
- ODBMS.org: Free resources on Big Data and Analytical Data Platforms:
| Blog Posts | Free Software| Articles| Lecture Notes | PhD and Master Thesis|