ODBMS Industry Watch » IBM http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Fri, 09 Feb 2018 15:17:45 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.19 New Gartner Magic Quadrant for Operational Database Management Systems. Interview with Nick Heudecker http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/ http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/#comments Wed, 30 Nov 2016 20:30:20 +0000 http://www.odbms.org/blog/?p=4272

“It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.”–Nick Heudecker.

I have interviewed Nick Heudecker, Research Director on Gartner’s Data & Analytics team.
The main topic of the interview is the new Magic Quadrant for Operational Database Management Systems.


Q1. You have published the new Magic Quadrant for Operational Database Management Systems (*). How do you define the operational database management system market?

Nick Heudecker: We define a DBMS as a complete software system used to define, create, manage, update and query a database. DBMSs provide interfaces to independent programs and tools that both support and govern the performance of a variety of concurrent workload types. There is no presupposition that DBMSs must support the relational model or that they must support the full set of possible data types in use today. OPDBMSs must include functionality to support backup and recovery, and have some form of transaction durability — although the atomicity, consistency, isolation and durability model is not a requirement. OPDBMSs may support multiple delivery models, such as stand-alone DBMS software, certified configurations, cloud (public and private) images or versions, and database appliances.

Q2. Can you explain the methodology you used for this new Magic Quadrant?

Nick Heudecker: The methodologies for several Gartner methodologies are public. The Magic Quadrant methodology can be found here.

We use a number of data sources when we’re creating the Magic Quadrant for Operational Database Management Systems.
We survey vendor reference customers and include data from our interactions with Gartner clients. We also consider earlier information and any news about vendors’ products, customers and finances that came to light during the time frame for our analysis.

Once we have the data, we score vendors across the various dimensions of Completeness of Vision and Ability to Execute.
One thing that’s important to note is Magic Quadrants are relative assessments of vendors in a market. We couldn’t have one vendor on an MQ because it would be right in the middle – there’s nothing to compare it to.

Q3. Why were there no Visionaries this year?

Nick Heudecker: We determined there was an overall lack of vision in the market. After a few years of rapid feature expansion, the focus has shifted to operational excellence and execution. Even Leaders shifted to the left on vision, but are still placed in the Leaders quadrant based on their vision for the development of hybrid database management, hardware optimization and integration, emerging deployment models such as containerization, as well as vertical features.

Q4. Were you surprised by the analysis and some of the results you obtained?

Nick Heudecker: The lack of overall vision in the market struck us the most. Other than in a few notable cases, we received largely the same story from most vendors. The explosion of features, and the vendors emerging to implement them, has slowed. The features that initiated the expansion, such as storing new data types, geographically distributed storage, cloud and flexible data consistency models, have become common. Today, nearly every established or emerging DBMS vendor supports these features to some degree. The OPDBMS market has shifted from a phase of rapid innovation to a phase of maturing products and capabilities.

Q5. Do you believe the “NoSQL” label will continue to distinguish DBMSs?

Nick Heudecker: If you look at the entire operational DBMS space, there’s already a great deal of convergence between NoSQL vendors, as well as between NoSQL and traditionally relational vendors. Nearly every vendor, nonrelational and relational, supports multiple data types, like JSON documents, graph or wide-column. NoSQL vendors are adding SQL: MongoDB’s BI Connector and Couchbase’s N1QL are good, if diverse, examples. They’re also adding things like schema management and data validation capabilities.
On the relational side, they’re adding horizontal scaling options and alternative consistency models, as well as modern APIs. And everyone either has or is adding in-memory and cloud capabilities.

It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.

Q6. What are the other “Vendors to Consider”?

Nick Heudecker: The other vendors to consider are vendors that did not meet the inclusion requirements for the Magic Quadrant. Usually this is because they missed our minimum revenue requirements, but that doesn’t mean they don’t have compelling products.

Nick Heudecker is a Research Director on Gartner’s Data & Analytics team. His coverage includes data management technologies and practices.


(*) Magic Quadrant for Operational Database Management Systems. Published: 05 October 2016 ID: G00293203Analyst(s): Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald

– Complimentary Gartner Research: 100 Data and Analytics Predictions Through 2020. Get exclusive access to Gartner’s top 100 data and analytics predictions through 2020. Plus access other relevant Gartner research including Magic Quadrant reports for database and data warehouse solutions, and the market guide for in-memory computing (LINK to MemSQL web site – registration required).

Related Posts

MarkLogic Named a Next-Generation Database Challenger in 2016 Gartner Magic Quadrant. By GARY BLOOM, Chief Executive Officer and President MARKLOGIC

MarkLogic Recognized in New Gartner® Magic Quadrant. Gartner Magic Quadrant for Operational Database Management Systems positions MarkLogic® the highest for ability to execute in the Challengers Quadrant

– Accelerating Business Value with a Multi-Model, Multi-Workload Data Platform

– NuoDB Recognized by Gartner in Critical Capabilities for Operational Database Management Systems. Elastic SQL database achieves top five score in all four use cases.

– Clustrix Recognized in Gartner Magic Quadrant for Operational Database Management Systems

– Learn why EDB is named a “Challenger” in the 2016 Gartner ODBMS Magic Quadrant

– DataStax Receives Highest Scores in 2 Use Cases in Gartner’s Critical Capabilities for Operational Database Management Systems

– Gartner Scores Oracle Highest In 3 of 4 Use Cases: Gartner Critical Capabilities for Operational Database Management Systems Report

Gartner Critical Capabilities For Operational Database Management Systems 2016 – Redis Labs Ranked Second Highest In 2/4 Categories (Link- Registation required)


Follow us on Twitter: @odbmsorg


http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/feed/ 0
Machines of Loving Grace. Interview with John Markoff. http://www.odbms.org/blog/2016/08/machines-of-loving-grace-interview-with-john-markoff/ http://www.odbms.org/blog/2016/08/machines-of-loving-grace-interview-with-john-markoff/#comments Thu, 11 Aug 2016 19:13:46 +0000 http://www.odbms.org/blog/?p=4190

“Intelligent system designers do have ethical responsibilities.”
–John Markoff.

I have interviewed John Markoff, technology writer at The New York Times. 
In 2013 he was awarded a Pulitzer Prize.
The interview is related to his recent book “Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots, published in August of 2015 by HarperCollins Ecco.


Q1. Do you share the concerns of prominent technology leaders such as Tesla’s chief executive, Elon Musk, who suggested we might need to regulate the development of artificial intelligence?

John Markoff: I share their concerns, but not their assertions that we may be on the cusp of some kind of singularity or rapid advance to artificial general intelligence. I do think that machine autonomy raises specific ethical and safety concerns and regulation is an obvious response.

Q2. How difficult is it to reconcile the different interests of the people who are involved in a direct or indirect way in developing and deploying new technology?

John Markoff: This is why we have governments and governmental regulation. I think AI, in that respect is no different than any other technology. It should and can be regulated when human safety is at stake.

Q3. In your book Machines of Loving Grace you argued that “we must decide to design ourselves into our future, or risk being excluded from it altogether”. What do you mean by that?

John Markoff: You can use AI technologies either to automate or to augment humans. The problem is minimized when you take an approach that is based on human centric design principles.

Q4. How is it possible in practice? Isn’t the technology space dominated by giants such as IBM, Apple,Google who dictate the direction of new technology?

John Markoff:  This is a very interesting time with “giant” technology companies realizing that there are consequences in the deployment of these technologies. Google, IBM and Microsoft have all recently made public commitments to the safe use of AI.

Q5. What are the most significant new developments in the humans-computers area, that are likely to have a significant influence in our daily life in the near future?

John Markoff:  One of the best things about being a reporter is that you don’t have to predict the future. You only have to note what the various visionaries say, so you can call that to their attention when their predictions prove inaccurate. With that caveat, if I am forced to bet on any particular information technology it would be augmented reality. This is because I believe that multi-touch interfaces for mobile devices simply can’t be the last step in user interface.

Q6. Do you believe that robots will really transform modern life?

John Markoff:  I struggle with the definition of what is a “robot.” If something is tele-operated, for example, is it a robot? That said I think that we will increasingly be surrounded by machines that perform tasks.
The question is will they come as quickly as Silicon Valley seems to believe. My friend Paul Saffo has said, “Never mistake a clear view for a short distance.” And I think that is the case with all kinds of mobile robots, including self driving cars.

Q7. For the designers of Intelligent Systems, how difficult is to draw a line between what is human and what is machine?

John Markoff:  I feel strongly that the possibility of designing cyborgs, particularly with respect to intellectual prosthesis is a boundary we should cross with great caution. Remember the Borg from StarTrek. “Resistance is futile, you will be assimilated.” I think the challenge is to use these systems to enhance human thought, not for social control.

Q8. What are the ethical responsibilities of designers of intelligent systems?

John Markoff: I think the most important aspect of that question is the simple acknowledgement that intelligent system designers do have ethical responsibilities. That has not always been the case, but it seems to be a growing force within the community of AI and robotics designers in the past five years, so I’m not entirely pessimistic.

Q9. If humans delegate decisions to machines, who will be responsible for the consequences?

John Markoff: Ben Shneiderman, the University of Maryland computer scientist and user interface designer has written eloquently on this point. Indeed he argues against autonomous systems for precisely this reason. His point is that it is essential to keep a human in the loop. If not you run the risk of abdicating ethical responsibility for system design.

Q10. Assuming there is a real potential in using data–driven methods to both help charities develop better services and products, and understand civil society activity. In your opinion, what are the key lessons and recommendations for future work in this space?

John Markoff: I’m afraid I’m not an expert in the IT needs of either charities or NGOs. That said a wide range of AI advances are already being delivered at nominal cost via smart phones. As cheap sensors proliferate virtually all everyday objects will gain intelligence that will be widely accessible.

Qx. Anything else you wish to add?

John Markoff: Only that I think it is interesting that the augmentation vs automation dichotomy is increasingly seen as a path through which to navigate the impact of these technologies. Computer system designers are the ones who will decide what the impact of these technologies are and whether to replace or augment humans in society.



John Markoff joined The New York Times in March 1988 as a reporter for the business section. He is now a technology writer based in San Francisco bureau of the paper. Prior to joining the Times, he worked for The San Francisco Examiner from 1985 to 1988. He reported for the New York Times Science Section from 2010 to 2015.

Markoff has written about technology and science since 1977. He covered technology and the defense industry for The Pacific News Service in San Francisco from 1977 to 1981; he was a reporter at Infoworld from 1981 to 1983; he was the West Coast editor for Byte Magazine from 1984 to 1985 and wrote a column on personal computers for The San Jose Mercury from 1983 to 1985.

He has also been a lecturer at the University of California at Berkeley School of Journalism and an adjunct faculty member of the Stanford Graduate Program on Journalism.

The Times nominated him for a Pulitzer Prize in 1995, 1998 and 2000. The San Francisco Examiner nominated him for a Pulitzer in 1987. In 2005, with a group of Times reporters, he received the Loeb Award for business journalism. In 2007 he shared the Society of American Business Editors and Writers Breaking News award. In 2013 he was awarded a Pulitzer Prize in explanatory reporting as part of a New York Times project on labor and automation.

In 2007 he became a member of the International Media Council at the World Economic Forum. Also in 2007, he was named a fellow of the Society of Professional Journalists, the organization’s highest honor.

In June of 2010 the New York Times presented him with the Nathaniel Nash Award, which is given annually for foreign and business reporting.

Born in Oakland, California on October 29, 1949, Markoff grew up in Palo Alto, California and graduated from Whitman College, Walla Walla, Washington, in 1971. He attended graduate school at the University of Oregon and received a masters degree in sociology in 1976.

Markoff is the co-author of “The High Cost of High Tech,” published in 1985 by Harper & Row. He wrote “Cyberpunk: Outlaws and Hackers on the Computer Frontier” with Katie Hafner, which was published in 1991 by Simon & Schuster.
In January of 1996 Hyperion published “Takedown: The Pursuit and Capture of America’s Most Wanted Computer Outlaw,” which he co-authored with Tsutomu Shimomura. “What the Dormouse Said: How the Sixties Counterculture shaped the Personal Computer Industry,” was published in 2005 by Viking Books. “Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots,” was published in August of 2015 by HarperCollins Ecco.

He is currently researching a biography of Stewart Brand.

He is married to Leslie Terzian Markoff and they live in San Francisco, Calif.


MACHINES OF LOVING GRACE – The Quest for Common Ground Between Humans and Robots By John Markoff, Illustrated. 378 pp. Ecco/HarperCollins Publishers.

Shneiderman’s “Eight Golden Rules of Interface Design”. These rules were obtained from the text Designing the User Interface by Ben Shneiderman.

“Designing the User Interface”, 6th Edition. This is a revised edition of the highly successful textbook on Human Computer Interaction originally developed by Ben Shneiderman and Catherine Plaisant at the University of Maryland.

Related Posts

– Recruit Institute of Technology. Interview with Alon Halevy ODBMS Industry Watch, Published on 2016-04-02

– Civility in the Age of Artificial Intelligence,  by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

– On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

– On Big Data and Society. Interview with Viktor Mayer-SchönbergerODBMS Industry Watch.

Follow us on Twitter: @odbmsorg

# #

http://www.odbms.org/blog/2016/08/machines-of-loving-grace-interview-with-john-markoff/feed/ 3
On Big Data and Data Science. Interview with James Kobielus http://www.odbms.org/blog/2016/04/on-big-data-and-data-science-interview-with-james-kobielus/ http://www.odbms.org/blog/2016/04/on-big-data-and-data-science-interview-with-james-kobielus/#comments Tue, 19 Apr 2016 08:34:09 +0000 http://www.odbms.org/blog/?p=4119

“One of the most typical mistakes in large-scale data projects is losing sight of the biases that may skew the insights you extract.”– James Kobielus

On the topics of Big Data, and Data Science, I have interviewed James Kobielus, IBM Big Data Evangelist.


Q1. What kind of companies generate Big Data, besides the Internet giants?

James Kobielus: Big data isn’t something you “generate.” Rather, the term refers to the ability to achieve differentiated value from advanced analytics on trustworthy data at any scale. In other words, it’s a best practice, not a specific type of data or even a specific scale of data (measured in volume, velocity, and/or variety).

When considered in this light, you can identify big data analytic applications in every industry. Every C-level executive has strategic applications of big data. Here are just a smattering:

  • Chief Marketing Officers have been the prime movers on many big data initiatives that involve Hadoop, NoSQL, and other approaches. Their primary applications consist of marketing campaign optimization, customer churn and loyalty, upsell and cross-sell analysis, targeted offers, behavioral targeting, social media monitoring, sentiment analysis, brand monitoring, influencer analysis, customer experience optimization, content optimization, and placement optimization
  • Chief Information Officers use big data platforms for data discovery, data integration, business analytics, advanced analytics, exploratory data science.
  • Chief Operations Officers rely on big data for supply chain optimization, defect tracking, sensor monitoring, and smart grid, among other applications.
  • Chief Information Security Officer run security incident and event management, anti-fraud detection, and other sensitive applications on big data.
  • Chief Technology Officers do IT log analysis, event analytics, network analytics, and other systems monitoring, troubleshooting, and optimization applications on big data.
  • Chief Financial Officers run complex financial risk analysis and mitigation modeling exercises on big data platforms.

Q2. What are the most challenging problems you are facing when analysing Big Data?

James Kobielus: Searching for actionable intelligence in big data involves building and testing advanced-analytics models against large volumes of complex data that may be flowing in at high velocities.

At these scales, it’s easy to get overwhelmed in your analysis unless you automate the end-to-end processes of extracting intelligence at scale. Automation can also help control the cost of managing a growing volume of algorithmic models against ever expanding big-data collections. The key processes that need automating are data discovery, profiling, sampling, and preparation, as well as model building, scoring, and deployment.

Q3. How do you typically handle them?

James Kobielus: Automating the modeling process will boost data scientist productivity by an order of magnitude, freeing them from drudgery so that they can focus on the sorts of exploration, modeling, and visualization challenges that demand expert human judgment. Data scientists can accelerate their modeling automation initiatives by following these steps:

  • Virtualize access to data, metadata, rules, and predictive models, as well as to data integration, data warehousing, and advanced analytic applications through a BI semantic virtualization layer;
  • Unify access, governance, orchestration, automation, and administration across these resources within a service-oriented architecture;
  • Explore commercial tools that support maximum automation of model development, scoring, deployment, and execution;
  • Consolidate, accelerate, and deepen predictive analytics through integration into big-data platforms with scalable in-database execution; and
  • Migrate existing analytical data marts into multidomain big-data platforms with unified data, metadata, and model governance within service-oriented virtualization framework.

Q4. What are in your experience the typical mistakes made in large scale data projects?

James Kobielus: One of the most typical mistakes in large-scale data projects is losing sight of the biases that may skew the insights you extract.

Even if you accept that a data scientist’s integrity is rock-solid, intentions pure, skills stellar, and discipline rigorous, there’s no denying that bias may creep inadvertently into their work with big data. The biases may be minor or major, episodic or systematic, tangential or material to their findings and recommendations. Whatever their nature, the biases must be understood and corrected as fully as possible.

Here are some of the key sources of bias that may crop up in a data scientist’s work with big data:

  • Cognitive bias: This is the tendency to make skewed decisions based on pre-existing cognitive and heuristic factors–such as a misunderstanding of probabilities–rather than on the data and other hard evidence. You might say that the educated intuition that drives data science is rife with cognitive bias, but that’s not always a bad thing.
  • Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient, and cost-effective for your purposes, as opposed to being necessarily the most valid and relevant for your study. Clearly, data scientists do not have unlimited budgets, may operate under tight deadlines, and don’t use data for which they lack authorization. These constraints may introduce an unconscious bias in the big-data collections they are able to assemble.
  • Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population most relevant to the initial scope of a data-science project, thereby making it unlikely that you will uncover any meaningful correlations that may apply to other segments. Another source of sampling bias is “data dredging,” in which the data scientist uses regression techniques that may find correlations in samples but that may not be statistically significant in the wider population. Consequently, you’re likely to spuriously confirm your initial model for the segments that happen to make the sampling cut.
  • Modeling bias: Beyond the biases just discussed, this is the tendency to skew data-science models by starting with a biased set of project assumptions that drive selection of the wrong variables, the wrong data, the wrong algorithms, and the wrong metrics of fitness. In addition, overfitting of models to past data without regard for predictive lift is a common bias. Likewise, failure to score and iterate models in a timely fashion with fresh observational data also introduces model decay, hence bias.
  • Funding bias: This may be the most silent but pernicious bias in data-scientific studies of all sorts. It’s the unconscious tendency to skew all modeling assumptions, interpretations, data, and applications to favor the interests of the party–employer, customer, sponsor, etc.–that employs or otherwise financially supports the data-science initiative. Funding bias makes it highly unlikely that data scientists will uncover disruptive insights that will “break the rice bowl” in which they make their living.

Q5. How do you measure “success” when analysing data?

James Kobielus: You measure success in your ability to distill useful insights in a timely fashion from the data at your disposal.

Q6. What skills are required to be an effective Data Scientist?

James Kobielus: Data science’s learning curve is formidable. To a great degree, you will need a degree, or something substantially like it, to prove you’re committed to this career. You will need to submit yourself to a structured curriculum to certify you’ve spent the time, money and midnight oil necessary for mastering this demanding discipline.

Sure, there are run-of-the-mill degrees in data-science-related fields, and then there are uppercase, boldface, bragging-rights “DEGREES.” To some extent, it matters whether you get that old data-science sheepskin from a traditional university vs. an online school vs. a vendor-sponsored learning program. And it matters whether you only logged a year in the classroom vs. sacrificed a considerable portion of your life reaching for the golden ring of a Ph.D. And it certainly matters whether you simply skimmed the surface of old-school data science vs. pursued a deep specialization in a leading-edge advanced analytic discipline.

But what matters most to modern business isn’t that every data scientist has a big honking doctorate. What matters most is that a substantial body of personnel has a common grounding in core curriculum of skills, tools and approaches. Ideally, you want to build a team where diverse specialists with a shared foundation can collaborate productively.

Big data initiatives thrive if all data scientists have been trained and certified on a curriculum with the following foundation:

  • Paradigms and practices: Every data scientist should acquire a grounding in core concepts of data science, analytics and data management. They should gain a common understanding of the data science lifecycle, as well as the typical roles and responsibilities of data scientists in every phase. They should be instructed on the various role(s) of data scientists and how they work in teams and in conjunction with business domain experts and stakeholders. And they learn a standard approach for establishing, managing and operationalizing data science projects in the business.
  • Algorithms and modeling: Every data scientist should obtain a core understanding of linear algebra, basic statistics, linear and logistic regression, data mining, predictive modeling, cluster analysis, association rules, market basket analysis, decision trees, time-series analysis, forecasting, machine learning, Bayesian and Monte Carlo Statistics, matrix operations, sampling, text analytics, summarization, classification, primary components analysis, experimental design, unsupervised learning constrained optimization.
  • Tools and platforms: Every data scientist should master a core group of modeling, development and visualization tools used on your data science projects, as well as the platforms used for storage, execution, integration and governance of big data in your organization. Depending on your environment, and the extent to which data scientists work with both structured and unstructured data, this may involve some combination of data warehousing, Hadoop, stream computing, NoSQL and other platforms. It will probably also entail providing instruction in MapReduce, R and other new open-source development languages, in addition to SPSS, SAS and any other established tools.
  • Applications and outcomes: Every data scientist should learn the chief business applications of data science in your organization, as well as in how to work best with subject-domain experts. In many companies, data science focuses on marketing, customer service, next best offer, and other customer-centric applications. Often, these applications require that data scientists understand how to leverage customer data acquired from structured survey tools, sentiment analysis software, social media monitoring tools and other sources. It also essential that every data scientist gain an understanding of the key business outcomes–such as maximizing customer lifetime value–that should focus their modeling initiatives.

Classroom instruction is important, but a curriculum that is 100 percent devoted to reading books, taking tests and sitting through lectures is insufficient. Hands-on laboratory work is paramount for a truly well-rounded data scientist. Make sure that your data scientists acquire certifications and degrees that reflect them actually developing statistical models that use real data and address substantive business issues.

A business-oriented data-science curriculum should produce expert developers of statistical and predictive models. It should not degenerate into a program that produces analytics geeks with heads stuffed with theory but whose diplomas are only fit for hanging on the wall.

Q7. Hadoop vs. Spark: what are the pros and cons?

James Kobielus: Big data analytics infrastructures are growing more hybridized than ever. Every new technology—such as Hadoop, in-memory databases, and graph databases—finds its specific niche in terms of use cases, deployment modes, and applications for which it is best suited.

Even as Apache Spark pushes more deeply into big-data environments, it won’t substantially change this trend. Yes, of course Spark is on the fast track to ubiquity in big-data analytics. This is especially true for the next generation of machine-learning applications that feed on growing in-memory pools and require low-latency distributed computations for streaming and graph analytics. But those use cases aren’t the sum total of big-data analytics and never will be.

As we all grow more infatuated with Spark, it’s important to continually remind ourselves of what it’s not suitable for. If, for example, one considers all the critical data management, integration, and preparation tasks that must be performed prior to modeling in Spark, it’s clear that these will not be executed in any of the Spark engines (Spark SQL, Spark Streaming, GraphX). Instead, they’ll be carried out in the data platforms and elastic clusters (HDFS, Cassandra, HBase, Mesos, cloud services, etc.) upon which those engines run. Likewise, you’d be hardpressed to find anyone who’s seriously considering Spark in isolation for data warehousing, data governance, master data management, or operational business intelligence.

Above all else, Spark is the new power tool for data scientists who are pushing boundaries in the emerging era of in-memory big data analytics in low-latency scenarios of all types. Spark is proving its value as a development tool for the new generation of data scientists building the in-memory statistical models upon which it all will depend.

Let’s not fall into the delusion that everything is converging toward Spark, as if it were the ravenous maw that will devour every other big-data analytics tool and platform. Spark is just another approach that’s being fitted to and optimized for specific purposes.

And let’s resist the hype that treats Spark as Hadoop’s “successor.” This implies that Hadoop and other big-data approaches are “legacy,” rather than what they are, which is foundational. For example, no one is seriously considering doing “data lakes,” “data reservoirs,” or “data refineries” on anything but Hadoop or NoSQL.


James Kobielus is an industry veteran and serves as IBM Big Data Evangelist; Senior Program Director for Product Marketing in Big Data Analytics; and Team Lead, Technical Marketing, IBM Big Data & Analytics Hub. He spearheads thought leadership activities across the IBM Analytics solution portfolio. He has spoken at such leading industry events as IBM Insight, Hadoop Summit, and Strata. He has published several business technology books and is a very popular provider of original commentary on blogs and many social media.


–  Master of Information and Data Science,  UC Berkeley School of Information.

– MS in Data Science, NYU Center for Data Science.

– Free data science curriculum, kdnuggets.com

Data Science | Coursera

– Master of Science in Data Science – Data Science Institute

Data Mining and Applications Graduate Certificate, Stanford

The European Data Science Academy (EDSA) designs curricula for data science training and data science education across the European Union (EU).

-The EDISON project will focus on activities to establish the new profession of ‘Data Scientist’, following the emergence of Data Science technologies (also referred to as Data Intensive or Big Data technologies) which changes the way research is done, how scientists think and how the research data are used and shared. This includes definition of the required skills, competences framework/profile, corresponding Body Of Knowledge and model curriculum. It will develop a sustainability/business model to ensure a sustainable increase of Data Scientists, graduated from universities and trained by other professional education and training institutions in Europe. 
EDISON will facilitate the establishment of a Data Science education and training infrastructure at major European universities by promoting experience of ‘champion’ universities involving them into coordinated development and implementation of the model curriculum and creation of cooperative educational and training infrastructure.

Related Posts

– RIP Big Data, By Carl Olofson, Research Vice President, Data Management Software Research, IDC. ODBMS.org, January  2016

Open Source Software and IBM’s Big Data platform. By Cynthia M. Saracco, senior solutions architect at IBM’s Silicon Valley Laboratory. ODBMS.org, April 2016.

Looking back at Big Data in 2015, By Cynthia M. Saracco, IBM Senior Solution Architect, ODBMS.org. November 2015

–  Heuristics for a Data Scientist: A common sense approach. BY Silvia Dassiè, Data Scientist at Ryanair. ODBMS.org, December 2015

The rise of immutable data stores. By Alan Morrison, Senior Manager, PwC Center for technology and innovation. ODBMS.org. October 2015

Follow us on Twitter: @odbmsorg


http://www.odbms.org/blog/2016/04/on-big-data-and-data-science-interview-with-james-kobielus/feed/ 0
On Data Mining and Data Science. Interview with Charu Aggarwal http://www.odbms.org/blog/2015/05/on-data-mining-and-data-science-interview-with-charu-aggarwal/ http://www.odbms.org/blog/2015/05/on-data-mining-and-data-science-interview-with-charu-aggarwal/#comments Tue, 12 May 2015 12:03:08 +0000 http://www.odbms.org/blog/?p=3894

“What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging” — Charu Aggarwal.

On Data Mining, Data Science and Big Data, I have interviewed Charu Aggarwal, Research Scientist at the IBM T. J. Watson Research Center, an expert in this area.


Q1. You recently edited two books: Data Classification: Algorithms and Applications and Data Clustering: Algorithms and Applications.
What are the main lessons learned in data classification and data clustering that you can share with us?

Charu Aggarwal: The most important lesson, which is perhaps true for all of data mining applications, is that feature extraction, selection and representation are extremely important. It is all too often that we ignore these important aspects of the data mining process.

Q2. How Data Classification and Data Clustering relate to each other?

Charu Aggarwal: Data classification is the supervised version of data clustering. Data clustering is about dividing the data into groups of similar points. In data classification, examples of groups of points are made available to you. Then, for a given test instance, you are supposed to predict which group this point might belong to.
In the latter case, the groups often have a semantic interpretation. For example, the groups might correspond to fraud/not fraud labels in a credit-card application. In many cases, it is natural for the groups in classification to be clustered as well. However, this is not always the case.
Some methods such as semi-supervised clustering/classification leverage the natural connections between these problems to provide better quality results.

Q3. Can data classification and data clustering be useful also for large data sets and data streams? If yes, how?

Charu Aggarwal: Data clustering is definately useful for large data sets, because clusters can be viewed as summaries of the data. In fact, a particular form of fine-grained clustering, referred to as micro-clustering, is commonly used for summarizing high-volume streaming data in real time. These summaries are then used for many different applications, such as first-story detection, novelty detection, prediction, and so on.
In this sense, clustering plays an intermediate role in enabling other applications for large data sets.
Classification can also be used to generate different types of summary information, although it is a little less common. The reason is that classification is often used as the end-user application, rather than as an intermediate application
like clustering. Therefore, big-data serves as a challenge and as an opportunity for classification.
It serves as a challenge because of obvious computational reasons. It serves as an opportunity because you can build more complex and accurate models with larger data sets without creating a situation, where the model inadvertently overfits to the random noise in the data.

Q4. How do you typically extract “information” from Big Data?

Charu Aggarwal: This is a highly application-specific question, and it really depends on what you are looking for. For example, for the same stream of health-care data, you might be looking for different types of information, depending on whether you are trying to detect fraud, or whether you are trying to discover clinical anomalies. At the end of the day, the role of the domain expert can never be discounted.
However, the common theme in all these cases is to create a more compressed, concise, and clean representation into one of the data types we all recognize and know how to process. Of course, this step is required in all data mining applications, and not just big data applications. What is different in big data applications, is that sometimes the data is stored in a distributed sense, and even simple processing becomes more challenging.
For example, if you look at Google’s original MapReduce framework, it was motivated by a need to efficiently perform operations that are almost trivial for smaller data sets, but suddenly become very expensive in the big-data setting.

Q5. What are the typical problems and scenarios when you cluster multimedia, text, biological, categorical, network, streams, and uncertain data?

Charu Aggarwal: The heterogeneity of the data types causes significant challenges.
One problem is that the different data types may often be mixed, as a result of which the existing methods can sometimes not be used directly. Some common scenarios in which such data types arise are photo/music/video-sharing (multimedia), healthcare (time-series streams and biological), and social networks. Among these different data types, the probabilistic (uncertain) data types does not seem to have graduated from academia into industry very well. Of course, it is a new area and there is a lot of active research going on. The picture will become clearer in a few years.

Q6. How effective are today ́s clustering algorithms?

Charu Aggarwal: Clustering problems have become increasingly effective in recent years because of advances in high-dimensional methods. In the past, when the data was very high-dimensional most existing methods work poorly because of locally irrelevant attributes and concentration effects. These are collectively referred to as the curse of dimensionality. Techniques such as subspace and projected clustering have been introduced to discover clusters in lower dimensional views of the data. One nice aspect of this approach is that some variations of it are highly interpretable.

Q7. What is in common between pattern recognition, database analytics, data mining, and machine learning?

Charu Aggarwal: They really do the same thing, which is that of analyzing and gleaning insights from data. It is just that the styles and emphases are different in various communities. Database folks are more concerned
about scalability. Pattern recognition and machine learning folks are somewhat more theoretical. The statistical folks tend to use their statistical models. The data mining community is the most recent one, and it was formed to create a common meeting ground for these diverse communities.
The first KDD conference was held in 1995, and we have come a long way since then towards integration. I believe that the KDD conference has played a very major role in the amalgamation of these communities. Today, it is actually possible for the folks from database and machine learning communities to be aware of each other’s work. This was not quite true 20 years ago.

Q8. What are the most “precise” methods in data classification?

Charu Aggarwal: I am sure that you will find experts who are willing to swear by a particular model. However, each model comes with a different set of advantages over different data sets. Furthermore, some models, such as univariate decision trees and rule-based methods, have the advantage of being interpretable even when they are outperformed by other methods. After all, analysts love to know about the “why” aside from the “what.”

While I cannot say which models are the most accurate (highly data specific), I can certainly point to the most “popular” ones today from a research point of view. I would say that SVMs, and neural networks (deep learning) are the most popular classification methods. However, my personal experience has been mixed.
While I have found SVMs to work quite well across a wide variety of settings, neural networks are generally less robust. They can easily over fit to noise or show unstable performance over small ranges of parameters. I am watching the debate over deep learning with some interest to see how it plays out.

Q9. When to use Mahout for classification? and What is the advantage of using Mahout for classification?

Charu Aggarwal: Apache Mahout is a scalable machine learning environment for data mining applications. One distinguishing feature of Apache Mahout is that it builds on top of distributed infrastructures like MapReduce, and enables easy building of machine learning applications. It includes libraries of various operations and applications.
Therefore, it reduces the effort of the end user beyond the basic MapReduce framework. It should be used in cases, where the data is large enough to require the use of such distributed infrastructures.

Q10. What are your favourite success stories in Data Classifications and/or Data Clustering?

Charu Aggarwal: One of my favorite success stores is in the field of high dimensional data, where I explored the effect of locally irrelevant dimensions and concentration effects on various data mining algorithms.
I designed a suite of algorithms for such high-dimensional tasks as clustering, similarity search, and outlier detection.
The algorithms continue to be relevant even today, and we have even generalized some of these results to big-data (streaming) scenarios and other application domains, such as the graph and text domains.

Qx Anything else you wish to add?

Charu Aggarwal: Data mining and data sciences are at exciting cross-roads today. I have been working in this field since 1995, and I have never seen as much excitement about data science in my first 15 years, as I have seen
in the last 5. This is truly quite amazing!

Charu C. Aggarwal is a Research Scientist at the IBM T. J. Watson Research Center in Yorktown Heights, New York.
He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996.
His research interest during his Ph.D. years was in combinatorial optimization (network flow algorithms), and his thesis advisor was Professor James B. Orlin.
He has since worked in the field of performance analysis, databases, and data mining. He has published over 200 papers in refereed conferences and journals, and has applied for or been granted over 80 patents. He is author or editor of nine books.
Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research.
He has served on the program committees of most major database/data mining conferences, and served as program vice-chairs of the SIAM Conference on Data Mining, 2007, the IEEE ICDM Conference, 2007, the WWW Conference 2009, and the IEEE ICDM Conference, 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering Journal from 2004 to 2008. He is an associate editor of the ACM TKDD Journal, an action editor of the Data Mining and Knowledge Discovery Journal, an associate editor of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal.
He is a fellow of the IEEE for “contributions to knowledge discovery and data mining techniques”, and a life-member of the ACM.



Data Classification: Algorithms and Applications, Editor: Charu C. Aggarwal, Publisher: CRC Press/Taylor & Francis Group, 978-1-4665-8674-1, © 2014, 707 pages

Data Clustering: Algorithms and Applications, Edited by Charu C. Aggarwal, Chandan K. Reddy, August 21, 2013 by Chapman and Hall/CRC

– MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
Appeared in:OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Download: PDF Version

Related Posts

Follow ODBMS.org on Twitter: @odbmsorg


http://www.odbms.org/blog/2015/05/on-data-mining-and-data-science-interview-with-charu-aggarwal/feed/ 0
How to run a Big Data project. Interview with James Kobielus http://www.odbms.org/blog/2014/05/james-kobielus/ http://www.odbms.org/blog/2014/05/james-kobielus/#comments Thu, 15 May 2014 17:20:16 +0000 http://www.odbms.org/blog/?p=3006

“You need a team of dedicated data scientists to develop and tune the core intellectual property–statistical, predictive, and other analytic models–that drive your Big Data applications. You don’t often think of data scientists as “programmers,” per se, but they are the pivotal application developers in the age of Big Data.”–James Kobielus

Managing the pitfalls and challenges of Big Data projects. On this topic I have interviewed James Kobielus, IBM Senior Program Director, Product Marketing, Big Data Analytics solutions.


Q1. Why run a Big Data project in the enterprise?

James Kobielus: Many Big Data projects are in support of customer relationship management (CRM) initiatives in marketing, customer service, sales, and brand monitoring. Justifying a Big Data project with a CRM focus involves identifying the following quantitative ROI:
• Volume-based value: The more comprehensive your 360-degree view of customers and the more historical data you have on them, the more insight you can extract from it all and, all things considered, the better decisions you can make in the process of acquiring, retaining, growing and managing those customer relationships.
• Velocity-based value: The more customer data you can ingest rapidly into your big-data platform and the more questions that a user can pose more rapidly against that data (via queries, reports, dashboards, etc.) within a given time period prior, the more likely you are to make the right decision at the right time to achieve your customer relationship management objectives.
• Variety-based value: The more varied customer data you have – from the CRM system, social media, call-center logs, etc. – the more nuanced portrait you have on customer profiles, desires and so on, hence the better-informed decisions you can make in engaging with them.
• Veracity-based value: The more consolidated, conformed, cleansed, consistent current the data you have on customers, the more likely you are to make the right decisions based on the most accurate data.
How can you attach a dollar value to any of this? It’s not difficult. Customer lifetime value (CLV) is a standard metric that you can calculate from big-data analytics’ impact on customer acquisition, onboarding, retention, upsell, cross-sell and other concrete bottom-line indicators, as well as from corresponding improvements in operational efficiency.

Q2. What are the business decisions that need to be made in order to successfully support a Big Data project in the enterprise?

James Kobielus: In order to successfully support a Big Data project in the enterprise, you have to make the infrastructure and applications production-ready in your operations.
Production-readiness means that your big-data investment is fit to realize its full operational potential. If you think “productionizing” can be done in a single step, such as by, say, introducing HDFS NameNode redundancy, then you need a cold slap of reality. Productionizing demands a lifecycle focus that encompasses all of your big-data platforms, not just a single one (e.g., Hadoop/HDFS), and addresses more than just a single requirement (e.g., ensuring a highly available distributed file system).
Productionizing involves jumping through a series of procedural hoops to ensure that your big-data investment can function as a reliable business asset. Here are several high-level considerations to keep in mind as you ready your big-data initiative for primetime deployment:
• Stakeholders: Have you aligned your big-data initiatives with stakeholder requirements? If stakeholders haven’t clearly specified their requirements or expectations for your big-data initiative, it’s not production-ready. The criteria of production-readiness must conform to what stakeholders require, and that depends greatly on the use cases and applications they have in mind for Big Data. Service-level agreements (SLAs) vary widely for Big Data deployed as an enterprise data warehouse (EDW), as opposed to an exploratory data-science sandbox, an unstructured information transformation tier, a queryable archive, or some other use. SLAs for performance, availability, security, governance, compliance, monitoring, auditing and so forth will depend on the particulars of each big-data application, and on how your enterprise prioritizes them by criticality.
• Stacks: Have you hardened your big-data technology stack – databases, middleware, applications, tools, etc. – to address the full range of SLAs associated with the chief use cases? If the big-data platform does not meet the availability, security and other robustness requirements expected of most enterprise infrastructure, it’s not production-ready. Ideally, all production-grade big-data platforms should benefit from a common set of enterprise management tools.
• Scalability: Have you architected your environment for modular scaling to keep pace with inexorable growth in data volumes, velocities and varieties? If you can’t provision, add, or reallocate new storage, compute and network capacity on the big-data platform in a fast, cost-effective, modular way to meet new requirements, the platform is not production-ready.
• Skillsets: Have you beefed up your organization’s big-data skillsets for maximum productivity? If your staff lacks the requisite database, integration and analytics skills and tools to support your big-data initiatives over their expected life, your platform is not production-ready. Don’t go deep on Big Data until your staff skills are upgraded.
• Seamless service: Have your re-engineered your data management and analytics IT processes for seamless support for disparate big-data initiatives? If you can’t provide trouble response, user training and other support functions in an efficient, reliable fashion that’s consistent with existing operations, your big-data platform is not production-ready.
To the extent that your enterprise already has a mature enterprise data warehousing (EDW) program in production, you should use that as the template for your big-data platform. There is absolutely no need to redefine “productionizing” for Big Data’s sake.

Q3. What are the most common problems and challenges encountered in Big Data projects?

James Kobielus: The most common problems and challenges in Big Data projects revolve around integrated lifecycle management (ILM).
ILM faces a new frontier when it comes to Big Data. The core challenges are threefold: the sheer unbounded size of Big Data, the ephemeral nature of much of the new data, and the difficulty of enforcing consistent quality as the data scales along any and all of the three Vs (volume, velocity, and variability). Comprehensive ILM has grown more difficult to ensure in Big Data environments, given rapid changes in the following areas:
• New Big Data platform: Big data is ushering a menagerie of new platforms (Hadoop, NoSQL, in-memory, and graph databases) into enterprise computing environments, alongside stalwarts such MPP RDBMS, columnar, and dimensional databases. The chance that your existing ILM tools work out of the box with all of these new platforms is slim. Also, to the extent that you’re doing Big Data in a public cloud, you may be required to use whatever ILM features — strong, weak, or middling — that may be native to the provider’s environment. To mitigate your risks in this heterogeneous new world and to maintain strong confidence in your core data, you’ll need to examine new Big Data platforms closely to ensure they have ILM features (data security, governance, archiving, retention) that are commensurate to the roles for which you plan to deploy them.
• New Big Data subject domains: Big data has not altered enterprise requirements for data governance hubs where you store and manage office systems of record (customers, finances, HR). This is the role of your established EDW, most of which run on traditional RDBMS-based data platforms and incorporate strong ILM. But these systems of record data domains may have very little presence on your newer Big Data platforms, many of which focus instead on handling fresh data from social, event, sensor, clickstream, geospatial, and other new sources. These new data domains are often “ephemeral” in the sense there may be no need to retain the bulk of the data in permanent systems of record.
• New Big Data scales: Big data does not mean that your new platforms support infinite volume, instantaneous velocity, or unbounded varieties. The sheer magnitudes of new data will make it impossible to store most of it anywhere, given the stubborn technological and economic constraints we all face. This reality will deepen Big Data managers’ focus on tweaking multitemperature storage management, archiving, and retention policies. As you scale your Big Data environment, you will need to ensure that ILM requirements can be supported within your current constraints of volume (storage capacity), velocity (bandwidth, processor, and memory speeds), and variety (metadata depth).

Q4. How best is to get started with a Big Data project?

James Kobielus: Scope the project well to deliver near-term business benefit. Using the nucleus project as the foundation for accelerating future Big Data projects. Recognize that the initial database technology you use in that initial project is just one of many storage layers that will need to play together in a hybridized, multi-tier Big Data architecture of your future.
In the larger evolutionary perspective, Big Data is evolving into a hybridized paradigm under which Hadoop, massively parallel processing (MPP) enterprise data warehouses (EDW), in-memory columnar, stream computing, NoSQL, document databases, and other approaches support extreme analytics in the cloud.
Hybrid architectures address the heterogeneous reality of Big Data environments and respond to the need to incorporate both established and new analytic database approaches into a common architecture. The fundamental principle of hybrid architectures is that each constituent Big Data platform is fit-for-purpose to the role for which it’s best suited. These Big Data deployment roles may include any or all of the following:
• Data acquisition
• Collection
• Transformation
• Movement
• Cleansing
• Staging
• Sandboxing
• Modeling
• Governance
• Access
• Delivery
• Interactive exploration
• Archiving
In any role, a fit-for-purpose Big Data platform often supports specific data sources, workloads, applications, and users.
Hybrid is the future of Big Data because users increasingly realize that no single type of analytic platform is always best for all requirements. Also, platform churn—plus the heterogeneity it usually produces—will make hybrid architectures more common in Big Data deployments. The inexorable trend is toward hybrid environments that address the following enterprise Big Data imperatives:
• Extreme scalability and speed: The emerging hybrid Big Data platform will support scale-out, shared-nothing massively parallel processing, optimized appliances, optimized storage, dynamic query optimization, and mixed workload management.
• Extreme agility and elasticity: The hybrid Big Data environment will persist data in diverse physical and logical formats across a virtualized cloud of interconnected memory and disk that can be elastically scaled up and out at a moment’s notice.
• Extreme affordability and manageability: The hybrid environment will incorporate flexible packaging/pricing, including licensed software, modular appliances, and subscription-based cloud approaches.
Hybrid deployments are already widespread in many real-world Big Data deployments. The most typical are the three-tier—also called “hub-and-spoke”—architectures. These environments may have, for example, Hadoop (e.g., IBM InfoSphere BigInsights) in the data acquisition, collection, staging, preprocessing, and transformation layer; relational-based MPP EDWs (e.g., IBM PureData System for Analytics) in the hub/governance layer; and in-memory databases (e.g., IBM Cognos TM1) in the access and interaction layer.
The complexity of hybrid architectures depends on range of sources, workloads, and applications you’re trying to support. In the back-end staging tier, you might need different preprocessing clusters for each of the disparate sources: structured, semi-structured, and unstructured. In the hub tier, you may need disparate clusters configured with different underlying data platforms—RDBMS, stream computing, HDFS, HBase, Cassandra, NoSQL, and so on—-and corresponding metadata, governance, and in-database execution components. And in the front-end access tier, you might require various combinations of in-memory, columnar, OLAP, dimensionless, and other database technologies to deliver the requisite performance on diverse analytic applications, ranging from operational BI to advanced analytics and complex event processing.
Ensuring that hybrid Big Data architectures stay cost-effective demands the following multipronged approach to optimization of distributed storage:
• Apply fit-for-purpose databases to particular Big Data use cases: Hybrid architectures spring from the principle that no single data storage, persistence, or structuring approach is optimal for all deployment roles and workloads. For example, no matter how well-designed the dimensional data model is within an OLAP environment, users eventually outgrow these constraints and demand more flexible decision support. Other database architectures—such as columnar, in-memory, key-value, graph, and inverted indexing—may be more appropriate for such applications, but not generic enough to address other broader deployment roles.
• Align data models with underlying structures and applications: Hybrid architectures leverage the principle that no fixed Big Data modeling approach—physical and logical—can do justice to the ever-shifting mix of queries, loads, and other operations. As you implement hybrid Big Data architectures, make sure you adopt tools that let you focus on logical data models, while the infrastructure automatically reconfigures the underlying Big Data physical data models, schemas, joins, partitions, indexes, and other artifacts for optimal query and data load performance.
• Intelligently compress and manage the data: Hybrid architectures should allow you to apply intelligent compression to Big Data sets to reduce their footprint and make optimal use of storage resources. Also, some physical data models are more inherently compact than others (e.g., tokenized and columnar storage are more efficient than row-based storage), just as some logical data models are more storage-efficient (e.g., third-normal-form relational is typically more compact than large denormalized tables stored in a dimensional star schema).

Q5. What kind of expertise do you need to run a Big Data project in the enterprise?

James Kobielus: Data-driven organizations succeed when all personnel—both technical and business—have a common understanding of the core big-data best skills, tools and practices. You need all the skills of data management, integration, modeling, and so forth that you already have running your data marts, warehouses, OLAP cubes, and the like.

Just as important, you need a team of dedicated data scientists to develop and tune the core intellectual property–statistical, predictive, and other analytic models–that drive your Big Data applications. You don’t often think of data scientists as “programmers,” per se, but they are the pivotal application developers in the age of Big Data.
The key practical difference between data scientists and other programmers—including those who develop orchestration logic—is that the former specifies logic grounded in non-deterministic patterns (i.e., statistical models derived from propensities revealed inductively from historical data), whereas the latter specifies logic whose basis is predetermined (i.e., if/then/else, case-based and other rules, procedural and/or declarative, that were deduced from functional analysis of some problem domain).
The practical distinctions between data scientists and other programmers have always been a bit fuzzy, and they’re growing even blurrier over time. For starters, even a cursory glance at programming paradigms shows that core analytic functions—data handling and calculation—have always been the heart of programming. For another, many business applications leverage statistical analyses and other data-science models to drive transactional and other functions.
Furthermore, data scientists and other developers use a common set of programming languages. Of course, data scientists differ from most other types of programmers in various ways that go beyond the deterministic vs. non-deterministic logic distinction mentioned above:
• Data scientists have adopted analytic domain-specific languages such as R, SAS, SPSS and Matlab.
• Data scientists specialize in business problems that are best addressed with statistical analysis.
• Data scientists are often more aligned with specific business-application domains—such as marketing campaign optimization and financial risk mitigation—than the traditional programmer.
These distinctions primarily apply to what you might call the “classic” data scientist, such as multivariate statistical analysts and data mining professionals. But the notion of a “classic” data scientist might be rapidly fading away in the big-data era as more traditional programmers need some grounding in statistical modeling in order to do their jobs effectively—or, at the very least, need to collaborate productively with statistical modelers.

Q6. How do you select the “right” software and hardware for a Big Data project?

James Kobielus: It’s best to choose the right appliance–a pre-optimized, pre-configured hardware/software appliance–for the specific workloads and applications of your Big Data project. At the same time, you should make sure that the chosen appliances can figure into the eventual cloud architecture toward which your Big Data infrastructure is likely to evolve.

An appliance is a workload-optimized system. Its hardware/software nodes are the key building block for every Big Data cloud. In other words, appliances, also known as expert integrated systems, are the bedrock of all three “Vs” of the Big Data universe, regardless of whether your specific high-level topology is centralized, hub-and-spoke, federated or some other configuration, and regardless of whether you’ve deployed all of these appliance nodes on premises or are outsourcing some or all of it to a cloud/SaaS provider.
Within the coming 2-3 years, expert integrated systems will become a dominant approach for enterprises to put Hadoop and other emerging Big Data approaches into production. Already, appliances are the principal approach in the core Big Data platform market: enterprise data warehousing solutions that implement massively parallel processing, such as those powered by IBM PureData Systems for Analytics..
The core categories of workloads that user need their optimized Big Data appliances to support within cloud environments are as follows:
• Big-data storage: A Big Data appliance can be core building block in a enterprise data storage architecture. Chief uses may be for archiving, governance and replication, as well as for discovering, acquiring, aggregating and governing multistructured content. The appliance should provide the modularity, scalability and efficiency of high-performance applications for these key data consolidation functions. Typically, it would support these functions through integration with a high-capacity storage area network architecture such as IBM provides.
• Big-data processing: A Big Data appliance should support massively parallel execution of advanced data processing, manipulation, analysis and access functions. It should support the full range of advanced analytics, as well as some functions traditionally associated with EDWs, BI and OLAP. It should have all the metadata, models and other services needed to handle such core analytics functions as query, calculation, data loading and data integration. And it should handle a subset of these functions and interface through connectors to analytic platforms such as IBM PureData Systems.
• Big-data development: A Big Data appliance should support Big Data modeling, mining, exploration and analysis. The appliance should provide a scalable “sandbox” with tools that allow data scientists, predictive modelers and business analysts to interactively and collaboratively explore rich information sets. It should incorporate a high-performance analytic runtime platform where these teams can aggregate and prepare data sets, tweak segmentations and decision trees, and iterate through statistical models as they look for deep statistical patterns. It should furnish data scientists with massively parallel CPU, memory, storage and I/O capacity for tackling analytics workloads of growing complexity. And it should enable elastic scaling of sandboxes from traditional statistical analysis, data mining and predictive modeling, into new frontiers of Hadoop/MapReduce, R, geospatial, matrix manipulation, natural language processing, sentiment analysis and other resource-intensive types of Big Data processing.
A big-data appliance should not be a stand-alone server, but, instead, a repeatable, modular building block that, when deployed in larger cloud configurations, can be rapidly optimized to new workloads as they come online. Many appliances will be configured to support mixes of two or all three of these types of workloads within specific cloud nodes or specific clusters. Some will handle low latency and batch jobs with equal agility in your cloud. And still others will be entirely specialized to a particular function that they perform with lightning speed and elastic scalability. The best appliances, like IBM Netezza, facilitate flexible re-optimization by streamlining the myriad deployment, configuration tuning tasks across larger, more complex deployments.
You may not be able to forecast with fine-grained precision the mix of workloads you’ll need to run on your big-data cloud two years from next Tuesday. But investing in the right family of big-data appliance building blocks should give you confidence that, when the day comes, you’ll have the foundation in place to provision resources rapidly and efficiently.

Q7. Is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?

James Kobielus: No. Hadoop is powering unstructured ETL, queryable archiving, data-science exploratory sandboxing, and other use cases. OLAP–in terms of traditional cubing–remains key to front-end query acceleration in decision support applications and data marts. In support of those front-end applicatioins, OLAP is facing competition from other approaches, especially in-memory, columnar databases (such as the BLU Acceleration feature of IBM DB2 10.5).

Q8. Could you give some examples of successful Big Data projects?

James Kobielus: Examples are here.

James Kobielus is IBM Senior Program Director, Product Marketing, Big Data Analytics solutions. He is an industry veteran, a popular speaker and social media participant, and a thought leader in big data, Hadoop, enterprise data warehousing, advanced analytics, business intelligence, data management, and next best action technologies.

Related Posts

The other side of Big Data. Interview with Michael L. Brodie.
ODBMS Industry Watch, April 26, 2014

What are the challenges for modern Data Centers? Interview with David Gorbet.
ODBMS Industry Watch, March 25, 2014

Setting up a Big Data project. Interview with Cynthia M. Saracco.
ODBMS Industry Watch, January 27, 2014



Follow ODBMS.org on Twitter: @odbmsorg

http://www.odbms.org/blog/2014/05/james-kobielus/feed/ 0
Setting up a Big Data project. Interview with Cynthia M. Saracco. http://www.odbms.org/blog/2014/01/setting-up-a-big-data-project-interview-with-cynthia-m-saracco/ http://www.odbms.org/blog/2014/01/setting-up-a-big-data-project-interview-with-cynthia-m-saracco/#comments Mon, 27 Jan 2014 07:39:46 +0000 http://www.odbms.org/blog/?p=2929

“Begin with a clear definition of the project’s business objectives and timeline, and be sure that you have appropriate executive sponsorship. The key stakeholders need to agree on a minimal set of compelling results that will impact your business; furthermore, technical leaders need to buy into the overall feasibility of the project and bring design and implementation ideas to the table.”–Cynthia M. Saracco.

How easy is to set up a Big Data project? On this topic I have interviewed Cynthia M. Saracco, senior solutions architect at IBM’s Silicon Valley Laboratory. Cynthia is an expert in Big Data, analytics, and emerging technologies. She has more than 25 years of software industry experience.


Q1. How best is to get started with a Big Data project?

Cynthia M. Saracco: Begin with a clear definition of the project’s business objectives and timeline, and be sure that you have appropriate executive sponsorship.
The key stakeholders need to agree on a minimal set of compelling results that will impact your business; furthermore, technical leaders need to buy into the overall feasibility of the project and bring design and implementation ideas to the table. At that point, you can evaluate your technical options for the best fit. Those options might include Hadoop, a relational DBMS, a stream processing engine, analytic tools, visualization tools, and other types of software. Often, a combination of several types of software is needed for a single Big Data project. Keep in mind that every technology has its strengths and weaknesses, so be sure you understand enough about the technologies you’re inclined to use before moving forward.

If you decide that Hadoop should be part of your project, give serious consideration to using a distribution that packages commonly needed components into a single bundle so you can minimize the time required to install and configure your environment. It’s also helpful to keep in mind the existing skills of your staff and seek out offerings that enable them to be productive quickly.
Tools, applications, and support for common scripting and query languages all contribute to improved productivity. If your business application needs to integrate with existing analytical tools, DBMSs, or other software, look for offerings that have some built-in support for that as well.

Finally, because Big Data projects can get pretty complex, I often find it helpful to segment the work into broad categories and then drill down into each to create a solid plan. Examples of common technical tasks include collecting data (perhaps from various sources), preparing the data for analysis (which can range from simple format conversions to more sophisticated data cleansing and enrichment operations), analyzing the data, and rendering or sharing the results of that analysis with business users or downstream applications. Consider scalability and performance needs in addition to your functional requirements.

Q2. What are the most common problems and challenges encountered in Big Data projects?

Cynthia M. Saracco: Lack of appropriately scoped objectives and lack of required skills are two common problems. Regarding objectives, you need to find an appropriate use case that will impact your business and tailor your project’s technical work to meet the business goals of that project efficiently. Big Data is an exciting, rapidly evolving technology area, and it’s easy to get side tracked experimenting with technical features that may not be essential to solving your business problem. While such experimentation can be fun and educational, it can also result in project delays as well as deliverables that are off target. In addition, without well-scoped business objectives, the technical staff may end up chasing a moving target.

Regarding skills, there’s high demand for data scientists, architects, and developers experienced with Big Data projects. So you may need to decide if you want to engage a service provider to supplement in-house skills or if you want to focus on growing (or acquiring) new in-house skills. Fortunately, there are a number of Big Data training options available today that didn’t exist several years ago. Online courses, conferences, workshops, MeetUps, and self-study tutorials can help motivated technical professionals expand their skill set. However, from a project management point of view, organizations need to be realistic about the time required for staff to learn new Big Data technologies. Giving someone a few days or weeks to master Hadoop and its complementary offerings isn’t very realistic. But really, I see the skills challenge as a point-in-time issue. Many people recognize the demand for Big Data skills and are actively expanding their skills, so supply will grow.

Q3. Do you have any metrics to define how good is the “value” that can be derived by analyzing Big Data?

Cynthia M. Saracco: Most organizations want to focus on their return on investment (ROI). Even if your Big Data solution uses open source software, there are still expenses involved for designing, developing, deploying, and maintaining your solution. So what did your business gain from that investment?
The answer to that question is going to be specific to your application and your business. For example, if a telecommunications firm is able to reduce customer churn by 10% as a result of a Big Data project, what’s that worth? If an organization can improve the effectiveness of an email marketing campaign by 20%, what’s that worth? If an organization can respond to business requests twice as quickly, what’s that worth? Many clients have these kinds of metrics in mind as they seek to quantify the value they have derived — or hope to derive — from their investment in a Big Data project.

Q4. Is Hadoop replacing the role of OLAP (online analytical processing) in preparing data to answer specific questions?

Cynthia M. Saracco: More often, I’ve seen Hadoop used to augment or extend traditional forms of analytical processing, such as OLAP, rather than completely replace them. For example, Hadoop is often deployed to bring large volumes of new types of information into the analytical mix — information that might have traditionally been ignored or discarded. Log data, sensor data, and social data are just a few examples of that. And yes, preparing that data for analysis is certainly one of the tasks for which Hadoop is used.

Q4. IBM is offering BigInsights and Big SQL? What is it?

Cynthia M. Saracco: InfoSphere BigInsights is IBM’s Hadoop-based platform for analyzing and managing Big Data. It includes Hadoop, a number of complementary open source projects (such as HBase, Hive, ZooKeeper, Flume, Pig, and others) and a number of IBM-specific technologies designed to add value.

Big SQL is part of BigInsights. It’s IBM’s SQL interface to data stored in BigInsights. Users can create tables, query data, load data from various sources, and perform other functions. For a quick introduction to Big SQL, read this article.

Q5. How does it compare to RDBMS technology? When’s it most useful?

Cynthia M. Saracco: Big SQL provides standard SQL-based query access to data managed by BigInsights. Query support includes joins, unions, sub-queries, windowed aggregates, and other popular capabilities. Because Big SQL is designed to exploit the Hadoop ecosystem, it introduces Hadoop-specific language extensions for certain SQL statements.
For example, Big SQL supports Hive and HBase for storage management, so a Big SQL CREATE TABLE statement might include clauses related to data formats, field delimiters, SerDes (serializers/deserializers), column mappings, column families, etc. The article I mentioned earlier has some examples of these, and the product InfoCenter has further details.

In many ways, Big SQL can serve as an easy on-ramp to Hadoop for technical professionals who have a relational DBMS background. Big SQL is good for organizations that want to exploit in-house SQL skills to work with data managed by BigInsights. Because Big SQL supports JDBC and ODBC, many traditional SQL-based tools can work readily with Big SQL tables, which can also make Big Data easier to use by a broader user community.

However, Big SQL doesn’t turn Hadoop — or BigInsights — into a relational DBMS. Commercial relational DBMSs come with built-in, ACID-based transaction management services and model data largely in tabular formats. They support granular levels of security via SQL GRANT and REVOKE statements. In addition, some RDBMSs support 3GL applications developed in “legacy” programming languages such as COBOL. These are some examples of capabilities aren’t part of Big SQL.

Q6. What are some of its current limitations?

Cynthia M. Saracco: The current level of Big SQL included in BigInsights V2.1.0.1 enables users to create tables but not views.
Date/time data is supported through a full TIMESTAMP data type, and some common SQL operations supported by relational DBMSs aren’t available or have specific restrictions.
Examples include INSERT, UPDATE, DELETE, GRANT, and REVOKE statements. For more details on what’s currently supported in Big SQL, skim through the InfoCenter.

Q7. How BigInsights differs from / adds value to open source Hadoop?

Cynthia M. Saracco: As I mentioned earlier, BigInsights includes a number of IBM-specific technologies designed to add value to the open source technologies included with the product. Very briefly, these include:
– A Web console with administrative facilities, a Web application catalog, customizable dashboards, and other features.
– A text analytic engine and library that extracts phone numbers, names, URLs, addresses, and other popular business artifacts from messages, documents, and other forms of textual data.
– Big SQL, which I mentioned earlier.
BigSheets, a spreadsheet-style tool for business analysts.
– Web-accessible sample applications for importing and exporting data, collecting data from social media sites, executing ad hoc queries, and monitoring the cluster. In addition, application accelerators (tool kits with dozens of pre-built software articles) are available for those working with social data and machine data.
– Eclipse tooling to speed development and testing of BigInsights applications, new text extractors, BigSheets functions, SQL-based applications, Java applications, and more.
– An integrated installation tool that installs and configures all selected components across the cluster and performs a system-wide health check.
– Connectivity to popular enterprise software offerings, including IBM and non-IBM RDBMSs.
– Platform enhancements focusing on performance, security, and availability. These include options to use with an alternative, POSIX-compliant distributed file system (GPFS-FPO) and an alternative MapReduce layer (Adaptive MapReduce) that features Platform Symphony’s advanced job scheduler, workload manager, and other capabilities.

You might wonder what practical benefits these kinds of capabilities bring. While that varies according to each organization’s usage patterns, one industry analyst study concluded that BigInsights lowers total cost of ownership (TCO) by an average of 28% over a three-year period compared with an open source-only implementation.

Finally, a number of IBM and partner offerings support BigInsights, which is something that’s important to organizations that want to integrate a Hadoop-based environment into their broader IT infrastructure. Some examples of IBM products that support BigInsights include DataStage, Cognos Business Intelligence, Data Explorer, and InfoSphere Streams.

Q8. Could you give some examples of successful Big Data projects?

Cynthia M. Saracco: I’ll summarize a few that have been publicly discussed so you can follow links I provide for more details. An energy firm launched a Big Data project to analyze large volumes of data that could help it improve the placement of new wind turbines and significantly reduce response time to business user requests.
A financial services firm is using Big Data to process large volumes of text data in minutes and offer its clients more comprehensive information based on both in-house and Internet-based data.
An online marketing firm is using Big Data to improve the performance of its clients email campaigns.
And other firms are using Big Data to detect fraud, assess risk, cross-sell products and services, prevent or minimize network outages, and so on. You can find a collection of videos about Big Data projects undertaken by various organizations; many of these videos feature users speaking directly about their Big Data experiences and the results of their projects.
And a recent report on Analytics: The real-world use of big data contains further examples. based the results of a survey of more than 1100 businesses that the Said Business School at the University of Oxford conducted with IBM’s Institute for Business Value.

Qx Anything else to add?

Cynthia M. Saracco: Hadoop isn’t the only technology relevant to managing and analyzing Big Data, and IBM’s Big Data software portfolio certainly includes more than BigInsights (its Hadoop-based offering). But if you’re a technologist who wants to learn more about Hadoop, your best bet is to work with the software. You’ll find a number of free online courses in the public domain, such as those at Big Data University. And IBM offers a free copy of its Quick Start Edition of BigInsights as a VMWare image or an installable image to help you get started with minimal effort.

Cynthia M. Saracco is a senior solutions architect at IBM’s Silicon Valley Laboratory, specializing in Big Data, analytics, and emerging technologies. She has more than 25 years of software industry experience, has written three books and more than 70 technical papers, and holds six patents.
Related Posts

Big Data: Three questions to Pivotal. ODBMS Industry Watch, January 20, 2014.

Big Data: Three questions to InterSystems. ODBMS Industry Watch, January 13, 2014.

Operational Database Management Systems. Interview with Nick Heudecker. ODBMS Industry Watch, December 16, 2013.

On Big Data and Hadoop. Interview with Paul C. Zikopoulos. ODBMS Industry Watch, June 10, 2013.


What’s the big deal about Big SQL? by Cynthia M. Saracco , Senior Software Engineer, IBM, and Uttam Jain, Software Architect, IBM.

ODBMS.org: Free resources on Big Data and Analytical Data Platforms:
| Blog Posts | Free Software| Articles| Lecture Notes | PhD and Master Thesis|

  • Follow ODBMS.org on Twitter: @odbmsorg
  • ##

    http://www.odbms.org/blog/2014/01/setting-up-a-big-data-project-interview-with-cynthia-m-saracco/feed/ 0
    On Big Data and Hadoop. Interview with Paul C. Zikopoulos. http://www.odbms.org/blog/2013/06/on-big-data-and-hadoop-interview-with-paul-c-zikopoulos/ http://www.odbms.org/blog/2013/06/on-big-data-and-hadoop-interview-with-paul-c-zikopoulos/#comments Mon, 10 Jun 2013 06:35:23 +0000 http://www.odbms.org/blog/?p=2335

    “We’re not all LinkedIns and Facebooks; we don’t have budgets to hire 1000s of new hires with these skills, and what’s more we’ve invested in existing skills and people today. So to democratize Big Data, you need it to be consumable and integrated. These will flatten the time to value for Hadoop” — Paul C. Zikopoulos.

    I have interviewed Paul C. Zikopoulos, Director of Technical Professionals for IBM Software Group’s Information Management division. The topic: Apache Hadoop and Big Data, State of the Union in 2013 and Vision for the future.


    Q1. What what do you think is still needed for big data analytics to be really useful for the enterprise?

    Paul C. Zikopoulos: Integration and Consumability. We’re not all LinkedIns and Facebooks; we don’t have budgets to hire 1000s of new hires with these skills, and what’s more we’ve invested in existing skills and people today.
    So to democratize Big Data, you need it to be consumable and integrated.
    These will flatten the time to value for Hadoop. IBM is working really hard in these areas. I could go into other areas, but this is key.

    Q2. Hadoop is still quite new for many enterprises, and different enterprises are at different stages in their Hadoop journey.
    When you speak with your customers what are the typical use cases and requirements they have?

    Paul C. Zikopoulos: No matter what industry I’m working with, 90% of the Big Data use cases always have 2 common denominators: Whole Population Analytics to break free of traditional capacity constrained samples and analytics for data at-rest moving to in-motion.
    So if you think about churn prediction, next best action, next best offer, fraud prediction, condition monitor, out of tolerance quality predictors, and more – it’s all going to rely on using more data (could be volume, could be variety, and often both) to build better models.
    If you’re looking for specific use cases by industry, here’s a bunch of them that we’ve worked with clients on at IBM.

    Q3. How do you categorize the various stages of the Hadoop usage in the enterprises?

    Paul C. Zikopoulos: The IBM Institute for Business Value did a joint study with Said Business School (University of Oxford). They talked to a lot of Big Data folks and found that 28% were in the pilot phase, 24% haven’t started anything, and 47% are planning. After going through their research, they broke the answers into four stages: Educate / Explore / Engage / Execute.
    So I’ll detail those four stages, but you can get the entire study here.

    Educate: Building a base of knowledge (24 percent of respondents).
    In the Educate stage, the primary focus is on awareness and knowledge development.
    Almost 25 percent of respondents indicated they are not yet using big data within their organizations. While some remain relatively unaware of the topic of big data, our interviews suggest that most organizations in this stage are studying the potential benefits of big data technologies and analytics, and trying to better understand how big data can help address important business opportunities in their own industries or markets.
    Within these organizations, it is mainly individuals doing the knowledge gathering as opposed to formal work groups, and their learnings are not yet being used by the organization. As a result, the potential for big data has not yet been fully understood and embraced by the business executives.

    Explore: Defining the business case and roadmap (47 percent).
    The focus of the Explore stage is to develop an organization’s roadmap for big data development.
    Almost half of respondents reported formal, ongoing discussions within their organizations about how to use big data to solve important business challenges.
    Key objectives of these organizations include developing a quantifiable business case and creating a big data blueprint.
    This strategy and roadmap takes into consideration existing data, technology and skills, and then outlines where to start and how to develop a plan aligned with the organization’s business strategy.

    Engage: Embracing big data (22 percent).
    In the Engage stage, organizations begin to prove the business value of big data, as well as perform an assessment of their technologies and skills.
    More than one in five respondent organizations is currently developing POCs to validate the requirements associated with implementing big data initiatives, as well as to articulate the expected returns. Organizations in this group are working – within a defined, limited scope – to understand and test the technologies and skills required to capitalize on new sources of data.

    Execute: Implementing big data at scale (6 percent).
    In the Execute stage, big data and analytics capabilities are more widely operationalized and implemented within the organization. However, only 6 percent of respondents reported that their organizations have implemented two or more big data solutions at scale – the threshold for advancing to this stage. The small number of organizations in the Execute stage is consistent with the implementations we see in the marketplace. Importantly, these leading organizations are leveraging big data to transform their businesses and thus are deriving the greatest value from their information assets.
    With the rate of enterprise big data adoption accelerating rapidly – as evidenced by 22 percent of respondents in the Engage stage, with either POCs or active pilots underway – we expect the percentage of organizations at this stage to more than double over the next year. NOW ! While only 6% are executing, about 25% of respondents in this study are ‘piloting’ initiatives.

    Q4. Could you give us some examples on how do you get (Big) Data Insights?

    Paul C. Zikopoulos: IBM has a non-forked version of Hadoop called BigInsights.
    When it comes to open source, it’s really hard to look past IBM’s achievements. Lucene, Apache Derby, Apache Jakarta, Apache Geronimo, Eclipse and so much more – so it shouldn’t surprise anyone that IBM is squarely in Hadoop’s corner.
    Our strategy here is Embrace and Extend. We will embrace the open source Hadoop community. We are a vibrant part of it (in the latest Hadoop patch as of the time of this interview, the most fixes came from IBM; we have a number of contribution to HBase, and more). IBM has a long history in understanding enterprise concerns, that’s the extend part.
    Some of the extensions work just fine with open source. For example, we provide a rich management tool, a quick installer, and concentrate opens ports into a single one to make your Hadoop cluster pass audit easier.
    Some of our extensions overlay Hadoop. For example, our Adaptive Map Reduce which can deliver a 30% performance boost using its algorithms to optimize the overhead of MapReduce task startup.
    We have enhanced schedulers, announced the option to use GPFS as the file system which provides a lot of benefits, and more. But these are optional. If you use BigInsights you are using a non-forked Hadoop distro.
    Some of our extensions are ’round-trip-able’ – if you use them, you can walk back to pure Open Source Hadoop at any time, and some aren’t. If you want to get our fast to install non extended version of Hadoop for free, you can download InfoSphere BigInsights Basic Edition here.

    Q5. What are the main technical challenges for big data analytics when data is in motion rather than at rest?

    Paul C. Zikopoulos: Well the challenge is to ask yourselves how do I get those analytics artifacts that I learn at rest either in Hadoop or the EDW and get them to real time; I call this Nowcasting instead of Forecasting.
    In order to do that, with agility and speed, you’re going to want a platform that’s designed for in-motion at-rest analytics.
    I’m not seeing that in the marketplace today. In fact, I’m not seeing a focus on in-motion analytics.
    When I refer to in-motion, I refer to the Velocity attribute of Big Data (people often talk to the Big Vs in Big Data, so that’s the one for in-motion). Velocity IS the game change.
    It’s not just how fast data is produces or changes, BUT the speed at which it must be understood, acted upon, turned into something useful. So to me the main technical challenge in getting to in-motion from at-rest is the fact that I’m not really seeing that kind of true integration and it’s something we squarely hit on in the IBM Big Data platform.
    Let me share an example, if you were to build some text analytical function at rest in Hadoop, perhaps an email phrase that’s highly correlated with a customer churn even, you can SEAMLESSLY take that artifact and deploy it on InfoSphere Streams (our Big Data Velocity engine) without any work at all, you just deploy the compiled AOG file. Wow! Platform.
    The other challenge is just the volume and speed in which you have to process events. IBM invented our streaming products with the US government – and it can scale. For example, one of our clients analyzes and correlates over 5M market messages a second to execute algorithmic option trades with average latency of 50 microseconds.
    The point is that this is not CEP; this is not 1 or 2 servers with 10-20,000 events a second. CEP can be a style or a technology.
    You need to be able to do the style, but you need a technology platform too. If you asked me what is one of the biggest things IBM has done in the Big Data space, it is flattening the technical challenge to perform Big Data analytics on data in motion.

    Q6. In your opinion, is there a technology which is best suited to build a Big Data Analytics Data Platform? If yes, which one?

    Paul C. Zikopoulos: Well you say the word platform, and that’s going to imply a number of technologies. Right?
    When I get asked this question, I refer to my Big Data Platform Manifesto, this is what you’re going to need to form a Big Data platform. Many people think big data is about Hadoop technology. It is and it isn’t. Its about a lot more than Hadoop.
    One of the key requirements is to understand and navigate federated sources of big data – to discover data in place.
    New technology has emerged that discovers, indexes, searches, and navigates diverse sources of big data. Of course big data is also about Hadoop. Hadoop is a collection of open source capabilities.
    Two of the most prominent ones are Hadoop Distributed File System (HDFS) for storing a variety of information, and MapReduce – a parallel processing engine.
    Data warehouses also manage big data- the volume of structured data is growing quickly. The ability to run deep analytic queries on huge volumes of structured data is a big data problem. It requires massive parallel processing data warehouses and purpose-built appliances for deep analytics.
    Big data isn’t just at rest – it’s also in motion. Streaming data represents an entirely different big data problem – the ability to quickly analyze and act upon data while its still moving. This new technology opens a world of possibilities – from processing volumes of data that were just not practical to store, to detecting insight and responding quickly.
    As much of the worlds big data is unstructured and in textual content, text analytics is a critical component to analyze and derive meaning from text.
    And finally, integration and governance technology – ETL, data quality, security, MDM, and lifecycle management. Integration and governance technology establishes the veracity of big data, and is critical in determining whether information is trusted.
    Finally, consumability, characteristics here include such items as being able to declare what you want done, not how to do it, expert integrated systems, deployment patterns, and so on.

    So if you wanted a short answer a Big Data platform needs to be consumable, governable, give the opportunity for analytics in-motion, at rest (in an EDW AND things like Hadoop), discovery and index Big Data, and finally, provide the ability to analyze unstructured data.

    Notice I didn’t mention one IBM product above; you can piece together a platform with a mash of vendors if you want; if you start to look into what IBM is doing, and although I’m bias and work there, I think you will find we have a true Big Data platform.

    Q6. Does it make sense in your opinion to virtualize Hadoop?

    Paul C. Zikopoulos: It can. It’s going to depend on the use case right? I see a lot of efforts by EMC in that area and that’s cool. Of course the Cloud and Hadoop kind of go hand and hand. I think this space is growing by leaps and bounds…fun to watch.

    Q7. What is your opinion on the evolution of Hadoop?

    Paul C. Zikopoulos: It’s just that – an evolution. I think that innovation is going to deliver more and more of what enterprises need from a ‘hardening’ aspect as time goes on. Hadoop 2.0 is a big step forward for availability. It’s out there yet now, but not ready for production in my humble opinion (although some vendors are shipping it, their documentation tells you it’s not ready for production). Next version of MapReduce (Yarn) and making Hive really fast (Tez) are also part of the evolution, stay close here, it’s changing fast!
    That’s the best part of community. Now if you look at most of the vendors in this space, many are getting distracted and working on non-Hadoop’ish things to help Hadoop, and that’s fine too. We’re on a good path here.
    A lot of vendors here are and more popping up all the time (like Intel just announced their own distribution). At some point, I think there will be a consolidated of distros out there, but with the hype around it right now, it will continue to evolve.
    For example, it’s becoming more than just a MapReduce processing areas. Right? Lots of technologies are storing data in Hadoop’s HDFS, but bypassing MapReduce. So I find the file system key to the evolution.

    Q8. Can In-Memory Data Management play a significant role for Big Data Analytics? If yes, how?

    Paul C. Zikopoulos: I think it’s essential, but in a Big Data world, it would seem that the amount of data we are storing – at least right now – is proportionally bigger than the amount we can get into memory at a cost effective rate.
    So in-memory needs to harmoniously live with the database. If you look at what we did with BLU Acceleration and DB2, we did just that.
    In-memory columnar and typical relational tables live side by side in the same database kernel.
    You can work with both structures together, in the same memory structures, queries, and so on.

    When you can’t fit all the columns into memory, performance either falls off the cliff, or worse! Could crash the system.

    From an analytics side, BLU Acceleration allows you to run queries faster, amazingly faster. That’s going to get more iterations of queries, analytics and what not. It’s not for everything, but if you can help my reports run faster, that’s cool. So imagine you find in a Discovery Zone powered by a Hadoop engine some interesting pieces of information, pulling that out and packing it into an in-memory structure and surfacing it to the enterprise is going to be pretty cool

    Q9. What about elastic computing in the Cloud? How does it relate to Big Data Analytics?

    Paul C. Zikopoulos: This is pretty important because I need the utility-like nature of a Hadoop cluster, without the capital investment. Time to analytics is the benefit here. After all, if you’re a start-up analytics firm seeking venture capital funding, do you really walk into to your investor and ask for millions to set up a cluster; you’ll get kicked out the door.
    No, you go to Racksapce or Amazon, swipe a card, and get going. IBM is there with its Hadoop clusters (private and public) and you’re looking at clusters that cost as low as $0.60 US an hour.
    I think at one time I costed out a 100 node Hadoop cluster for an hour and it was like $34US – and the price has likely gone down. What’s more, your cluster will be up and running in 30 minutes. So on-premise or off-premise Cloud is key for these environments.

    Paul C. Zikopoulos, B.A., M.B.A., is the Director of Technical Professionals for IBM Software Group’s Information Management division and additionally leads the World Wide Competitive Database and Big Data Technical Sales Acceleration teams.
    Paul is an award winning writer and speaker with more than 19 years of experience in Information Management.
    Paul is seen as a global expert in Big Data and database. He was picked by SAP as one of its “Top 50 Big Data Twitter Influencers”, named by BigData Republic to its “Top 100 Most Influential” list, Technopedia listed him a “A Big Data Expert to Follow”, and he was consulted on Big Data by the popular TV show “60 Minutes”.
    Paul has written more than 350 magazine articles and 16 books, some of which include “Harness the Power of Big Data”, “Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data”, “Warp Speed, Time Travel, Big Data, and More: DB2 10 New Features”, “DB2 pureScale: Risk Free Agile Scaling”, “DB2 Certification for Dummies”, “DB2 for Dummies”, and more.
    In his spare time, he enjoys all sorts of sporting activities, including running with his dog Chachi, avoiding punches in his MMA training, and trying to figure out the world according to Chloë—his daughter.

    Related Posts

    On Virtualize Hadoop. Interview with Joe Russell. April 29, 2013

    On Pivotal HD. Interview with Scott Yara and Florian Waas. April 22, 2013

    On Big Data Velocity. Interview with Scott Jarr. January 28, 2013


    Harness the Power of Big Data The IBM Big Data Platform.
    Paul C. Zikopoulos, Dirk deRoos, Krishnan Parasuraman, Thomas Deutsch, David Corrigan,James Giles, Chris Eaton.
    Book, Copyright © 2013 by The McGraw-Hill Companies.
    Download Book (.PDF 250 pages)

    Warp Speed, Time Travel, Big Data, and More. DB2 10 for Linux, UNIX, and Windows New Features.
    Paul Zikopoulos, George Baklarz, Matt Huras, Walid Rjaibi, Dale McInnis, Matthias Nicola, Leon Katsnelson.
    Book, Copyright © 2012 by The McGraw-Hill Companies.
    Download book (.PDF 217 pages)

    Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data.
    Paul C. Zikopoulos, Chris Eaton, Dirk deRoos, Thomas Deutsch, George Lapis,
    Book, Copyright © 2012 by The McGraw-Hill Companies.
    Download book (.PDF 142 pages)

    – ODBMS.org Resources on Big Data and Analytical Data Platforms:
    Blog Posts | Free Software | Articles | Lecture Notes | PhD and Master Thesis|

    Follow ODBMS.org on Twitter: @odbmsorg


    http://www.odbms.org/blog/2013/06/on-big-data-and-hadoop-interview-with-paul-c-zikopoulos/feed/ 1
    Two Cons against NoSQL. Part I. http://www.odbms.org/blog/2012/10/two-cons-against-nosql-part-i/ http://www.odbms.org/blog/2012/10/two-cons-against-nosql-part-i/#comments Tue, 30 Oct 2012 16:57:52 +0000 http://www.odbms.org/blog/?p=1785 Two cons against NoSQL data stores read like this:

    1. It’s very hard to move data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

    2. There is no standard way to access a NoSQL data store.
    All tools that already exist for SQL has to be recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many NoSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).

    These are valid points. I wanted to start a discussion on this.
    This post is the first part of a series of feedback I received from various experts, with obviously different point of views.

    I plan to publish Part II, with more feedback later on.

    You are welcome to contribute to the discussion by leaving a comment if you wish!


    1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL.

    Dwight Merriman ( founder 10gen, maker of MongoDB): I agree it is still early and I expect some convergence in data models over time. btw I am having conversations with other nosql product groups about standards but it is super early so nothing is imminent.
    50% of the nosql products are JSON-based document-oriented databases.
    So that is the greatest commonality. Use that and you have some good flexibility and JSON is standards-based and widely used in general which is nice. MongoDB, couchdb, riak for example use JSON. (Mongo internally stores “BSON“.)

    So moving data across these would not be hard.

    1. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

    Dwight Merriman: Yes. Once again I wouldn’t assume that to be the case forever, but it is for the present. Also I think there is a bit of an illusion of portability with relational. There are subtle differences in the SQL, medium differences in the features, and there are giant differences in the stored procedure languages.
    I remember at DoubleClick long ago we migrated from SQL Server to Oracle and it was a HUGE project. (We liked SQL server we just wanted to run on a very very large server — i.e. we wanted vertical scaling at that time.)

    Also: while porting might be work, given that almost all these products are open source, the potential “risks” of lock-in I think drops an order of magnitude — with open source the vendors can’t charge too much.

    Ironically people are charged a lot to use Oracle, and yet in theory it has the portability properties that folks would want.

    I would anticipate SQL-like interfaces for BI tool integration in all the products in the future. However that doesn’t mean that is the way one will write apps though. I don’t really think that even when present those are ideal for application development productivity.

    1. For example, how many noSQL databases can export their data to excel? (Something every CEO wants to get sooner or later).

    Dwight Merriman: So with MongoDB what I would do would be to use the mongoexport utility to dump to a CSV file and then load that into excel. That is done often by folks today. And when there is nested data that isn’t tabular in structure, you can use the new Aggregation Framework to “unwind” it to a more matrix-like format for Excel before exporting.

    You’ll see more and more tooling for stuff like that over time. Jaspersoft and Pentaho have mongo integration today, but the more the better.

    John Hugg (VoltDB Engineering): Regarding your first point about the issue with moving data out from one NoSQL to some other system, even other NoSQL.
    There are a couple of angles to this. First, data movement itself is indeed much easier between systems that share a relational model.
    Most SQL relational systems, including VoltDB, will import and export CSV files, usually without much issue. Sometimes you might need to tweak something minor, but it’s straightforward both to do and to understand.

    Beyond just moving data, moving your application to another system is usually more challenging. As soon as you target a platform with horizontal scalability, an application developer must start thinking about partitioning and parallelism. This is true whether you’re moving from Oracle to Oracle RAC/Exadata, or whether you’re moving from MySQL to Cassandra. Different target systems make this easier or harder, from both development and operations perspectives, but the core idea is the same. Moving from a scalable system to another scalable system is usually much easier.

    Where NoSQL goes a step further than scalability, is the relaxing of consistency and transactions in the database layer. While this simplifies the NoSQL system, it pushes complexity onto the application developer. A naive application port will be less successful, and a thoughtful one will take more time.
    The amount of additional complexity largely depends on the application in question. Some apps are more suited to relaxed consistency than others. Other applications are nearly impossible to run without transactions. Most lie somewhere in the middle.

    To the point about there being no standard way to access a NoSQL data store. While the tooling around some of the most popular NoSQL systems is improving, there’s no escaping that these are largely walled gardens.
    The experience gained from using one NoSQL system is only loosely related to another. Furthermore, as you point out, non-traditional data models are often more difficult to export to the tabular data expected by many reporting and processing tools.

    By embracing the SQL/Relational model, NewSQL systems like VoltDB can leverage a developer’s experience with legacy SQL systems, or other NewSQL systems.
    All share a common query language and data model. Most can be queried at a console. Most have familiar import and export functionality.
    The vocabulary of transactions, isolation levels, indexes, views and more are all shared understanding. That’s especially impressive given the diversity in underlying architecture and target use cases of the many available SQL/Relational systems.

    Finally, SQL/Relational doesn’t preclude NoSQL-style development models. Postgres, Clustrix and VoltDB support MongoDB/CouchDB-style JSON Documents in columns. Functionality varies, but these systems can offer features not easily replicated on their NoSQL inspiration, such as JSON sub-key joins or multi-row/key transactions on JSON data

    1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

    Steve Vinoski (Architect at Basho): Keep in mind that relational databases are around 40 years old while NoSQL is 3 years old. In terms of the technology adoption lifecycle, relational databases are well down toward the right end of the curve, appealing to even the most risk-averse consumer. NoSQL systems, on the other hand, are still riding the left side of the curve, appealing to innovators and the early majority who are willing to take technology risks in order to gain advantage over their competitors.

    Different NoSQL systems make very different trade-offs, which means they’re not simply interchangeable. So you have to ask yourself: why are you really moving to another database? Perhaps you found that your chosen database was unreliable, or too hard to operate in production, or that your original estimates for read/write rates, query needs, or availability and scale were off such that your chosen database no longer adequately serves your application.
    Many of these reasons revolve around not fully understanding your application in the first place, so no matter what you do there’s going to be some inconvenience involved in having to refactor it based on how it behaves (or misbehaves) in production, including possibly moving to a new database that better suits the application model and deployment environment.

    2. There is no standard way to access a NoSQL data store.
    All tools that already exists for SQL has to recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many noSQL databases can export their data to Excel? (Something every CEO wants to get sooner or later).

    Steve Vinoski: Don’t make the mistake of thinking that NoSQL is attempting to displace SQL entirely. If you want data for your Excel spreadsheet, or you want to keep using your existing SQL-oriented tools, you should probably just stay with your relational database. Such databases are very well understood, they’re quite reliable, and they’ll be helping us solve data problems for a long time to come. Many NoSQL users still use relational systems for the parts of their applications where it makes sense to do so.

    NoSQL systems are ultimately about choice. Rather than forcing users to try to fit every data problem into the relational model, NoSQL systems provide other models that may fit the problem better. In my own career, for example, most of my data problems have fit the key-value model, and for that relational systems were overkill, both functionally and operationally. NoSQL systems also provide different tradeoffs in terms of consistency, latency, availability, and support for distributed systems that are extremely important for high-scale applications. The key is to really understand the problem your application is trying to solve, and then understand what different NoSQL systems can provide to help you achieve the solution you’re looking for.

    1. It’s very hard to move the data out from one NoSQL to some other system, even other NoSQL. There is a very hard lock in when it comes to NoSQL. If you ever have to move to another database, you have basically to re-implement a lot of your applications from scratch.

    Cindy Saracco (IBM Senior Solutions Architect) (these comments reflect my personal views and not necessarily those of my employer, IBM) :
    Since NoSQL systems are newer to market than relational DBMSs and employ a wider range of data models and interfaces, it’s understandable that migrating data and applications from one NoSQL system to another — or from NoSQL to relational — will often involve considerable effort.
    However, I’ve heard more customer interest around NoSQL interoperability than migration. By that, I mean many potential NoSQL users seem more focused on how to integrate that platform into the rest of their enterprise architecture so that applications and users can have access to the data they need regardless of the underlying database used.

    2. There is no standard way to access a NoSQL data store.
    All tools that already exists for SQL has to recreated to each of the NoSQL databases. This means that it will always be harder to access data in NoSQL than from SQL. For example, how many noSQL databases can export their data to excel? (Something every CEO wants to get sooner or later).

    Cindy Saracco: From what I’ve seen, most organizations gravitate to NoSQL systems when they’ve concluded that relational DBMSs aren’t suitable for a particular application (or set of applications). So it’s probably best for those groups to evaluate what tools they need for their NoSQL data stores and determine what’s available commercially or via open source to fulfill their needs.
    There’s no doubt that a wide range of compelling tools are available for relational DBMSs and, by comparison, fewer such tools are available for any given NoSQL system. If there’s sufficient market demand, more tools for NoSQL systems will become available over time, as software vendors are always looking for ways to increase their revenues.

    As an aside, people sometimes equate Hadoop-based offerings with NoSQL.
    We’re already seeing some “traditional” business intelligence tools (i.e., tools originally designed to support query, reporting, and analysis of relational data) support Hadoop, as well as newer Hadoop-centric analytical tools emerge.
    There’s also a good deal of interest in connecting Hadoop to existing data warehouses and relational DBMSs, so various technologies are already available to help users in that regard . . . . IBM happens to be one vendor that’s invested quite a bit in different types of tools for its Hadoop-based offering (InfoSphere BigInsights), including a spreadsheet-style analytical tool for non-programmers that can export data in CSV format (among others), Web-based facilities for administration and monitoring, Eclipse-based application development tools, text analysis facilities, and more. Connectivity to relational DBMSs and data warehouses are also part IBM’s offerings. (Anyone who wants to learn more about BigInsights can explore links to articles, videos, and other technical information available through its public wiki. )

    Related Posts

    On Eventual Consistency– Interview with Monty Widenius. by Roberto V. Zicari on October 23, 2012

    On Eventual Consistency– An interview with Justin Sheehy. by Roberto V. Zicari, August 15, 2012

    Hadoop and NoSQL: Interview with J. Chris Anderson by Roberto V. Zicari


    http://www.odbms.org/blog/2012/10/two-cons-against-nosql-part-i/feed/ 1
    Big Data for Good. http://www.odbms.org/blog/2012/06/big-data-for-good/ http://www.odbms.org/blog/2012/06/big-data-for-good/#comments Mon, 04 Jun 2012 06:15:30 +0000 http://www.odbms.org/blog/?p=1474 A distinguished panel of experts discuss how Big Data can be used to create Social Capital.

    Every day, 2.5 quintillion bytes of data are created. This data comes from digital pictures, videos, posts to social media sites, intelligent sensors, purchase transaction records, cell phone GPS signals to name a few. This is Big Data.

    There is a great interest both in the commercial and in the research communities around Big Data. It has been predicted that “analyzing Big Data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus”, according to research by MGI and McKinsey’s Business Technology Office.

    But very few people seem to look at how Big Data can be used for solving social problems. Most of the work in fact is not in this direction. Why this?
    What can be done in the international research community to make sure that some of the most brilliant ideas do have an impact also for social issues?

    I have invited a panel of distinguished well known researchers and professionals to discuss this issue. The list of panelists include:

    Roger Barga, Microsoft Research, group lead eXtreme Computing Group, USA

    Laura Haas, IBM Fellow and Director Institute for Massive Data, Analytics and Modeling IBM Research, USA

    Alon Halevy, Google Research, Head of the Structured Data Group, USA

    Paul Miller, Consultant, Cloud of Data, UK

    This Q&A panel focuses exactly at this question: is it possible to conduct research for a corporation and/or a research lab, and at the same time make sure that the potential output of our research has also a social impact?

    We take Big Data as a key example. Big Data is clearly of interest to marketers and enterprises a like who wish to offer their customers better services and better quality products. Ultimately their goal is to sell their products/services.

    This is good, but how about digging into Big Data to help people in need? Preventing / predicting natural catastrophes, helping offering services “targeting” to people and structures in social need?

    Hope you`ll find this interview interesting, as well as eye-opening.


    Q1. In your opinion, would it be possible to exploit some of the current and future research and developments efforts on Big Data for achieving social capital?

    Alon: Yes, Big data is not just the size of an individual data set, but rather the collection of data that is available to us online (e.g., government data, NGOs,
    local governments, journalists, etc). By putting these data together we help tell stories about the data and make them of interest and of value to the wider public. As one simple example, a recent Danish Journalism Award was given to a nice visualization of data about which doctors are being sponsored by the medical industry. The ability to communicate this data with the public is certainly part of the Big Data agenda.

    Laura: Absolutely. In fact, many of the efforts that we are engaged in today are exactly in this direction.
    Much of our “Smarter Planet” related research is around utilizing more intelligently the large amounts of data coming from instrumenting, observing, and capturing the information about phenomena on planet earth, both natural and man-made.

    Paul: First, it’s important to recognise that technological advances, new techniques, and new ways of working often deliver both tangible and intangible social benefit as a by-product of something else. Robert Owen and his peers in the late 18th and early 19th centuries might have had genuine motives for the social welfare and educational programmes they delivered for workers in their factories, but it was the commercial success of the factories themselves that paid for the philanthropy. And better educated children became better integrated factory workers, so it wasn’t completely altruistic.

    That said, there is clearly scope for Big Data to deliver direct benefits in areas that aid society. Google Flu Trends is perhaps the best-known example – analysis of many millions of searches for flu-related terms (symptoms, medicines, etc) enabling Google’s non-profit Foundation to provide early visibility of illness in ways that could/should assist local healthcare systems. Google’s search engine isn’t about flu, and its indices aren’t for flu detection or prediction; this piece of societal value simply emerges from the ‘data exhaust’ of all those people searching a single site. Flu Trends isn’t alone; Harvard researchers found that Twitter data could be analysed to track the spread of Cholera on Haiti in a way that proved “substantially faster” than traditional techniques. According to Mathew Ingram’s write-up of the research, “What the Harvard and HealthMap study shows is that analyzing the data from large sets like the tweets around Haiti isn’t just good at tracking patterns or seeing connections after an event has occurred, but can actually be of use to researchers on the ground while those events are underway” (my emphasis).

    Roger: Absolutely, we have already seen several such examples.
    One such example in science is Jim Gray and Alex Szalay’s collaboration to build a virtual observatory for astronomy, which leveraged relational database technology.
    The SDSS Sky Server has since supported hundreds of researchers and resulted in thousands of publications over the years.
    Another, more recent example, is the language translation system researchers in Microsoft Research built for the aid relief worker in Haiti after the 2010 earthquake.
    They leveraged the same technology we leverage in our search operations to build a statistical machine translation engine to translate Haitian Creole to English from scratch in 4 days, 17 hours, and 30 minutes and delivered to aid workers in Haiti.

    Q2. If yes, what are the areas where in your opinion Big Data could have a real impact on social capital?

    Alon: Bringing data that is otherwise hidden from view to the eyes of the interested public. Data activists and journalists world-wide need to be able to easily
    discover data sets, merge them in a sensible fashion and tell stories about them that will grab people’s attention.
    As another example, helping people in crisis response situations has huge potential. As two examples, people have used Google Fusion Tables to create maps with critical information for people after the Japan earthquake in 2011 and before the hurricane in NYC later that year.

    Laura: Healthcare is an obvious one, where leveraging the vast amounts of genomic information now being produced together with patient records, and the medical literature could help us provide the best known treatments to an individual patient — or discover new therapies that may be more effective than those currently in use.
    We have worked already on leveraging big data and machine learning to predict the best therapeutic regimens for AIDS patients, for example. Or, when it comes to natural resources, we are leveraging big data to optimize the placement of turbines in a wind farm so that we get the most power for the least environmental impact. We can also look at man-made phenomena — for example, understanding traffic patterns and using the insight to do better planning or provide incentives that can reduce traffic during crunch hours. Many other examples can be given of how Big Data is being used to improve the planet!

    Paul: The opportunities must – surely – be enormous? Any of the big issues affecting society, from environmental change, to population growth, to the need for clean water, food, and healthcare; all of these affect large groups of people and all of them have aspects of policy formulation or delivery that are (or should be, if anyone collected it) data-rich. The Volume, Velocity and Variety of data in many of these areas should offer challenging research opportunities for practitioners… and tangible benefits to society when they’re successful.

    Roger: Top of mind is to advance scientific research, what has been referred to as eScience which covers both the traditional hard sciences from astronomy, oceanography, to the social sciences and economics.
    Our ability to acquire and analyze unprecedented amounts of data has the potential to have a profound impact on science. It is a leap from the application of computing to support scientists to ‘do’ science (i.e. ‘computational science’) to the integration of computer science and ability to analyze volumes of data to extract insights into the very fabric of science. While on the face of it, this change may seem subtle, we believe it to be fundamental to science and the way science is practiced. Indeed, we believe this development represents the foundations of a new revolution in science.
    We captured stories from many different scientific investigations in the book “The Fourth Paradigm: Data-Intensive Scientific Discovery.”

    Q3. What are the main challenges in such areas?

    Alon: Data discovery is a huge challenge (how to find high-quality data from the vast collections of data that are out there on the Web).
    Determining the quality of data sets and relevance to particular issues (i.e., is the data set making some underlying assumption that renders it biased or not informative for a particular question). Combining multiple data sets by people who have little knowledge of database techniques is a constant challenge.

    Laura: With any big data project, many of the same issues exist. I’ll mention three major categories of issues: those related to the data, itself, those related to the process of deriving insight and benefit from the data, and finally, those related to management issues such as data privacy, security, and governance in general. In the data space, we talk about the 4 V’s of data — Volume (just dealing with the sheer size of it), Variety (handling the multiplicity of types and sources and formats), Velocity (reacting to the flood of information in the time required by the application), and, last and perhaps least understood, Veracity (how can we cope with uncertainty, imprecision, missing values, and yes, occasionally, mis-statements or untruths?).
    The challenges with deriving insight include capturing data, aligning data from different sources (e.g., resolving when two objects are the same), transforming the data into a form suitable for analysis, modeling it, whether mathematically, or through some form of simulation, etc, and then understanding the output — visualizing and sharing the results, for example. And governance includes ensuring that data is used correctly (abiding by its intended uses and relevant laws), tracking how the data is used, transformed, derived, etc, and managing its lifecycle. There are research topics in ALL of these areas!

    Paul: Data availability – is there data available, at all? Increasingly, there is. But coverage and comprehensiveness often remain patchy, and the rigour with which datasets are compiled may still raise concerns. A good process will, typically, make bad decisions if based upon bad data.
    Data quality – how good is the data? How broad is the coverage? How fine is the sampling resolution? How timely are the readings? How well understood are the sampling biases? What are the implications in, for example, a Tsunami that affects several Pacific Rim countries? If data is of high quality in one country, and poorer in another, does the Aid response skew ‘unfairly’ toward the well-surveyed country or toward the educated guesses being made for the poorly surveyed one?
    Data comprehensiveness – are there areas without coverage? What are the implications?
    Personally Identifiable Information – much of this information is about people. Can we extract enough information to help people without extracting so much as to compromise their privacy? Partly, this calls for effective industrial practices. Partly, it calls for effective oversight by Government. Partly – perhaps mostly – it requires a realistic reconsideration of what privacy really means… and an informed grown up debate about the real trade-off between aspects of privacy ‘lost’ and benefits gained.
    Rather than offering blanket privacy policies, perhaps customers, regulators and software companies should be moving closer to some form of explicit data agreement; if you give me access to X, Y, and Z about yourself, I will use it for purposes A, B, and C… and you will gain benefits/services D, E, and F. The first two parts are increasingly in place, albeit informally. The final part – the benefits – is far less well expressed.
    Data dogmatism – analysis of big data can offer quite remarkable insights, but we must be wary of becoming too beholden to the numbers. Domain experts – and common sense – must continue to play a role. It would be worrying, indeed, if the healthcare sector only responded to flu outbreaks when Google Flu Trends told them to! See, for example, a recent blog post of mine

    Roger: The first important step is to embrace a data-centric view. The goal is not merely to store data for a specific community but to improve data quality and to deliver as a service accurate, consistent data to operational systems. It isn’t simply a matter of connecting the plumbing between many different data sources, there’s a quality function that has to be applied, to clean, and reconcile all of this information.
    Researchers don’t simply need data, they need services-based information over this data to support their work.

    Q4. What are the main difficulties, barriers hindering our community to work on social capital projects?

    Alon: I don’t think there are particular barriers from a technical perspective. Perhaps the main barrier is ideas of how to actually take this technology and make social impact. These ideas typically don’t come from the technical community, so we need more inspiration from activists.

    Laura: Funding and availability of data are two big issues here. Much funding for social capital projects comes from governments — and as we know, are but a small fraction of the overall budget. Further, the market for new tools and so on that might be created in these spaces is relatively limited, so it is not always attractive to private companies to invest. While there is a lot of publicly available data today, often key pieces are missing, or privately held, or cannot be obtained for legal reasons, such as the privacy of individuals, or a country’s national interests. While this is clearly an issue for most medical investigations, it crops up as well even with such apparently innocent topics as disaster management (some data about, e.g., coastal structures, may be classified as part of the national defense).

    Paul: Perceived lack of easy access to data that’s unencumbered by legal and privacy issues? The large-scale and long term nature of most of the problems?
    It’s not as ‘cool’ as something else? A perception (whether real or otherwise) that academic funding opportunities push researchers in other directions?
    Honestly, I’m not sure that there are significant insurmountable difficulties or barriers, if people want to do it enough.
    As Tim O’Reilly said in 2009 (and many times since), developers should “Work on stuff that matters.” The same is true of researchers.

    Roger: The greatest barrier may be social. Such projects require community awareness to bring people to take action and often a champion to frame the technical challenges in a way that is approachable by the community.
    These projects will likely require close collaboration between the technical community and those familiar with the problem.

    Q5. What could we do to help supporting initiatives for Big Data for Good?

    Alon: Building a collection of high quality data that is widely available and can serve as the backbone for many specific data projects. For example, data sets that
    include boundaries of countries/counties and other administrative regions, data sets with up-to-date demographic data. It’s very common that when a particular data story arises, these data sets serve to enrich it.

    Laura: Increasingly, we see consortiums of institutions banding together to work on some of these problems. These Centers may provide data and platforms for
    data-intensive work, alleviating some of the challenges mentioned above by acquiring and managing data, setting up an environment and tools, bringing in expertise in a given topic, or in data, or in analytics, providing tools for governance, etc. My own group is creating just such a platform, with the goal of facilitating such collaborative ventures. Of course, lobbying our governments for support of such initiatives wouldn’t hurt!

    Paul: Match domains with a need to researchers/companies with a skill/product. Activities such as the recent Big Data Week Hackathons might be one route to follow – encourage the organisers (and companies like Kaggle, which do this every day) to run Hackathons and competitions that are explicitly targeted at a ‘social’ problem of some sort. Continue to encourage the Open Data release of key public data sets. Talk to the agencies that are working in areas of interest, and understand the problems that they face. Find ways to help them do what they already want to do, and build trust and rapport that way.

    Roger: Provide tools and resources to empower the long tail of research.
    Today, only a fraction of scientists and engineers enjoy regular access to high performance and data-intensive computing resources to process and analyze massive amounts of data and run models and simulations quickly. The reality for most of the scientific community is that speed to discovery is often hampered as they have to either queue up for access to limited resources or pare down the scope of research to accommodate available processing power. This problem is particularly acute at the smaller research institutes which represent the long tail of the research community. Tier 1 and some tier 2 universities have sufficient funding and infrastructure to secure and support computing resources while the smaller research programs struggle. Our funding agencies and corporations must provide resources to support researchers, in particular those who do not have access to sufficient resources.

    Q6. Are you aware of existing projects/initiatives for Big Data for Good?

    Laura: Yes, many! See above for some examples. IBM Research alone has efforts in each of the areas mentioned — and many more. For example, we’ve been working with the city of Rio, in Brazil, to do detailed flood modeling, meter by meter; with the Toronto Children’s Hospital to monitor premature babies in the neonatal ward,
    allowing detection of life-threatening infections up to 24 hours earlier; and with the Rizzoli Institute in Italy to find the best cancer treatments for particular groups of patients.

    Roger: Yes, the United Nations Global Pulse initiative is one example. Earlier this year at the 2012 Annual Meeting in Davos, the World Economic Forum
    published a white paper entitled “Big Data, Big Impact: New Possibilities for International Development“. The WEF paper lays out several of the ideas which fundamentally drive the Global Pulse initiative and presents in concrete terms the opportunity presented by the explosion of data in our world today, and how researchers and policymakers are beginning to realize the potential for leveraging Big Data to extract insights that can be used for Good, in particular for the benefit of low-income populations. What I find intriguing about this project from a technical perspective is how to extract insight from ambient data, from GPS devices, cell phones and medical devices, combined with crowd sourced data from health and aid workers in the field, then analyzed with machine learning and analytics to predict a potential social need or crisis in advance while remediation is still viable.

    Q7. Anything else you wish to add?

    Alon: Google Fusion Tables has been used in many cases for social good, either though journalists, crisis response or data activists making a compelling
    visualization that caught people’s attention. This has been one of the most gratifying aspects of working on Fusion Tables and has served as a main driver for prioritizing our work: make it easy for people with passion for the data (rather than database expertise) to get their work done; make it easier for them to find relevant data and combine it with their own. We look very carefully at the workflow of these professionals and try to make it as efficient as possible.

    Laura: I think our community has the ability to do a lot of good by leveraging the tools we are developing, and our expertise, to attack some of the critical problems facing our world. We may even create economic value (not a bad thing, either!) while doing so.

    Dr. Roger Barga has been with the Microsoft Corporation since 1997, first working as a researcher in the database research group of Microsoft Research, then as architect of the Technical Computing Initiative, followed by architect and engineering group lead in the eXtreme Computing Group of Microsoft Research.
    He currently leads a product group developing an advanced analytics service on Windows Azure. Roger holds a PhD in Computer Science (database systems), MS in Computer Science (machine learning), and a BS in Mathematics.

    Dr. Alon Halevy heads the Structured Data Group at Google Research.
    Prior to that, he was a Professor of Computer Science at the University of Washington, where he founded the Database Research Group. From 1993 to 1997 he was a Principal Member of Technical Staff at AT&T Bell Laboratories (later AT&T Laboratories). He received his Ph.D in Computer Science from Stanford University in 1993, and his Bachelors degree in Computer Science and Mathematics from the Hebrew University in Jerusalem in 1988. Dr. Halevy was elected Fellow of the Association of Computing Machinery in 2006.

    Dr. Laura Haas is an IBM Fellow, and Director of IBM Research’s new Institute for Massive Data, Analytics and Modeling; she also serves as a “catalyst” for ambitious research across IBM’s worldwide research labs. She was the Director of Computer Science at IBM’s Almaden Research Center from 2005 to 2011.
    From 2001-2005, she led the Information Integration Solutions architecture and development teams in IBM’s Software Group. Previously, Dr. Haas was a research staff member and manager at Almaden.
    She is best known for her work on the Starburst query processor, from which DB2 LUW was developed, on Garlic, a system which allowed integration of heterogeneous data sources, and on Clio, the first semi-automatic tool for heterogeneous schema mapping.
    She has received several IBM awards for Outstanding Innovation and Technical Achievement, an IBM Corporate Award for her work on information integration technology, and the Anita Borg Institute Technical Leadership Award. Dr. Haas was Vice President of the VLDB Endowment Board of Trustees from 2004-2009, and is a member of the National Academy of Engineering and the IBM Academy of Technology, an ACM Fellow, and Vice Chair of the board of the Computing Research Association.

    Dr. Paul Miller is Founder of the Cloud of Data, a UK-based consultancy primarily concerned with Cloud Computing, Big Data, and Semantic Technologies.
    He works with public and private sector clients in Europe and North America, and has a Ph.D in Archaeology (Geographic Information Systems) from the University of York.


    Acknowledgement: I would like to thank Michael J. Carey with whom I have brainstormed about this project at EDBT in Berlin. RVZ



    http://www.odbms.org/blog/2012/06/big-data-for-good/feed/ 1
    Benchmarking XML Databases: New TPoX Benchmark Results Available. http://www.odbms.org/blog/2011/09/benchmarking-xml-databases-new-tpox-benchmark-results-available/ http://www.odbms.org/blog/2011/09/benchmarking-xml-databases-new-tpox-benchmark-results-available/#comments Mon, 19 Sep 2011 07:19:28 +0000 http://www.odbms.org/blog/?p=1138 “A key value is to provide strong data points that demonstrate and quantify how XML database processing can be done with very high performance.”Agustin Gonzalez, Intel Corporation.

    “We wanted to show that DB2’s shared-nothing architecture scales horizontally for XML warehousing just as it does for traditional relational warehousing workloads.”Dr. Matthias Nicola, IBM Corporation.

    TPoX stands for “Transaction Processing over XML” and is a XML database benchmark that Intel and IBM have developed several years ago and then released as open source.
    A couple of months ago, the project has published some new results.

    To learn more about this I have interviewed the main leaders of the TPoX project, Dr. Matthias Nicola, Senior engineer for DB2 at IBM Corporation and Agustin Gonzalez, Senior Staff Software Engineer at Intel Corporation.


    Q1. What is exactly TPoX?

    Matthias: TPoX is an XML database benchmark that focuses on XML transaction processing. TPoX simulates a simple financial application that issues XQuery or SQL/XML transactions to stress the XML storage, XML indexing, XML Schema support, XML updates, logging, concurrency and other components of an XML database system. The TPoX package comes with an XML data generator, an extensible Workload Driver, three XML Schemas that define the XML structures, and a set of predefined transactions. TPoX is free, open source, and available at http://tpox.sourceforge.net/ where detailed information can be found. Although TPoX comes with a predefined workload, it’s very easy to change this workload to adjust the benchmark to whatever your goals might be. The TPoX Workload driver is very flexible, it can even run plain old SQL against a relational database and simulate hundreds concurrent database users. So, when you ask “What is TPoX”, the complete answer is that it is an XML database benchmark but also a very flexible and extensible framework for database performance testing in general.

    Q2. When did you start with this project? What was the original motivation for TPoX? What is the motivation now?

    Matthias: We started with this project approximately in 2003/2004. At that time we were working on the native XML support in DB2 that was later released in DB2 version 9.1 in 2006. We needed an XML workload -a benchmark- that was representative of an important class of real-world XML applications and that would stress all critical parts of a database system.
    We needed a tool to put a heavy load on the new XML database functionality that we were developing. Some XML benchmarks had been proposed by the research community, such as XMark, MBench, XMach-1, XBench, X007, and a few others. They were are all useful in their respective scope, such as evaluating XQuery processors, but we felt that none of them truly aimed at evaluating a database system in its entirety. We found that they did not represent all relevant characteristics of real-world XML applications.
    For example, many of them only defined a read-only and single-user workload on a single XML document. However, real applications typically have many concurrent users, a mix of read and write operations, and millions or even billions of XML documents.
    That’s what we wanted to capture in the TPoX benchmark.

    Agustin: And the motivation today is the same as when TPoX became freely available as open source: database and hardware vendors, database researchers, and even database practitioners in the IT departments of large corporations need a tool evaluate system performance, compare products, or compare different design and configuration options.
    At Intel, the main motivation behind TPoX it to benchmark and improve our platforms for the increasingly relevant intersection of XML and databases. So far, the joint results with IBM have exceeded our expectations.

    Q3. TPoX is an application-level benchmark. What does it mean? Why did you choose to develop an application-level benchmark?

    Matthias: We typically distinguish between micro-benchmarks and application-level benchmarks, both of which are very useful but have different goals. A micro-benchmark typically defines a range of tests such that each test exercises a narrow and well-defined piece of functionality. For example, if your focus is an XQuery processor you can define tests to evaluate XPath with parent steps, other tests to evaluate XPath with descendant-or-self axis, other tests to evaluate XQuery “let” clauses, and so on.
    This is very useful for micro-optimization of important features and functions. In contrast, an application-level benchmark tries to evaluate the end-to-end performance of a realistic application scenario and to exercise the performance of a complete system as a whole, instead of just parts of it.

    Agustin: As an application-level benchmark, TPoX has proven much more useful and believable than “synthetic” micro-benchmarks. As a result, TPoX can even be used to predict how similar real-world applications will perform, or where they will encounter a bottleneck. You cannot make such predications with a micro-benchmark. Another important feature is that TPoX is very scalable – you can run TPoX on a laptop but also scale it up and run on large enterprise-grade servers, such as multi-processor Intel Xeon platforms.

    Q4. How do you exactly evaluate the performance of XML databases?

    Agustin: Well, one way is to use TPoX on a given platform and then compare to existing results on different combinations of hardware and software. I know that this is a simplistic answer but we really learn a lot from this approach. Keeping a precise history of the test configurations and the results obtained is always critical.

    Matthias: This is actually a very broad question! We use a wide range of approaches. We use micro-benchmarks, we use application-level benchmarks such as TPoX, we use real-world workloads that we get from some of our DB2 customers, and we continuously develop new performance tests. When we use TPoX, we often choose a certain database and hardware configuration that we want to test and then we gradually “turn up the heat”. For example, we perform repeated TPoX benchmark runs and increase the number of concurrent users until we hit a bottleneck, either in the hardware or the software. Then we analyze the bottleneck, try to fix it, and repeat the process. The goal is to always push the available hardware and software to the limit, in order to continuously improve both.

    Q5. What is the difference of TPoX with respect to classical database benchmarks such as TPC-C and TPC-H?

    Matthias: One of the obvious differences is that TPC-C and TPC-H focus on very traditional and mature relational database scenarios. In contrast, TPoX aims at the comparatively young field of XML in databases. Another difference is that the TPC benchmarks have been standardized and “approved” by the TPC committee, while TPoX was developed by Intel and IBM, and extended by various students and Universities as an open source project.

    Agustin: But, TPoX also has some important commonalities with the TPC benchmarks. TPC-C, TPC-H, and TPoX are all application-level benchmarks. Also, TPC-C, TPC-H, and TPoX have each chosen to focus on a specific type of database workload. This is important because no benchmark can (or should try to) exercise all possible types of workloads. TPC-C is a relational transaction processing benchmark, TPC-H is a relational decision support benchmark, and TPoX is an XML transaction processing benchmark. Some people have called TPoX the “XML-equivalent of TPC-C”. Another similarity between TPC-C, TPC-E, and TPoX is that all three are throughput oriented “steady state benchmarks”, which makes it straightforward to communicate results and perform comparisons.

    Q6. Do you evaluate both XML-enabled and Native XML databases? Which XML Databases did you evaluate?

    Matthias: TPoX can be used to evaluate pretty much any database that offers XML support. The TPoX workload driver is architected such that only a thin layer (a single Java class) deals with the specific interaction to the database system under test. Personally have used TPoX only on DB2. I know that other companies as well as students at various Universities have also run TPoX against other well-known database systems.

    Q7. How did you define the TPoX Application Scenario? How did you ensure that the TPoX Application Scenario you defined is representative of a broader class of applications?

    Matthias: Over the years we have been working with a broad range of companies that have XML applications and require XML database support. Many of them are in the financial sector. We have worked closely with them to understand their XML processing needs. We have examined their XML documents, their XML Schemas, their XML operations, their data volumes, their transaction rates, and so on. All of that experience has flown into the design of TPoX. One very basic but very critical observation is that there are practically no real-world XML applications that use only a single large XML document. Instead, the majority of XML applications use very large numbers of small documents.

    Agustin: TPoX is also very realistic because it uses a real-world XML Schema called FIXML, which standardizes trade-related messages in the financial industry. It is a very complex schema that defines thousands of optional elements and attributes and allows for immense document variability. It is extremely hard to map the FIXML schema to a traditional normalized relational schema. In the past, many XML processing systems were not able to handle the FIXML schema. But, since type of XML is used in real-world applications, it is a great fit for a benchmark.

    Q8. How did you define the workload?

    Matthias: Again, by experience with real XML transaction processing applications.

    Q9. In your documentation you write that TPoX uses a “stateless” workload? What does it mean in practice? Why did you make this choice?

    Matthias: It means that every transaction is submitted to the database independently from any previous transactions. As a result, the TPoX workload driver doesn’t need to remember anything about previous transactions. This makes it easier to design and implement a benchmark that scales to billions of XML documents and hundreds of millions transactions in a single benchmark run.

    Q10. Why not define a workload also for complex analytical queries?

    Matthias: We did! And we ran it on a 10TB XML data warehouse with more than 5.5 Billion XML documents.
    That was a very exciting project and you can find more details on my blog.
    Although the initial wave of XML database adoption was more focused on transactional and operations systems, companies soon realized that they were accumulating very large volumes of XML documents that contained a goldmine of information. Hence, the need for XML warehousing and complex analytical XML queries was pressing. We wanted to show that DB2’s shared-nothing architecture scales horizontally for XML warehousing just as it does for traditional relational warehousing workloads.

    Agustin: Admittedly, we have not yet formally included this workload of complex XML queries into the TPoX benchmark. Just like TPC-C and TPC-H are separate for transaction processing vs. decision support, we would also need to define two flavors of TPoX, even if the underlying XML data remains the same. A TPoX workload with complex queries is definitely very meaningful and desirable.

    Q11. What are the main new results you obtained so far? What are the main values of the results obtained so far?

    Agustin: We have produced many results using TPoX over the years, with ever larger numbers of transactions per second and continuous scalability of the benchmark on increasingly larger platforms. A key value is to provide strong data points that demonstrate and quantify how XML database processing can be done with very high performance. In particular, the first public 1TB XML benchmark that we did a few years ago has helped establish the notion that efficient XML transaction processing is a reality today. Such results give the hardware and the software a lot of credibility in the industry. And of course we learn a lot with every benchmark, which allows us to continuously improve our products.

    Q12. You write in your Blog “For 5 years now Intel has a strong history of testing and showcasing many of their latest processors with the Transaction Processing over XML (TPoX) benchmark.” Why has Intel been using the TPoX benchmark? What results did they obtain?

    Matthias: I let Agustin answer this one.

    Agustin: Intel uses the TPoX benchmark because it helps us demonstrate the power of Intel platforms and generate insights on how to improve them. TPoX also enables us to work with IBM to improve the performance of DB2 on Intel platforms, which is good for both IBM and Intel. This collaboration of Intel and IBM around TPoX is an example of an extensive effort at Intel to make sure that enterprise software has excellent performance on Intel. You can see our most important results on the TPoX web page.

    Q13. Can you use TPoX to evaluate other kinds of databases (e.g. Relational, NoSQL, Object Oriented, Cloud stores)? How does TPoX compare with the Yahoo! YCSB benchmark for Cloud Serving Systems?

    Matthias: Yes, the TPoX workload driver can be used to run traditional SQL workloads against relational databases. Assuming you have a populated relational database, you can define a SQL workload and use the TPoX driver to parameterize, execute, and measure it. TPoX and YCSB have been designed for different systems under test. However, parts of the TPoX framework can be reused to quickly develop other types of benchmarks, especially since TPoX offers various extension points.

    Agustin: Some open source relational databases have started to offer at least partial support for the SQL/XML functions and the XML data type. Given the level of parameterization and the extensible nature of the TPoX workload driver it would be very easy to develop custom workloads for the emerging support of the XML data type on open source databases. At the same time, the powerful XML document generator included in the kit can be used to generate the required data. Using TPoX to test the performance of XML in open source databases is an intriguing possibility.

    Q14. Is it possible to extend TPoX? If yes, how?

    Matthias: Yes, TPoX can be extended in several ways. First, you can change the TPoX workload in any way you want. You can modify, add, or remove transactions from the workload, you can change their relative weight, and you can change the random value distributions that are used for the transaction parameters. We have used the TPoX workload driver to run many different XML workloads, also on other XML data than just the TPoX documents. We have also used the workload driver for relational SQL performance tests, just because it’s so easy to setup concurrent workloads.
    Second, the database specific interface of the TPoX workload driver is encapsulated in a single Java class, so it is relatively easy to port the driver to another database system. And third, the new version TPoX 2.1 allows transactions to be coded not only in SQL, SQL/XML, and XQuery, but also in Java. TPoX 2.1 supports “Java-Plugin transactions” that allow you to implement whatever activities you want to run and measure in a concurrent manner. For example, you can run transactions that call a web service, send or receive data from a message queue, access a content management system, or perform any other operations – only limited by what you can code in Java!

    Agustin: At Intel we have been using TPoX internally for various other projects. Since the TPoX workload driver is open source, it is straightforward to modify it to support other type of workloads, not necessarily steady state, which makes it amenable to testing other aspects of computer systems such as power management, storage, and so on.

    Q15 What are the current limitations of TPoX?

    Matthias: Out of the box, the TPoX workload driver only works with databases that offer a JDBC interface. If a particular database system has specific requirements for its API or query syntax, then some modifications may be necessary. Some database system might require their own JDBC driver to be compiled into the workload driver.

    Q16. Who else is using TPoX?

    Matthias: You can see some examples of other TPoX usage on the TPoX web site. We know that other database vendors are using TPoX internally, even if haven’t decided to publish results yet. I also know a company in the Data Security space that uses TPoX to evaluate the performance of different data encryption algorithms. And TPoX also continues to be used at various universities in Europe, US, and Asia for a variety of research and student projects. For example, the University of Kaiserslautern in Germany has used TPoX to evaluate the benefit of solid-state disks for XML databases. Other universities have used TPoX to evaluate and compare the performance of several XML-only databases.

    Q17. TPoX is an open source project. How can the community contribute?

    Matthias: A good starting point is to use TPoX. From there, contributing to the TPoX project is easy. For example, you can report problems and bugs , or you can submit new feature requests. Or even better, you can implement bug fixes and enhancements yourself and submit them to the SVN code repository on sourceforge.net.
    If you design other workloads for the TPoX data set, you can upload new workloads to the TPoX project site and have your results posted on he TPoX web site.

    Agustin: As is customary for an open source project on sourceforge, anybody can download all TPoX files and source code freely.
    If you want to upload any changed or new files or modify the TPoX web page, you only need to become a member of the TPoX sourceforge project, which is quick and easy.
    Everybody is welcome, without exceptions.


    TPoX software:

    XML Database Benchmark: “Transaction Processing over XML (TPoX)”:
    TPoX is an application-level XML database benchmark based on a financial application scenario. It is used to evaluate the performance of XML database systems, focusing on XQuery, SQL/XML, XML Storage, XML Indexing, XML Schema support, XML updates, logging, concurrency and other database aspects.
    Download TPoX (LINK), July 2009. | TPoX Results (LINK), April 2011.


    “Taming a Terabyte of XML Data”.
    Augustin, Gonzales, Matthias Nicola, IBM Silicon Valley Lab.
    Paper | Advanced | English | LINK DOWNLOAD (PDF)| 2009|

    “An XML Transaction Processing Benchmark”.
    Matthias Nicola, Irina Kogan, Berni Schiefer, IBM Silicon Valley Lab.
    Paper | Advanced | English | LINK DOWNLOAD (PDF)| 2007|

    “A Performance Comparison of DB2 9 pureXML with CLOB and Shredded XML Storage”.
    Matthias Nicola et a., IBM Silicon Valley Lab.
    Paper | Advanced | English | LINK DOWNLOAD (PDF)| 2006|

    Related Posts

    Measuring the scalability of SQL and NoSQL systems.

    Benchmarking ORM tools and Object Databases.

    http://www.odbms.org/blog/2011/09/benchmarking-xml-databases-new-tpox-benchmark-results-available/feed/ 0