Skip to content

"Trends and Information on Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Feb 5 15

Polyglot approach to storing data. Interview with John Allison

by Roberto V. Zicari

“We were looking for solutions which provided the data integrity guarantees we needed, provided clustering tools to ease operational complexity, and were able to handle our data size and the read/write throughput we required.”–John Allison

I have interviewed John Allison, CTO and founder of Customer.io, a start up company in Portland, Oregon.

RVZ

Q1. What is the business of Customer.io ?

John Allison: We help our customers send timely, targeted messages based on user activity on their website or mobile app. We achieve this by collecting analytical data, providing real-time segmentation, and allowing our customers to define rules to trigger messages at different points in their interactions with a user.

Q2. How large are the data sets you analyze?

John Allison: We’ve collected 6 terabytes of analytical event data for over 55 million unique users across our platform. Due to it’s nature, this data continues to grow and grows faster as we collect data for more and more users.

Q3. What are the main business and technical challenges you are currently facing?

John Allison: As we continue to grow our business, we need to ensure the technical side of our service can easily scale out to support new customers who want to use our product.

Q4. Why did you replace your existing underlying database architecture supporting your “MVP” product ? What were the main technical problems you encountered?

John Allison: As our data set grew in size to the point where we couldn’t realistically manage it all on a small number of servers, we began looking for alternatives which would allow us to continue providing our service in a larger, more distributed way.

Q5. How did you evaluate the alternatives?

John Allison: We evaluated many options and found that most didn’t live up to the availability or consistency guarantees they promised when run over a cluster of servers. We were looking for solutions which provided the data integrity guarantees we needed, provided clustering tools to ease operational complexity, and were able to handle our data size and the read/write throughput we required.

Q6. How is the new solution looking like?

John Allison: We’ve taken more of a polyglot approach to storing our data. We are consolidating on three main clustered databases:

1) FoundationDB – Data where distributed transactions and consistency guarantees are most important.
2) Riak – Large amounts of immutable data where availability is more important.
3) ElasticSearch – Indexing data for ad-hoc querying.

All three have built in tools for expanding and administrating a cluster, provide fault-tolerance and increased reliability in the face of server faults, and each provides us with unique ways to access our data.

Q7. What experience do you have with this new database architecture until now? Do you have any measurable results you can share with us?

John Allison: Embracing a distributed architecture and storing data in the right database for a given use-case has led to less time worrying about operations, increased reliability of our service as a whole, and the ability to scale out all parts of our infrastructure to increase our platform’s capacity.

Q8. Moving forward, what are your plans for the next implementation of your product?

John Allison: Continuing to improve our product in order to provide the most value we can for our customers.

——————————
John Allison is the CTO and founder of Customer.io, a startup focused on making it easy to build, manage, and measure automatic customer retention emails. Prior to that he was the head of engineering at Challengepost.com. He is a world traveler, Golfer, and an Arkansas Razorback fan.

Resources
We have published several new experts articles on Big Data and Analytics in ODBMS.org.

Related Posts

On Mobile Data Management. Interview with Bob Wiederhold. ODBMS Industry Watch, 2014-11-18.

Big Data Management at American Express. Interview with Sastry Durvasula and Kevin Murray. ODBMS Industry Watch, 2014-10-12

Follow ODBMS.org on Twitter: @odbmsorg

Jan 14 15

On Data Curation. Interview with Andy Palmer

by Roberto V. Zicari

“We propose more data transparency not less.”Andy Palmer

I have interviewed Andy Palmer, a serial entrepreneur, who co-founded Tamr, with database scientist and MIT professor Michael Stonebraker.

Happy and Peaceful 2015!

RVZ

Q1. What is the business proposition of Tamr?

Andy Palmer: Tamr provides a data unification platform that reduces by as much as 90% the time and effort of connecting and enriching multiple data sources to achieve a unified view of silo-ed enterprise data. Using Tamr, organizations are able to complete data unification projects in days or weeks versus months or quarters, dramatically accelerating time to analytics.
This capability is particularly valuable to businesses as they can get a 360-degree view of the customer, unify their supply chain data for reducing costs or risk, e.g. parts catalogs and supplier lists, and speed up conversion of clinical trial data for submission to the FDA.

Q2. What are the main technological and business challenges in producing a single, unified view across various enterprise ERPs, Databases, Data Warehouses, back-office systems, and most recently sensor and social media data in the enterprise?

Andy Palmer: Technological challenges include:
Silo-ed data, stored in varying formats and standards
– Disparate systems, instrumented but expensive to consolidate and difficult to synchronize
– Inability to use knowledge from data owners/experts in a programmatic way
– Top-down, rules-based approaches not able to handle the extreme variety of data typically found, for example, in large PLM and ERP systems.

Business challenges include:
– Globalization, where similar or duplicate data may exist in different places in multiple divisions
M&As, which can increase the volume, variety and duplication of enterprise data sources overnight
– No complete view of enterprise data assets
– “Analysis paralysis,” the inability of business people to access the data they want/need because IT people are in the critical path of preparing it for analysis

Tamr can connect and enrich data from internal and external sources, from structured data in relational databases, data warehouses, back-office systems and ERP/PLM systems to semi- or unstructured data from sensors and social media networks.

Q3. How do you manage to integrate various part and supplier data sources to produce a unified view of vendors across the enterprise?

Andy Palmer: Patent-pending technology using machine learning algorithms performs most of the work, unifying up to 90% of supplier, part and site entities by:

– Referencing each transaction and record across many data sources

– Building correct supplier names, addresses, ID’s, etc. for a variety of analytics

– Cataloging into an organized inventory of sources, entities, and attributes

When human intervention is necessary, Tamr generates questions for data experts, aggregates responses, and feeds them back into the system. This feedback enables Tamr to continuously improve its accuracy and speed.

Q4. Who should be using Tamr?

Andy Palmer: Organizations whose business and profitability depend on being able to do analysis on a unified set of data, and ask questions of that data, should be using Tamr.

Examples include:
– a manufacturer that wants to optimize spend across supply chains, but lacks a unified view of parts and suppliers.

– a biopharmaceutical company that needs to achieve a unified view of diverse clinical trials data to convert it to mandated CDISC standards for ongoing submissions to the FDA – but lacks an automated and repeatable way to do this.

– a financial services company that wants to achieve a unified view of its customers – but lacks an efficient, repeatable way to unify customer data across multiple systems, applications, and its consumer banking, loans, wealth management and credit card businesses.

– the research arm of a pharmaceutical company that wants to unify data on bioassay experiments across 8,000 research scientists, to achieve economies, avoid duplication of effort and enable better collaboration

Q5. “Data transparency” is not always welcome in the enterprise, mainly due to non-technical reasons. What do you suggest to do in order to encourage people in the enterprise to share their data?

Andy Palmer: We propose more data transparency not less.
This is because in most companies, people don’t even know what data sources are available to them, let alone have insight into them or use of them. With Tamr, companies can create a catalog of all their enterprise data sources; they can then choose how transparent to make those individual data sources, by showing meta data about each. Then, they can control usage of the data sources using the enterprise’s access management and security policies/systems.
On the business side, we have found that people in enterprises typically want an easier way to share the data sources they have built or nurtured ─ a way that gets them out of the critical path.
Tamr makes people’s data usable by many others and for many purposes, while eliminating the busywork involved.

Q6. What is Data Curation and why is it important for Big Data?

Andy Palmer: Data Curation is the process of creating a unified view of your data with the standards of quality, completeness, and focus that you define. A typical curation process consists of:

Identifying data sets of interest (whether from inside the enterprise or outside),

Exploring the data (to form an initial understanding),

Cleaning the incoming data (for example, 99999 is not a valid ZIP code),

Transforming the data (for example, to remove phone number formatting),

Unifying it with other data of interest (into a composite whole), and

Deduplicating the resulting composite.

Data Curation is important for Big Data because people want to mix and match from all the data available to them ─ external and internal ─ for analytics and downstream applications that give them competitive advantage. Tamr is important because traditional, rule-based approaches to data curation are not sufficient to solve the problem of broad integration.

Q7. What does it mean to do “fuzzy” matches between different data sources?

Andy Palmer: Tamr can make educated guesses that two similar fields refer to the same entity even though the fields describe it differently: for example, Tamr can tell that “IBM” and “International Business Machines” refer to the same company.
In Supply Chain data unification, fuzzy matching is extremely helpful in speeding up entity and attribute resolution between parts, suppliers and customers.
Tamr’s secret sauce: Connecting hundreds or thousands of sources through a bottom-up, probabilistic solution reminiscent of Google’s approach to web search and connection.
Tamr’s upside: it becomes the Google of Enterprise Data, using probabilistic data source connection and curation to revolutionize enterprise data analysis.

Q8. What is data unification and how effective is it to use Machine Learning for this?

Andy Palmer: Data Unification is part of the curation process, during which related data sources are connected to provide a unified view of a given entity and its associated attributes. Tamr’s application of machine learning is very effective: it can get you 90% of the way to data unification in many cases, then involve human experts strategically to guide unification the rest of the way.

Q9. How do you leverage the knowledge of existing business experts for guiding/ modifying the machine learning process?

Andy Palmer: Patent-pending technology using machine learning algorithms performs most of the data integration work. When human intervention is necessary, Tamr generates questions for data experts, sends them simple yes-no questions, aggregates their responses, and feeds them back into the system. This feedback enables Tamr to continuously improve its accuracy and speed.

Q10. With Tamr you claim that less human involvement is required as the systems “learns.” What are in your opinion the challenges and possible dangers of such an “automated” decision making process if not properly used or understood? Isn’t there a danger of replacing the experts with intelligent machines?

Andy Palmer: We aren’t replacing human experts at all: we are bringing them into the decision-making process in a high-value, programmatic way. And there are data stewards and provenance and governance procedures in place that control how this done. For example: in one of our pharma customers, we’re actually bringing the research scientists who created the data into the decision-making process, capturing their wisdom in Tamr. Before, they were never asked: some guy in IT was trying to guess what each scientist meant when he created his data. Or the scientists were asked via email, which, due to the nature of the biopharmaceutical industry, required printing out the emails for audit purposes.

Q11. How do you quantify the cost savings using Tamr?

Andy Palmer: The biggest savings aren’t from the savings in data curation (although these are significant), but the opportunities for savings uncovered through analysis of unified data ─ opportunities that wouldn’t otherwise have been discovered. For example, by being able to create and update a ‘golden record’ of suppliers across different countries and business groups, Tamr can provide a more comprehensive view of supplier spend.
You can use this view to identify long-tail opportunities for savings across many smaller suppliers, instead of the few large vendors visible to you without Tamr.
In the aggregate, these long-tail opportunities can easily account for 85% of total spend savings.

Q12. Could you give us some examples of use cases where Tamr is making a significant difference?

Andy Palmer: Supply Chain Management, for streamlining spend analytics and spend management. Unified views of supplier and parts data enable optimization of supplier payment terms, identification of “long-tail” savings opportunities in small or outlier suppliers that were not easily identifiable before.

Clinical Trials Management, for automated conversion of multi-source /multi-standard CDISC data (typically stored in SaS databases) to meet submission standards mandated by regulators.
Tamr eliminates manual methods, which are usually conducted by expensive outside consultants and can result in additional, inflexible data stored in proprietary formats; and provides a scalable, repeatable process for data conversion (IND/NDA programs necessitate frequent resubmission of data).

Sales and Marketing, for achieving a unified view of the customer.
Tamr enables the business to connect and unify customer data across multiple applications, systems and business units, to improve segmentation/targeting and ultimately sell more products and services.

——————–

Andy Palmer, Co-Founder and CEO, Tamr Inc.

Andy Palmer is co-founder and CEO of Tamr, Inc. Palmer co-founded Tamr with fellow entrepreneur Michael Stonebraker, PhD. Previously, Palmer was co-founder and founding CEO of Vertica Systems, a pioneering big data analytics company (acquired by HP). During his career as an entrepreneur, Palmer has served as founder, founding investor, BOD member or advisor to more than 50 start-up companies. He also served as Global Head of Software Engineering and Architecture at Novartis Institutes for BioMedical Research (NIBR) and as a member of the start-up team and Senior Vice President of Operations and CIO at Infinity Pharmaceuticals (NASDAQ: INFI). He earned undergraduate degrees in English, history and computer science from Bowdoin College, and an MBA from the Tuck School of Business at Dartmouth.
————————–
-Resources

Data Science is mainly a Human Science. ODBMS.org, October 7, 2014

Big Data Can Drive Big Opportunities, by Mike Cavaretta, Data Scientist and Manager at Ford Motor Company. ODBMS.org, October 2014.

Big Data: A Data-Driven Society? by Roberto V. Zicari, Goethe University, Stanford EE Computer Systems Colloquium, October 29, 2014

-Related Posts

On Big Data Analytics. Interview with Anthony Bak. ODBMS Industry Watch, December 7, 2014

Predictive Analytics in Healthcare. Interview with Steve Nathan. ODBMS Industry Watch, August 26, 2014

-Webinar
January 27th at 1PM
Webinar: Toward Automated, Scalable CDISC Conversion
John Keilty, Third Rock Ventures | Timothy Danford, Tamr, Inc.

During a one-hour webinar, join John Keilty, former VP of Informatics at Infinity Pharmaceuticals, and Timothy Danford, CDISC Solution Lead for Tamr, as they discuss some of the key challenges in preparing clinical trial data for submission to the FDA, and the problems associated with current preparation processes.

Follow ODBMS.org on twitter: @odbsmorg

Jan 6 15

On Solr and Mahout. Interview with Grant Ingersoll

by Roberto V. Zicari

“When does it get practical for most people, not just the Google’s and the Facebook’s of the world? I’ve seen some cool usages of big data over the years, but I also see a lot of people with a solution looking for a problem.”–Grant Ingersoll.

I have interviewed Grant Ingersoll, CTO and co-founder of LucidWorks. Grant is an active member of the Lucene community, and co-founder of the Apache Mahout machine learning project.

I wish you a Happy and a Peaceful 2015!

RVZ

Q1. Why LucidWorks Search? What kind of value-add capabilities does it provide with respect to the Apache Lucene/Solr open source search?

Grant Ingersoll: I like to think of LucidWorks Search (LWS) as Solr++, that is, we give you all of the goodness of Solr and then some more. Our primary focus in building LWS is in 4 key areas:

1. IT integration — Make it easy to consume Solr within an IT organization via things like monitoring, APIs, installation and so on.
2. Enterprise readiness — Large enterprises have 1 of everything and they all have a multitude of security requirements, so we focus on making it easier to operate in these environments via things like connectors for data acquisition, security and the like
3. Tools for Subject Matter Experts — These are aimed at technical non developers like Business Analysts, Merchandisers, etc. who are responsible for understanding who asked for what, when and why. These tools are primarily aimed at understanding relevancy of search results and then taking action based on business needs.
4. Deliver a supported version of the open source so that companies can reliably deploy it knowing they have us to back them up.

Q2. At LucidWorkd you have integrated Apache open source projects to deliver a Big Data application development and deployment platform. What does the emerging big data stack look like?

Grant Ingersoll: We use capabilities from the Hadoop ecosystem for a number of activities that we routinely see customers struggling with when they try to better understand their data. In many cases, this boils down to large scale log analysis to power things like recommendation systems or Mahout for machine learning, but it also can be more subtle like doing large scale content extraction from Office documents or natural language processing approaches for identifying interesting phrases. We also rely on Zookeeper quite heavily to make sure that our cluster stays in a happy state and doesn’t suffer from split brain issues and cause failures.

Q3. How does it different with respect to other Big Data Hadoop-based distributions such as Cloudera, Hortonworks, and Greenplum Pivotal HD?

Grant Ingersoll: I can’t speak to their integrations in great detail, but we integrate with all of them (as well as partner with most of them), so I guess you would say we try to work at a layer above the core Hadoop infrastructure and focus on how the Hadoop ecosystem can solve specific problems as opposed to being a general purpose tool. For instance, we ship with a number of out of the box workflows designed to solve common problems in search like click-through log analysis and whole collection document clustering so you don’t have to write them yourself.

Q4. How does it work to build a framework for big data with open source technologies that are “pre-integrated”?

Grant Ingersoll: Well, you quickly realize what a version soup there is out there, trying to support all the different “flavors” of Hadoop. Other than, it is a lot of fun to leverage the technologies to solve real problems that help people better understand their data. Naturally, there are challenges in making sure all the processes work together at scale, so a lot of effort goes into those areas.

Q5. What happens when big data plus search meets the cloud?

Grant Ingersoll: You get cost effective access and insight into your data instead of a big science experiment. In many ways, the benefits are the same as search and ranking in on-prem situations plus the added benefits the cloud brings you in terms of costs, scaling and flexibility. Of course, the well-documented challenge in the cloud is how to get your data there. So, for users who already have their data in the cloud, it’s an especially easy win, for those who don’t, we provide connectors that help.

Q6. Solr Query includes simple join capability between two document types. How do such queries scale with Big Data?

Grant Ingersoll: Solr scales quite well (billions of documents and very large query volumes).
In fact, we’ve seen it routinely scale linearly to quite large cluster sizes.

As with databases, joins require you to pay attention to how you do the join or whether there are better ways of asking your question, but I have seen them used quite successfully in the appropriate situation. At the end of the day, I try to remain pragmatic and use the appropriate tool for the job. A search engine can handle some types of joins, but that doesn’t always mean you should do it in a search engine. I like to think of a search engine as a very fast ranking engine. If the problem requires me to rank something, than search engine technology is going to be hard to beat. If you need it to do all different kinds of joins across a large number of document types or constant large table scans, it may be appropriate to do in a search engine and it may not. It’s a classic “it depends” situation. That being said, over the past few years, these kinds of problems have become much more efficient to do in a search engine thanks to a multitude of improvements the community has made to Lucene and Solr.

Q7. The Apache Mahout Machine Learning Project’s goal is to build scalable machine learning libraries. What is current status of the project?

Grant Ingersoll: We released 0.9 and are working towards a 1.0. The main focus lately has been on preparing for a 1.0 release by culling old, unused code and tightly focusing on a core set of algorithms which are tried and true that we want to support going forward.

Q8. What kind of algorithms is Apache Mahout currently supporting?

Grant Ingersoll: I tend to think of Mahout as being focused on the three “C’s”: clustering, classification and collaborative filtering (recommenders). These algorithms help people better understand and organize their data. Mahout also has various other algorithms like singular value decomposition, collocations and a bunch of libraries for Java primitives.

Q9. How does Mahout relies on the Apache Hadoop framework?

Grant Ingersoll: Many of the algorithms are written for Hadoop specifically, but not all. We try to be prudent about where it makes sense to use Hadoop and where it doesn’t, as not all machine learning algorithms are best suited for Map-Reduce style programming. We are also looking at how to leverage other frameworks like Spark or custom distributed code.

Q10. Who is using Apache Mahout and for what?

Grant Ingersoll: It really spans a lot of interesting companies, ranging from those using it to power recommendations to others classifying users to show them ads. At LucidWorks, we use Mahout for identifying statistically interesting phrases, clustering and classification of user’s query intent and more.

Q11. How scalable is Apache Mahout? What are the limits?

Grant Ingersoll: That will depend on the algorithm. I haven’t personally run an exhaustive benchmark, but I’ve seen many of the clustering and classification algorithms scale linearly.

Q12. How do you take into account user feedback when performing Recommendation mining with Apache Mahout?

Grant Ingersoll: Mahout’s recommenders are primarily of the “collaborative filtering” type, where user feedback equates to a vote for a particular item. All of those votes are, to simplify things a bit, added up to produce a recommendation for the user. Mahout supports a number of different ways of calculating those recommendations, since it is a library for producing recommendations and not just a one size fits all product.

Q13. Looking at three elements: Data, Platform, Analysis, what are the main challenges ahead?

Grant Ingersoll: I’d add a fourth element: the user. Lots of interesting challenges here:

When do we get past the hype cycle of big data and into the nitty gritty of making it real? That is, when does it get practical for most people, not just the Google’s and the Facebook’s of the world? I’ve seen some cool usages of big data over the years, but I also see a lot of people with a solution looking for a problem.

How do we leverage the data, the platform and the analysis to make us smarter/better off instead of just better marketing targets? How do we use these tools to personalize without offending or destroying privacy?

How do we continue to meet scale requirements without breaking the bank on hardware purchases, etc?

Qx. Anything you wish to add?

Grant Ingersoll: Thanks for the great questions!

-Grant

————–
Grant Ingersoll, CTO and co-founder of LucidWorks, is an active member of the Lucene community – a Lucene and Solr committer, co-founder of the Apache Mahout machine learning project and a long-standing member of the Apache Software Foundation. He is co-author of “Taming Text” from Manning Publications, and his experience includes work at the Center for Natural Language Processing at Syracuse University in natural language processing and information retrieval.
Ingersoll has a Bachelor of Science degree in Math and Computer Science from Amherst College and a Master of Science degree in Computer Science from Syracuse University.

Resources

Taming Text How to Find, Organize, and Manipulate It
Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris
Softbound print: September 2012 (est.) | 350 pages, Manning, ISBN: 193398838X

Related Posts

AsterixDB: Better than Hadoop? Interview with Mike Carey. ODBMS Industry Watch, October 22, 2014

Hadoop at Yahoo. Interview with Mithun Radhakrishnan. ODBMS Industry Watch, September 21, 2014

Follow ODBMS.org on Twitter: @odbmsorg
##

Dec 7 14

On Big Data Analytics. Interview with Anthony Bak

by Roberto V. Zicari

“The biggest challenge facing data analytics is how to turn complex data into actionable information. One way to think about complexity is that there are many stories happening simultaneously in the data – some relevant to the problem being solved but most irrelevant. The goal of Big Data Analytics is to find the relevant story, reducing complexity to actionable information.”–Anthony Bak

On Big Data Analytics, I have interviewed Anthony Bak, Data Scientist and Mathematician at Ayasdi.

RVZ

Q1. What are the most important challenges for Big Data Analytics?

Anthony Bak: The biggest challenge facing data analytics is how to turn complex data into actionable information. One way to think about complexity is that there are many stories happening simultaneously in the data – some relevant to the problem being solved but most irrelevant. The goal of Big Data Analytics is to find the relevant story, reducing complexity to actionable information. How do we sort through all the stories in an efficient manner?

Historically, organizations extracted value from data by building data infrastructure and employing large teams of highly trained Data Scientists who spend months, and sometimes years, asking questions of data to find breakthrough insights. The probability of discovering these insights is low because there are too many questions to ask and not enough data scientists to ask them.

Ayasdi’s platform uses Topological Data Analysis (TDA) to automatically find the relevant stories in complex data and operationalize them to solve difficult and expensive problems. We combine machine learning and statistics with topology, allowing for ground-breaking automation of the discovery process.

Q2. How can you “measure” the value you extract from Big Data in practice?

Anthony Bak: We work closely with our clients to find valuable problems to solve. Before we tackle a problem we quantify both its value to the customer and the outcome delivering that value.

Q3. You use a so called Topological Data Analysis. What is it?

Anthony Bak: Topology is the branch of pure mathematics that studies the notion of shape.
We use topology as a framework combining statistics and machine learning to form geometric summaries of Big Data spaces. These summaries allow us to understand the important and relevant features of the data. We like to say that “Data has shape and shape has meaning”. Our goal is to extract shapes from the data and then understand their meaning.

While there is no complete taxonomy of all geometric features and their meaning there are a few simple patterns that we see in many data sets: clusters, flares and loops.

Clusters are the most basic property of shape a data set can have. They represent natural segmentations of the data into distinct pieces, groups or classes. An example might find two clusters of doctors committing insurance fraud.
Having two groups suggests that there may be two types of fraud represented in the data. From the shape we extract meaning or insight about the problem.

That said, many problems don’t naturally split into clusters and we have to use other geometric features of the data to get insight. We often see that there’s a core of data points that are all very similar representing “normal” behavior and coming off of the core we see flares of points. Flares represent ways and degrees of deviation from the norm.
An example might be gene expression levels for cancer patients where people in various flares have different survival rates.

Loops can represent periodic behavior in the data set. An example might be patient disease profiles (clinical and genetic information) where they go from being healthy, through various stages of illness and then finally back to healthy.
The loop in the data is formed not by a single patient but by sampling many patients in various stages of disease. Understanding and characterizing the disease path potentially allows doctors to give better more targeted treatment.

Finally, a given data set can exhibit all of these geometric features simultaneously as well as more complicated ones that we haven’t described here. Topological Data Analysis is the systematic discovery of geometric features.

Q4. The core algorithm you use is called “Mapper“, developed at Stanford in the Computational Topology group by Gunnar Carlsson and Gurjeet Singh. How has your company, Ayasdi, turned this idea into a product?

Anthony Bak: Gunnar Carlsson, co-founder and Stanford University mathematics professor, is one of the leaders in a branch of mathematics called topology. While topology has been studied for the last 300 years, it’s in just the last 15 years that Gunnar has pioneered the application of topology to understand large and complex sets of data.

Between 2001 and 2005, DARPA and the National Science Foundation sponsored Gunnar’s research into what he called Topological Data Analysis (TDA). Tony Tether, the director of DARPA at the time, has said that TDA was one of the most important projects DARPA was involved in during his eight years at the agency.
Tony told the New York Times, “The discovery techniques of topological data analysis are going to have a huge impact, and Gunnar Carlsson is at the forefront of this research.”

That led to Gunnar teaming up with a group of others to develop a commercial product that could aid the efforts of life sciences, national security, oil and gas and financial services organizations. Today, Ayasdi already has customers in a broad range of industries, including at least 3 of the top global pharmaceutical companies, at least 3 of the top oil and gas companies and several agencies and departments inside the U.S. Government.

Q5. Do you have some uses cases where Topological Data Analysis is implemented to share?

Anthony Bak: There is a well known, 11-year old data set representing a breast cancer research project conducted by the Netherlands Cancer Institute-Antoni van Leeuwenhoek Hospital. The research looked at 272 cancer patients covering 25,000 different genetic markers. Scientists around the world have analyzed this data over and over again. In essence, everyone believed that anything that could be discovered from this data had been discovered.

Within a matter of minutes, Ayasdi was able to identify new, previously undiscovered populations of breast cancer survivors. Ayasdi’s discovery was recently published in Nature.

Using connections and visualizations generated from the breast cancer study, oncologists can map their own patients data onto the existing data set to custom-tailor triage plans. In a separate study, Ayasdi helped discover previously unknown biomarkers for leukaemia.

You can find additional case studies here.

Q6. Query-Based Approach vs. Query-Free Approach: could you please elaborate on this and explain the trade off?

Anthony Bak: Since the creation of SQL in the 1980s, data analysts have tried to find insights by asking questions and writing queries. This approach has two fundamental flaws. First, all queries are based on human assumptions and bias. Secondly, query results only reveal slices of data and do not show relationships between similar groups of data. While this method can uncover clues about how to solve problems, it is a game of chance that usually results in weeks, months, and years of iterative guesswork.

Ayasdi’s insight is that the shape of the data – its flares, cluster, loops – tells you about natural segmentations, groupings and relationships in the data. This information forms the basis of a hypothesis to query and investigate further. The analytical process no longer starts with coming up with a hypothesis and then testing it, instead we let the data, through its geometry, tell us where to look and what questions to ask.

Q7 Anything else you wish to add?

Anthony Bak: Topological data analysis represents a fundamental new framework for thinking about, analyzing and solving complex data problems. While I have emphasized its geometric and topological properties it’s important to point out that TDA does not replace existing statistical and machine learning methods. 
Instead, it forms a framework that utilizes existing tools while gaining additional insight from the geometry.

I like to say that statistics and geometry form orthogonal toolsets for analyzing data, to get the best understanding of your data you need to leverage both. TDA is the framework for doing just that.

———————
Anthony Bak is currently a Data Scientist and mathematician at Ayasdi. Prior to Ayasdi, Anthony was at Stanford University where he worked with Ayasdi co-founder Gunnar Carlsson on new methods and applications of Topological Data Analysis. He did his Ph.D. work in algebraic geometry with applications to string theory.

Resources

Extracting insights from the shape of complex data using topology
P. Y. Lum,G. Singh,A. Lehman,T. Ishkanov,M. Vejdemo-Johansson,M. Alagappan,J. Carlsson & G. Carlsson
Nature, Scientific Reports 3, Article number: 1236 doi:10.1038/srep01236, 07 February 2013

Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition

Extracting insights from the shape of complex data using topology

Related Posts

Predictive Analytics in Healthcare. Interview with Steve Nathan,ODBMS Industry Watch,August 26, 2014

Follow ODBMS.org on Twitter: @odbmsorg

##

Nov 18 14

On Mobile Data Management. Interview with Bob Wiederhold

by Roberto V. Zicari

“We see mobile rapidly emerging as a core requirement for data management. Any vendor who is serious about being a leader in the next generation database market, has to have a mobile strategy.”
–Bob Wiederhold.

I have interviewed Bob Wiederhold, President and Chief Executive Officer of Couchbase.

RVZ

Q1. On June 26, you have announced a $60 Million series E round of financing. What are Couchbase’s chances of becoming a major player in the database market (and not only in the NoSQL market)? And what is your strategy for achieving this?

Bob Wiederhold: Enterprises are moving from early NoSQL validation projects to mission critical implementations.
As NoSQL deployments evolve to support the core business, requirements for performance at scale and completeness increase. Couchbase Server is the most complete offering on the market today, delivering the performance, scalability and reliability that enterprises require.
Additionally, we see mobile rapidly emerging as a core requirement for data management. Any vendor who is serious about being a leader in the next generation database market, has to have a mobile strategy.
At this point, we are the only NoSQL vendor offering an embedded mobile database and the sync needed to manage data between the cloud, the device and other devices. We believe that having the most complete, best performing operational NoSQL database along with a comprehensive mobile offering, uniquely positions us for leadership in the NoSQL market.

Q2. Why Couchbase Lite is so strategically important for you?

Bob Wiederhold: First, because the world is going mobile. That is indisputable. Mobile initiatives top the list of every IT department. As I said above, if you don’t have a mobile data management offering, you are not looking at the complete needs of the developer or the enterprise.
Second, let’s level set on Couchbase Lite. Couchbase Lite is our offering for an embedded mobile JSON database.
Our complete mobile offering, Couchbase Mobile, includes Couchbase Server – for data management in the cloud, and Sync Gateway for synchronization of data stored on the device with other devices, or the database in the cloud.
Today, because connectivity is unknown, data synchronization challenges force developers to either choose a total online (data stored in the cloud), or total offline (data stored on the device) data management strategy.
This approach limits functionality, as when the network is unavailable, online apps may freeze and not work at all. People want access to their applications, travel, expense report, or multi-user collaboration etc., whether they’re online or not.
Couchbase Mobile is the only NoSQL offering available that allows developers to build JSON applications that work whether an application is online or off, and manages the synchronization of the data between those applications and the cloud, or other devices. This is revolutionary for the mobile world and we are seeing tremendous interest from the mobile developer community.

Q3 What can enterprise do with a NoSQL mobile database, that they would not be able to do with a non-mobile database?

Bob Wiederhold: Offline access and syncing has been too time and resource intensive for mobile app developers. With Couchbase Mobile, developers don’t have to spend months, or years, building a solution that can store unstructured data on the device and sync that data with external sources – whether that is the cloud or another device. With Couchbase Mobile, developers can easily create mobile applications that are not tied to connectivity or limited by sync considerations. This empowers developers to build an entirely new class of enterprise applications that go far beyond what is available today.

Q4 What kind of businesses and applications will benefit when people use a NoSQL databases on their mobile devices? Can you give us some examples?

Bob Wiederhold: Nearly every business can benefit from the use of a complete mobile solution to build always available apps that work offline or online. One business example is our customer Infinite Campus.
Focused on educational transformation through the use of information technology, Infinite Campus is looking at Couchbase Lite as a solution that will enable students to complete their homework modules even when they don’t have access to a network outside of school. Instructional videos and homework assignments can be selectively pushed to students’ mobile devices when they are online at school.
Using Couchbase Lite, students can work online at school and then complete their homework assignments anywhere – on or offline. And the data seamlessly syncs across devices and between users, so teachers and students can participate in real-time Q&A chat sessions during lectures.

Q5. Do you have some customers who have gone into production with that?

Bob Wiederhold: The product is new, but we already have several customers that are live.
In addition to Microsoft, we have several companies around the world. You can check out one iOS app by Spraed, who is using Couchbase Server – running on AWS, Sync Gateway and Couchbase Lite.

Q6. Couchbase Server is a JSON document-based database. Why this design choice?

Bob Wiederhold: The world is changing. Businesses need to be agile and responsive.
Relational databases, with rigid schema design, don’t allow for fast change. JSON is the next generation architecture that businesses are increasingly using for mission critical applications because the technology allows them to manage and react to all aspects of big data: volume, variety and velocity of data, as well as big users and do that in a cloud based landscape.

Q7. Do you have any plan to work with Cloud providers?

Bob Wiederhold: We already work with many cloud providers. We have a great relationship with Amazon Web Services and many of our customers, including WebMD and Viber, run on AWS.
We also have partnerships and customers running on Windows Azure, GoGrid, and others. More and more organizations are moving infrastructure to the cloud and we will continue expanding our eco system to give our customers the flexibility to choose the best deployment options for their businesses.

Q8. Do you see happening any convergence between operational data management and analytical data processing? And if yes, how?

Bob Wiederhold: Yes, Analytics can happen at real time, near real time in operational stores and in batch modes. We have several customers who are deploying and have deployed complete solutions to integrated operational big data with real time analytical processing. LivePerson has done some incredibly innovative work here. They have been very open about the work they are doing and you can hear them tell their story here.

Q9 Do you have any plan to integrate your system with platforms for use in big data analytics?

Bob Wiederhold: Absolutely and we are integrated today into many platforms, including Hadoop via our Couchbase Hadoop connector and have many customers using Couchbase Server with both realtime and batch mode analytics platforms. See Avira and LivePerson presentations for examples. We continue to work with big data ISVs to ensure our customers can easily integrate their systems with the analytics system of their choosing.

—————
BobW
Bob Wiederhold, President and Chief Executive Officer, Couchbase

Bob has more than 25 years of high technology experience. Until an acquisition by IBM in 2008, Bob served as chairman, CEO, and president of Transitive Corporation, the worldwide leader in cross-platform virtualization with over 20 million users. Previously, he was president and CEO of Tality Corporation, the worldwide leader in electronic design services, whose revenues and size grew to almost $200 million and had 1,500 worldwide employees.
Bob held several executive general management positions at Cadence Design Systems, Inc., an electronic design automation company, which he joined in 1985 as an early stage start-up and helped to grow to more than $1.5 billion during his 13 years at the company. Bob also headed High Level Design Systems, a successful electronic design automation start-up that was acquired by Cadence in 1996. Bob has extensive board experience having served on both public (Certicom, HLDS) and private company boards (Snaketech, Tality, Transitive, FanfareGroup).

Resources

Magic Quadrant for Operational Database Management Systems. 16 October 2014. Analyst(s): Donald Feinberg, Merv Adrian, Nick Heudecker, Gartner.

Related Posts

Using NoSQL at BMW. Interview with Jutta Bremm and Peter Palm. ODBMS Industry Watch, September 29, 2014

NoSQL for the Internet of Things. Interview with Mike Williams. ODBMS Industry Watch, June 5, 2014

Follow ODBMS.org on Twitter: @odbmsorg

##

Nov 2 14

On Hadoop RDBMS. Interview with Monte Zweben.

by Roberto V. Zicari

“HBase and Hadoop are the only technologies proven to scale to dozens of petabytes on commodity servers, currently being used by companies such as Facebook, Twitter, Adobe and Salesforce.com.”–Monte Zweben.

Is it possible to turn Hadoop into a RDBMS? On this topic, I have interviewed Monte Zweben, Co-Founder and Chief Executive Officer of Splice Machine.

RVZ

Q1. What are the main challenges of applications and operational analytics that support real-time, interactive queries on data updated in real-time for Big Data?

Monte Zweben: Let’s break down “real-time, interactive queries on data updated in real-time for Big Data”. “Real-time, interactive queries” means that results need to be returned in milliseconds to a few seconds.
For “Data updated in real-time” to happen, changes in data should be reflected in milliseconds. “Big Data” is often defined as dramatically increased volume, velocity, and variety of data. Of these three attributes, data volume typically dominates, because unlike the other attributes, its growth is virtually unbounded.

Traditional RDBMSs like MySQL or Oracle can support real-time, interactive queries on data updated in real-time, but they struggle on handling Big Data. They can only scale up on larger servers that can cost hundreds of thousands, if not millions of dollars per server.

Big Data technologies such as Hadoop can easily handle Big Data data volumes with their ability to scale-out on commodity hardware. However, with their batch analytics heritage, they often struggle to provide real-time, interactive queries. They also lack ACID transactions to support data updated in real time.

So, real-time applications and operational analytics had to choose between real-time interactive queries on data updated in real-time, or Big Data volumes. With Splice Machine, these applications can have the best of both worlds: real-time interactive queries, the reliability of real-time updates on ACID transactions, and the ability to handle Big Data volumes with a 10x price/performance improvement over traditional RDBMSs.

Q2. You suggested that companies should replace their traditional RDBMS systems. Why and when? Do you really think this is always possible? What about legacy systems?

Monte Zweben: Companies should consider replacing their traditional RDBMSs when they experience significant cost or scaling issues. Our informal surveys of customers indicate that up to half of traditional RDBMSs experience cost or scaling issues. The biggest barrier to migrating from a traditional RDBMS to a new database like Splice Machine is converting custom stored procedure (e.g., PL/SQL code). Operational analytics often have limited custom stored procedure code, so the migration process is generally straightforward.

Operational applications typically have thousands of lines of custom stored procedure code, but in extreme cases it can run into hundreds of thousands to millions of lines of code. There are actually commercially-supported tools that will convert from PL/SQL to the Java needed for Splice Machine. We have typically seen them convert from 70-95% accurately, but it will obviously depend on the complexity of the original code. Financially, migration makes sense for many companies to get an ongoing 10x price/performance, but there are cases when it does not make sense because converting custom code is too expensive.

Q3. Is scale-out the solution to Big Data at scale? Why?

Monte Zweben: Scale-out is definitely the critical technology to making Big Data work at scale. Scale-out leverages inexpensive, commodity hardware to parallelize queries to easily achieve a 10x price/performance improvement over existing database technologies.

Q4. You have announced your real-time relational database management system. What is special about Splice Machine`s Hadoop RDBMS? 

Monte Zweben: We are the only Hadoop RDBMS. There are obviously many RDBMSs, but we are the only one with scale-out technology from Hadoop. Hadoop is the only scale-out technology proven to scales into dozens of petabytes on commodity hardware at companies like Facebook. There are other SQL-on-Hadoop technologies, but none of them can support real-time ACID transactions.

Q5 Hadoop-connected SQL databases do not eliminate “silos”. How do you handle this? 

Monte Zweben: We are not a database that has a connector to Hadoop. We are tightly integrated into Hadoop, using HBase and HDFS as our storage layer.

Q6. How did you manage to move Hadoop beyond its batch analytics heritage to power operational applications and real-time analytics?

Monte Zweben: At its core, Hadoop is a distributed file system (HDFS) where data cannot be updated or deleted. If you want to update or delete anything, you have to reload all the data (i.e., batch load). As a file system, it has very limited ability to seek specific data; instead, you use Java MapReduce programs to scan all of the data to find the data you need. It can easily take hours or even days for queries to return data (i.e., batch analytics). There is no way you could support a real-time application on top of HDFS and MapReduce.

By using HBase (a real-time key value store on top of HDFS), Splice Machine provides a full RDBMS on top of Hadoop.
You can now get real-time, interactive queries on real-time updated data on Hadoop, necessary to support operational applications and analytics.

Q7. How do you use Apache Derby™ and Apache HBase™/Hadoop?

Monte Zweben: Splice Machine marries two proven technology stacks: Apache Derby for ANSI SQL and HBase/Hadoop for proven scale out technology. With over 15 years of development, Apache Derby is a Java-based SQL database. Splice Machine chose Derby because it is a full-featured ANSI SQL database, lightweight (<3 MB), and easy to embed into the HBase/Hadoop stack.

HBase and Hadoop are the only technologies proven to scale to dozens of petabytes on commodity servers, currently being used by companies such as Facebook, Twitter, Adobe and Salesforce.com. Splice Machine chose HBase and Hadoop because of their proven auto-sharding, replication, and failover technology.

Q8. Why did you replace the storage engine in Apache Derby with HBase?

Monte Zweben: Apache Derby has a native shared-disk (i.e., non-distributed) storage layer. We replaced that storage layer with HBase to provide an auto-sharded, distributed computing storage layer.

Q9. Why did you redesign the planner, optimizer, and executor of Apache Derby?

Monte Zweben: We redesigned the planner, optimizer, and executor of Derby because Splice Machine has a distributed computing infrastructure instead of its old shared-disk storage. Distributed computing requires a functional re-architecting because computation must be distributed to where the data is, instead of moving the data to the computation.

Q10. What are the main benefits for developers and database architects who build applications?

Monte Zweben: There are two main benefits to Splice Machine for developers and database architects. First, no longer is data scaling a barrier to using massive amounts of data in an application; you no longer need to prune data or rewrite applications to do unnatural acts like manual sharding. Second, you can enjoy the scaling with all the critical features of an RDBMS – strong consistency, joins, secondary indexes for fast lookups, and reliable updates with transactions. Without those features, developers have to implement those functions for each application, a costly, time-consuming, and error-prone process.

———————
Monte Zweben, Co-Founder and Chief Executive Officer, Splice Machine

A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Branch Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.

In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of Red Prairie. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company, and is the chairman of Clio Music, an advanced music research and development company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.

Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science. Monte’s involvement with CMU, which has been a long-time leader in distributed computing and Big Data research, helped inspire the original concept behind Splice Machine.

Resources

ODBMS.org: Several Free Resources on Hadoop.

Related Posts

AsterixDB: Better than Hadoop? Interview with Mike Carey. ODBMS INDUSTRY WATCH, October 22, 2014

Hadoop at Yahoo. Interview with Mithun Radhakrishnan. ODBMS INDUSTRY WATCH, September 21, 2014

On the Hadoop market. Interview with John Schroeder. ODBMS INDUSTRY WATCH, June 30, 2014

–> FOLLOW ODBMS.ORG ON TWITTER: @odbmsorg 

##

Oct 22 14

AsterixDB: Better than Hadoop? Interview with Mike Carey

by Roberto V. Zicari

“To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”)”–Mike Carey.

Mike Carey and his colleagues have been working on a new data management system for Big Data called AsterixDB.

The AsterixDB Big Data Management System (BDMS) is the result of approximately four years of R&D involving researchers at UC Irvine, UC Riverside, and Oracle Labs. The AsterixDB code base currently consists of over 250K lines of Java code that has been co-developed by project staff and students at UCI and UCR.

The AsterixDB project has been supported by the U.S. National Science Foundation as well as by several generous industrial gifts.

RVZ

Q1. Why build a new Big Data Management System?

Mike Carey: When we started this project in 2009, we were looking at a “split universe” – there were your traditional parallel data warehouses, based on expensive proprietary relational DBMSs, and then there was the emerging Hadoop platform, which was free but low-function in comparison and wasn’t based on the many lessons known to the database community about how to build platforms to efficiently query large volumes of data. We wanted to bridge those worlds, and handle “modern data” while we were at it, by taking into account the key lessons from both sides.

To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”). 
We felt that the Big Data world, once the initial Hadoop furor started to fade a little, would benefit from having a platform that could offer things like:

  • a flexible data model that could handle data scenarios ranging from “schema first” to “schema never”;
  • a full query language with at least the expressive power of SQL;
  • support for data storage, data management, and automatic indexing;
  • support for a wide range of query sizes, with query processing cost being proportional to the given query;
  • support for continuous data ingestion, hence the accumulation of Big Data;
  • the ability to scale up gracefully to manage and query very large volumes of data using commodity clusters; and,
  • built-in support for today’s common “Big Data data types”, such as textual, temporal, and simple spatial data.

So that’s what we set out to do.

Q2. What was wrong with the current Open Source Big Data Stack?

Mike Carey: First, we should mention that some reviewers back in 2009 thought we were crazy or stupid (or both) to not just be jumping on the Hadoop bandwagon – but we felt it was important, as academic researchers, to look beyond Hadoop and be asking the question “okay, but after Hadoop, then what?” 
We recognized that MapReduce was great for enabling developers to write massively parallel jobs against large volumes of data without having to “think parallel” – just focusing on one piece of data (map) or one key-sharing group of data (reduce) at a time. As a platform for “parallel programming for dummies”, it was (and still is) very enabling! It also made sense, for expedience, that people were starting to offer declarative languages like Pig and Hive, compiling them down into Hadoop MapReduce jobs to improve programmer productivity – raising the level much like what the database community did in moving to the relational model and query languages like SQL in the 70’s and 80’s.

One thing that we felt was wrong for sure in 2009 was that higher-level languages were being compiled into an assembly language with just two instructions, map and reduce. We knew from Tedd Codd and relational history that more instructions – like the relational algebra’s operators – were important – and recognized that the data sorting that Hadoop always does between map and reduce wasn’t always needed. 
Trying to simulate everything with just map and reduce on Hadoop made “get something better working fast” sense, but not longer-term technical sense. As for HDFS, what seemed “wrong” about it under Pig and Hive was its being based on giant byte stream files and not on “data objects”, which basically meant file scans for all queries and lack of indexing. We decided to ask “okay, suppose we’d known that Big Data analysts were going to mostly want higher-level languages – what would a Big Data platform look like if it were built ‘on purpose’ for such use, instead of having incrementally evolved from HDFS and Hadoop?”

Again, our idea was to try and bring together the best ideas from both the database world and the distributed systems world. (I guess you could say that we wanted to build a Big Data Reese’s Cup… J)

Q3. AsterixDB has been designed to manage vast quantities of semi-structured data. How do you define semi-structured data?

Mike Carey: In the late 90’s and early 2000’s there was a bunch of work on that – on relaxing both the rigid/flat nature of the relational model as well as the requirement to have a separate, a priori specification of the schema (structure) of your data. We felt that this flexibility was one of the things – aside from its “free” price point – drawing people to the Hadoop ecosystem (and the key-value world) instead of the parallel data warehouse ecosystem.
In the Hadoop world you can start using your data right away, without spending 3 months in committee meetings to decide on your schema and indexes and getting DBA buy-in. To us, semi-structured means schema flexibility, so in AsterixDB, we let you decide how much of your schema you have to know and/or choose to reveal up front, and how much you want to leave to be self-describing and thus allow it to vary later. And it also means not requiring the world to be flat – so we allow nesting of records, sets, and lists. And it also means dealing with textual data “out of the box”, because there’s so much of that now in the Big Data world.

Q4. The motto of your project is “One Size Fits a Bunch”. You claim that AsterixDB can offer better functionality, managability, and performance than gluing together multiple point solutions (e.g., Hadoop + Hive + MongoDB).  Could you please elaborate on this?

Mike Carey: Sure. If you look at current Big Data IT infrastructures, you’ll see a lot of different tools and systems being tied together to meet an organization’s end-to-end data processing requirements. In between systems and steps you have the glue – scripts, workflows, and ETL-like data transformations – and if some of the data needs to be accessible faster than a file scan, it’s stored not just in HDFS, but also in a document store or a key-value store.
This just seems like too many moving parts. We felt we could build a system that could meet more (not all!) of today’s requirements, like the ones I listed in my answer to the first question.
If your data is in fewer places or can take a flight with fewer hops to get the answers, that’s going to be more manageable – you’ll have fewer copies to keep track of and fewer processes that might have hiccups to watch over. If you can get more done in one system, obviously that’s more functional. And in terms of performance, we’re not trying to out-perform the specialty systems – we’re just trying to match them on what each does well. If we can do that, you can use our new system without needing as many puzzle pieces and can do so without making a performance sacrifice.
We’ve recently finished up a first comparison of how we perform on tasks that systems like parallel relational systems, MongoDB, and Hive can do – and things look pretty good so far for AsterixDB in that regard.

Q5. AsterixDB has been combining ideas from three distinct areas — semi-structured data management, parallel databases, and data-intensive computing. Could you please elaborate on that?

Mike Carey: Our feeling was that each of these areas has some ideas that are really important for Big Data. Borrowing from semi-structured data ideas, but also more traditional databases, leads you to a place where you have flexibility that parallel databases by themselves do not. Borrowing from parallel databases leads to scale-out that semi-structured data work didn’t provide (since scaling is orthogonal to data model) and with query processing efficiencies that parallel databases offer through techniques like hash joins and indexing – which MapReduce-based data-intensive computing platforms like Hadoop and its language layers don’t give you. Borrowing from the MapReduce world leads to the open-source “pricing” and flexibility of Hadoop-based tools, and argues for the ability to process some of your queries directly over HDFS data (which we call “external data” in AsterixDB, and do also support in addition to managed data).

Q6. How does the AsterixDB Data Model compare with the data models of NoSQL data stores, such as document databases like MongoDB and CouchBase, simple key/value stores like Riak and Redis, and column-based stores like HBase and Cassandra?

Mike Carey: AsterixDB’s data model is flexible – we have a notion of “open” versus “closed” data types – it’s a simple idea but it’s unique as far as we know. When you define a data type for records to be stored in an AsterixDB dataset, you can choose to pre-define any or all of the fields and types that objects to be stored in it will have – and if you mark a given type as being “open” (or let the system default it to “open”), you can store objects there that have those fields (and types) as well as any/all other fields that your data instances happen to have at insertion time.
Or, if you prefer, you can mark a type used by a dataset as “closed”, in which case AsterixDB will make sure that all inserted objects will have exactly the structure that your type definition specifies – nothing more and nothing less.
(We do allow fields to be marked as optional, i.e., nullable, if you want to say something about their type without mandating their presence.)

What this gives you is a choice!  If you want to have the total, last-minute flexibility of MongoDB or Couchbase, with your data being self-describing, we support that – you don’t have to predefine your schema if you use data types that are totally open. (The only thing we insist on, at the moment, is that every type must have a key field or fields – we use keys when sharding datasets across a cluster.)

Structurally, our data model was JSON-inspired – it’s essentially a schema language for a JSON superset – so we’re very synergistic with MongoDB or Couchbase data in that regard. 
On the other end of the spectrum, if you’re still a relational bigot, you’re welcome to make all of your data types be flat – don’t use features like nested records, lists, or bags in your record definitions – and mark them all as “closed” so that your data matches your schema. With AsterixDB, we can go all the way from traditional relational to “don’t ask, don’t tell”. As for systems with BigTable-like “data models” – I’d personally shy away from calling those “data models”.

Q7. How do you handle horizontal scaling? And vertical scaling?

Mike Carey: We scale out horizontally using the same sort of divide-and-conquer techniques that have been used in commercial parallel relational DBMSs for years now, and more recently in Hadoop as well. That is, we horizontally partition both data (for storage) and queries (when processed) across the nodes of commodity clusters. Basically, our innards look very like those of systems such as Teradata or Parallel DB2 or PDW from Microsoft – we use join methods like parallel hybrid hash joins, and we pay attention to how data is currently partitioned to avoid unnecessary repartitioning – but have a data model that’s way more flexible. And we’re open source and free….

We scale vertically (within one node) in two ways. First of all, we aren’t memory-dependent in the way that many of the current Big Data Analytics solutions are; it’s not that case that you have to buy a big enough cluster so that your data, or at least your intermediate results, can be memory-resident.
Instead, our physical operators (for joins, sorting, aggregation, etc.) all spill to disk if needed – so you can operate on Big Data partitions without getting “out of memory” errors. The other way is that we allow nodes to hold multiple partitions of data; that way, one can also use multi-core nodes effectively.

Q8. What performance figures do you have for AsterixDB?

Mike Carey: As I mentioned earlier, we’ve completed a set of initial performance tests on a small cluster at UCI with 40 cores and 40 disks, and the results of those tests can be found in a recently published AsterixDB overview paper that’s hanging on our project web site’s publication page (http://asterixdb.ics.uci.edu/publications.html).
We have a couple of other performance studies in flight now as well, and we’ll be hanging more information about those studies in the same place on our web site when they’re ready for human consumption. There’s also a deeper dive paper on the AsterixDB storage manager that has some performance results regarding the details of scaling, indexing, and so on; that’s available on our web site too. The quick answer to “how does AsterixDB perform” is that we’re already quite competitive with other systems that have narrower feature sets – which we’re pretty proud of.

Q9. You mentioned support for continuous data ingestion. How does that work?

Mike Carey: We have a special feature for that in AsterixDB – we have a built-in notion of Data Feeds that are designed to simplify the lives of users who want to use our system for warehousing of continuously arriving data.
We provide Data Feed adaptors to enable outside data sources to be defined and plugged in to AsterixDB, and then one can “connect” a Data Feed to an AsterixDB data set and the data will start to flow in. As the data comes in, we can optionally dispatch a user-defined function on each item to do any initial information extraction/annotation that you want.  Internally, this creates a long-running job that our system monitors – if data starts coming too fast, we offer various policies to cope with it, ranging from discarding data to sampling data to adding more UDF computation tasks (if that’s the bottleneck). More information about this is available in the Data Feeds tech report on our web site, and we’ll soon be documenting this feature in the downloadable version of AsterixDB. (Right now it’s there but “hidden”, as we have been testing it first on a set of willing UCI student guinea pigs.)

Q10. What is special about the AsterixDB Query Language? Why not use SQL?

Mike Carey: When we set out to define the query language for AsterixDB, we decided to define our own new language – since it seemed like everybody else was doing that at the time (witness Pig, Jaql, HiveQL, etc.) – one aimed at our data model. 
SQL doesn’t handle nested or open data very well, so extending ANSI/ISO SQL seemed like a non-starter – that was also based on some experience working on SQL3 in the late 90’s. (Take a look at Oracle’s nested tables, for example.). Based on our team’s backgrounds in XML querying, we actually started there – XQuery was developed by a team of really smart people from the SQL world (including Don Chamberlin, father of SQL) as well as from the XML world and the functional programming world – so we started there. We took XQuery and then started throwing the stuff overboard that wasn’t needed for JSON or that seemed like a poor feature that had been added for XPath compatibility.
What remained was AQL, and we think it’s a pretty nice language for semistructured data handling. We periodically do toy with the notion of adding a SQL-like re-skinning of AQL to make SQL users feel more at home – and we may well do that in the future – but that would be different than “real SQL”. (The N1QL effort at Couchbase is doing something along those lines, language-wise, as an example. The SQL++ design from UCSD is another good example there.)

Q11. What level of concurrency and recovery guarantees does AsterixDB offer?

Mike Carey: We offer transaction support that’s akin to that of current NoSQL stores. That is, we promise record-level ACIDity – so inserting or deleting a given record will happen as an atomic, durable action. However, we don’t offer general-purpose distributed transactions. We support an arbitrary number of secondary indexes on data sets, and we’ll keep all the indexes on a data set transactionally consistent – that we can do because secondary index entries for a given record live in the same data partition as the record itself, so those transactions are purely local.

Q12. How does AsterixDB compare with Hadoop? What about Hadoop Map/Reduce compatibility?

Mike Carey: I think we’ve already covered most of that – Hadoop MapReduce is an answer to low-level “parallel programming for dummies”, and it’s great for that – and languages on top like Pig Latin and HiveQL are better programming abstractions for “data tasks” but have runtimes that could be much better. We started over, much as the recent flurry of Big Data analytics platforms are now doing (e.g., Impala, Spark, and friends), but with a focus on scaling to memory-challenging data sizes. We do have a MapReduce compatibility layer that goes along with our Hyracks runtime layer – Hyracks is name of our internal dataflow runtime layer – but our MapReduce compatibility layer is not related to (or connected to) the AsterixDB system.

Q13. How does AsterixDB relate to Hadapt?

Mike Carey: I’m not familiar with Hadapt, per se, but I read the HadoopDB work that fed into it. 
We’re architecturally very different – we’re not Hadoop-based at all – I’d say that HadoopDB was more of an expedient hybrid coupling of Hadoop and databases, to get some of the indexing and local query efficiency of an existing database engine quickly in the Hadoop world. We were thinking longer term, starting from first principles, about what a next-generation BDMS might look like. AsterixDB is what we came up.

Q14. How does AsterixDB relate to Spark?

Mike Carey: Spark is aimed at fast Big Data analytics – its data is coming from HDFS, and the task at hand is to scan and slice and dice and process that data really fast. Things like Shark and SparkSQL give users SQL query power over the scanned data, but Spark in general is really catching fire, it appears, due to its applicability to Big Machine Learning tasks. In contrast, we’re doing Big Data Management – we store and index and query Big Data. It would be a very interesting/useful exercise for us to explore how to make AsterixDB another source where Spark computations can get input data from and send their results to, as we’re not targeting the more complex, in-memory computations that Spark aims to support.

Q15. How can others contribute to the project?

Mike Carey: We would love to see this start happening – and we’re finally feeling more ready for that, and even have some NSF funding to make AsterixDB something that others in the Big Data community can utilize and share. 
(Note that our system is Apache-style open source licensed, so there are no “gotchas” lurking there.)
Some possibilities are:

(1) Others can start to use AsterixDB to do real exploratory Big Data projects, or to teach about Big Data (or even just semistructured data) management. Each time we’ve worked with trial users we’ve gained some insights into our feature set, our query optimizations, and so on – so this would help contribute by driving us to become better and better over time.

(2) Folks who are studying specific techniques for dealing with modern data – e.g., new structures for indexing spatiotemporaltextual (J) data – might consider using AsterixDB as a place to try out their new ideas.
(This is not for the meek, of course, as right now effective contributors need to be good at reading and understanding open source software without the benefit of a plethora of internal design documents or other hints.) We also have some internal wish lists of features we wish we had time to work on – some of which are even doable from “outside”, e.g., we’d like to have a much nicer browser-based workbench for users to use when interacting with and managing an AsterixDB cluster.

(3) Students or other open source software enthusiasts who download and try our software and get excited about it – who then might want to become an extension of our team – should contact us and ask about doing so. (Try it first, though!)  We would love to have more skilled hands helping with fixing bugs, polishing features, and making the system better – it’s tough to build robust software in a university setting, and we would especially welcome contributors from companies.

Thanks very much for this opportunity to share what we’ve being doing!

————————
Michael J. Carey is a Bren Professor of Information and Computer Sciences at UC Irvine.
Before joining UCI in 2008, Carey worked at BEA Systems for seven years and led the development of BEA’s AquaLogic Data Services Platform product for virtual data integration. He also spent a dozen years teaching at the University of Wisconsin-Madison, five years at the IBM Almaden Research Center working on object-relational databases, and a year and a half at e-commerce platform startup Propel Software during the infamous 2000-2001 Internet bubble. Carey is an ACM Fellow, a member of the National Academy of Engineering, and a recipient of the ACM SIGMOD E.F. Codd Innovations Award. His current interests all center around data-intensive computing and scalable data management (a.k.a. Big Data).

Resources

– AsterixDB Big Data Management System (BDMS): Downloads, Documentation, Asterix Publications.

Related Posts

Hadoop at Yahoo. Interview with Mithun Radhakrishnan. ODBMS Industry Watch, September 21, 2014

On the Hadoop market. Interview with John Schroeder. ODBMS Industry Watch, June 30, 2014

Follow ODBMS.org on Twitter: @odbmsorg
##

Oct 12 14

Big Data Management at American Express. Interview with Sastry Durvasula and Kevin Murray.

by Roberto V. Zicari

“The Hadoop platform indeed provides the ability to efficiently process large-scale data at a price point we haven’t been able to justify with traditional technology. That said, not every technology process requires Hadoop; therefore, we have to be smart about which processes we deploy on Hadoop and which are a better fit for traditional technology (for example, RDBMS).”–Kevin Murray.

I wanted to learn how American Express is taking advantage of analysing big data.
I have interviewed Sastry Durvasula, Vice President – Technology, American Express, and Kevin Murray, Vice President – Technology, American Express.

RVZ

Q1. With the increasing demand for mobile and digital capabilities, how are American Express’ customer expectations changing?

SASTRY DURVASULA: American Express customers expect us to know them, to understand and anticipate their preferences and personalize our offerings to meet their specific needs. As the world becomes increasingly mobile, our Card Members expect to be able to engage with us whenever, wherever and using whatever device or channel they prefer.
In addition, merchants, small businesses and corporations also want increased value, insights and relevance from our global network.

Q2. Could you explain what is American Express’ big data strategy?

SD: American Express seeks to leverage big data to deliver innovative products in the payments and commerce space that provide value to our customers. This is underpinned by best-in-class engineering and decision science.

From a technical perspective, we are advancing an enterprise-wide big data platform that leverages open source technologies like Hadoop, integrating it with our analytical and operational capabilities across the various business lines. This platform also powers strategic partnerships and real-time experiences through emerging digital channels. Examples include Amex Offers, which connects our Card Members and merchants through relevant and personalized digital offers; an innovative partnership with Trip Advisor to unlock exclusive benefits; insights and tools for our B2B partners and small businesses; and advanced credit and fraud risk management.

Additionally, as always, we seek to leverage data responsibly and in a privacy-controlled environment. Trust and security are hallmarks of our brand. As we leverage big data to create new products and services, these two values remain at the forefront.

Q3. What is the “value” you derive by analysing big data for American Express?

SD: Within American Express, our Technology and Risk & Information Management organizations partner with our lines of business to create new opportunities to drive commerce and serve customers across geographies with the help of big data. Big data is one of our most important tools in being the company we want to be – one that identifies solutions to customers’ needs and helps us deliver what customers want today and what they may want in the future.

Q4. What metrics do you use to monitor big data analytics at American Express?

SD: Big data investments are no different than any other investments in terms of the requirement for quantitative and qualitative ROI metrics with pre- and post-measurements that assess the projects’ value for revenue generation, cost avoidance and customer satisfaction. There is also the recognition that some of the investments, especially in the big data arena, are strategic and longer term in nature, and the value generated should be looked at from that perspective.

Additionally, we are constantly focused on benchmarking the performance of our platform with industry standards, like minute-sort and tera-sort, as well as our proprietary demand management metrics.

Q5. Could you explain how did you implement your big data infrastructure platform at Amex?

KEVIN MURRAY: We started small and expanded as our use cases grew over time, about once or twice a year.
We make it a practice to reassess the hardware and software state within the industry before each major expansion to determine whether any external changes should alter the deployment path we have chosen.

Q6. How did you select the components for your big data infrastructure platform, choosing among the various competing compute and storage solutions available today?

KM: Our research told us low-cost commodity servers with local storage was the common deployment stack across the industry. We made an assessment of industry offerings and evaluated against our objectives to determine a good balance of cost, capabilities and time to market.

Q7. How did you unleash big data across your enterprise and put it to work in a sustainable and agile environment?

SD: We engineered our enterprise-wide big data platform to foster R&D and rapid development of use cases, while delivering highly available production applications. This allows us to be adaptable and agile, scaling up or redeploying, as needed, to meet market and business demands. With the Risk and Information Management team, we established Big Data Labs comprising top-notch decision scientists and engineers to help democratize big data, leveraging self-service tools, APIs and common libraries of algorithms.

Q8. What are the most significant challenges you have encountered so far?

SD: An ongoing challenge is balancing our big data investment between immediate needs and research or innovations that will drive the next generation of capabilities. You can’t focus solely on one or the other but has to find a balance.

Another key challenge is ensuring we are focused on driving outcomes that are meaningful to customers – that are responsive to their current and anticipated needs.

Q9. What did you learn along the way?

KM: The Hadoop platform indeed provides the ability to efficiently process large-scale data at a price point we haven’t been able to justify with traditional technology. That said, not every technology process requires Hadoop; therefore, we have to be smart about which processes we deploy on Hadoop and which are a better fit for traditional technology (for example, RDBMS). Some components of the ecosystem are mature and work well, and others require some engineering to get to an enterprise-ready state. In the end, it’s an exciting journey to offer new innovation to our business.

Q10. Anything else you wish to add?

KM: The big data industry is evolving at lightning speed with new products and services coming to market every day. I think this is being driven by the enterprise’s appetite for something new and innovative that leverages the power of compute, network and storage advancements in the marketplace, combined with a groundswell of talent in the data science domain, pushing academic ideas into practical business use cases. The result is a wealth of new offerings in the marketplace – from ideas and early startups to large-scale mission-critical solutions. This is providing choice to enterprises like we’ve never seen before, and we are focused on maximizing this advantage to bring groundbreaking products and opportunities to life.

———————————-
Sastry Durvasula, Vice President – Technology, American Express
Sastry Durvasula is Vice President and Global Technology Head of Information Management and Digital Capabilities within the Technology organization at American Express. In this role, Sastry leads IT strategy and transformational development to power the company’s data-driven capabilities and digital products globally. His team also delivers enterprise-wide analytics and business intelligence platforms, and supports critical risk, fraud and regulatory demands. Most recently, Sastry and his team led the launch of the company’s big data platform and transformation of its enterprise data warehouse, which are powering the next generation of information, analytics and digital capabilities. His team also led the development of the company’s API strategy, as well as the Sync platform to deliver innovative products, drive social commerce and launch external partnerships.

Kevin Murray, Vice President – Technology, American Express
Kevin Murray is Vice President of Information Management Infrastructure & Integration within the Technology organization at American Express. Throughout his 25+ year career, he has brought emerging technologies into large enterprises, and most recently launched the big data infrastructure platform at American Express. His team architects and implements a wide range of information management capabilities to leverage the power of increasing compute and storage solutions available today.

Related Posts

Hadoop at Yahoo. Interview with Mithun Radhakrishnan. ODBMS Industry Watch, 2014-09-21

On Big Data benchmarks. Interview with Francois Raab and Yanpei Chen. ODBMS Industry Watch,2014-08-14

Resources

Presenting at Strata/Hadoop World NY
Big Data: A Journey of Innovation
Thursday, October 16, 2014, at 1:45-2:25 p.m. Eastern
Room: 1 CO3/1 CO4

The power of big data has become the catalyst for American Express to accelerate transformation for the digital age, drive innovative products, and create new commerce opportunities in a meaningful and responsible way. With the increasing demand for mobile and digital capabilities, the customer expectation for real-time information and differentiated experiences is rapidly changing. Big data offers a solution that enables this organization to use their proprietary closed-loop network to bring together consumers and merchants around the world, adding value to each in a way that is individualized and unique.

During their presentation, Sastry Durvasula and Kevin Murray will discuss American Express’ ongoing big data journey of transformation and innovation. How did the company unleash big data across its global network and put it to work in a sustainable and agile environment? How is it delivering offers using digital channels relevant to their Card Members and partners? What have they learned along the way? Sastry and Kevin will address these questions and share their experiences and insights on the company’s big data strategy in the digital ecosystem.

Follow ODBMS.org and ODBMS Industry Watch on Twitter: @odbmsorg
##

Sep 29 14

Using NoSQL at BMW. Interview with Jutta Bremm and Peter Palm.

by Roberto V. Zicari

“We need high performance databases for a wide range of challenges and analyses that arise from a variety of different systems and processes.”–Jutta Bremm, BMW

BMW is using a NoSQL database, CortexDB, for the configuration of test vehicles. I have interviewed Jutta Bremm, IT Project Leader at BMW, and Peter Palm, CVO at Cortex.

RVZ

Q1. What is your role, and for what IT projects are you responsible for at BMW?

Jutta Bremm: I am IT Project Leader for IT projects at BMW with a volume of more than 10 million Euro per year.

Q2. What are the main technical challenges you have at BWM?

Jutta Bremm: We need high performance databases for a wide range of challenges and analyses that arise from a variety of different systems and processes.

These don’t only include recursive, parameterized explosions for bills of materials, but also the provision of standardized tools to the business departments. That way, they can run their own queries more often and are not so dependent on IT to do it for them.

Q3. You define CortexDB as a schema-less multi-model database. What does it mean in practice? What kind of applications is it useful for?

Peter Palm: In CortexDB, datasets are stored as independent entities (cf. objects). To achieve this, the system transforms all content into a new type of index structure. This ensures that every item of content and every field “knows” the context in which it is being used. As a result, the database isn’t searched. Instead, queries are run on information that is already known and the results are combined using simple procedures based on set theory.

This is why there’s no predefined schema for the datasets – only for the index of all fields and the content.
This is what differentiates CortexDB from all other databases, which require the configuration of at least one index even though the datasets themselves are stored in schema-less mode.

The innovative index structure means that no administrative adaptation or optimization of the index is necessary.
Nor is there any requirement for an index for a specific applications – and that enables users to query all the content whenever they want and combine queries with each other too. That makes it very flexible for them to query any field and easily make any necessary development changes to in-house applications.

From the server’s perspective, the fields and content, as well as the interpretation of dataset structure and utilization, are not that important. The application working with the data creates a data structure that can be changed at any time (this is known as schema-less). For CortexDB, all that’s relevant is the content-based structure, which can be used in a generalized way and modified any time. This design gives customers a significant advantage when working with recursive data structures.

This is why CortexDB is particularly well suited to tasks whose definitive structure cannot be fixed at the beginning of the project, as well as for systems that change dynamically. The content-based architecture and the innovative index also deliver significant benefits for BI systems, as ad hoc analyses can be run and adapted whenever required.

In addition, users can add a validity period (“valid from…”) to any item of content. This enables them to view the evolution of particular data over time (known as historization). This evolutionary information is ideal for storing data that change frequently, such as smart metering and insurance information. For each field in a dataset, users don’t only see the information that was valid at the time of the transaction, but also the validity date after which the information was/is/will be valid. This is what we call a temporal database.

These benefits are complemented by the fact that individual fields can be used alone or in combination with others and repeated within a dataset. This – together with the use of validity dates – is what we call a “multi-value” database.

The terms “multi-model”, “multi-value” and “schema-less” also explain the fact that benefits of the database functions mentioned above apply to other NoSQL databases too, but users can extend these with new functions. In principle, any other database can be seen as a subset of CortexDB:

Database type: Key/Value Store
Function: One dataset = one key with one value (a value or value list) => a single, large index of keys
How it works in CortexDB: Every value and every field is indexed automatically and can be freely combined with others by using an occurrence list

Database type: Document Store
Function: One dataset combines several fields using a common ID (often json objects)
How it works in CortexDB: One ID combines fields that belong together in a dataset. Datasets can be output as json objects via an API.

Database type: GraphDB
Function: Links to other datasets are saved as meta information and can be used via proprietary graph queries.
How it works in CortexDB: Links are stored as actual data in a dataset and can be edited using additional fields. Fields can be repeated as often as required.

Database type: Big Table
Function: Multi-dimensional tables that use timestamps to define the validity of information. Its datasets can have a variety of attributes.
How it works in CortexDB: The use of a validity date in addition to a transaction date delivers a temporal database. Additional content can be added despite the dataset description.

Database type: Object oriented
Function: A class model defines the objects that need to be monitored persistently.
How it works in CortexDB: With the Cortex UniPlex application, users can define dataset types. Compared with classes, these define the maximum attributes of a dataset. Nevertheless, users can add more fields at any time, even if they have not been defined for UniPlex.

Q4. Can you please describe the use cases where you use CortexDB at BMW?

Jutta Bremm: The current use case for which we’re working with CortexDB is the explosion of bills of material for the configuration of test vehicles.

The construction of test vehicles must be planned and timed just as carefully as with mass production. To make the process smoother, we conduct reviews before starting construction to ensure that the bills of materials include the right parts and are therefore complete and free of any errors and conflicts.

One thing I’d like to point out here is that every vehicle comprises 15,000 parts, so there are between 10 to the power of 30 and 10 to the power of 60 configuration possibilities! It’s easy to understand why this isn’t an easy task. This high variance is due to the number of different models, engine types, displacements, optional extras, interior fittings and colors. As a result, a development BOM can only be stored in a highly compressed format.

To obtain an individual car from all this, the BOM must be “exploded” recursively. Multiple parameters have an effect on this, including validities (deadlines for parts, products, optional extras, markets etc.), construction stipulations (“this part can only be installed together with a navigation device and a 3-liter engine”) and structures (“this part is comprised of several smaller parts”).

Unlike conventional solutions, for which an explosion function is complex and expensive, the interpretation of the compressed BOM is very easy for CortexDB due to its bidirectional linking technology.

Q5. Why did you select CortexDB and not a classical relational database system? Did you compare CortexDB with other database management systems?

Jutta Bremm: We were looking for a product that would be easy to use, as well as simple and flexible to configure, for our users in product data management. We also wanted the highest possible level of functionality included as standard.

We looked at 4 products that appeared to be suitable for use by the departments for analysis and evaluation. The essential functions for product data management – explosion and the documentation of components used – were only available as standard with CORTEX. For all other products, we were looking at customer-specific extensions that would have cost several hundred thousand euros.

Q6. How do you store complex data structures (such as for example graphs) in CortexDB?

Peter Palm: CortexDB sees graphs as a derivative of certain database functions.

Firstly, it uses the “internal reference” field type (link). This is a data field in which the UUID of a target dataset is stored. That alone enables the use of simple links.

Second, users can choose to define fields as “repeating fields”. That means that the same field can also be used within a dataset. This is useful when a contact has more than one email address or phone number, and for links to individual parts in a BOM.

Repeating fields defined in this way can be grouped together to produce “repeating field groups”. Content items that belong together are thus stored as an information block. An example of this is bank account details that comprise the bank’s name, the sort code and the account number.

The use of repeating field groups, in which validity values are added to linked fields, enables complex data structures within a single dataset.

In addition, every dataset “knows” which other dataset is pointing to it. This bidirectional information using a simple link means that data administration is only required for one dataset. It is only necessary in both datasets if there are two conflicting points of view on a graph (e.g. “my friend considers me as an enemy”).

In addition, result sets can be combined with partial sets resulting from links when running queries and making selections. This limits the results to those that include certain details about their link structures.

Q7. How do you perform data analytics with CortexDB?

Peter Palm: The content in every field “knows” the field context it is being used in and how often (“occurrence list” or “field index”). By combining partial sets (as in set theory), result sets are determined extremely fast, eliminating the need for read access to individual datasets.

CortexDB comes with an application that lets users freely configure queries, reports and graphical output. There is also an application API (data service) that enables these elements to be used within in-house applications or interfaces.

The solution also identifies correlations itself using algorithms, even if they are connected via graphs. Unlike data warehouse systems, this lets users do more than just test estimates or ideas – it determines a result on its own and delivers it to the user for further analysis or for modification of the algorithm.

Q8. Do you some performance metrics for the analysis of recursive structured BOMs (bill of material) for your vehicles?

Jutta Bremm: Internal tests on BOM explosion with conventional relational databases showed that it took up to 120 seconds. Compare that with CortexDB, which delivers the result of the same explosion in 50 milliseconds.

Q9. How do you handle data quality control?

Jutta Bremm: We require 100% data quality (consistency at all times) and CortexDB delivers that.

Q10. What are the main business benefits of using CortexDB for these use cases?

Jutta Bremm: The agile modeling, the flexible adaptation options and the level of functionality delivered as standard shortens the duration of a project and reduces the costs compared to the other products we tested (see Q5).

Qx. Anything else you wish to add?
Jutta Bremm, Peter Palm: By using the temporal capabilities (time of transaction and time of validity), users can easily see which individual value in a dataset was/is/will be valid and from when.
In addition, the server-side JavaScript is used to calculate ad hoc results from the recursive structure, eliminating the need for these to be calculated and saved in the database beforehand.

——————–
Jutta Bremm, IT Project Manager, BMW.
Jutta is a IT Project Leader at BMW in product data management since 1987.
She was involved in IT projects at Siemens, Wacker Chemie, Sparkassenverband since 1978.

Peter Palm, Chief Visionary Officer (CVO) at Cortex.
Started CortexDB development in 1997.
Holds a Master in electronic engineering.
Area of expertise: Computer hardware development, Chip design, Independent Design Center for Chip Design (Std-Cell, Gate Array), Operating system development, CRM development since 1986.

Resources

ODBMS.org: Resources related to Cortex.

Related Posts

NoSQL for the Internet of Things. Interview with Mike Williams. ODBMS Industry Watch,June 5, 2014

On making information accessible. Interview with David Leeming. ODBMS Industry Watch, July 30, 2014

On SQL and NoSQL. Interview with Dave Rosenthal. ODBMS Indutry Watch, March 18, 2014

Follow ODBMS.org on Twitter: @odbmsorg

##

Sep 21 14

Hadoop at Yahoo. Interview with Mithun Radhakrishnan

by Roberto V. Zicari

“The main challenge when working with “big data” in Yahoo has always been our definition of “big”. :] There are several thousands of feeds on Yahoo’s Hadoop clusters, with daily, hourly and up-to-the-minute data frequencies, spanning Petabytes of data”.–Mithun Radhakrishnan

I have interviewed one of our experts, Mithun Radhakrishnan, member of the Yahoo Hive team.

RVZ

Q1. You work on Apache Hive, in the Yahoo Hadoop team. What are the most current projects you are working on?

Mithun Radhakrishnan: I work on the Hive team at Yahoo. Currently, we are migrating our Hadoop clusters from Hadoop 0.23 (initial release of YARN) to Hadoop 2.5. My team has been focusing on making sure that Hive 0.12 is performant on Hadoop 2.5, as well as rolling out Hive 0.13 to Yahoo’s Grid infrastructure. We have also been busy trying to enhance the performance of Hive queries, as well as of the Hive metastore, to work effectively at Yahoo’s large scale.

Q2. What are the most important challenges you are facing for the deployment, scaling and performance of Hive-related services at Yahoo?

Mithun Radhakrishnan:The main challenge when working with “big data” in Yahoo has always been our definition of “big”. :] There are several thousands of feeds on Yahoo’s Hadoop clusters, with daily, hourly and up-to-the-minute data frequencies, spanning Petabytes of data. Each feed would correspond to a Hive table, with the timestamp (date, hour, minute) being just one of several levels of partition-keys. Some of our more popular feeds add hundreds of thousands of partitions daily, and span millions overall. We’re working on optimizations in Hive’s metadata-storage, to scale to these high levels.

Another recent challenge has been the increased adoption of Business Intelligence and Data visualization tools (such as Tableau and MicroStrategy), connected directly to Grid data over HiveServer2. Such use imposes expectations not only on Hive query performance, but also on data transport as well as the metastore.

And finally, the hardware on which Hadoop runs at Yahoo is heterogeneous, accumulated over many years of usage at Yahoo. While our newer clusters use bleeding-edge hardware with gobs of memory, some of our clusters are several years old.
At our scale, we don’t have the luxury of completely replacing our hardware every year. We need our Grid software (Hadoop, Hive, Pig, etc.) to be performant on a variety of processor/memory/disk configurations.

Q3. What kind of Hive-related services did you implement at Yahoo?

Mithun Radhakrishnan: Yahoo has traditionally been an Apache Pig shop, but recently, we’ve seen an increase in the number of Hive jobs. This may be attributed to increased SQL-based analytics, proliferation of Business Intelligence tools, and some use of Hive for data transformations.

At Yahoo, we use HCatalog (i.e. Hive’s metadata server) for interoperability between Pig, Hive and MapReduce. An HCatalog Server runs as a separate service, serving metadata about various datasets.
Users consume this data using Hive directly, or using Pig and MapReduce (via HCatalog wrappers).

The data lifecycle (ingestion, replication and retirement) is managed via the Grid Data Management (GDM) suite, which was a pre-cursor to the Apache Falcon project. GDM is tightly integrated with HCatalog, and deals with data-registrations and discovery with HCatalog.

To enable data analysis and visualization tools for analysts, we deploy HiveServer2 instances. This allows direct JDBC/ODBC based connections to Grid data, to drastically cut down analysis and decision time, as well as unnecessary intermediate copies in a separate data warehouse.

A large number of users employ Oozie jobs to produce/consume Hive data, using Oozie’s “Hive Actions“. The Yahoo Hive and Oozie teams have integrated the two systems to reduce latencies in data processing pipelines.

Q4, What is Y!Grid ? And what is it useful for?

Mithun Radhakrishnan: Y!Grid is Yahoo’s Grid of Hadoop Clusters that’s used for all the “big data” processing that happens in Yahoo today. It currently consists of 16 clusters in multiple datacenters, spanning 32,500 nodes, and accounts for almost a million Hadoop jobs every day. Around 10% of those are Hive jobs.

No one else makes as much use out of Hadoop every single day as Yahoo does. Some of the notable use cases of Hadoop at Yahoo include:
Content Personalization for increasing engagement by presenting personalized content to users based on their profile and current activity
Ad Targeting and Optimization for serving the right ad to the right customer by targeting billions of impressions everyday based on recent user activities
New Revenue Streams from native ads and mobile search monetization through better serving, budgeting, reporting and analytics
Data Processing Pipelines for aggregating various dimensions of event level traffic data (page, ad, link views, link clicks, etc.) across billions of audience, search, and advertising events everyday
Mail Anti-spam and Membership Anti-abuse for blocking billions of spam emails and hundreds of thousands of abusive accounts per day through machine learning algorithms
Search Assist and Analytics for improving the Yahoo Search experience by processing billions of web pages

Q5. What are the strengths and weakness of Hive in your experience?

Mithun Radhakrishnan: Hive combines the immense computing power of Hadoop with the accessibility and expressiveness of SQL. Its main strengths include:
Scale: Hive scales easily to multi-terabyte datasets, and isn’t shackled by memory constraints.
SQL: Hive allows business logic to be expressed in SQL. This lowers the bar of entry for usage, allowing data-analysts with little Hadoop experience to use their expertise with Hadoop data. (Performance tuning is, admittedly, a different kettle of fish. ;])
Standard: Apache Hive supports analytics through Tableau, Microstrategy and Microsoft Excel, and has supported this for the longest time.
Strong community: The dev community in Hive is brilliant, vibrant and active (as a glance at the Git log would reveal. ;]) We’ve recently seen the introduction of an Apache Tez backend, vectorization support, optimized file-formats like ORC, as well as the promise of very interesting things to come (such as the new Cost Based Optimizer, and an Apache Spark-based back-end).

Which is not to say that everything’s perfect:

M/R: Until recently, Hive’s physical plans could only target MapReduce, which caused multi-stage queries to run quite slowly. Hive 0.13 now supports the expression of physical plans as arbitrary DAGs, using Apache Tez. This dramatically boosts performance, as our benchmarks have shown.
Standard SQL: HiveQL isn’t quite SQL92-compliant yet, although it’s tending in the right direction. Industry-standard benchmarks like TPC-h and TPC-ds typically need rewriting to run on Hive. To borrow a simile from Rowan Atkinson: it is sort of like Andrew Lloyd Webber rearranging the score of Evita to suit the vocal range of Britney Spears. :]
Metastore performance and data throughput in HiveServer2 still have room for improvement.

Q6. You are an Apache HCatalog committer. What are your most important contributions? Who is currently using HCatalog and for what?

Mithun Radhakrishnan: The Apache HCatalog project has been merged with the main Apache Hive project now.
My work with HCatalog has primarily revolved around integration with other projects. Specifically:
I worked on the HCatalog notification system, to send JMS compliant notifications in response to changes to a dataset’s metadata. In Yahoo, we use this specifically with Oozie, to kick off Oozie workflows as soon as their dependency dataset-partitions are published in HCatalog. This reduces workflow launch latencies, end-to-end pipeline execution times, while also reducing NameNode pressure caused by polling.
I’ve worked (and am still working) on integration with data ingestion services like GDM. My focus at the moment is on metastore performance, and replication of tables/partitions across HCatalog instances.

HCatalog is an integral part of data processing pipelines at Yahoo, given its integration with GDM and Oozie. Outside of Yahoo, HCatalog is also used at Twitter and LinkedIn, as far as I’m aware. I’m sure there are other firms as well.

HCatalog is also used externally by several projects such as Apache Falcon, Apache Oozie, etc.

Q7. You have been benchmarking various versions of Hive. What are the main results you have obtained?

Mithun Radhakrishnan: I’ve had the opportunity to benchmark Apache Hive 0.10 through Hive 0.13, across various scales of input data, multiple data formats and tuning parameters. We’ve observed that the query performance has improved steadily, with each major release. But the jump in Hive 0.13 has been quite phenomenal. The switch to a more expressive physical execution engine in Apache Tez, coupled with vectorization, ORC files and table/column statistics has really paid dividends.

For the Yahoo Hadoop team, the main result from the benchmark was that Apache Hive 0.13 supports a “high dynamic range” of data scale: it is performant enough at the 100GB scale to approach interactivity, while simultaneously also scaling to 10+ TB of data. Given that the system scales over such a wide range, and that Yahoo already deploys Hive in production, we find little reason to deploy any other frameworks for SQL-based analytics on Y!Grid.

Q8. How did you define the workloads for your benchmark of Hive?

Mithun Radhakrishnan: When we started off, we considered creating a Yahoo-specific benchmark: a set of Hive scripts and accompanying datasets to represent the Yahoo workload. The problem was that there was a variety of datasets, and several Hive users, running different kinds of workloads.

In the end, we opted to use the TPC-h benchmarks instead. These are industry standard, more or less representative of the jobs we run at Yahoo. Hortonworks was already running a large subset of TPC-ds benchmarks on Hive. We decided that TPC-h would allow for complementary coverage. We did partition the data and transform it in the way that we would have with production data.

At the time, the comparisons most people were trying to make were between Shark (on Apache Spark) and Hive.
Shark engineers had posted results from running a port of TPC-h, transliterated to Hive’s SQL dialect. We figured we’d get an apples-to-apples comparison by running those scripts against Hive.

Q9. Did you compare Hive with other Big Data software platforms?

Mithun Radhakrishnan: The objective of the benchmark was primarily to track Apache Hive’s progressive performance gains viz. prior versions. However, I did compare Hive 0.12 and 0.13’s performance against Shark 0.7.1 and Shark 0.8 (which was trunk at the time.)

The results were mixed. At the 100GB scale, I did see Shark perform admirably. But I ran into problems with Shark at scale. A large majority of queries simply didn’t complete on Shark at the 10+TB scale. It appeared that a lot of time was lost in shuffling data between consecutive stages. Coupled with the fact that Shark was only compatible with Hive Metastore v0.9, that the number of reducers wasn’t deduced per job, the lack of support for security or interoperability with our existing production systems, Apache Hive looked the better fit for Y!Grid.

I haven’t had the opportunity to compare Hive against other systems yet. I do hope to, as soon as I can find the time, but my day-job keeps me pretty busy. :]

Qx Anything else you wish to add?

Mithun Radhakrishnan: Lots of people fret over query performance. Performance is important, but one must think holistically about the data and workload that needs to be processed, hardware choices available at various price points, holistic long-term TCOs of operating the system, current and future use cases, support, etc. Everyone’s situation would be a bit different and something to take into account when thinking about a SQL-on-Hadoop solution.

————
My name is Mithun Radhakrishnan. I work on Apache Hive, in the Yahoo Hadoop team. My team is responsible for the deployment, scaling and performance of Hive-related services (including HCatalog and HiveServer2) on the Y!Grid, the largest production Hadoop Clusters in existence today.

I’ve been working on Hadoop-related projects in Yahoo since 2009, including the Grid Data Management System (pre-cursor to Apache Falcon), HCatalog and Hive. I’m an Apache HCatalog committer and Hive contributor. Prior to working at Yahoo, I was a firmware developer at Hewlett-Packard, writing hardware self-diagnostic and healing firmware for HP’s big-iron boxen (Integrity Servers, running Intel Itaniums).

I’m currently working broadly on getting the Hive Metastore to perform at Yahoo-scale.

I’ve recently had the pleasure of benchmarking various versions of Hive (0.10-13), with different settings, file-formats, etc., to gauge progressive performance gains. I’ll be presenting my findings at Strata 2014.

Resources

Hive on Apache Tez: Benchmarked at Yahoo! Scale. Mithun Radhakrishnan (Yahoo! Inc.). Talk at Strata+Hadoop Conference. 2:35pm Thursday, 10/16/2014

Related Posts

On Big Data benchmarks. Interview with Francois Raab and Yanpei Chen. ODBMS Industry Watch, August 14, 2014

On the Hadoop market. Interview with John Schroeder. ODBMS Industry Watch, June 30, 2014

On Spring for Apache Hadoop. Interview with Thomas Risberg. ODBMS Industry Watch, May 28, 2014

Follow ODBMS.org on Twitter: @odbmsorg
##