On Data Curation. Interview with Andy Palmer
“We propose more data transparency not less.”—Andy Palmer
I have interviewed Andy Palmer, a serial entrepreneur, who co-founded Tamr, with database scientist and MIT professor Michael Stonebraker.
Happy and Peaceful 2015!
Q1. What is the business proposition of Tamr?
Andy Palmer: Tamr provides a data unification platform that reduces by as much as 90% the time and effort of connecting and enriching multiple data sources to achieve a unified view of silo-ed enterprise data. Using Tamr, organizations are able to complete data unification projects in days or weeks versus months or quarters, dramatically accelerating time to analytics.
This capability is particularly valuable to businesses as they can get a 360-degree view of the customer, unify their supply chain data for reducing costs or risk, e.g. parts catalogs and supplier lists, and speed up conversion of clinical trial data for submission to the FDA.
Q2. What are the main technological and business challenges in producing a single, unified view across various enterprise ERPs, Databases, Data Warehouses, back-office systems, and most recently sensor and social media data in the enterprise?
Andy Palmer: Technological challenges include:
– Silo-ed data, stored in varying formats and standards
– Disparate systems, instrumented but expensive to consolidate and difficult to synchronize
– Inability to use knowledge from data owners/experts in a programmatic way
– Top-down, rules-based approaches not able to handle the extreme variety of data typically found, for example, in large PLM and ERP systems.
Business challenges include:
– Globalization, where similar or duplicate data may exist in different places in multiple divisions
– M&As, which can increase the volume, variety and duplication of enterprise data sources overnight
– No complete view of enterprise data assets
– “Analysis paralysis,” the inability of business people to access the data they want/need because IT people are in the critical path of preparing it for analysis
Tamr can connect and enrich data from internal and external sources, from structured data in relational databases, data warehouses, back-office systems and ERP/PLM systems to semi- or unstructured data from sensors and social media networks.
Q3. How do you manage to integrate various part and supplier data sources to produce a unified view of vendors across the enterprise?
Andy Palmer: Patent-pending technology using machine learning algorithms performs most of the work, unifying up to 90% of supplier, part and site entities by:
– Referencing each transaction and record across many data sources
– Building correct supplier names, addresses, ID’s, etc. for a variety of analytics
– Cataloging into an organized inventory of sources, entities, and attributes
When human intervention is necessary, Tamr generates questions for data experts, aggregates responses, and feeds them back into the system. This feedback enables Tamr to continuously improve its accuracy and speed.
Q4. Who should be using Tamr?
Andy Palmer: Organizations whose business and profitability depend on being able to do analysis on a unified set of data, and ask questions of that data, should be using Tamr.
– a manufacturer that wants to optimize spend across supply chains, but lacks a unified view of parts and suppliers.
– a biopharmaceutical company that needs to achieve a unified view of diverse clinical trials data to convert it to mandated CDISC standards for ongoing submissions to the FDA – but lacks an automated and repeatable way to do this.
– a financial services company that wants to achieve a unified view of its customers – but lacks an efficient, repeatable way to unify customer data across multiple systems, applications, and its consumer banking, loans, wealth management and credit card businesses.
– the research arm of a pharmaceutical company that wants to unify data on bioassay experiments across 8,000 research scientists, to achieve economies, avoid duplication of effort and enable better collaboration
Q5. “Data transparency” is not always welcome in the enterprise, mainly due to non-technical reasons. What do you suggest to do in order to encourage people in the enterprise to share their data?
Andy Palmer: We propose more data transparency not less.
This is because in most companies, people don’t even know what data sources are available to them, let alone have insight into them or use of them. With Tamr, companies can create a catalog of all their enterprise data sources; they can then choose how transparent to make those individual data sources, by showing meta data about each. Then, they can control usage of the data sources using the enterprise’s access management and security policies/systems.
On the business side, we have found that people in enterprises typically want an easier way to share the data sources they have built or nurtured ─ a way that gets them out of the critical path.
Tamr makes people’s data usable by many others and for many purposes, while eliminating the busywork involved.
Q6. What is Data Curation and why is it important for Big Data?
Andy Palmer: Data Curation is the process of creating a unified view of your data with the standards of quality, completeness, and focus that you define. A typical curation process consists of:
– Identifying data sets of interest (whether from inside the enterprise or outside),
– Exploring the data (to form an initial understanding),
– Cleaning the incoming data (for example, 99999 is not a valid ZIP code),
– Transforming the data (for example, to remove phone number formatting),
– Unifying it with other data of interest (into a composite whole), and
– Deduplicating the resulting composite.
Data Curation is important for Big Data because people want to mix and match from all the data available to them ─ external and internal ─ for analytics and downstream applications that give them competitive advantage. Tamr is important because traditional, rule-based approaches to data curation are not sufficient to solve the problem of broad integration.
Q7. What does it mean to do “fuzzy” matches between different data sources?
Andy Palmer: Tamr can make educated guesses that two similar fields refer to the same entity even though the fields describe it differently: for example, Tamr can tell that “IBM” and “International Business Machines” refer to the same company.
In Supply Chain data unification, fuzzy matching is extremely helpful in speeding up entity and attribute resolution between parts, suppliers and customers.
Tamr’s secret sauce: Connecting hundreds or thousands of sources through a bottom-up, probabilistic solution reminiscent of Google’s approach to web search and connection.
Tamr’s upside: it becomes the Google of Enterprise Data, using probabilistic data source connection and curation to revolutionize enterprise data analysis.
Q8. What is data unification and how effective is it to use Machine Learning for this?
Andy Palmer: Data Unification is part of the curation process, during which related data sources are connected to provide a unified view of a given entity and its associated attributes. Tamr’s application of machine learning is very effective: it can get you 90% of the way to data unification in many cases, then involve human experts strategically to guide unification the rest of the way.
Q9. How do you leverage the knowledge of existing business experts for guiding/ modifying the machine learning process?
Andy Palmer: Patent-pending technology using machine learning algorithms performs most of the data integration work. When human intervention is necessary, Tamr generates questions for data experts, sends them simple yes-no questions, aggregates their responses, and feeds them back into the system. This feedback enables Tamr to continuously improve its accuracy and speed.
Q10. With Tamr you claim that less human involvement is required as the systems “learns.” What are in your opinion the challenges and possible dangers of such an “automated” decision making process if not properly used or understood? Isn’t there a danger of replacing the experts with intelligent machines?
Andy Palmer: We aren’t replacing human experts at all: we are bringing them into the decision-making process in a high-value, programmatic way. And there are data stewards and provenance and governance procedures in place that control how this done. For example: in one of our pharma customers, we’re actually bringing the research scientists who created the data into the decision-making process, capturing their wisdom in Tamr. Before, they were never asked: some guy in IT was trying to guess what each scientist meant when he created his data. Or the scientists were asked via email, which, due to the nature of the biopharmaceutical industry, required printing out the emails for audit purposes.
Q11. How do you quantify the cost savings using Tamr?
Andy Palmer: The biggest savings aren’t from the savings in data curation (although these are significant), but the opportunities for savings uncovered through analysis of unified data ─ opportunities that wouldn’t otherwise have been discovered. For example, by being able to create and update a ‘golden record’ of suppliers across different countries and business groups, Tamr can provide a more comprehensive view of supplier spend.
You can use this view to identify long-tail opportunities for savings across many smaller suppliers, instead of the few large vendors visible to you without Tamr.
In the aggregate, these long-tail opportunities can easily account for 85% of total spend savings.
Q12. Could you give us some examples of use cases where Tamr is making a significant difference?
Andy Palmer: Supply Chain Management, for streamlining spend analytics and spend management. Unified views of supplier and parts data enable optimization of supplier payment terms, identification of “long-tail” savings opportunities in small or outlier suppliers that were not easily identifiable before.
Clinical Trials Management, for automated conversion of multi-source /multi-standard CDISC data (typically stored in SaS databases) to meet submission standards mandated by regulators.
Tamr eliminates manual methods, which are usually conducted by expensive outside consultants and can result in additional, inflexible data stored in proprietary formats; and provides a scalable, repeatable process for data conversion (IND/NDA programs necessitate frequent resubmission of data).
Sales and Marketing, for achieving a unified view of the customer.
Tamr enables the business to connect and unify customer data across multiple applications, systems and business units, to improve segmentation/targeting and ultimately sell more products and services.
Andy Palmer, Co-Founder and CEO, Tamr Inc.
Andy Palmer is co-founder and CEO of Tamr, Inc. Palmer co-founded Tamr with fellow entrepreneur Michael Stonebraker, PhD. Previously, Palmer was co-founder and founding CEO of Vertica Systems, a pioneering big data analytics company (acquired by HP). During his career as an entrepreneur, Palmer has served as founder, founding investor, BOD member or advisor to more than 50 start-up companies. He also served as Global Head of Software Engineering and Architecture at Novartis Institutes for BioMedical Research (NIBR) and as a member of the start-up team and Senior Vice President of Operations and CIO at Infinity Pharmaceuticals (NASDAQ: INFI). He earned undergraduate degrees in English, history and computer science from Bowdoin College, and an MBA from the Tuck School of Business at Dartmouth.
January 27th at 1PM
Webinar: Toward Automated, Scalable CDISC Conversion
John Keilty, Third Rock Ventures | Timothy Danford, Tamr, Inc.
During a one-hour webinar, join John Keilty, former VP of Informatics at Infinity Pharmaceuticals, and Timothy Danford, CDISC Solution Lead for Tamr, as they discuss some of the key challenges in preparing clinical trial data for submission to the FDA, and the problems associated with current preparation processes.
Follow ODBMS.org on twitter: @odbsmorg
From → Uncategorized