On Dark Data. Interview with Gideon Goldin
“Topdown cataloging and masterdata management tools typically require expensive data curators, and are not simple to use. This poses a significant threat to cataloging efforts since so much knowledge about your organization’s data is inevitably clustered across the minds of the people who need to question it and the applications they use to answer those questions.”–Gideon Goldin
I have interviewed Gideon Goldin, UX Architect, Product Manager at Tamr.
Q1. What is “dark data”?
Gideon Goldin: Gartner refers to dark data as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing).” For most organizations, dark data comprises the majority of available data, and it is often the result of the constantly changing and unpredictable nature of enterprise data something that is likely to be exacerbated by corporate restructuring, M&A activity, and a number of external factors.
By shedding light on this data, organizations are better suited to make more datadriven, accurate business decisions.
Tamr Catalog, which is available as a free downloadable app, aims to do this, providing users with a view of their entire data landscape so they can quickly understand what was in the dark and why.
Q2. What are the main drawbacks of traditional topdown methods of cataloging or “master data management”?
Gideon Goldin: The main drawbacks are scalability and simplicity. When Yahoo, for example, started to catalog the web they employed some topdown approaches, hiring specialists to curate structured directories of information. As the web grew, however, their solution became less relevant and significantly more costly. Google, on the other hand, mined the web to understand references that exist between pages, allowing the relevance of sites to emerge from the bottomup. As a result, Google’s search engine was more accurate, easier to scale, and simpler.
Topdown cataloging and masterdata management tools typically require expensive data curators, and are not simple to use. This poses a significant threat to cataloging efforts since so much knowledge about your organization’s data is inevitably clustered across the minds of the people who need to question it and the applications they use to answer those questions. Tamr Catalog aims to deliver an innovative and vastly simplified method for cataloging your organization’s data.
Q3. Tamr recently opened a public Beta program Tamr Catalog for an enterprise metadata catalog. What is it?
Gideon Goldin: The Tamr Catalog Beta Program is an open invitation to testdrive our free cataloging software. We have yet to find an organization that is content with their current cataloging approaches, and we found that the biggest barrier to reform is often knowing where to start. Catalog can help: the goal of the Catalog Beta Program is to better understand how people want and need to collaborate around their data sources. We believe that an early partnership with the community will ensure that we develop useful functionality and thoughtful design.
Q4 What are the core functionality of Tamr Catalog?
Gideon Goldin: Tamr Catalog enables users to easily register, discover and organize their data assets.
Q5. How does it help simplify access to highquality data sets for analytics?
Gideon Goldin: Not surprisingly, people are biased to use the data sets closest to them. With Catalog, scientists and analysts can easily discover unfamiliar data setsdata sets, for example, that may belong to other departments or analysts. Catalog profiles and collects pointers to your sources, providing multifaceted and visual browsing of all data trivializing the search for any given set of data.
Q6. How does Tamr Catalog relate to the Tamr Data Unification Platform?
Gideon Goldin: Before organizations can unify their data, preparing it for improved analysis or management, they need to know what they have. Organizations often lack a good approach for this first (and repeating) step in data unification. We realized this quickly when helping large organizations begin their unification projects, and we even realized we lacked a satisfactory tool to understand our own data. Thus, we built Catalog as a part of the Tamr Data Unification Platform to illuminate your data landscape, such that people can be confident that their unification efforts are as comprehensive as possible.
Q7. What are the main challenges (technical and non technical) in achieving a broad adoption of a vendor and platform neutral metadata cataloging?
Gideon Goldin: Often the challenge isn’t about volume, it’s about variety. While a vendor neutral Catalog intends to solve exactly this, there remains a technical challenge in providing a flexible and elegant interface for cataloging dozens or hundreds of different types of data sets and the structures they comprise.
However, we find that some of the biggest (and most interesting) challenges revolve around organizational processes and culture. Some organizations have developed sophisticated but unsustainable approaches to managing their data, while others have become paralyzed by the inherently disorganized nature of their data. It can be difficult to appreciate the value of investing in these problems. Figuring out where to start, however, shouldn’t be difficult. This is why we chose to release a lightweight application free of charge.
Q8. Chief Data Officers (CDOs), data architects and business analysts have different requirements and different modes of collaborating on (shared) data sets. How do you address this in your catalog?
Gideon Goldin: The goal of cataloging isn’t cataloging, it’s helping CDOs identify business opportunities, empowering architects to improve infrastructures, enabling analysts to enrich their studies, and more. Catalog allows anyone to register and organize sources, encouraging open communication along the way.
Q9. How do you handle issues such as data protection, ownership, provenance and licensing in the Tamr catalog?
Gideon Goldin: Catalog allows users to indicate who owns what. Over the course of our Beta program, we have been fortunate enough to have over 800 early users of Catalog and have collected feedback about how our users would like to see data protection and provenance implemented in their own environments. We are eager to release new functionality to address these needs in the near future.
Q10. Do you plan to use the Tamr Catalog also for collecting data sets that can be used for data projects for the Common Good?
Gideon Goldin: We do know of a few instances of Catalog being used for such purposes, including projects that will build on the documenting of city and health data. In addition to our Catalog Beta Program, we are introducing a Community Developer Program, where we are eager to see how the community links Tamr Catalogs to new sources (including those in other catalogs), new analytics and visualizations, and ultimately insights. We believe in the power of open data at Tamr, and we’re excited to learn how we can help the Common Good.
Gideon Goldin, UX Architect, Product Manager at Tamr.
Prior to Tamr, Gideon Goldin worked as a data visualization/UX consultant and university lecturer. He holds a Masters in HCI and a PhD in cognitive science from Brown University, and is interested in designing novel humanmachine experiences. You can reach Gideon on Twitter at @gideongoldin or email him at Gideon.Goldin at tamr.com.
-Tamr Catalog Developer Community
Online community where Tamr catalog users can comment, interact directly with the development team, and learn more about the software; and where developers can explore extending the tool by creating new data connectors.
Follow ODBMs.org on Twitter: @odbmsorg