Data Curation at Scale: The Data Tamer System
Authors: Michael Stonebraker, Daniel Bruckner,George Beskales, Mitch Cherniack ,Ihab F. Ilyas , Stan Zdonik ,Alexander Pagan ,Shan Xu.
Data curation is the act of discovering a data source(s) of in- terest, cleaning and transforming the new data, semantically integrating it with other local data sources, and deduplicat- ing the resulting composite. There has been much research on the various components of curation (especially data inte- gration and deduplication). However, there has been little work on collecting all of the curation components into an integrated end-to-end system.
In addition, most of the previous work will not scale to the sizes of problems that we are finding in the field. For exam- ple, one web aggregator requires the curation of 80,000 URLs and a second biotech company has the problem of curating 8000 spreadsheets. At this scale, data curation cannot be a manual (human) effort, but must entail machine learning approaches with a human assist only when necessary.
This paper describes Data Tamer, an end-to-end curation system we have built at M.I.T. Brandeis, and Qatar Com- puting Research Institute (QCRI). It expects as input a se- quence of data sources to add to a composite being con- structed over time. A new source is subjected to ma- chine learning algorithms to perform attribute identification, grouping of attributes into tables, transformation of incom- ing data and deduplication. When necessary, a human can be asked for guidance. Also, Data Tamer includes a data visualization component so a human can examine a data source at will and specify manual transformations.
We have run Data Tamer on three real world enterprise cura- tion problems, and it has been shown to lower curation cost by about 90%, relative to the currently deployed production software.
Download article (.PDF):Data Tamer CIDR13_Paper28.pdf
This article is published under a Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits distribution and reproduction in any medium as well allowing derivative works, pro- vided that you attribute the original work to the author(s) and CIDR 2013. 6th Biennial Conference on Innovative Data Systems Research (CIDR ’13) January 6-9, 2013, Asilomar, California, USA.