HoloClean: A Machine Learning System for Data Enrichment
by Roberto Zicari ·
HoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and multiple other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. HoloClean allows data practitioners and scientists to save the enormous time they spend in building piecemeal cleaning solutions, and instead, effectively communicate their domain knowledge in a declarative way to enable accurate analytics, predictions, and insights from noisy, incomplete, and erroneous data.
Resources
Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas, A formal framework for probabilistic unclean databases, Manuscript, 2018. [PDF]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré, Holoclean: Holistic data repairs with probabilistic inference, PVLDB 10 (2017), no. 11, 1190-1201. [PDF]
Theodoros Rekatsinas, Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and Christopher Ré, SLiMFast: Guaranteed results for data fusion and source reliability, SIGMOD 2017.[PDF]
Ihab F. Ilyas and Xu Chu, Trends in Cleaning Relational Data: Cosistency and Deduplications, Foundations and Trends in Databases 2015.[PDF]