What is data blending
By Oleg Roderick, David Sanchez, Geisinger Data Science, November 2015
2 Comparable Topics in Applied Science
3 Expectations for Participants
4 History of Data Blending
5 Extending the Method
Contest participants (Geisinger Health Collider Project) asked the Geisinger team to provide some additional guidance on the task of data blending. We write this note with a single reservation: as of 2014-2015, there is no concise dictionary definition. There are multiple terms: data integration, data blending, data fusion, use of data from multiple sources, data acquisition. They describe roughly the same applied practice and respond to the same challenge, namely, that the flow of information is not, by nature, well organized. Recorded knowledge enters our focus of attention in the wrong order, in very inconvenient formats, and at different quality.
Informally: the answer is not always written at the same book as the question. Thus, we must learn to decipher it from multiple books. Some of them are in a foreign language, some are hundreds of times thicker than others, and most of them are by different authors who have never agreed on a literary style. And there is no catalogue.
Data integration refers to collection of data from multiple sources, including changes of format and cleanup of redundant or useless entries. The outcome is a standardized, unified table.
Data fusion almost invariably means integration of imperfect data sources overlapping over a small group of objects (perhaps a single object, think: target tracking).
Data blending (as we have been using it) allows sources to be imperfect, incomplete, and overlap over a few objects or none at all, requiring inspired guesses and generalizations. These guesses will then be subjected to rigorous hypothesis testing, which is where it becomes science again, not narrative about data.
We are at the initial stage of multidisciplinary investigation. Thus, we are happy with having just the narrative about data.
Comparable Topics in Applied Science
Consider the core data set describing patients in medical care. The variables (or features) describing patients are schematically divided into two groups: [A] [B]. There is also an outcome, or a set of outcomes of interest Y=F(A,B) approximated by a model M: Y=M(A,B). Here, F is unknowable “true” relationship between cause and effect. M is its practical approximation, with simplifications and noise.
Features in A are private, highly specialized (difficult to obtain, transfer or interpret without additional skills and tools, e.g. 3-d internal scans); many of them are in the status of unknown knowns (i.e. we don’t know if such measurements are possible before we specifically ask for them). External researchers should not be able to see A until they have very specific reasons.
Features in B are possible to share for research. They are standard clinical variables listed in our data dictionaries. We provide ~100 variables, but in practice there are time-dependent sets of thousands of features (more if we include genomic data). In ~1970s – 2010s they were extensively used in clinical informatics. Modern methods of statistics / machine learning were used to create models in the format Y=M(B). For many such models, we have reached a stage where incremental development continues, but qualitative improvement is very hard or impossible.
The connected society of the 21st century offers another option to analysis of healthcare data – via blending with socioeconomic data. Consider the general population as represented through multiple information sources. In the population, a person (or aggregate of a group of people) will be characterized by features [B’] [C]. Here B’ is a much simplified subset of B, perhaps including only a few general labels for condition of interest, such as: ‘obesity’, or ‘PTSD’, or ‘clinical inpatient’. The most interesting part is variables in [C] as many of them were never considered as a part of healthcare study.
The task of data blending, put very simply, consists of the two parts:
- Given [B], find data sets that include [B’ C]. Even that valuable: many, if not most academic data science programs have not developed this capacity.
- Make a case that [B’ C] can be used to predict Y. Does not have to predict “better”; we are happy with additional volume of data at the cost of prediction quality. The narrative does not have to be mathematically rigorous; we hope to see a lot of arguments using subject matter knowledge.
Expectations for Participants
Once we review the narratives from the teams, the next goal of the study would be to compare Y=M1(B) versus Y=M2(B’,C) , where M1 is some standard clinical model (we will be using very basic medical informatics literature) and M2 is innovative, based on subject knowledge and guesswork, perhaps not yet rigorously tested.
Mature machine learning must imitate human ability to acquire data from multiple sources. While we have not achieved true AI yet, a modern data-driven organization with humans and computers is a working substitute. In the language of machine learning/statistical inference, data blending closely corresponds to inductive transfer, or transfer learning, and there is a good amount of mathematical literature on the subject.
However, human and organizational intelligence does not consciously reproduce mathematical process. We transfer skills and portions of knowledge, and then test their appropriateness in the new situation. In that type of cognitive activity, thinking by analogy is allowed, and ability to set up the connection is relatively more valuable (we have statistical approaches to testing, so the latter part of the process is largely figured out).
History of Data Blending
When we turn to the practical experience in the industry that introduced the concept of data blending, we see that the concept emerged in the data science community as a topic of interest around 2014 or late 2013. At the time, software packages like Tableau were offering a “data blending” method, which was intended to improve productivity and experience for the segment of the user population whose primary interface to data was through the tool itself (as opposed to power users who could combine multiple data feeds themselves and often had no need for this convenience). The way this method worked is follows: suppose one is interested in combining spreadsheet-based data (e.g., in Excel or a local .csv file) with data stored in an enterprise data-management system, perhaps Oracle or Hadoop. Typically, a business analyst would require an exchange with the team responsible for data engineering or ETL in order to realize this workflow. In 2014, “Data Blending” meant that the BI tool was capable of providing this functionality directly for the end-user by, for example, treating Oracle and Excel abstractly as relational stores and leveraging user- or enterprise-defined metadata to reason about the necessary joining structure. At this point in time, the “data blending” workflow consisted of: 1) identifying the data sources one wishes to blend 2) describe to the tool some metadata concerning the desired join 3) perform some standard cleaning and sanity-checks against the results.
This led to some interesting consequences, as the user community encountered certain performance bottlenecks in dealing with compute loads distributed across server and client. For instance, Teradata is designed to map the join of two large tables efficiently, and the underlying appliance has enough horsepower to return this result to the user in a timely manner. In a data blending scenario, the BI tool is performing the join across multiple systems, which are each (presumably) ignorant of the total data-space of the join. In this situation, it falls to the client (or some intermediary machine) to marshal resources to execute the join. In the worst-case scenario, this means transferring large amounts of data to the user’s machine and doing large-scale joins on the client-side. Users quickly sought to employ the typical trick, which is to coursen the join parameter and stage intermediate joins on the client side. For example, if joining by date, do the joins one-year-at-a-time. This is an asymmetrical method leveraging the asymmetrical computing power between server and host.
Extending the Method
The inversion of this use-pattern led to the idea of data blending as we have been presenting it. This method has been described as the creation of fictitious identifiers between clusters, and the joining thereof–which is true, and carries all of the caution tape typically associated with the intentional addition of bias into a data science workflow. However, it also represents an opportunity for the user to include domain-, method-, or problem-specific metadata which associates data the underlying system may not itself have a capacity to associate. In industry, it is not uncommmon, for example, to granularize features such as income, temperature, highest-education-level, occupation, etc, by zip code (geospatial discretization) in a way that the resulting set is its own commercially viable product or is representative of the knowledge base of a particular enterprise. Even subjective data has use–it is no stretch of the imagination that Google would happily pay a large amount for a model which would perfectly identify whether a given user was an active smoker. However, it is known that the company is in no hurry to retire whatever model they may already have, however imperfect it may be.
This is not the only necessary approach. Analysts could condition their models upon different classes of clusters, resulting in a multitude of models for individual cluster-combinations. Practitioners realized that there typically aren’t that many dimensions of freedom in the underlying “universal” parametrization of the system, so either one is forced to take clever approaches toward aliasing cluster-combinations (e.g., probabilistically), or to have been gifted with a sufficiently large amount of data to properly train a respectable subset of the cluster-combinations
Another aspect is in getting systems to take the essential step to data-blending, even on their own. This is a ubiquitous challenge for those engineering Big Data systems. A technically-challenged user might “know” what they want to do–for instance, they want a 90-day moving average of the regional sale figures across the geographically-diverse business units of a company–but they don’t necessarily know how to make the system perform the requisite operations. Unfortunately, it is extraordinarily difficult to design a system which is capable of inferring this requirement, both with or without the introduction of carefully-curated metadata about the business and its underlying business processes. The insight here is that the successful user is performing operations of which the system is incapable: they are leveraging metadata to reason about the combination of data elements. Successful users leverage the bias they’ve gained as domain-experts in order to create solutions to what are otherwise combinatorially-challenging problems.
Although this provides very few concrete examples of how to do so, we hope to convey at least one interest facet behind the intent of this Collider: leverage interesting metadata. Introductory statistics contains useful insight regarding how one quantifies and removes bias. An entire specialty is concerned with the careful design of experiment. The practitioner then performs complex operations with a degree of confidence that the “average case” holds. Contrast to the state-of-the-art in the field: introduce bias in a controlled manner, inspect its implications, and inject knowledge in a way that unlocks the true potential of the analyst and of the data.
Some references you may find helpful:
- Pan, Sinno Jialin, and Qiang Yang. “A survey on transfer learning.” Knowledge and Data Engineering, IEEE Transactions on 22.10 (2010): 1345-1359.
- Lenzerini, Maurizio. “Data integration: A theoretical perspective.” Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2002.
- Yager, Ronald R. “A framework for multi-source data fusion.” Information Sciences 163.1 (2004): 175-200.
- Goodman, Irwin R., Ronald P. Mahler, and Hung T. Nguyen. Mathematics of data fusion. Vol. 37. Springer Science & Business Media, 2013.
- Sambhoos, Kedar, et al. “Enhancements to high level data fusion using graph matching and state space search.” Information Fusion 11.4 (2010): 351-364.
- Blasius, Jorg, and Michael Greenacre, eds. Visualization and verbalization of data. CRC Press, 2014.
- Wang, Zhijun, et al. “A comparative analysis of image fusion methods.” Geoscience and Remote Sensing, IEEE Transactions on 43.6 (2005): 1391-1402.
- Núnez, R. C., et al. “Credibility assessment and inference for fusion of hard and soft information.” Proc. 2nd International Conference on Cross-Cultural Decision Making: Focus 2012 (also in Advances in Design for Cross-Cultural Activities. Vol. 1. 2012.
- Data-Blending Tutorial
- Gurus, Data Blending