Big Data in Medicine – Big or Dense?
Big Data in Medicine – Big or Dense?
Weiqi Wang, PhD; School of Medicine, Stanford University
It has been well known that big data should fit in the “3V” model, which is high volume of data mass, high velocity of data flow, and high variety of data types. Indeed many big data has been exemplifying this 3V features, including but not limited to sensor data, weblog data, business transaction data, stock data etc. Those types of big data are always bigger than what we could afford to analyze after them being collected, either due to the computation limitations of the hardware or the software, or both. In fact, sometimes they are even too big to store, needless to talk about the analysis. On top of that, they can keep coming in real time so that people never got time to take a breath from them. Those types of big data are what we usually would think of when we use the term “big data” in a normal context.
In medicine, however, things can be a little different.
Medical big data, exemplified by different sub types, such as electrical medical records (EMR), genomic data, medical device signal data including but not limited to graphical data, are usually not as large as other types of big data listed above. Reasons are many – As an example, if one compares sensor data and/or web-log data to genomic data, the first difference that he will see is that the former are accumulated as time goes on, whereas the latter are typically not, by its nature.
This difference by a dimension, i.e. the time dimension in this case, primarily makes the data volume differ by a degree or more. If one thinks a bit more, he will notice that the sensors could be as many as people choose to install and monitor as many things as people design them, where as the genomic data of one person can only be 23 chromosome pairs (don’t take me wrong, it is still a huge number of DNA base calls, just that the number is “fixed” in a sense), and there are not that many people having their genome sequenced – certainly this number is growing but it won’t break even with the time dimension plus the fact that number of sensors also grows. Similar stories apply to EMR and other types of medical data, that they will not be as big as the 3V model typically asks for, since the number of patients in clinical trials is just not that big in the mathematical world.
So, does that mean that there is no “big data” in medicine?
To answer this question, one need to think about what “big data” really refers to. Data, after all, carry and represent “information”, or ultimately, “knowledge” that reflects the reality. If the data contains a vast amount of noise or artifact, it can be “big” in volume but little in value. This actually applies to the sensor data and such.
Think about the web-log data from google: say today a random guy called Weiqi, or whoever, searched “Bikini beauty”, or whatever on google, yet he had connection problem with his laptop so he had to use his desktop to get the results instead.
The weblog on the server, with full loyalty, recorded both queries, yet it can not provide evidence that these two queries were from one same person, so if they count as two independent queries with the same keyword, that is an “artifact” in a sense. Similarly, if this guy shut the desktop and a week later turned it back on, the browser pops up and open the same searching page from the history, it counted another query but he really didn’t want it. Therefore if the weblog indicated that after a week the same user was still interested in Bikini beauty, it is another artifact.
Examples like this tell us a story, that the big data in such sense may have more issues than we normally would expect and sometimes they are unforseen-able; in another word, the true information, or the “knowledge” are less dense than we may think. On the other hand, genomic data have less likelihood to encounter such problems, especially with the advancement of the technology to lower the chance of fault readout. Therefore, each readout on a DNA base contains info as it really is – artifact free. Therefore, when it comes to the analysis, each bit of data counts.
In such a sense, the data, although smaller in volume, are bigger in knowledge density. Similar stories apply to EMR and such, where there is a manual supervision to ensure a better quality, a higher knowledge density than automated data.
Btw, It is interesting to know that from time to time people make efforts to abstract a second layer of data from the raw data from sensors or weblogs, to make it more knowledge-dense/reliable for analysis; However it is still not as good as medical big data usually can achieve due to the unpredictable nature of the sensor data themselves.
In such sense, medical big data, or big data medicine usually possess smaller volume as compared to sensor data, but can be a lot more “dense”. According to the 3V model, it should be more appropriate to call it “dense data” instead of “big data”, but it is an arguable and open question to the community. Moreover, with the development of portable and wearable medical devices, the medical big data can be Moisturized by this subtype with the time dimension, which is actually sensor data; therefore at the same time it may satisfy the 3V criteria, the cost is that more noise and/or artifact is likely to come in also.
After all, big or dense, it can only pick one.