On Managing Big Data. Interview with Jules J. Berman
“My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.”–Jules J. Berman.
I have interviewed Jules J. Berman, former President of the Association for Pathology Informatics. The focus of the interview is on how to manage Big Data.
Q1. In your experience what are the common mistakes that endanger most Big Data projects?
Jules J. Berman: Overconfidence is the biggest culprit. The creators of Big Data resources like to believe that they have collected all the data relevant to their domain, that all of the data is accurate, and that the data is organized in a manner that supports meaningful data searches. The Big Data analysts like to believe that their results and conclusions are correct. Hah!
Q2. How do you organize large volumes of complex data? Any insights you could give us on this?
Jules J. Berman: Large volumes of Big Data are organized the same way that humans organize the large volumes of complex data held in their brains: through classification. We could not cope with all the sensory input we receive each day if we did not bin visual objects into categories.
There is a science to constructing classifications, and if the science is misapplied, then the complex data objects held in a Big Data resource cannot be sensibly retrieved, or collected with objects to which they are logically related. Novices to the field make two common errors: confusing properties with classes (e.g., creating red-colored objects as a new class), or assigning a part of an object as a subclass (e.g., making “legs” a subclass of “person”). Just like any other science, the science of classification must be studied, practiced, and mastered.
Q3. You have been working on data permanence: what does it mean in practice? How can it be achieved when the content of the data is constantly changing?
Jules J. Berman: Everyone knows the slogan from Orwell’s masterpiece, 1984: “Big Brother is watching you”. If you’ve read the book, you’ll remember that there was another major theme; one that involved data mutability. The minions of Big Brother were constantly fiddling with collected data to distort reality. Because Big Brother held all the data, Big Brother could create perceptions of reality that suited the totalitarian state.
I see the problem of data mutability (i.e., the ability to modify, delete, or fabricate data) as being much more important than issues related to over-surveillance. In hospitals, the regrettable act of “retro-noting” (i.e., inserting patient notes out of sequence to cover omissions, or to justify billing, or to eradicate errors), is an example of data mutability.
The solution involves employing time stamps and metadata, and procedures that block data erasures. Data mutability, and the related topic of missing legacy data, are two of my favorite issues, and they are both covered in my book, “Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.”
Q4. Data verification: what are the challenges?
Jules J. Berman: The biggest challenge involves getting data analysts to take the topic of data verification seriously. I personally know data scientists who have the attitude that data verification is “not my job.” They believe that they have no control over the data; their job is to do the best they can with the data they receive.
I think that we really must get everybody on board with the idea that data needs to be verified. The task of creating verified sets of data is child’s play compared with the professional issues instigated by recalcitrant data scientists.
Q5. Data validation: what are the challenges?
Jules J. Berman: There are many ways of thinking about validation, but my perception is that most people in the field are approaching validation as a post-analytic process, wherein old conclusions are tested on new data, or tested on alternate data sources, or are re-calculated on a regular basis. The validation process is aimed at determining whether what seems true for me today will be true for you and me, today and tomorrow.
Like anything else in Big Data, it requires work and vigilance, and a delay in gratification.
Q6. Are there any general methods for data verification and validation that can be specifically applied to Big Data resources?
Jules J. Berman: There’s a large literature out there on this subject. In my opinion, the methods are not as important as the documentation. Protocols must be written, actions must be recommended, and steps must be taken to implement corrections. If you’re serious about Big Data, you must be serious about documenting everything: how you found errors, what you did to correct the errors, what you did to make sure that future errors of the same kind will not occur, what you did to monitor the occurrence of future errors of the same type. It never seems to end, but it’s just part of the job.
Q7. How would you find relationships among data objects held in disparate Big Data resources: Could you give us some examples?
Jules J. Berman: In my book Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. published in May, I give real-world examples for all of the points raised in this interview, but my favorite “reach” into disparate data involves some inventive research into the sinking of the Titanic.
Here’s an excerpt from my book: “A recent headline story explains how century old tidal data plausibly explained the appearance the iceberg that sank the titanic, on April 15, 1912. Records show that several months earlier, in 1912, the moon earth and sun aligned to produce a strong tidal pull, and this happened when the moon was the closest to the earth in 1,400 years. The resulting tidal surge was sufficient to break the January Labrador ice sheet, sending an unusual number of icebergs towards the open North Atlantic waters.
The Labrador icebergs arrived in the commercial shipping lanes four months later, in time for a fateful rendezvous with the Titanic. Back in January 1912, when tidal measurements were being collected, nobody foresaw that the data would be examined a century later.”
Of course, the finest tool for finding relationships among data objects held in disparate Big Data resources is the human brain. Good data analysts spend lots of time surveying the data held in various resources. When you spend the time, the inspirational moments will come, and you will begin to synthesize new relationships among data from different knowledge domains. Typically, analysis follows inspiration; not vice versa.
Q8. Data integration: how can data be extracted and integrated with data from other resources?
Jules J. Berman: Of course, standards, specifications, and metadata play an important role.
The Holy Grail in the Big Data field involves finding and implementing standard methods for organizing and tagging data, so that every piece of data held on any computer, can be linked and combined into a virtual Super-Big Data resource.
On a less grand scale, it’s always nice when workers in a common field collect their data in a standard form.
In most cases, I’ve been favoring specifications over standards. Data standards seldom, if ever “fit” your data correctly, are prone to re-versioning, often cost money, and usually come with a fine-print license that restricts how the standards are used and how your annotated data are distributed. Specifications are recommendations for describing data; RDF is a good example. Specifications provide the flexibility required for complex data, but the structure required for data integration.
A smart data manager can do a lot more with a specification than with a standard.
Q9. What about Big Data sharing?
Jules J. Berman: Data sharing is absolutely essential to the field of data science. If the data upon which your assertions are based is unavailable to the public, then why would anyone believe your results and conclusions?
In the Big Data realm, there are lots of things that can go wrong with a data analysis project. The chances that any new analysis is correct, on first pass, is slim-to-none. Everything must be repeated over and over, critiqued, and validated on fresh data.
My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.
Jules Berman received two baccalaureate degrees from MIT; in Mathematics, and in Earth and Planetary Sciences. He received the Ph.D. from Temple University, and the M.D. from the U. of Miami.
He received post-doctoral training at NIH and residency training at Geo. Washington U Med Ctr. He is board certified in anatomic pathology and in cytopathology. He served as Chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and the Johns Hopkins Medical Institutions. In 1998, he became a Medical Officer at the U.S. National Cancer Institute and served as the Program Director for Pathology Informatics in the Institute’s Cancer Diagnosis Program. In 2006, Jules Berman was President of the Association for Pathology Informatics. In 2011 he received the Lifetime Achievement Award from the Association for Pathology Informatics. Today, Jules Berman is a free-lance writer. He has first-authored more than 100 articles and 11 book titles in science and medicine.
– Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information, Jules J Berman, Ph.D., M.D. Paperback: 288 pages, Morgan Kaufmann; 1 edition (June 13, 2013), ISBN-10: 0124045766
Follow ODBMS.org on Twitter: @odbmsorg