“Blinded Analysis”: Protecting privacy with the power of big data
BY Luigi Scorzato & Stefan Rustler — Accenture Digital
About one century ago, every new born child in Europe could expect to live, on average, about 40 years. Today, life expectancy at birth is about twice as much. The progress in the medical sciences that granted us, effectively, a second life would have never been possible without the systematic analysis of some of the most sensitive personal data of our fellow citizens — the data on their individual health. There is no medicine, no science and no progress without data.
More recently, new tools have greatly improved our ability to collect and analyse large amounts of data in a variety of contexts. As a result, we are witnessing (or we can realistically expect) huge improvements in many important aspects of our lives: from health to security, from transportation to communication, from education to … you name it.
But to achieve that, personal data are almost always essential, and whenever personal data are involved, there are legitimate concerns about privacy and potential abuses: “Who exactly can access my data?” or “For which other purposes will my data be used?”. Unfortunately, the ways in which many organisations try to reassure their audience often exacerbate the problem, rather than solving it: it is not very helpful to ask the visitors of a website to consent that their data will be used, if the website can’t specify how exactly those data will be used. It is often far too ambiguous which uses of the data are part of the service provided and which are not. It is not reassuring to say that the data will not be given to “third parties”: what is inside and what is outside of a typical multinational organisation of today?
Further, the reassurance that the data are “anonymised”, is also hardly sufficient. What is done in those cases, is more accurately called “pseudo-anonymization”: it consists of deleting some obviously sensitive fields (e.g. names, identification numbers, addresses, etc.) or substituting them with auxiliary keys. This is appropriate in some contexts, but far from sufficient to ensure anonymization in general, because identities can often be reconstructed even without those pieces of information, from just a few other properties that, taken separately, do not seem sensitive at all. For example, how many Italian particle physicists lived in the UK in the year 2000 and in Germany in the year 2005? However, removing precise addresses and identifications may compromise some important analysis. For example, identifying the origin of a polluting agent through the incidence of pathologies may require very accurate data about residence. In conclusion, removing more information is not a sensible general solution either.
A good answer to privacy concerns is probably the greatest challenge to our hope of a better future. We cannot imagine a functioning world of 10 billion people that is not data-driven, and we cannot imagine a bright data-driven society that is not based on transparency and trust. In fact, if the people do not trust those organisations (public or private) that manage their data, those organisations are likely to navigate in a sea of garbage data. Many people fear a “Big Brother” world enabled by big data technologies. But there might be an even greater risk of a world where everyone struggles to communicate with the organisations that matter to them, but those organisations do not listen, and do not learn from their mistakes, because they can’t tell the signal from the noise of unreliable data. There is no science and no progress with garbage data. This perspective should also worry us.
Many people assume that the values of privacy and analytics are necessarily in conflict, and we can at best aim at a decent compromise. But this view is technologically naive: it assumes that things can only be done as they are done now. In fact, there is a better way to ensure that our data are not misused: Blinded Analysis (BA). The idea is that nobody has the right to directly access the data. Only the metadata (i.e. the description of each field in the database) are available. Moreover, the queries performed by the analysts are fully recorded, for auditing purposes, they can be restricted to some forms — typically large aggregates — and they can eventually be published. After all, what analysts, corporations, or organisations are — or should be — interested in, are the results of queries on data, not the data themselves.
Performing a BA is definitely more difficult than a traditional data analysis, because the analyst cannot inspect the data directly. However, a good big data analyst is trained exactly for these kinds of tasks: performing a full data analysis without the possibility to inspect the data directly. In that case, the reason is simply that they are too big. Indeed, the analysis of very large datasets already requires to set up a system of data cleaning and data analysis which is based only on highly aggregated queries, without having access to any single entry.
Moreover, managing data without accessing them directly is a standard service offered, for instance, by cloud services providers. These services build their reputation entirely on ensuring that they do protect the data properly. Of course, we still have to trust the system administrators who manage the system that stores the encryption keys. The concept of BA is not meant to enable a world with no trust: it is a way to separate the role of those responsible for performing the analysis from the role of those responsible for protecting the data. In the best possible world we can at best choose whom to trust, we can never choose to trust no one.
The BA approach has many advantages. The first and most important is the great potential of building a relation of trust between the organisations that collect and manage personal data and their users. In this way, for example, a company X can certify that it does not know, about its customers, anything more than specific aggregate information that everyone can access, because the original database is hidden to everyone, and the questions that X is asking to the database can be shared among all parties. This symmetry removes the fear of being spied upon and transforms the process of data analysis into an open conversation between the company X and its audience, as it should be.
The BA approach also protects the organisations from several big risks: the risk that the data may be stolen, the risk that some analysts might misuse the data to which they are given access, the risk of spreading false news about how their data are being used. These risks are very serious today, and most organisations around the world are investing huge resources to mitigate them, often unsuccessfully. None of these is a problem, however, if nobody has direct access to these data, and the queries are recorded and published.
The BA approach is also the ideal substitute for exchanging data between organisations. Allowing organisation X to access the data of organisation Y is always a very painful process that rarely ends successfully, even between different departments of the same company. In fact, before granting access to its data to organisation X, organisation Y must imagine all possible uses that X could do with Y’s data, decide whether they are all admissible, and partially delete some fields or some records to prevent any conceivable misuse. This assessment is extremely difficult and always questionable. It requires long, often inconclusive, discussions between legal experts and subject matter experts, and it mostly ends up with someone taking over a big risk and/or with a restricted and basically useless view of the data. With a BA approach, instead, no organisation assumes the risk of granting full access to any data. The two organisations simply agree on the specific aggregated analysis to be performed and approve only that, without the need of imagining other possible misuses.
The BA approach also avoids the risk that any personal information uncontrollably spreads across several databases within a company and its partners, and stays there indefinitely. Indeed, personal information never exits the original database. Once a personal record is deleted from there, it will not appear in any subsequent analysis.
In summary, the concept of BA enables the realization of the best balance between privacy and transparency: namely privacy on the individual records, and transparency on the aggregated analysis.
Almost any sector can benefit from a BA approach. For example, insurance companies currently suffer for two opposite reasons. On one hand, they are often accused of exploiting the data of their customers to model the risks more precisely than the public is able to do, and taking advantage of their exclusive knowledge. This damages the relation of trust that they need to build with their customers and the public. On the other hand, insurance companies struggle to obtain good data, mainly because of the suspicion that surrounds the use they do with them.
Actually, access to more accurate data — although non-exclusively — would be far more useful than exclusive access to poor data, and everybody would benefit from that. But in order to have better data we need to ensure that personal data are safe, and we also need to tell people what happens exactly to their personal data. That means, in particular, telling them which of their own personal information matters to compute the premium, and how. This does not mean revealing all commercial information, but only that part of the formula that relies on user’s data. We believe that the advantages, both for the customers and for the insurances, of more accurate modelling will vastly exceed the unclear advantages of keeping unreliable secrets.
The interests of insurance companies are not necessarily in conflict with the interests of their customers. We could all gain from a fairer and more efficient sharing of the risks. Indeed, insurances are, potentially, the most efficient and civilised way in which a society shares its risks, realise solidarity in practice, and also decides which factors may justify a higher premium and which should not. But to make this real, we need to build trust.