How the 11.5 million Panama Papers were analysed. Interview with Mar Cabra
“The best way to explore all The Panama Papers data was using graph database technology, because it’s all relationships, people connected to each other or people connected to companies.” –Mar Cabra.
I have interviewed Mar Cabra, head of the Data & Research Unit of the International Consortium of Investigative Journalists (ICIJ). Main subject of the interview is how the 11.5 million Panama Papers were analysed.
Q1. What is the mission of the International Consortium of Investigative Journalists (ICIJ)?
Mar Cabra: Founded in 1997, the ICIJ is a global network of more than 190 independent journalists in more than 65 countries who collaborate on breaking big investigative stories of global social interest.
Q2. What is your role at ICIJ?
Mar Cabra: I am the Editor at the Data and Research Unit – the desk at the ICIJ that deals with data, analysis and processing, as well as supporting the technology we use for our projects.
Q3. The Panama Papers investigation was based on a 2.6 Terabyte trove of data obtained by Süddeutsche Zeitung and shared with ICIJ and a network of more than 100 media organisations. What was your role in this data investigation?
Mar Cabra: I co-ordinated the work of the team of developers and journalists that first got the leak from Süddeutsche Zeitung, then processed it to make it available online though secure platforms with more than 370 journalists.
I also supervised the data analysis that my team did to enhance and focus the stories. My team was also in charge of the interactive product that we produced for the publication stage of The Panama Papers, so we built an interactive visual application called the ‘Powerplayers’ where we detailed the main stories of the politicians with connections to the offshore world. We also released a game explaining how the offshore world works! Finally, in early May, we updated the offshore database with information about the Panama Papers companies, the 200,000-plus companies connected with Mossack Fonseca.
Q4. The leaked dataset are 11.5 million files from Panamanian law firm Mossack Fonseca. How was all this data analyzed?
Mar Cabra: We relied on Open Source technology and processes that we had worked on in previous projects to process the data. We used Apache Tika to process the documents and also to access them, and created a processing chain of 30 to 40 machines in Amazon Web Services which would process in parallel those documents, then index them onto a document search platform that could be used by 100s of journalists from anywhere in the world.
Q5. Why did you decide to use a graph-based approach for that?
Mar Cabra: Inside the 11.5 million files in the original dataset given to us, there were more than 3 million that came from Mossaka Fonseca’s internal database, which basically contained names of companies in offshore jurisdictions and the people behind them. In other words, that’s a graph! The best way to explore all The Panama Papers data was using graph database technology, because it’s all relationships, people connected to each other or people connected to companies.
Q6. What were the main technical challenges you encountered in analysing such a large dataset?
Mar Cabra: We had already used all the tools that we were using in this investigation, in previous projects. The main issue here was dealing with many more files in many more formats. So the main challenge was how can we make readable all those files, which in many cases were images, in a fast way.
Our next problem was how could we make them understandable to journalists that are not tech savvy. Again, that’s where a graph database became very handy, because you don’t need to be a data scientist to work with a graph representation of a dataset, you just see dots on a screen, nodes, and then just click on them and find the connections – like that, very easily, and without having to hand-code or build queries. I should say you can build queries if you want using Cypher, but you don’t have to.
Q7. What are the similarities with the way you analysed data in the Swiss Leaks story (exposing the fraudulent activity of 100,000 HSBC private bank clients in Switzerland)?
Mar Cabra: We used the same tools for that – a document search platform and a graph database and we used them in combination to find stories. The baseline was the same but the complexity was 100 times more for the Panama Papers. So the technology is the same in principle, but because we were dealing with many more documents, much more complex data, in many more formats, we had to make a lot of improvements in the tools so they really worked for this project. For example, we had to improve the document search platform with a batch search feature, where journalists would upload a list of names and then they would get a list back of links when that list of names had a hit a document.
Q8. Emil Eifrem, CEO, Neo Technology wrote: “If the Panama Papers leak had happened ten years ago, no story would have been written because no one else would have had the technology and skillset to make sense of such a massive dataset at this scale.” What is your take on this?
Mar Cabra: We would have done the Panama Papers papers differently, probably printing the documents – and that would have had a tremendous effect on the paper supplies of the world, because printing out all 11.5 million files would have been crazy! We would have published some stories and the public might have seen some names on the front page of a few newspapers, but the scale and the depth and the understanding of this complex world would not have been able to happen without access to the technology we have today. We would just have not been able to do such an in-depth investigation at a global scale without the technology we have access to now.
Q9. Whistleblowers take incredible risks to help you tell data stories. Why do they do it?
Mar Cabra: Occasionally, some whistleblowers have a grudge and are motivated in more personal terms. Many have been what we call in Spanish ‘widows of power’: people who have been in power and have lost it, and those who wish to expose the competition or have a grudge. Motivations of Whistleblowers vary, but I think there is always an intention to expose injustice. ‘John Doe’ is the source behind the Panama Papers, and a few weeks after we published, he explained his motivation; he wanted to expose an unjust system.
Mar Cabra is the head of ICIJ’s Data & Research Unit, which produces the organization’s key data work and also develops tools for better collaborative investigative journalism. She has been an ICIJ staff member since 2011, and is also a member of the network.
Mar fell in love with data while being a Fulbright scholar and fellow at the Stabile Center for Investigative Journalism at Columbia University in 2009/2010. Since then, she’s promoted data journalism in her native Spain, co-creating the first ever masters degree on investigative reporting, data journalism and visualisation and the national data journalism conference, which gathers more than 500 people every year.
She previously worked in television (BBC, CCN+ and laSexta Noticias) and her work has been featured in the International Herald Tribune, The Huffington Post, PBS, El País, El Mundo or El Confidencial, among others.
In 2012 she received the Spanish Larra Award to the country’s most promising journalist under 30. (PGP public key)
– Offshore Leaks Database: Released in June 2013, the Offshore Leaks Database is a simple search box.
Open Source used for analysing the #PanamaPapers:
– We created a small program ourselves which we called Extract which is actually in our GitHub account that allowed us to do this parallel processing. Extract would get a file and try to see if it could recognize the content. If it couldn’t recognize the content, then we would do OCR and then send it to our document searching platform, which was Apache Solr.
– Based on Apache Solr, we created an index, and then we used Project Blacklight, another open source tool that was originally used for libraries, as our front-end tool. For example, Columbia University Library, where I studied, used this tool.
– Linkurious: Linkurious is software that allows you to visualize graphs very easily. You get a license, you put it in your server, and if you have a database in Neo4j you just plug it in and within hours you have the system set up. It also has this private system where our reporters can login or logout.
– Thanks to another open source tool – in this case Talend – and extractions from a load tool, we were able to easily transform our database into Neo4j, plug in Linkurious and get reporters to search.
–Neo4j: Neo4j is a highly scalable, native graph database purpose-built to leverage not only data but also its relationships. Neo4j’s native graph storage and processing engine deliver constant, real-time performance, helping enterprises build intelligent applications to meet today’s evolving data challenges.
-The good thing about Linkurious is that the reporters or the developers at the other end of the spectrum can also make highly technical Cypher queries if they want to start looking more in depth at the data.