Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner
“My experience overall with almost all open-source tools has been very positive: open source tools are very high quality, well documented, and if you get stuck there is a helpful and responsive community on mailing lists or Stack Exchange and similar sites.” —Dr. Jochen L. Leidner.
Q1. What is your current activity at Thomson Reuters?
Jochen L. Leidner: For the most part, I carry out applied research in information access, and that’s what I have been doing for quite a while. After five years with the company – I joined from the University of Edinburgh, where I had been a postdoctoral Royal Society of Edinburgh Enterprise Fellow half a decade ago – I am currently a Lead Scientist with Thomson Reuters, where I am building up a newly-established London site part of our Corporate Research & Development group (by the way: we are hiring!). Before that, I held research and innovation-related roles in the USA and in Switzerland.
Let me say a few words about Thomson Reuters before I go more into my own activities, just for background. Thomson Reuters has around 50,000 employees in over 100 countries and sells information to professionals in many verticals, including finance & risk, legal, intellectual property & scientific, tax & accounting.
Our headquarters are located at 3 Time Square in the city of New York, NY, USA.
Most people know our REUTERS brand from reading their newspapers (thanks to our highly regarded 3,000+ journalists at news desks in about 100 countries, often putting their lives at risk to give us reliable reports of the world’s events) or receiving share price information on the radio or TV, but as a company, we are also involved in as diverse areas as weather prediction (as the weather influences commodity prices) and determining citation impact of academic journals (which helps publishers sell their academic journals to librarians), or predicting Nobel prize winners.
My research colleagues and I study information access and especially means to improve it, using including natural language processing, information extraction, machine learning, search engine ranking, recommendation system and similar areas of investigations.
We carry out a lot of contract research for internal business units (especially if external vendors do not offer what we need, or if we believe we can build something internally that is lower cost and/or better suited to our needs), feasibility studies to de-risk potential future products that are considered, and also more strategic, blue-sky research that anticipates future needs. As you would expect, we protect our findings and publish them in the usual scientific venues.
Q2. Do you use Data Analytics at Thomson Reuters and for what?
Jochen L. Leidner: Note that terms like “analytics” are rather too broad to be useful in many instances; but the basic answer is “yes”, we develop, apply internally, and sell as products to our customers what can reasonably be called solutions that incorporate “data analytics” functions.
One example capability that we developed is the recommendation engine CaRE (Al-Kofahi et al., 2007), which was developed by our group, Corporate Research & Development, which is led by our VP of Research & Development, Khalid al-Kofahi. This is a bit similar in spirit to Amazon’s well-known book recommendations. We used it to recommend legal documents to attorneys, as a service to supplement the user’s active search on our legal search engine with “see also..”-type information. This is an example of a capability developed in-house that also something that made it into a product, and is very popular.
Thomson Reuters is selling information services, often under a subscription model, and for that it is important to have metrics available that indicate usage, in order to inform our strategy. So another example for data analytics is that we study how document usage can inform personalization and ranking, of from where documents are accessed, and we use this to plan network bandwidth and to determine caching server locations.
A completely different example is citation information: Since 1969, when our esteemed colleague Eugene Garfield (he is now officially retired, but is still active) came up with the idea of citation impact, our Scientific business division is selling the journal citation impact factor – an analytic that can be used as a proxy for the importance of a journal (and, by implication, as an argument to a librarian to purchase a subscription of that journal for his or her university library).
Or, to give another example from the financial markets area, we are selling predictive models (Starmine) that estimate how likely it is whether a given company goes bankrupt within the next six months.
Q3. Do you have Big Data at Thomson Reuters? Could you please give us some examples of Big Data Use Cases at your company?
Jochen L. Leidner: For most definitions of “big”, yes we do. Consider that we operate a news organization, which daily generates in the tens of thousands of news reports (if we count all languages together). Then we have photo journalists who create large numbers of high-quality, professional photographs to document current events visually, and videos comprising audio-visual storytelling and interviews. We further collect all major laws, statutes, regulations and legal cases around in major jurisdictions around the world, enrich the data with our own meta-data using both manual expertise and automatic classification and tagging tools to enhance findability. We hold collections of scientific articles and patents in full text and abstracts.
We gather, consolidate and distribute price information for financial instruments from hundreds of exchanges around the world. We sell real-time live feeds as well as access to decades of these time series for the purpose of back-testing trading strategies.
Q4. What “value” can be derived by analyzing Big Data at Thomson Reuters?
Jochen L. Leidner: This is the killer question: we take the “value” very much without the double quotes – big data analytics lead to cost savings as well as generate new revenues in a very real, monetary sense of the word “value”. Because our solutions provide what we call “knowledge to act” to our customers, i.e., information that lets them make better decisions, we provide them with value as well: we literally help our customers save the cost of making a wrong decision.
Q5. What are the main challenges for big data analytics at Thomson Reuters ?
Jochen L. Leidner: I’d say these are absolute volume, growth, data management/integration, rights management, and privacy are some of the main challenges.
One obvious challenge is the size of the data. It’s not enough to have enough persistent storage space to keep it, we also need backup space, space to process it, caches and so on – it all adds up. Another is the growth and speed of that growth of the data volume. You can plan for any size, but it’s not easy to adjust your plans if unexpected growth rates come along.
Another challenge that we must look into is the integration between our internal data holdings, external public data (like the World Wide Web, or commercial third-party sources – like Twitter – which play an important role in the modern news ecosystem), and customer data (customers would like to see their own internal, proprietary data be inter-operable with our data). We need to respect the rights associated with each data set, as we deal with our own data, third party data and our customers’ data. We must be very careful regarding privacy when brainstorming about the next “big data analytics” idea – we take privacy very seriously and our CPO joined us from a government body that is in charge of regulating privacy.
Q6. How do you handle the Big Data Analytics “process” challenges with deriving insight?
Jochen L. Leidner: Usually analytics projects happen as an afterthought to leverage existing data created in a legacy process, which means not a lot of change of process is needed at the beginning. This situation changes once there is resulting analytics output, and then the analytics-generating process needs to be integrated with the previous processes.
Even with new projects, product managers still don’t think of analytics as the first thing to build into a product for a first bare-bones version, and we need to change that; instrumentation is key for data gathering so that analytics functionality can build on it later on.
In general, analytics projects follow a process of (1) capturing data, (2) aligning data from different sources (e.g., resolving when two objects are the same), (3) pre-processing or transforming the data into a form suitable for analysis, (4) building some model and (5) understanding the output (e.g. visualizing and sharing the results). This five-step process is followed by an integration phase into the production process to make the analytics repeatable.
Q7. What kind of data management technologies do you use? What is your experience in using them?
Jochen L. Leidner: For storage and querying, we use relational database management systems to store valuable data assets and also use NoSQL databases such as CouchDB and MongoDB in projects where applicable. We use homegrown indexing and retrieval engines as well as open source libraries and components like Apache Lucene, Solr and ElasticSearch.
We use parallel, distributed computing platforms such as Hadoop/Pig and Spark to process data, and virtualization to manage isolation of environments.
We sometimes push out computations to Amazon’s EC2 and storage to S3 (but this can only be done if the data is not sensitive). And of course in any large organization there is a lot of homegrown software around. My experience overall with almost all open-source tools has been very positive: open source tools are very high quality, well documented, and if you get stuck there is a helpful and responsive community on mailing lists or Stack Exchange and similar sites.
Q8. Do you handle un-structured data? If yes, how?
Jochen L. Leidner: We curate lots of our own unstructured data from scratch, including the thousands of news stories that our over three thousand REUTERS journalists write every day, and we also enrich the unstructured content produced by others: for instance, in the U.S. we manually enrich legal cases with human-written summaries (written by highly-qualified attorneys) and classify them based on a proprietary taxonomy, which informs our market-leading WestlawNext product (see this CIKM 2011 talk), our search engine for the legal profession, who need to find exactly the right cases. Over the years, we have developed proprietary content repositories to manage content storage, meta-data storage, indexing and retrieval. One of our challenges is to unite our data holdings, which often come from various acquisitions that use their own technology.
Q9. Do you use Hadoop? If yes, what is your experience with Hadoop so far?
Jochen L. Leidner: We have two clusters featuring Hadoop, Spark and GraphLab, and we are using them intensively. Hadoop is mature as an implementation of the MapReduce computing paradigm, but has its shortcomings because it is not a true operating system (but probably it should be) – for instance regarding the stability of HDFS, its distributed file system. People have started to realize there are shortcomings, and have started to build other systems around Hadoop to fill gaps, but these are still early stage, so I expect them to become more mature first and then there might be a wave of consolidation and integration. We have definitely come a long way since the early days.
Typically, on our clusters we run batch information extraction tasks, data transformation tasks and large-scale machine learning processes to train classifiers and taggers for these. We are also inducing language models and training recommender systems on them. Since many training algorithms are iterative, Spark can win over Hadoop for these, as it keeps models in RAM.
Q10. Hadoop is a batch processing system. How do you handle Big Data Analytics in real time (if any)?
Jochen L. Leidner: The dichotomy is perhaps between batch processing and dialog processing, whereas real-time (“hard”, deterministic time guarantee for a system response) goes hand-in-hand with its opposite non-real time, but I think what you are after here is that dialog systems have to be responsive. There is no one-size-fits-all method for meeting (near) real-time requirements or for making dialog systems more responsive; a lot of analytics functions require the analysis of more than just a few recent data-points, so if that’s what is needed it may take its time. But it is important that critical functions, such as financial data feeds, are delivered as fast as possible – micro-seconds matter here. The more commoditized analytics functions become, the faster they need to be available to retain at least speed as a differentiator.
Q11 Cloud computing and open source: Do you they play a role at Thomson Reuters? If yes, how?
Jochen L. Leidner: Regarding cloud computing, we use cloud services internally and as part of some of our product offerings. However, there are also reservations – a lot of our applications contain information that is too sensitive to entrust a third party, especially as many cloud vendors cannot give you a commitment with respect to hosting (or not hosting) in particular jurisdictions. Therefore, we operate our own set of data centers, and some part of these operates as what has become known as “private clouds”, retaining the benefit of the management outsourcing abstraction, but within our large organization rather than pushing it out to a third party. Of course the notion of private clouds is leading the cloud idea ad absurdum quite a bit, because it sacrifices the economy of scale, but having more control is an advantage.
Open source plays a huge role at Thomson Reuters – we rely on many open source components, libraries and systems, especially under the MIT, BSD, LGPL and Apache licenses. For example, some of our tagging pipelines rely on Apache UIMA, which is a contribution originally developed at IBM, and which has seen contributions by researchers from all around the world (from Darmstadt to Boulder). To date, we have not been very good about opening up our own services in the form of source code, but we are trying to change that now, and we have just launched a new corporation-wide process for open-sourcing software. We also have an internal sharing repository, “Corporate Source”, but in my personal view the audience in any single company is too small – open source (like other recent waves such as clouds or crowdsourcing) needs Internet-scale to work, and Web search engines for the various projects to be discovered).
Q12 What are the main research challenges ahead? And what are the main business challenges ahead?
Jochen L. Leidner: Some of the main business challenges are the cost pressure that some of our customers face, and the increasing availability of low-cost or free-of-charge information sources, i.e. the commoditization of information. I would caution here that whereas the amount of information available for free is large, this in itself does not help you if you have a particular problem and cannot find the information that helps you solve it, either because the solution is not there despite the size, or because it is there but findability is low. Further challenges include information integration, making systems ever more adaptive, but only to the extent it is useful, or supporting better personalization. Having said this sometimes systems need to be run in a non-personalized mode (e.g. in the field of e-discovery, you need to have a certain consistency, namely that the same legal search systems retrieves the same things today and tomorrow, and to different parties.
Q13 Anything else you wish to add?
Jochen L. Leidner: I would encourage decision makers of global companies not to get misled by fast-changing “hype” language: words like “cloud”, “analytics” and “big data” are too general to inform a professional discussion. Putting my linguist’s hat on, I can only caution about the lack of precision inherent in marketing language’s use in technological discussions: for instance, cluster computing is not the same as grid computing. And what looks like “big data” today we will almost certainly carry around with us on a mobile computing device tomorrow. Also, buzz words like “Big Data” do not by themselves solve any problem – they are not magic bullets. To solve any problem, look at the input data, specify the desired output data, and think hard about whether and how you can compute the desired result – nothing but “good old” computer science.
Dr. Jochen L. Leidner is a Lead Scientist with Thomson Reuters, where he is building up the corporation’s London site of its Research & Development group.
He holds Master’s degrees in computational linguistics, English language and literature and computer science from Friedrich-Alexander University Erlangen-Nuremberg and in computer speech, text and internet technologies from the University of Cambridge, as well as a Ph.D. in information extraction from the University of Edinburgh (“Toponym resolution in Text”). He is recipient of the first ACM SIGIR Doctoral Consortium Award, a Royal Society of Edinburgh Enterprise Fellowship in Electronic Markets, and two DAAD scholarships.
Prior to his research career, he has worked as a software developer, including for SAP AG (basic technology and knowledge management) as well as for startups.
He led the development teams of multiple question Answering systems, including the systems QED at Edinburgh and Alyssa at Saarland University, the latter of which ranked third at the last DARPA/NIST TREC open-domain question answering factoid track evaluation.
His main research interests include information extraction, question answering and search, geo-spatial grounding, applied machine learning with a focus on methodology behind research & development in the area of information access.
In 2013, Dr. Leidner has also been teaching an invited lecture course “Language Technology and Big Data” at the University of Zurich, Switzerland.
– ODBMS.org free resources on Big Data and Analytical Data Platforms:
Blog Posts | Free Software| Articles | Lecture Notes | PhD and Master Thesis|
Follow us on Twitter: @odbmsorg