Time for Harmonization across the Data Landscape
Time for Harmonization across the Data Landscape
Peter Wittenburg (RDA Europe Director)
The trends towards more and more complex data and a growing number of computational solutions in almost all scientific disciplines are well known. These trends and the need to open the distributed data and tools domain to others such as entrepreneurs and citizens require efficiently working information infrastructures. Some global regions such as the US and Europe programs such as Cyberinfrastructure and ESFRI (European Strategy Forum for RI) [1,2] have been launched already a few years ago anticipating the need for advanced infrastructures. Currently the funders apply the strategy to let 1000 flowers blossom since there is yet no clear view how to come to an efficient ecosystem of information infrastructures (II).
A German study  showed that mainly at European and German funding level there are already more than 500 initiatives all working hard to provide solutions for their immediate needs. Some of these II are project-oriented and thus directly focusing on specific researcher interests, others address the needs of broad communities such as typically by the ESFRI projects and some address common cross-disciplinary requirements such as typically by the e-Infrastructures .
This phase of broad funding was and is extremely helpful and essential for two reasons:
(1) A large number of scientists and technologists have been engaged from various disciplines thereby creating both broad awareness and a large group of experts.
(2) All these initiatives and their comparison has given us a much better understanding of what it means to build such IIs and what their building blocks are.
Recent discussions  have shown that a growing number of stakeholders understand that we urgently need to start a “harmonization track” in parallel. Mainly for the following reasons:
(1) Most of these infrastructures have been organized like projects and started developing components in silos to serve immediate needs.
(2) Heterogeneity decreases interoperability and thus increases inefficiencies since research is, in general, organized across-boundaries.
(3) The solution space is huge as many addressed similar problems in slightly different ways and there is no chance to maintain all these solutions, since the costs are simply too high.
(4) Industry and also scientists hesitate to invest, since there are no clear directions yet and the risk that money is wasted due to wrong decisions is too high.
Thus, we need to enter a phase of “harmonization” or “reduction of the solutions space” as it happened many years ago when TCP/IP  was accepted globally as THE basis for computer communication. This global agreement had as we know a huge impact on how we exchange information globally and it created completely new businesses, companies and labour.
To make data globally accessible and exploitable and much more efficient to the benefit of societies we need convergence on what the Data Fabric Interest Group  of the Research Data Alliance is calling Common Components – generic building blocks of information infrastructures. Obviously, making the data landscape more interoperable is much more complex than interconnecting computers and other networking devices such as routers. Therefore the question is how to achieve progress in harmonization and that in limited time periods. Attempts to define standards in a top-down fashion will fail, as will imposing a specific type of solution found in one discipline to all others. The only possibility is to choose the time consuming path of bringing people together across boundaries (geographic, disciplines), to analyse the various solutions and to identify the common components and their characteristics based on rough consensus principles. This is the route the Research Data Alliance  is taking similar other organizations such IETF (Internet Engineering Task Force) .
The question is whether this will be sufficient and the simple answer is NO. 12 RDA working groups produced concrete results  and this within 2.5 years. Most of the roughly 700 data practitioners involved in the creation of these results will be satisfied since they built concrete bridges making their work more interoperable and efficient and this in a cross-disciplinary fashion. But this is still a small group and why should investors rely on the broad uptake of these results. We need to start applying the paradigm of virtuous circles  which is the only way often to improve complex systems, i.e. we need to stimulate testing and adopting the results (also in combination) and to learn from these exercises how to improve the current results, detect gaps, etc. If we manage to include many communities across many countries in this active process this will create the momentum which is necessary to reach impact and achieve convergence.
This path is tough and requires time. Some are still asking the question whether RDA is the right vehicle to stimulate and catalyse this process. Of course, RDA is recognizing that there are other initiatives as well working on generic components such as W3C , that a number of strong communities came up with highly relevant solutions  and that infrastructures built up over years cannot throw away all their solutions overnight. However, after having shown that RDA is able to bring people together across disciplinary and geographic boundaries based on widely agreed principles there seems to be only one answer: if we think that RDA is not working well enough, we should convince the active crowd and change it.
It’s all of us (data scientists, data managers, data librarians etc.) who are responsible to define the structures that are urgently needed to be effective in building the social and technological bridges for improved data sharing and re-use.
Obviously this bottom-up engagement alone will still not be sufficient to change practices. We need to achieve a momentum at policy level as well that has the power to turn wide agreements into recommendations. TCP/IP would not have been accepted if there was no policy momentum. Bringing all these aspects together at policy level to come to global recommendations is the real challenge and it will cost time. But we need to try it NOW.
The RDA Europe  project works on all these levels:
(1) funds were reserved to support testing and adopting work,
(2) structures were established to extract knowledge from all testing and to feed it back to the RDA global process,
(3) interactions at all levels (policy makers, practitioners) are continued to bring people together and to facilitate the steps from testing to recommendations. But we also see the urgent need that even more activity is required to accelerate the virtuous circle which can only be achieved by creating additional motivation.
If someone is interested to participate in this global activity and to be at the cutting edge of infrastructure discussions, contact RDA global and/or RDA Europe to make your contribution and take advantage of the opportunities.
 (German) Council for Information Infrastructures: not yet published survey
 M. L. Brodie, Understanding Data Science: An Emerging Discipline for Data-Intensive Discovery, keynote, Proc.of the XVII Int’l Conf Data Analytics and Management in Data Intensive Domains (DAMDID’2015), Obninsk, Russia, October 13-16, 2015.