The Frontier of Data Discovery
- Mark Sanden (SURFsara BV)
To foster the idea of Open Science reproducibility and to stimulate the optimal use and reuse of research data, it can only be realised if data is consistently maintained according to the FAIR principles (findable, accessible, interoperable and re-usable) within a secure and trustworthy environment. In the current era in which data produced through science is exponentially growing, in more automatic ways, with a higher need to share among fellow researchers within and across scientific disciplines, making research data discoverable is an essential step. Scientific communities and data providers have adopted very different standards to describe scientific output. This makes it difficult to extract enough content related information to enable cross-disciplinary search and to link scientific output to publications. Scientific output is stored highly distributed across different European, National and/or in regional institutional and/or community-based repositories. Via OpenAIRE and EUDAT, cross-disciplinary data discovery services are being provided in which metadata from many of these repositories are being harvested and are presented in a simple and user-friendly way.
In data discovery, there is a high reliance on data providers and on the quality of the information provided. The semantics in which relationships between datasets and publications are described are heterogeneous across communities. The granularity in which datasets are described is perceived in different ways across disciplines. A dataset can consist of single or a few objects or consists of a large number of objects referring to Terabytes or even Petabytes of data. For example, bio-databases of sequences can bear millions of links between one publication and millions of sequences and there is no formal way to identify sets of sequences. There are still many challenges to overcome, for example: lack of standards for or the use of licenses, poor descriptive metadata, heterogeneous ways to refer to format and schemas, how to link datasets to research communities, how versioned datasets can be referred or discovered, how to handle deduplication of links when information is collected at different places and how quality of data can be measured in terms of access (usage stats), liveliness (#versions), citations or on feedback from users. Where can we provide added value to the individual researcher, the research community and to other stakeholders active within the science domain?
In this World Cafe session, we present the current state on data discovery, by presenting the work from the OpenAIRE and EOSC-hub project and from the angle of a community. Via a panel discussion, valuable feedback will be collected from the audience and presenters on the current status and future direction to improve data discovery.
- Community representatives and data managers with interest to extend
- Data repository owners to make research data findable