Over the last decades, the extraction and resolution of entities in documents have become a central task in large-scale text mining activities and have been the object of numerous research and development projects.
Entity extraction and resolution (linking, disambiguation) is the task of determining the identity of entities mentioned in a text against an existing knowledge base reflecting the available knowledge for the domain under consideration.
The task may comprise the recognition of generic named entities suitable in general purpose subjects, like person name, location, organisation name and so on, but also the resolution of specialized entities associated to scientific or technical domains such as chemistry, biology, astronomy, etc. Beyond the ongoing research and development works that are taking place, there is a need to deploy named entity services that can be used seamlessly within a variety of communities and in particular in the humanities, where digital methods are taking up swiftly.
*entity-fishing* addresses these needs and provides a generic service for entity extraction and disambiguation (NERD) against Wikipedia and Wikidata, supporting possible further adaptations for applications to specialist domains. This allows it to be independent of a particular framework and usage scenario for maximum reuse.
*entity-fishing* offers an efficient and compact design and a richer API that allows the processing of different input (raw or partially annotated texts, PDF, search query), different languages (English, Italian, French, German and Spanish) and different formats. Thanks to the integration of the GROBID library it natively supports PDF as source or destination.
*entity-fishing* employs supervised Machine Learning algorithms for both the recognition and the disambiguation.
Not only does the API target state-of-the-art performances in terms of accuracy, coverage (entity variety), speed and exploitation of the context, but the service is also completely decoupled from the implementation details, and thus independent of a particular platform, infrastructure and architecture, making it easier to integrate, to scale and to monitor.
*entity-fishing* is currently available in the DARIAH infrastructure, deployed at HUMA-NUM, the French organisation providing large infrastructure for researchers in the Social Sciences and Digital Humanities.
This application was initially developed by Inria in the context of the EU FP7 project CENDARI and, within the scope of the European project H2020 HIRMEOS it has been released and deployed as a service within the DARIAH research infrastructure.
HIRMEOS is a 30-month project focuses on the monograph as a significant mode of scholarly communication in the SSH and tackles the main obstacles to the full integration of five large-scale platforms supporting open access content: OpenEdition Books, OAPEN Library, EKT Open Book Press, Ubiquity Press and Göttingen University Press.
As part of our contribution we present *entity-fishing* and we discuss the use cases implemented by the partners within the project HIRMEOS like enhanced semantic search, PDF on-the-fly annotations or time-lines visualisation.
These use cases are applicable not only to other publishing platforms in SSH but potentially to any open access publication repository.