Villa Romanazzi Carducci - Federico II
A Tutorial on Hybrid Data Infrastructures: D4Science as a case study
- Pasquale PAGANO
- gianpaolo CORO
- Donatella CASTELLI (Consiglio Nazionele delle Ricerche (CNR) - ISTI)
An e-Infrastructure is a distributed network of service nodes, residing on multiple sites and managed by one or more organizations allowing scientists residing at distant places to collaborate. They may offer a multiplicity of facilities as-a-service, supporting data sharing and usage at different levels of abstraction. E-Infrastructures can have different implementations (Andronico et al 2011). A major distinction is between (i) Data e-Infrastructures, i.e. digital infrastructures promoting data sharing and consumption to a community of practice (e.g. MyOcean, Blanc 2008) and (ii) Computational e-Infrastructures, which support the processes required by a community of practice using GRID and Cloud computing facilities (e.g. Candela et al. 2013). A more recent type of e-Infrastructure is the Hybrid Data Infrastructure (HDI) (Candela et al. 2010), i.e. a Data and Computational e-Infrastructure that adopts a delivery model for data management, in which computing, storage, data and software are made available as-a-Service. HDIs support, for example, data transfer, data harmonization and data processing workflows. Hybrid Data e-Infrastructures have already been used in several European and international projects (e.g. i-Marine 2011; EuBrazil OpenBio 2011) and their exploitation is growing fast supporting new projects and initiatives, e.g. Parthenos, Ariadne, Descramble. A particular HDI, named D4Science (Candela et al. 2009), has been used by communities of practice in the fields of biodiversity conservation, geothermal energy monitoring, fisheries management, and culture heritage. This e-Infrastructure hosts models and resources by several international organizations involved in these fields. Its capabilities help scientists to access and manage data, reuse data and models, obtain results in short time and share these results with other colleagues. In this tutorial, we will give an overview of the D4Science capabilities; in particular, we will show practices and methods that large international organizations like FAO and UNESCO apply by means of D4Science. At the same time, we will explain how the D4Science facilities conform to the concepts of e-Infrastructures, Virtual Research Environments (VREs), data sharing and experiments reproducibility. In our tutorial, we will give insight about how D4Science contributors can add new models and algorithms to the processing platform. D4Science adopts methods to embed software developed by communities of practice involving people with limited expertise in Computer Science. Community software involves legacy programs (e.g. written in Fortran 90) as well as R scripts developed under different Operating Systems and versions of the R interpreters. D4Science is able to manage this multi-language scenario in its Cloud computing platform (Coro et al. 2014). Finally, D4Science uses the EGI Federated Cloud (FedCloud) infrastructure for data processing: computations are parallelized by dividing the input in several chunks and each chunk is sent to D4Science services residing on FedCloud (Generic Workers) to be processed. Furthermore, another D4Science service executing data mining algorithms (DataMiner) also resides on FedCloud and adopts an interface that is compliant with the Web Processing Service (WPS, Schut and Whiteside 2015) specifications.