We’re happy to announce that EGI2024 will take place in beautiful Lecce, Italy – from September 30th to October 4th.
For this edition of our annual conference, we are closely collaborating with Fondazione Centro Euro-Mediterraneo sui Cambiamenti Climatici (CMCC), based in Lecce.
About CMCC
The Euro-Mediterranean Center on Climate Change is an Italian research centre dedicated to climate and climate related research, including climate variability, its causes and consequences, carried out through numerical models ranging from Global Earth System to Regional models within the Euro-Mediterranean area.
About Lecce
Located in the very heel of Italy, Lecce is a gorgeous, baroque city with a delightful Mediterranean climate at the end of September, known for its olive oil and wine culture.
About the venue
The main location for the conference will be the Hilton Garden Inn in Lecce. Some side meetings will take place at the CMCC offices.
Timeline
FitSM Foundation Training co-located with EGI2024
Details and registration: https://indico.egi.eu/event/6526/
The EU-funded SPECTRUM project brings together leading European science organisations and e-Infrastructure providers to formulate a Strategic Research, Innovation, and Deployment Agenda (SRIDA) along with a Technical Blueprint for a European computer and data continuum. This collaborative effort is set to create an Exabyte-scale research data federation and compute continuum, fostering data-intensive scientific collaborations across Europe. Further information: https://www.spectrumproject.eu/
Take the Next Step in IT Service Management with FitSM Foundation Training!
Enrol Now on Indico (https://indico.egi.eu/event/6526/ ) and Transform Your IT Service Delivery!
Are you ready to transform your IT service delivery and enhance customer satisfaction?
The FitSM Foundation Training is your gateway to mastering the essentials of IT Service Management (ITSM). Designed to provide you with a comprehensive understanding of the FitSM framework, this course equips you with the knowledge and tools to optimise your IT services for efficiency, cost savings, and superior user experience.
We are co-locating our FitSM Foundation Training with EGI2024, taking place in Lecce, Italy, and we will be hosted by our partners at the CMCC Foundation for this activity.
Why Choose FitSM Foundation Training?
Gain In-Depth Knowledge: Understand the fundamental principles of ITSM and how they apply to your organisation.
Master the FitSM Process Model: Learn to navigate and implement the FitSM-aligned processes seamlessly within your team.
Achieve Professional Recognition: earn the prestigious FitSM Foundation certificate issued by APMG, demonstrating your expertise and commitment to excellence.
What’s in it for you?
Enhanced Service Delivery: Implementing FitSM standards can significantly boost your IT service availability, reliability, and user-centric focus.
Cost Efficiency: Optimise your processes to achieve greater cost savings without compromising on service quality.
Immediate Application: Start applying FitSM-compliant processes within your organisation as soon as you complete the course.
Course Highlights
Neutral Introduction: Provides a balanced introduction to ITSM and the FitSM process-driven model, comparable to other standards like ITIL and ISO20k.
Official Certification: Successfully pass the exam (20 multiple-choice questions, 13 needed to pass) and receive a certification endorsed by APMG, recognised globally for its excellence.
Training Schedule and Location at EGI2024
Dates:
Monday, 30th September: 14:00 - 18:00
Tuesday, 1st October: 9:00 - 12:30
Venue: CMCC (Via Marco Biagi, 5, 73100 Lecce, Italy)
Requirements:
No requirements for the FitSM Foundation Training
A limited number of tickets:
The training is supposed to be delivered from a minimum of 6 to a maximum of 15 trainees.
Cost
Early Bird Course Fee: €450 (Certification Fee Included - €82 ) + VAT
Course Fee: €550 (Certification Fee Included - €82 ) + VAT
EGI Federation members pay the discounted price of 180 EUR + VAT (Certification Fee Included - €82 ). Unsure about eligibility? Contact us at events@egi.eu.
What is included in the fee
In-person training (not online)
EGI training material
Catering
Facility
APMG Exam + APMG Certificate upon success (€82 + VAT) - a second try is permitted upon failure
Free access to the FitSM Workshop on Thursday 3rd of October
If you are part of the following projects you can use your training project budget related to FitSM to cover the expenses:
METROFOOD
iMagine
PITHIA
IRISCC
Don’t miss out on this opportunity to advance your IT service management skills and take your organisation’s service delivery to the next level. Enrol in the FitSM Foundation Training today and begin your journey towards ITSM mastery and professional growth!
Enrol Now on Indico (https://indico.egi.eu/event/6526/ ) and Transform Your IT Service Delivery!
This workshop is for the institutes that contribute to the EGI AAI infrastructure, and will cover the recent developments, future roadmaps and broader collaboration opportunities in the AAI domain.
This is the General Assembly meeting of the GreenDIGIT project that is run by 4 Digital Infrastructures (EGI, SoBigData, SLICES, EBRAINS) to address vital needs of the ESFRI Research Infrastructures and other digital service providers for science in lowering energy consumption and environmental impact. Further information: https://greendigit-project.eu/
The EU-funded SPECTRUM project brings together leading European science organisations and e-Infrastructure providers to formulate a Strategic Research, Innovation, and Deployment Agenda (SRIDA) along with a Technical Blueprint for a European computer and data continuum. This collaborative effort is set to create an Exabyte-scale research data federation and compute continuum, fostering data-intensive scientific collaborations across Europe. Further information: https://www.spectrumproject.eu/
Take the Next Step in IT Service Management with FitSM Foundation Training!
Enrol Now on Indico (https://indico.egi.eu/event/6526/ ) and Transform Your IT Service Delivery!
Are you ready to transform your IT service delivery and enhance customer satisfaction?
The FitSM Foundation Training is your gateway to mastering the essentials of IT Service Management (ITSM). Designed to provide you with a comprehensive understanding of the FitSM framework, this course equips you with the knowledge and tools to optimise your IT services for efficiency, cost savings, and superior user experience.
We are co-locating our FitSM Foundation Training with EGI2024, taking place in Lecce, Italy, and we will be hosted by our partners at the CMCC Foundation for this activity.
Why Choose FitSM Foundation Training?
Gain In-Depth Knowledge: Understand the fundamental principles of ITSM and how they apply to your organisation.
Master the FitSM Process Model: Learn to navigate and implement the FitSM-aligned processes seamlessly within your team.
Achieve Professional Recognition: earn the prestigious FitSM Foundation certificate issued by APMG, demonstrating your expertise and commitment to excellence.
What’s in it for you?
Enhanced Service Delivery: Implementing FitSM standards can significantly boost your IT service availability, reliability, and user-centric focus.
Cost Efficiency: Optimise your processes to achieve greater cost savings without compromising on service quality.
Immediate Application: Start applying FitSM-compliant processes within your organisation as soon as you complete the course.
Course Highlights
Neutral Introduction: Provides a balanced introduction to ITSM and the FitSM process-driven model, comparable to other standards like ITIL and ISO20k.
Official Certification: Successfully pass the exam (20 multiple-choice questions, 13 needed to pass) and receive a certification endorsed by APMG, recognised globally for its excellence.
Training Schedule and Location at EGI2024
Dates:
Monday, 30th September: 14:00 - 18:00
Tuesday, 1st October: 9:00 - 12:30
Venue: CMCC (Via Marco Biagi, 5, 73100 Lecce, Italy)
Requirements:
No requirements for the FitSM Foundation Training
A limited number of tickets:
The training is supposed to be delivered from a minimum of 6 to a maximum of 15 trainees.
Cost
Early Bird Course Fee: €450 (Certification Fee Included - €82 ) + VAT
Course Fee: €550 (Certification Fee Included - €82 ) + VAT
EGI Federation members pay the discounted price of 180 EUR + VAT (Certification Fee Included - €82 ). Unsure about eligibility? Contact us at events@egi.eu.
What is included in the fee
In-person training (not online)
EGI training material
Catering
Facility
APMG Exam + APMG Certificate upon success (€82 + VAT) - a second try is permitted upon failure
Free access to the FitSM Workshop on Thursday 3rd of October
If you are part of the following projects you can use your training project budget related to FitSM to cover the expenses:
METROFOOD
iMagine
PITHIA
IRISCC
Don’t miss out on this opportunity to advance your IT service management skills and take your organisation’s service delivery to the next level. Enrol in the FitSM Foundation Training today and begin your journey towards ITSM mastery and professional growth!
Enrol Now on Indico (https://indico.egi.eu/event/6526/ ) and Transform Your IT Service Delivery!
ICT sectors contribute 1.8%–2.8% of global greenhouse gas emissions. Green computing initiatives aim to reduce this impact.
EGI in partnership with the SLICES, SoBigData and EBRAINS digital research infrastructures are at the forefront of this movement through the GreenDIGIT EC Project. This session explores how GreenDIGIT is working to:
We will delve into the GreenDIGIT work plan and its initial results, including a landscape analysis across EGI, other research and e-infrastructures, and partner research institutions.
Join us for a panel discussion where experts will explore how these findings can shape the future of GreenDIGIT, EGI and ESFRI research infrastructure's green computing initiatives.
In order to keep Research Infrastructures (RIs) at the highest level of excellence in science, new technologies and solutions must be developed to steer toward a reduced environmental footprint, as it is the case for all domains of our societies. Lowering the environmental impact of digital services and technologies has to become a priority for both the operation of existing digital services and the design of future digital infrastructures. GreenDIGIT brings together 4 major distributed Digital Infrastructures at different lifecycle stages, EGI, SLICES, SoBigData, EBRAINS, to tackle the challenge of environmental impact reduction with the ambition to provide solutions that are reusable across the whole spectrum of digital services on the ESFRI landscape, and play a role model. GreenDIGIT will capture good practices and existing solutions and will develop new technologies and solutions for all aspects of the digital continuum: from service provisioning to monitoring, job scheduling, resources allocation, architecture, workload and Open Science practices, task execution, storage, and use of green energy. GreenDIGIT will deliver these solutions as building blocks, with a reference architecture and guidelines for RIs to lower their environmental footprint. User-side tools and Virtual Research Environments will also be expanded with energy usage reporting and reproducibility capabilities to motivate users to apply low-energy practices. The new solutions will be validated through reference scientific use cases from diverse disciplines and will be promoted to providers and users to prepare the next generation of Digital RIs with a low environmental footprint.
The GreenDIGIT project run a survey among research infrastructures to understand their status practices, plans and needs towards lowering the environmental impact of their digital services. This presentation will present the data and findings from this survey.
AWS support for research and collaboration with HPC centres using the greenest cloud platform
09.00-09.45: Panel discussion
Moderator: Lisbon Council/EGI
- The Lethe approach (LC)
- Capturing patient data through apps (FORTH)
- The Lethe app + Prediction models (FHJ)
- Integration of applications to prevent cognitive decline (COMB)
- EGI FedCloud and infrastructure solutions (EGI)
09.45-10.30: Workshop activity
Moderator: Lisbon Council/EGI
New to EGI? Feeling overwhelmed? This session is your essential introduction to understanding EGI!
EGI is your gateway to a powerful global network of computing resources, data analytics tools, and expert support. We empower researchers, innovators, and educators to break boundaries in their fields.
In this EGI101 session, you'll gain a clear understanding of:
No matter your background, this session will equip you with the knowledge you need to understand EGI
A collaborative approach is the hallmark of modern science and research. Scientists and researchers, from different disciplines and countries, work together to advance humanity’s understanding of the world around us. As science and research become more “digital” – combining both digital assets and digital tools and services to increase the speed and excellence of scientific process – those digital assets and services must also be used collaboratively and increasingly provided and shared on a collaborative basis. Several terms have been used to describe this collaborative process, including open science commons and research ecosystems.Open science commons will face opportunities and challenges in the coming years. In this presentation we will illustrate some of them.Firstly, accessibility of distributed data, computing and storage services is a major requirement to enable data-intensive science.
Compute and Storage are the capabilities by which value is added to existing research data through its re-use and combination with other data - by researchers from any discipline and any country. In the current and future geopolitical landscape, data, computing and storage will be critical infrastructures to retain scientific excellence. It is important that policies are established to safeguard digital sovereignty of research performing organisations.Secondly, more and increasingly bigger scientific datasets are generated at research facilities and made available in ‘data holdings’ or ‘FAIR data repositories’. Although datasets are offered as resources in these holdings for primary and secondary use - as the size of holdings, the size of individual datasets and the complexity of datasets grow - the re-use of data becomes practically impossible without technical knowledge of compute environments, data staging, data analysis and AI techniques. To retain scientific excellence, the the data analysis capacity of research performing organisations needs to be significantly enhanced. This entails the coordinated sustainable provisioning of data as a service together with scalable computational platforms and integrating AI
frameworks.
This will enable more complex, data-intensive research, accelerating scientific discoveries and technological innovations. The advancement of interoperable
standards and protocols will be required to facilitate seamless data exchange and collaboration across different scientific domains, breaking down silos and enabling multidisciplinary research efforts.Lastly, AI democratisation will be necessary. AI needs to be put into the hands of researchers without specialised AI and technical knowledge. Large scale AI adoption requires open-source datasets, integrated computing infrastructure and tools which demand less knowledge of AI from the user so that they can build innovative AI software.
Research performing organisations need to integrate their capabilities to provide access to advanced computing, datasets, models, software, training and user support to researchers.We will illustrate how the EGI Federation, as reference digital infrastructure for data-intensive computing in Europe and beyond, is getting ready with its members and partners to face these opportunities and challenges, and an overview of its flagship Research and Development projects will be provided.
"Join us for a session delving into the world of cloud computing. Discover how cloud computing empowers researchers by providing on-demand computing resources and complete control over hosting environments. Learn about the EGI Cloud, which federates 30 providers across Europe and beyond, offering seamless access using Single Sign-On with Check-in.
This session will bring together national cloud computing initiatives and showcase the latest developments in EGI cloud service components and frameworks. Don't miss this opportunity to explore how researchers can leverage cloud resources for their projects."
Last year, we introduced Beskar cloud - an open-source community around deploying and maintaining OpenStack cloud on top of Kubernetes cloud. Since then, we have successfully built two OpenStack sites and seamlessly transitioned users from our original OpenStack instance to the new environment built on Beskar Cloud.
In this presentation, we aim to provide an overview of our progress throughout the past year, detailing the advancements made within the project. Additionally, we will share insights gained from our experiences with migrations and day-to-day operations.
Over the past years, the Italian National Institute for Nuclear Physics (INFN) has developed and refined its cloud platform, designed to facilitate access to distributed computing and storage resources for scientific research. This evolution in Platform-as-a-Service (PaaS) orchestration has focused on enabling seamless service deployment, improving user experience, and integrating innovative solutions to address changing demands and technological challenges.
INFN's journey toward a robust cloud platform began with the deployment of a national cloud system designed to streamline access to distributed resources. A key element of this initiative was a user-friendly web portal, the INFN Cloud Dashboard, allowing users to instantiate high-level services on-demand. This was achieved through TOSCA templates processed by an orchestration system that supported a lightweight federation of cloud sites and automated scheduling for optimal resource allocation.
The orchestration system used by INFN Cloud is based on the open-source INDIGO PaaS middleware, designed to federate heterogeneous computing environments. It plays a crucial role in orchestrating virtual infrastructure deployment, enabling high-level services like Jupyter Hub, Kubernetes, and Spark clusters. The core component, the Orchestrator, is supported by micro-services, facilitating the optimal selection of cloud providers based on specific deployment requirements.
In the context of the internal INFN DataCloud project and some European projects like interTwin and AI4EOSC, INFN is undertaking a comprehensive revamp of its PaaS system to accommodate the changing technology landscape and replace old and legacy software components. A key example of this effort is the transition from the legacy Configuration Management Database (CMDB) to the Federation-Registry, a modern solution built on the FastAPI framework and using neo4j, a flexible graph database. This transition will ensure more robust and scalable management of federation-related information, supporting a diverse set of cloud providers and modern security protocols.
To further optimize the orchestration system, INFN is exploring the use of artificial intelligence to improve deployment scheduling. The Cloud Provider Ranker, which provides the list of providers based on various metrics and Service Level Agreements (SLAs), is going to be enhanced with AI techniques. This improvement will allow for the identification of meaningful metrics, creation of predictive models for deployment success/failure, and regression models for deployment times. These models will enable a more dynamic and accurate ranking of cloud providers, leading to more efficient resource usage and a reduction in deployment failures.
Finally, the PaaS dashboard, which serves as a gateway for user interaction with the orchestration and service deployment system, recently underwent a major renovation to improve usability and security. The dashboard redesign aimed to offer a more secure, efficient, and user-friendly interface while providing a visually appealing design.
This contribution will outline the key advancements in the PaaS orchestration system aimed at supporting scientific communities with a reliable, scalable, and user-friendly environment for their computational needs.
Contemporary HPC and cloud-based data processing is based on complex workflows requiring close access to large amounts of data. OpenEO process graphs allow users to access data collections and create complex processing chains. Currently OpenEO can be accessed via one of the clients in JavaScript, R, or Python. Direct access to data is provided via Spatio Temporal Asset Catalogues (STAC). As part of our ongoing research under the InterTwin project, the focus is on extending the capabilities of OpenEO to support the management and execution of OGC Application Packages.
An Application Package allows users to create automated, scalable, reusable and portable workflows. It does so by creating, for example, a Docker Image containing all of the application code and dependencies. The workflow is described with Common Workflow Language (CWL). The CWL document references all of the inputs, outputs, steps, and environmental configurations to automate the execution of an application.
The execution is handled by an Application Deployment and Execution Service (ADES) coming from the EOEPCA project. It is a Kubernetes based processing service capable of executing Application Packages via OGC WPS1.0, 2.0 OWS service, and the OGC API-processes. In many ways, the core goals and objectives between EOEPCA and the InterTwin project align well. The focus is on allowing workflows to be seamlessly executed without the need of substantial code rewrites or adaptations to a specific platform.
OpenEO is on its way to become an OGC community standard. It currently supports a large set of well-defined cloud optimized processes that allow users to preprocess and process data directly in the cloud. The goal is to integrate the Application Deployment and Execution Service (ADES) from the Earth Observation Exploitation Platform Common Architecture (EOEPCA) project to create a fusion between OpenEO process graphs and Application Packages.
Application Package support in OpenEO is a means of providing users the ability to bring their applications directly to the platform. Instead of having to reimplement the code in a process graph, it is possible to wrap any existing application. The fusion of process graphs and CWL based application workflows extends OpenEO for users that would like to perform testing of their models, ensemble models etc. while utilizing the same process graph and direct access to data.
Many complex workflows require some kind of data preprocessing. This preprocessing can be done using OpenEO process graphs and then be directly sent for execution to an Application Package to run the actual process. OpenEO complements STAC by providing a standardized interface and processing framework for accessing and analyzing Earth observation data. Having all of the data and tools readily available on a single platform creates an accessible, interoperable, and a reproducible environment for users to create efficient workflows.
The ability to create standardized, reusable workflows using CWL and execute them on distributed computing resources via ADES can significantly reduce the time and effort required for data processing tasks. Researchers can focus on algorithm development and data analysis rather than worrying about infrastructure management or software compatibility issues.
Nowadays, more and more services are dynamically deployed in Cloud environments. Usually, the services hosted on virtual machines in Cloud are accessible only via IP addresses or pre-configured hostnames given by the target Cloud providers, making it difficult to provide them with meaningful domain names. The Dynamic DNS service was developed by Institute of Informatics, Slovak Academy of Sciences (IISAS) to alleviate this problem.
The Dynamic DNS service provides a unified Dynamic DNS support for virtual machines across the EGI Cloud infrastructure. Users can register their chosen hostnames in predefined domains (e.g., my-server.vo.fedcloud.eu) and assign them to the public IPs of their servers.
The Dynamic DNS service significantly simplifies the deployment of services that are dynamically deployed in Cloud infrastructures. It removes the obstacles of changing IP addresses of services in Cloud at every deployment and enables obtaining SSL certificates for the hostnames. Service providers can migrate services from local servers to Cloud or from a Cloud site to another without noticing users from the change.
The service has been in operation since 2018 with more than one hundred active users. It is being upgraded for stability and security. There are several new and ongoing developments that may be interesting for the users of the Dynamic DNS service:
Stay tuned!
Data spaces are EU data sharing paradigms to enable data flows between different domains and stakeholders while promoting fair participation and respect for data sharing conditions. Sectorial data spaces have been launched covering domains such as the green deal, cultural heritage or agriculture to solve some of today’s most complex societal challenges such as climate neutrality. The session will address data sharing approaches, technology solutions and dedicated use cases that make data spaces a reality
Lambousa is a 25-meter long wooden boat the type of liberty, built in 1955 in Greece. It was registered in Cyprus in 1965 and was used as a fishing trawler until 2004, when it was withdrawn according to EU Fishing Policy (EU Directive 2008/56/EC). The boat was preserved in the sea, as a monument of the local cultural heritage by the Municipality of Limassol. In 2020, the boat was dry docked and a European fund of more than one million Euro, was acquired for its full restoration. The project began in January 2023, undertaken by a local marine maintenance company. More than 20 different traditional craftsmen were engaged in a combination of simultaneous works and completed the restoration in one year. The project was under the supervision of a municipal engineers’ team and an archaeologist-consultant and superintendent, in order to record the restoration procedures and follow traditional shipbuilding technics during the restoration.
This, constitutes the largest, the most in detail renovation, the most expensive and complex multidisciplinary project of its type in Cyprus and most probably in the Eastern Mediterranean.
The UNESCO Chair on Digital Cultural Heritage at CUT team, in cooperation with the Municipality of Limassol and with the support of two EU projects H2020 ERA Chair Mnemosyne and the Digital Europe Eureka3D, undertook the detail 2D and 3D survey of the boat including its entire intangible/memory.
For the digital surveying a high-resolution photogrammetry and LIDAR was undertaken, which concluded with an accurate 3D model. The entire data acquisition and survey were based on the results of the newly published EU Study on quality in 3D digitisation of tangible cultural heritage.
In addition, an online platform for the holistic digital documentation of the boat including its entire biography/memory is under development to serve further research and the multidisciplinary community of users. The complex 3D reconstruction of the trawler and its related records such Paradata and Metadata will be harvested in Europeana and pesented during the Europeana’s TwinIT-Event at the headquarters of the European Commission in Brussels on the 14th of May 2026.
This is the first time in the EU that a 3D object is harvested in Europeana using the Eureka3D methodology based on the latest requirements from the EU policy on the Data Cloud in Cultural Heritage by utilizing the full power of EGI Data Cloud Infrastructure.
This contribution discusses the boat’s characteristics, its restoration procedures and the positive impact for the preservation of the local Maritime Cultural Heritage by creating the exact #MemoryTwin and make all information and data available under open-access to the entire world.
Onedata[1] is a high-performance data management system with a distributed, global infrastructure that enables users to access heterogeneous storage resources worldwide. It supports various use cases ranging from personal data management to data-intensive scientific computations. Onedata has a fully distributed architecture that facilitates the creation of a hybrid cloud infrastructure with private and commercial cloud resources. Users can collaborate, share, and publish data, as well as perform high-performance computations on distributed data using different interfaces: POSIX-compliant native mounts, pyfs (python filesystem) plugins, REST/CDMI API, and S3 protocol (currently in beta).
The latest Onedata release line, 21.02, introduces several new features and improvements that enhance its capabilities in managing distributed datasets throughout their lifecycle. The software allows users to establish a hierarchical structure of datasets, control multi-site replication and distribution using Quality-of-Service rules, and keep track of the dataset size statistics over time. In addition, it also supports the annotation of datasets with metadata, which is crucial for organising and searching for specific data. The platform also includes robust protection mechanisms that prevent data and metadata modification, ensuring the integrity of the dataset in its final stage of preparation. Another key feature of Onedata is its ability to archive datasets for long-term preservation, enabling organisations to retain critical data for future use. This is especially useful in fields such as scientific research, where datasets are often used for extended periods or cited in academic papers. Finally, Onedata supports data-sharing mechanisms aligned with the idea of Open Data, such as the OAI-PMH protocol and the newly introduced Space Marketplace. These features enable users to easily share their datasets with others, either openly or through controlled access.
Currently, Onedata is used in European projects: EUreka3D[2], EuroScienceGateway[3], DOME[4], and InterTwin[5], where it provides a data transparency layer for managing large, distributed datasets on dynamic hybrid cloud containerised environments.
Acknowledgements: This work is co-financed by the Polish Ministry of Education and Science under the program entitled International Co-financed Projects (projects no. 5398/DIGITAL/2023/2 and 5399/DIGITAL/2023/2).
REFERENCES:
- Onedata project website. https://onedata.org. EUreka3D: European
Union’s REKonstructed in 3D. https://eureka3d.eu. EuroScienceGateway
project: open infrastructure for data-driven research.
https://galaxyproject.org/projects/esg/. DOME: A Distributed Open
Marketplace for Europe Cloud and Edge Services.
https://dome-marketplace.eu. InterTwin: Interdisciplinary Digital
Twin Engine for Science. https://intertwin.eu.
CEDAR is a brand new Horizon Europe projects whose key goal is to develop methods, tools, and guidelines to digitise, protect, and integrate data to address significant issues like corruption, aligning with the European Strategy for Data and the development of Common European Data Spaces (CEDS), and the European Data Act. This will lead to improved transparency and accountability in public governance, promoting European values and rights in the digital world, and enriching the European data ecosystem and economy.
The Consortium boasts nine top research institutions and universities, twelve technology and business developing companies, seven public sector end users, and three relevant NGOs. By sharing high-quality datasets, developing secure connectors for European data repositories, and employing innovative technologies for efficient big data management and analysis, CEDAR aims to promote better evidence-based decision-making, combat corruption, and reduce fraud in public administration.
In this short talk (10 minutes) we would like to present the key objectives of the project and, most prominently, the three Pilot Studies (co-located in three different EU member states) to effectively co-create and test the projects' outcomes in a relevant setting with the end users, as well as to validate the key CEDAR benefits, which are:
Trust, defined as the favourable response of a decision-making party assessing the risk regarding another party’s ability to fulfil a promise, is an essential enabler for data sharing.
Participants in a data space need to have verifiable information about each other's identities and rely on each other’s compliance with the data space rules, possibly including compliance with domain-specific standards and overarching legal requirements.
The Gaia-X Trust Framework, combining the most established standards on conformity assessment and digital attestations, provides the means to assess compliance with the requirements set to operate in data spaces while ensuring data sovereignty, security and transparency. Furthermore, it promotes organizational and semantic interoperability, by contributing to the alignment of business processes, focusing on the users' needs, and ensuring that the meaning of the exchanged information is preserved throughout the exchanges between parties.
The accreditation by the data space governance authority of the data space Trust Anchors, parties allowed to issue attestations about specific claims within a defined scope, is an essential component of the operationalisation of the Trust Framework.
From a technical standpoint, among the elements that constitute the Framework, asymmetric cryptography and linked data principles are used to build a machine-readable knowledge graph of claims about the objects of assessment, verify at any time the content integrity of the claims and keep track of the origin of claims and the parties issuing them
Finally, the Gaia-X Trust Framework introduces automation in the process of verification of compliance and speeds it up, with the result of lowering the costs and the barriers to participation in data spaces and involvement in data sharing processes, especially for SMEs.
ECMWF’s IT service provision strategy offers a seamless cloud infrastructure and harmonised services to all ECMWF users, allowing them to effectively use the computing services and data available including Copernicus data and services. To this effect ECMWF procured a “Multi-Purpose Cloud Infrastructure incorporating the European Weather Cloud” (EWC). This infrastructure was recently extended to incorporate the Copernicus Data Stores (CDS/ADS) Services, converging into an ECMWF Common Cloud Infrastructure (CCI). CCI cloud is hosted at ECMWF’s data centre in Bologna.
Cloud hosting has provided Data Store (DS) Services since their launch with the capacity for a sustained growth in terms of stored data and computing capabilities which is envisaged to continue at a steady rate in the coming future fostered by its integration within CCI. DSs Services are split into two main layers with different functions and Cloud requirements: Data Repositories and Software Services.
Data Repositories are the foundational base of DSs. They are distributed and diverse in size, format and scope encompassing from a CDS-MARS Archive to other repositories such us a modernised Observations Repository and a series of smaller on-disk datasets. Serving data from cloud disks allow high efficiency and performance able to serve a daily average over 160 TB in the form of more than 500k requests coming from over 3k active users. The Service is also implementing an ARCO (Analysis Ready, Cloud Optimized) Data Lake to improve visualisation and interactivity of C3S and CAMS data on WEkEO and in addition to address the needs of demanding ML/AI solutions and visual-interactive applications.
The Running Services layer is integrated by the software components supporting the range of functionalities of the Data Stores Service. These components are optimized to deploy and run in Cloud environments. They offer to users processing and visualisation capabilities in addition to data access functions. The Climate Data Store (CDS) for C3S, the Atmosphere Data Store (ADS) for CAMS and the recently launched Early Warning Data Store (EWDS) for CEMS are the well-known public facing interfaces of this layer. The operational management of these Services relies heavily on the automatic deployment and configuration of multiple instances facilitated by the elastic resources offered by CCI. As an extension of the Services a Beta version of a JupiterHub is in the pipeline. This will allow users to launch temporal sessions and allocate computing and storage resources to perform computation and visualisation on top of the data using a set of preconfigured expert tools mostly offered by eartkit.
The aim of this presentation is to introduce the scope and plans of the Modernised Data Stores Service and describes how this fits into the streamlined services hosted on the ECMWF Common Cloud Infrastructure (CCI) to foster the future evolution of the Services and the synergies with other platforms such as WEkEO or the Copernicus Data Space Ecosystem.
"As we advance into the age of precision science and engineering, digital twins are becoming essential tools for research and development. This session will explore the role of scientific digital twins across various domains, highlighting their ability to replicate and predict complex systems with remarkable accuracy. Attendees will gain insights into how these virtual models mirror physical counterparts, enabling experiments, performance optimisation, and future behaviour predictions without physical constraints.
We will also showcase key projects funded by the European Commission, particularly the EGI-coordinated InterTwin project. Experts will discuss the development, application, and impact of these initiatives, demonstrating how digital twins are driving innovation and offering solutions to complex scientific and engineering challenges. Join us to discover how digital twins are enhancing our understanding and paving the way for groundbreaking advancements."
The Horizon Europe interTwin project is developing a highly generic yet powerful Digital Twin Engine (DTE) to support interdisciplinary Digital Twins (DT). Comprising thirty-one high-profile scientific partner institutions, the project brings together infrastructure providers, technology providers, and DT use cases from Climate Research and Environmental Monitoring, High Energy and AstroParticle Physics, and Radio Astronomy. This group of experts enables the co-design of the DTE Blueprint Architecture and the prototype platform benefiting end users like scientists and policymakers but also DT developers. It achieves this by significantly simplifying the process of creating and managing complex Digital Twins workflows.
As part of our contribution, we'll share the latest updates on our project, including the DTE Blueprint Architecture, whose latest version will be under finalisation in Q4/2024. The interTwin components, thanks to the collaboration with ECMWF partner in the project, are designed to be aligned with what Destination Earth is designing and building. Therefore, we will show the activities carried out by the project to analyse DestinE architecture and the points of interoperability planned.
The contribution will also cover the status of the DT use cases we currently support and describe the software releases of the DTE.
In this session we will discuss and report on the progress, how Earth System digital twins and digital twins that are part of the wider Green Deal initiative could operate together in a digital twin platform.
For this purpose we explain in detail the processes, technical implementation, and ontology alignment that needs to be put in place in order to allow for interoperability of digital twin systems stemming from different communities and initiatives. We do not intend to provide a generic interoperability framework but work from the assumption that the most value can be derived from providing specific solutions, driven by use cases, that are generic by design but not designed to be generic.
We are not aiming for integration through aggregation but integration through federation where each system focuses on the integration functions or services that allow interoperability between digital twin system components when required.
The level of integration requirement between digital twins can span a wire range of functions and services, from full integration on the physics level (tightly coupled digital twins) to integration through DT outputs (loosely coupled). In order to capture these requirements we defined a so-called integration continuum where we can map integration requirements between digital twins and digital twin systems.
From these exercises we developed a shared high level architectural view and also a common glossary that can describe the implementation for each participating project.
The DT-GEO project (2022-2025), funded under the Horizon Europe topic call INFRA-2021-TECH-01-01, is implementing an interdisciplinary digital twin for modelling and simulating geophysical extremes at the service of research infrastructures and related communities. The digital twin consists of interrelated Digital Twin Components (DTCs) dealing with geohazards from earthquakes to volcanoes to tsunamis and that harness world-class computational (FENIX, EuroHPC) and data (EPOS) Research Infrastructures, operational monitoring networks, and leading-edge research and academic partnerships in various fields of geophysics. The project is merging and assembling latest developments from other European projects and EuroHPC Centers of Excellence to deploy 12 DTCs, intended as self-contained containerised entities embedding flagship simulation codes, artificial intelligence layers, large volumes of (real-time) data streams from and into data-lakes, data assimilation methodologies, and overarching workflows for deployment and execution of single or coupled DTCs in centralised HPC and virtual cloud computing Research Infrastructures (RIs). Each DTC addresses specific scientific questions and circumvents technical challenges related to hazard assessment, early warning, forecasts, urgent computing, or geo-resource prospection. This presentation summarises the results form the two first years of the project including the digital twin architecture and the (meta)data structures enabling (semi-)automatic discovery, contextualisation, and orchestration of software (services) and data assets. This is a preliminary step before verifying the DTCs at 13 Site Demonstrators and starts a long-term community effort towards a twin on Geophysical Extremes integrated in the Destination Earth (DestinE) initiative.
Digital Twins provide a virtual representation of a physical asset enabled through data and models. They can be used for multiple applications such as real-time forecast of system dynamics, system monitoring and controlling, and support to decision making. Recent tools take advantage of the huge online volume of data streams provided by satellites, IoT sensing and many real-time surveillance platforms, and the availability of powerful computational resources that make process-solving, high-resolution models and AI-based models possible, to build high accuracy replicas of the real world.
The Tagus estuary is the largest estuarine region in the Iberian Peninsula and holds a multitude of services of huge economic, environmental and social value. The management of this large system is quite complex and there are often conflicting uses that require high resolution, complex tools to understand and predict its dynamics and support any interventions. Simultaneously, the Tagus basin raises concerns related to inundation and erosion (Fortunato et al., 2021) and water quality (Rodrigues et al., 2020). A variety of models have been applied here to address multiple concerns from physical to water quality and ecology. At the same time, the Tagus holds several observatories supported by data (e.g. CoastNet, http://geoportal.coastnet.pt/) and integrated model and data (UBEST, http://ubest.lnec.pt/). In spite of all these efforts, no integrated infrastructure, from river to ocean, accounting for the city of Lisbon and other important cities‘ drainage, was available to support management and research alike, allowing for users to interact with data and models to build customized knowledge.
The CONNECT project, funded through the CMEMs coastal downscaling programme, developed a multi-purpose collaboratory that combines digital twin technology, a smart coastal observatory tool (Rodrigues et al., 2021) and a monitoring infrastructure – CoastNet, to address both inundation and water quality concerns. The work takes advantages of the on-demand, relocatable coastal forecast framework OPENCoastS (Oliveira et al., 2021) to build a user-centered, multi-purposes DT platform that provides tailored services customized to meet the users’ needs. A combination of process-based modeling in the estuary, using SCHISM suite, and AI modeling for the river inflow, using the AI4Rivers model builder, supports the automatic creation of both 2D and 3D predictions daily. Model performance is automatically shared with the users, both through online comparison with the in-situ and remote sensing data from CoastNet and CMEMS, and the calculation of indicators at several time scales.
Fortunato, A.B., Freire, P., Mengual, B., Bertin, X., Pinto, C., Martins, K., Guérin, T., Azevedo, A., 2021. Sediment dynamics and morphological evolution in the Tagus Estuary inlet. Marine Geology 440, 106590.
Oliveira, et al, 2021. Forecasting contrasting coastal and estuarine hydrodynamics with OPENCoastS, Environmental Modelling & Software, Volume 143,105132.
Rodrigues, M., Cravo, A., Freire, P., Rosa, A., Santos, D., 2020. Temporal assessment of the water quality along an urban estuary (Tagus estuary, Portugal). Marine Chemistry 223, 103824.
Rodrigues, M., Martins, R., Rogeiro, J., Fortunato, A.B., Oliveira, A., Cravo, A., Jacob, J., Rosa, A., Azevedo, A., Freire, P., 2021. A Web-Based Observatory for Biogeochemical Assessment in Coastal Regions. J ENVIRON INFORM.
Frontier and Summit, two of the largest supercomputers in the world, are hosted at the Oak Ridge Leadership Computing Facility (OLCF), and managed on behalf of the US Department of Energy (USDOE). They are also counted among “leadership class” systems in the world offering capability computing that accommodate modeling and simulations as well as data analytics and artificial intelligence applications at scale, not readily available at most capacity computing centers. The portfolio of recent computing projects at OLCF include kilometer scale earth system modeling, using the DOE Energy Exascale Earth System Model (E3SM) and the ECMWF Integrated Forecasting System (IFS), and the development of AI foundation models for climate and environmental applications. The presentation will summarize recent advances and highlights from computational earth and environmental sciences projects at OLCF, including: [a] global 3.5 km simulations using the DOE Simple Cloud Resolving E3SM Atmosphere Model (SCREAM); [b] the Oak Ridge Base Foundation Model for Earth System Predictability (ORBIT), a 113 billion parameter vision transformer model trained on CMIP6 simulations; and [c] two geoAI foundation models trained on large volumes of earth observation data from satellites.
The ever-growing volume of environmental data presents both exciting possibilities and significant challenges. This session delves into the critical role of advanced data analysis tools, robust infrastructure solutions, and collaborative practices based on FAIR data principles in unlocking its full potential for tackling global environmental challenges.
Through a series of engaging presentations, we'll explore advanced data analysis tools, scalable data infrastructure, best practices in data interoperability best on the FAIR principles, the value of data publication and usage metrics to enhance research impact and ensure responsible data sharing practices, with practical case studies, highlighting the power of advanced tools and collaborative data access in tackling critical environmental challenges.
The escalating volume and complexity of Earth and environmental data necessitate an effective, interdisciplinary partnership among scientists and data providers. Achieving this requires the utilization of research infrastructures that offer sophisticated e-services. These services enhance data integration and interoperability, enable seamless machine-to-machine data exchanges, and leverage High-Performance Computing (HPC) along with cloud capabilities.
In this presentation, we will demonstrate a case study focused on the import, analysis, and visualization of geodata within the ENES Data Space (https://enesdataspace.vm.fedcloud.eu), a cutting-edge cloud-enabled data science environment designed for climate data analysis. This platform is ingeniously constructed atop the European Open Science Cloud (EOSC) Compute Platform. By integrating with either an institutional or social media account, users gain entry to the ENES Data Space. Here, they can initiate JupyterLab, accessing a personal workspace equipped with computational resources, analytical tools, and pre-prepared climate datasets. These datasets, which include historical data recording and future projections, are primarily sourced from the CMIP (Coupled Model Intercomparison Project).
Our case study will utilize global precipitation data derived from the Centro Euro-Mediterraneo sui Cambiamenti Climatici (CMCC) experiments, analyzed within the ENES workspace through two distinct approaches:
1. Direct MATLAB Online Integration: Users can launch MATLAB Online directly from the ENES Data Space JupyterLab. Utilizing a Live Script (.mlx), the process involves importing, filtering, and manipulating data, creating visual maps, comparing results, and conducting hypothesis testing to ascertain the statistical significance of the project findings. Live Scripts serve as interactive notebooks that facilitate the clear articulation of research methodologies and goals by integrating data, hyperlinks, figures, text, and code. These scripts also incorporate UI tools for intuitive, point-and-click data analysis and visualization, eliminating the need for extensive programming expertise.
2. MATLAB Kernel within Jupyter Notebook: This method demonstrates the analysis process using a MATLAB kernel executed from a Jupyter notebook (.ipynb) within the same JupyterLab environment.
In both scenarios, the results can be exported in multiple formats (e.g., PDF, markdown, LaTeX, etc.), allowing for easy downloading and sharing with other researchers, educators, and students. This entire workflow is seamlessly executed in MATLAB within the ENES Data Space, without the need for software installation or data downloads on local (non-cloud) devices. This case study exemplifies the power of cloud-based platforms in enhancing the accessibility, efficiency, and collaborative potential of climate data analysis.
The Global Fish Tracking System (GFTS) is a use case from the European Space Agency's DestinE Platform. It leverages the Pangeo software stack to enhance our understanding of fish habitats, and in particular Seabass and Pollack. By addressing a data gap highlighted by the International Council for the Exploration of the Sea (ICES), the project combines various data sources, including data from DestinE Climate Adaptation Digital Twin, data from Copernicus marine services, and biologging data from sea bass tracking.
The 'Pangeo-fish' software, a key part of GFTS, improves data access and usage efficiency. Initially developed for HPC, it was ported for cloud infrastructure because of the versatility of the Pangeo ecosystem. This system's model and approach can be adapted for wider marine ecosystem conservation efforts across different scales, species and regions.
The GFTS system was also tested on Pangeo@EOSC. This Pangeo platform, deployed in collaboration with the EGI-ACE and C-SCALE projects, offers Pangeo notebooks with a Dask gateway for comprehensive data analysis at scale. An equivalent system was implemented on the OVH cloud, to prepare for future porting on the DestinE Platform.
Reflecting its original Pangeo ecosystem, GFTS follows open science guidelines. It includes a Decision Support Tool (DST), which enables users to understand complex results and make informed decisions. Accessibility, usability, and data sharing compliance with FAIR principles are prioritised.
In conclusion, GFTS represents a perfect blend of careful management of data and computational resources, a strong commitment to improving ocean conservation, and their habitats, and the efficient use of advanced technology for data analysis and modelling. The presentation will delve into the project's achievements and challenges, providing valuable insights into the practical benefits of incorporating Open Science practices for marine ecosystem preservation.
The increase in the volume of Earth Observation (EO) data in the past decade has led to the emergence of cloud-based services in recent years. Copernicus data and services have provided several EO and Earth Modeling data to European Citizens. Data acquired from Sentinel satellites is made available to the end users through the Copernicus Data Space Ecosystem, providing free access to a wide range of data and services from the Copernicus Sentinel missions and other land, ocean, and atmosphere EO data. Moreover, there are six Copernicus services providing data for the atmosphere, marine, land, climate change, security, and emergency related services. As these services, which are not directly integrated, require different data access methods, Copernicus Data and Information Access Services (DIAS) are providing centralised access to Copernicus data and information, in addition to cloud infrastructure and processing tools. The Copernicus Data Access Service (C-DAS), builds on DIAS-es existing distribution services, ensuring their continuity, and bringing significant improvements like advanced search functions, virtualisations, APIs etc.
Destination Earth (DestinE) develops a high precision digital model of the Earth (a digital twin) to monitor and simulate natural and human activity, with the first two digital twins focusing on weather-induced and geophysical extremes, and on climate change adaptation. DestinE will deliver enormous new Earth modelling data and access to Copernicus data. Finally, there are several existing European Data Spaces providing data from various domains (agriculture, food security, health, energy, natural resources, environmental monitoring, insurances, tourism, security). This data opens new opportunities for the creation of beyond state-of-the-art solutions which can provide new products and services to the public.
Despite the significant volume and plethora of EO and Earth Modeling data offered, their access has not been yet extended beyond experts and scientists to the wider industry to deliver tangible applications that improve our health and lives and protect the planet. Unfortunately, a small part of the market has that kind of expertise and, as follows, high value EO information remains unexploited, it is often fragmented, complex, diverse, difficult to find, retrieve, download and process, while users must have some kind of domain expertise to find, access, understand how to pre-process data, find storage solutions and transform data into useful formats for analytics and Geographic Information Systems (GIS).
The EO4EU project is providing an integrated and scalable platform to make the above-mentioned EO data easily findable and accessible, relying on machine learning and advanced user interfaces supported by a highly automated multi-cloud computing platform and a pre-exascale high-performance computing infrastructure. EO4EU introduces an ecosystem for the holistic management of EO data, improving its FAIRness by delivering dynamic data mapping and labelling based on AI, while bridging the gap between domain experts and end users, and while bringing in the foreground technological advances to address the market straightness towards a wider usage of EO data.
In this session, the key innovative features of the EO4EU Platform will be presented, and architectural insights will be provided.
In the climate domain, the Coupled Model Intercomparison Project (CMIP) represents a collaborative framework designed to improve knowledge of climate change with the important goal of collecting output from global coupled models and making them publically available in a standardized format. CMIP has led to the development of the Earth System Grid Federation (ESGF), one of the largest-ever collaborative data efforts in earth system science involving a large set of data providers and modelling centres around the globe.
ESGF manages a huge distributed and decentralized database for accessing multiple petabytes of science data at dozens of federated sites. In this context, providing an in-depth understanding about the data published and exploited across the federation is of paramount importance in order to get useful insights on the long tail of research.
To this end, the ESGF infrastructure includes a specific software component, named ESGF Data Statistics, deployed at the CMCC SuperComputing Center. More specifically, the service takes care of collecting, storing, and analyzing data usage logs (prior filtering out sensitive information) sent by the ESGF data nodes on a daily basis. A set of relevant usage metrics and data archive information are then visualized on an analytics user interface including a rich set of charts, maps and reports, allowing users and system managers to visualize the status of the infrastructure through smart and attractive web gadgets.
Further insights relevant to the research infrastructure managers could come through the application of a data-driven approach applied to download information in order to identify changes in the download patterns and predict possible issues at the infrastructural level.
Developing digital twins of environmental systems requires accessing heterogeneous data sources, connecting them to varied, often interconnected, statistical, computational or AI models, running on distributed computing resources.
To tackle this complexity, digital twin developers need to reuse resources like data, models, workflows, and services from different sources. Collaborative Virtual Research Environments (VREs) can facilitate this process with tools such as discovery access, interoperation and reuse of research assets, and integration of all resources into cohesive observational, experimental, and simulation investigations with replicable workflows. However, while effective for specific scientific communities, existing VREs often lack adaptability and require substantial time investment for incorporating external resources or custom tools. In contrast, many researchers and data scientists prefer notebook environments like Jupyter for their flexibility and familiarity.
To bridge this gap we propose a VRE solution for Jupyter Notebook-as-a-VRE (NaaVRE).
The NaaVRE empowers users to construct functional blocks by containerizing cells within notebooks, organizing them into workflows, and overseeing the entire experiment cycle along with its generated data. These functional blocks, workflows, and data can then be shared within a common marketplace, fostering user communities. Additionally, NaaVRE can integrate with external repositories, enabling users to access assets such as data, software, and algorithms. Lastly, NaaVRE is designed to seamlessly operate within cloud infrastructures, offering users the flexibility and cost efficiency of utilizing computational resources as needed.
We showcase the versatility of NaaVRE by building several customized VREs that support scientific workflows and prototype digital twins across different communities. These include tasks such as extracting ecosystem structures from Light Detection and Ranging (LiDAR) data, monitoring bird migrations via radar observations, analyzing phytoplankton species, and digital twins of ecosystems as part of the Dutch NWO LTER-LIFE project.
"High-performance computing (HPC) offers incredible power for tackling complex research challenges, but it's often locked away in isolated systems. What if you could seamlessly connect HPC resources to the broader ecosystem? This session explores the exciting world of HPC interoperability. We'll delve into ongoing efforts to integrate HPC systems with the larger ecosystems, including Authentication and Authorisation Infrastructure (AAI), ensuring smooth user access across HPC and other resources, and Data Movement, enabling easy data transfer between HPC and the broader research environment.
By overcoming these challenges, we'll enrich the EGI offering and make HPC more accessible than ever before.
Join us to discover:
- Latest advancements in delivering HPC resources with EGI integration
- How the EGI cloud continuum unlocks the full potential of HPC
- Opportunities for collaboration in building a unified research computing ecosystem"
In recent years, in particular with the rise of AI, the diversity of workloads that need to be supported by research infrastructures has exploded. Many of these workloads take advantage of new technologies, such as Kubernetes, that need to be run alongside the traditional workhorse of the large batch cluster. Some require access to specialist hardware, such as GPUs or network accelerators. Others, such as Trusted Research Environments, have to be executed in a secure sandbox.
Here, we show how a flexible and dynamic research computing cloud infrastructure can be achieved, without sacrificing performance, using OpenStack. By having OpenStack manage the hardware, we get access to APIs for reconfiguring that hardware, allowing the deployment of platforms to be automated with full control over the levels of isolation. Optimisations like CPU-pinning, PCI passthrough and SR-IOV allow us to take advantage of the efficiency gains from virtualisation without sacrificing performance where it matters.
The HPC+AI Cloud becomes even more powerful when combined with Azimuth, an open-source self-service portal for HPC and AI workloads. Using the Azimuth interface, users can self-service from a curated set of optimised platforms from web desktops through to Kubernetes apps such as Jupyter notebooks. Those applications are accessed securely, with SSO, via the open-source Zenith application proxy. Self-service platforms provisioned via Azimuth can co-exist with large bare-metal batch clusters on the same OpenStack cloud, allowing users to pi the environments and tools that best suit their workflow.
Introduction
The digital twin concept is gaining traction in research, demanding substantial computational power for simulations. Sano Centre for Computational Medicine, in collaboration with ACC Cyfronet AGH, is actively developing tools to optimize high performance computing (HPC) resources. Our focus is on providing scientists with a user-friendly toolkit for seamless model execution. This paper introduces the integration of the Model Execution Environment platform with data repositories, streamlining data management for researchers.
Description of the problem
Harnessing HPC resources necessitates specific expertise and extensive data management, posing challenges for researchers. Additionally, sharing processed data and research results among teams demands adherence to fair involvement rules, involving external services and consuming valuable time. Our aim is to alleviate these challenges by providing a comprehensive platform for efficient data management.
Related work
While Pegasus and others operate on various infrastructures, Model Execution Environment (MEE) focuses on an established execution framework. Our unique approach prioritizes seamless data staging across diverse repositories, such as Dataverse and Zenodo, enhancing flexibility, streamlining execution and fostering effortless collaboration.
Solution of the problem
Our platform integrates with Dataverse and Zenodo APIs, enhancing efficiency and collaboration by eliminating intermediaries. Customizable repository rules ensure fair data sharing, safeguarding confidentiality.
Conclusions
Our research has resulted in a sophisticated toolkit for medical research efficiency. Future plans include broader integration, simplified data retrieval via Digital Object Identifiers, enhancing accessibility of our toolkit.
Acknowledgements. This publication is (partly) supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement ISW No 101016503, supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement Sano No 857533 and was created within the project of the Minister of Science and Higher Education "Support for the activity of Centers of Excellence established in Poland under Horizon 2020" on the basis of the contract number MEiN/2023/DIR/3796. We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2023/016227.
he integration of High-Performance Computing (HPC), High-Throughput Computing (HTC), and Cloud computing is a key to enable convergent use of hybrid infrastructures.
We envision a model where multi stage workflows can move back and forth across multiple resource providers by offloading containerized payloads.
From a technical perspective the project aim is to use the Kubernetes API primitives to enable a transparent access to any number of external hardware machines and type of backends.
We created the interLink project, an open source extension to the concept of Virtual-Kubelet with the primary goal to have HPC centers exploitable with native Kubernetes APIs with an effort close to zero from all the stakeholders' standpoint.
interLink is developed by INFN in the context of interTwin, an EU funded project that aims to build a digital-twin platform ( Digital Twin Engine) for sciences, and the ICSC National Research Centre for High Performance Computing, Big Data and Quantum Computing in Italy. In this talk we will walk through the key features and the early use cases of a Kubernetes-based computing platform capable of extending its computational capabilities over heterogeneous providers: among others, the integration of a world-class supercomputer such as EuroHPC Vega and Juelich will be showcased.
The management of sensitive data affects several different communities, whether it is about disciplines or innovations. It is not only a question of technical development, but also of development related to operations, processes and regulations. In this session we discuss sensitive data, its management and processing taking into account governance point of views, architecture approaches and technical solutions.
Session agenda:
- Pioneering Integrated Data Access and Analysis Across Research Fields: A Dutch Initiative, Ahmad Hesam (SURF)
- A Complete Workflow For Finding, Sharing And Analysing Sensitive Data: The SSH Use Case, Lucas van der Meer (ODISSEI)
- INFN Approach in Handling Sensitive Data, Barbara Martelli (INFN)
- GÉANT Initiatives on Sensitive Data Management, Mario Reale (GÉANT)
- EGI Landscaping Report on Trusted Research Environments - an overview, Ville Tenhunen (EGI Foundation)
This talk introduces a significant Dutch initiative designed to transform the landscape of data access and analysis for all research fields, using the ODISSEI metadata portal as a specific example for the social sciences and humanities (SSH) community. Our integrated workflow begins with the Data Access Broker (DAB), developed by SURF, which standardizes data access requests and data transfers across diverse data providers, addressing the complexities of handling sensitive data with varying access scopes and policies.
Following data acquisition, the workflow advances to the Secure Analysis Environment (SANE), a Trusted Research Environment (TRE) facilitated by SURF. SANE allows researchers to work securely with sensitive data, while the data provider stays in full control of the data and tools within the TRE. Through Federated Identity and Access Management (FIAM) we simplify the collaboration between data providers and researchers. This cloud-based solution simplifies the often intricate process of sensitive data analysis, providing researchers with essential tools and access needed to drive forward their investigations.
In this presentation, we will not only explore the architectural and operational aspects of DAB and SANE but also outline our strategy for expanding these services beyond SSH to encompass all research domains. As a leading Dutch initiative for integrated data systems, our aim is to reveal the potential and current capabilities of this innovative framework, showcasing how the ODISSEI metadata portal serves as a model for other research communities.
Attendees will gain insight into the full workflow, from data discovery through to detailed analysis, and understand how this initiative is shaping a new frontier in research capabilities across the Netherlands.
A large portion of datasets in the Social Science and Humanities (SSH) community is sensitive, for instance for privacy or copyright reasons. The Dutch national infrastructures for the social sciences and humanities, ODISSEI and CLARIAH, collaborate with the Dutch NREN SURF in the development of an integrated workflow to find, request and analyse sensitive data.
In the ODISSEI Portal, researchers can find datasets from a wide variety of data providers, through rich metadata. A service called the Data Access Broker enables researchers to submit a data access request that is processed semi-automatically based on the user’s credentials and the data provider’s access procedure. After approval, the sensitive data set is transferred directly to SANE: an off-the-shelve, data provider-agnostic Trusted Research Environment (TRE). It is a secure analysis environment that leaves the data provider in full control of the sensitive information.
In an interactive session, ODISSEI and SURF will illustrate how they facilitate a complete workflow: from finding, to requesting, and finally analysing sensitive SSH data.
"As we advance into the age of precision science and engineering, digital twins are becoming essential tools for research and development. This session will explore the role of scientific digital twins across various domains, highlighting their ability to replicate and predict complex systems with remarkable accuracy. Attendees will gain insights into how these virtual models mirror physical counterparts, enabling experiments, performance optimisation, and future behaviour predictions without physical constraints.
We will also showcase key projects funded by the European Commission, particularly the EGI-coordinated InterTwin project. Experts will discuss the development, application, and impact of these initiatives, demonstrating how digital twins are driving innovation and offering solutions to complex scientific and engineering challenges. Join us to discover how digital twins are enhancing our understanding and paving the way for groundbreaking advancements."
The term “digital twin” has been used to designate 3D models of physical cultural artefacts to which additional information might be added. If the 3D model consisted in a point cloud, as in the case of generating it via scanning, such information was attached to its points or regions as a sort of Post-it, thus creating so-called “augmented objects”. When, instead, CAD systems are used to produce the 3D model, the extra data were incorporated in an extension of BIM (Building Information Modelling) called HBIM (Heritage BIM), which adds the heritage-related necessary classes to BIM, an ISO standard used in the construction industry to incorporate information about materials, services and processes of a building.
In 2023 we proposed a novel ontology for heritage information based on the Heritage Digital Twin (HDT) a holistic approach to heritage information where the 3D graphical component is just one element. It allows to document intangible heritage as well, where the visual documentation may consist in video or audio recordings or even be totally absent. Such ontology, named HDTO, is a compatible extension of CIDOC-CRM, the standard for heritage documentation, allowing a straightforward incorporation of existing data organized according to it. The HDTO has been used to set up the cloud-based Knowledge Base (KB) created in 4CH, an EU-funded project designing a Competence Centre for the Conservation of Cultural Heritage. Documentation in the 4CH KB includes the relevant information about heritage assets, from visual one to the results of scientific analyses, conservation activities and historical documents.
The HDT does not consider the dynamic and interactive aspects connecting a digital twin to reality. The proposed improved model, named Reactive HDT Ontology (RHDTO), includes the documentation of dynamic interactions with the real world. A first example of application concerns the Internet of Cultural Things (IoCT), i.e. the use of IoT in the cultural heritage domain, for example fire sensors based on smoke or heat and other environmental sensors, activating processes and reactions. But the connection with reality may also consist in data directly provided by external digital systems, such as those providing weather forecasts or monitoring landslide hazards. The “reactive” nature of the system consists in three steps: an input/sensor, receiving data from the real world and processing them; the resulting outcome is input into a decider, which then transmits orders to an activator: each of them is documented as a member of a digital process RHDTO class and the related process is described in a specific instance of it. Such instances vary according to the nature of the planned reaction and are programmed according to scientific or heuristic knowledge about the relevant phenomenon, which may also be stored in the KB. The system may be connected and receive inputs from larger models such as the Digital Twin of the Earth, the ECMWF or the CMCC. Finally, the system allows also “what-if” simulation to experiment risks and mitigating measures, by defining simulated deciders and activators and providing as outcomes the simulation results.
Equitable flood risk management is contingent upon understanding the evolution of floods and their impacts on different groups in society. While rapid, open-source, physics-based flood and impact models offer valuable insights, their complexity often limits accessibility for decision-makers lacking technical expertise. Digital twins for flood risk management can address this issue by automating model pre-processing, execution, and post-processing, enabling end users to evaluate meaningful “what-if” scenarios, such as specific events, future conditions, or protective measures, regardless of their technical expertise. These digital twins employ automated workflows and model builders to configure and execute state-of-the-art flood and impact models across various contexts efficiently. However, orchestrating multiple models across disciplines poses challenges, including standardised data management and reproducibility. Our work focuses on developing a digital twin for flood risk management, building on the FloodAdapt desktop application. FloodAdapt integrates compound flood modeling and detailed impact assessment, providing an accessible platform for defining, simulating, and visualizing flood scenarios and their consequences. Users can explore diverse scenarios, including historical events, future projections, and adaptation strategies like green infrastructure, floodwalls, or elevating buildings. In our presentation, we will highlight the capabilities of the flood risk management digital twin that are under development. We’ll describe how we leveraged Destination Earth and the interTwin Digital Twin Engine in the implementation of FloodAdapt as a digital twin web application, highlighting the benefits this presents to end-users.
One of the main benefits of modern radio astronomy, its ability to collect more higher-resolution and wider-bandwidth data from more and more antennas is now also starting to become one of its greatest problems. The advent of cutting-edge radio telescopes, such as MeerKAT, a precursor to the Square Kilometre Array (SKA), has made it impractical to rely on the traditional method of storing the raw data for extended periods and then manually processing it. Furthermore, the high data rates necessitate the use of High-Performance Computing (HPC), yet existing common radio astronomical data reduction tools, like Common Astronomy Software Applications (CASA), are not well-suited for parallel computing. We have addressed these challenges in developing the ML-PPA (Machine Learning-based Pipeline for Pulsar Analysis). It is an automated classification system capable of categorizing pulsar observation data and assigning labels, such as "pulse", "pure noise", or various types of Radio Frequency Interference (RFI), to each time fragment, represented as a 2D time-frequency image or "frame". The analysis is performed by a Convolutional Neural Network (CNN). Given the highly imbalanced distribution of different frame types in real data (e.g. only 0.2% are "pulses"), it is essential to generate artificial data sequences with specific characteristics to effectively train such systems. To achieve this, "digital twins" were developed to replicate the signal path from the source to a pulsar-observing telescope. A corresponding pipeline was created and tested in Python, and then rewritten in C++, making it more suitable for HPC applications. The initial version of the ML-PPA framework has been released and successfully tested. This talk presents a comprehensive overview of the project, its current status and future prospects.
Climate Extreme Events and their impacts are getting a lot of attention lately, because their occurrence, severity and spatial coverage are increasing and will likely increase further toward mid and end of century. Many countries are experimenting significant impact of those climate extremes. It becomes more and more important to better assess the change of characteristics of climate extremes, according to users and society needs.
However, it is not straightforward to correctly assess and quantify uncertainties. It is also a challenge to find and characterize climate extremes in all available and relevant climate simulations. This is mainly due to the very large number of simulations, along with significant data volumes. It is unfortunate to limit the number of climate simulations used in a climate change assessment study, only because of those technical and time constraints, as we should use all available information.
A novel approach and methodology is being developed to detect and characterize the changes in climate extreme events using Artificial Intelligence (AI). This is a generic method based on Convolutional Variational Autoencoders (CVAE). This deep learning technique, that uses neural networks, can process large climate datasets much faster than traditional analytical methods, and also use efficient hardware architecture like GPUs. It has the potential to better assess and quantify uncertainties associated with the various projected IPCC (Intergovernmental Panel on Climate Change) scenarios. This has been integrated in a Digital Twin Engine (DTE) architecture provided by Core Components and a Data Lake within the interTwin projects.
In this presentation, first results of the method applied on Global Coupled Climate Model datasets will be shown for several greenhouse gas scenarios, over Western Europe. A comparison to analytical methods will also be presented to assess the robustness of the method.
In summary, this DT application will enable end users to perform on-demand what-if scenarios in order to better evaluate the impact of climate change on several real-world applications in specific regions to better adapt and prepare the society.
This project (interTwin) has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement N°824084.
Digital Twin technology isn't a single monolithic software solution. It is a complex system that must adapt to varying and potentially unpredictable user needs. This adaptability is crucial in environments where data, models, and objectives are shared across different domains, sectors, organisations, and expertise groups and roles across the organisations. Striving to ensure Digital Twin applications and models are future-compatible and can cater to diverse requirements, a holistic approach to design and management is essential. We outline a strategy using a Platform-as-a-Service model for Digital Twinning Infrastructure Components. This approach enables AI-powered virtual sensing to support multiple Digital Twin applications and models, illustrated through a case study where groundwater level measurements are integrated into a digital twin of The Netherlands. We share the insights gained from developing this operational platform service and their implications for future services.
NBIS (National Bioinformatics Infrastructure Sweden) is one of the largest research infrastructures in Sweden. With approximately 120 multidisciplinary experts positioned across Sweden's major universities, NBIS constitutes the SciLifeLab Bioinformatics platform and represents Sweden within ELIXIR, the European infrastructure for biological information.
NBIS's team is composed of specialists in various bioinformatics fields, such as protein bioinformatics, mass spectrometry (MS), next-generation sequencing (NGS), large-scale data management, metagenomics, systems biology, biostatistics, and RNA sequencing. Committed to advancing research, NBIS delivers tailored support for numerous projects, providing sophisticated infrastructure and analytical tools for bioinformatics.
Establishing dynamic partnerships with SciLifeLab platforms, the SciLifeLab Data Centre, the National Academic Infrastructure for Supercomputing in Sweden (NAISS), and an extensive network of stakeholders, NBIS encourages collaboration and cohesion. These partnerships ensure exceptional support for researchers, enhancing bioinformatics research throughout Sweden and internationally.
As the Swedish node for ELIXIR, NBIS plays a crucial role in collaborative European initiatives, contributing to prominent projects like the Federated European Genome-phenome Archive (EGA), Genomic Data Infrastructure (GDI), BigPicture, EUCAIM, and the European Joint Programme on Rare Diseases (EJP-RD). By connecting local expertise with worldwide efforts, NBIS not only bolsters Sweden's computational capacities but also contributes to a concerted effort to manage the complexities of biological data across Europe.
Through the National Recovery and Resilience Program (NRRP), Italy has funded the constitution of an unprecedented national infrastructure targeting digital resources and services for science and industry. Specifically, the National Center on HPC, Big Data and Quantum Computing (“ICSC”) is an initiative funded with €320M to evolve existing public state-of-the-art network, data, and compute services in the Country, establish new facilities and solutions, and drive an ambitious program for fundamental as well as industrial research. In this contribution, the current state of work of ICSC will be given, exploring the instruments and collaborations that ICSC has been enacting to maximize the impact of this initiative at the national and international levels.
The primary objective of the CoastPredict Programme is to provide decision-makers and coastal communities with integrated observing and predicting systems to manage risk in the short-term and plan for mitigation and adaptation in the longer-term context of future climate and ocean change.To accomplish the CoastPredict goals, the GlobalCoast initiative has been launched to create globally replicable solutions, standards, and applications that enhance coastal resilience. The advancement of CoastPredict innovation will be facilitated through the creation of an open and freely accessible digital platform known as the GlobalCoast Cloud - GCC. It is a cloud platform to transform the way we can improve and expand monitoring and forecasting of the global coastal ocean. By harnessing extensive data and establishing an infrastructure for cloud-based data and computing, this platform will expedite the dissemination of science-driven tools and information, making sure they are available and practical for the benefit of the public, decision- makers, coastal communities, and the research community. The GlobalCoast Cloud will hosts products and services for 125 Pilots Sites across the world ocean coastal areas.
interTwin co-designs and implements the prototype of an interdisciplinary Digital Twin Engine (DTE), an open-source platform that provides generic and tailored software components for modelling and simulation to integrate application-specific Digital Twins(DTs). Its specifications and implementation are based on a co-designed conceptual model -the DTE blueprint architecture-guided by the principles of open standards and interoperability. The ambition is to develop a common approach to the implementation of DTs that is applicable across the whole spectrum of scientific disciplines and beyond to facilitate development and collaboration. Further information: https://www.intertwin.eu/
Agenda can be found under https://indico.egi.eu/event/6547/
This double session will demonstrate the commitment of the EGI community towards Open Science through a series of presentations and discussions presenting the EGI contribution to the establishment of the European Open Science Cloud.
The content will include an overview of the current status of the EOSC with a focus on the ongoing process of defining the EOSC Federation as a network of EOSC Nodes. The EGI-coordinated EOSC Beyond project, which started in April 2024, has taken on a relevant role in this activity with its set of EOSC Pilot Nodes that will be established during its execution. Presentations of the EOSC Beyond Pilot Nodes are included in the session agenda together with a status update on the establishment of the first node of the EOSC Federation, the EOSC EU Node, that highlights the EGI contribution to the EOSC procurement activities.
This talk will provide an in-depth look at the initiative's rationale, outline the various phases of its development, and offer a clear picture of what to expect moving forward. Attendees will gain a thorough understanding of the EOSC EU Node's objectives, milestones, and the impact it aims to achieve within the European Open Science Cloud framework. Don't miss this opportunity to engage with key aspects of this pivotal project and its future directions.
The talk will dive into the contribution of EGI to the building of the EOSC EU Node, the first in a new federated EOSC node ecosystem. EGI participates in developing the core services constituting the EOSC EU Node and brings the expertise in service planning, delivery, operation, and control founded on FitSM. Moreover, EGI is responsible for a comprehensive verification and validation plan for the entire EOSC EU node, ensuring the best possible experience for the users regarding the availability and stability of features, performance, and security. Last but not least, EGI oversees the security coordination of the EOSC EU Node.
The ambition of EOSC Beyond is to support the growth of the European Open Science Cloud (EOSC) in terms of integrated providers and active users by providing new EOSC Core technical solutions that allow developers of scientific application environments to easily compose a diverse portfolio of EOSC Resources, offering them as integrated capabilities to researchers.
EOSC Beyond introduces a novel concept of EOSC, establishing a federated network of pilot Nodes operating at various levels (national, regional, international, and thematic) to cater to specific scientific missions. Key objectives include accelerating the development of scientific applications, enabling Open Science through dynamic resource deployment, fostering innovation with testing environments, and aligning EOSC architecture with European data spaces.
The project advances EOSC Core through co-design methodologies, collaborating with diverse use cases from national and regional initiatives (e-Infra CZ, NFDI, NI4OS), and thematic research infrastructures (CESSDA, CNB-CSIC, Instruct-ERIC, ENES, LifeWatch, METROFood-RI).
At the heart of the EOSC Beyond project lies the development of new EOSC Core services to further elevate the platform's capabilities: EOSC Integration Suite, EOSC Execution Framework, EOSC Core Innovation Sandbox. EOSC Beyond is also dedicated to enhancing current EOSC Core services and framework.
Simpl is part of a broader vision under the Common European Data Spaces initiative. The Common European Data Space serves as a technical tool for data pooling and to facilitate data sharing and exchange in a secure manner. Data holders remain in control of who can access and use their data, for which purpose and under which conditions. However, there is currently no unified approach for data spaces. Instead, different sectors and organisations – public and private – are using different technologies and governance models to develop and deploy their own unique data spaces. Against that background, Simpl is the common software behind data spaces. It is the one software that should be used by European common data spaces. It builds on existing knowledge and software(s), promotes cross-country and cross-border interoperability and data exchange, and saves money and time in the long-run. The aim of this talk is to provide an update on the state of play of Simpl after nine months of development, present the EOSC's Simpl-Live feasibility study: results, lessons learnt, and possibilities for Data Spaces interested in adopting Simpl; and comment on what to expect from Simpl in the coming months.
In order to fulfil its “catalysing and leveraging role” in the development of European Research Infrastructures (RIs) and e-Infrastructures, the European Commission (EC) introduced mechanisms in the previous Framework Programme to provide researchers who participated in EC-funded projects with access to European RIs. Access to “depletable” resources, including physical and remote access to facilities, was regulated by Trans-national Access (TNA), while access to “non-depletable” resources (like e.g. data) was done via Virtual Access (VA). TNA was restricted to partners in the consortium, while VA could be extended to users outside as well.
TNA and VA allowed projects to use money from grants to reimburse providers for the costs incurred in the provision of the service, including support-related costs, and covered also any travel costs of researchers accessing the services. This approach helped pool resources across Europe to “properly address the cost and complexity of new world-class RIs” and ensured “wider and more efficient access to and use of” European RIs. By transferring the money directly to the provider, the EC enabled researchers to use facilities around Europe free at the point of use.
While continued for Destination INFRASERV calls of the current Framework Programme, VA and TNA are not included as eligible costs for projects awarded in Destination INFRAEOSC calls, causing several digital services to become discontinued, and making access by researchers to others difficult due to the lack of funding sources that allow them to pay for their use.
However, demand for access to datasets, data processing applications and other data-related services is expected to continue increasing. Processing, analysis and storing of data carry considerable costs when incurred by researchers outside of their own communities, linked to the infrastructure, maintenance, and operating staff. In the context of the Open Science paradigm, the EC and the EU Member States (EU MS) have agreed that enabling “secondary use” of data is needed aims to provide access to any potential user to all data obtained with public funding. This key ingredient in the “EOSC Federation” put forward by the EC, the EOSC Association and the EU MS is expected to result in a further increase in the costs incurred by RIs, since the additional access to and processing of data by researchers not included in the original user base have in general not been foreseen when planning RIs. Some RIs will therefore face problems to fulfil the requirements placed on them.
We argue that a mechanism that replaces VA and TNA is needed in the future FP10 for a successful implementation of EOSC as a federation of “EOSC nodes”. The EC and EU MS must agree on a way by which data providers can be reimbursed for the extra costs generated by the “secondary use” of data such that access to data remains essentially free at the point of use for researchers.
In our talk we will evaluate the current situation according to the plans to build the EOSC Federation, and will suggest possible ways forward to be discussed with the audience.
Open Science, an approach that promotes transparency, accessibility, and collaboration in research, is revolutionizing the way knowledge is created and shared worldwide. This session delves into the pivotal role of EGI in enabling and accelerating Open Science practices on a global scale.
Join us as we explore diverse perspectives and experiences from leading experts and practitioners who are at the forefront of this transformation in collaboration with EGI. We will examine how advanced computational resources—including high-performance computing, cloud platforms, data repositories, and collaborative tools—are driving innovation and inclusivity in global scientific research. Discover how EGI and worldwide infrastructures and initiatives are breaking down barriers, fostering collaboration, and empowering researchers to address complex global challenges through Open Science
The Australian Research Data Commons (ARDC) is establishing 3 national-scale Thematic Research Data Commons to meet Australia’s future research needs with long-term, enduring digital infrastructure. Each Thematic Research Data Commons integrates the ARDC’s underpinning compute, storage infrastructure and services with data assets, analysis platforms and tools. Each is supported by our expertise, skills building activities and our work on developing community-agreed standards and best practices. These coordinated, structured, and complementary activities are building data assets, tools and skills that will constitute a national ‘knowledge infrastructure’ that enables Australian researchers to transform our lives and address the complex global challenges we are facing through Open Science. They are co-designed with the research community through extensive consultations and broad partnerships, they will enable us to achieve our goal of supporting the maximum number of researchers in strategic priority areas of research through a new approach to participation and organisation.
This presentation explores how the data, storage and compute Solutions and Services provided by EGI might be transformed into an EGI Research Commons.
The publication by the RDA Global Open Research Commons Working Group in October, 2023 of the Global Open Research Commons International Model (GORC Model) made available a well researched and fully featured template for a Research Commons. To borrow the definition by Scott Yockel, University Research Computing Officer at Harvard, a research commons "brings together data with cloud computing infrastructure and commonly used software, services and applications for managing, analyzing and sharing data to create an interoperable resource for a research community".
Since the publication of the GORC model, national organizations in Sweden, The Netherlands, Germany, and elsewhere, and ELIXIR, are using the GORC Model to explore establishing Research Commons.
A fully featured proposal to create a Research Commons for Norway (REASON) based on the GORC Model, was submitted to the Norwegian Infrastructure Fund in November, 2023. REASON is being used as a reference by many groups exploring the establishment of a Research Commons.
As Research Commons are coming into prominence, parallel initiatives, called Research Clouds, are also emerging. Examples include the ARDC Nectar Cloud in Australia, the Alliance Cloud in Canada, and the New England Research Cloud in the northeast US. 'Bringing Data to Compute' is a central objective of Research Clouds that is overlooked in most Research Commons. The New England Research Cloud (NERC) is arguably the most interesting Research Cloud, because it also incorporates as a foundational feature an important element of Research Commons, namely deployment of a series of researcher-facing research research data management tools that interoperate through the research data lifecycle, and are deployed in conjunction with storage and compute resources.
Three key features of both Research Commons and Research Clouds are: first, they offer researchers access to an integrated series of complementary services that are accessible from a single platform; second the researcher-facing data management services are integrated with the cloud and compute layer; the researcher-facing data management services facilitate passage of data and metadata between tools throughout the research lifecycle.
EGI provides most of the storage, cloud and compute services identified in the GORC model, but these do not present as an integrated platform, and it provides only a few, unconnected researcher facing data management services. How might EGI add these elements to present as a Research Cloud/Commons?
The presentation is divided into the following sections:
The China Science and Technology Cloud (CSTCloud) stands as one of the key national research e-infrastructures in China. Sponsored by the Chinese Academy of Sciences, CSTCloud aims to empower scientists with efficient and integrated cloud solutions across domains and disciplines. Through the integration of big data, cloud computing, and artificial intelligence, CSTCloud delivers robust data and cloud computing services to bolster scientific innovation and socioeconomic development. To break silos across domains and regions, the idea of co-designing and co-developing a Global Open Science Cloud was proposed during the CODATA Beijing 2019 Conference. This report introduces both CSTCloud and the GOSC initiative, focusing on CSTCloud cloud services and recent pilots within the CSTCloud-EGI and CSTCloud-AOSP EA cloud federations. It also highlights regional collaborations with Africa and Southeast Asia under the GOSC umbrella. Additionally, key cloud technologies and applications deployed in the newly established GOSC testbed will be shared, discussing the interoperability issues across diverse research domains and geographical regions. We aim to offer insights into constructing innovative open science infrastructures in the digital age, fostering robust alignment among stakeholders for interconnected and interoperable open science clouds, and bringing open dialogues in connecting research e-infrastructures for future-led Open Science and SDGs.
The European Union (EU) has been working on the establishment of the European Open Science Cloud (EOSC) for several years now to support Open Science with a federated infrastructure that can underpin every stage of the science life-cycle. In response to this global trend, the Korea Institute of Science and Technology Information (KISTI), under the Ministry of Science and ICT in Korea, is currently undertaking the construction of the Korean Research Data Commons (KRDC) to facilitate the integration of various domestic or institutional research data commons systems. Specifically, starting from 2022, KISTI has been designing and developing the KRDC Framework to implement KRDC and concurrently building strategies for collaboration with advanced global systems such as EOSC, with the aim of establishing a global interconnection framework in the future.
To facilitate the aforementioned objectives, EGI, which plays a pivotal role in the EOSC initiative, and KISTI conducted a bilateral project in 2024 to establish a methodology for connectivity between EGI, EOSC and KRDC, thereby ensuring seamless interoperability among these infrastructures. This presentation will report on the outcomes of this project.
Significant investments have been made by the South African government in efforts to support the e-research environments across multiple disciplines in the South African research landscape. This has given birth to the National Integrated Cyberinfrastructure Systems (NICIS) which currently supports communication networks, high performance computing (HPC), data storage and research data management services across the research landscape of South Africa.
The Data Intensive Research Initiative of South Africa (DIRISA) is tasked with dealing with the increased proliferation of data that is being generated from new technologies and scientific instruments. Large amounts of research data is created daily which introduces new challenges for DIRISA and requires increased efforts towards solving these challenges. This presentation discusses the primary objectives of DIRISA which are - providing a national research data infrastructure, providing coordination and advocacy, developing human capital skills, providing research data management services and providing thought leadership in local and international efforts.
DIRISA is critical for researchers that are engaged in data intensive research and international research collaboration as it is able to bridge the gaps of infrastructure limitations at various public institutions by providing dedicated access to data and high capacity data storage. The comprehensive suite of research data management services offered by DIRISA ensures that South African researchers derive value from their research data. DIRISA offers research data management services that span the entire research data lifecycle such as: single sign-on authentication and authorization mechanisms, tools for crafting data management plans, metadata catalogue and management, digital object identifier (DOI) issuance, and tools for data depositing, data sharing and long-term archival. As underscored by DIRISA's objectives, community training assumes paramount importance, enabling researchers to effectively harness the technologies and tools provided by the initiative. This presentation also deliberates on DIRISA's diverse human capital development and training endeavors that not only cover the researchers but also reach down to high school level students.
An impact assessment of how DIRISA services have contributed to the advancement of research in the country along with the challenges and gaps that currently exist at DIRISA are discussed. This presentation provides a framework that can be used by other African and developing countries towards creating cross-disciplinary data infrastructures through an analysis and evaluation of DIRISA by focusing on infrastructure, research data management services, policies and human capital development. DIRISA aims to provide a platform for supporting researchers through the provision of data infrastructure for South Africa and the lessons from DIRISA can have applicability for the African context. Finally, the future directions for addressing emerging challenges in data management and infrastructure development are discussed to provide a glimpse into how data infrastructure can adapt to the changing research data management landscape.
This session will explore cutting-edge technologies and strategies that enhance data management and access to federated data sources currently developed in the context of EU projects The session will inclide as well example of usages and integrations of those platforms by scientific communities
Document structuring is a fundamental aspect of information management, involving the categorization and organization of documents into logical and physical structures. This presentation explores the benefits and challenges associated with document structuring, focusing on the distinctions between physical and logical structures, metadata and content, as well as addressing the implications for businesses and research centers dealing with large volumes of data encompassed in data warehouses and lakes of textual documents.
In the task of document structuring, distinctions arise between physical and logical structures. Physical structures pertain to the layout and presentation of documents, encompassing elements such as tables, figures, and images. On the other hand, logical structures refer to the organization of content within documents, including metadata that describes document attributes and content that comprises the textual information.
Implementing structured document management systems brings several benefits for business and research bodies. Firstly, structured documents target search queries more effectively, yielding more relevant search results and reducing the volume of irrelevant hits. This not only enhances search efficiency but also saves time and resources, resulting in cost savings and eco-friendly practices. Additionally, structured documents facilitate comparisons between similar structures, enabling deeper analysis and insights. Moreover, the adoption of structured documents enables the extraction of statistics and the creation of dashboards, as it allows for the identification and analysis of document elements beyond mere text.
However, document structuring still faces great challenges. Legacy documents pose a significant hurdle, particularly those with poor scans or generated through low-quality optical character recognition (OCR). These documents may contain noise, artifacts, or degradation, compromising the accuracy of structure recognition algorithms. Furthermore, complex layouts, heterogeneous documents, handwritten content, tables, figures, images, multilingual text, and dynamic content all contribute to the complexity of document structuring. Moreover, the scarcity of labeled data exacerbates the challenge, hindering the development of accurate and robust structuring algorithms.
While Generative Artificial Intelligence (GenAI) has demonstrated remarkable capabilities in various domains, including natural language understanding and image recognition, it still struggles with document structuring due to the complexity of document layouts, ambiguity in content, and limited contextual understanding. It faces challenges in handling diverse document formats, noisy data, and legacy documents. Additionally, GenAI's reliance on labeled data for training limits its generalization ability, hindering its performance on unseen document structures. Overcoming these challenges requires interdisciplinary collaboration and continued research to develop more robust Artificial Intelligence (AI) models capable of effectively managing the complexities of document organization and content extraction.
In conclusion, document structuring offers substantial benefits for businesses and research centers, enabling more efficient information retrieval, automated data extraction, enhanced searchability, standardization, and improved data analysis. However, overcoming these challenges requires innovative solutions and advancements in document structuring technology. By addressing these challenges, organizations can harness the full potential of structured documents to optimize workflows, facilitate knowledge management, and drive innovation.
In many of the societal and scientific challenges, such as Digital Twins of the Oceans and virtual research environments, fast access to a large number of multidisciplinary data resources is key. However, achieving performance is a major challenge as original data is in many cases organised in millions of observation files which makes it hard to achieve fast responses. Next to this, data from different domains are stored in a large variety of data infrastructures, each with their own data-access mechanisms, which causes researchers to spend much time on trying to access relevant data. In a perfect world, users should be able to retrieve data in a uniform way from different data infrastructures following their selection criteria, including for example spatial or temporal boundaries, parameter types, depth ranges and other filters. Therefore, as part of the EOSC Future and Blue-Cloud 2026 projects, MARIS developd a software system called ‘BEACON’ with a unique indexing system that can, on the fly with incredible performance, extract specific data based on the user’s request from millions of observational data files containing multiple parameters in diverse units.
The BEACON system has a core written in RUST (low-level coding language) and its indexed data can be accessed via a REST API that is exposed by BEACON itself meaning clients can query data via a simple JSON request. The system is built in a way that it returns one single harmonised file as output, regardless of whether the input contains many different data types or dimensions. It also allows for converting the units of the original data if parameters are measured in different types of units (for this it e.g. makes use of the NERC Vocabulary Server (NVS) and I-Adopt framework).
EOSC-FUTURE Marine Data Viewer
Showcasing the performance and usability of BEACON, the BEACON system is applied to the SeaDataNet CDI database, Euro-ARGO and the ERA5 dataset from the Climate Data Store. These are also connected to a Marine Data Viewer that was developed as part of the EOSC-FUTURE project to co-locate Copernicus Marine satellite derived data products for Temperature and Salinity with observed in-situ data, made available through BEACON instances for the Euro-Argo and SeaDataNet marine data services.
The user interface of the Marine Data Viewer (https://eosc-future.maris.nl/) is designed to allow (citizen) scientists to interact with the data collections and retrieve parameter values from observation data. Enabled by the performance of BEACON, the user can filter the data on-the-fly using sliders for date, time and depth. At present, the ocean variables concern temperature, oxygen, nutrients and pH measurements, from Euro-Argo and SeaDataNet. The in-situ values are overlayed at the same time and space with product layers from Copernicus Marine, based upon modelling and satellite data.
Presentation
During the presentation more details will be given about the BEACON software and its performances. Moreover, latest developments will be presented, which includes deploying BEACON instances for several leading marine and ocean data repositories as part of Blue-Cloud 2026 to provide data lakes to the VRE user community and DTO.
The Blue-Cloud 2026 HE project aims at a further evolution of the pilot Blue-Cloud open science infrastructure into a Federated European Ecosystem to deliver FAIR & Open data and analytical services, instrumental for deepening research of oceans, EU seas, coastal & inland waters. It also strives to become a major data and analytical component for the Digital Twins of the Oceans (DTO’s) as well as a blue print for a thematic EOSC node.
One of the key services is the Blue-Cloud Data Discovery & Access service (DD&AS), which federates key European data management infrastructures, to facilitate users in finding and retrieving multi-disciplinary datasets from multiple repositories through a common interface. In Europe, there are several research infrastructures and data management services operating in the marine and ocean domains. These cover a multitude of marine research disciplines, and providing access to data sets, directly originating from observations, and to derived data products. A number are ocean observing networks, while others are data aggregation services. Furthermore, there are major EU driven initiatives, such as EMODnet and Copernicus Marine. Together, these infrastructures constitute a diverse world, with different user interfaces. The Blue-Cloud DD&AS has been initiated to overcome this fragmentation and to provide a common interface for users by means of federation.
The pilot Blue-Cloud Data Discovery & Access service (DD&AS) already federates EMODnet Chemistry, SeaDataNet, EuroArgo-Argo, ICOS-Marine, SOCAT, EcoTaxa, ELIXIR-ENA, and EurOBIS, and provides common discovery and access to more than 10 million marine datasets for physics, chemistry, geology, bathymetry, biology, biodiversity, and genomics. As part of Blue-Cloud 2026 project, is is being expanded by federating more leading European Aquatic Data Infrastructures, such as EMSO, SIOS, EMODnet Physics, and EBI – Mgnify. In addition, upgrading is underway for optimising the FAIRness of the underpinning web services, incorporating semantic brokering, and adding data sub-setting query services.
The common interface includes facilities for discovery in two steps from collection to granular data level, and including mapping and viewing of the locations of data sets. The interface features a shopping mechanism, facilitating users to compose and submit mixed shopping baskets with requests for data sets from multiple BDIs. The DD&AS is fully based and managed using web services and APIs, following protocols such as OGC CSW, OAI-PMH, ERDDAP, Swagger API, and others, as provided and maintained by the BDIs. These are used to deploy machine-to-machine interactions for harvesting metadata, submitting queries, and retrieving resulting metadata, data sets and data products.
Presentation:
During the presentation more details will be given about the federation principles, the semantic brokerage, and the embedding of the DD&AS in the Blue-Cloud e-infrastructture, serving external users as well as users of Blue-Cloud Virtual Labs and EOV WorkBenches.
The Centre for Environmental Data Analysis (CEDA) stores over 20 Petabytes of atmospheric and Earth observation data. Sources for the CEDA Archive include aircraft campaigns, satellites, automatic weather stations and climate models, amongst many others. The data mainly consists of well-described formats such as netCDF files but we also hold historical data where the format cannot be easily discerned from the file name and extension.
CEDA are investigating the SpatioTemporal Asset Catalogue (STAC) specification to allow for user interfaces and search services to be enhanced and facilitate interoperability with user tools and our partners. We are working to create a full-stack software implementation including an indexing framework, API server, web and programmatic clients, and vocabulary management. All components are open-source so that they can be adopted and co-developed with other organisations working in the same space.
We have built the “stac-generator”, a tool that can be used to create a STAC catalog, which utilises a plugin architecture to allow for more configurability. A range of input, output, and extraction methods can be selected to enable data extraction across the diverse archive data and its use by other organisations. Elasticsearch was chosen to host the indexed metadata because it is performant, highly scalable and supports semi-structured data - in this case the faceted search values related to different data collections. As STAC’s existing API was backed by an SQL database this called for the development of a new ES backed STAC API, which has now been merged back into the community developed API as an alternate database backend. We have also developed several extensions to the STAC framework to meet requirements that weren’t met by the core and community functionality. These include an end-point for interrogating the facet values, as queryables, and a free-text search capability across all properties held in the index.
The developments of our search system has also included pilots for the Earth Observation Data Hub (EODH) and a future version of the Earth System Grid Federation (ESGF) search service, in which we have created an experimental index containing a subset of CMIP6, CORDEX, Sentinel 2 ARD, Sentinel 1, and UKCP data to investigate performance and functionality.
With the increasing demand on cloud-accessible analysis-ready data we are seeing in several of our upcoming projects. We have started to explore Kerchunk a lightweight non-conversion approach for referencing existing data, which works with open-source python packages like fsspec and xarray. And are looking to integrate this with our STAC work.
It is the aim of project to increase the interoperability of our search services, as well as foster collaboration with other organisation who share our goals. Additionally, it is hoped that this work will allow for greater and easier access to the data held at CEDA.
In an era where web search serves as a cornerstone driving the global digital economy, the necessity for an impartial and transparent web index has reached unprecedented levels, not only in Europe but also worldwide. Presently, the landscape is dominated by a select few gatekeepers who provide their web search services with minimal scrutiny from the general populace. Moreover, web data has emerged as a pivotal element in the development of AI systems, particularly Large Language Models. The efficacy of these models is contingent upon both the quantity and calibre of the data available. Consequently, restricted access to web data and search capabilities severely curtails the innovation potential, particularly for smaller innovators and researchers who lack the resources to manage Petabyte Platforms.
In this talk, we present the OpenWebSearch.eu project which is currently developing the core of a European Open Web Index (OWI) as a basis for a new Internet Search in Europe. We mainly focus on the setup of a Federated Data Infrastructure leveraging geographically distributed data and compute resources at top-tier supercomputing centres across Europe. We then detail the use of the LEXIS platform to orchestrate and automate the execution of complex preprocessing and indexing of crawled data at each of the centres. We finally present the effort to adhere to the FAIR data principles and to make the data available to the general public.
Sunet Drive is a national file storage infrastructure for universities and research institutions in Sweden. It is based on a Nextcloud setup and is comprised of 54 nodes, one prepared and provisioned for each institution. The aim of Sunet Drive is to become an Academic Toolbox capable of collecting, storing, analyzing, and publishing research data, supporting FAIR principles. We present Sunet Drive as an integrated solution comprised of four essential building blocks:
Participating organizations co-manage their Sunet Drive node as part of a global scale setup, meaning that every node is governed by the operating organization, while being able to collaborate and share data with users within the federation, but also external partners through open cloud mesh protocol (OCM), such as the ScienceMesh. S3-compatible buckets are used as logical storage entities that can be assigned for different purposes: research projects, institutions, laboratories. They are technically independent from the EFSS layer and their life-cycle can be managed beyond the lifetime of the selected EFSS software, an important step towards long-term sustainability of FAIR data.
Collaboration is encouraged by allowing access through eduGAIN and subsequently accept documents, shares, and data from their collaboration partners. External collaboration is enabled via Eduid.se. Added security can be provided through step-up-authentication, adding a second authentication factor for identity providers that have not added support for 2FA yet. Further security can be added by activating MFA-zones, mandating the receiver of a file or folder to add a second authentication factor, such as TOTP or a WebAuthn device.
During the runtime of a research project, data can be processed and analyzed directly through a scalable JupyterHub integration, an open source application developed by Sunet and funded by the GÉANT Project Incubator. Compute resources are intelligently managed in a kubernetes environment and can be allocated on a per-project basis, which includes support for CPU and GPU flavours.
The integration of Research Data Services, RDS, enables the preparation and direct publication of datasets directly. This includes services like InvenioRDM (e.g., Zenodo), Harvard Dataverse, or Doris from the Swedish National Dataservice, SND. Research object crates (RO-Crate) are used as an intermediate lightweight package for the data, and respective metadata, connectors ensure compliance with each publication service. Domain-specific customizations include the integration of different publishing paradigms: While data is being actively pushed to repositories such as InvenioRDM or OSF, the SND Doris connector uses a more lightweight approach where the metadata is pushed to Doris, with the data storage remaining under the sovereignty of the publishing institution.
Providing researchers with an Academic Toolbox with streamlined support for authentication, data management, analysis, and publication helps to ensure compliance with local, national, and international guidelines for storing of research data, including FAIR principles.
Datacubes form an acknowledged cornerstone for analysis-ready data – the
multi-dimensional paradigm is natural for humans and easier to handle than
zillions of scenes, for both humans and programs. Today, datacubes are common in
many places – powerful management and analytics tools exist, with both
datacube servers and clients ranging from simple mapping over virtual globes and
Web GIS to high-end analytics through python and R. This ecosystem is backed by
established standards in OGC, ISO, and further bodies.
In the EarthServer federation, institutions from the US, Europe, and Asia
contribute spatio-temporal datacubes through OGC compliant services, including
the CoperniCUBE datacube ecosystem. Weather and climate data, satellite data
timeseries, and further data are provided, altogether multi-Petabyte. A unique
feature is the location transparency: users see the federation offerings as a
single, integrated pool. The federation member nodes orchestrate incoming
queries automatically, including distributed data fusion.
Further, a tight integration of AI and datacubes is provided through the novel
concept of AI-Cubes.
We briefly introduce the concepts of datacubes and then explore hands-on
together how to access, extract, analyze, and reformat data from datacubes.
Particular emphasis is on federation aspects.
Most of the examples can be recapitulated and modified by participants with
online access. Ample room will be left for discussion.
The contributor is editor of the datacube standards in OGC and ISO and member, EOSC.
FAIR EVA is a tool that allows checking the level of adoption of the FAIR principles for digital objects. It provides an API for querying via a persistent identifier and a web interface to interpret the offered results. These results assess, based on a series of indicators and automated technical tests, whether certain requirements are met. Additionally, FAIR EVA not only aims to evaluate and validate digital objects and their level of compliance with the FAIR principles, but it also intends to help data producers improve the characteristics of their published objects through a series of tips.
The diversity of repository systems and data portals means that, technically, the way data and metadata are accessed varies significantly. Although there are interoperability solutions like OAI-PMH or Signposting, certain indicators require a higher level of technical detail, such as those related to metadata standards or formats specific to scientific communities. Moreover, the FAIR principles mainly focus on metadata, and data quality is only superficially assessed.
To address this issue, FAIR EVA is designed modularly and, through its plugin system, can connect with various repositories or data portals with very different technical characteristics. In general, FAIR EVA implements the indicators of the RDA FAIR Maturity Working Group but allows them to be replaced with others or even extended to perform quality tests and metrics for a specific domain. For instance, a plugin has been developed for GBIF (Global Biodiversity Information Facility) that evaluates the adoption level of the FAIR principles and extends the tests to check certain specific quality indices for biodiversity data.
The proposed demo aims to showcase the fundamental features of FAIR EVA, particularly how a plugin can be created and adapted for a specific community, extending the list of tests to assess other aspects of data quality.
FAIR EVA started to be developed under the context of EOSC-Synergy project, and it has released the second version this year. There are different plugins being developed for diverse communities: DT-GEO project for geosciences, AI4EOSC, SIESTA, etc.
Keeping a research community updated with all the most relevant and impactful research and information is a never-ending task. With over 4 million articles published in 2021, growing rapidly at over 5% per year[1], it’s hard for anyone to keep up with a given topic.
Other academic sharing and networking platforms rely on users to add and share papers within specific groups, placing a heavy burden on them to maintain relevance and currency. This results in incomplete and messy data, including grey literature and errors, alongside peer-reviewed research. Community curation demands significant time and effort, often leading to outdated or inactive communities. Additionally, fragmented information sharing limits interdisciplinary discussion and collaboration. Finally, impact scores are publication-based rather than community-specific, hindering the recognition of relevant impactful research.
The result is that online research communities are currently much more limited in scope than they could be and often not good information sources, limiting online collaboration and making finding and following research areas much harder than necessary.
This is why we have developed OiPub - to discover and share cutting-edge research effortlessly.
OiPub automatically tracks all the latest research from recognised respected sources (e.g. CrossRef, OpenAire, ORCID, Scholexplorer) and categorises it into Topic-based information and discussion hubs. Topics can also be combined by users into custom Communities following specific interests. These are automatically kept up to date with feeds of all the latest research data relevant to them, with filtering and sorting tools allowing users to easily find the exact information they need.
This makes keeping up to date with research in any and every niche easier than ever before, allowing you to spend less time finding research and more time doing research.
In this talk we will provide a live demo of OiPub, focusing on our design concept along with insights into the data perspectives and outcomes of our work in:
Omni Iota Science Limited received valuable support from EGI-ACE in developing OiPub, starting from the EOSC Digital Innovation Hub (https://eosc-dih.eu/oipub/) through the EOSC Future project, and continuing with EGI computational services and support from the EGI DIH. This along with support from other EOSC-DIH service partners including OpenAire Nexus (https://www.openaire.eu/integration-of-openaire-graph-on-oipub-sme-platform), and funding grants through the Malta Council for Science and Technology and the Malta Information Technology Agency for the development of its platform OiPub.
You can find and use OiPub at https://oipub.com/.
In the context of Artificial Intelligence (AI), the evolution of computing paradigms from centralized data centers to the edge of the network heralds a transformative shift in how AI applications are developed, deployed, and operated. Specifically, the edge computing paradigm is characterized by processing data directly in the devices where it is collected, such as smartphones, wearables, and IoT. Edge computing significantly reduces latency, conserves bandwidth, and enhances data privacy and security, thereby catalyzing the realization of real-time, responsive AI solutions.
The AI-SPRINT (Artificial Intelligence in Secure PRIvacy-preserving computing coNTinuum) H2020 project has developed a comprehensive design and runtime framework dedicated to accelerating the development and deployment of AI applications across the computing continuum. This continuum spans from cloud data centers to edge devices and sensors, integrating AI components to operate seamlessly and efficiently. The project's core objective is to offer a suite of tools that enable an optimal balance among application performance, energy efficiency, and AI model accuracy, all while upholding rigorous security and privacy standards.
The Personalized Healthcare use case focuses on harnessing the power of AI and wearable technologies for health monitoring. By integrating quantitative data on heart functions from wearable device sensors with qualitative lifestyle information, the use case aims to develop a personalized stroke risk assessments model. This initiative is particularly critical given the prevalence of stroke among the aging population, marking it as a significant cause of death and disability globally. Leveraging the AI-SPRINT framework, this application enables efficient resource distribution and computation across the edge-to-cloud continuum, facilitating real-time, non-invasive and secure monitoring and risk assessment.
The use case has been implemented using the PyCOMPSs programming framework from BSC to develop a Machine Learning model for detecting atrial fibrillation (AF) in electrocardiogram data (ECG) implemented using the parallel dislib library built on top of PyCOMPSs.
In this demo we show how the prediction risk is calculated, using ECGs extracted from a wearable device, on a Cloud server deployed in the EGI Cloud, with resources provided on-demand by the OSCAR framework from UPV.
OSCAR is an open-source platform to support the serverless computing model for event-driven data-processing applications. It can be automatically deployed on multiple Cloud backends, thanks to the Infrastructure Manager, to create highly parallel event-driven data-processing serverless applications that execute on customized
runtime environments provided by Docker containers than run on an elastic Kubernetes cluster.
In the demo, the OSCAR cluster is deployed and configured with a MinIO storage. When new data (ECG files) is sent through an HTTP request by uploading the ECG file to MinIO, OSCAR triggers a PyCOMPSs/dislib container creation to serve the execution of the inference computation in a Function as a Service mode. This is implemented through a script that starts a PyCOMPSs instance and uploads the result data back in the storage. The number of resources used in the execution can be configured dynamically through an integration with OSCAR and the COMPSs runtime through environment variables.
The increasing accessibility of quantum computing technology has opened up new avenues for exploring their potential applications in various scientific fields such as artificial intelligence, manufacturing, and finance. Many research scientists heavily depend on cloud computing infrastructures for their investigations. However, accessing actual quantum hardware resources, often located remotely, involves deploying and configuring different software components.
In this demonstration, we present our reference architecture [1,2], which combines cloud computing and quantum resources for easier initiation of experiments across diverse quantum compute resources. Our solution simplifies distributed quantum computing simulations in traditional cloud environments and provides access to remote quantum compute resources. The reference architecture is portable and adaptable to different cloud platforms, offering efficient utilization and application opportunities for research communities. It incorporates essential quantum software development kits (SDKs) with machine learning support and access to various quantum devices. Furthermore, we provide practical examples serving as references for constructing solutions to predefined problems. Our reference architecture prioritizes a user-friendly interface.
Additionally, our reference architecture enables continuous deployment of quantum applications, allowing seamless orchestration with traditional cloud-based applications. These quantum applications are deployed as microservices, accessible through standard REST APIs following open standards. This combination simplifies the design and deployment of quantum services, showcasing the effective utilization of standard methodologies from traditional service-oriented computing in this hybrid context.
The reference architecture is deployed amongst others within the National Laboratory for Autonomous Systems in Hungary [3] (abbreviated as ARNL) and within the National Research on Hydrogen-Powered, Cooperative Autonomous Remote Sensing Devices and Related Data Processing Framework project (TKP2021-NVA-01). We would like to demonstrate our reference architecture through a selected use case involving global route planning for autonomous vehicles such as unmanned ground vehicles (UGVs).
The aerial-ground cooperative mapping system aims to construct a comprehensive 3D model of an unknown environment by leveraging the perspectives of different agents. Drones enable rapid exploration, but it may result in incomplete 3D reconstructions due to limited aerial coverage. Aerial robot-collected data is used in structure from motion (SfM) pipelines to create initial 3D reconstructions. To address the aforementioned issue, we propose automatically locating unmapped regions and guiding the ground robot to complete the 3D model. Leveraging the initial aerial 3D model, we determine the shortest traversable paths between unmapped regions and utilize quantum computing to generate an optimal global route.
The research was partially supported by the Ministry of Innovation and Technology NRDI Office within the framework of the Autonomous Systems National Laboratory Program. Project no. TKP2021-NVA-01 has been implemented with the support provided by the Ministry of Innovation and Technology of Hungary from the National Research, Development and Innovation Fund, financed under the TKP2021-NVA funding scheme.
[1] Quantum | HUN-REN Cloud - https://science-cloud.hu/en/reference-architectures/quantum
[2] A. Cs. Marosi, A. Farkas, T. Máray and R. Lovas, "Toward a Quantum-Science Gateway: A Hybrid Reference Architecture Facilitating Quantum Computing Capabilities for Cloud Utilization," in IEEE Access, vol. 11, pp. 143913-143924, 2023, doi: 10.1109/ACCESS.2023.3342749.
[3] National Laboratory for Autonomous Systems - https://autonom.nemzetilabor.hu/
OSCAR is an open-source serverless framework to support the event-driven serverless computing model for data-processing applications. It can connect to an object storage solution where users upload files to trigger the execution of parallel invocations to a service responsible for processing each file. It also supports other flexible execution approaches such as programmatic synchronous invocations and exposing user-defined APIs for fast AI inference.
Serverless computing is very appropriate for the inference phase of the AI model lifecycle, as it offers several advantages such as automatic scalability and resource optimization, both at the level of costs and energy consumption. This model, in combination with the composition of workflows using visual environments, can significantly benefit AI scientists. With this objective, we have designed, in the context of the AI4EOSC project, AI4Compose, a framework responsible for supporting composite AI by allowing the workflow composition of multiple inference requests to different AI models. This solution relies on Node-RED and Elyra, two widely adopted open-source tools for graphical pipeline composition, employing a user-friendly drag-and-drop approach. Node-RED, in combination with Flowfuse to support multitenancy, serves as a powerful graphical tool for rapid communication between different services; meanwhile, Elyra provides a visual Notebook Pipeline editor extension for JupyterLab Notebooks to build notebook-based AI pipelines, simplifying the conversion of multiple notebooks into batch jobs or workflows. The integration with OSCAR is made through flow and node implementations offered as reusable components inside both Node-RED and Elyra visual pipeline compositors.
During the session, we want to demonstrate how AI4Compose works, for both Node-RED and Elyra environments, making use of the Flowfuse instance of AI4EOSC and the EGI Notebooks service, empowered by the Elyra extension. We will present how to trigger the inference of AI models available in the AI4EOSC marketplace and compose the workflow graphically, demonstrating that, with AI4Compose, AI scientists can easily design, deploy, and manage workflows using an intuitive visual environment. This reduces the time and effort required for pipeline composition, while the AI model inference can be executed on remote OSCAR clusters running in the EGI Cloud.
This work was supported by the project AI4EOSC ‘‘Artificial Intelligence for the European Open Science Cloud’’ that has received funding from the European Union’s Horizon Europe Research and Innovation Programme under Grant 101058593. Also, Project PDC2021-120844-I00 funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR and Grant PID2020-113126RB-I00 funded by MCIN/AEI/10.13039/501100011033.
One of the Key Exploitable Results of the DECIDO project is the EOSC Competence Centre for Public Authorities concept aiming to provide a sustainable path to foster the bilateral collaboration between representatives from the Public Sector with EOSC experts and consultants, so that the two communities can interact and profit from each other. As the project has recently ended this short talk will recap the main findings from the project and show the way forward and how to adhere to the Collaboration and Membership Charter.
URBREATH aims to develop, implement, demonstrate, validate and replicate a comprehensive urban revitalization methodology based on community and stakeholder participation focusing on greening and renaturing issues. This process will be supported by advanced technologies that the project will further develop and test. In fact, the project will use techniques such as local digital twins and artificial intelligence, and social innovation to achieve its vision. The climate neutrality paradigm is at the core of this methodology which aims at regenerating deprived and abandoned areas, brownfields and other problematic sites through physical transformations of the built environment and renaturing interventions that will radically enhance social interactions, inclusion, equitability and liveability in cities.
BeOpen project aims to provide a comprehensive framework to support the open data and metadata lifecycle management pipelines. These pipelines are designed to access, curate, and publish HVDs based on the FAIR principles, making them available for future Data Spaces that support the sustainable city domain. The framework encompasses open-source tools, replicable pipelines, ontologies, and best practices for data collection, curation, semantic annotation, data and metadata harmonization, quality improvement, and publication in machine-readable formats (either in bulk or via APIs). This particularly focuses on HVDs related to statistics, mobility, environmental issues, earth observation, and geo-spatial data. Legal interoperability will also be taken into account. The framework will be deployed and evaluated through several pilots in six different countries: Germany, Greece, Italy, Lithuania, Portugal, and Spain. For potential future replications, lessons learned and best practices will be shared through relevant city networks, including Open and Agile Smart Cities, the FIWARE city community, and the Living-in.eu community.
In the context of the EOSC Competence Center of Public Authorities supported by the DECIDO project, BeOpen project can provide PAs with technologies and know-how on how to improve HVDs and exploit them to create added value digital services for cities and communities. BeOpen can also exploit EOSC services to exploit the scalable solutions for data exchange and analysis in highly distributed ecosystems.
Agenda:
Introduction & EOSC CC4PA (5 min) (Xavier Salazar)
BeOPEN (10 min) (Antonio Filograna)
UBREATH (10 min) (Francesco Mureddu)
Q&As (5 min)
Building simulations for the Virtual Human Twin (VHT) is a challenging and complex task. In order to contemplate practical use of the VHT concept, we require an inclusive ecosystem of digital twins in healthcare, a federated cloud-based repository for gathering human digital twin resources such as models, data sets, algorithms and practices, along with a simulation platform to facilitate the transition towards personalised medicine.
These challenges are the focus of the EDITH EU-funded project [1], whose primary goal is to prepare the European roadmap for developing Virtual Human Twins. In the scope of preparing such a roadmap, we validated its key points by building a prototype implementation of the simulation platform. We began by analysing the internal structure and functional requirements of typical applications simulating human physiology, developed by EDITH partners. This formed the basis for a demonstrator of the execution subsystem of the VHT ecosystem: a software architecture that enables execution of computational models. An integrated versioning system enables collaborative editing and tagging of specific model versions that may be later selected to suit the researchers’ needs. The platform also provides a straightforward way to display, download and analyse simulation results. The functionality of the demonstrator was successfully validated with a set of typical VHT modules on ACC Cyfronet HPC resources.
The demonstrator utilises standardised solutions to implement the simulation environment, such as Git repositories to store and version the simulation source code, S3 to store patient data, along with simulation outputs, Dataverse/Zenodo integration to utilise published datasets (or to create new ones), along with HPC to run complex and time-consuming workflows. The environment enables development of algorithms, models and simulations that can make personalised medicine easier, and, as a result, increase the effectiveness and timeliness of medical treatment. Our research has resulted in a demonstrator which can run VHT modules on HPC resources, and which may be integrated with model and data repositories. We consider this the first step towards elaborating the whole VHT ecosystem [2].
In the scope of this presentation we will show the main building blocks of the demonstrator, and discuss how they help build the VHTs and enact the corresponding methodology, ensuring that simulations follow the 3R principles (Repeatability, Replicability and Reproducibility). We will also address the obstacles and challenges posed by existing HPC infrastructures that need to be overcome to simplify the integration of platform similar to the demonstrator with large-scale computational resources.
Acknowledgements. We acknowledge the support of the EU, under grants EDITH No. 101083771, Teaming Sano No. 857533, and ACK Cyfronet AGH grant PLG/2023/016723.
References
The LifeWatch ERIC Metadata Catalogue is a centralized platform for discovering and disseminating data and services, ensuring equitable access to information and promoting inclusivity in biodiversity and ecosystem research and conservation efforts. LifeWatch ERIC Metadata Catalogue was designed to tackle several challenges, critical for biodiversity and ecosystem research:
LifeWatch ERIC has adopted GeoNetwork as technology for its Metadata Catalogue, obtaining numerous advantages: open-source flexibility, geospatial capabilities, standards compliance, user-friendly interface, metadata management features, interoperability, scalability, performance, and community support. Different showcases have been developed in the latest years to demonstrate the data and services discoverability across different institutions and research infrastructures, like the one developed jointly with ANAEE and eLTer for the Research Sites in the ENVRI FAIR contest (https://envri.eu/home-envri-fair/), or the metadata harvesting from the GBIF network. The Metadata Catalogue has been continuously improving, progressively incorporating new features: a template based on a profile of EML 2.2.0 standard, developed for ecological datasets; a template with the LifeWatch ERIC profile of ISO 19139/119, developed for services; customizations to supply DOI assignment, if needed, using the Datacite services; metadata FAIRness evaluation, added thanks to the F-UJI tool (https://www.f-uji.net). Continuos developments are planned to keep on improving the quality of metadata, the next step will be the integration with the LifeWatch ERIC semantic repository Ecoportal (ecoportal.lifewatch.eu) and with AI technology to ensure metadata’s consistency.
Furthermore, LifeWatch ERIC is an EOSC Candidate Node and is working to federate the Metadata Catalogue in the context of the EOSC-Beyond project.
The EGI Cloud Container Compute is a container orchestrator that facilitates the deployment of containerised applications. Containers offer the advantage of having the entire software and runtime environment in a single package, which can simplify the deployment process. However, deploying a containerised application can be challenging due to the need to install a container orchestrator. This often requires the user to manage the entire virtual machine with all the required system services, which can be time-consuming and complex. In recent years, the adoption of the Kubernetes container orchestrator has led to a notable improvement in the efficiency of both developers and system administrators. This trend is also being observed in scientific computing, where containers are being tailored for use in federated computing environments to facilitate the execution of scientific workloads. This demonstration will introduce the EGI Cloud Container Compute Service, a managed platform that provides seamless container execution.
[In a nutshell]
This demo, run on the production EGI DataHub service, will cover a multi-cloud distributed data management scenario. We will showcase how the data can be ingested, managed, processed in a repetitive way using automation workflows, and interacted with using middleware and scientific tools. All that will happen in a distributed environment, underlining Onedata’s capabilities for collaborative data sharing that crosses organizational borders.
Join us to see the latest developments of Onedata: S3 interface, evolved GUI, improved automation workflows, developer tools and lightweight Python client libraries.
[Full abstract]
Onedata continues to evolve with subsequent releases within the 21.02 line, enhancing its capabilities and solidifying its position as a versatile distributed data management system. Key improvements include the rapid development of the automation workflow engine, the maturation of the S3 interface, and powerful enhancements to the web UI for a smoother user experience and greater control over the distributed data.
Apart from that, a significant focus has been put on enhancing the interoperability of the platform. Onedata can be easily integrated as a back-end storage solution for various scientific tools, data processing and analysis platforms, and domain-specific solutions, providing a unified logical view on otherwise highly distributed datasets. This is achieved thanks to the S3, POSIX, and Pythonic data interfaces and tools that enable effortless inclusion of Onedata as a 3rd party solution in CI/CD pipelines. For example, the "demo mode" makes it straightforward to develop and test arbitrary middleware against a fully functional, zero-configuration Onedata backend. With the ability to integrate with SSO and IAM services and reflect the fine-grained federated VO structures, Onedata can serve as a comprehensive data management solution in federated, multi-cloud, and cross-organizational environments. Currently, it's serving this purpose in the ongoing EuroScienceGateway, EUreka3D, and Dome EU-funded projects.
Automation workflows in Onedata can streamline data processing, transformation, and management tasks by automating repetitive actions and running user-defined logic fitted to their requirements. The integrated automation engine runs containerized jobs on a scalable cluster next to the data provider's storage systems. This allows seamless integration of data management and processing steps, allowing for efficient handling of large-scale datasets across distributed environments.
During our demonstration, we will present a comprehensive use case demonstrating Onedata's capabilities in managing and processing distributed data based on the EGI DataHub environment. It will showcase a pipeline that embraces the user's federated identity and VO entitlements, automated data processing workflows, the wide range of Onedata's tools for data management, and interoperability with scientific tools and middleware --- with a special focus on the S3 interface.
Join us for the demo to see how Onedata empowers organizations to manage and process federated and multi-cloud data efficiently, driving collaboration and accelerating scientific discovery.
Federated learning aims to revolutionize the scene when it comes to training artificial intelligence models, in particular deep learning and machine learning with distributed data. Emerging as a privacy preserving technique, it allows to train models without centralizing or sharing data, preserving their integrity and privacy. Moreover, different studies show that in some cases it also offers advantages from the point of view of accuracy and robustness of the developed models, but also regarding savings in energy consumption, computational cost, latency reduction, etc.
In this demonstration, we will showcase how to carry out the implementation of a complete federated learning system in the AI4EOSC platform. Specifically, during the session we will perform the live training of an AI model under a personalized federated learning approach. This federated training will be done with multiple clients using distributed data in different locations (including resources from the platform itself, but also from the EGI Federated Cloud), simulating a real world application, including participation from the audience in the overall training process.
This is a side meeting to support the GlobalCoast initiative.
While physical presence is strongly encouraged, remote attendance will be possible via Zoom.
The connection details to attend the GA is the following:
https://eu01web.zoom.us/j/61437139251?pwd=V1dhF8tmHoxvfVGNyjvGBiGTrwh8Rb.1
Agenda:
15:00-15:05 Welcome (5', Giuseppe La Rocca, EGI F.)
15:05-15:15 Intro: Challenges and mission of the GlobalCoast initiative (10', Giovanni Coppini, CMCC)
15:15-15:25 GlobalCoast technical architecture design and requirements (10', Giovanni Coppini, CMCC)
15:25-15:35 EGI Federated Infrastructure for advanced research (10', Giuseppe La Rocca, EGI F.)
15:35-16:00 Discussion on the technical architecture & requirements (25’)
Coffee break (10')
Part I - Data management solutions
16:10-16:20 Beacon (10', Dick Schaap, MARIS/SeaDataNet)
16:20-16:30 EGI DataHub (10', Andrea Cristofori, EGI F.)
16:30-16:40 Pangeo (10', Richard Signell, OpenScience Computing)
16:40-16:50 Pangeo@EOSC (10', Tina Odaka & Anne Fouilloux)
Part II - Example of applications
16:50-17:00 Witoil Cloud prototype (Giovanni Coppini, CMCC)
17:00-17:10 OPENCoastS (10’, Anabela Oliveira, LNEC)
Part III - Authentication Authorization Infrastructure (60’)
17:10-17:50 Collecting requirements to set-up the GlobalCoast AAI (40', Valeria Ardizzone, EGI F.)
17:50-18:00 Wrap-up and next steps (10')
This double session will demonstrate the commitment of the EGI community towards Open Science through a series of presentations and discussions presenting the EGI contribution to the establishment of the European Open Science Cloud.
The content will include an overview of the current status of the EOSC with a focus on the ongoing process of defining the EOSC Federation as a network of EOSC Nodes. The EGI-coordinated EOSC Beyond project, which started in April 2024, has taken on a relevant role in this activity with its set of EOSC Pilot Nodes that will be established during its execution. Presentations of the EOSC Beyond Pilot Nodes are included in the session agenda together with a status update on the establishment of the first node of the EOSC Federation, the EOSC EU Node, that highlights the EGI contribution to the EOSC procurement activities.
The German National Research Data Infrastructure (national Forschungsdateninfrastruktur, NFDI) comprises over 270 institutions, including science organisations, universities, higher education institutions, non-university research institutions, scientific societies, and associations. It is organized in 26 consortia, five sections, and one basic services initiative whose vision is that data is a common good for excellent science. Sustainability and FAIRness are key factors when it comes to preserving data that has been acquired over time by scientists in laborious and eventually costly efforts.
The represented multitude of scientific disciplines and institutions in NFDI and EOSC Beyond make it a fitting audience to set up a pilot node for EOSC. Leveraging the knowledge and experience of the people behind the Helmholtz cloud, the node is not being built from scratch but on the shoulders of the community and with an infrastructure that has been tried and tested. The node itself will be built from existing building blocks such as an AARC-compliant AAI architecture, providing access to services hosted in a federated manner all over Germany to the members of the NFDI consortia and their partners. A service catalogue and corresponding marketplace will not only serve as a common entry point for all scientists but also make it easy for institutions to integrate new services for the community.
There are several services planned initially, e.g. an OpenData portal that will be built in collaboration with the consortium DAPHNE4NFDI for the photon and neutron community, an ontology catalogue stemming from the needs of NFDI4ING for the engineering community as well as a JupyterHub instance deployed by BASE4NFDI who engage in providing base services to NFDI. In the talk, we will present the planned architecture of the German NFDI EOSC pilot node together with the existing components that will be subsequently connected within EOSC Beyond's timeframe. A short introduction of the initial services will conclude the presentation.
Several scientific disciplines, including climate science, have experienced significant changes in recent years due to the increase in data volumes and the emergence of data science and Machine Learning (ML) approaches. In this scenario, ensuring fast data access and analytics has become crucial. The data space concept has emerged to address some of the key challenges and support scientific communities towards a more sustainable and FAIR use of data.
The ENES Data Space (EDS) represents a domain-specific example of this concept for the climate community developed under the umbrella of the European Open Science Cloud (EOSC) initiative.
More specifically, the EDS was launched in early 2022 in the context of the EGI-ACE project and it is being further advanced in the frame of the EOSC-Beyond project contributing to enhance and validate the EOSC Core functionalities with the ambition of becoming one of the nodes in the EOSC Federation.
More in detail, the ENES Data Space pilot node aims to offer core services and capabilities relevant to the climate community. This includes data (input datasets and research products), resources (storage and compute), infrastructural components for deployment and orchestration of services, and software solutions supporting researchers and institution departments in realistic scenarios. In this way, scientists can run AI/ML-based applications and perform big data processing, interactive analytics and visualisation of climate data, without having to download datasets, install software and prepare the environment. The environment will also provide provenance capabilities, thus allowing scientists from different domains to publish and/or manage provenance documents, so that they can share and explore data lineage information related to their scientific workflows.
LifeBlock, developed by LifeWatch ERIC, stands at the forefront of the advancement of the FAIR data management principles. This talk explores ways by which LifeBlock integrates federated data sources, employs semantic treatment, and incorporates AI to support ecological and biodiversity research.
LifeBlock excels in federating data from diverse, heterogeneous sources, creating a unified environment for comprehensive data access and discovery. This capability is critical for ecological research, which relies on various datasets from different contexts and methodologies. By federating these sources, LifeBlock contributes to defragmentation, thus accelerating their access and analysis. It also supports multidisciplinary and cross-domain research, which is fundamental for biodiversity and ecology.
The platform leverages semantic technologies to enrich data with meaningful context through ontologies and standardiszed vocabularies. This semantic treatment enhances data interoperability and usability, facilitating complex queries and advanced analytics. LifeBlock’s use as a Scientific Knowledge Graphs (SKGs) further organiszes and links data based on semantic relationships, enabling intuitive exploration and knowledge discovery.
AI integration within LifeBlock significantly enhances its functionality. AI algorithms are employed for data quality assessment, automated metadata generation, and intelligent search, reducing manual effort and increasing efficiency. Additionally, AI-driven analytics support predictive modelling and scenario analysis, providing powerful tools for addressing complex queries.
The system ensures proper attribution to the original data sources, rewarding contributors for their data with recognition and potential incentives. Provenance tracking within LifeBlock guarantees the authenticity and traceability of the data, fostering trust and reliability in the dataset.
Through these advanced methodologies, LifeBlock introduces an alternative for FAIR data management, ensuring that ecological and biodiversity data are managed effectively and leveraged to their full potential. This presentation will provide a detailed exploration of these technologies, showcasing their functional aspects and their role in fostering a collaborative and innovative scientific community.
This talk will very briefly discuss the three main aspects of the proposed CESSDA Pilot Node:
- Resource usage tracking and cost calculation for service providers;
- Access to digital objects using institutional credentials;
- PID registration and resolution for digital objects.
The problems to be addressed, user scenarios, constraints and target groups for each will be elaborated.
EGI is a federator of national and domain-specific e-infrastructure capabilities, with governance rooted in national e-infrastructures (NGIs) and international scientific communities. This session will zoom into the national level of EGI, offering a comprehensive view of the latest developments and future initiatives from EGI member countries.
Through a series of insightful talks, representatives from various countries will highlight their recent achievements, showcasing successful services and innovative approaches tailored to serve scientific communities. Attendees will gain an understanding of the diverse strategies and roadmaps that member countries are implementing to enhance their e-infrastructures. The session will also feature discussions on upcoming projects and collaborative efforts aimed at addressing future challenges and advancing the capabilities of EGI. Join us to explore how national perspectives within the EGI federation are driving progress and fostering a robust ecosystem for scientific research and innovation.
In Italy thanks to the investment coming from Italy’s Recovery and Resilience Plan projects (mainly ICSC: https://www.supercomputing-icsc.it/, Terabit: https://www.terabit-project.it/, and others) INFN is implementing a distributed HW, SW infrastructure that will be used to support very heterogeneous scientific use cases, not only coming from the INFN Community, but also with the full Italian scientific community.
INFN DataCloud aims to create a next-generation integrated computing and network infrastructure by 2025. The primary goal is to enhance collaboration and information exchange among Italian scientific communities.
In particular both within ICSC and Terabit, INFN is collaborating with CINECA and GARR with the aim of building an integrated computing and network infrastructure to eliminate disparities in access to high-performance computing across Italy.
The ICSC project has a very large partnership of around 55 entities both public and private bodies that assure that the infrastructure we will put in operation by the end of the project has to be able to support requirements from about full Italian scientific communities.
The INFN DataCloud project must address technical and not technical challenges related to network architecture, data storage, and computational resources together with the distributed team of people working in all the projects activities from each of the INFN main sites (about 12 distributed data center in Italy).
One of the main challenges is Data Security and Privacy: Handling sensitive scientific data requires robust security measures. The project must address data encryption, access controls, and compliance with privacy regulations to protect researchers’ work.
The computing and storage distributed facilities upgraded by Italy’s Recovery and Resilience Plan projects that founded many fat nodes (we call those “HPC-Bubbles”) with GPU, many cores CPU, large RAM Memory and SSD based storage increase the resources available to the researchers in order to support the modern requirements of AI algorithms and very large dataset needed for most of the science (Physics, bioinformatics, earth studies, climate etc).
In the contest of the Data access/management/transfer, the INFN DataCloud project is federating both posix and Object storage with a geographically distributed data lake.
In the talk we will show both technical and not technical solutions implemented to build a transparent high-level federation of both Compute and Data resources, and how the development, operation and user support activities are organized in such a heterogeneous environment.
In summary, the INFN DataCloud project faces a mix of technical, organizational, and logistical challenges. However, its potential impact on Italian scientific communities makes overcoming these hurdles worthwhile.
Moreover, the INFN DataCloud project has democratized access to high-performance computing and data resources, empowering Italian researchers to accelerate their scientific endeavors.
PLGrid is a nationwide computing infrastructure designed to support scientific research and experimental development across a wide range of scientific and economic fields. PLGrid provides access to supercomputers, quantum computers, specialized accelerators for artificial intelligence, cloud computing, disk storage, optimized computing software and assistance from experts from the entire Poland. The Polish PLGrid infrastructure is managed by the PLGrid Consortium, established in January 2007, which includes the following computing centers: Cyfronet, ICM, PSNC, CI TASK, WCSS, NCBJ.
In order to make it easier for users to use the available distributed resources, it was necessary to create a centralized platform that includes many applications, tools, and solutions, with the PLGrid Portal as its main component. The platform has been developed since the beginning of the PLGrid Consortium, going through successive new versions. Thanks to the experience gained over 10 years, it has become a mature and flexible solution. This allows us to easily adapt to changing requirements, which makes us able to effectively respond to new challenges in terms of both user and operational convenience in federated infrastructure.
As the main application in PLGrid Infrastructure, the Portal PLGrid consists of many elements from creating an account to requesting distributed resources through PLGrid grants. All the necessary functionalities for the User like creating and managing an account, affiliations, subordinates, teams, services (access to resources), applications, and ssh keys. However, it is the process of requesting resources that is specific. The user fills out a grant application, which is negotiated with resource administrators - and after the grant is completed, it must be settled. From our perspective, the most important thing was to realize such Portal that would be flexible. We have a variety of types: accounts, teams, services, and grants. In an easy way, new types of resources can be defined, that will have other limitations. For example, a specific account type cannot create a specific grant type or a given type of grant has other fields that are required to be provided by the user. On the operational side, we have many roles, where the main like Resource/Service Administrator have dedicated web views, the ability to replicate data to LDAP, and API access which allows you to synchronize all data among various HPC centers and different types of resources (computing, storage, object storage, cloud, etc).
Modern life sciences research has undergone a rapid development driven mainly by the technical improvements in analytical areas leading to miniaturization, parallelization, and high throughput processing of biological samples. This has led to the generation of huge amounts of experimental data. To meet these rising demands, the German Network for Bioinformatics Infrastructure (de.NBI) was established in 2015 as a national bioinformatics consortium aiming to provide high quality bioinformatics services, comprehensive training, powerful computing capacities (de.NBI Cloud) as well as connections to the European Life Science Infrastructure ELIXIR, with the goal to assist researchers in exploring and exploiting data more effectively.
Our de.NBI Cloud project type SimpleVM enables users with little to no background knowledge in cloud computing or systems administration to employ cloud resources with few clicks. SimpleVM is an abstraction layer on top of OpenStack to manage virtual machines (VMs) or clusters thereof. It was designed to support the combination of resources from independent OpenStack installations, thus operating as a federated multi-cloud platform which is accessible from a single web-based control panel. The entire software stack only requires access to 1) the OpenStack API and 2) an AAI provider (via Keycloak, LifeScience AAI) and it can be deployed on any vanilla OpenStack installation using Ansible. In general, SimpleVM primarily eases the creation and management of individual pre-configured virtual machines and provides web-based, SSO-protected access to popular research and development environments such as Rstudio, Guacamole Remote Desktop, Theia IDE, JupyterLab and Visual Studio Code. However, custom recipes based on Packer can be added to provide specific VMs tailored to user requirements.
A single SimpleVM project can host multiple VMs with individual access permissions for users. On top of this functionality, a dedicated SimpleVM Workshop mode streamlines virtual machine provisioning for workshops. Organizers can define a custom VM image and possible access methods, optionally based on the research environments mentioned above. When the workshop starts, participants can instantly access individual VMs based on this predefined configuration via ssh or browser. Once the VMs are ready, the system allows the organizers to automatically inform participants on how to access the resource.
Further, with SimpleVM, de.NBI Cloud users can effortlessly configure and manage their own SLURM-based BiBiGrid clusters with just a few clicks. This feature addresses the needs of researchers who want to run their tools or entire workflows across multiple machines and provides a simple route for users to learn how to use grid-based scheduling systems.
In summary, SimpleVM provides a comprehensive solution to bring federated, multi-cloud resources to end-users and in addition, provides a simple to use basis for online training and as an entry to grid-based computing.
The Hungarian Research Network's (HUN-REN) Data Repository Platform (ARP) is a national repository infrastructure that was opened to the public in March 2024. With ARP, we aim to create a federated research data repository system that supports the data management needs across its institutional network. Implementing ARP is our first step towards establishing an EOSC compliant research infrastructure.
Here we present the conceptualization, development, and deployment of this federated repository infrastructure, focusing on the ARP project's objectives, architecture, and functionalities.
The primary goal of the ARP project is to establish a FAIR focused, sustainable, continuously operational federated data repository infrastructure that not only supports the central storage and management of digital objects but also ensures the interoperability and accessibility of research data across various scientific domains.
The Hungarian Research Network (HUN-REN) currently comprises 11 research centers, 7 research institutes and 116 additional supported research groups, conducting research in the most varied disciplines of mathematics and natural sciences, life sciences, social sciences and the humanities.
ARP is built on a foundation of secure and scalable storage solutions, utilizing the HUN-REN's existing infrastructure to establish a resilient and redundant data storage environment. The system incorporates a hierarchical storage model with a capacity of 1.4 Petabytes, distributed across two sites for enhanced data security. This model supports triple replication of data, ensuring high availability and disaster recovery capabilities.
Central to the ARP's functionality is its suite of data management tools. The primary service of ARP is the data repository itself built on Harvard's Dataverse repository system. In ARP we addressed an important shortcoming of Dataverse, namely the difficulty to handle a diverse set of metadata schemas. As ARP's goal is to support the metadata annotation needs of researchers of various domains it was inevitable to provide a richer set of metadata schemas besides the ones built into Dataverse. To achieve this we added as a central component a Metadata Schema Registry, built on Stanford University's CEDAR framework and closely integrated with the ARP repository to manage diverse data types and standards, ensuring interoperability across different research disciplines.
Beside providing the possibility to author and use any domain specific metadata schema we also extended Dataverse with the import, export and authoring of datasets using RO-Crate via our custom AROMA tool. AROMA and RO-Crate facilitates the structured packaging and rich metadata annotation of research data, enhancing the granularity and usability of data curation. With RO-Crate it is possible to describe datasets or individual files in datasets in any detail that is not otherwise possible in Dataverse.
ARP as a federated service integrates disparate data management systems into a cohesive framework that supports a unified knowledge graph and query service for researchers nationwide. The ViVO based knowledge graph of ARP enables detailed, file-level search functionality and supports federated searches across a variety of national and international research databases, significantly improving data discoverability.
HUN-REN ARP project represents a significant advancement in the field of research data management for the Hungarian research community.
The University of Zagreb, University Computing Centre (SRCE) has been providing all layers of e-infrastructure for Croatian science and higher education for more than 50 years. The latest expansion achieved through the project Croatian Scientific and Educational Cloud (HR-ZOO) brought
five new data centers in four major cities, significant bandwidth improvement in the national educational and research network and resources for the two new advanced ICT services – Virtual
Data Centers and Advanced Computing.
The Advanced Computing service provides users with two advanced computing resources - Supekand Vrančić. Supek is a supercomputer based on HPE Cray with sustained computing power of 1.25 PFLOPS. Vrančić is a cloud computing platform based on widely used open-source platforms
OpenStack and Ceph that provides 11520 CPU cores, 16 GPUs and 57 TB of RAM.
Furthermore, SRCE provides several data services options – PUH for storage and sharing of data during education and research and digital academic archives and repositories framework DABAR for establishment and maintenance of reliable and interoperable institutional and thematic repositories.
Both services are integrated with the Advanced Computing in a sense that users can easily access store data on PUH or DABAR.
Finally, SRCE is actively participating in the development and maintenance of the Croatian Research Information System – CroRIS – the central place for reliable information about institutions,
researchers, projects, equipment, publications, etc. Access to Advanced Computing resources for researchers is currently fully integrated with CroRIS, which enables advanced resource usage reporting – based on institutions, projects or funding streams, but also automatically correlating publications with resources.
The SRCE ecosystem provides users from science and higher education with a variety of features described above, together with workshops that enable speedy on-boarding, as well as expert support that prepares scientific applications on advanced computing resources as well as other tools to simplify usage of provided services. Information systems such as CroRIS glue all this together by enabling the development of rich and transparent usage reports.
"This session will discuss the SPECTRUM project's efforts to create a roadmap for the future of data-intensive research in High Energy Physics and Radio Astronomy. The project aims to address the unprecedented data processing and analysis needs of new instruments that will generate exabytes of data. This will require a federated European Exabyte-scale research data federation and compute continuum, encompassing HTC, HPC, Cloud and Quantum technologies.
Key Topics:
- Challenges of Exascale data processing and analysis in frontier research
- Development of a Strategic Research, Innovation and Deployment Agenda (SRIDA)
- Collaboration with a Community of Practice composed of external experts
- Roadmap for data-intensive scientific collaborations and access to a federated European research infrastructure
Join us to learn about the latest developments in data-intensive research infrastructure and how it will enable groundbreaking discoveries in High Energy Physics and Radio Astronomy."
Short introduction to the SPECTRUM project highlighting the objectives, expected results, partners and timeline.
This presentation provides an introduction to the SPECTRUM Community of Practice, it described the survey to consults the community on current best practices and future needs in large-scale and data-intensive scientific computing, it also describes the approach to collect use cases.
See also: https://www.spectrumproject.eu/spectrumcop
CERN-related HEP
This presentation presents the current and future needs of two important Radio Astronomy initiatives: LOFAR and SKA.
"Have you heard about the EGI Federation services but are unsure about what they mean?
Then, join this insightful session on the cutting-edge EGI Federation services. This service federation and management platform ensures seamless coordination of the EGI infrastructure. Discover how the federation core empowers service providers to integrate into a unified hub while engaging with research communities to meet their needs, facilitate access, and spark innovation. During the session, we'll dig into the latest features and development roadmaps of the EGI Federation services, providing a forum for scientific user communities and infrastructure providers to collaborate and reflect on these promising advancements. Service list: https://www.egi.eu/services/federation/"
The EGI Services for Federation complement the EGI Services for Research and Business with tools for coordination of the EGI Federation, improving how we, as a federation, work together.
This presentation will give a bird eye view about the EGI Federation services landscape, before diving into the individual services and service roadmaps. https://www.egi.eu/services/federation/
This presentation unveils the enhanced EGI Helpdesk platform, designed to empower researchers and foster collaboration across Europe's open science initiatives. It provides an overview of the migration to the new EGI Helpdesk, detailing the optimization of existing workflows and the implementation of new ticketing processes to ensure a more efficient support experience for EGI users. The presentation will delve into the current status of the migration, highlighting accomplished integrations with other services and outlining the roadmap for a seamless transition. Several demonstrations will showcase the most interesting and complex ticket management workflows for the Worldwide LHC Computing Grid (WLCG) and broader EGI communities.
This presentation provides an overview of the architecture and implementation of the new artefacts repositories for EGI.
The EGI repositories are developed, maintained and operated by LIP and IFCA/CSIC. The new repositories will host RPMs for (RHEL and compatible distributions), DEBs (for Ubuntu and compatible distributions) and Docker images for container-based services and micro-services.
The presentation will describe the architecture of the new repositories, its several components and capabilities.
The EGI Software Vulnerability Group (SVG) has been handling software vulnerabilities since 2010, with the purpose 'To Minimize the risk of security incidents due to software vulnerabilities'.
This is important in order to help sites protect themselves against the most serious vulnerabilities and to give the communities using the services confidence that their credentials and data are secure and that sites patch in a consistent manner.
As the EGI is evolving, the EGI SVG is evolving to cope with the changing landscape. This includes increased inhomogeneity of the infrastructure, and increasing proliferation of services on the infrastructure.
This short talk aims to inform people of what changes we have made in recent times, what our plans are, and invite others to become involved. Whether reported vulnerabilities are deemed to be in scope depends on sufficient participation of people with expertise in the affected areas. We aim for service providers to effectively help one another stay secure via the sharing of their invaluable knowledge!
The Belle II experiment is an international collaboration operating at the SuperKEKB energy-asymmetric electron-positron collider at the Japanese KEK laboratory. The experiment takes advantage of a distributed computing infrastructure to run MC simulations and analyze the large amount of data produced by the detector.
For several years, the collaboration has been testing different solutions to integrate cloud resources into their computing model for general activities and within the context of the European project JENNIFER2.
In this presentation, we will show our experience in using the EGI Federated Cloud and how its usage is evolving with the new interface provided by the DIRAC framework, which is used as the workload management system by the experiment. We will also provide some insights for the JENNIFER3 project.
Modern scientific organisations require a robust, scalable, and flexible environment that supports high-performance computing, large-scale data management, and collaborative research environments. These requirements often include specialised hardware, software, and infrastructure to handle large-scale data analysis, high-performance computing (HPC), and collaborative research efforts.
Addressing these requirements involves a combination of cutting-edge hardware, specialised software, efficient data management practices, and a strong emphasis on security, compliance, and ethical considerations. By fulfilling these needs, scientific organisations can advance their research capabilities and drive innovation in their respective fields.
This session will zoom in the technical computing and data continuum requirements of modern scientific communities and Research Infrastructures (RIs) and discuss how modern infrastructures can pave the way towards new data-intensive scientific collaborations.
A panel discussion will follow.
The proliferation of terms used to characterize digital infrastructures providing services for cultural heritage reflects different perspectives on their role but may be confusing for their potential users. Deciding what is a research infrastructure (RI) and what is not may affect stakeholders’ understanding of them and the related use, and impact on their funding, where support to their creation and management depends on the legal definition of what is eligible and what is not. Adding the qualification of “research” to such infrastructures further complicates the situation, as it also involves the understanding of what research-oriented means. Cultural heritage is a domain in which the term “research” has a broad connotation, including activities performed according to the methodology of social sciences and humanities and to the scientific method (within the so-called heritage and archaeological sciences) and requires a tailored approach to digitization. Important research results can be achieved both by professionals and by researchers, thus also the target user communities can have fuzzy delimitations.
Attempts to identify different approaches to research date back to the well-known study by Stokes, which distinguished among fundamental research, use-inspired research and solution-oriented applied research, with most of the research activities in cultural heritage belonging to the second and the third categories. This distinction was considered in a study carried out in 2018 by the RI-PATHS EU project, which included competence centres in the third grouping but also considered the case of multi-purpose RIs. Digital infrastructures are not considered separately in this RI-PATHS taxonomy, or just mentioned as data and service providers. This requires an update, since research activities now increasingly consist or rely on digital services. Notable examples of digital research infrastructures on cultural heritage are national initiatives such as DIGILAB.be, a Belgian data infrastructure for conservation, and HSDS in the UK, a similar one in the UK. Similarly, the ARIADNE RI provides a catalogue of 4,000,000 archaeological research datasets with services to process the data they contain. A European initiative is 4CH, implementing a European competence centre for heritage conservation relying on a knowledge base of heritage information. Thus the question moves to the digital backbones of the above and how they could be further developed to reap the benefits of an advanced digital transformation. The just started ECCCH project aims at developing cloud-based vertical services for the different heritage communities, and will deploy its results by 2029.
So the main question seems to be not just which services are required, but in which cloud environment they should be implemented. If answering to the question “is a cloud the place to develop a digital research infrastructure for cultural heritage” has an obvious positive answer, the features of such cloud still need clarification and further investigation. The present contribution will discuss this question, with a focus on the digital services to be provided by 4CH in its development as an international digital research infrastructure and competence centre.
METROFOOD-RI is a distributed research infrastructure for promoting metrology in food and nutrition, which will provide high level metrology services for enhancing food quality and safety. METROFOOD entered the ESFRI roadmap in 2018 and is currently in its implementation phase. The physical part of METROFOOD-RI consists of facilities such as laboratories, experimental fields/farms for crop production and animal breeding, small-scale plants for food processing and storage, kitchen-labs for food preparation, and “demo” sites for direct stakeholder engagement, while the digital part consists of resources such as databases, apps, Wikis, and e-learning platforms. In the preparatory phase, METROFOOD-RI analysed its own digital requirements and compared them with the ones of other infrastructures, associations, and SMEs, and some similarities were identified. The minimum viable digital requirements for METROFOOD-RI were then defined for the central hub, the national nodes and the partner institutes. The requirements were separated into internal requirements that are needed for the back-office to run the legal entity and in public requirements, considering access to be provided to users, researchers, policy makers, industry, and the public. Two of these core requirements are the service and resource catalogues and the central authentication and authorisation infrastructure (AAI). Many of the required resources will have connections between each other, representing workflows and processes within METROFOOD-RI, and many of these resources will also provide open APIs for automated data exchange. In this presentation, an overview of these digital requirements will be given and some of the requirements will be presented in detail.
ENVRI-Hub NEXT is a 36-month project designed to address major environmental challenges such as climate change, natural hazards, and ecosystem loss by advancing multidisciplinary research and integration among European research infrastructures (RIs). This project aims to build on the current ENVRI-Hub platform to create a robust framework that brings together different environmental RIs, facilitating their collaboration and contribution to the European Open Science Cloud (EOSC).
The primary goal of ENVRI-Hub NEXT is to support the integration of environmental RIs across four major domains: atmospheric, marine, terrestrial, and Earth observation. This integration is crucial to unlock the potential of RI data for addressing complex research questions, in line with the European Green Deal and Digital Transition. It will also help establish a coherent, sustainable, and world-leading RI cluster.
To achieve this goal, ENVRI-Hub NEXT promotes operational synergies among environmental RIs and e-infrastructures, provides interdisciplinary science-based services, and enhances the integration with EOSC. The consortium leading this project includes key ESFRI Landmarks (ACTRIS, AnaEE ERIC, EPOS ERIC, Euro-Argo ERIC, IAGOS AISBL, ICOS ERIC, LifeWatch ERIC) and RIs from ESFRI Roadmap (eLTER), along with several technology providers and the EGI Foundation to support the project's technical operations and integration with EOSC Core.
The technical objectives of the project include advancing the ENVRI-Catalogue, -Knowledge Base and Federated Search Engine using AI-based dialogue techniques, semantic WEB technologies and metadata standards. These advancements will enable end-users to effectively discover research assets from multiple RIs
ENVRI-Hub NEXT aims to consolidate the conceptual and technical structure of the ENVRI-Hub platform. This consolidation will involve providing data-driven services including an analytical framework and training that facilitate interdisciplinary research, promoting the integrated use of data from different environmental RIs, and expanding the frontiers of multidisciplinary environmental sciences supporting virtual research environments.
The project officially began in February 2024. Its early focus is on summarising the status of the ENVRI-Hub and outlining the plans for ENVRI-Hub NEXT, with an emphasis on building a framework that supports the long-term sustainability and impact of the integrated environmental research infrastructure.
Within the framework of the European data strategy (European Commission. European Data Strategy (2020), (eu/info/strategy/priorities-2019-2024/europe-fit-digital-age/european-data-strategy_it), the establishment of European data spaces per specific domains (e.g., the European Health Data Space - EHDS) have been proposed with the concomitant strengthening of regulations for governing cybersecurity.
The European Health Data Space aims to create a common space where individuals can easily control their electronic health data by defining directly applicable common rules and principles. It will also enable researchers and policymakers to use such electronic health data reliably and in compliance with privacy regulations.
The sectionalization of data movement spaces has led to increased legislation on cybersecurity, in addition to the GDPR, the EU Institutions have improved the regulation on artificial intelligence, the proposal for data governance regulation, and the proposed regulation on data, Directive (EU) 2016/1148 concerning the security of network and information systems (NIS Directive). The new legal framework is operating within a scenario where the principles of Open Access, Open Science, and FAIR principles are increasingly affirmed. The principles to be introduced within the European legal framework will impact the application of Open Science, Open Access, and FAIR principles. The impact of the ongoing regulatory adoption will also be significant in the context of research projects that concern human data. In particular the Genome Data Infrastructure (GDI) project which will enable access to genomic, phenotypic, and clinical data across Europe. GDI aims to establish a federated, sustainable, and secure infrastructure to access the data. It builds on the outputs of the Beyond 1 Million Genomes (B1MG) project to realize the ambition of the 1+Million Genomes (1+MG) initiative. Additionally, the ELIXIRxNextGenerationIT project for empowering the Italian Node of ELIXIR, the European Research Infrastructure for Life Science Data, has the primary goal of enhancing six platforms: Data, Compute, Tools, Interoperability, Omics, and Training, and integrating the activities of the national infrastructure dedicated to Systems Biology. This contribution aims to outline the legal framework currently being defined and verify the impact of regulations on research activities in the field of Biological Data.
This session will showcase simple-to-use platforms and gateways that make life easier to researchers in data-driven research disciplines. Platforms and gateways hide away the complexity of the underlying computing infrastructure so researchers can focus on doing data-intensive science.
Example platforms and gateways will demonstrate how to empower researchers with customised environments tailored to their needs, making it possible to run a variety of complex workloads across federated cloud sites, including the access and processing of big data, easy visualization of results and enabling the sharing of research outputs with colleagues in the field.
PITHIA-NRF (Plasmasphere Ionosphere Thermosphere Integrated Research Environment and Access services: a Network of Research Facilities) is a project funded by the European Commission’s H2020 programme to build a distributed network of observing facilities, data processing tools and prediction models dedicated to ionosphere, thermosphere and plasmasphere research. One of the core components of PITHIA-NRF is the PITHIA e-Science Centre (PeSC) that supports access to distributed data resources and facilitates the execution of various prediction models on local infrastructures and remote cloud computing resources. As the project nears its completion in 2025, the e-Science Centre has now become a mature and widely utilised tool within the community. The PeSC facilitates the registration of Data Collections, that can either be datasets or prediction models. Registration utilises a rich set of metadata based on the ISO 19156 standard on Observations and Measurements (O&M) and a Space Physics Ontology to define the applicable keywords. While these standards are based on XML, a wizard is also available for resource providers that makes the creation of these XML files easier and more automated. Users can either browse the registered Data Collections, or search for them utilising free-text keywords or the Space Physics ontology. Once found, they can interact with the Data Collection by either navigating to its external site or accessing is through an Application Programming Interface (API) directly from the e-Science Centre. Data Collections can be deployed either at the providers premises or on EGI cloud computing resources. Data storage is facilitated by the EGI DataHub service. Besides Data Collections, the PeSC also supports the execution of workflows. Workflows can be composed of registered Data Collections and executed via APIs. User management in the PeSC is realised by the integration of the EGI Check-in federated identity management system and the Perun authorisation framework. While the PeSC is completely open to end users and anyone can access it without registration, the publication of Data Collections and workflows requires authentication and authorisation. Users belong to Institutions that own the Data Collections and only members of a certain Institution can manage the given resource. Handling of user tickets is managed by the GGUS ticketing system, also provided by EGI and fully integrated with the PeSC. Currently, there are 57 Data Collections and two workflows registered in the PeSC, represented by 790 XML files describing institutions, individuals, projects, platforms, instruments, etc., and made available to the wider PITHIA research community. The presentation and a live demonstration will explain the above functionalities of the e-Science Centre, give examples of PITHIA Data Collections and workflows, and outline the next steps in the development process. The PeSC can be accessed at https://esc.pithia.eu/.
The advancement of EOSC promotes a research paradigm more reproducible and verifiable in response to the growing complexity and interdisciplinarity of modern research, necessitating an unprecedented level of collaboration and data sharing. In line with this, federated data infrastructures, like the Blue-Cloud project, have been established, integrating marine data sources across Europe to catalyze advancements in marine science. Among these initiatives, the NEw REsearch Infrastructure Datacenter for EMSO (NEREIDE) developed by INGV and located near the Western Ionian Sea facility, is designed to drive data science forward through its Virtual Data Center (VDC) platform.
The core of NEREIDE innovation is the Virtual Data Center, a managed environment where Data Scientists (DSs) can control cloud resources, including Virtual Machines (VMs), behind a dedicated customizable Gateway with a public IP address. DSs have administrative privileges over their cloud segments, enabling them to autonomously create VMs and manage network components including firewalls, VPNs and advanced routing configurations. Meanwhile, the overarching management of cloud infrastructure, including the physical data center, remains in charge and under the control of the Infrastructure Administrators. This dual-structure ensures a balance between stringent infrastructural measures and DSs operational autonomy. VDCs provide a sophisticated, plug-and-play infrastructure that sidesteps the complexities of traditional data center management, allowing DSs to focus on their data services, bypassing the complexities of traditional physical data center management.
DSs want to work into dynamic and customizable environments where they manage substantial computational resources to tackle complex scientific questions. The main component of VDCs is OpenStack, an open-source “Infrastructure as a Service” platform that enables seamless scalability and flexibility in resource management, aided by workflow automation tools such as MaaS and JuJu. This setup allows DSs to optimize computing and storage capacities according to project needs, essential for processing extensive datasets and performing complex simulations.
VDCs rely also on Ceph, a distributed software defined storage engine, which offers flexible and scalable storage resources, in conjunction with data security and integrity. This solution allows DSs to face heavy scientific data loads and to efficiently manage multiple storage types.
Additionally, VDCs not only provide bare computational power and raw storage space to DSs, but also enable the integration of sophisticated arrays of tools, such as JupyterHub for interactive data analysis, ERDDAP for data distribution, ElasticSearch for data querying, etc. These tools underpin data science activities including data analysis, visualization, and collaborative research, thereby making complex data comprehensible and accessible across various scientific domains.
Essentially, with the introduction of a custom gateway, VDCs represent an evolution of the concept of virtualization, extending it beyond individual virtual machines to include an entire ready-to-use data center infrastructure.
The potential integration of VDC platforms within federated data infrastructures, like Blue-Cloud, suggests a future where seamless data and resource sharing could significantly boost the analytical and operational capacities within different scientific domains. These advancements foster new scientific applications and innovations, accelerating the achievement of open science goals and easing the work of data scientists.
Many scientific problems, such as environmental research or cancer diagnosis, require large data volumes, advanced statistical or AI models, and distributed computing resources.
To help domain scientists conduct their research more effectively they need to reuse resources like data, AI models, workflows, and services from different sources to address complex challenges. Sharing resources requires collaborative platforms that facilitate advanced data science research that offers: discovery access, interoperation and reuse of research assets, and integration of all resources into cohesive observational, experimental, and simulation investigations with replicable workflows. Virtual Research Environments (VREs) effectively supported such use cases offering software tools and functional modules for research management. However, while effective for specific scientific communities, existing VREs often lack adaptability and require substantial time investment for incorporating external resources or custom tools. In contrast, many researchers and data scientists prefer notebook environments like Jupyter for their flexibility and familiarity.
To bridge this gap we propose a VRE solution for Jupyter Notebook-as-a-VRE (NaaVRE).
The NaaVRE empowers users to construct functional blocks by containerizing cells within notebooks, organizing them into workflows, and overseeing the entire experiment cycle along with its generated data. These functional blocks, workflows, and data can then be shared within a common marketplace, fostering user communities and tailored Virtual Research Environments (VREs). Additionally, NaaVRE seamlessly integrates with external repositories, enabling users to explore, select, and reuse various assets such as data, software, and algorithms. Lastly, NaaVRE is designed to seamlessly operate within cloud infrastructures, offering users the flexibility and cost efficiency of utilizing computational resources as needed.
We showcase the versatility of NaaVRE by building several customized VREs that support specific scientific workflows across different communities. These include tasks such as extracting ecosystem structures from Light Detection and Ranging (LiDAR) data, monitoring bird migrations via radar observations, and analyzing phytoplankton species. Additionally, NaaVRE finds application in developing Digital Twins for ecosystems as part of the Dutch NWO LTER-LIFE project.
The key task led by EOSC Focus is to coordinate ‘Enabling an operational, open and FAIR EOSC ecosystem (INFRAEOSC)’ projects under the Horizon Europe (HE) Programme. Technical coordination activities have evolved from annual coordination meetings with the European Commission and online meetings of HE Technology Working Group in EOSC Forum to the recent EOSC Winter School 2024, mainly represented by INFRAEOSC projects granted between 2021 and 2023. Among six Opportunity Areas (OA) identified for collaboration across INFRAEOSC projects, OA4 was dedicated to User & Resource Environments, aligned with the EOSC Strategic Research and Innovation Agenda (SRIA). The participants in OA4 had interactive and hands-on workshops focusing on Virtual Research Environments (VREs), supported by EGI and Galaxy initiatives and EOSC-A Task Force members, who are also the main stakeholders in INFRAEOSC projects highlighting the adoption of EGI resources in their scientific workflows. At the end of intensive Winter School, the roadmap of advancing VREs was discussed in contribution to the new normal to reproduce Open Science following the FAIR principles. In summary, short-term, mid-term and long-term objectives are updated to be contributed among participating projects towards the upcoming EOSC Symposium 2024 and the second edition of Winter School.
EuroScienceGateway leverages a distributed computing network across 13 European countries, accessible via 6 national, user-friendly web portals, facilitating access to compute and storage infrastructures across Europe as well as to data, tools, workflows and services that can be customized to suit researchers’ needs.
In this talk present the work carried out in the project during the last year, including integration of the EuroScienceGateway with new Identity Providers, Bring your Own Compute (with Pulsar and ARC), Bring your Own Storage (with S3 and Onedata) and improvements in the metascheduling of jobs across the distributed computing infrastructure.
In the session, there will be presentations on various aspects of AI covering the entire MLOps lifecycle, from data acquisition, preprocessing, and labelling to inference pipelines. The projects AI4EOSC, iMagine, and interTwin will provide insights on key topics such as federated learning, distributed computing, experiment tracking, and drift detection. Use cases will demonstrate applications from composing inference pipelines to the application of Graph Neural Network and 3DGAN. The session will open with a presentation and discussion on the EGI AI roadmap, outlining future directions for advancing the AI capabilities of the EGI infrastructure.
Aquatic ecosystems are vital in regulating climate and providing resources, but they face threats from global change and local stressors. Understanding their dynamics is crucial for sustainable use and conservation. The iMagine AI Platform offers a suite of AI-powered image analysis tools for researchers in aquatic sciences, facilitating a better understanding of scientific phenomena and applying AI and ML for processing image data.
The platform supports the entire machine learning cycle, from model development to deployment, leveraging data from underwater platforms, webcams, microscopes, drones, and satellites, and utilising distributed resources across Europe. With a serverless architecture and DevOps approach, it enables easy sharing and deployment of AI models. Four providers within the pan-European EGI federation power the platform, offering substantial computational resources for image processing.
Five use cases focus on image analytics services, which will be available to external researchers through Virtual Access. Additionally, three new use cases are developing AI-based image processing services, and two external use cases are kickstarting through recent Open Calls. The iMagine Competence Centre aids use case teams in model development and deployment. The talk will provide an overview of the development status of the use cases and offer insights on the platform.
The iMagine platform utilizes AI-driven tools to enhance the processing and analysis of imaging data in marine and freshwater research, supporting the study of crucial processes for ocean, sea, coastal, and inland water health. Leveraging the European Open Science Cloud (EOSC), the project provides a framework for developing, training, and deploying AI models. To effectively achieve the objectives of the project, about twelve use cases in different areas of aquatic science are collaborating with the providers of the iMagine AI platform. This collaboration has yielded valuable insights and practical knowledge. Thoroughly revising the existing solutions from data acquisition and preprocessing to the final stage, provides a trained model as a service to the users.
Within the framework of iMagine, we outline various tools, techniques, and methodologies appropriate for aquatic science image processing and analysis. In this work, we delve into the best AI-based solutions for image processing, drawing on the extensive experience and knowledge we have gained over the course of the iMagine project. Clear guidelines for annotating images, coupled with comprehensive training and accessible tools, ensure consistency and accuracy in labeling. Thus, We verify annotation tools such as BIIGLE, Roboflow, LabelStudio, CVAT, and LabelBox based on the different features, along with real-time video streaming tools.
Preprocessing techniques and quality control measures are discussed to enhance data quality in aquatic datasets, aiming to identify and address issues such as blurriness, glare, or artifacts. Preparation of training datasets and their publishing in a data repository with the relevant metadata is assessed. Following this, an overview of deep learning models, including convolutional neural networks, and their applications in classification, object detection, localization, and segmentation methods is provided.
Performance metrics and evaluation methods, along with experiment tracking tools such as Tensorboard, MLflow, Weight and Biases, and Data Version Control are discussed for the purpose of reproducibility and transparency. Ground truth data is utilized to validate and calibrate image analysis algorithms, ensuring accuracy and reliability. Furthermore, AI model drift tools, data biases, and fairness considerations in aquatic science models are discussed, concluding with case studies, discussions on challenges and limitations in AI applications for aquatic sciences.
By embracing these best practices, providers of image collections and image analysis applications in aquatic sciences can enhance data quality, promote reproducibility, and facilitate scientific progress in this field. A collaboration of research infrastructures and IT experts within the iMagine framework results in the development of best practices for delivering image processing services. The project establishes common solutions in data management, quality control, performance, integration, and FAIRness across research infrastructures, thereby promoting harmonization and providing input for best practice guidelines.
Finally, iMagine shares its developments with other leading projects such as AI4EOSC and Blue-Cloud 2026 to achieve optimal synergy and wider uptake of the iMagine platform and best practices by the larger aquatic and AI research communities.
The release of oil into marine environments can result in considerable harm to coastal ecosystems and marine life, while also disrupting various human activities. Despite advances in maritime safety, there has been a noticeable uptick in spill occurrences throughout the Mediterranean basin, as documented by the European Maritime Safety Agency's Cleanseanet program. Precisely predicting the movement and transformation of oil slicks is crucial for assessing their impact on coastal and marine regions. Numerical modeling of oil spills plays a pivotal role in understanding their unseen consequences and addressing observational gaps. However, these models often rely on manually selected simulation parameters, which can affect result accuracy. We propose an innovative approach integrating satellite observations, the Medslik-II oil spill model, and Machine Learning techniques to optimize model parameterization, thereby enhancing the accuracy of oil numerical simulations. Utilizing a Bayesian Optimization Framework, the study seeks the optimal configuration within the parameter space for which model simulations best represent actual oil spill observations.
Validation of the proposed approach is performed using a real case of an oil spill in the Baniyas area (Syria) in 2021. Preliminary evaluations of this framework show promising results, suggesting that combining physics-based and data-driven methodologies can lead to more precise risk assessment and planning for oil spill incidents. Furthermore, the resulting workflow represents an integrated solution for optimal and automated selection of model simulation parameters.
The work is being developed within the framework of the EGI coordinated iMagine project, which focuses on a portfolio of "free at the point of use" image datasets, high-performance image analysis tools empowered with Artificial Intelligence (AI), and best practice documents for scientific image analysis.
Cloud computing has revolutionized how we store, process, and access data, offering flexibility, scalability, and cost-effectiveness. On the other hand, High Performance Computing (HPC) provides unparalleled processing power and speed, making it an essential tool for complex computational tasks. However, leveraging these two powerful technologies together has been a challenge.
In recent years, Artificial Intelligence (AI) and Machine Learning (ML) have grown exponentially, with many software tools being developed for the Cloud. Despite this, the potential of integrating these tools with HPC resources has yet to be explored.
Containers have revolutionized application delivery due to their lightweight and versatility. They are standard in cloud-native applications, but new containerization technologies have emerged specifically for HPC environments.
Our team presents a solution for seamlessly integrating Cloud and HPC environments using two essential tools: OSCAR, a Kubernetes-based serverless event-driven platform where the user can easily create services for running jobs within a container, and interLink, a middleware that allows the offloading of tasks created in a Kubernetes cluster to an HPC cluster.
The OSCAR-interLink integration together with iTwinAI, a framework for advanced AI/ML workflows, allows for AI workloads, such as 3DGAN inference, to take advantage of the resources available in HPC, including GPU processing power.
Our results showcase a successful use case, integrating dCache, Apache NiFi, OSCAR, interLink, and iTwinAI, based on a 3DGAN for particle simulation, demonstrating the benefits of the approach by exploiting remote GPUs from an HPC facility from an OSCAR cluster running on a Cloud infrastructure.
This work was supported by the project “An interdisciplinary Digital Twin Engine for science’’ (interTwin) that has received funding from the European Union’s Horizon Europe Programme under Grant 101058386. GM would like to thank Grant PID2020-113126RB-I00 funded by MCIU/AEI/10.13039/501100011033. GM and SL would like to thank project PDC2021-120844-I00 funded by MCIU/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR.
In recent years, the escalation of Extreme Weather Events (EWEs), including storms and wildfires, due to Climate Change has become a pressing concern. This exacerbation is characterised by increased intensity, frequency as well as the duration of such events.
Machine Learning (ML) presents a promising avenue for tackling the challenges associated with predicting global wildfire burned areas. It offers sophisticated modelling techniques capable of estimating EWEs in a cost-effective manner. ML-based algorithms not only assist in detection and prediction but also provide robust data-driven tools for scientists,policymakers, and the general public. Yet, the implementation of such solutions requires a comprehensive infrastructure including data acquisition systems, preprocessing modules, computing platforms, and visualisation tools.
A relevant aspect which the InterTwin project - funded by the EU - focuses on is the development of a Digital Twin for EWE analysis. This Digital Twin harnesses artificial neural networks to model the non-linear relationships between various climate, geomorphological and human factors and the occurrence of EWEs, thereby enabling insights from historical data and projections for future events.
In particular, within the interTwin project, our work is emphasising on modelling and predicting global wildfire burned areas, together with tropical cyclones detection and tracking. Our work aims to establish a resilient system for timely prediction and EWE assessment and analysis on projections scenarios.
The Digital Twin on wildfires prediction integrates data and ML models to provide a proactive approach to the fire danger assessment. These efforts underscore the importance of leveraging cutting-edge technologies to address the challenges posed by Climate Change-induced EWEs, ultimately fostering informed actions and resilient communities.
Researchers exploiting artificial intelligence (AI) techniques like machine learning and deep learning require access to specialized computing and storage resources. Addressing this need, the AI4EOSC project is providing an easy to use suite of services and tools within the European Open Science Cloud (EOSC). This platform aims to facilitate the development of AI models, including federated learning, zero touch deployment of models, MLOps tools and composite AI pipelines among others.
In this presentation, we will provide an exploration of our platform's high-level architecture, with a particular emphasis on meeting the diverse needs of users. We will give an overview of the frameworks and technologies that lay the foundations of our implementation. Through real-world examples coming from active projects and communities (including the notable involvement of iMagine) we will illustrate how researchers are effectively leveraging the platform to advance their AI initiatives. This showcase serves not only to highlight the capabilities of the AI4EOSC project but also to underscore its practical utility and impact within the scientific community.
Managing and monitoring AI models in production, also known as machine learning operations (MLOps), has become essential in our days, resulting in the need for highly reliable MLOps platforms and frameworks. In the AI4EOSC project in order to provide our customers with the best available ones, we reviewed the field of open-source MLOps and examined the platforms that serve as the backbone of machine learning systems. Recognizing how tracking experiments may improve the process of organising and analysing the results of machine learning experiments as well as team collaboration and knowledge sharing should be noted. From workflow orchestration to drift detection, every aspect of the machine learning lifecycle was reviewed.
Based on that study and in order to aid scientists in their goal to achieve high model standards and implement MLOps practices, we deployed the MLflow platform for the AI4EOSC and iMagine users, are offering a Frouros drift detection python library, are developing a monitoring system for logging drift detection runs. The provided MLflow platform features a central remote tracking server so that every AI experimentation run either on the AI4EOSC platform or any other resources can be individually tracked and shared with other registered users if desired. Frouros library combines classical and more recent algorithms for both concept and data drift detection.
In this contribution, the global MLOps landscape of the continuously growing AI world will be presented together with our practical implementation in the AI4EOSC project and lessons learned from our users.
In recent years, Large Language Models (LLMs) have become powerful tools in the machine learning (ML) field, including features of natural language processing (NLP) and code generation. The employment of these tools often faces complex processes, starting from interacting with a variety of providers to fine-tuning models of a certain degree of appropriateness to meet the project’s needs.
This work explores in detail using MLflow [1] in deploying and evaluating two notable LLMs: Mixtral[2] from MistralAI and Databricks Rex (DBRX) [3] from Databricks, both available as open-source models in the HuggingFace portal. The focus lies on enhancing inference efficiency, specifically emphasising the fact that DBRX has better throughput than traditional models of similar scale.
Hence, MLflow offers a unified interface for interacting with various LLM providers through the Deployments Server (previously known as “MLflow AI Gateway”) [4], which streamlines the deployment process. Further, with standardised evaluation metrics, we present a comparative analysis between Mixtral and DBRX.
MLflow's LLM Evaluation tools are designed to address the unique challenges of evaluating LLMs. Unlike traditional models, LLMs often lack a single ground truth, making their evaluation more complex.
MLflow allows customers to use a bundle of tools and features that are specifically tailored to deal with difficulties arising from integrating LLMs in a comprehensive manner. The MLflow Deployments Server serves as the central location, eliminating the need to juggle multiple provider APIs and simplifying integration with self-hosted models.
We plan to implement this solution using the MLflow tracking server deployed in the AI4eosc project [5] as a showcase.
In conclusion, this contribution seeks to offer insights into the efficient deployment and evaluation of LLMs using MLflow, with a focus on optimising inference efficiency through a unified user interface. With MLflow capabilities, developers and data scientists can navigate through integrating LLMs into their applications easily and effectively, unlocking their maximum potential for revolutionary AI-driven solutions.
[1] https://mlflow.org
[2] https://huggingface.co/mistralai
[3] https://huggingface.co/databricks
[4] https://mlflow.org/docs/latest/llms/index.html
[5] https://ai4eosc.eu
With the expansion of applications and services based on machine learning (ML), the obligation to ensure data privacy and security has become increasingly important in recent times. Federated Learning (FL) is a privacy-preserving machine learning paradigm introduced to address concerns related to data sharing in centralized model training. In this approach, multiple parties collaborate to jointly train a model without disclosing their individual data.
There are various aggregation algorithms for aggregating the local model updates in Federated Learning, e.g. FedAVG, FedProx, Scafflod, and Ditto to overcome the challenges posed by the fact that data can be unbalanced, non-independent, or non Identically Distributed (non-IID) in FL environments. There exist as well various workflows like Scatter and Gather, Cyclic Learning and Swarm Learning for communication strategies. In addition, various security enhancements including Differential Privacy (DP), Homomorphic Encryption (HE), and secure model aggregation have been developed to address privacy concerns. Key considerations when setting up an FL process involve selecting the best framework that meets the specific requirements of the task in terms of the best aggregation algorithm, workflow, and security enhancements.
To help researchers make informed decisions, within the AI4EOSC project, we provide a comprehensive evaluation and comparison of the two most widely used frameworks for federated learning NVFlare and Flower, which have also recently announced a collaboration between the two.
To compare the frameworks in terms of the features they offer, we develop a deep learning solution for the Detection of thermal anomalies use case of AI4EOSC. We use this real-world case study to demonstrate the practical impact and performance of various FL aggregation algorithms, workflows, and security enhancements, and their implementation in each FL framework.
We highlight the different features and capabilities that these frameworks bring to FL settings to provide a better understanding of their respective strengths and applications. The Flower server was seamlessly integrated into the AI4EOSC dashboard, which simplified our experimentation process. All experiments are monitored and tracked using the MLflow instance provided by the AI4EOSC project. Our evaluation included analyses of the convergence speed of various aggregation methods offered by these frameworks, global model accuracy, communication overhead in various workflows, and privacy-preserving functionalities of both frameworks such as HE and DP. Furthermore, we explore the novel collaboration between these two frameworks to explore synergies and potential improvements in federated learning methods for thermal bridge detection.
The AI4EOSC project will deliver an enhanced set of services for the development of Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL) models and applications for the European Open Science Cloud (EOSC). One of the components of the platform is the workload management system that manages execution of compute requests on different sites on EGI Federated Cloud.
To be able manage the distributed compute resources in a simple and efficient way, a distributed computing platform must be created. We based this platform on the service mesh technology paradigm. The platform consists of three parts:
This platform is a unified, reliable, distributed computing system on different sites on EGI Federated Cloud. It resembles the Kubernetes platform. On the other side the Hashicorp Consul and Nomad are more simpler, lighter and flexible compared to Kubernetes. And it is a completely distributed and fault tolerant platform for reliable job execution.
Marine and coastal ecosystems (MCEs) play a vital role in human well-being, contributing significantly to Earth’s climate regulation and providing ecosystem services like carbon sequestration and coastal protection against sea level rise. However, they face serious threats, including one deriving from the interaction between multiple human stressors (e.g. pollution) and pressures more related to climate change (CC) (e.g. rising sea temperature, ocean acidification, etc.). The complex interplay of these pressures is escalating cumulative impacts on MCEs, jeopardizing their ability to provide ecosystem services and compromising their health and resilience. Machine Learning (ML), using different types of algorithms such as Random Forest (RF) or Support Vector Machine (SVM), can be effective tools to evaluate changes in environmental and ecological status against multiple pressures, but they often overlook the spatial dependence of pressure effects. The examination of spatial relationships among anthropogenic and CC-related pressures is facilitated by Graph Neural Networks (GNNs), which explicitly model the relationships between data points, hence offer potential solutions to the issue of neglecting spatial dependencies in the prediction. Based on these considerations, the main aims of this study are exploring the application of GNNs-based models to evaluate the impact of pressures on Seagrasses ecosystem in the Italian Seas and compare these methods with the models that usually are employed in this field (i.e., RF, SVM, Multi-Layer Perceptron (MLP)).The methodology involves compiling a comprehensive dataset encompassing key variables influencing Seagrass health, including several endogenic and exogenic pressures (e.g., nutrient concentrations, temperature, salinity). Geospatial data from open-source platforms (e.g., Copernicus, EMODnet) are processed and synthesized into a 4km raster grid. The study area was defined based on 2017 seagrass coverage, considering a bathymetry layer up to 50 meters. The seagrasses distribution in each pixel of the case study was considered, categorizing the latter as presence or absence pixels. Experiments include implementing and evaluating different GNN architectures, (i.e., Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs)), alongside traditional ML models. To construct the graph for GNNs, each pixel in the study area, identified by latitude and longitude, is a node. The feature vectors associated with each node represent the pressures. Nodes are connected to their nearest neighboring pixels, forming a spatially informed graph structure. Model performance is assessed using accuracy and F1-score metrics, with GNNs showing the highest F1-score in detecting presence of seagrasses. Qualitative analysis reveals that models lacking spatial context in their predictions tend to exhibit errors attributed to isolated consideration of individual pixels. For instance, these models incorrectly predict the absence of seagrass in regions surrounded by meadows or vice versa. In contrast, GNNs predominantly misclassify pixels along seagrass patch boundaries. While spatial context proves invaluable for prediction accuracy, challenges stemming from limited data availability of high-resolution datasets, impede comprehensive exploration of temporal dynamics within seagrass ecosystem. Future research aims to transition to a local scale, gathering high-resolution data. This facilitates the incorporation of temporal dimensions and the consideration of relevant physical processes, such as ocean currents or extreme events, influencing ecosystem dynamics within the graph.
Users may have difficulties to find the needed information in the documentation for products, when many pages of documentation are available on multiple web pages or in email forums. We have developed and tested an AI based tool, which can help users to find answers to their questions. The Docu-bot uses Retrieval Augmentation Generation solution to generate answers to various questions. It uses github or open gitlab repositories with documentation as a source of information. Zip files with documentation in a plain text or markdown format can also be used for input. Sentence transformer model and Large Language Model generate answers.
Different LLM models can be used. For performance reasons, in most tests we use the model Mistral-7B-Instruct-v0.2, which fits into the memory of the Nvidia T4 GPU. We have also tested a larger model Mixtral-8x7B-Instruct-v0.1, which requires more GPU memory, available for example on Nvidia A100, A40 or H100 GPU cards. Another possibility is to use the API of OpenAI models like gpt-3.5-turbo, but the user has to provide his/her own API access key to cover expenses.
Reproducibility is a key component of open science, ensuring that scientific findings are reliable, transparent, and credible.
This session will showcase services and tools to support reproducibilty and open science in EGI and EOSC, covering access to reusable data, producing provenance information, delivering reproducible computing environments and facilitating the comparison of results in sets of experiments.
The ReproVIP project aimed at evaluating and improving the reproducibility of scientific results obtained with the Virtual Imaging Platform (VIP) in the field of medical imaging. ReproVIP focused on a reproducibility level ensuring that the code produces the same result when executed with the same set of inputs and that an investigator is able to reobtain the published results. We investigated reproducibility at three levels: (i) the code itself, and in particular different versions of the same code [Lig2023], (ii) the execution environment, such as the operating system and code dependencies [Vila2024], parallel executions and the use of distributed infrastructures and (iii) the exploration process, from the beginning of the study and until the final published results [Vila2023].
Within this project, we conducted different studies corresponding to these three reproducibility levels. Some of them were conducted on the EGI infrastructure, in production conditions, others on the Grid’5000 research infrastructure. Grid’5000 is a large-scale testbed deployed in France (and member of the SLICES RI) for experiment-driven research in all areas of computer science. It provides access to a large amount of resources highly reconfigurable and controllable, which allowed us to adopt solutions available on EGI, such as CVMFS.
Within ReproVIP, we also enriched the ecosystem around VIP with tools facilitating the assessment of the reproducibility of scientific results: a reproducibility dashboard, a data management platform and a continuous integration tool. The tools are interconnected and linked to VIP, providing researchers with an integrated end-to-end solution to improve the reproducibilityproducibility of scientific results.
The talk will present the studies and tools produced within ReproVIP, highlighting the findings and lessons learnt during the project.
References:
[Lig2023] Morgane Des Ligneris, Axel Bonnet, Yohan Chatelain, et al., “Reproducibility of Tumor Segmentation Outcomes with a Deep Learning Model,” in International Symposium on Biomedical Imaging (ISBI), Cartagena de Indias, Colombia, Apr. 2023
[Vila2023] Gaël Vila, Axel Bonnet, Fabian Chauveau, et al., “Computational Reproducibility in Metabolite Quantification Applied to Short Echo Time in vivo MR Spectroscopy” in International Symposium
on Biomedical Imaging (ISBI), Cartagena de Indias, Colombia, Apr. 2023
[Vila2024] Gaël Vila, Emmanuel Medernach, Inés Gonzalez, et al., “The Impact of Hardware Variability on Applications Packaged with Docker and Guix: a Case Study in Neuroimaging,” Submitted at https://acm-rep.github.io/2024/, Feb. 2024.
Abstract
The Cloud Computing Platform (CCP), developed under the aegis of D4Science [1], an operational digital infrastructure initiated 18 years ago with funding from the European Commission, represents a significant advancement in supporting the FAIR (Findable, Accessible, Interoperable, and Reusable) principles, open science, and reproducible data-intensive science. D4Science has evolved to harness the "as a Service" paradigm, offering web-accessible Virtual Laboratories [2] that have also been instrumental in facilitating science collaborations [3]. These laboratories simplify access to datasets whilst concealing underlying complexities, and include functionalities such as a cloud-based workspace for file organisation, a platform for large-scale data analysis, a catalogue for publishing research results, and a communication system rooted in social networking practices.
At the core of the platform for large-scale data analysis, CCP promotes widespread adoption of microservice development patterns, significantly enhancing software interoperability and composability across varied scientific disciplines. CCP introduces several innovative features that streamline the scientific method lifecycle, including a method importer tool, lifecycle tracking, and an executions monitor with real-time output streaming. These features ensure that every step—from creation, through execution, to sharing and updating—is meticulously recorded and readily accessible, thus adhering to open science mandates. CCP supports a broad range of programming languages through automatic code generation, making it effortlessly adaptable to diverse scientific requirements. The robust support for containerisation, utilising Docker, simplifies the deployment of methods on scalable cloud infrastructures. This approach not only reduces the overhead of traditional virtualisation but also enhances the execution efficiency of complex scientific workflows. The platform’s RESTful API design further facilitates seamless interactions between disparate software components, promoting a cohesive ecosystem for method execution and data analysis.
Significantly, CCP embodies the principles of Open Science by ensuring that all scientific outputs are transparent, repeatable, and reusable. Methods and their executions are documented and shared within the scientific community, enhancing collaborative research and enabling peers to verify and build upon each other's work. The platform’s design also includes comprehensive provenance management, which meticulously tracks the origin and history of data, thus providing a record for scientific discoveries.
CCP serves as a platform for large-scale data analysis of the (i) EOSC Blue-Cloud2026 project VRE, which by leveraging digital technologies for ocean science, utilises CCP to perform large-scale collaborative data analytics, significantly benefiting from CCP's robust, scalable cloud infrastructure and tools designed for extensive data processing and collaboration, and of the (ii) SoBigData Research Infrastructure that, with its focus on social data mining and Big Data analytics, integrates CCP to facilitate an ecosystem for ethical, scientific discoveries across multiple dimensions of social life.
Keywords: Open Science, Cloud Computing, FAIR Principles, Reproducibility, Data-intensive Science, Containerisation, Microservices
Open Science plays an important role to fully support the whole research process, which also includes addressing provenance and reproducibility of scientific experiments. Indeed, handling provenance at different levels of granularity and during the entire analytics workflow lifecycle is key for managing lineage information related to large-scale experiments in a flexible way as well as enabling reproducibility scenarios, which in turn foster re-usability, one of the FAIR guiding data principles.
This contribution focuses on a multi-level approach applied to climate analytics experiments as a way to manage provenance information in a more structured and multi-faced fashion, thus allowing scientists to explore the provenance space across multiple dimensions and get coarse- or fine-grained information according to their needs. More specifically, the talk introduces the yProv multi-level provenance service, a new core component within an Open Science-enabled research data lifecycle, along with its design, main features and graph-based data model.
The service can be deployed on several platforms, including cloud infrastructures: indeed, thanks to the recent integration in the Infrastructure Manager Dashboard (https://im.egi.eu/im-dashboard), non advanced users can easily launch the deployment of a yProv service instance on top of a wide range of cloud providers.
This work is partially funded by the EU InterTwin project (Grant Agreement 101058386), the EU Climateurope2 project (Grant Agreement 101056933) and partially under the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.4 - Call for tender No. 1031 of 17/06/2022 of Italian Ministry for University and Research funded by the European Union – NextGenerationEU (proj. nr. CN_00000013).
DESY is one of the largest synchrotron facilities in Europe and as such is involved with a large amount of different scientific fields. Among these are High Energy and Astro particle Physics, Dark matter research, Physics with Photons and Structural Biology which generate huge amounts of data. This data is valuable and mostly handled in accordance with domain and community specific policies which take into account that embargo periods, ownership and license restrictions are respected. Nowadays there is a push towards opening the data up to the public as requested by funding agencies and scientific journals. In order to support this push, DESY IT is implementing and deploying solutions that support and enable the publishing of Open Data sets for the scientific community. These solutions will make the Open Data easily findable, browsable and reusable for further analyses by the long tail of science, especially when it's participants are not supported by large e-infrastructures.
With Open and FAIR data principles in mind, we will provide a metadata catalogue to make the data findable. The accessibility aspect is covered by making use of federated user accounts via eduGAIN, HelmholtzID, NFDI and later EOSC-AAI and will give community members access to the data with their institutional accounts. The interoperability of the data sets is ensured by establishing the use of commonly accepted data formats such as HDF5, specifically NeXuS and openPMD wherever possible. Providing the technical and scientific metadata will finally make the open data sets reusable for subsequent analyses and research. In order to address the spirit of sharing in Open Science, the blueprint for our Open Data solution will be shared with others through HIFIS first and upon successful evaluation also with the wider community.
Our protoype will initially consist of three connected solutions: the metadata catalogue SciCat, the storage system dCache and the VISA (Virtual Infrastructure for Scientific Analysis) portal. Scientific data is placed in a specific directory on dCache together with its metadata which is ingested into SciCat to be availabe for access and download options. Here, it is crucial to ensure that the scientific metadata stored in the catalog is harmonized among similar experiments. In order to achieve this, we are devising a method of creating experiment-specific metadata schemata against which metadata will be validated before ingestion. Simultaneously, a subset of the technical and scientific metadata will be integrated into the VISA portal such that scientists can access the dataset within it. VISA is a portal that allows creating virtual machines with pre-installed analysis tools, the selected data sets already mounted and accessible from a web browser forming a consistent environment allowing easy access to data and tools.
During the talk, we will present the architecture of the system, its individual components as well as their interplay. The focus will be the harmonization of the metadata schemata as well as the roadmap for the development of tooling and processes for ingestion and validation of the ingested metadata.
In this presentation we review the available services in the EGI portolio that assist researchers with the reproducibility of computational experiments. We will describe how to use EGI Replay, EGI DataHub, EGI AppDB and Infrastructure Manager to create, run and share scientific workflows publicly.
As a subset of open science, open infrastructure has the enormous potential to alter
research and education. This is because open infrastructure is flexible, cost-effective, and
scalable. Invest in Open Infrastructure (IOI) has created an Infra Finder tool to help
activists make the case for adopting open infrastructure in their local communities. This
technology helps to level the playing field between organisations with little resources and
those with analyst teams by enhancing openness and the ease with which information on
open infrastructure services can be obtained. To encourage trust, the gadget will present
information that goes beyond traditional market indications. Governance frameworks,
regulations, pricing, user contributions and community, technological affordances, and
interoperability are just a few examples. Infra Finder enables users to discover, filter, and
compare services that complement their existing offers, empowering all users and
cultivating a sense of belonging.
Infra Finder, currently housing 57 infrastructures, is on a trajectory of growth. It aims to
expand its platform to include a larger number of open infrastructures, thereby enhancing
their visibility and subsequent adoption across the continent. This growth promises
exciting opportunities and advancements in the field of open infrastructure.
The IOI's State of Open Infrastructure report was created by methodically curating and
analysing data from the 57 facilities listed in Infra Finder. This paper contains
comprehensive data and analysis, providing a unique and insightful view on open
infrastructure characteristics and the concerns affecting them internationally. A chapter is
dedicated to studying the global policy framework for open research and infrastructure,
bringing significant insights to the audience.
Trust is a critical pillar of information security, and mastering its concepts is essential for protecting against security threats and ensuring the confidentiality, integrity, and availability of information and resources.
By comprehensively understanding the components of trust, including authentication, authorization, trust models, trust boundaries, and trustworthiness is possible to effectively mitigate security risks and safeguard valuable information assets.
We present updates about the mytoken service, giving a short overview of the mytoken service, its idea and concept, and then focusing on the newest developments and future work.
These include the new notification feature, which allows user to obtain email notifications for various things, e.g. to be notified before a mytoken expires to easily create a new one. Also mytoken expirations can be integrated into a user's calendar application.
The mytoken software offers a central service to obtain OpenID Connect Access Tokens in an easy but secure way for extended periods of time and across multiple devices. In particular, mytoken was developed to provide OIDC Access Tokens to long-running compute jobs.
Mytokens can be restricted through the concept of "restrictions" and "capabilities" which allow very fine-grained access rights - much more detailed and flexible as OIDC tokens would allow.
Public instances are available at https://mytoken.data.kit.edu and https://mytok.eu
The latter running in a secure credential store environment.
To conduct research and foster innovation, collaboration and resource sharing have become the primary focus for research communities and national e-infrastructures. It can reduce duplication of work, leading to reduced costs, while opening possibilities for achieving common goals by combining data, skills, and efforts. However, offering such functionalities brings complex challenges to building an environment that is secure and easy to administer. That’s where the authentication and authorization infrastructure (AAI) steps forward.
At CESNET, we are continuously addressing those needs by implementing Perun AAI, which is actively used as a fundamental component for large communities such as the LifeScience cluster on the European Open Science Cloud (EOSC) level and the Czech national e-infrastructure. In terms of collaboration, we create an environment that enables easy integration of services users can access by logging through their own organization accounts without forcing them to remember extra usernames and passwords. Our primary focus is also to protect users’ resources by enabling multi-factor authentication, anti-phishing protection, and advanced authorization mechanisms i.e., GA4GH passports that provide a convenient and standardized way of communicating users’ data access authorizations based on either their role (e.g. being a researcher), affiliation, or access status. Last but not least, we put our efforts into coming up with automatic processes, such as user life cycle within the organization, together with the automatic provisioning and de-provisioning of accounts without manual interference.
This talk will introduce the benefits Perun AAI can bring to research communities and national e-infrastructures to help them foster collaboration, improve the security of all operations, and lower administration costs.
In today's infrastructures, the collection, exchange and continues processing of geospatial data takes place at pre-defined network endpoints of a spatial data infrastructure. Each participating operator hosts a predefined static functionality at a network endpoint. Some network endpoints of an operator may provide data access, other endpoints may provide processing functionality or uploading capabilities. Security context constraints are fundamental for installing services in production environments. Several legislations from security technical implementations guides to information security policies apply. Recent legislation like EU Data Act entered into force on 11 January 2024, and will become applicable in September 2025. Because of this regulation, connected products will have to be designed and manufactured in a way that empowers users (businesses or consumers) to easily and securely access, use and share the generated data. The EU DSA Data Service Act states applicability for simple websites, Internet infrastructure services and online platforms.
Our novel approach introduces an agile decentralized eco-system that is concerned with trust and authenticity by introducing Smart Certificates which can be applied to data products, workflow processes and services. The Smart Certificates enable the flexible and trustworthy creation, distribution and verification of data products. The certification process can either take place manually or automatically if the data has appropriate Identity-Integrity Provenance and Trust I2PT-enabling metadata.
Our approach differs from well-known X-509 certificates by schema definition of the information contained in Smart Certificates. The schema - hence the structure of the certificate - is stored on a Blockchain to become immutable. Supporting Zero-Knowledge-Proof as well as requesting information from the certificates during the verification procedure supports a wide range of use cases.
For illustrating the use of Smart Certificates, a process - Reprojection - allows a user or a workflow engine to reproject an image for which the user has a Smart Certificate. For the output image, the process creates a Smart Certificate. The process itself is verifiable because it also has a Smart Certificate associated. The bundling of image data and Smart Certificates allows the process to check for authentic input but also to verify the usage of the image. If the usage is not appropriate for the trusted process, the execution in refused.
The Trusted Reprojection process was implemented in Python and deployed using the OGC API Processes Standard and a modified version of pygeoapi. The deployed process validates the input image Smart Certificates and creates a Smart Certificate for the created output product - the reprojected image. The Hyperledger Indy Blockchain and Aries Cloud Agent are used as the backbone.
Our approach introduces trusted computing in a distributed environment by leveraging Hyperledger Indy, Aries Cloud Agent and specific business logic. The solution can be used to verify and issue Smart Certificates for data products and trusted processes. The introduced eco-system is one example solution to support the EU Data Act.
Account linking may be usefule at different places in the AAI
Architecture. Over the past years, we have seen account linking at the
Community-AAI, where multiple Home-Organisation logins may be used to log
in to a single account at the Community. This typically allows linking
attributes of services such as ORCID or Google. More recently this type
of account linking is being integrated in an additional proxy above the
Community-AAI. These additional proxies are known as "national Edu-ID".
They aim to support researcher mobility by allowing links to several
different, sometimes international, Home Organistions.
To complement these early (or northbound) linkings, we have designed and
implemented a system for late (or southbound) linking of accounts. Our use
case are users, that authenticate with their federated identity to a
modern service inside a particular computer centre. Computer centres are
often reluctant to invest early into new AAI systems. Their Unix-based
infrastructure (HPC Clusters, Filesystems) therefore do not support
federated identities. To allow our modern service to use this
infrastructure for federated users, we need to know to which Unix account
the federated user will be mapped, when logging in with an account local
to the computer centre.
ALISE, the Account LIkning SErvice, does exactly that. The web interface
asks the user to log in with the computer centre account. Once
authenticated, federated identities my be linked to the computer centre
account. The linkage information may be accessed via a REST API, so that
our modern service may use this information.
The initial setup is working for the VEGA HPC Centre in Slovenia, where an
instance of dCache needs to utilise local storage to read or write data,
that a VEGA HPC user owns.
An overview of EGI Check-in use cases, such as EUreka3D and LETHE
Open Policy Agent (OPA) is an open-source, general-purpose authorization engine that provides a high-level declarative language, called Rego, which allows the expression of policies as code, using a combination of data manipulation and logical operators. OPA takes policy decisions by evaluating the query input against policies and data. The OPA RESTful APIs allow the service to be integrated into any application, making it a versatile tool for authorization and access control.
One of the main advantages of using OPA is its performance optimization capabilities. The OPA policy evaluation engine is designed to handle large volumes of requests, making it an ideal choice for the Grid middleware. Additionally, the OPA caching mechanism allows it to minimize the number of policy evaluations, further improving performance. Moreover, the OPA declarative approach to policy management allows for a more intuitive and straightforward policy development process.
With this contribution, we want to highlight the potential of this framework in the context of our Grid middleware and to illustrate how we are exploring the use of OPA in two use cases: to implement the authorization rules defined in the WLCG JWT profile for StoRM Tape and StoRM WebDAV, and to replace the home-made scope policy engine within INDIGO IAM. The appropriate comparison in terms of performance and compliance between the previous solutions and those based on OPA will also be illustrated.
The Secure Shell (SSH) Protocol is widely recognized as the de-facto standard for accessing remote servers on the command line, across a number of user cases, such as: remote system administration, git operations, system backups via rsync, and high-performance computing (HPC) access.
However, as federated infrastructures become more prevalent, there is a growing demand for SSH to operate seamlessly and securely in such environments. Managing SSH keys in federated setups poses a number of challenges, since SSH keys are trusted permanently, can be shared across devices and teams, and do not have a mechanism to enforce the use of passphrases. Unfortunately, there is currently no universally accepted usage pattern for globally federated usage.
The large variety of users with different backgrounds and usage profiles motivated us to develop a set of different tools for facilitating the integration with federated user identities. The main novelty that will be presented in this contribution is the integration of an SSH-certificate-based mechanism into the existing ecosystem for SSH with OpenId Connect, consisting of motley-cue and oidc-agent.
This new mechanism consists of a set of programs collectively referred to as "oinit". It aims to simplify the usage of SSH certificates by leveraging authorization information via established federation mechanisms. The main benefit is that, after an initial setup step, SSH may be used securely without interrupting existing flows, enabling the use of rsync, for example.
The core components of oinit include the following:
In addition to outlining the architecture and functionality of our solution, we provide an initial security assessment and offer a live demo of SSH with OpenID Connect, with oinit and selected components.
Currently, data processing and analysis predominantly occur within data centers and centralized computing environments, with 80% of this activity centralized and only 20% happening through intelligent, connected devices. Additionally, merely one out of four European companies leverages cloud technologies, and non-EU entities control 75% of Europe's cloud market.
To leverage the shift towards edge computing, which aligns with Europe's strategies on data, the environment, and industry, it's crucial for Europe to consolidate significant investments. The emphasis should be on the creation and implementation of advanced computing components, systems, and platforms. These technologies are essential for facilitating the move to a computing continuum that boasts potent edge and far-edge capabilities, all while being energy-efficient and reliable.
The development of the edge to cloud continuum faces a number of technological and conceptual challenges. First, seamless, transparent and trustworthy integration of diverse computing and data environments spanning from core cloud to edge, in an AI-enabled computing continuum. Secondly, automatic adaptation to the growing complexity of requirements and the exponential increase of data driven by IoT deployment across sectors, users and contexts while achieving optimal use of resources, holistic security and data privacy and credibility. Finally, interoperability challenges among computing and data platform providers and cloud federation approaches based on open standards, interoperability models and open platforms.
To cope with those challenges, ACES will provide an edge-services cloud with hierarchical intelligence, specifically autopoiesis and cognitive behaviors to manage and automate the platform.
These solutions include:
Such new solutions are tested through three use cases:
Extended list of the consortium members includes:
Climate change and transformation is urging scientific communities and decision makers around the world to better understand and handle such systemic shift and its consequences at different levels and to instill a gradual societal adaptation and change into the population.
The availability of tailored and robust information about current climate and climate change at local, regional or national scales is an increasing requirement in a wide range of end-user applications and as a decision-support basis in the fields of risk reduction and adaptation planning.
Numerous European and national portals have been recently developed to ease access to climate data, visualize precomputed information and promote climate change communication. At the same time, a wide range of ready-to-use packages written in popular programming languages and published in open-source code repositories, e.g., GitHub, have been released with the aim of enabling end users to derive customized data for specific applications, e.g., climate indices for sector-oriented analyses, or to further integrate the utilities into tailored climate services.
However, open-source packages completely integrating all steps composing a service-oriented application – from the calculation of climate information to the open-access publication in repositories, the metadata curation, and customizable analyses – are still missing.
In this framework, with the aim of answering to the increasing need of elaborating climate data for research activities as well as practice-oriented applications we developed an open-source tool called climdex-kit [1] and published in the official Python Package Index (PyPI, https://pypi.org/). The package is designed to support users with some programming skills carrying out research in the field of climate change and impact prediction, to support dissemination and educational activities through effective visualization or to develop more complex architectures for operational platforms addressing a broad audience. The tool is written in Python and integrates utilities from the well-established Climate Data Operators (CDO) and NetCDF Operators (NCO) libraries. climdex-kit provides utilities to implement the whole pipeline of calculation, orchestrate parallelized processing over multiple climate data, publish and analyze climate indices as well as to shape the visualization of results based on user needs. The current version offers the calculation of 37 climate indices, while the package can be easily extended to support other indices and more unforeseen operators, thanks to thorough documentation for developers available in the source repository.
We will present and discuss the climdex-kit functionalities as well as its potential integration into local applications by applying the software to a dataset of climate projections for the Italian region Trentino-South Tyrol, used as study case.
[1] https://pypi.org/project/climdex-kit/
The Pangeo community is eager to demonstrate the Pangeo@EOSC service, derived from a collaboration between Pangeo and the EGI-ACE and C-SCALE projects. Offering Pangeo notebooks as well as Machine Learning (both Pytorch and TensorFlow) and Data Science notebooks (R & Julia), Pangeo@EOSC provides an integrative platform within the EOSC for scientific data analysis. Our demonstration will effectively showcase the functionality, convenience, and far-reaching impact of this service.
Pangeo@EOSC, a powerful, scalable, and open-source platform, enables Big Data analysis across an array of disciplines using vast multi-dimensional data, such as geoscience and environmental science, among others. The platform serves as a bridge between data storage, computation, and the scientist, creating a seamless, integrated working environment that stimulates more efficient research and collaborations.
During our 30-minute demonstration, we will delve into Pangeo@EOSC's functionalities. Starting from data access, we will navigate through data exploration, visualisation, and analysis, and further explore its collaborative features. The demonstration will further illuminate how Pangeo@EOSC facilitates end-to-end reproducibility.
We look forward to engaging with fellow researchers, scientists, and data enthusiasts during the dedicated networking session. This will not only provide valuable insight into practical requirements and evolving expectations in the scientific world, but also offer us a great opportunity to receive feedback on Pangeo@EOSC.
With Pangeo@EOSC, the future of scalable, collaborative, and reproducible scientific research is not just a possibility, but a reality within our reach.
Software engineering best practices favour the creation of better quality projects, where similar projects should originate from similar layout, also called software templates. This approach greatly enhances project comprehension and reduces developers’ effort in the implementation of high-quality code. As an example, reproducibility and reusability are the key aspects of this software engineering process, the use of packaging tools and containers is a common practice to achieve robustness and portability for long-term software maintenance. However, these tools are not always easy to use and require a certain level of expertise to implement from scratch. Software templates are known to be an excellent way to reduce the complexity load on the use of such tools on the developer’s side.
There exist various tools to create such templates and routinely generate projects from them. One such Open Source tool is cookiecutter [1], a cross-platform command-line utility where a new project is replicated according to a set of files and directories that are pre-configured to provide the base structure. These templates, or cookiecutters, can be re-used and freely hosted on software version control platforms e.g. GitHub, where customization is achieved by filling in placeholders in the template files using project-specific values.
In this contribution, we show you how to develop custom modules within the AI4OS dashboard and make use of the best software development practices for this scientific framework. We present a new service that provides a collection of best templates on a marketplace/hub and use them to generate new projects on-fly through a web interface without requiring the installation of the cookiecutter tool on the client side. The platform features a GitHub repository to collect metadata about templates, a python-based backend, and a javascript Web GUI with the authentication via EGI Check-In. From data preprocessing to model evaluation, this session covers the most critical steps in the process of module development, empowering participants to achieve the desired reproducibility and reusability in their projects.
[1] https://github.com/cookiecutter/cookiecutter
EGI, EUDAT, GÉANT, OpenAIRE and PRACE are glad to announce the establishment of the e-Infrastructure Assembly. By joining this collaborative effort, EGI, EUDAT, GÉANT, OpenAIRE and PRACE seek to promote e-Infrastructures in the European Research Area (ERA) and its changing landscape, and ensure that the bottom-up experience developed by e-Infrastructures over the last decades plays an active role in the definition and shaping of future discussions on the digitisation of Europe, with a particular focus on the ERA, Research and Education.
With the signature of the e-Infrastructure Assembly Memorandum of Understanding seeing the participation of Tiziana Ferrari (EGI), Yann Le Franc (EUDAT), Erik Huizer (GÉANT), Natalia Manola (OpenAIRE) and Serge Bogaerts (PRACE), the e-Infrastructure Assembly will be formally established.
The newly constituted collaboration will promote the dialogue and the interaction with major European initiatives in the European and international research landscape, with the purpose of increasing today’s collaboration in research and development areas of common interest, and of defining common strategies for the development, interoperation, integration and delivery of e-Infrastructure services.
The participants in the session will learn about the strategies of EGI, EUDAT, GÉANT, OpenAIRE and PRACE, and how their participation in the e-Infrastructure Assembly will support such strategies.
The ambition of EOSC Beyond is to support the growth of the European Open Science Cloud (EOSC) in terms of integrated providers and active users by providing new EOSC Core technical solutions that allow developers of scientific application environments to easily compose a diverse portfolio of EOSC Resources, offering them as integrated capabilities to researchers.
EOSC Beyond introduces a novel concept of EOSC, establishing a federated network of pilot Nodes operating at various levels (national, regional, international, and thematic) to cater to specific scientific missions. Key objectives include accelerating the development of scientific applications, enabling Open Science through dynamic resource deployment, fostering innovation with testing environments, and aligning EOSC architecture with European data spaces.
The project advances EOSC Core through co-design methodologies, collaborating with diverse use cases from national and regional initiatives (e-Infra CZ, NFDI, NI4OS), thematic research infrastructures (CESSDA, CNB-CSIC, Instruct-ERIC, ENES, LifeWatch, METROFood-RI), and e-Infrastructures (EGI).
In the age of digital transformation, data spaces play a pivotal role in enabling secure, transparent, and cross-sectoral data sharing. This workshop, hosted by the TANGO project in collaboration with CEDAR and DIVINE at the EGI2024 conference, will focus on the shared mission of these Horizon Europe projects: creating resilient, trusted, and interoperable data spaces that align with Europe’s digital sovereignty ambitions.
By bringing together TANGO, CEDAR, and DIVINE, this session will emphasize the synergistic approach these projects have adopted to address shared challenges in data management and connectivity across sectors such as healthcare, finance, public administration, and industry. Attendees will gain a holistic view of how these projects complement each other, creating a unified front in Europe’s efforts to establish data sovereignty and trust through advanced data space connectors.
Key Themes:
Collaborative Innovation: How the combined efforts of TANGO, CEDAR, and DIVINE are laying the groundwork for a new era of secure and privacy-preserving data sharing.
Connector Interoperability: Focus on how each project has approached the design, evaluation, and implementation of data space connectors to ensure interoperability across sectors and platforms.
Cross-Project Synergies: Highlighting areas where these projects are working together to develop best practices, share knowledge, and establish scalable, trusted data-sharing frameworks.
Policy Alignment: Mapping the technological solutions to the strategic pillars of the European Data Strategy, showcasing how these projects jointly support the EU's goals of data sovereignty, transparency, and compliance.
The session is led by TANGO, an EU-funded project aimed at establishing a secure, trustworthy, and energy efficient cross-sector data-sharing platform. This workshop serves as a platform for knowledge exchange, exploring how these projects collectively tackle the challenges of secure data sharing in an increasingly complex regulatory landscape.
TANGO: This project is centered on developing a secure, citizen-centric data management and sharing platform that incorporates advanced technologies like artificial intelligence, self-sovereign identity, and blockchain. The aim is to facilitate privacy-preserving and energy-efficient data handling across various domains such as finance, public administration, and manufacturing. TANGO's efforts also address green data operations by optimizing energy use within data centers. https://tango-project.eu/
CEDAR: The project is focused on combating corruption and enhancing transparency through data-driven governance. It addresses challenges like emergency public spending and fraudulent activities by creating high-quality, harmonized datasets and improving data management and machine learning tools. CEDAR aims to enable robust, human-centric decision-making and validate its methods through pilot studies in areas like public healthcare funds and foreign aid. The project emphasizes collaboration to generate positive impacts on the European economy and society.
DIVINE: DIVINE is dedicated to building resilient data spaces that prioritize data sovereignty and trustworthiness. The project leverages open-source tools and frameworks like the Dataspace Connector, which ensures secure data flows between different sectors and facilitates cross-industry collaboration. DIVINE aims to enhance secure communication between data providers and consumers, ensuring compliance with regulatory standards.
Goal of the Workshop: The goal is to facilitate cross-project discussions on the latest advancements in data space technologies and connectors, share best practices, and identify opportunities for collaboration. This event is an essential gathering for stakeholders interested in the future of data sovereignty, secure data sharing, and compliance in a rapidly evolving digital landscape.
Agenda
15:15: Welcome and Introduction to the Workshop (15 minutes)
Mpampis Chatzimallis, The Lisbon Council, TANGO
- Introduction to the workshop’s cross-project, synergy-driven approach
15:30: Introductory Session on Data Spaces and Connectors (15 min)
Giulia Giussani, IDSA
- Overview of the role of data spaces and the critical importance of connectors
- Insights into how these connectors form the backbone of interoperability and trust
15:45: Collaborative Project Pitches: Unified Challenges and Solutions (5 min each)
Giulia Giussani, IDSA
16:00: Deep-Dive into Data Space Connectors: Synergies and Challenges (15 min each)
Mod: Giulia/Mpampis (+Apostolos, Ilias +tech)
- TANGO: Leveraging AI and blockchain for secure, compliant, and efficient connectors. Kaitai Liang
- CEDAR: Tackling governance challenges with high-quality data and machine learning tools for public good. Silvio Sorace
- DIVINE: Building resilient, cross-sector connectors with open-source frameworks. Sergio Comella
Rationale for the Workshop:
Through investments in European Digital Sovereignty a marketplace will emerge with many different hardware and software service providers. End users will be able to freely select service providers to manage, share and exploit their Data, AI models and fulfil their compute needs. Trust and trustworthiness refers to the confidence users and stakeholders have towards three aspects; reliability, security and performance of a system.
ACES, the European funded R&D project foremost develops the orchestration of distributed powerful edge-cloud computing infrastructures. The framework of ACES and the blueprint of the solution with its various technologies were developed, with trust and trustworthiness by design.
More often implementing trust is done as an afterthought by stacking technologies on top of the existing system to create security and trustworthiness and consequently create overhead. The ACES project has security by design in the way that selected technologies can have a dual use; an operational function and a trust function.
ACES focus is performance optimization in support of big data and AI processing in data dense environments and mission critical applications. The consortium is planning a follow up project, ACES II, to work on the other two aspects of Trust and Trustworthiness’, reliability and security.
The moderated (Luca Remotti, Francesco Mureddu) fast paced workshop will have a range of short presentations, each followed by a 10 min Q&A and reflections from the audience on aspects of trust and trustworthiness at the high performing edge. The workshop will address following aspects: Trust & trustworthiness by design, ACES meta-vision and how was incorporated in the ACES blueprint (Fred Buining, HIRO-MicroDataCenters BV), Relevant Trust Metrics (Petar Kochovski, University Lubljana), Knowledge Graphs for transparency (Felix Cuadrado, UPM), Bayesian analytics and trust (Loris Canelli, SUPSI), Peer-to-Peer Trust (Vadim Bulavintsev, Hiro MicroDataCenters) Network embedded AI (Fernando Ramos, University Lisbon) for security, Swarm based orchestration of distributed edge-computing infrastructures (Melanie Schranz, Lakeside Labs).
Workshop Agenda:
15:15 - 15:25: Introduction to the Project
Presenter: Luca Remotti
Duration: 10 minutes
Content:
Luca Remotti will introduce the project, outlining the objectives and the importance of integrating AI into cloud-edge computing systems. He will briefly discuss the ethical, security, and transparency considerations central to the project.
15:25 - 15:50: Keynote Speech on Trust & Trustworthiness at powerful edge-computing infrastructures
Speaker: Fred Buining
Duration: 15 minutes + 10 min Q&A, Reflections.
Content:
Fred Buining will present a meta-vision on the future of edge-computing infrastructures within Europe’s Digital Sovereignty and the edge custodianship for trust and trustworthiness for edge devices, applications, AI and dataspaces. Fred will also introduce the trust and trustworthiness by design that is embedded in the ACES framework.
15:50 - 16:10: Trust Metrics collection (ACES, ACES II); Presentation, Q&A and Discussion
Speaker: Petar Kochovski (University Lubljana), Moderator: Francesco Mureddu
Content: Optimized metrics collection relevant for performance and reliability of distributed edge, future metrics for security
Peter will present which metrics and their optimized collections method for metrics that are relevant for the system operations. He highlights those metrics that are earmarked for performance and reliability monitoring. Finally he will present metrics that are in particular relevant for security.
16:10 - 16:20: Break
Duration: 10 minutes
Content: A short break for networking and refreshments.
16:20 - 16:40: Knowledge Graphs for Infrastructure transparency (ACES, ACES II); Presentation, Q&A and Discussion
Speaker: Felix Cuadrado (University of Madrid), Moderator Francesco Mureddu
Content: Knowledge graphs to capture the ontology and complex relationships of the distributed infrastructure. Current and future use to enhance trust/ trustworthiness.
Felix will discuss the use of knowledge graph in defining and explaining complex distributed systems, their relationships and storing relevant data and metadata. Felix will look towards the future of using knowledge graphs to enhance systems intelligence and security.
16:40 - 17:00: Bayesian analytics for analysis under dynamicity and uncertainty (ACES, ACES II); Presentation, Q&A and Discussion
Speaker: Loris Canelli (Supsi), Moderator Francesco Mureddu
Content: Loris will present how Bayesian analytics are used to analyse the relationships and data stored in the knowledge graph and discover probalistic relationships between different components and the overall system performance and reliability. Loris will end his presentation with a view on how these methods can model, infer, and predict potential security threats and anomalies.
17:00 - 17:20: Swarm Intelligence to manage highly dynamic distributed and future composable edge cloud computing infrastructures at the edge
Speaker: Melanie Schranz (Lakeside Labs), Moderator Francesco Mureddu
Content: Melanie will present how custom built swarm mechanisms can help manage highly complex and distributed fine grained infrastructures and workloads in edge-cloud computing settings. Swarms performance and reliability can be achieved via light weight swarm mechanisms. Her presentation will end with how swarm elements can be used to enhance security.
17:00 - 17:40: Intelligence in the network of a distributed infrastructure
Speaker: Fernando Ramos (University of Lisbon), Moderator Francesco Mureddu
Content: Fernando will present how programmability and embedded intelligence in the network switches paves the way for high performant, reliable networks. The presentation will end with an outlook on how AI embedded in the network switch can enhance security in future distributed networks.
17:40- 18:00: Reflections on trust and trustworthiness in distributed powerful edge computing infrastructures.
All presenters and audience, Moderator Francesco Mureddu
Content: The main questions the audience tries to answer is: “How can Europe own a trusted powerful edge- cloud infrastructure ? Which initiatives and investments are needed and what type of research could drive this development forward?
This is an open face to face meeting of the EGI Federated Cloud task force to discuss the current status of the federation and the next steps towards its evolution.
ENVRI-Hub NEXT offers a user-friendly platform for seamless access to data from environmental Research Infrastructures, fostering interdisciplinary research and driving breakthroughs in environmental science. The project consolidates and advances the technical structure established by the previous ENVRI projects to empower the "ENVRI Science Cluster" to provide interdisciplinary data-driven services. Further information: https://www.egi.eu/project/envri-hub-next/
This session is the last session of the EGI's Working Group on Trusted Research Environments and Sensitive Data Management (TRE WG). A landscaping report prepared by the WG will be presented and discussed. In this session we go through the big picture of the trusted research environments and some future development directions. The session is open for everyone.
Session preliminary agenda:
- Trusted Research Environments and Sensitive Data Management landscape report
- Future directions of TREs; Architectures, technology and user experiences
- European collaboration between TREs, projects and infrastructures
- Global initiatives
- Next steps of the EGI’s TRE WG
ENVRI-Hub NEXT offers a user-friendly platform for seamless access to data from environmental Research Infrastructures, fostering interdisciplinary research and driving breakthroughs in environmental science. The project consolidates and advances the technical structure established by the previous ENVRI projects to empower the "ENVRI Science Cluster" to provide interdisciplinary data-driven services. Further information: https://www.egi.eu/project/envri-hub-next/
IT Service Management is a discipline that helps provide services with a focus on customer needs and in a professional manner. It is widely used in the commercial and public sectors to manage IT services of all types, but current solutions are very heavyweight with high barriers to entry.
FitSM is an open, lightweight standard for professionally managing services. It brings order and traceability to a complex area and provides simple, practical support in getting started with ITSM. FitSM training and certification provide crucial help in delivering services and improving their management. It provides a common conceptual and process model, sets out straightforward and realistic requirements and links them to supporting materials.
Through FitSM, the aim is to conduct effective IT service management in the EOSC federated environment and achieving a baseline level of ITSM, which can also act in support of ‘management interoperability’ in federated environments (e.g. EOSC).
This three-hour workshop focuses on equipping participants with the necessary skills and knowledge to effectively design and implement IT Service Management (ITSM) processes in their organizations.
Duration
3 hours
Registration:
contact elia.bellussi@egi.eu directly
Target audience
All individuals involved in the provisioning of (federated) IT services.
Service and Process Managers and Owners.
Candidates who wish to understand better how to implement what was learned from FitSM Trainings.
Entry requirements
Participants must have a solid understanding of ITSM principles and processes.
Contents
Basic IT service management concepts and terms (based on FitSM-0)
Approaches and strategies for implementing an ITSM.
Training Outputs
Design Skills: Equip participants with the skills to design effective ITSM processes.
Implementation Insights: Provide practical insights into successful ITSM process implementation.
Engagement: Foster interactive learning and collaboration among participants.