30 September 2024 to 4 October 2024
Hilton Garden Inn, Lecce, Italy
Europe/Amsterdam timezone

Distributed computing platform on EGI Federated Cloud

3 Oct 2024, 12:00
10m
Hilton Garden Inn, Lecce, Italy

Hilton Garden Inn, Lecce, Italy

Speakers

Martin Seleng (IISAS) Viet Tran (IISAS)

Description

The AI4EOSC project will deliver an enhanced set of services for the development of Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL) models and applications for the European Open Science Cloud (EOSC). One of the components of the platform is the workload management system that manages execution of compute requests on different sites on EGI Federated Cloud.

To be able manage the distributed compute resources in a simple and efficient way, a distributed computing platform must be created. We based this platform on the service mesh technology paradigm. The platform consists of three parts:

  • The underlying network connection is based on the Hashicorp Consul that enables the managing of secure network connectivity with help of the Envoy proxy across different cloud environments (multi clouds and multi providers) and on premises as well. It offers different services like discovery, service mesh, traffic management, and automated updates to network infrastructure devices.
  • To manage the workload on the computing resources we adopt the workload orchestrator Hashicorp Nomad, which enables deployments and managements of containers and non-containerized applications at scale. Nomad can run a diverse workload of Docker, non-containerized, microservices, and batch applications.
  • The last but not least part is the AI4EOSC API for managing job execution on Nomad. The API enables advanced authentication/authorization mechanisms (OIDC authentication, VO-based authorization), jobs monitoring and also simplifies job management by attaching additional metadata to jobs.

This platform is a unified, reliable, distributed computing system on different sites on EGI Federated Cloud. It resembles the Kubernetes platform. On the other side the Hashicorp Consul and Nomad are more simpler, lighter and flexible compared to Kubernetes. And it is a completely distributed and fault tolerant platform for reliable job execution.

Topic Needs and solutions in scientific computing: Federated operation

Primary author

Co-authors

Presentation materials

There are no materials yet.