Speaker
Overview
Many research activities from several fields, such as high energy, nuclear and atomic physics, biology, medicine, geophysics and environmental science, rely on data simulation and analysis tools, which are high CPU-consuming and/or data-driven. Sequential computation may require months or years of CPU time, so a loose-parallel distributed execution is expected to give benefits to these applications. In fact, the large number of storage and computation resources offered by a Grid environment allow to consistently reduce the amount of computation time by splitting the whole task in several parts and executing each part on a single node.
High energy physics experiments made a pioneer work on this but their results are hardly available to small and mid-size organizations that may have similar computational requirements, mostly due to the large amount of needed technical expertise.
The goal of our work is to provide a suite allowing a easy, quick and highly customizable access to the Grid.
Description of the work
This work started from the necessity of the SuperB project, an High Energy Physics experiment, to be able in simulating detector systems.
The effort in developing a distributed simulation production system capable of exploiting multi flavour Grid resources resulted in a general purpose design based on minimal and standard set of Grid services and capable to fit the requirements of many different VOs.
The system design includes a central EGI service site providing the web interface to job and metadata management tools.
A database system is used to store the "distributed work session" related metadata; it works as back-end for all the suite sub-services and communicates via RESTful protocol with running jobs on distributed sites.
The whole suite is based on standard Grid services such as the WMS as job brokering service and Grid flavours interoperability element, the VOMS as authentication and accounting system, the LFC and LCG-Utils as file metadata catalog and data-handling solution, GANGA as job management system and SRM as protocol for data access.
A web-based user-interface has been developed which takes care of the database interactions and the job preparation; it also provides basic monitor functionalities.
The web interface permits to initialize the bookkeeping database for a new
session, includes a parametric search and statistics engine, provides a
submission interface for shift takers allowing an automatic submission to all the available sites, a submission interface for experts allowing a fine grain selection of job submission parameters, a simplified elog system.
The submission workflow includes the offline sites setup (job input file transfers to SEs and VO software installation), the bulk job submission via web tool suite. The jobs communicate to central DB status information, at job completion the output files are transfered to central site data repository, metadata and log files are stored and location info registered into the LFC service.
Conclusions
The prototype suite has proven to be reliable and efficient although it still requires a careful initial configuration and needs a deeper abstraction from the application specifics.
Further development is needed to provide such an abstraction layer both in the web-interface and in the bookkeeping database. For instance, a user-defined database schema should be made available by providing an admin web-interface that permit the environment configuration, table modifications and setting of display and logic preferences.
Moreover a better configuration structure is needed in order to allow both site, job script, data management and executable customization.
At a higher level, the minimal set of Grid services and protocols must be redefined in terms of standard compliance, lastingness and easiness, in order to expand the user base.
An authentication management layer will be added in order to provide a direct access to the Grid resources from the suite.
Impact
The prototype suite we have developed has been successfully used in summer 2010 during a large Monte Carlo simulation production of SuperB.
The centralized access site has been the CNAF computing center, which has provided the web-interface, the submission point, the bookkeeping database and hosted the data repository.
Fifthteen remote sites in Italy, France, UK, USA and Canada, which deploy three different Grid middleware, have been involved.
More than 11 billion simulated events have been produced.
Over an effective period of 4 weeks, approximately 180000 jobs were completed with a ∼ 8% failure rate, mainly due to executable errors (0.5%), site misconfigurations (2%), proxy expiration (4.0%), and temporary overloading of the machine used to receive the data transfers from the remote sites (2%). The peak rate reached 7000 simultaneous jobs with an average of 3500. The total wall clock time spent by the simulation executables is ~195 years.
The distributed infrastructure and the developed suite have been fundamental in achieving the SuperB production cycles goals.
The online and offline monitor included with the web-interface keep the
metadata information stored in the bookkeeping database available for querying and further processing and analysis.
Our suite can be seen as a light-weight general-experiment framework which focuses on basic functionalities, designed specifically for organizations that cannot afford the use of the more specialized HEP frameworks but that still require an easy-to-use interface to the Grid.
Customization of the web-interface, the bookeeping database, job executable and site requirements are considered the key points to achieve the goal as well as small installation and configuration footprint.