26–30 Mar 2012
Leibniz Supercomputing Centre (LRZ)
CET timezone
CALL FOR PARTICIPATION: is now closed and successful applicants have been informed

User-centric monitoring of the analysis and production activities within the ATLAS and CMS Virtual Organisations using the Experiment Dashboard system

27 Mar 2012, 16:40
25m
LRZ 2 (100) (Leibniz Supercomputing Centre (LRZ))

LRZ 2 (100)

Leibniz Supercomputing Centre (LRZ)

Speaker

Dr Edward Karavakis (CERN)

Overview (For the conference guide)

The Experiment Dashboard is a monitoring system developed for the LHC experiments in order to provide the view of the Grid infrastructure from the perspective of the Virtual Organisation (VO). It enables a transparent view of the experiment activities across different middleware implementations and combines the Grid monitoring data with information that is specific to the VO. Job processing is the core part of the VO computing activities. The scientists must be able to monitor the execution status, application and grid-level messages of their tasks that may run at any site within the VO. The Dashboard Task Monitoring applications collect and expose a user-centric set of information to the user regarding submitted tasks. They provide a clear and precise view of the task status evolution and reason of failure as a function of time or site. Advanced graphical plots are also available which give a more usable and attractive interface to the analysis and production user.

Description of the Work

Various fully distributed job submission methods and execution backends are used within both the ATLAS and CMS VOs. More than 700,000 ATLAS and 300,000 CMS jobs are submitted daily to the Worldwide LHC Computing Grid (WLCG) and are processed on different middleware platforms. The LHC job processing activity is divided in two categories: processing of large-scale Monte-Carlo production jobs and user analysis jobs. The main difference between these categories is that the former is a well-organised activity performed by a group of experts, while the latter is chaotic analysis processing by diverse and geographically widespread members of the physics community. The behaviour of analysis jobs is particularly difficult to predict as it is normally carried out by users who are not necessarily experienced in using the Grid. All of these factors increase the complexity of the monitoring of the job processing activities within these VOs. While most of the existing monitoring applications are coupled to a specific Workload Management System (WMS), such as CRAB Monitoring for CMS and Panda Monitoring for ATLAS, the Dashboard Task Monitoring applications support different middleware implementations and job submission systems. They combine Grid monitoring data with information that is specific to the experiment by collecting information from various sources, such as the user interface of the WMS, the job submission systems, and the jobs themselves, presenting all this information in a coherent way, as if all of it came from one source. The development was user driven with physicists invited to test the prototypes in order to assemble further requirements and identify weaknesses with the applications. This talk will describe the current status of the job processing monitoring, cover the Dashboard Task Monitoring applications for the analysis and the production users which are widely used by the ATLAS and CMS community, and provide an insight into future development plans.

Impact

The Dashboard Task Monitoring applications for analysis and production users have become very popular within the ATLAS and CMS communities and play an important role in the analysis and production operations of the LHC. They also play an important role in the support infrastructure as they ensure that only serious issues are escalated to the support teams. More than two hundred and fifty distinct users are using them daily for their work just for CMS. Close collaboration with users and production teams resulted in the tools being focused on their exact monitoring needs.

URL

http://dashboard.cern.ch

Conclusions

There was major progress in the development of applications for monitoring of the user analysis and production activities from 2009 onwards. This work is very important, since it contributes to the overall success of the LHC offline computing effort. During the first year of data taking, the Dashboard Task Monitoring applications were proven to be an essential component for the LHC computing operations. They are being developed in very close collaboration with the physicists who use the Grid infrastructure to submit analysis and production jobs. As a result, they respond well to the needs of the LHC experiments.

Primary authors

David Tuckett (CERN) Dr Edward Karavakis (CERN) Ivan Dzhunov (CERN) Julia Andreeva (CERN) Laura Sargsyan (A.I. Alikhanyan National Scientific Laboratory (AM)) Lukasz Kokoszkiewicz (CERN) Mattia Cinquilli (CERN) Dr Michael Kenyon (CERN) Pablo Saiz (CERN)

Presentation materials