26–30 Mar 2012
Leibniz Supercomputing Centre (LRZ)
CET timezone
CALL FOR PARTICIPATION: is now closed and successful applicants have been informed

Experience in Grid Site Testing for High Energy Physics with HammerCloud

27 Mar 2012, 16:00
20m
LRZ 2 (100) (Leibniz Supercomputing Centre (LRZ))

LRZ 2 (100)

Leibniz Supercomputing Centre (LRZ)

Operational services and infrastructure Quality Assurance - Infrastructure

Speakers

Daniel Colin van der Ster (CERN) Ramon Medrano Llamas (CERN)

Impact

HammerCloud is running between 40,000 and 60,000 grid jobs per day for the three VOs that is currently testing. In the ATLAS instance, the auto-exclusion feature has been deployed more than one year ago with very promising results on the grid reliability, decreasing the grid error rate by up to a 50%.

Description of the Work

Frequent validation and stress testing of the network, storage and CPU resources of a grid site is essential to achieve high performance and reliability. HammerCloud was previously introduced with the goals of enabling VO- and site-administrators to run such tests in an automated or on-demand manner. The ATLAS, CMS and LHCb experiments have all successfully integrated it into their grid operations infrastructures. This work will present the experience in running HammerCloud at full scale for more than 3 years and present solutions to the scalability issues faced by the service. First, we will show the particular challenges faced when integrating with CMS and LHCb offline computing, including customized dashboards to show site validation reports for the VOs and a new API to tightly integrate with the LHCbDIRAC Resource Status System. Next, a study of the automatic site exclusion component used by ATLAS will be presented along with results for tuning the exclusion policies. A study of the historical test results for ATLAS, CMS and LHCb will be presented, including comparisons between the experiments' grid availabilities and a search for site-based or temporal failure correlations. Finally, we will look to future plans that will allow users to gain new insights into the test results; these include developments to allow increased testing concurrency, increased scale in the number of metrics recorded per test job (up to hundreds), and increased scale in the historical job information (up to many millions of jobs per VO).

URL

http://hammercloud.cern.ch/

Overview (For the conference guide)

HammerCloud is a grid site testing service for the ATLAS, CMS and LHCb experiments centered at CERN in Geneva. This tool, which is provided as an online service for operation managers, site administrators and, in general, grid experts, allows them to perform on-demand tests of their computing facilities in order to validate and measure their performance. In addition, HammerCloud runs automated tests to check the availability and reliability of the sites under different circumstances. The tests consist of real analysis code provided by the physics community to ensure real-world use cases for the grid sites. Indeed, HammerCloud has been employed in HEP for more than 2 years and has helped increase the performance and reliability seen by the grid users. In this work we will present the lessons learnt while deploying, optimizing and evolving the system for the three VOs and the development plans for the near and mid-term future.

Conclusions

HammerCloud has proven that is a fundamental tool for the grid operations, not only helpful for the commissioning of new sites and upgrades/tuning campaigns, but necessary to monitor the availability and reliability of sites, providing useful insights for daily operations on grid sites.

Primary author

Daniel Colin van der Ster (CERN)

Co-authors

Andrea Sciaba (CERN) Federica Legger (LMU Munich) Johannes Elmsheuser (LMU Munich) Mario Ubeda Garcia (CERN) Ramon Medrano Llamas (CERN)

Presentation materials