AVAILABILITY/RELIABILITY STATISTICS PROCEDURE MEETING June 16 2010 Present: David Collados, Tiziana Ferrari, Malgorzata Krakowian, Marcin Radecki, Luuk Ulije, Kostas Koumantaros, Christos Kanelopoulos ==================== David describes the current procedure. gridview team is responsible of producing reports, which are usually released on day 4 or 5 of each month. - confirmation from gridview that the reports are generated, - evaluation before distributing them to check everything looks correct - xls file upload into edms and distribute infor to the roc managers future: gridview team will still generate reports in the future A new person needs to be identified to play David's role. David: in the future it is important to simplify the process so that people can build on the fly their reports Tiziana: what tools are involved in this process? - central db at cern on oracle, where results are stored, tests executed by sam and nagios (two different dbs) - gridview: every hour gridview calculates status of services according to the probe results and the overall status of different service types (ce, se, bdii), computing overall site availability every hour and storing in different tables, also computing status of last week (continuous computation) - end of months gridview makes computation and puts info in a different table of the db, and a script runs on the table creating reports with openreports for that, everything is stored in a separated db dedicated to gridview, this is hosted at CERN Tiziana: are special rights needed to produce reports? David: only acess to open reports url, liaise directly with Rajesh team leader of gridview Tiziana: what is needed to fix availability statistics that were penalized by third party issues? David: with sam sites were not penalized with issues due to the monitoring infrastructure, before gridview was recomputing the whole availability reports after problems were fixed, how can this be done in the future? Responsibility of monitoring is devolved to the NGI ACTION: a procedure is needed to define what happens when a failure of the monitoring infrastructure occurs Note: if monitoring results are not available for some time, site statistics are not affected, as the avail calculation algorithm only considers intervals when data are available. On the other hand, if wms has issues in the NGI nagios infrastructure, something should be done, problem: how Thsi could be done by faking the system, handling the timestamps and sending output as good, this must be noted and at the end of the month gridview needs to explain why sites or whole NGIs that were affected Strict control would be needed in order to avoid abuse ACTION (Christos): mention site suspension policy in ola (ex EGEE SLD) David: technically possible to change statistics afterwords, we need to implement something and evaluate effort kostas: feedback from NGI on procedures is needed before implementation work happens ACTION: David to document the overall procedure for release of reports by today Marcin: we needed a full picutre, what explanations are expected (below threashold, or not following procedures) Full picture in terms of COD work and then discuss with developers Tiziana: when will engines for flexible computation of availability statistics be ready? David: this is under the responsibility of the gridview team: development of avail calculation algorithm, different profiles. First prototype being tested now Rajesh is the contact point. ===================================================== David: process of releasing of new certification authority lists stuck on this nagios probes need to be integrated the new CAs and the process is blocked because integration team at cern, pps process was dismanteled 3 basic test of rpms 7 days period counting, less than 7 days to integrate latest version of CA ===================================================== -- Tiziana Ferrari Grid Operations Service Italian National Institute for Nuclear Physics - CNAF tel: +39.051.6092.759 fax: +39.051.6092.916 http://www.cnaf.infn.it/~ferrari -----------------------------------------------------