OLA Task Force kickoff meeting

Europe/Amsterdam
Christos Kanellopoulos (GRNET), Dimitris Zilaskos (Technical staff)
Description
1. Site-NGI OLA changes based on feedback from questionnaire. - Update minimum storage/cpu limits - Adjust GGUS acknowledge time to 1 working day instead of 5 hours - Decrease threshold for new sites. Instead allow a grace period where suspension will not take place - Increase availability/reliability thresholds for sites to 80/85 % - NGI meetings participation metric: If an NGI holds an internal operations meeting, require participation * Suggestion: address it as an NGI specific metric - NGI specific metics: Add a clause that an NGI may specify additional metrics * Possible conflicts with EGI limits? - Discontinuous availability of resources * Permit a site to stop offering resources for an extended period without requiring recertification or raise alerts Is that different to suspending the site for that period? - Timeframe for middleware updates * Not all middleware updates are of the same importance - Security related requirements * Security contacts, ability of the EGI_CERT to suspend in case of security emergency Perhaps we should ask input from EGI CSIRT. Maybe there is a document already we can refer to. - Clause for sites supporting only ops VO for extended periods: * Would be nice to have this determined by nagios somehow. - Staged rollout provisions/early adapters * How to treat them? 2. Core services specific provisions in the OLA 3. Requirements from Operational tools EVO information: Title: OLA Task Force Description: OLA Task Force Community: Universe Password: ola_tf Meeting Access Information: - Meeting URL http://evo.caltech.edu/evoNext/koala.jnlp?meeting=M2MvMB2v2nDaDu999tDM92 - Password: ola_tf - Phone Bridge ID: 2372327 Password: 1016 Eastern European Summer Time (+0300) Start 2010-10-04 12:00 End 2010-10-04 16:00 Japan Standard Time (+0900) Start 2010-10-04 18:00 End 2010-10-04 22:00 Central European Summer Time (+0200) Start 2010-10-04 11:00 End 2010-10-04 15:00
Attendance Dimitris Zilaskos (TSA1.8, GRNET) Christos Kanellopoulos (TSA1.8, GRNET) Tiziana Ferrari (Chief Operations Officer, EGI) Vera Hasper (Operational documentation) Helene Cordier (NGI_FRANCE) Dusan Vudragovic (NGI_AEGIS) Mats Nylen (NGI_NDGF/Swden) Tomasz Szepieniec (NGI_PL) Gergey Sipos (User support) Introduction by Tiziana -Purpose of the OLA TF: * recap results from the ola workshop * start working on a clean plan how to make OLA evolve for the 1st year of the EGI * output is expected from the group: A document with our proposals for enhancements of the current OLAs, and for definition of new OLAs, to be submited to the OMB for extended discussions/amendments, and used as roadmap * Tsa 1.8 has handled the site-NGI OLA agreement * Which is linked to the EGEE SLD *Given that EGI is service oriented, OLAs should formalize services EGI provides, in order not to only measure availability/reliability, but have a more "political" nature * OLAs formalize which services are provided, by whom * also what services external providers provide or use to/from the EGI and vice versa So we start from where we are today, and then go to medium term plans - Helene: put on the agenda the evolution of the OLA document, concerns how the evolution will be tackled. Changes should be reflected transparently. Once the OLA is in place, procedures are reflected on a wiki page rather than the document, and these procedures have an impact on the work of NGIs. - OLA workshop items - minimum core/storage: Currently 8 cores/ 1 TB, some feedback storage is too high. Vera: allow a transition period, even one desktop computer today has 8 cores Dimitris: 1 TB in 2010 can be achieved with just 1 disk Tiziana: remove the limit completely, "small" and "large" are different per NGI. Either keep the current thresholds, or get rid of them entirely, and leave it up to the NGIs. [Long parenthesis starts] Helene: site-NGI OLA is an NGI business, it has impact on how NGIs/site work. Also, what is the number of OLAs is needed? Tiziana: In the review of EGEE3 the European Commission is asking for an infrastructure than can be seamless accessed. This may not be just CPU cores and storage but for example virtual servers. This calculation of availability and reliability of additional resources is needed. Proposal: Ask from developers a tunable list of services that are provided, remove the fixed values Mats: Agreed, instead find a mechanism where NGI and site reach an agreement Helene: Agreed, sites provide services rather than CPU cores/storage, in fact some sites offer only services. Tiziana: Certification process should establish if what a site offers is acceptable, for example the services (other than CPU cores/storage) Tiziana: Will meet James Casey this week to discuss about ACE plans (developed by BARC), will try to raise our points to see what it can be done. in egee3 for nagios Tomasz: There have been cases with sites offering only for storage Dusan: Would like to see OLA documents VO-site for certain resources Tiziana: Work should focus on issues of the current site-NGI OLA first, description of additional metrics will follow. Tiziana: The igning procedure in EGEE was very difficult, not clear who is authorised to do what. In EGI we got rid of the SLA to get away from signing. Comment in EGEE from EC was that there is no real usefulness for signing if no mechanism for penalties or rewards is in place if contract is not respected. Suspension is an extreme case only when site falls below any reasonable % EC is not intested on having a penalty mechanism in place. Suggestion is to get rid of signing, rely on the OMB procedures, and during site certification have a tool that automates the process by clicking a button "I agree." Dimitris: For example this could be added in GOCDB Tiziana: Sort of, Gocdb is also for uncertified sites. Better have NGI take care of the checking the box. Dimitris: The NGI clicks on behalf of the site, after it has handled the process internally. Tiziana: as long as there are no conflicts/duplications in the process. The tool can be very single, for example a web form as in Italian grid. Dimitris: could it be shared and submitted to jra1? Tiziana: yes, but other NGIs could have something else. Helene: We are focusing on the technical solution: there have been legal reasons, a web form will not get rid of the reluctance for the site admins to bind to an agreement. Would also like to comment on availability/reliability calculations: are there plans to have a regionalized tool? What about failover? Helene: What recommendations could be oassed for regionalized versions of the availability tool? Could they be included in the MoU wLCG-EGI-InSpire? Tiziana: Nothing is known so far. Service description database regionalization postponed due to priority given to Nagios. Overall the effort from BARC is limited, but anything we want should go into the MoU. Dusan: there is the excel tool from gvdev for on demand reports Tiziana: correct, but there is a summarization process every month which fixes issues with the results manually. Dusan: What about site-vo ola template that will include core services used by the VO? Tiziana: Better not have a site-vo ola, rather than collections of VOs and EGI/NGI, NGIs provide services, also EGIs provides some services Dimitris: VOS tend to focus on cores/storage requirements. Tiziana: Availability is not just an operational measure, it also also t capture the real quality perceived by the real user. So we could evolve the concept: VOs may use a subset of NGI services. Also there have been complains the ops VO tests do not represent actual quality. Helene: VO specific availabilty and VO specific tests are available on dashboard at least for LHC VOs , but there is no workflow about them. So how oeprations should handle such VO specific tests. Dimitris: Provide the VOs with tools, but let them handle them as they see fit, as VOs are quiet a lot. Helene: EGI supports VO communities Dusan: Provide the tools, let vo handle them. Gergey: There should be some common tests suits provided by sa1, and VO deal with their own tests, but at what cost (in terms of VO training) Helene: the VO tests plug in for LHC is there, but no workflow how to handle its alarms. Tiziana: ops tests apply to all production sites. Some probes avaialble only to VOs and Tier1 sites (like VO box) Gergey: tna 1.3 vo services: there is effort for vo core services. only dashboard is mentioned there. We could identify additional tools that vos can use, and give them to VOs. Dimitris: The complex-systems VO is using its own Nagios with its own probes after instructions were provided to them by local Nagios expert. Helene: The biomed community also has a dedicated nagios. But operations procedures for VOs could possibly be overlapping with ngi procedures. Gergey: When biomed started their own monitoring? Helene: This was started voluntarily in the end of EGEE III, so not sure if it is accounted in tna1.3 [Returning to the point of CPU cores/storage requirements] Vera: Put a recommendation instead. Tiziana: The OLA is more binding. Helene: It could be in the GOCdb so ppl can refer to what value is acceptable Tiziana: Freeze CPU/Storage metrics for now, because current probes need them, and see about them after information about ACE is received from James. Helene: Can we add values for sites that provides only services(not CPU cores/storage) Mats: The current tests require storage elements/ worker nodes /cores, perhaps we can change that later. Dimitris: Keep things as they are, see if they can be changed later from the feedback we get from the tools. - GGUS response time metric Vera: Why increase it? In any case, an automated answer could fix it. Isn;t solution time a better metric? Dimitris: many sites have one admin, he may be in meetings for half a day or busy with some other task. Automated answers defy the purpose of this metric Tiziana: We could apply 4 hours to alarm tickets, the and the rest to 8 hours. The 5 days limit for solution is not for every problem. Middleware issues may prevent solutions within the limits. The response time is easier to monitor. and usually the default answer is not canned, but a request for additional information. Suggestion: alarm tickets for urgent tickets Helene: Some VOs other than LHC should be able to use alarm tickets. Dimitris: What ensures that alarm tickets will be handled better than normal tickets? Tiziana: It is mostly about user awareness regarding the severity of the problem, also the idea is that an alarm ticket permits out band communication with the admin, for example with SMS. Helene: Also for example in biomed only knowledgeable ppl open the tickets and not any VO user. Alarm tickets have specific response time. Mats: Argees to increase yo 8 hours , alarm tickets maybe are out of the scope of the site-NGI OLA. Helene: Will try get information about biomed works and provide it. 30 minutes break - Decrease threshold for new sites. Instead allow a grace period where suspension will not take place Dusan Vera Dimitris: removal of suspension penalties for 3 months is more acceptable Tiziana: Site would not get suspended anyway i 3 months if it was below 50%. Certification procedures should make the need of a grace period redundant anyway, especially in case certification was long. Short certification procedure however may justify some grace period. Dimitris: Since the site will not get suspended anyway during the 3 months, allow grace in case of under performing sites: advise NGIs to close immediately such tickets for newly certified. - NGI meetings participation metric: If an NGI holds an internal operations meeting, require participation Consensus that this better be left to NGIs as an NGI specific metric. - Discontinuous availability of resources Dimitris/Tiziana: should not be treated in special away, normal site startup/certification/closure procedures apply. - Increase availability/reliability thresholds for sites to 80/85 % Suggestion by Tiziana: it would be challenging to make that change now. However, the criteria for suspesion could be increases to 70/70 respectively. Dimitris: will check the impact that change could have to the number of sites suspended, in case the impact is too great, and also get in touch with Kostas since the SEE ROC has many small sites coming from SEE GRID that could be affected. - Security related requirements Dimitris: Current OLA already requires having security and other contact information entered into GOCDB and kept up to date. - Timeframe for middleware updates Tiziana: UMD roadmap at https://documents.egi.eu/document/100 . Middleware components advertise version information in the Information System about varius middleware subsystems. There was a procedure for phasing out middleware releases at the end of EGEE III that Tiziana will look into adapting for EGI. The OLA could be rephrased to mention middleware that is in line with the UMD roadmap instead of "EGI endorsed middleware". It would be useful to have a tool that can monitor the exact middleware version a site/site nodes is using, this will have to be phrased out properly and asked to Daniele (JRA1) not after October 20. - Clause for sites supporting only ops VO for extended periods: Dimitris/Mats: Perhaps accounting could do that, but only for computing resources, currently no storage or any other method of accounting. Tiziana/Helene: The CiC portal may contain such fuctionality. Rephrase the OLA that site supports at least one "non monitoring" VO - Will try find a suitable date for next EVO as the agenda was not exhausted
There are minutes attached to this event. Show them.
The agenda of this meeting is empty