Minutes Grid Operations Meeting 8 November 2010

Attendance 32 people

- Dimitris Zilaskos (DZ): gave a report about the GGUS group of tickets, raised by Malgorzata (MK), about new groups in dteam VO.
https://gus.fzk.de/ws/ticket_search.php?show_columns_check[]=REQUEST_ID&show_columns_check[]=TICKET_TYPE&show_columns_check[]=AFFECTED_VO&show_columns_check[]=AFFECTED_SITE&show_columns_check[]=RESPONSIBLE_UNIT&show_columns_check[]=STATUS&show_columns_check[]=DATE_OF_CREATION&show_columns_check[]=LAST_UPDATE&show_columns_check[]=SHORT_DESCRIPTION&ticket=&supportunit=VOSupport&vo=dteam&user=&keyword=&involvedsupporter=&assignto=&affectedsite=&specattrib=0&status=all&priority=all&typeofproblem=all&mouarea=&radiotf=1&timeframe=lastmonth&tf_date_day_s=&tf_date_month_s=&tf_date_year_s=&tf_date_day_e=&tf_date_month_e=&tf_date_year_e=&lm_date_day=05&lm_date_month=11&lm_date_year=2010&orderticketsby=GHD_INT_REQUEST_ID&orderhow=descending

MK: This is part of the new procedure to create new NGIs.
DZ: has been in contact with CERN to migrate the VOMS server, but CERN people has been unreponsive lately, and this lead to delays of this process.
DZ acknowledges the tickets, has permissions to create the new groups, and will proceed with it. LAter he can re-synch with CERN without impact on the
new groups.

Action on Mario (MD) and DK: Ask Tiziana on how to proceed in the communication with CERN.

- Information (Mario)
   - The SW release workflow is being implemented and tested - "NSRW: New SW Release Workflow"
     check: https://wiki.egi.eu/wiki/NSRW_IMPLEMENTATION_RT
   - RT queue "sw-rel":
     - CA test release workflow with EGI tools (RT ticket #460)

- Helene (HC) asks where is the nformation about the CA release procedure in EGI.
- MD: https://wiki.egi.eu/wiki/EGI_IGTF_Release_Process
in progress to integrate this in the above mentioned NSRW

     - Nagios, already using/testing the SW workflow (RT ticket #490)
   - patches in staged rollout:     
     - for glite 3.1 and 3.2

MD:
--- For glite 3.1 there are the following components under staged rollout
- L&B : for a long time with no EA team doing it.
- lcg-CE and glite-CLUSTER: there was a problem found in gliteCLUSTER and the patch was rejected
we are waiting for a new version, and the new lcg-CE should be released together with this new node type
the glite-CLUSTER can be installed in the same machine as the lcg-CE
- WMS : this new version should solve an issue with the interaction with the VOMS server in glite 3.2, found a few months ago
in Ibergrid (MD and Alvaro Casani) are doing this test with the WMS and a VOMS in glite 3.2. WMS itself seems to behave properly
even with a large number of jobs (CERN test)

--- For glite 3.2:
- ARGUS: staged rollout OK
- CREAM: staged rollout OK but with some warning in release notes about the configuration
- L&B:  staged rollout OK but with some warning in release notes about the configuration

- glexec: is on hold due to some problem found in the voms client

- Problematic GGUS tickets:
   - H. Cordier: https://gus.fzk.de/ws/ticket_info.php?ticket=54678
     some progress here, answer from developer
""
So we have been investigating the problem and this is what we have found so far.

The problem is related to the LDAP delete operation and that indexes in the Berkeley Database are growing. This causes the query response time to increase. 
We have confirmed that this happens with both OpenLDAP 2.3 and 2.4 and is also independent to the schema used.
The rate of degradation is dependent on how dynamic sites are. For example if there is an unstable network or sites frequently time out, the rate of 
degradation will be faster than if things are stable.
The work around is currently to restart the BDII when the response time becomes to large. The next step is to contact the LDAP developers to try and find a 
solution. Failing that, we will have to add a routine to the BDII update script that will compact the database after it has been updated.
""
- HC: it's not acceptable the long time frame to answer/solve the ticket. The answer is not at all satisfatory.
- Tiziana (TF): there is a procedure to escalate/prioritize these type of tickets in TCB. Goes through DMSU -> prioritization based on impact -> TCB
the DMSU milestone is in: https://documents.egi.eu/secure/ShowDocument?docid=69

   - A. Aeschlimann (AA): https://gus.fzk.de//ws/ticket_info.php?ticket=59041
     no progress
AA and MD: this is a Atlas SAM critical test, error happens randomly.
AA: argues that the problem should either be corrected or the test be removed.

   - A. Paolini (AP): https://gus.fzk.de/ws/ticket_info.php?ticket=63103
     need discussion operations  Tech. providers
AP and MD: this is a ticket from biomed VO, where some files at a site wwhere not being transfered due to SE downtime
there has been some discussion on how to implement downtime of services and that it be seen in the information service on the
Status field of that service.
The ticket was closed by the biomed VO manager, and a savannah ticket has been opened to the technology providers
https://savannah.cern.ch/bugs/?74976

MD: stressed that It's expected that operations report issues with "long lived untouched" tickets.

AOB
TF: asked about the issue of the criticality nagios test of the "bdii freshness" on ARC sites
EI: this is solved, and now only one ARC site has problems.

Marcin (MR): has a draft document about the procedure to introduce new critical tests. 

- Information about operational tools (Emir)
Emir (EI) nothing to update

- COD issues (Malgorzata, Luuk)
MK nothing to update

MD: Nest Grid Operations Meeting on the 22 November 14h00 (Amsterdam time)