2nd MPI VT meeting 06 Feb 2012 ------------------------------ Ivan Diaz Alvarez (CESGA) Alvaro Simón [Chair] (CESGA) Viet Tran (SAVBA) Karolis Eigelis (EGI.eu) Gergely Sipos (EGI.eu) (connected with Karolis) Zdenek Sustr (CESNET) Alessandro Constantini (INFN) Paschalis Korosiglou (GRNET) Enol Fernandez (IFCA) Viera Sipkova (SAVBA) Roberto Rosende (CESGA) Gonçalo Borges (LIP) Emir Imamagic (SRCE) Vania Boccia (GRISU) Apologies --------- John Walsh Greetings & Roll Call --------------------- * Alvaro: This is the 2nd meeting of the VT * Alvaro: Please check the status on the wiki page. TASK 1: MPI Documentation (Enol & Paschalis) ------------------------------------ * Alvaro: Most of the work seems to be done. There are comments from Zdenek * Enol: I will work on the comments from Zdenek, and have the admin guide all done by next week. * Karolis: Asks about the user guide. * Enol: After ending with the admin guide, I will start with the user guide. * Alvaro: The priority was the admin guide. Comments from Esteban were also included. For the next versions, perhaps in can be tracked with GGUS. Everyone OK? * Karolis: OK with the priority. * Alvaro: There was an action for Enol to check documentation changes. *John Walsh (Offline): The SU has now been set up, but as you are aware, it has been requested that we use individual e-mail addresses rather than the MPI-VT list address. I will see if we can get a list dump, and ask the GGUS about adding the individual members. Probes (delayed to end of reunion) ---------------------------------- * GGUS Tickets for sites not publishing correctly are done by hand, but there are plans to use a probe to check automatically. * Emir, John Walsh and Gonçalo implicated in this task. TASK 4: Accounting ---------- * Ivan: No progress on this, we didn't have news from John. * Alvaro: There was a request from Tiziana, since MPI reporting is high priority for some NGIs. < Gonçalo connects > TASK 3: Information System ---------- * Alvaro: Gonçalo, can you coordinate the detection of sites publishing bad data along John Walsh * Gonçalo: OK * Karolis: Why there are not NAGIOS probes for check the bug with MPI capabilities. And why don't develop this probe instead of the manual search. * Alvaro: There is no NAGIOS probes implemented yet. This task is very important. Gstat did some tests about that, but we can't open tickets based on Gstat. * Karolis: No, the question is why is not done yet. * Alvaro: There is still not a expert team for development of the probes * Karolis: Why this development not done (it was planned) and needs to be done on a VT. * Enol: There are probes there, but they need improvements. * Alvaro: Yes, they should be improved, we have to check their deficiencies, and contact the responsible parties. * Karolis: Why there is a manual effort instead of probes? These probes should be developed. * Alvaro: It is only a starting point to see the status of publishing. In the future we shouldn't do that at all. * Karolis: Who is going to certificate this probes. It is a responsibility of the WLCG. * Alvaro: We do not know the official procedure, we wanted to ask Emir about this point. * Karolis: There is documentation about the existing probes. * Alvaro: Yes, they are documented on a wiki page. This documentation task is assigned to John Walsh and Paschalis. Paschalis, can you test the procedure?. * John Walsh (offline): I will start the process of opening tickets with sites regarding the PCPUT,CPUT torque issue later this week. <> TASK 2: Nagios Probes ---------- * Alvaro greets and asks Emir about the probes * Emir: Would be the probes implemented by someone? Who would develop them? * Alvaro: In principle, John Walsh. The idea is to send an MPI job once per week to check MPI jobs using different physical WNs. * Emir: The problematic is who develops them. EMI is unlikely to develop them. I have to check. * Gergely: It should be SA3 who implements them. The question is who will write the specification. * Enol: We are working on the specification, and working on some better specification. This task is leaded by John Walsh. * Gergely: Where is the task in the wiki * Enol: Is task #2 * Gergely: We need an specficiation. * Enol: We are working on that. * Alvaro: Which is the procedure. * Emir: Is procedure 7 (post link to https://wiki.egi.eu/wiki/PROC07), only when probes are actually developed. bout frequency, it can be done every week or month, it does not matter. * Zdenek: It is important to consider if this accounts operations and for A/R calculations. * Emir: This should be considered separately. * Alvaro asks about sites not publishing correctly and the publishing of tickets about them * Emir: There is not still GOCDB support for MPI sites. The only way to check for MPI sites would be to have an MPI tag on GOCDB. * Alvaro asks about the procedure. * Emir: You should submit a requirement (RT). * Alvaro puts an action on himself. * Alvaro asks about test rejection from the sites. * Emir: I don't think there would be rejections to the probes if the OTAG does not. * Emir: Do you plan to differentiate between implementations, or expect all implementations included. * Alvaro: We should check for MPI services on the CREAM-CEs. That is possible? * Emir: The question is which MPI implementation. * Alvaro pass the question to Enol. * Enol: I have to think about that, we have to consider the consecuences. + Gonçalo: From my experience, there is not much use to implement flavors of MPI, since this does not scale, and it will explode after a time. The practical approach would be a unique type and to catch the flavors from BDII. Everyone should be free to use the flavor they want, so an unique type should be the best solution. * Alvaro ask Enol to investigate about this situation- * Gonçalo: Another suggestion. all the procedures are documented, including pushing probes to production and being in the A/R calculations, developed by OMB. The specification should be prioritary, since all the rest should be easy to do, and we should try to help John Walsh to have this task done. Is not a question of developing, more of documentation. * Alvaro agrees, puts an action on himself, Enol and Paschalis to assist. * Gonçalo: maybe there should be a dedicated meeting just for this task. * Alvaro agrees and puts an action on himself to schedule a dedicated meeting. * Gonçalo: Enol, this is reasonable * Enol: It's OK. * John Walsh (Offline): For the MPI tests, I will see if it is possible for us to reuse the mpitests-openmpi RPM test-suite. The same tests could probably be used for MPICH2. * Alvaro agrees that the probes are in the critical path and asks for comments. ** No comments. Task 5: Batch System Status (Enol & Roberto) ------------------------------------ * Alvaro: You were doing testing with the batch systems. Do you have more info about this task?. * Enol: We did testing with SGE and some flavors on MPI. MPICH had problems with accountig. This is solved with a new version and documented. We tested with Torque, no isuues, except the Maui problem. The EMI version is not working properly with MPI, since it only allows to use one node per job. We need a way to distribute the new version. * Alvaro: Yes, there is also a workaround. This is another important point. The batch system is not supported by the TP with MPI. How should we do in this case?. Also applies to security issues. A third option is to compile from source or use the EPEL version. Gonçalo, what you do think of this?. * Gonçalo: This is a difficult problem. There were many complaints, and it is hard to attribute blame. In this case, the problem was not the MPI team. The first option is to compile and maintain another MAUI version. But it is a effort that is difficult to maintain on time. * Karolis/Gergely: Why is not a support for this configuration=. * Gonçalo: This is different, since the variable is the batch system * Karolis/Gergely: The problem is that sites use unsupported bathc systems. * Gonçalo: MAUI is used by 90% of the sites. * Karolis: We should consider the support situation for those sites. * Alvaro: There is problems with the EMI lack of manpower. Perhaps the solution would be to split the maintanance, so sites are resposible of maintaining their batch system << Emir excuses himself, and promises to attend the probe meeting >> * Alvaro puts action on himself to check this situation. * Gonçalo: We should raise the awareness on this issue. The problem is that there is no clear responsible, since is an interaction, in no man's land. * Karolis: Which is the problem?. * Enol: The problem is the interaction between EMI's MAUI version and MPI. There is no trusted version usable. * Karolis/Gergely: It should be described on the wikipage. * Enol: It is described. * Karolis/Gergely: This problem should be put as a requirement for operation people. * Enol: Alvaro proposed to raise this on SA1/2. * Alvaro: I will submit this issue to SA1/2. This issue can be present on the future with other batch systems. We have to find a way to solve this in future cases. I will submit a mail to Tiziana and Michel with copy to th VT * Alvaro: Any doubts? * Alessandro: The problem is already reported on GGUS by D. Cessini, and Cristina Aiftimiei put that they are working on the problem. * Alvaro: I know that D. Cessini submitted that ticket, but I doubt there is commitment form EMI to solve this. * Alvaro: Enol, I understood they didn't want to work on this * Enol: I don't know * Alvaro puts an action to ask EMI and find alternatives about this. Task 6: Gather Info from MPI sites (Zdenek) --------------- * Zdenek: Some progress, there are resources commited from the mpi VO. We are in contact with more RP. There was feedback from pilot users to check if all is fine. Slow progress in sum. * Zdenek: I registered the VO, but there was not formal confirmation of its registry. * Alvaro: Can you submit the endpoint of this VO to see this configuration. CESGA perhaps will support it on the future. More questions?. * Karolis/Gergely: It is the CPU reporting ploblem the same problem with the IS that with Maui? * Alvaro: The number of CPUs published problem is other issue, documented on the wiki, and its not the MAUI problem. * K/G: OK * Alvaro: Any questions about the MPI VO? Zdenek, you should check if its correctly published. * K/G: Everybody should use this VO to check integration. * Alvaro: Who is managing this VO * Zdenek: CESNET * Alvaro: Members of the VT should be included * Zdenek: OK. I sent an invitation, there are 7 users so far, some members of the VT, others from CESNET. Please participate, I will resend the mail. * Alvaro: Thanks. * Alvaro: More questions? * Zdenek: The wikipage hasn't a end date. We should agree on an end date. * Alvaro: Yes, we should check this. It should be no more than 3 or 4 months. We should update this. * Alvaro puts action on Zdenek and Alvaro AOB & Closing ------------- * Alessandro: About the end date. We have many thinks to do in three months. * Alvaro: We have to discuss this offline. There are many thinks from three months. I agree we have to see this. * K/G: All VT should finish on 6 months, so from November should be May at most. Most efforts are spent on making MPI usable. We should finish before May. * Alvaro: perhaps we started too late. We will see.. * Gonçalo: In the last OMB, there were a lot of people interested on this VT. They complained they didn't know about was what going on this, and there is overlap. You may receive some request form report from Tiziana or so. * Alvaro: There was a point about the VT status on the next OMB. * Zdenek: not in the next, but yes. * K/G: Perhaps Tiziana should be included on the mailing list. * Alvaro: Yes, we can consider that. * K/G: we should do a document with the issues to be forwarded to SA1/2 * Alvaro: Yes * K/G: There should be a deadline, more important that the end date. * Alvaro: You are right. I can write this document. It should be not very technical, but should be based on the wiki. * K/G: OK, there should be like the wiki but, more detailed, and with technical detail * Alvaro: OK. * Action to upgrade the wiki page with technical information. * Alvaro proposes new reunion on 2 weeks: * Zdenek: OK, with a doodle poll * Alvaro: Also a meeting about probes soon. <> ACTION REVIEW -------------- * Action 1.1 (Enol): Check an update MPI wiki to include Zdenek comments the next week. Include an users section. * Action 1.2 (Alvaro/all): Put current MPI issues and technical information and mitigation plan into MPI VT wiki. * Action 3.1 (John Walsh/Gonçalo Borges): Until we don't have nagios probes for that, Gonçalo will contact with John to open GGUS tickets to MPI sites that are not publishing batch system info correctly. * Action 2.1 (John W./Enol/Paschalis/Alvaro): Create a new wiki section to include new MPI nagios probes specifications to be developed by SA3. Follow nagios wiki procedure to include the new probes in production. * Action 2.2 (Alvaro/Enol): Ask for a new GOCDB requirement, include MPI service in GOCDB. Check if it's needed different mpi services (for each flavour) or not. * Action 2.3 (Alvaro): Submit a doodle to schedule Nagios MPI probes meeting. * Action 5.1 (Alvaro): Ask about batch system support issue in EMI. Raise this issue to EGI SA1/2. * Action 6.1 (Zdenek): Distribute and include the new MPI VO endpoint between MPI VT members, ask to MPI sites to support the new VO. * Action 6.2 (Zdenek): Inform OMB about MPI VT status and work progress. * Action 7.1 (Zdenek/Alvaro): Set an estimated end date for MPI VT.