3rd MPI VT meeting 29 Feb 2012 ------------------------------ Ivan Diaz Alvarez (CESGA) Alvaro Simón [Chair] (CESGA) Karolis Eigelis (EGI.eu) Zdenek Sustr (CESNET) Alessandro Constantini (INFN) Paschalis Korosiglou (GRNET) Viera Sipkova (SAVBA) Roberto Rosende (CESGA) Gonçalo Borges (LIP) Apologies --------- Emir Imamagic (SRCE) Greetings & Roll Call --------------------- * Sound check * Alvaro: This is the 3rd meeting of the VT * We have two presentations vi Alejandro and Viera * Yesterday we had a presentation on the OMB: https://www.egi.eu/indico/conferenceDisplay.py?confId=719 * We will also do a presentation at the Munic CF * Alvaro: I will check the status on the wiki page. MPI VT actions review 30' Speakers: Alvaro Simon (FCTSG) , Zdenek Sustr (CESNET) -------------------------------------------------------- * We can check the open actions. I have to update the wikipage: * The first task is the documentation. It was done by Enol (not connected). * The next action is on me, is still in progress. * I have to ask Enol about the action to merge the docs in a single endpoint for users and admins. * Gonçalo has submitted a proposal to changes in the NAGIOS. There is a wiki for that: https://wiki.egi.eu/wiki/VT_MPI_within_EGI:Nagios. If you have problems please update it. * If you want you can put your comments. I have included those by Enol and will ask Emir too. * Gonçalo, do you want to add something? * Gonçalo: We have to establish a deadline, and when we agree, we should put the requirement on SA3 to write the probes. * Alvaro: Ok, one week or two weeks should be enough. What do you think. * Gonçalo: Better two weeks. * Alvaro: I will put an action on me to submit a broadcast to close the proposal for SA3. Thanks, Gonçalo. * The next action on me and Enol (on GOCDB) is already done. We discussed it on the OMB, and we have pushed to include this on GOCDB. The RT ticket is on hold, but we can probably do it next week, after discussing it. * The next action about sites publishing the MPI is in progress, is for Gonçalo and John. I hope to update this shortly, on next week * Enol has submitted information on how to check number of cores executable per job, but this is not published by the IP * I will put an action on Roberto to check this. * The next action is on the accounting system - No progress. We have to wait for support on APEL. * The next action is on batch systems, we asked Cristina at EMI. In reality, the Torque support is third party for EMI. * We will have to raise this to Tiziana and the OMB. We will have to wait on this situatuion. * Zdenek has included information about the VO on the wiki. Thanks, Zdenek. There is more information about that? * Zdenek: Asks Viera about her MPI site. * Viera: I made the test on our client on Friday. * Zdenek: The responsible admin was on leave. I don't know if he informed you. * Viera: I will try the next week. * Zdnek: I will update this on the mailing list. * Alvaro: We configured a new CREAM and MPI at CESGA. Next week it will be on production * Zdenek: BDII had problems last week, so I don't know the status. * Viera: You can use our site, but there can be problems to terminate jobs. * Alvaro: We have not published the info on the BDII yet. 16:30 - 16:50NGI_IT MPI survey 20' Speaker: Alessandro Costantini (UNIPG) ----------------------------------------- * Alessandro: The report is in the agenda * Alessandro begins with the first slide. * Objective to see situation on Italy. Survey of ops staff. * Italy --> 57 production site, 19 MPI. * 35% of the sites declare MPI, several thousand of cores. * 90% of the resources shared, not totally reliable. * MPI-ch supported for fast connection. * 90% of the sites that declare MPI declare MPI task. It is not clear if they have configuration problems. Some experience problems with multiple libraries. Better investigation is needed. * The admins opinion about documentation: 45% think it good. The admins have been very polite. * The user experience must be improved, perhaps with a GGUS group. * A restyling of documentation is needed. * Need to understand common config problems. Need to ask sites about this. * Speaking with the operation crew, we have something to help us on the GLUE 1.3 schema. Using subclusters we can segregate MPI resources. But WMS can only see one cluster. * About MPI resources to a single calculation, the batch system can be set to have this in account. We can set the number of CPUs to a VO, but there is no way to publish this on the GLUE 1.3 schema. * The ops teams say that they don't want to make public the number of CPUs dedicated to each VO. * Slide 11 - We need to fix the MPI and MAUI problems. There is a workaround modifying the source code. With MAUI 3.4 perhaps the problem perhaps will be solved, we are experimenting installing it on one site. We discovered problems with compilers * SL5.7 --> gcc 4.1 --> problems compiling applications with GFortran and dynamic allocation. This is not a problem for this VT, but perhaps it is a problem to be raised here, because it can affect MPI applications. * Another think is the impossibility to compile several versions of the same MPI library. * I did a test testing the attributes of JDL, and the granurality of nodes. We required 2 nodes, 8 CPU, SMP=4. I did not have errors with the WMS. I submitted the JDL on the CREAM-CE, and in both cases the result is that the number of nodes is guaranteed by the batch systems, but the distribution of processes can be uneven. We are not able to distribute the resources. The load balancing makes no sense. (More details on the report) * Alvaro: Thanks, Alessandro. NGI_IT has a lot of MPI sites. * Alessandro: 16 * Alvaro: It is a lot compared with other NGIs. Some sites on your NGI have MPI, but are not publishing it. Do you know the reason?. * Alessandro: That is the point. For some sites we have no problem, but others have config problems with the NAGIOS or the site. I need to investigate those sites. * Alvaro: Ok, no problem. Another point is the documentation, any feedback is welcome. At this point, Enol has made many changes. I do not know if they saw the updates. * Alessandro: I don't think so. They use the documentation they were able to find. * Alvaro: We want to concentrate the documentation on a single endpoint. * Alessandro: The survey was done on January, and the docs were not ready, so we have to consider that. * Alvaro: We can wait for it to be complete, broadcast it, and make a new survey. * Alvaro: About the testing, it is good that JDL does not have problems, and the load blaancing probably is due to config issues. Probably we can put a guide about how to configure the batch system. * Alessandro: I think the test it is normal, with a normal granurality. The problem is in the batch system way. We need better understanding how these attributes affect the batch system. It is something that we need to understand to run properly MPI jobs. * Alvaro: I think that this will be useful. Another point is the compilers, I'm not sure, but you can use a custom compiler with the VO software. Is another point to check * Alessandro: A lot of MPI people compile applications on the UI because of these problems. For the users is better to compile on the WNs. What you say is a good solution, but perhaps site admins don't want to do that. * Alvaro: Yes, but it is an option. Perhaps we can put a new section. I don't know if you have more questions * Alessandro: One more thing, I think that we can investigate what the GLUE schema can offer us. The schema is able to distingish subclusters, but the WMS not. * Alvaro: As you said, probably we will have to ask EMI, since it is very useful to us if WMS can understand this. Another problems is the number of slots available MPI. But as I said, is another workaround 16:50 - 17:10NGI_SK MPI report 20' Speaker: Viera Sipkova (II SAS) ---------------------------------- There applciations supporting MPI-start. The main goal is to see if MPI-start support different parallel models. * We have used two GFortran versions, with MPI-START and CREAM. * Shows several tables with different options for number of processes, threads, SMPGranularity etc... * All tests (MPI, OpenMP, MPI+OpenMP) terminate with correct results * Alvaro: Thanks a lot, this means that JDL works fine, as said Alessandro. * Viera: Yes, it would be interesting to set more parameters. * Alvaro: I think there must be stress on good site configuration. A guideline would be useful. It is a point we can add in the agenda. I will ask Enol about MPI_USE_OMP=0, perhaps it is configurable internally by the users. More questions * Alvaro: Thanks a lot AoB -- No questions. There will be a presentation the next CF. Perhaps we will do some research to know about current MPI usage of users. Thanks for participating, see you.