Minutes UNICORE integration task force 2011-03-08 10:00-12:00 Minutes belonging to Agenda under https://www.egi.eu/indico/conferenceDisplay.py?confId=409 Presence: Michaela Lechner ML Krzysztof Benedyczak KB Vera Hansper VH (mikeless) Marcin Radecki MR Esteban Freire Garcia EG (IberGrid ) Foued Jrad FR David Meredith DM Peter Sologna PS 1) Last meetings minutes https://www.egi.eu/indico/materialDisplay.py?materialId=minutes&confId=309 were accepted without comments. 2) Announcement of our new mailinglist https://www.egi.eu/sso/groupView/unicore-integration-tf unicore-integration-tf@mailman.egi.eu 3) Going through last meetings Action points https://wiki.egi.eu/wiki/UNICORE_integration_task_force a) AP: Find out more about the EMI registry and the static info contained: How is the distinction between the static data GOCDB and EMI registry, how can we propagate downtime info for just one Service Endpoint (URLs are needed to distinguish between different instances)? Progress: ML will report on her email discussion with Laurence Field and plans to push the GLUE 2.0 use case in propagating downtime information ML: One of the actionpoints was for example to find out more over the planned features in the EMI registry concerning static information. I had some conversation with Laurence Field, who didn't allow me to quote the email conversation on the wiki, but I think I can say this: With the Information System and EMI Registry, the information source is the Grid service itself and provides information on what is installed/available on the infrastructure. The EMI Registry should only focus on static information, that is information that is the same for the lifetime of the service such as service type and ID. As such, aggressive caching policies can be used to ensure the information is available even when a service is unstable or down. Services should specifically de-register to remove themselves from the registry and so the main purpose of the cache time-to-live is to avoid the registry being polluted with zombie registrations when a service is removed from the infrastructure uncleanly. So we will have to register and de-register services in the EMI Registry and it focuses on static information, that answers already most of our questions. Now concerning the problem on how to propagate downtime information: The aim of the EMI Registry is to discover what services exist and a URL to find out further information about that service. As part of that service information in the GLUE 2.0 model, it is possbile to publish service downtime information. How this is implemented is yet to be defined. In the end my discussion with Laurence Field focused around this question on how to propagate downtime information for a service that is managed in the EMI registry. The conclusion there was that we need to push this use case for GLUE 2.0. I will try to do this within the OGF PGI working group, where the next meeting is on Tuesday 10th of March. I've already started some discussion about this with Morris Riedel, the leader of the PGI. KB: I talked to another person from the EMI registry taskforce, and learned that the EMI registry is in a very early planning stage and that it will take for sure more than one year to have it implemented so we can register and deregister the components. My conclusion would be to maybe take it into account in a longer perspective, when it then might be a solution for us, but we'll need something in the meantime. ML: Foued, did you have a chance to look through last meetings minutes, when we talked about using XML as meantime solution instead of an ugly GOCDB hack. Would that also work for Germany? [10:17:16] Vera Hansper any improvement to the GOCDB would be welcome FR: I'll look at last meetings minutes and have a look at the XML discussion. Is this autoconfigurable? KB: I can confirm how XML is autoconfigurable. FB: I'll suggest to publish an examplary XML document, so I can have a look at it. ML: There was an example in the documentation we looked at last time. Krzysztof can you please post again the link to it? KB: Example: https://wiki.plgrid.pl/doku.php?id=pakiet3:publiczne:monitoring:service_list Definition: http://alfred.studmat.umk.pl/~szczeles/PL-Grid/UMI-Autoconf.html PS: I was thinking about this XML repository for UNICORE, Emir said that he could fix the Nagios part, but what would be the effect to the dashboard if using that? ML: Afaik the dashboard is taking the monitoring information directly from Nagios, only downtime information is taken from GOCDB in dashboard. 0:19:40] Vera Hansper the dashboard get's info via a message passing bus, IIRC [10:19:45] Vera Hansper from NAGIOS PS: So if a UNICORE service that is not registered in GOCDB is down most of the site has to be put in downtime? ML: Last time we said, that single service outtages are maybe not that common, and since we are aiming for a meantime solution with XML... PS: Wouldn't it be the best idea to keep a common line for all middleware stacks in this aspect? ML: Exactly! This is why we don't want to add an ugly hack to GOCDB which is then unnecessary when the EMI Registry arrives. PS: So you are saying a hack is not worth to GOCDB, since the EMI Registration service is about to come in one year? DM: We actually still have the 2nd solution, the more GLUE 2 centered solution is suggested in the slides we discussed last time. We only discounted the first hack. [10:26:46] david meredith https://www.egi.eu/indico/getFile.py/access?contribId=4&sessionId=5&resId=0&materialId=slides&confId=153 10:21:27] Vera Hansper but you can still put a whole site into downtime? [10:21:30] Vera Hansper David? [10:21:46] Vera Hansper because, if there's a system wide problem, you really want to be able to do that [10:31:04] david meredith in gocdb you put a site's services into downtime (not a site), to (effectivley) put a site into downtime you put all its member services into downtime [10:31:50] david meredith but there is an action to add a flag to indicate that all SEs of a site are in downtime to make this explicit [10:33:41] Vera Hansper thanks david ML: Would that mean for this second solution that we would be waiting for a better GLUE 2 definition like EMI? DM: No, it is not a full GLUE 2 implementation. The implementation of this solution would be possible in maybe 3 months work. However, I'm working on more high priority things at the moment. ML: So but that means, we'll reopen the GOCDB requirements. I'll put this down as a new ActionPoint. What would be a realistic timeline? DM: Let's say 6 months. KB: We can expect a full integration of the EMI Registry at the end of EMI project, in 3 years. ML: So that means that the 2nd GOCDB solution would in each case be worth it. KB: Foued, have you seen I posted the links, the first link to the example, second link contains the definition and explainatin of the examlpe. FR: Who should take care of the information in the XML file? KB: That would be the site admin, I agree that this is the biggest problem in this approach, but this is just a temporary solution. FR: We'll have the same problems as with the BDII in gLite. EI: Shouldn't Nagios fill in this XML stuff and not every site? KB: But then the site admins would have to tell the Nagios admin, that their site will be in downtime, and that new information has to be added when new services have been installed. EI: We'll use the GOCDB for the downtime and not the XML file, concerning the second part of the question about adding new ServiceEndpoint URLs: these are the exact URLs that have to be monitored. KB: Would that be possible to split this way? We would have to check how to implement it this way with part of information in GOCDB, part in the XML file. Since for UNICORE in Poland we have everything including the downtime information in the XML file. EI: I think we agreed that we have said that some high level stuff should be in the GOCDB, like topology and downtime information. We heavily depend on GOCDB not only for downtime information but also for availability calculations. FR: I agree with EI to use GOCDB as the downtime source and to not use XML for downtimes as well. ML: Good to have this point clearified. I think we can go on to the next open Actionpoint: b) AP: ML to test sending a GGUS ticket to UNICORE: See ticket https://gus.fzk.de/dmsu/dmsu_ticket.php?ticket=68192 Who got it? KB: I didn't get the ticket and I did a similar test, I asked a colleague (Rafal) but I didn't receive any email and I am not even allowed to look at the ticket. I wasn't informed on neither of these tickets, neither of this nor the other ticket. ML: What was the ticket nummer of the one your colleague wrote? [10:35:51] foued Jrad what is the ticket number? [10:36:58] Michaela Lechner https://gus.fzk.de/dmsu/dmsu_ticket.php?ticket=68192 [10:37:54] Krzysztof Benedyczak https://gus.fzk.de/ws/ticket_info.php?ticket=68177 KB: and this ticket should have definitely landed on my desk since the ticket was about UVOS, which is one of the components handled by my product team in EMI ("UNICORE Security"). It is a real ticket, that I have to handle. ML: Okay, I put down a new AP to ask Torsten from GGUS or somebody else from GGUS why you didn't get the ticket and who should have gotten this ticket. ML: Going on with the next APs: c) AP: ML to send EI minutes of GOCDB dedicated phone conference That was done. d) AP: DM to rename the first 3 UNICORE service types. unicore6.registry unicore6.UNICOREX unicore6.gateway DM: I haven't had time yet to rename the service types. It shouldn't take too long. ML: I'll update the AP that we send you a again reminder within a week. DM: I'll put it on priority of my list. ML: d) AP: KB contacting Germany through ticket 306 https://rt.egi.eu/rt/Ticket/Display.html?id=306 That was done. Since this ticket is a very nice way to follow our arguments and discussion I put it down as a AP to all of us to regularly update the ticket. ML: e) AP: official requirement towards APEL (accounting monitoring must be sufficient to decide whether all deployed middlewares of a site publish accounting data) Nobody from accounting here today, and I didn't do anything with that yet. This AP stays open. ML: Continuing with the Agenda: 4) Discussion of Progress a) GOCDB I think everything was said there now already, agreed? b) Monitoring KB: one comment from our site: the first release candidate was finished and our colleagues did a great job in testing EGI Nagios, so there are quite some changes, but for Emir the current version should be much better from the interaction point of view. I was just waiting for Emir to come back from vacation. EI: I came back from vacation yesterday, I'll can start to do something productive in the second half of this week. KB: Currently the version is just installed together with our PL Grid probes of gLite and according to our colleagues there are no problems. ML: Emir, Krzysztof, if you do something this week, could I just remind you to not only do that in private communication but to update ticket 306 instead? FR: I have a question, I speak now again about the XML file used for monitoring. Actually the same information is published by the UNICORE Common monitoring service used by UNICORE 6, why is this service not used, why do we need a new file for this? KB: The CIS (Central Information Service) was not of production quality when we started with this, it was not stable. Maybe the latest version was improved, but half year ago it didn't have enough of production quality. FR: We tested CIS in d-Grid and didn't experience any problems. Furthermore CIS is still in development, so they can take our requirements and add further needed functionality. KB: I agree, that this is the solution we could use. It is maybe best I'll send a complete summary of the experienced problems with CIS later to our unicore-integration-tf mailinglist. FR: If you need some contact person for CIS I can give you contact information. KB: Thanks, I have contact to the main developer FR: I agree it is not easy to write information provider for CIS. KB: I'll send an summary to our mailinglist for a more detailed discussion (ML noting this as AP). FR: For a fast solution I agree to use the XML file now. ML: Just a question: could this CIS be an alternative to the EMI registry? KB: CIS is not supported in EMI. ML: Continuing in the agenda c) Accounting Nothing new here I suppose. Then for the next point Krzysztof was so nice to volunteer to give us an overview of the status of Argus development for UNICORE within EMI: 5) ARGUS authorization framework: Krzysztof Benedyczak will give an overview of the current status KB: ARGUS: Here as introduction are some slides from Valery Tschopp at the EGI TF in Amsterdam last year: [10:57:12] Krzysztof Benedyczak https://www.egi.eu/indico/materialDisplay.py?contribId=219&sessionId=118&materialId=slides&confId=48 Let me explain this right from scratch: In a VO, the VO admin assigns attributes and membership to people and this is controlled by the VOMS, but the sites can not influence this information. However a site sometimes wants to controll access in more fine grained detail: like to ban one user from a certain VO, or limit the access to some of the resources. Argus is the solution for this purpose. Technically Argus is a complicated beast composed of 3 daemons/service components (PEP, PDP, PAP). Many Argus instances can be stacked together to provide hierarchical definition of access policies. A site installs Argus, all these three daemons and the Argus Authorization Service can than tell if a user is authorized to use a certain resource. In UNICORE the situation is a bit different: [11:04:22] Krzysztof Benedyczak https://www.egi.eu/indico/materialDisplay.py?contribId=216&sessionId=118&materialId=slides&confId=48 [11:04:26] Krzysztof Benedyczak 13-16 The data user attribiutes can be easily overwritten by site admins. In Argus users can be banend. In UNICORE with the user attributes there is a default policy, people with the corresponding user attribute can access this site and use its services. The site admin can overwrite this user attribute in order to say that this user is banned. So the bottomline here is, that in the case of UNCIORE Argus is not a crucial service, since authorization can already be controled in another way, but it might be fruitful to use it anyway: like in the scenarios where we are having two middlewares deployed together and Argus can control both of them. ML: What is the current status of interoperability there? KB: I''ll cover your question in a moment. The currently planned status is that the UNICORE release in EMI 1 will use authorization by using Argus. So theoretically it will be possible for site admins to regulate access. Unfortunately this won't be useful at all, since currently the Argus policy language is quite simple. And therefore it is not yet possible to decide a possible useful policy for Argus. That part will be implemented later and afterwards Argus will definitely be a useful solution, and it will be possible to use ARGUS for gLite and for UNICORE at the same time. In detail UNICORE will not use Argus for authorization, but only the PAP daemon to ask for the authorization policy, then UNICORE can use that directly: The format of the PAP is the same as the one used in UNICORE in general, so it won't be required to do a network call to Argus and like this it will also work if Argus is shut down. But this depicted solution is not implemented at all currently, it is not even in the orginal plans and Argus people don't advertise this PAP daemon. For them the only entry point is just another Argus. I' have to understand how Argus really works internally, so we could do this and implement it in EMI 1. PS: PAP allows to download all policies in the service. KB: So we can just query PAP, this is trivial to be done in UNICORE. PS: Within Argus there is a specific authorisation service, that uses those policies. KB: you mean PDP, it is not specific service, it is part of any UNICORE installation container. PS: So this component is already part of every UNICORE service, every service has to download the Argus policy and implement it according to Argus? KB: The current approach to use Argus within EMI is to choose PDP to use for each UNICORE service container, so each single service will access Argus. This takes time and can fail, since Argus can be down or the network can be overloaded, and the protocols used for supporting things like policies are changed regulary. ML: So but your proposed solution which sounds very effective is not yet in EMI roadmap? KB: EMI roadmap is still very general, there it is only written that UNICORE should be prepared to use Argus. In gLite for example the approach is already very different: an extra PIP is used which covers PDP. ML: We will have to make sure that interfaces are standarized and stable and clearly defined. KB: The Argus daemons are talking to each other using standardized protocols. The PIP to PDP on the other hand used in the case of gLite is propiatary. So it is clearly preferable and better to use PDP and PAP directly since those two have standardized interfaces. I was able to configure and ask PDP and PAP from UNICORE directly. And it is probable that we'll aim for the simplest and easiest development solution within EMI. ML: Who, besides you is involved in the UNICORE development part for Argus? KB: for UNICORE is just me, maybe Piotr, too. Valery is maybe also relevant for this work since he is leading the XACML working group (an EMI working group), which is aiming at standardizing XACML attributes used in the policies. [11:18:44] Krzysztof Benedyczak https://twiki.cern.ch/twiki/bin/view/EMI/EmiJra1T4XACML ML: Are there tendencies to go bigger? Are there plans to make this into an international standard later on by example in OGF? KB: Not yet, it needed to define it more clearly first. ML: I agree. I have no further questions at the moment, thank you for this nice overview and I hope you can keep us updated on the status of Argus UNICORE development also in the future! ML: Next point in the agenda: 6) UNICORE Best practices UNICORE not presented yet in the Best Practices wiki a) Promote existence of the Best practices working group: Regular monthly phone meetings of the best practices group Next one today in the afternoon: https://www.egi.eu/indico/conferenceDisplay.py?confId=413 Somebody with UNICORE experience would be more than welcome! https://wiki.egi.eu/wiki/Operations_Best_Practices Meetings of the operational documentation group: https://www.egi.eu/indico/categoryDisplay.py?categId=28 b) Do we already have some best practices we would like to share? Who could join today? KB: I thought these BPs are more operation specifc, not middleware specific. Current BPs focus around putting a site into downtime and ticket escalation procedures. ML: I could not come up myself with something explicitely UNICORE specific at first glance either, maybe Vera could enlighten us a little bit more what would be expected from us there? But can we not think of at least something that would be interesting amongst us to share? [11:21:47] Vera Hansper and there is a meeting this afternoon at 13:30 [11:21:57] Vera Hansper [11:22:23] Vera Hansper a best practice shouldn't be invented just for the sake of it [11:22:38] Vera Hansper but more a need of something that is a useful way to do something [11:22:47] Vera Hansper that is out of the scope of standard procedures FR: How to install more than one UNICORE service on one host could be an example. [11:24:54] Vera Hansper the best practices area is for any procedure [11:25:08] Vera Hansper of course [11:25:14] Vera Hansper or - maybe in a procedure [11:25:49] Vera Hansper there is a bit of a blurred line between procedures and best practices [11:26:16] Vera Hansper currently, my group is focused on ROD related procedures (operations) [11:26:26] Vera Hansper BUT, we will have links to relevant areas KB: I have no extra time to join this meeting, but as far as I know, there is some effort to provide documentation in UNICORE on how to deploy UNICORE in production and to use this in a simple way. I can keep the best practices in mind, so if we have some documentation ready, we can share the links. Would you add the links? [11:27:13] Vera Hansper yes, we will do that FR: Yes, we have no documentation now, maybe if we have some, we can provide them with some links. Do we have permission to enter new BPs? ML: Previously there was an own suggestion page, now I only see the mailinglist promoted. Vera what is the current recommended way? [11:27:22] Vera Hansper the wiki area is still developing [11:27:36] Vera Hansper to submit a best practice - you need to send to the mailing list [11:27:45] Vera Hansper and it will be moderated [11:28:05] foued Jrad ok [11:28:31] Vera Hansper https://wiki.egi.eu/wiki/Operations:Best_Practices [11:28:38] Vera Hansper but this will probably change [11:28:45] Vera Hansper You can suggest/ask for new Best Practice by: * sending a mail to mailto:operational-documentation-best-practices@mailman.egi.eu [11:29:13] Vera Hansper the wiki will be (most likely) https://wiki.egi.eu/wiki/Best_Practices [11:29:21] Vera Hansper in the future FR: I don't have time today in the afternoon either. ML: I'm sure Vera can report a little bit of our conclusion here today to the BPs meeting in the afternoon and ask them if they would be happy with our suggested links we then send to the mailing list. [11:29:57] Vera Hansper OK [11:30:09] Vera Hansper yes, send the mailing list the info ML: So we are almost finished for today, when shall we have another meeting 7) Next meeting Shall I make a doodle poll again? [11:30:24] Krzysztof Benedyczak doodle! [11:30:30] foued Jrad doodle [11:30:38] Marcin Radecki yes, doodle ML: Okay, that is settled then. I'll make another doodle poll and try this time with Global Time Zone setting and maybe enabled. 8) AOB If there is nothing, thanks everybody for joining! == Open Actionpoints after the meeting: == *AP: Find out more about the EMI registry and the static info contained: How is the distinction between the static data GOCDB and EMI registry, how can we propagate downtime info for just one Service Endpoint (URLs are needed to distungish between different instances)? (contact Laurence Field) :Progress: ML will report on her email discussion with Laurence Field and plans to push the GLUE 2.0 use case in propagating downtime information : Update: EMI registry in very early planning stage, XML solution in the meantime. Second solution for GOCDB (the GLUE 2.0 based approach) again in discussion. ML to bring forward and discuss the usecase in the next OGF PGI wg meeting. *AP: Reopen GOCDB requirement ticket for EndPointServiceURL and ask for second solution with a proposed timeline of 6 months. *AP: ML to test sending a GGUS ticket to UNICORE: :See ticket https://gus.fzk.de/dmsu/dmsu_ticket.php?ticket=68192 Who got it? : Update: https://gus.fzk.de/dmsu/dmsu_ticket.php?ticket=68177 was produced as another test ticket. Check why nobody got the tickets! : Progress: Ticket towards GGUS created: https://gus.fzk.de/ws/ticket_info.php?ticket=68354 *AP: DM to rename the first 3 UNICORE service types. :unicore6.registry :unicore6.UNICOREX :unicore6.gateway : Update: remind DM occasionally to put this on the top of his priority listyour writing to be edited mercilessly, then do not submit it here. *AP: for everybody (especially EI, KB and FR): to keep https://rt.egi.eu/rt/Ticket/Display.html?id=306 updated. *AP: official requirement towards APEL (accounting monitoring must be sufficient to decide whether all deployed middlewares of a site publish accounting data) *AP: ML making a doodlepoll for next meeting, this time with Global Timezones and maybe function enabled. *AP: KB sends a summary of CIS, why it is not supported by EMI and why the XML approach is advantageous monitoring wise. *AP: KB and FR to send documentation links to operational-documentation-best-practices@mailman.egi.eu as soon as considered sufficiently completed, suggested procedure which could be of interest in this context: how to install more than one UNICORE service on one host.