Minutes UNICORE integration task force 2011-02-04 10:00-12:00

Minutes belonging to Agenda under 
https://www.egi.eu/indico/conferenceDisplay.py?confId=309

Participants:

Michaela Lechner (ML)
Marcin Radecki (MR)
Krzysztof Benedyczak (KB)
Cristina del Cano Novales (CN)
Rebecca Breu (RB)
Mathilde Romberg (MR2)
David Meredith (DM)
Peter Sologna (PS)
Emir Imamagic (EI)


ML: Welcome, this task force has been created to increase the level of communication between all parties involved. We begin with an overview of the current status.

Point 1a in the agenda "Integration of UNICORE services into GOCDB"

was discussed during the OTAG meeting in Amsterdam (
OTAG meeting https://www.egi.eu/indico/conferenceTimeTable.py?confId=153#20110125.detailed)
There are no minutes yet so Daniele Cesinis notes on the two requirements relevant for UNICORE are in the Agenda. (going through it).

DM: volunteeres to go through the relevant slides in his presentation again:
David Merediths slides on GOCDB status/roadmap:
https://www.egi.eu/indico/getFile.py/access?contribId=4&sessionId=5&resId=0&materialId=slides&confId=153

DM: adopting new types of new service types is fine. Renaming existing types should not be an issue for UNICORE.

RB: Is the naming already fixed for new servicetypes as well? unicore6.uvos ?

DM: no correspoonding UNICORE services GLUE2 service_type in there yet. Current list proposed by KB, I can put it on the GOCDB wiki. So we can just add them as new service type values (see RT ticket 300), we just need to agree on a common name within EMI and EGI.
Ensure that this list is also adopted within EMI and EMI registry

KB: This list was already discussed within EMI. I work mainly for EMI.

(Editor remark: RB and MR2 are also part of EMI.)

DM: I was in contact with Shiraz Memon (JUELICH, leader EMIRegistry project), his response was that this list hasn't been formalized yet.

KB: He is happy to adopt the same list. Maybe some problems will be indentified, but we'll work from this list. 

DM: They have agreed to follow GLUE2 schema, in the EMIRegistry, so it makes sense to adopt the same values.

DM (going through slide 10 of his presentation with no questions or disagreements to follow GLUE2 approach also here)

DM (going in detail through slide nr12): This is my understanding of the requirement. We can understand the requirement for these ServiceEndpoint URL fields in GOCDB, but there was strong opposition against this requirement. This information shouldn't belong in GOCDB. If this information is implemented in the EMIRegistry this information will be depricated in GOCDB. Question is how to deal with this in the meantime.


KB: Excuse me, but you contradict yourself: the EMIregristry will be updated, so if the service disappears and the service will disappear from the Registry.
Currently in BDII you have all the date about the services, what for do we actually need GOCDB?

MR2: For the monitoring you need to have the proper URL for e.g. the UNICOREX service otherwise you can not test it automatically.


Emir:GOCDB is more of an administrative database. High level tags of infrastructure topology, but you have to go elsewhere for the details. 
The idea of the registry is to store this data more permanently, so even if the service goes down it will stay in the EMIRegistry and still be visible a little bit later.

KB: first you'll still have to look in GOCDB, so it EMIRegistry still consists of dynamic data, and then the GOCDB is not only used by people, but also via different interfaces, e.g. webservices as in the case for gLite for what shall be checked by Nagios probes etc. If we want that information, it won't be in the EMI registry.

DM: Laurance Field gave a presentation at the EMIF2F in Prague: EMIRegistry will include static information, it willl include both static and dynamic information

KB: If so, fine. Then for the time being we can just ignore GOCDB.

ML: Emir, you will be able to get monitoring information from the EMIRegistry as well, right?

EI: Currently we have a model that gets me needed information from the BDII. For the EMIRegistry this will be the same. It is maybe a minor thing to fetch the information there. Once the thing is there it shouldn't be a problem to develop such a model.

MR2: That can only work if the EMIRegistry does not only include dynamic information.

ML: Confirmation clearly needed from Laurence Field about that EMIRegistry will really also contain static information.

KB: Which static info will be included in the EMIRegistry?

MR: Do we need GOCDB in EMIRegistry area?

EI: You will still need it for the administrative data. EMIRegistry is a service discovery service not for administrative purposes and structure of NGI information. Ports and service URLs on the other hand should go into EMIRegistry.

MR: We need to have a distinction between the static data in GOCDB and the static information provided by EMIRegistry. Clarification would be useful to confirm with EMI registry that really publish static information and how this works in detail. What should be there? We should not fixate on the URL, the URL is just a location of a service which should be in place.

EI: what should be there: name of queue, port for this individual service,.. to detailed to go into GOCDB.

KB: Currently GOCDB is used by administrators to not schedule a service is in downtime. Assuming that that functionality will be left in GOCDB, if this info is no in the EMIRegistry how are we supposed to enter this information, will there be a link from GOCDB to the EMIRegistry? How do we do this?

ML: Good point: make it a ActionPoint to collect more information on EMIRegistry (Difference static info there to static info in GOCDB, downtime for services?)

EI: Do you want to put just one service in downtime, normally a whole site is put in downtime.

KB: Not sensible to put whole site in downtime if this is just one service:
See the example for UNICORE 6 on slide 12 in Davids presentation which is confusing since in practice in production you won't have such an URL.
In the example you use https://host:8443/services/TargetSystemFactoryService?res=default_target_system_factory
but in the real case it should be https://host:8443/VSITE/services/TargetSystemFactoryService?res=default_target_system_factory 

Typically, one Gateway machine is the entrypoint for the whole organisation, then you have some administrative string, the gateway will depending on the first part redirect the quest correctly to the right host. A site can have e.g. two clusters and wants to put in downtime only one of them. The VSITE is distinguishing which host/cluster is really meant.

EI: Could you enter all VSITE in GOCDB, instead?

KB: Yes, this would solve most problems (90%), the last part after res can have additional arguments where it is a problem without having enough information on this, though. So for being future proof we need at least one more thing. There can also be more gateways for loadbalancing reasons on the same site.

ML: Emir, we have discussed this in detail already in the dedicated GOCDB phone conference when we defined the requirement. I'll refer you to it. 
( https://www.egi.eu/indico/conferenceDisplay.py?confId=229 )

ML: Going to the next point in the Agenda:
 b) Monitoring
(reads the details from the agenda)

KB: Yesterday I sent an email with an update: Basically done with whole implentation of EGI NAGIOS probes. Currently our collegues from PL-Grid are testing it. Some minor todos are left.

EI: What is this service state refresher? It looks like a broker module?

KB: It is explained in Autoconf module general documentation where there is a big readme. There are two options to deploy the system. One (not suitable with EGI NAGIOS) is with -x option, the other without -x option and with this modul which speeds up execution time of dependent plugins status updates greatly. It is only used for UNICORE NAGIOS probes.
We are testing it right now on production environments, before it was tested in non-production environements. The quality of service drops down without refreseher, you would have to do much more tests.

EI: How to you limit it only to UNICORE instances?

KB: I will have to forward this question to the person doing the implementation.

EI: Ah, I've actually found it in the documentation now. When you say you need to store your certificate, you mean the proxy certificate, right, not the user certificate?

KB: It is a robot certificate.

EI: What if I don't have robot certificate? Most of the NGIs don't use robot certificates, there are CAs that don't trust and support robots.

KB: You need a certificate to run NAGIOS, then you can use the normal certifiate.

EI: No, that is a tricky one.

KB: Yes you can, according to EUGridPMA Documents.

EI: This depends on the CA CPCPS, e.g. in our CAs CPCPS you are not allowed to leave your usercertificate unprotected on a machine. This is why we came up with the myProxy solution and a special check checking the VO attribute. Otherwise it is not safe and not compliant with the CPCPS. In theory someone could break in and steal the usercerts which would give him access to all kind of resources. So this is one big security concern for EGI.

KB: If somebody breaks into the machine, he can equally steal the myProxy certificate.

EI: The whole idea of myProxy is to have only a limited amount of effect. To leave this security discussion, I still thinks this is still a problem.

KB: We can easily provide a solution for you where you want a certificate on the NAGIOS box. It is absurdly safe. It the solves the issue much better than with myProxy, but still you will have to need a certificate on the NAGIOS box, which can have no priviledges in itself. This is a trivial matter of configuration without any development, but is quite complicated to explain, I'll write an email with the answer to this question.
Basically you just add one commandline option: Then on the box where you really have your certificate or where your proxy for monitoring is created, you would be able to generate a cross delegation ascertion you can send to the NAGIOS box.

ML: What is the planned timeline now to integrate the Polish probes into EGI NAGIOS, will you work on it together this month?

EI: No, I will be going on holidays: I will start this week, but I wont make it by end of next week, I will be away for 3 weeks, coming back on 7th of march. Unfortunalty, noone else on the SAM team will be taking over this task, since everybody works on other things. The realistic timeline is that I start working in March, by that time the Polish probes will have provided the first test results, so we should make it for the march release (end of March).

ML: Going on in the agenda, someone already tested the current status in GGUS? Maybe not much to test beside writing a ticket and see if it will be assigned to UNICORE people.
Let's go again back to the discussion. Is it time to add new UNICORE 6 service types in GOCDB. Have we really decided now about the naming. Are the current probes provided by Poland also fine for Germany. I see for example no probes for SIMON here. Is something else missing?
In GOCDB should we rename the current existing service types to the GLUE 2 schemes?

DM: This would need wider agreement, easily doable in GOCDB.

ML: So you think it is not recommandable?

DM: Happy to do it, but it has to be widely agreed and known.

ML: Currently it should only affect monitoring, right?

EI: If we are only talking about UNICORE, we will not affect anyone, since noone is using it yet. We will be perfectly safe to switch to GLUE2

ML: So that means it should be fine for renaming. Somebody against it? Germany?

DM: Ok, I will rename the three service types in GOCDB.

ML: The new names would be:
unicore6.registry
unicore6.UNICOREX
unicore6.gateway

DM: UNICOREX has no equivialent, so presumably yes. Can that be confirmed?

KB: UNICOREX is a container not a service, therefore it has no equvivalent.

ML: So we decide to rename it and you fix the probe configuration then in March.

KB: Nothing to be done for the current probes, this information is entered in xml.

EI: I would need a list of all UNICORE sites, I go to GOCDB to get a list then I go to the ATP database, so I would need to know that I would need to go elsewhere. So we need to work on the details.

KB: In PL Grid it is our first priority to get monitoring running for the UNICORE sites. We think it is easiest to maintain staic information in one place, like one xml file in one place where administators can edit it. Easier than distribute information a little bit everywhere, like in GOCDB and somewhere else.

ML: We have to vote first whether we really no go for this solution to use xml until the information can be entered into the EMI registry. Is the xml solution sufficient or go we for the interim solution David proposed?

EI: XML is tricky: we can't relase UNICORE probes and say this will only work for PL Grid. Germany will object.

KB: XML is not a PL Grid specigic format. We can include german sites as well. The documenation of the format is written in 
http://alfred.studmat.umk.pl/~szczeles/PL-Grid/UMI-Autoconf.html there is an example of the XML data provider, it is not PL Grid specific.

EI: Ah, I now realize that you were actually doing my job! You are actually generating NAGIOS configuration. My job was to create configuration for you with a consistent set of emails and so on.

KB: Merging our configuration with your configuration is very nice future work.

EI: Krzysztof, you should have contacted me earlier on this... I would still like to create all the config.

KB: That is fine, that was what we were aiming for.

ML: Hopefully we are even faster now! To continue the discussion I would like to have some input from Germany here, Rebecca, Mathilde? (no answer) It seems that we have lost Germany. Will ask them again via e-mail, otherwise it will be unfair, but it looks like a viable solution.

MR2: (Input from Mathilde via e-mail around that time:
As you cannot hear us any longer, I want to say that it would be a good
idea to look at the NAGIOS UNICORE implementation running in NGI-DE in
production since some time now.
)

ML: What should then be the next service types that should be added to the GOCDB?

KB: Without clear answer to the questions in the open GOCDB requirement tickets and the downtime problematic, I don't think we should add anything new to GOCDB. That wouldn't make sense. Currently it is just sensible to add the sites in GOCDB.

ML: That's a valid opinion! So we possibly wait with that decission until a later meeting. So for the monitoring, is it a problem that there are probes for services that don't have a corresponding service type in the GOCDB right now?

EI: If we use the XML solution this is not a problem. The only info I need is: Use the XML file for this site in the GOCDB in the bootstrap. So from bootstrapping the XML file I then get all the information. This would also solve the problem with what services should be in the GOCDB right now.

ML: That's good. I feel we need some more coordination in this point with NGI Germany. Krzysztof, did you happen to have a look at the German NAGIOS UNICORE 6 sensor probes?

KB: There were some links, but unfortunately I was no able to log in there. I can contact them.

ML: Maybe we can do this officially through the ticket https://rt.egi.eu/rt/Ticket/Display.html?id=306 which has not been updated for a while. Can I put this Actionpoint on you?

KB: Sure, if you would assist me.

ML: If there is nothing more on GOCDB and monitoring I would like to go on to accounting.

(Over the chat:)
[11:59:52] Marcin Radecki at 12.00 I should join another meeting, so have to go right now, bye!
[12:00:06] Krzysztof Benedyczak cześć
[12:00:07] david meredith i have to go. thanks. bye 


ML: We have heard that NGI PL has a locally working accounting solution.

KB: Yes, data is also stored in the OGF UR record format, it is extracted both from the batch system and the UNICORE CE and merged. Currently inside PL grid we are just deploying it in production. It was tested for a long time during the last months. The communication between the UNICORE part and the central part of the PL Grid project is done using JMS. According to John Gordon, it seems to be okay to move on towards EMI.

ML: Christina is here to replace John Gordon today and advise us. On his roadmap slides he has point on collecting requirements for integration other NGIs accounting systems until end of April. Do we need to collect and specify our requirements there? And one more question first to you Krzysztof: are you using the same extension to the OGF UR format as John Gordon uses, the one APEL suggested in the UR working group in OGF as extension for the OGF UR 2.0.

KB: Just yesterday I was put down as UR contact point for UNICORE in the UR working group inside EMI. This group inside EMI has just started. The task has moved from infrastructure to compute area. Our system is really plugable, internally we use a propetary format, we use a plugin to translate it into the OGF UR format. Even if there are some differences it is not a big implementation issue for us.

CN: About our current state: we recieve records both from sites and regions, we distributed the format and want to get approval for it, we can then discuss how to translate your format in our format.

KB: It was sent yesterday, I was not able to read it yet.

CN: We can keep in touch that way.

KB: In general the EGI roadmap in accounting is important to us.

ML: Please look at the accounting roadmap link in the agenda. My question here to Christina is should we provide something from the side of UNICORE, what is expected from us and needed.

CN: Not sure what is meant exactly there on the slides, either. But, what we will need is your job record form and what sort of data you would like to send to the central accounting server, will you send job records, will you send summaries, that are the important questions I can think of right now.

ML: Should we also say, if we want tu use the RUS interface and if we want to only send aggreagated UR format (like required from SGAS)? Which should be in development anyway, right?

CN: Yes, SGAS and some NGIs like Italy send us aggregated data. At the moment, this is done via mySQL directly, in the near future we will move to ActiveMQ. I guess UNICORE will be in the exactly same situation.

CN: If there are any other questions, I am happy do be contacted personally or over the APEL support mailing list.

KB: I have one more question, I am not aware of the political goals of accounting in EGI? What do want to achieve?

ML: For the accounting it would certainly be nice to have a common OGF UR version 2. For the more technical thing it might be nice to anyway open an RT ticket to APEL on how the UNICORE accounting data looks like to have an official place for the discussion and collection of requirements.

KB: Yes, technically we are all already on a good track with using the same solutions. However, what are the goals for the accounting database in the European space? will there be two different accounting graphs for the same site? 

MR: We are representing the Polish sites. Each NGI will have some autonomy. It is a goal to provide a regional version of the accounting portal. Therefore the publication schemes will change. From our site we'll maintain our own accounting solution.

ML: There might also be legal issue, so some NGIs will only be allowed to publish aggregated data into the European space.

CN: For the regional accounting server it is our requirement that the NGIs will be be able to filter the data they want to publish by site and by (local) VO into the central APEL accounting server. In the APEL client it is possible to configure that you are not able to publish single user records.

KB: So the goal is to have some high-level statistics on the usage, right?

ML: If accounting really works, we might also speculate about monitary consequences. We will know like how much is calculated by one VO in this specific region. This represents some kind of value that should be mirrored in the OLAs. Therefore in the long run it might also have some impact on different kinds of funding. However for that you really only need the big picture.

PS: I have a question about the monitoring on accounting. Emir, currently the accounting publishing is monitored by the central APEL repository, right?


EI: Yes, the central reposoritry sends us alarms: it is not a pull, it's a push model!

PS: Will the UNICORE sites then also be monitored, as soon as they start to publish data?

EI. If once the start providing summaries it is in principal always the same kind of data, so I don't foresee any need for change.

PS: And do the regional nagios get this information from the central APEL repository, too?

CN: In the central APEL repository there is the test for the the production and certified sites, the test is when the latest data was received from the site, it doesn't matter wheter the data comes as aggregated summary or in single job record format. So the test is done centrally and is then passed on to any regional NAGIOS.

PS: So there is no need to worry about accounting monitoring from UNICORE side?

CN: Well if you require any new type of testing that would be a be a new requirement.

PS: Not necessary.

KB: There is one issue, when a site has two different middlewares deployed (it is possible to deploy UNICORE and gLite on sane hardware without bigger problems) and one middleware is reporting accounting correctly to the APEL database and the other not.

EI: The piece of software which aggregates the data to a summary of the site would go to a batching system, so the central database will be a aware that it didn't receive all information and throw and arrow. Can you confirm?

CN: APEL will only get information from the log files it knows about. It knows about data from gLite, I realy don't think APEL will care about it. If the gLite data is correct is correct, APEL will be happy.

EI: My understanding is that these things will be integrated at some point.

KB: APEL is scanning the logs and for UNICORE it is a plug-in it is more on the level of a local database. Parts are integrated into one component.

EI: Does this meant, that there will be two different accounting graphs/summaries for the same site? 

KB: We are thinking about this in PL grid, since we have exactly this problem. I think we have no perfect solution yet. The central server just puts together data from one site.

EI: Cristina: how do you handle the SGAS and DGAS sites, all the sites like OSG or NORDUGrid that provide summaries will the trigger alarms?

CN: I believe so, we only receive summaries from them. It ends up in the same table. As long as the site is registered in GOCDB as production site, the test will be created and we see alarms yes.

EI: So when UNICORE comes up, they will start to provide summaries in the same way as OSG or Nordugrid and use the same test.

KB: Do you have sites where two middlewares are provided from the same site, e.g. ARC and gLite?

(nobody knows about such configuration)

CN: I don't think it is a problem, as long as each middelware is accounted by something and there is no overlap. Otherwise we might end up with duplicated records which wouldn't be very good. But if we have a site that is accounted for UNICORE and gLite it should be possible to get those two things together.

KB: The original question was about monitoring of accounting. Yes, if the data of two middlewares is brought together in the APEL repository, everything will be fine, but if we want to check if the accounting is working, that is not enough, since the site is producing APEL accounting data, but we don't know for sure whether it is really producing accounting data from both middlewares deployed.

ML: This might actually be a new requirement we will have to bring in to accounting. Is NGI Germany having sites with more than one MW deployed?

RB: Yes, actually we have a site with three middlewares installed on one site.

EI: On the same machine or different machines, but within the same cluster?

RB: I don't know how others handle this, but here we typically have different frontends which go to the same cluster.

EI: That would be good, but if you had UNICORE and gLite on the same box, how whould you distinguish them in GOCDB?

CN: The problem at the moment is, that currently that the test is done for every single node, but we are publishing it per site. So that would be a new requirement for us. I think we can add that. I will make a note on this.

ML: We should also make a ticket to collect the details. I put that in the Actionpoints.

ML: Since Rebecca is now here again, maybe she can answer some questions. XML and coordination of the probes with Poland?

RB: Concerning XML, in the German Grid we have our own resource managment system, that is where our NAGIOS is getting the information. The person that could explain this better than me would be Foued. For the probes, maybe we should invite him for the next conference.

ML: So point 5 in the next meeting, Rebecca mentioned we should invite Foued. Somebody else? When shall we have the next meeting?

EI: How often shall we have this meeting?

ML: I thought about once a month should be reasonable?

PS: Once a month, every three weeks would be a fine frequency to have some progress to discuss.

ML: I'll set up a doodle. Should I include some more lists. Like the EMI Unicore mailinglist? (common agreement)
For the topics I suggest more accounting, maybe we can already touch Argus, otherwise it is maybe best to stay with the progress of todays meeting.

KB: I can give some update on Argus at the next meeting.

ML: Great! AOB?

ML: I declare the meeting closed exactly on time and thank you all for joining! I think we made a lot of progress.


--------------------------------------------------------------------------------------------------------------------


AP: Found out more about the EMI registry and the static info contained: How is the distinction between the static data GOCDB and EMI registry, how can we propagate downtime info for just one Service Endpoint (URLs are needed to distungish between different instances)? (contact Laurence Field)

AP: ML to test sending a GGUS ticket to UNICORE.

AP: ML to send EI minutes of GOCDB dedicated phone conference

AP: DM to rename the first 3 UNICORE service types.
unicore6.registry
unicore6.UNICOREX
unicore6.gateway

AP: KB contacting Germany through ticket 306

AP: official requirement towards APEL (accounting monitoring must be sufficient to decide whether all deployed middlewares of a site publish accounting data)

AP: ML making a doodlepoll for next meeting.