Speaker
Overview
Since LHC startup, HEP users have increased their requests towards the computing infrastructure in terms of performances and functionalities, often going behind the expected requirements of the experiment computing model.
Concerning functionalities, several users ask for an interactive facility to test their code before submitting it to the Grid. Usually such an interactive cluster is deployed as one or few powerful machines, but in our experience this solution is affected by scalability limits and maintenance issues. Besides, users ask also to use the local batch submission to run their analysis quickly and reliably, using a controlled environment, on small dataset produced by the same users.
On the other hand, the network topology and the storage infrastructure should be set up in order to fulfill the I/O requirements of analysis jobs; this could be difficult especially within a big multi-VO site, where users can have very different use cases.
Description of the work
Our work consists first of all in creating a recipe to provide the interactive facility using the worker nodes; Torque interactive jobs has been a starting point to deploy such facility.
Besides, many different storage solutions have been tested in order to choose the one which best fulfilled all the user requirements. The storage tests have been performed involving many different available storage solutions, using different storage brands and different technologies (hardware RAID, software RAID based on the Linux kernel; RAID5, RAID6; fiber channel, external SAS) as one of the primary goal of the test was to assure that an heterogeneous storage infrastructure could be successfully built achieving high performances. Lustre has been chosen as it provides POSIX access, best performances, easy administration. As a result of those tests, our farm migrated from dCache to Lustre.
Finally we have successfully connected Lustre to the Grid using consolidated solutions like StoRM, XROOTD, GridFTP, verifying their good behavior in terms of performance and compliance to the Grid requirements.
Conclusions
The number of users has considerably grown up during the last year, involving several experiments, VOs and communities (CMS, Alice, Glast/Fermi, Pamela, Theophys, Magic V, computational chemistry, biomedicine, bioinformatics...) actively using and enjoying the new infrastructure both locally and from EGI grid infrastructure. In the meanwhile, the maintenance overhead of the site administrator has considerably dropped.
Impact
Using Torque interactive jobs, the user connects to a front-end machine and runs a command to submit a job to the cluster, just like he does for batch local submission; the batch manager chooses one CPU to execute the job and returns an interactive shell. The user will keep that CPU until he releases the interactive jobs through a logout. Using the screen utility, the user can also interrupt his work, log out, and then log in back again preserving the session he had left.
Besides, Lustre has been mounted on all the nodes of the farm, so that user can access their data through Grid jobs, local batch jobs and interactive jobs, simplifying users' life very much.
The new storage configuration has been tested using CMS analysis jobs, resulting in a very high CPU efficiency when compared with other storage solutions.
As a consequence of the design of the overall infrastructure, it's very easy for every new user to start with his local activities, and for new experiments to add new resources in terms of nodes and storage servers.