18-22 May 2015
ISCTE-IUL
Europe/Lisbon timezone
Main EGI Conference website: http://go.egi.eu/c15
Home > Timetable > Contribution details

Contribution

Hadoop on the Cloud: the SlipStream deployment tool

Speakers

  • Dr. Cécile CAVET

Primary authors

Description

Cloud computing is a recent informatics paradigm which offers IT resources on-demand. The IaaS (Infrastructure-as-a-Service) layer provides virtual machines, storage, and network to run applications in a customised environment. Since recently, Cloud technology has reached a high level of maturity by offering a huge stack of services. But the management of a complex workflow can still encounter some problems. Indeed, automatic deployment of environnement on the Cloud can be difficult, especially in the case of a cluster. In a cluster environnement, master and slave nodes need to exchange information in order to set up system services. Several tools such as SlipStream (1), a PaaS (Platform-as-a-Service) solution, have been developed in order to manage automatic deployment on interoperable Cloud platforms. The SlipStream solution provides a Web interface allowing the use of several Cloud connectors. The management of the automatic deployment is based on recipes (shell script) which install and configure each type of nodes. The SlipStream tool was first developed in accordance with the StratusLab (2) solution. This european research project has provided since 2010 a public open-source Cloud solution to the academic community. For this study, we have used StratusLab@LAL (Laboratoire de l’Accélérateur Linéaire, Orsay, France). In the Big Data framework, Hadoop (3) technology is often associated to Cloud technology. Indeed, the treatment of a huge volume of data requires specific Hadoop clusters that are not often present at this time. The Cloud infrastructure thus allows to deploy a Hadoop virtual cluster on-demand. We have used the HortonWorks (4) distribution to build a Hadoop cluster on the StratusLab infrastructure with the SlipStream tool. We have written recipes to deploy Hadoop 2 (YARN) services on one master and three slaves. We describe the methodology that we have followed to fine-tune Hadoop configuration file parameters. Furthermore, we have realised benchmarks of performance (HiBench, (5)) in order to validate the cluster set up. We have also run typical MapReduce jobs on a huge volume of data in order to show the interest of this data treatment for scientific projects.  References: (1) SlipStream: http://sixsq.com/products/slipstream.html (2) StratusLab: http://stratuslab.eu/index.html (3) Hadoop: http://hadoop.apache.org (4) HortonWorks: http://hortonworks.com (5) HiBench: https://github.com/intel-hadoop/HiBench