Big Data Platforms

Europe/Amsterdam
Description
Aim of this meeting is having a clear picture of the platforms currently or in a next future available in the EGI community. Participants should be cloud providers willing to bring platforms into the European scale from a single institutional scope and/or willing to participate in the development of scaling up of such platforms to the European level. Each provider should prepare a short presentation about the big data platforms (next) available in its institution and that could be offered to the whole EGI community in a near future. Connection details: http://connect.ct.infn.it/egi-inspire-jra1/
Björn Hagemeier - Big Data at JSC
1. Motivated by a use case from NASA that requires very large RAM and disk per node - These are not typical at an HPC centre.
    * This use case demonstrates the scale of big data applications
    * Experimented with different platforms (HPC cluster, HDFS, etc.) - but each has its problems for this use case
    * Diego: At the moment we don't have sites in the EGI FedCloud that have enough RAM and storage to satisfy the use case
2. DBSCAN clustering algorithm - JSC created a highly parallel version
    * Applications: Noise reduction, Twitter tweet analysis, outlier detection
3. Ongoing development: Make HDFS available as a UNICORE storage
    * Send YARN jobs through UNICORE to work with such storage
    * Data resides on the cluster, throw jobs onto the nodes where the data is (not nodes that are free, like in grid batch computing)
    * Development should be finished in a few months

Danielle Lezzi - Hecuba and integration with COMPSs
1. HECUBA is a set of tools and interfaces that aim to simplify integration with non-relational DBs
    * It is currently used to integrate COMPSs with Cassandra (and other non-rel DBs as HBase)
    * Cassandra provides replicated, persistent storage for COMPSs
    * Hecuba redefines access and modification methods to data in memory so that if that data is stored Cassandra, it can be transparently used as if it was in local memory
2.Dataclay is a storage platform that aims to facilitate programmers an efficient and easy interaction with databases, using object-oriented programming, and providing big data sharing.
    * As Hecuba, it also works under COMPSs, and most of methods used to access the database are the same (make_persistent, delete_persistent, getID, next…)
3. The workflow is: COMPSs makes data available on Cassandra and other storages via HECUBA and Dataclay, then schedules tasks onto resources, where these tasks can interact with the data and storages, and benefit from persistent and replicated storage

Esteban Freire - Current cloud infrastructure at CESGA
1. OpenNebula site
2. Is providing a testing Hadoop platform
    * Based on ON framework
    * Developing a web portal through which users can instantiate Hadoop clusters
        * The portal will be developed specifically for the CESGA site, but could be assessed how to extend this to work with other sites
    * Remark from Gergely: Such an extended portal could be a front-end for a 'Hadoop' VO on the EGI FedCloud.
UPDATE from Esteban and Lopez Cacheiro: The new portal could be extended to support additional providers using for example rOCCI to deploy the virtual machines. The configuration part of the Hadoop cluster and the services offered by the portal on top of it should remain the same. With this changes the portal could serve as a front-end for a 'Hadoop' VO and we could provide an CLI similar to the one current in production at CESGA.
3. Hadoop on FedCloud
        * Hadoop's heavy use of disk and network I/O - this is the biggest challenge to host hadoop in the FedCloud
        * FedCloud would fit to small and medium-size Hadoop jobs where the data set is already pre-deployed in HDFS
        * Optimised start-up times are very close to Amazon, but data stage-in imposes a large overhead - Remark from Gergely: Maybe COMPSs could be used as a data stage-in agent?
        * TeraGen benchmark was successfully run up to 2TB and TeraSoft benchmark up to 500GB

Giacinto Donvito - Big data @ INFN Bari
1. INFN Bari is involved in two national grid projects:
    * Support use cases from public administration and research on open source IaaS solutions
    * OpenStack with KVM virtualisation, Swift with both CDMI and S3 interface on Ubuntu 12.04
    * Approx 700 cores with 5 TB RAM; 150 disks of a total of 470 TB storage
2. Working on Hadoop on OpenStack with Sahara: "The Sahara project provides a simple means to provision a data-intensive application cluster (Hadoop or Spark) on top of OpenStack."
    * Tests of Sahara are ongoing (deployment, scalability, tests with use cases)
    * Tests of installation of NoSQL DB on dynamic cluster instantiated with Heat (at least MongoDB and Cassandra)
3. Federation aspect of the work
    * They want to provide platforms (high level tools) for the users, IaaS is 'only' a required low level components for this
    * First this would serve one site, but want to expand this to multiple sites to create 'regions'
    * A further step is to test Hadoop across multiple sites by using its core features, such as hierarchical storage that enables distribution of data and jobs across data centres
Remark from Gergely: We should discuss the plans of these two Italian projects in details to see how these plans could be extended in such a way that the results would benefit EGI FedCloud and would provide opportunities for FedCloud members to contribute to the workplan

Ignacio Blanquer - UPV-IBM’s Big data observatory & hadoop infrastructure management
1. IBM-UPV 'Big Data Observatory' - A joint action by the School of Informatics of the UPV and IBM to promote BigData practicals in post-graduate courses
    * IBM InfoSphere BigInsights (An IBM’s version of Hadoop and other analytic tools for visualization, “R”, text analysis and tables)
    * Installed on UPV's premises using IBM Bluemix IaaS
    * Currently 2 projects: Suitability of university degrees according to employment demand; Extraction of trending topics from live streamed content in radio and tv.
    * Potential links with EGI (To be proposed to IBM):
        ** Support BigInsights from UPV’s FedCloud site.
        ** Scale-up the concept of Observatory to other centres in EGI.
    * Ignacio and Damian to explore with IBM their interest in these topics
2. Iaas Management framework of GRYCAP
    * A set of services for managing VMs and data on IaaS clouds
    * Most relevant services for this task
        ** Automatic configuration and recontextualization service: www.grycap.upv.es/im    
        ** Creation of elastic virtual clusters on top of both public and on-premise IaaS providers: www.grycap.upv.es/ec3
Remark: The Infrastructure Manager is similar, but richer than the rOCCI client. It is compatible with the EGI FedCloud so could be offered as an alternative interface to FedCloud users. Opportunity to explore this.

Jens Jensen (STFC)
- STFC has the Hartree centre which has the full suite of (commercially licensed) IBM Big Data tools. Hartree, however, is a supercomputing centre (it has a BlueGene) focusing on industry applications, so will not generally be available to academic researchers (unless they come with a grant).
- Big data problems are not black-box. You need to understand the big data problem to be able define and address them.
- GATE (linguistics) - trying to apply this to science cases (www.gate.ac.uk)
- Have a hadoop cluster and HPC clusters running R. R keeps data in memory, so there is a limit to scalability with R
- We should explore the case studies too from the sites that offer Hadoop solutions.
- Remark from Gergely: Exploring case studies from these sites can be possible topic for another teleconference.

Mario David (LIP)
- They are at an early stage.
- Have an OpenStack site and whatever solution they will adopt will integrate into OpenStack

Ruben Valles (BIFI)
- Have an internal testbed with Hadoop, currently investigating it
- Will have additional hardware for production use later this year
- Is happy to join collaborations in big data area within EGI

Viet Tran (IISAS)
- Have several products related to big data across use cases from
    * Information Retrieval, Big Data
    * Semantics, Graphs and Networks, Semantic Search
    * Multi-language Text Analysis
    * Knowledge Modeling, Ontologies
- Products - these are ready to use on the IISAS Hadoop cluster:
    * RDB2Onto: Tool for Relational Data to Ontology Individuals Mapping
    * ACoMA: Acoma process email communication on server side and attach relevant knowledge to email messages.
    * EMBET: Experience Management based on Text Notes - Active and Context sensitive Recommendation System
    * RIDAR: Relevant Internet Data Resource Identification
    * WEBCRAWLER: WebCrawler downloads recursively web pages.
    IISAS should check whether these could be shared with other sites and users via the EGI AppDB (maybe as VMs)
- Involved in several projects that are currently active in this area:
    * REDIRNET: Emergency Responder Data Interoperability Network (1.3.2014-31.8.2016)
    * CLAN:  Cloud Computing for Big Data Analytics (1.7.2012-31.12.2015) (National project)
    * SIVVP: Slovak Infrastructure for High Performance Computing (2010-2015) (National project)
    * VEGA: Selected methods, approaches and tools for distributed computing (2012-2015) (National project)
    * VEGA: New methods and approaches on information processing and knowledge bases (2013-2015) (National project)
- HW: Hadoop production and development clusters (384 + 120 cores) with Hadoop, Spark, Hive, Pig
   
Zdenek Sustr (CESNET):
- We are experimenting with Hadoop because several users are requesting this
- The OCCI working group is working on PaaS extensions of OCCI. The OCCI PaaS specification currently still has Draft status, thus it cannot be shared publicly. OCCI 1.2, incl. PaaS in its roadmap for 2015, so it's hoped the public release won't take long.

There are minutes attached to this event. Show them.
The agenda of this meeting is empty