Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy

Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Authors: Tamas Kiss, Shashank Gugnani, Gabor Terstyanszky, Peter Kacsuk, Carlos Blanco, Giuliano Castelli Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK kisst@wmin.ac.uk

MapReduce/Hadoop • MapReduce: to process large datasets in parallel and on thousands of nodes in a reliable and fault-tolerant manner • Map: input data in divided into chunks and analysed on different nodes in a parallel manner • Reduce: collating the work and combining the results into a single value • Monitoring, scheduling and re-executing the failed tasks are the responsibility of the MapReduce framework • Originally for bare-metal clusters – popularity in cloud is growing • Hadoop: Open source implementation of the MapReduce framework introduced by Google in 2004 IntroductionMapReduce and big data

Motivation • Many scientific applications (like weather forecasting, DNA sequencing, molecular dynamics) parallelized using the MapReduce framework • Installation and configuration of a Hadoop cluster well beyond the capabilities of domain scientists Aim • Integration of Hadoop with workflow systems and science gateways • Automatic setup of Hadoop software and infrastructure • Utilization of the power of Cloud Computing Motivations

CloudSME project • To develop a cloud-based simulation platform for manufacturing and engineering • Funded by the European Commission FP7 programme, FoF: Factories of the Future • July 2013 – March 2016 • EUR 4.5 million overall funding • Coordinated by the University of Westminster • 29 project partners from 8 European countries • 24 companies (all SMEs) and 5 academic/research institutions • Spin-off company established – CloudSME UG • One of the industrial use-cases: datamining of aircraft maintenance data using MapReduce based parallelisation Motivations

Motivations

Set up a disposable cluster in the cloud, execute Hadoop job and destroy cluster • Cluster related parameters and input files provided by user • Workflow node executable would be a program that sets up Hadoop cluster, transfers files to and from the cluster and executes the Hadoop job • Two methods proposed: • Single Node Method • Three Node Method Approach

Aim: • execute MapReduce job in Cloud resources • automatically set-up and destroy execution environment in the cloud • Infrastructure aware workflow: • the necessary execution environment should also be transparently set up before and destroyed after execution • carried out from the workflow without further user intervention. • Steps • execution environment is created dynamically in the cloud • execution of workflow tasks • breaking down of the infrastructure releasing resources ApproachInfrastructure aware workflow

Connect to cloud and launch servers • Connect to the master node server and setup cluster configuration • Transfer input files and job executable to master node • Start the Hadoop job by running a script in the master node • When the job is finished, delete servers from cloud and retrieve output if the job is successful ApproachSingle node method

Stage 1 or Deploy Hadoop Node: Launch servers in cloud, connect to master node, setup Hadoop cluster and save Hadoop cluster configuration • Stage 2 or Execute Node: Upload input files and job executable to master node, execute job and get result back • Stage 3 or Destroy Hadoop Node: Destroy cluster to free up resources ApproachThree node method

Implementation

…Appli-cations BiologyAppli-cations ChemistryAppli-cations PharmaAppli-cations …Cloud Open-StackCloud* Open-NebulaCloud* Euca-lyptusCloud ImplementationCloudBroker platform • Seamless access to heterogeneous cloud resources – high level interoperability End Users, Software Vendors, Resource Providers User Tools WebBrowserUI* CLI* Java Client Library* REST Web Service API* CloudBroker Platform* EngineeringAppli-cations AmazonCloud* CloudSigma Cloud*

ImplementationWS-PGRADE/gUSE • General purpose, workflow-oriented gateway framework • Supports the development and execution of workflow-based applications • Enables the multi-cloud and multi-grid execution of any workflow • Supports the fast development of gateway instances by a customization technology

ImplementationWS-PGRADE/gUSE • Each box describes a task • Each arrow describes information flow such as input files and output files • Special node describes parameter sweeps

ImplementationSHIWA workflow repository • Workflow repository to store directly executable workflows • Supports various workflow system including WS-PGRADE, Taverna, Moteur, Galaxy etc. • Fully integrated with WS-PGRADE/gUSE

Implementation Supported storage solutions Local (user’s machine): • Bottleneck for large files • Multiple file transfers: local machine – WS-PGRADE – CloudBroker – Bootstap node – Master node – HDFS Swift: • Two file transfers: Swift – Master node – HDFS Amazon S3: • Direct transfer from S3 to HDFS • using Hadoop’s distributed copy application Input/output locations can be mixed and matched in one workflow

Experiments and results Initial testbed • CloudSME production gUSE (v 3.6.6) portal • Jobs submitted using the CloudSMECloudBroker platform • All jobs submitted to University of Westminster OpenStack Cloud • Hadoop v2.5.1 on Ubuntu 14.04 trusty servers

Experiments and results Hadoop applications used for experiments • WordCount - the standard Hadoop example • Rule Based Classification - A classification algorithm adapted for MapReduce • Prefix Span - MapReduce version of the popular sequential pattern mining algorithm

Experiments and results Single node: Hadoop cluster created and destroyed multiple times Three node: multiple Hadoop jobs between single create/destroy nodes

Experiments and results

Experiments and results 5 jobs on a 5 node cluster each, using WS-PGRADE parameter sweep feature Single node method Single Hadoop jobs on 5 node cluster Single node method

Conclusion • Solution works for any Hadoop application • Proposed approach is generic and can be used for any gateway environment and cloud • User can choose the appropriate method (Single or Three Node) according to the application • Parameter sweep feature of WS-PGRADE can be used to run Hadoop jobs with multiple input datasets simultaneously • Can be used for large scale scientific simulations • EGI Federated Cloud integration: • Already runs on some EGI FedCloud resources: SZTAKI, BIFI • WS/PGRADE is fully integrated with EGI FedCloud • CloudBroker does not currently support EGI FedCloud directly

Any questions? http://cloudsme.eu http://www.cloudsme-apps.com/

Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy