1 / 22

Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy

Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy. Authors: Tamas Kiss, Shashank Gugnani, Gabor Terstyanszky, Peter Kacsuk, Carlos Blanco, Giuliano Castelli. Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK

jeffersonp
Download Presentation

Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Authors: Tamas Kiss, Shashank Gugnani, Gabor Terstyanszky, Peter Kacsuk, Carlos Blanco, Giuliano Castelli Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK kisst@wmin.ac.uk

  2. MapReduce/Hadoop • MapReduce: to process large datasets in parallel and on thousands of nodes in a reliable and fault-tolerant manner • Map: input data in divided into chunks and analysed on different nodes in a parallel manner • Reduce: collating the work and combining the results into a single value • Monitoring, scheduling and re-executing the failed tasks are the responsibility of the MapReduce framework • Originally for bare-metal clusters – popularity in cloud is growing • Hadoop: Open source implementation of the MapReduce framework introduced by Google in 2004 IntroductionMapReduce and big data

  3. Motivation • Many scientific applications (like weather forecasting, DNA sequencing, molecular dynamics) parallelized using the MapReduce framework • Installation and configuration of a Hadoop cluster well beyond the capabilities of domain scientists Aim • Integration of Hadoop with workflow systems and science gateways • Automatic setup of Hadoop software and infrastructure • Utilization of the power of Cloud Computing Motivations

  4. CloudSME project • To develop a cloud-based simulation platform for manufacturing and engineering • Funded by the European Commission FP7 programme, FoF: Factories of the Future • July 2013 – March 2016 • EUR 4.5 million overall funding • Coordinated by the University of Westminster • 29 project partners from 8 European countries • 24 companies (all SMEs) and 5 academic/research institutions • Spin-off company established – CloudSME UG • One of the industrial use-cases: datamining of aircraft maintenance data using MapReduce based parallelisation Motivations

  5. Motivations

  6. Set up a disposable cluster in the cloud, execute Hadoop job and destroy cluster • Cluster related parameters and input files provided by user • Workflow node executable would be a program that sets up Hadoop cluster, transfers files to and from the cluster and executes the Hadoop job • Two methods proposed: • Single Node Method • Three Node Method Approach

  7. Aim: • execute MapReduce job in Cloud resources • automatically set-up and destroy execution environment in the cloud • Infrastructure aware workflow: • the necessary execution environment should also be transparently set up before and destroyed after execution • carried out from the workflow without further user intervention. • Steps • execution environment is created dynamically in the cloud • execution of workflow tasks • breaking down of the infrastructure releasing resources ApproachInfrastructure aware workflow

  8. Connect to cloud and launch servers • Connect to the master node server and setup cluster configuration • Transfer input files and job executable to master node • Start the Hadoop job by running a script in the master node • When the job is finished, delete servers from cloud and retrieve output if the job is successful ApproachSingle node method

  9. Stage 1 or Deploy Hadoop Node: Launch servers in cloud, connect to master node, setup Hadoop cluster and save Hadoop cluster configuration • Stage 2 or Execute Node: Upload input files and job executable to master node, execute job and get result back • Stage 3 or Destroy Hadoop Node: Destroy cluster to free up resources ApproachThree node method

  10. Implementation

  11. …Appli-cations BiologyAppli-cations ChemistryAppli-cations PharmaAppli-cations …Cloud Open-StackCloud* Open-NebulaCloud* Euca-lyptusCloud ImplementationCloudBroker platform • Seamless access to heterogeneous cloud resources – high level interoperability End Users, Software Vendors, Resource Providers User Tools WebBrowserUI* CLI* Java Client Library* REST Web Service API* CloudBroker Platform* EngineeringAppli-cations AmazonCloud* CloudSigma Cloud*

  12. ImplementationWS-PGRADE/gUSE • General purpose, workflow-oriented gateway framework • Supports the development and execution of workflow-based applications • Enables the multi-cloud and multi-grid execution of any workflow • Supports the fast development of gateway instances by a customization technology

  13. ImplementationWS-PGRADE/gUSE • Each box describes a task • Each arrow describes information flow such as input files and output files • Special node describes parameter sweeps

  14. ImplementationSHIWA workflow repository • Workflow repository to store directly executable workflows • Supports various workflow system including WS-PGRADE, Taverna, Moteur, Galaxy etc. • Fully integrated with WS-PGRADE/gUSE

  15. Implementation Supported storage solutions Local (user’s machine): • Bottleneck for large files • Multiple file transfers: local machine – WS-PGRADE – CloudBroker – Bootstap node – Master node – HDFS Swift: • Two file transfers: Swift – Master node – HDFS Amazon S3: • Direct transfer from S3 to HDFS • using Hadoop’s distributed copy application Input/output locations can be mixed and matched in one workflow

  16. Experiments and results Initial testbed • CloudSME production gUSE (v 3.6.6) portal • Jobs submitted using the CloudSMECloudBroker platform • All jobs submitted to University of Westminster OpenStack Cloud • Hadoop v2.5.1 on Ubuntu 14.04 trusty servers

  17. Experiments and results Hadoop applications used for experiments • WordCount - the standard Hadoop example • Rule Based Classification - A classification algorithm adapted for MapReduce • Prefix Span - MapReduce version of the popular sequential pattern mining algorithm

  18. Experiments and results Single node: Hadoop cluster created and destroyed multiple times Three node: multiple Hadoop jobs between single create/destroy nodes

  19. Experiments and results

  20. Experiments and results 5 jobs on a 5 node cluster each, using WS-PGRADE parameter sweep feature Single node method Single Hadoop jobs on 5 node cluster Single node method

  21. Conclusion • Solution works for any Hadoop application • Proposed approach is generic and can be used for any gateway environment and cloud • User can choose the appropriate method (Single or Three Node) according to the application • Parameter sweep feature of WS-PGRADE can be used to run Hadoop jobs with multiple input datasets simultaneously • Can be used for large scale scientific simulations • EGI Federated Cloud integration: • Already runs on some EGI FedCloud resources: SZTAKI, BIFI • WS/PGRADE is fully integrated with EGI FedCloud • CloudBroker does not currently support EGI FedCloud directly

  22. Any questions? http://cloudsme.eu http://www.cloudsme-apps.com/

More Related