430 likes | 614 Views
Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid). Text. Shantenu Jha Director, Cyber-Infrastructure Development, CCT Computer Science e-Science Institute, Edinburgh http://www.cct.lsu.edu/~sjha http://saga.cct.lsu.edu. Outline.
E N D
Distributed Applications:Examining the Past Understanding the Present Preparing for the Future(Grid) Text Shantenu Jha Director, Cyber-Infrastructure Development, CCT Computer Science e-Science Institute, Edinburgh http://www.cct.lsu.edu/~sjha http://saga.cct.lsu.edu
Outline • Critical Perspective on Large-Scale Distributed Applications and Production Cyber-Infrastructure (CI) • Understanding Distributed Applications (DA) • Differ from HPC or || App, Challenges of DA • DA Development Objectives (IDEAS) • Understanding SAGA • Using SAGA to develop Distributed Applications • Frameworks • Abstractions for Dynamic Execution • Data-Intensive Applications • Discuss how IDEAS are met • Derive (Initial) User Requirements/Requests for FutureGrid Text
Critical Perspectives • Distributed CI: Is the whole > than the sum of the parts? • Several BIG Projects have success stories on TG • But REAL Science happens at ALL SCALES • Tools for the individual users to innovate and develop? • Infrastructure capabilities and policy determine Applications development, deployment and execution: • Proportion of App. that utilize multiple distributed sites sequentially, concurrently or asynchronously is low (~5%) • Not referring to tightly-coupled across multiple-sites • TG (exclusively) supported legacy, static execution models • Move data to computing Compute where the data is? • Distributed Data/Jobs vs Bringing it all into the Cloud • What novel applications & science has Distributed CI fostered?
Understanding Distributed Applications Development Challenges • Fundamentally a hard problem: • Dynamical Resource, Heterogeneous resources • Variable Control (or lack thereof) • Add to it: Complex underlying infrastructure provisioning • Programming Systems for Distributed Applications: • Incomplete? Customization? Extensibility? • Computational Models of Distributed Computing • Design Points: More than (peak) performance • Primary role of Usage Modes • Range of DA, no clear taxonomy Text
Understanding Distributed ApplicationsDevelopment Challenges • Distributed Applications Require: • Coordination over Multiple & Distributed sites: • Scale-up and Scale-out • Logically or physically Distributed • 1st Gen of Peta/Exa/Zetta/Yotta -- Applications requiring multiple-runs, ensembles, workflows.. • Core characteristics and challenges of logically and physically distributed applications are SAME • Inter-play of Requirements, Infrastructure, Usage Mode Ability to develop simple, novel or effective distributed Applications lags behind other aspects of CI General purpose Distributed Application Development Lacking in NSF/OCIs portfolio….
Understanding Distributed Applications Development Objectives • Interoperability: Ability to work across multiple distributed resources • Distributed Scale-Out: The ability to utilize multiple distributed resources concurrently • Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure • Adaptivity: Response to fluctuations in dynamic resource and availability of dynamic data • Simplicity: Accommodate above distributed concerns at different levels easily… Challenge: How to develop DA effectively and efficiently with the above as first-class objectives?
SAGA: Basic Philosophy • There exists a lack of Programmatic approaches that: • Provide general-purpose common grid functionality for applications and thus hide underlying complexity, varying semantics.. • The building blocks upon which to construct “consistent” higher-levels of functionality and abstractions • Hides “bad” heterogeneity, means to address “good” heterogeneity • Meets the need for a Broad Spectrum of Application: • Simple scripts, Gateways, Smart Applications and Production Grade Tooling, Workflow… • Simple, integrated, stable, uniform and high-level interface • Simple and Stable: 80:20 restricted scope and Standard • Integrated: Similar semantics & style across • Uniform: Same interface for different distributed systems • SAGA: Provides Application* developers with basic unit required to compose high-level functionality across (distinct) distributed systems (*) One Person’s Application is another Person’s Tool Text
SAGA: Job Submission Role of Adaptors (middleware binding) Text
SAGA-based Frameworks: Types • Frameworks: Logical structure for Capturing Application Requirements, Characteristics & Patterns • Runtime and/or Application Framework • Application Frameworks designed to either: • Pattern: Commonly recurring modes of computation • Programming, Deployment, Execution, Data-access.. • MapReduce, Master-Worker, H-J Submission • Abstraction: Mechanism to support patterns and application characteristics • Runtime Frameworks: • Load-Balancing – Compute and Data Distribution • SAGA-based Framework: Infrastructure-independent
Abstractions for Dynamic Execution (1) Container Task Adaptive: Type A: Fix number of replicas; vary cores assigned to each replica. Type B: Fix the size of replica, vary number of replicas (Cool Walking) -- Same temperature range (adaptive sampling) -- Greater temperature range (enhanced dynamics)
Abstractions for Dynamic Execution (2)SAGA Pilot-Job (BigJob)
Distributed Adaptive Replica Exchange (DARE)Scale-Out, Dynamic Resource Allocation and Aggregation
Multi-Physics Runtime FrameworksExtensibility • Coupled Multi-Physics require two distinct, but concurrent simulations • Can co-scheduling be avoided? • Adaptive execution model: Yes • Load-balancing required. • Pilot-Job facilitates LB! • Across sites? (open Q) • First demonstrated multi-platform Pilot-Job: • MPI-based TG – Condor GI
Ensemble Kalman FiltersHeterogeneous Sub-Tasks • Ensemble Kalman filters (EnKF), are recursive filters to handle large, noisy data; use the EnKF for history matching and reservoir characterization • EnKF is a particularly interesting case of irregular, hard-to-predict run time characteristics:
Using more machines decreases the TTC and variation between experiments Using BQP decreases the TTC & variation between experiments further Lowest time to completion achieved when using BQP and all available resources Results: Scale-Out Performance Khamra & Jha, GMAC, ICAC’09
But Why does BQP Help? The Case for System Senors
Autonomic Integration of HPC Grids-Clouds EnKF: Extensibility and Interoperabilty (work with M. Parashar et al. Accepted for e-Science 2009) • Application Objectives: • Acceleration • Resilience • Conservation • Pull vs Push Task map
Application-level InteroperabilityCloud-Cloud; Cloud-Grid • Application-level (ALI) vs. System-level Interoperability (SLI) • Infrastructure Independence is Pre-requisite for ALI • The case for both Grids AND Clouds: • Hybrid & Heterogeneous workload: data-compute affinity differ • Availability zone, Data-transfer cost.. • Complex data-flow dependency: need runtime determination • Just because you can use Grids AND Clouds, should you ? Important Research Question: When should you? • Runtime Decision:Mechanism to determine when/if ? • Should be influenced by Application Objectives • Programming Model should be Infrastructure independent • Same application, priced differently, for same performance • Same application, priced same, for different performance
SAGA-based Frameworks: Examples • SAGA-based Pilot-Job Framework (FAUST) • Extend to support Load-balancing for multi-components • SAGA MapReduce Framework: • Control the distribution of Tasks (workers) • Master-Worker: File-Based &/or Stream-Based • Data-locality optimization using SAGA’s replica API • SAGA NxM Framework: • Compute Matrix Elements, each is a Task • All-to-All Sequence comparison • Control the distribution of Tasks and Data • Data-locality optimization via external (runtime) module
Distributed Data Intensive ApplicationsResearch Challenges • Goal: Develop DDI scientific applications to utilize a broad range of distributed systems, without vendor lock-in, or disruption, yet with the flexibility and performance that scientific applications demand. • Frameworks as possible solutions • Frameworks address some primary challenges in developing Distributed DI Applications • Coordination of distributed data & computing • Runtime (Dynamic) scheduling, placement • Fault-tolerance • Many Challenges in developing such Frameworks: • What are the components? How are they coupled? Functionality expressed/exposed? Coordination? • Layering, Ordering, Encapsulations of Components • “Just because you use can’t use MPI (on distributed systems), doesn’t mean you can’t use other approaches”
SAGA-MapReduce(Miceli, Jha et al CCGrid’09; Merzky, Jha et al GPC’09) • Interoperability: Use multiple infrastructure concurrently • Control the NW placement • Simple staging of data • SAGA-Sphere-Sector: • Open Cloud Consortium • Stream processing model • Ongoing work • Apply to all elements (files) in a data-set (stream) Ts: Time-to-solution, including data-staging for SAGA-MapReduce (simple file-based mechanism)
SAGA All-Pairs: Runtime Data Placement • Classical: Place task on 4 LONI machines (512px Dell Clusters) • Simple data staging • “Intelligent”: Map a task to a resource based upon Cost • Cost = Data Dependency + transfer times (latency) • “Ignoring Intelligent mapping is no longer an option” • Quote (undergraduate) Miceli
Understanding Distributed Applications Development Objectives Redux • Interoperability: Ability to work across multiple distributed resources • SAGA: Middleware Agnostic • Distributed Scale-Out: The ability to utilize multiple distributed resources concurrently • Support Multiple Pilot-Jobs: Ranger, Abe, QB • Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure • Pilot-Job also Coupled CFD-MD, Integrated BQP • Adaptivity: Response to fluctuations in dynamic resource and availability of dynamic data • Simplicity: Accommodate above distributed concerns at different levels easily…
Early User: An Environment that Supports • Echo what Andrew Grimshaw said!! • e.g., test-bed for Standards interoperation • Trivial Remarks: • Not obsessed with system utilization like TG • Policies that support IDEAS as first-class concerns • Support Dynamic, First-Principles Explicitly Distributed App. • Dynamic, Adaptive Applications: • Dynamic Resource Utilization: • e.g BQP (Jha et al, GMAC, ICAC Barcelona 2009) • Grid Observatory (EGEE) – all kinds of Traces • Dynamic Adaptive Data: • Network Aware Application (Jha et al, IEEE eScience ’07) • Data Scheduler: Big Data, Frequent Data
Early User: An Environment that Supports • Autonomic Computational Science Applications • Support the tuning of and by Applications • Platform for developing (SAGA) AF and RT Frameworks • Design, Stand-up and Experiment with Frameworks • eg load-balancer for dynamic resource allocation • SAGA-MapReduce, NxM • eg Control Relative Placement of Data/Compute • Supporting Distributed Abstractions – Development, Deployment and Execution-level • A controlled but realistic environment • RAIN – Dynamic Provisioning (provide clean API) • (Reproducible) Experimental Manager, VAMPIR • [Connection with Grid Observatory]
SAGA-based Tools and ProjectsOne person’s Tool is another person’s Application • DESHL • DEISA-based Shell and Workflow library • JSAGA from IN2P3 (Lyon) • http://grid.in2p3.fr/jsaga/index.html • GANGA-DIANE • gLite • XtreemOS (Based upon SAGA for the Distribution) • NAREGI/KEK • SD Specification • With gLite adaptors Advantage of Standards Text
Acknowledgements SAGA Team and DPA Team and the UK-EPSRC (UK EPSRC: DPA, OMII-UK , OMII-UK PAL) People: SAGA D&D: Hartmut Kaiser, Ole Weidner, Andre Merzky, Joohyun Kim, Lukasz Lacinski, João Abecasis, Chris Miceli, Bety Rodriguez-Milla SAGA Users: Andre Luckow, Yaakoub el-Khamra, Kate Stamou, Cybertools (Abhinav Thota, Jeff, N. Kim), Owain Kenway Google SoC: Michael Miceli, Saurabh Sehgal, Miklos Erdelyi Collaborators and Contributors: Steve Fisher & Group, Sylvain Renaud (JSAGA), Go Iwai & Yoshiyuki Watase (KEK) DPA: Dan Katz, Murray Cole, Manish Parashar, Omer Rana, Jon Weissman
Abstractions for Distributed Applications and Systems: A Computational Science Perspective Authors: S Jha, D Katz, M Parashar, O Rana, J Weissman Upcoming Book by Wiley (Summer 2010)
SAGA: Building the abstractions to Bridge the Infrastructure-Applications Gap Focus on Application Development and Characteristics, not infrastructure details
DAG based Workflow ApplicationsExtensibility Approach Application Development Phase Generation & Exec. Planning Phase Execution Phase