850 likes | 1.32k Views
SAGA: An Overview. Shantenu Jha for the SAGA Team http:// saga.cct.lsu.edu. Overview. The SAGA Philosophy A Fresh Perspective on Distributed Applications and CI SAGA in a Nutshell SAGA Landscape Individual APIs OGF standard SAGA in action Applications
E N D
SAGA: An Overview Shantenu Jha for the SAGA Team http://saga.cct.lsu.edu
Overview • The SAGA Philosophy • A Fresh Perspective on Distributed Applications and CI • SAGA in a Nutshell • SAGA Landscape • Individual APIs • OGF standard • SAGA in action • Applications • Tools, Frameworks, Gateways, Access Layers.. • Uptake and Roadmap
Ability to develop simple, novel or effective distributed applications lags behind other aspects of CI • Distributed CI: Is the whole > than the sum of the parts? • Infrastructure capabilities (tools, programming systems) and policy determine applications, type development & execution: • Proportion of App. that utilize multiple distributed sites sequentially, concurrently or asynchronously is low • Not referring to tightly-coupled across multiple-sites • Focus on extending legacy, static execution models • Scale-Out of Simulations? Compute where the data is? • What novel applications & science has Distributed CI fostered • Distinguish challenges of provisioning Distributed CI versus support for application development Critical Perspectives
Critical PerspectivesQuick Analysis • Several Factors responsible for perceived & actual lack of DA • Developing Distributed Applications is fundamentally hard! • Coordination across multiple distinct resources • Range of tools, prog. systems and environments large • Interoperability and extensibility become difficult • Commonly accepted abstractions not available • E.g. Pilot-Job powerful, but no “unifying” tool on TG • Deployment and execution challenges disjoint from the development process • Generally good idea, but application development often influences where and how it can be deployed/executed
Distributed Applications Development Challenges • Developing Distributed Applications is fundamentally hard • Intrinsic: • Design Points: Dynamical and Heterogeneous resources and Variable Control (or lack thereof) • Coordination over Multiple & Distributed sites • Scale-up and Scale-out • Models of Distributed Applications: • More than (peak) performance • Primary role of Usage Modes • Extrinsic: • (Complex) Underlying infrastructure & its provisioning • Programming systems, tools and environments
Distributed ApplicationsDevelopment Challenges • Dist. Application and Programming Systems and Tools • Incompleteness and/or out-of-phase: • Need X and Y, but only X or Y available, • e.g., Master-Worker paradigm supported, but no FT. • Customization: • Works well with tool A but not B, • e.g., Pegasus-DAGMAN-Condor • Robustness and Scalability: • Works well in small or controlled environment, but not at-Scale • e.g., SAGA–based Montage
Distributed Applications and PGILessons from Applications • There exist distributed applications that aim to utilize multiple resources: • Complex Coordination, Data Mgmt and FT issues • Complex structures at different phases • Emerging Infrastructure present operational challenges • Don’t always provide the application requirement • Missing abstractions • Development, Deployment and Execution • e.g. Coding against low-level middleware • Issues of policy and infrastructure design decisions • e.g. Co-allocation supported or not • e.g. Narrow versus broad Grids
Distributed Application and PGI (2) • Lack of explicit support for addressing these challenges becomes self-fulfilling and self-perpetrating • Most PGI not designed to support distributed applications, increasing effort to develop/deploy/execute • Small number of distributed applications on TG leads to focus on single-site applications • (Ironically) Most applications have been developed to hide from heterogeneity and dynamism; not embrace them • Good heterogeneity vs Bad heterogeneity • Dynamism: Performance Advantages • Applications have been brittle and not extensible: • Tied to specific tool or prog. system (& thus PGI)
Distributed Application and PGI (3) • Successful Evolution of Supported Capabilities and PGI • OSG support of HTC • Condor from scavenging system to building block • Condor Flocking • TeraGrid and Gateways (User-level Abstraction) • Several new capabilities for new communities • Not so Successful Evolution of Capabiliites on PGI • Co-scheduling on PGI • Both technical and policy issues • Scale-out of DAGs/Loosely-coupled ensembles • Execution is logically or physically distributed
Distributed Application and PGI (4) • Supporting applications that have “come-of-age” useful • Coupling real-time simulations to live sensor data (LEAD) • Currently neither OSG nor TG can support DDDAS • Often the underlying infrastructure and capabilities change quicker than application • Infrastructure: Grids and Clouds • Many challenges remain the same, e.g., requirements of distribution coordination of data and computation • Role for abstractions • Applications still around: SFExpress, Netsolve->GridSolve
Distributed Application and PGI (5)Miscellaneous Factors • Role of Funding Agencies: • Changed vision, funding landscape • TG: Run anywhere with tools/prog systems to support primarily static HPC applications • Focus on static as opposed to distributed scale-out robust workflows • http://www.cct.lsu.edu/~sjha/presentations/panel_discussion/punch_counterpunch.pdf • Role of Standard: • Lack of standards impedes interoperation • No standards; • Chicken-and-egg situation • Simple standards have been effective, but with limited impact, eg., troika of JSDL, HPC-BP, OGSA-BES
Distributed Applications How do they differ from traditional HPC applications? PGI design needs to be reflect some or all of these: • Performance Models: • Not just “peak utilization”; e.g., HPC & HTC (# of jobs) • Usage Modes: • The same application has multiple usage modes • How applications are developed, deployed and executed is often determined by the infrastructure • Static vs Dynamic Execution: • Static applications is not enough; varying resource conditions, application requirements • Skillful Decomposition vs Aggregation • Primacy of Coordination across distributed resources
Understanding Distributed ApplicationsIDEAS: First Principles Development Objectives • Interoperability:Ability to work across multiple distributed resources • Distributed Scale-Out: The ability to utilize multiple distributed resources concurrently • Extensibility:Support new patterns/abstractions, different programming systems, functionality & Infrastructure • Adaptivity:Response to fluctuations in dynamic resource and availability of dynamic data • Simplicity:Accommodate above distributed concerns at different levels easily… Challenge: How to develop DA effectively and efficiently with the above as first-class objectives?
Overview • The SAGA Philosophy • A Fresh Perspective on Distributed Applications and CI • SAGA in a Nutshell • SAGA Landscape • Individual APIs • OGF standard • SAGA in action • Applications • Tools, Frameworks, Gateways, Access Layers.. • Uptake and Roadmap
SAGA: In a nutshell • There exists a lack of Programmatic approaches that: • Provide general-purpose, basic & common grid functionality for applications and thus hide underlying complexity, varying semantics.. • The building blocks upon which to construct “consistent” higher-levels of functionality and abstractions • Hides “bad” heterogeneity, means to address “good” heterogeneity • Meets the need for a Broad Spectrum of Application: • Simple scripts, Gateways, Smart Applications and Production Grade Tooling, Workflow… • Simple, integrated, stable, uniform and high-level interface • Simple and Stable: 80:20 restricted scope and Standard • Integrated: Similar semantics & style across • Uniform: Same interface for different distributed systems • SAGA: Provides Application* developers with units required to compose high-level functionality across (distinct) distributed systems • (*) One Person’s Application is another Person’s Tool
SAGA: Specification Landscape Blue lines show which packages have input in the Experience document
SAGA: Job Submission Role of Adaptors (middleware binding) SAGA: Role of Adaptors Text
SAGA Task Model • All SAGA objects implement the task model • Every method has three “flavors” • Synchronous version - the implementation • Asynchronous version - synchronous version wrapped in a task (thread) and started • Task version - synchronous version wrapped in a task but not started (task handle returned) • Adaptor can implement own async. version
SAGA Implementation: Extensibility • Horizontal Extensibility – API Packages • Current packages: • file management, job management, remote procedure calls, replica management, data streaming • Steering, information services, checkpoint… • Vertical Extensibility – Middleware Bindings • Different adaptors for different middleware • Set of ‘local’ adaptors • Extensibility for Optimization and Features • Bulk optimization, modular design
SAGA C++ quick tour • Open Source - released under the Boost Software License 1.0 • Implemented as a set of libraries • SAGA Core - A light-weight engine / runtime that dispatches calls from the API to the appropriate middle-ware adaptors • SAGA functional packages - Groups of API calls for: jobs, files, service discovery, advert services, RPC, replicas, CPR, ... (extensible) • SAGA language wrappers - Thin Python and C layers on top of thenative C++ API • SAGA middleware adaptors - Take care of the API call execution on the middleware • Can be configured / packaged to suit your individual needs!
SAGA: Available Adaptors • Job Adaptors • Fork (localhost), SSH, Condor, Globus GRAM2, OMII GridSAM, Amazon EC2, Platform LSF • File Adaptors • Local FS, GlobusGridFTP, Hadoop Distributed Filesystem (HDFS),CloudStore KFS, OpenCloud Sector-Sphere • Replica Adaptors • PostgreSQL/SQLite3, Globus RLS • Advert Adaptors • PostgreSQL/SQLite3, Hadoop H-Base, Hypertable
SAGA: Available Adaptors (2) • Other Adaptors • Default RPC / Stream / SD • Planned Adaptors • CURL file adaptor, gLite job adaptor (Ole), ….. • Open issues • We’re in the process of consolidating the adaptor code base and adding rigorous tests in order to improve adaptor quality • Capability Provider Interface (CPI - the ‘Adaptor API’) is notdocumented or standardized (yet), but looking at existing adaptor code should get you started if you want to develop your own adaptor.
SAGA API: Towards a StandardStandards promote Interoperability • The need for standard programming interface • Trade-off “Go it alone” versus “Community” model • Reinventing the wheel again, yet again, & then again • MPI a useful analogy of community standard • Vendors (Resource Provider), Software developers, users.. • social/historic parallels also important • Time to adoption, after specification .... • OGF the natural choice (SAGA-RG, SAGA-WG) • Spin-off of the Applications Research Group • Driven by UK, EU (German/Dutch), US • Design derived from 23 Use Cases • different projects, applications and functionality • biological, coastal modelling, visualization • Will discuss the advantage of SAGA as a standard specification
Overview • The SAGA Philosophy • A Fresh Perspective on Distributed Applications and CI • SAGA in a Nutshell • SAGA Landscape • Individual APIs • OGF standard • SAGA in Action • Applications • Tools, Frameworks, Gateways. Access Layers.. • Uptake and Roadmap
Understanding Distributed Applications Implicit vs Explicitly Distributed ? • Which approach (implicit vs. explicit) is used depends: • How the application is used? • Need to control/marshal more than one resource? • Why distributed resources are being used? • How much can be kept out of the application? • Can’t predict in advance? • Not obvious what to do, application-specific metric • If possible, Applications should not be explicitly distributed • GATEWAYS approach: • Implicit for the end-users • Supporting Applications? Or Application Usage Modes?
Taxonomy of Distributed Application Development • Example of Distributed Execution Mode: • Implicitly Distributed • HTC of HTC: 1000 job submissions of NAMD the TG/LONI • SAGA shell example (cf DESHL) • Example of Explicit Coordination and Distribution • Explicitly Distributed • DAG-based Workflows (example of Higher-level API) • EnKF-HM application • Example of SAGA-based Frameworks • Pilot-Jobs, Fault-tolerant Autonomic Framework • MapReduce, All-Pairs • Note: An application can belong to more than one type
DAG based Workflow ApplicationsExtensibility and Higher-level API Application Development Phase Generation & Exec. Planning Phase Execution Phase
Ensemble Kalman FiltersHeterogeneous Sub-Tasks • Ensemble Kalman filters (EnKF), are recursive filters to handle large, noisy data; use the EnKF for history matching and reservoir characterization • EnKF is a particularly interesting case of irregular, hard-to-predict run time characteristics:
Results: Scale-Out Performance • Using more machines decreases the TTC and variation between experiments • Using BQP decreases the TTC & variation between experiments further • Lowest time to completion achieved when using BQP and all available resources Khamra & Jha, GMAC, ICAC’09
Extreme Distribution: Frameworks? • History match on a 1 million grid cell problem, with a thousands of ensemble members • The entire system will have a few billion degrees of freedom • This will increase the need for scale-out, autonomy, fault tolerance, self healing etc... 43
Extreme Distribution: Frameworks? • History match on a 1 million grid cell problem, with a thousand ensemble members • The entire system will have a few billion degrees of freedom • This will increase the need for scale-out, autonomy, fault tolerance, self healing etc... 44
Extreme Distribution: Frameworks? • History match on a 1 million grid cell problem, with a thousand ensemble members • The entire system will have a few billion degrees of freedom • This will increase the need for scale-out, autonomy, fault- tolerance, self healing etc... 45
SAGA-based Frameworks: Types • Frameworks: Logical structure for Capturing Application Requirements, Characteristics & Patterns • Runtime and/or Application Framework • Application Frameworks designed to either: • Pattern: Commonly recurring modes of computation • Programming, Deployment, Execution, Data-access.. • MapReduce, Master-Worker, H-J Submission • Abstraction: Mechanism to support patterns and application characteristics • Runtime Frameworks: • Load-Balancing – Compute and Data Distribution • SAGA-based Framework: Infrastructure-independent
SAGA-based Frameworks: Examples • SAGA-based Pilot-Job Framework (FAUST) • Extend to support Load-balancing for multi-components • SAGA MapReduce Framework: • Control the distribution of Tasks (workers) • Master-Worker: File-Based &/or Stream-Based • Data-locality optimization using SAGA’s replica API • SAGA NxM Framework: • Compute Matrix Elements, each is a Task • All-to-All Sequence comparison • Control the distribution of Tasks and Data • Data-locality optimization via external (runtime) module
Abstractions for Dynamic Execution Container Task Adaptive: Type A: Fix number of replicas; vary cores assigned to each replica. Type B: Fix the size of replica, vary number of replicas (Cool Walking) -- Same temperature range (adaptive sampling) -- Greater temperature range (enhanced dynamics)