120 likes | 142 Views
Notes On the GAE. Harvey B. Newman California Institute of Technology Grid-enabled Analysis Environment Workshop June 24, 2003. GAE Workshop Goals (1). “Getting Our Arms Around” the Grid-Enabled Analysis “Problem”
E N D
Notes On the GAE Harvey B. Newman California Institute of TechnologyGrid-enabled Analysis Environment Workshop June 24, 2003
GAE Workshop Goals (1) • “Getting Our Arms Around” the Grid-Enabled Analysis “Problem” • Review Existing Work Towards a GAE: Components, Interfaces, System Concepts • Review Client Analysis Tools; Consider How to Integrate Them • User Interfaces: What does the GAE Desktop Look Like ? (Different Flavors) • Look At Requirements, Ideas for a GAE Architecture • A Vision of the System’s Goals and Workings • Attention to Strategy and Policy • Develop (Continue) a Program of Simulations of the System • For the Computing Model, and Defining the GAE • Essential for Developing a Feasible Vision; Developing Strategies, Solving Problems and Optimizing the System • With a Complementary Program of Prototyping
GAE Collaboration DesktopExample • Four-screen Analysis Desktop 4 Flat Panels: 5120 X 1024; RH9 • Driven by a single server and single graphics card • Allows simultaneous work on: • Traditional analysis tools (e.g. ROOT) • Software development • Event displays (e.g. IGUANA) • MonALISA monitoring displays; Other “Grid Views” • Job-progress Views • Persistent collaboration (e.g. VRVS; shared windows) • Online event or detector monitoring • Web browsing, email
GAE Workshop Goals (2) • Architectural Approaches: Choose A Feasible Direction • For example a Managed Services Architecture • Be Prepared to Learn by Doing; Simulating and Prototyping • Where to Start, and the Development Strategy • Existing and Missing Parts of the System [Layers; Concepts] • When to Adapt Existing Components, Or to Re-Build Them “from Scratch” • Manpower Available to Meet the Goals; Shortfalls • Allocation of Tasks; Including Generating a Plan • Linkage Between Analysis and Grid-Enabled Production • Planning for Closer Relationship with LCG, Trillium, and the Experiments’ starting Efforts in this area
HENP Grids: Services Architecture Design for a Global System • Self Discovering, Cooperative • Registered Services, Lookup Services; self-describing • “Spaces” for Mobile Code and Parameters • Scalable and Robust • Multi-threaded: with a thread pool managing engine • Loosely Coupled: errors in a thread don’t stop the task • Stateful: System State as well as task state • Rich set of “problem” situations: implies Grid Views, and User/System Dialogues on what to do • For Example: Raise Priority (Burn Quota); or Redirect Work • Eventually may be increasingly automated as we scale up and gain experience • Managed; to deal with a Complex Execution Environment • Real time higher level supervisory services monitor, track, optimize and Revive/Restart services as needed • Policy and strategy-driven; Self-Evaluating and Optimizing • Investable with increasing intelligence • Agent Based; Evolutionary Learning Algorithms
Getting Started Towards a Workable GAE (1) • Work on Computing Model (Essential) in Parallel • Focus on a Few Scenarios for Doing Analysis • “Grid Enabled PROOF” [in CMS; in ATLAS] • Start with Existing Analysis Applications: Can they be recast in GAE Form ? • Make Some Starting Assumptions • Need some simple picture of persistency • Supplementary considerations: • Multiuser situation (e.g. with avatars; then Analysis Challenges) • Coming to a few Either/Or Decisions • List of rudimentary analysis tools, and way of working • “External” to the application considerations: • Job planning • Key role of query estimation (not only beforehand) • Transparency versus tracking
Getting Started Towards a Workable GAE (2) • Session or Sessions on the Desktop • There Modes of Working; All in the GAE • Immediate (within a few seconds) • In the background (seconds to a few minutes) • Spawn batch job or jobs (minutes to hours) • Decisions and tradeoffs • Lay out the strategies and consequences (time, quota etc) • Present Choices • Monitor progress or get “alarms” and be prepared to re-strategize
Getting Started Towards a Workable GAE (3) • Smart Caching: Or Methods, of Data, or Time to Process Info. • Intelligence in the system does not only mean problem solving • Need to apply intelligence/experience to progressively improve system performance • Time-to-completion estimation: process a small amount of data to get a realistic first estimate.
3 Slides About Building a Computing Model & the GAE System • These Slides Focus on Simulation/Prototyping, as an Integral part of designing and building distributed systems for the GAE, and the Grid-Enabled Production Environment (GPE) as well.
Building a Computing Modeland an Analysis Strategy (I) • Generate a Blueprint: A “Computing Model” • Tasks Workload, Facilities, Priorities & GOALS • Persistency; Modes of Accessing Data (e.g. Object Collections) • What runs where; when to redirect • The User’s Working Environment • What is normal (managing expectations) ? • Guidelines for dealing with problems: based on which information ? • Performance and problem reporting/tracking/handling ? • Known Problems: Strategies to deal with those • Set up, code a Simulation of the Model • Develop mechanisms and sub-models as needed • Set up prototypes to measure the performance parameters where not already known to sufficient precision
Building a Computing Modeland an Analysis Strategy (II) • Run simulations (avatars for “actors”; agents; tasks; mechanisms) • Analyze and evaluate performance • General performance (throughput; turnaround) • Ensure “all” work is done: learn how to do this: within a reasonable time; compatible with the Collaboration’s guidelines • Vary Model to Improve Performance • Deal with bottlenecks and other problems • New strategies and/or mechanisms to manage workflow • Represent key features and behaviors, for example: • Responses to Link or Site failures • User input to redirect data or jobs • Monitoring information gathering • Monitoring and management agent actions and behaviors in a variety of situations • Validate the Model • Using Dedicated setups • Using Data Challenges (measure, evaluate, compare; fix key items) • Learn of new factors and/or behaviors to take into account
Building a Computing Modeland an Analysis Strategy (III) • MAJOR Milestone: Obtain a first picture of a Model that Seems to Work • This may or may not involve changes in the computing resource requirements-estimates; or Collaboration policies and expectations • It is hard to estimate how long it will take to reach this milestone [most experiments until now have reached it after the start of data taking] • Evolve the Model to • Distinguish what works and what does not • Incorporate evolving site hardware and network performance • Progressively incorporate new and “better” strategies, to improve throughput and/or turnarounds, or fix critical problems • Take into account experience with the actual software-system components as they develop • In parallel with the Model evolution keep developing the overall data analysis + Grid + monitoring “system”; represent it in the simulation • And the associated strategies