140 likes | 159 Views
This document provides an overview of the programming tools being developed by VGrADS at Rice University for transparent and efficient grid computing. It discusses the need for abstract views of the grid, workflow scheduling, fault tolerance, platform-independent applications, and performance modeling.
E N D
VGrADS Programming Tools Research: Vision and Overview Ken Kennedy Center for High Performance Software Rice University http://vgrads.rice.edu/site_visit/april_2005/slides/kennedy-tools.pdf
Database Supercomputer Database Supercomputer The VGrADS Vision:National Distributed Problem Solving • Where We Want To Be • Transparent Grid computing • Submit job • Find & schedule resources • Execute efficiently • Where We Are • Low-level hand programming • What Do We Need? • A more abstract view of the Grid • Each developer sees a specialized “virtual grid” • Simplified programming models built on the abstract view • Permit the application developer to focus on the problem
Programming Tools • Focus: Automating critical application-development steps: • Initiating application execution • Optimize and launch application on heterogeneous resources • Scheduling workflow graphs • Heuristics required (problems are NP-complete at best) • Good initial results if accurate predictions of resource performance are available (see EMAN demo) • Constructing performance models • Based on loop-level performance models of the application • Requires benchmarking with (relatively) small data sets, extrapolating to larger cases • Building workflow graphs from high-level scripts • Examples: Python (EMAN), Matlab
Initiating Application Execution • Vision: Transparent to the User with Fault Tolerance • Binaries shipped or preinstalled • Reconfigured where necessary • Data moved to computations • Support for fault tolerance • Support for rescheduling, migration and restart • Research • Separation of concerns between workflow management and virtual grid • Support for fault tolerance and rescheduling/migration • Checkpointing, load projection, performance modeling • Platform-independent application representation
Application Initiation and Management Abstract Workflow Workflow Scheduler Virtual Grid Pegasus Data Discovery Workflow Reduction Resource Discovery Mapping Concrete Workflow DAGMan Virtual Grid Execution System Virtual Grid Resources
Support for Fault Tolerance and Rescheduling • Fault tolerance • Workflows: Use Pegasus mechanisms for recovery • Issue: support for registration of completed steps in vgES • MPI steps: Disk-free checkpointing • See Dongarra talk • Rescheduling • Mechanisms for monitoring and extending virtual grids in vgES • Under development • Issue: Determining whether rescheduling will be profitable • Projection of load on current resources without current application • Models to project performance on current and alternative resources
Platform-Independent Applications • Supporting a Single Application Image for Different Platforms • Translation and optimization tools that map the application onto the hardware in an effective and efficient way • At the high level, resource allocation and scheduling • At the low level, optimization, scheduling, and runtime reoptimization • We are working with the LLVM system (from Illinois) • Produces good code for a variety of hardware platforms • Run-time Reoptimization • The Idea: In response to poor performance, reoptimize • The Vision: To reduce the runtime cost of reoptimization, move analysis and planning to compile time • Example: alternative blocking strategies for a loop nest, with shift triggered by excessive cache misses
Scheduling Workflows • Vision: • Application includes performance models for all workflow nodes • Performance models automatically constructed • Software schedules applications onto Grid in two phases • Virtual grid requirement and acquisition • Model based scheduling on the returned vGrid • Research • Scheduling strategy: Whole workflow • Dramatic makespan reduction • Two-phase scheduling • Exploring trade-offs • Expectation: dramatic reduction in complexity with little loss in performance
Set of resources: 50 rtc nodes at Rice (IA-64) 13 medusa nodes at U of Houston (Opteron) RDV data set Varying scheduling strategy RNP - Random / No PerfModel RAP - Random / Accurate PerfModel HCP - Heuristic / GHz Only PerfModel HAP - Heuristic / Accurate PerfModel EMAN Scheduling:Varying Performance Models
Performance Model Construction • Vision: Automatic performance modeling • From binary for a single resource and execution profiles • Models for all target resources • Research • Uniprocessor modeling • Can be extended to MPI steps • Memory hierarchy behavior • Role of instruction mix • Application-specific models • Scheduling
Object Code Binary Instrumenter Instrumented Code Binary Analyzer BB Counts Communication Volume & Frequency Memory Reuse Distance Control flow graph Loop nesting structure BB instruction mix Post Processing Tool Performance Prediction for Target Architecture Architecture neutral model Scheduler Architecture Description Performance Prediction Overview Dynamic Analysis Execute Static Analysis Post Processing
Building Workflow Graphs • Vision: • Application developer writes in scripting language • Candidates: Python (EMAN), Matlab, S-PLUS/R, LabView • Components represent applications • Software constructs workflow and data movement • Research • Related project: Telescoping languages • Type-based specialization • Type analysis • Plan: Harvest this work for Grid application development • Just getting started