490 likes | 645 Views
Scheduling From the Perspective of the Application. By Francine Berman & Richard Wolski Presenter:Kun-chan Lan . Outline of the talk. Overview Case study Application-centric scheduling AppleS Project Result Conclusion. Overview. Why scheduling is important in metacomputing system
Scheduling From the Perspective of the Application By Francine Berman & Richard Wolski Presenter:Kun-chan Lan
Outline of the talk • Overview • Case study • Application-centric scheduling • AppleS Project • Result • Conclusion
Overview.. • Why scheduling is important in metacomputing system • Better utilization of resource • Performance efficiency • Application-centric scheduling • Everything is evaluated in terms of its impact on the application
..Overview.. • Metacomputing • Aggregation of distributed and high-performance resources on coordinated networks, for performance required to address modern scientific problems • Heterogeneity(administrative domain, software/hardware architecture, protocol etc) • contention
Performance oriented Aggregation of resources from a single site(a mutli-processor machine) Communicate via dedicated devices like switch,share-memory etc. Homogeneous(hardware/software infrastructure, administrative domain etc) Performance oriented aggregation of resource from multiple sites Communicate via a distributed network Heterogeneous resources A software infrastructure required to coordinate distributed networks into a communication substrate Parallel computing vs. Metacomputing
Scheduling for parallel computing • Multiprocessor nodes generally have uniform capabilities • Usually there is a centralized system scheduler • Processors are dedicated to tasks of a single application -- No contention
Scheduling for Metacomputing • Resources are often managed by separate schedulers which are not coordinated – no single system scheduler • Data conversion between sides • Overlapping of communication and computation to amortize network communication • Separate optimized algorithm for tasks on different machine
Outline of the talk • Overview • Case study • Application-centric scheduling • AppleS • Result • Conclusion
CLEO • A high energy physics project • Each collision detected by CLEO is called an event • Each event is recorded and passed to a program called “pass2” to computer offline the physical properties of the particles • Records computed by “pass2” are read and compressed by another program for certain frequently-accessed fields • One terabyte of data being generated per year
Nile.. • A by-product of CLEO • Each CLEO’s collaborating institution is a site • Goal • provide a scalable, fault-tolerant, heterogeneous system of hundreds of commodity workstations, with access to a distributed database in excess of 100~TB • Resources(CLEO data) are spread across the United States and Canada at 24 collaborating institutions • resource can be accessed and used transparently from anywhere by any member of the CLEO collaboration
..Nile.. • Not specific to CLEO, can be used by any application that is easily parallelizable • Currently implemented in CORBA/JAVA • Three components • Nile Control System(NCS) • Data Repository • User Interface • Interconnecting networks include ATM,FDDI and Ethernet
..Nile.. • NCS: • Site manager: • Interface between NCS and clients • Receive job requests • For each job request, create a job manager, store the job context into Job Database and place the job into queue • stateless
..Nile.. • NCS: • Job DB: • Store the state of job • Resource DB: • Maintain the state of available hardware resources at local site • Data Location Manager: • Translate logical data specification in the job profile to a set of corresponding physical data objects, which can be used to determine the suitable hosts to run the sub jobs
..Nile.. • NCS: • Job Manager • Divide a single job into a set of sub jobs which can be executed in parallel • Monitor the state of sub-jobs • Collect and assemble the results, and pass them back the site manager • Planner • Produce an execution plan consisting of a list of sub-jobs,each having a host machine and a set of data objects
Characteristics of CLEO/NILE • The quantity of data for the problem is so large that no single site can provide all the resources needed • Efficient resource allocation is crucial • Execution sites and network interconnection are heterogeneous • Some resources are shared by other application, so performance might vary greatly based on contention for resources
CASE 2: 3-D REACT • Try to predict the energy level of reaction using quantum mechanics • Simulate a hydrogen-deuterium reaction • Essentially calculating the solution to a six-dimensional Schrodinger equation, and can be decomposed into three tasks • LHSF(local hyper-spherical surface function) • Log-D(logarithmic derivative propagation): use the result of LHSF as input • ASY:an asymptotic analysis on the matrices generated during the Log-D calculation
Scheduling 3D-REACT • Distribute 3D-REACT two computation units • Cray C90 in SDSC • 64-node Intel Paragon in CalTech • The problem is divided into smaller sub-domains of 5-20 surface function per sub-domains, so LHSF and Log-D can be executed concurrently • First C90 calculate the LHSF for a given sub-domain, and then the result is passed to Paragon which will calculate the log-D portion of that sub-domain • While Paragon is calculating the first sub-domain, C90 can start calculating the second sub-domain • After all the sub-domains are considered, the ASY will determine whether the calculation should stop
Characteristics of 3D-REACT • The algorithm implemented by a task is optimized for the machine to which it has assigned • Eg. The Log-D implementation used in C90 is different than that used in Paragon • Computation and communication can be pipelined to amortize communication delays • Data might need to be converted into different format when being transferred between different sites • Eg. The floating point needed to be converted when C90 sends data to Paragon • Scheduling is critical for performance • Each of the sub-tasks (LHSF/Log-D/ASY) can be execute on either machine
Outline of the talk • Overview • Case study • Application-centric scheduling • AppleS • Result • Conclusion
Generalization of Application-Centric scheduling • Each application develop a schedule to optimize its own performance without regard to the performance goals of other applications which share the system • Each application-centric schedule for different application is unrelated • However, there are still some commonalities which underly application-centric program development
Components of Application-Centric scheduling.. • Performance criteria/metrics • Dynamic system state • Application-specific resource locality • Application performance characteristics • User preferences • Prediction
Performance criteria/metrics • Performance criteria/metrics vary with the application • Eg. to minimize execution time • 3D-REACT: by maximizing speedup over a single-machine implementation • NILE: by distributing analysis of independent events • Some common metrics • Execution time • Speedup • Cost of execution cycle • User will attempt to optimize the usages of same resource for different performance criteria at the same time
Dynamic system state • Mixture of dedicated and non-dedicated resources • Should wait until the dedicated resources become available, or • Should execute the application with lesser performance on the non-dedicated resource currently available • Requirement of dynamic assessment of • Current system state • Resource loads • Short-term, but accurate prediction
taskX taskY X Y Application-specific Resource Locality • Applications seek to use “close” resources? • “Closeness” is a function of what the application requires from a resource as well as the resource’s capability • “Distance” of resources: the resource performance deliverable to application • Is X and Y close?
Application Characteristics • Implementation-dependent and implementation-independent • Some common categories of attributes • Task-specific implementation characteristics • Computation paradigm,number/size of data structure, data communication pattern, memory requirement, etc. • Inter-task communication characteristics • Data format for each task,pipeline size,communication regularity and frequency, etc. • Application structure information • Input/output requirement,iteration pattern, etc.
User Preferences • Not necessary directly related to application performance • Act as a filter over the possible resources and implementation available to the user
Role of Prediction • Prediction tells you • Potential communication and computation behavior of the application • Potential availability and load of resource • Potential performance of the application with respect to candidate schedules • Sources of prediction • App-specific or app-independent benchmark • Statistical analysis • Sensed or sampled data • Analytical model
Process of scheduling an application • Use user preference to filter out infeasible schedules • Use application-specific and dynamic information to develop an schedule • Use individual notion of performance and resource locality to evaluate the schedule • Predict the performance of candidate schedules • Compare and determine the “best schedule” that can be implemented on the available resources
Outline of the talk • Overview • Case study • Application-centric scheduling • AppleS • Result • Conclusion
AppleS(Application-level Scheduler) • Each application will have its own AppleS agent(a customized scheduler for each application) • What does AppleS do? • Select resources • Determine a performance-efficient schedule • Implement that schedule with respect to the appropriate resource management system • AppleS is NOT a resource management system: it rely on systems such as Globus,Legion
components of AppleS • Resource Selector: • choose and filter different resource combination • Planner • Generate a description of a resource-dependent schedule from a given resource combination • Performance estimator • Generate an estimate for candidate schedules according to the user’s performance metric • Coordinator • Choose the “best” schedule • Actuator • Implement the “best” schedule on the target resource management system
Input of AppleS: Information Pool • Network Weather Service • Dynamic information of system state and forecast of resource load • Heterogeneous Application Template(HAT) • information for the structure, characteristics and implementation of application and its tasks • Model • Used for performance estimation, planning and resource selection • User Specification(US) • Information on user’s criteria for performance, execution constraint, preference for implementation, etc
Using AppleS • User provide information to AppleS via HAT and US • Coordinator uses this information to filter out infeasible/possibly-bad schedules • Resource selector identify promising sets of resource, and prioritize them based the logical “distance” between resources • Planner computes a potential schedule for each viable resource configuration • Performance estimator evaluates each schedule in terms of the user’s performance objective • Coordinator chooses the best schedule and then implements it with Actuator
Using AppleSExample: 3D-REACT • Assuming implementations of LHSF and Log-D are available for several architectures • HAT: specify the computation-to-communication ratios for LHSF and Log-D, degree of overlap that is possible between the two, etc. for each implementation • Resource selector determine viable pairs of resources • Planner identify a set of candidate schedules • Performance estimator calculate the transfer unit size between LHSF and Log-D for each candidate schedule • Coordinator sends the best schedule to the Actuator
Outline of the talk • Overview • Case study • Application-centric scheduling • AppleS • Result • Conclusion
Jacobi2D code.. • a distributed data-parallel two dimensional Jacobi iterative solver • commonly used to solve the finite-difference approximation to Poisson's equation • Variable coefficients are represented as elements of a two-dimensional grid • At each iteration, the new value of each grid element is defined to be the average of its four nearest neighbors during the previous iteration
..Jacobi2D code • Typically, the Jacobi computation is parallelized by partitioning the grid into rectangular regions, and then assigning each region to a different processor • Parallelism vs. communication overhead P0 is twice as fast as processor P1 or P2
RS600 FDDI Alpha workstation
Three partition methods • HPF Uniform/Blocked • each processor is assigned (at compile-time) a relatively equal-sized square region of the grid to compute • Non-Uniform Strip • uses good static estimates for resource performance and uses resource selection to select a resource set from the total resources • AppleS
Memory availability • Adding two IBM SP-2 node with 128M memory into resource pool • dedicated access to the two SP-2 nodes and the link between them • the best partitioning is to split the grid evenly between the two SP-2 nodes as long as neither partition exceeded the available real memory on each node
A lot of page swapping
Conclusion • Performance-efficient schedule must exploit the concurrency of independent application task as well as factor in the impact of resource contention/diversity/autonomy • AppleS: http://apples.ucsd.edu/, still a working-in-progress • Related work: MARS: http://www.uni-paderborn.de/pc2/projects/mol/mars.htm • CLEO: http://www.lns.cornell.edu/public/CLEO/ • 3D-REACT: http://www.cacr.caltech.edu/Publications/techpubs/CASA/cacr123/web4.htm