Resource Management of Large-Scale Applications on a Grid

Resource Management of Large-Scale Applications on a Grid Laukik Chitnis and Sanjay Ranka (with Paul Avery, Jang-uk In and Rick Cavanaugh) Department of CISE University of Florida, Gainesville ranka@cise.ufl.edu 352 392 6838 (http://www.cise.ufl.edu/~ranka/)

Overview • High End Grid Applications and Infrastructure at University of Florida • Resource Management for Grids • Sphinx Middleware for Resource Provisioning • Grid Monitoring for better meta-scheduling • Provisioning Algorithm Research for multi-core and grid environments

Compute Intensive Applications MainFrame Applications The Evolution of High-End Applications (and their system characteristics) • Geographicallydistributed datasets • High speed storage • Gigabit networks Data Intensive Applications • Large clusters • Supercomputers • Centralmainframes 1980 1990 2000

Some Representative Applications HEP, Medicine, Astronomy, Distributed Data Mining

Representative Application: High Energy Physics 1000+ 20+ countries 1-10 petabytes 1-

Representative Application: Tele-Radiation Therapy RCET Center for Radiation Oncology

Application Application Data Mining and Scheduling Services Data Transport Services . . Data Management Services Data Management Services Representative Application: Distributed Intrusion Detection NSF ITR Project: Middleware for Distributed Data Mining (PI: Ranka joint with Kumar and Grossman)

Grid Infrastructure Florida Lambda Rail and UF

Campus Grid (University of Florida) NSF Major Research Instrumentation Project (PI: Ranka, Avery et. al.) 20 Gigabit/sec Network 20+ Terabytes 2-3 Teraflops 10 Scientific and Engineering Applications Gigabit Ethernet Based Cluster Infiniband based Cluster

Grid Services The software part of the infrastructure!

Security Services Services offered in a Grid Resource Management Services Monitoring and Information Services Data Management Services Note that all the other services use security services

Resource Management Services • Provide a uniform, standard interface to remote resources including CPU, Storage and Bandwidth • Main component is the remote job manager • Ex: GRAM (Globus Resource Allocation Manager)

User Resource Management on a Grid GRAM LSF Site 2 Condor Site 1 PBS fork Site 3 Site n The Grid Narration: note the different local schedulers

Scheduling your Application

Scheduling your Application • An application can be run on a grid site as a job • The modules in grid architecture (such as GRAM) allow uniform access to the grid sites for your job • But… • Most applications can be “parallelized” • And these separate parts of it can be scheduled to run simultaneously on different sites • Thus utilizing the power of the grid

Many workflows can be modeled as a Directed Acyclic Graph The amount of resource required (in units of time) is known to a degree of certainty There is a small probability of failure in execution (in a grid environment this could happen due to resources no longer available) Directed Acyclic Graph Modeling an Application Workflow

Workflow Resource Provisioning Executing multiple workflows over distributed and adaptive (faulty) resources while managing policies Large Precedence Applications Time Constraints Data Intensive Access Control Priority Multi-core Heterogeneous Policies Resources Multiple Ownership Quota Faulty Distributed

UW MIT UI FNAL Caltech UCSD Rice UF BU UM UC BNL ANL IU LBL OU UTA SMU A Real Life Example from High Energy Physics • Merge two grids into a single multi-VO“Inter-Grid” • How to ensure that • neither VO is harmed? • both VOs actually benefit? • there are answers to questions like: • “With what probability will my job be scheduled and complete before my conference deadline?” • Clear need for a scheduling middleware!

Typical scenario VDT Client ? ? ? VDT Server VDT Server VDT Server

Typical scenario @#^%#%$@# VDT Client ? ? ? VDT Server VDT Server VDT Server

Some Requirements for Effective Grid Scheduling • Information requirements • Past & future dependencies of the application • Persistent storage of workflows • Resource usage estimation • Policies • Expected to vary slowly over time • Global views of job descriptions • Request Tracking and Usage Statistics • State information important • Resource Properties and Status • Expected to vary slowly with time • Grid weather • Latency of measurement important • Replica management • System requirements • Distributed, fault-tolerant scheduling • Customisability • Interoperability with other scheduling systems • Quality of Service

Incorporate Requirementsinto a Framework VDT Client ? ? ? • Assume the GriPhyN Virtual Data Toolkit: • Client (request/job submission) • Globus clients • Condor-G/DAGMan • Chimera Virtual Data System • Server (resource gatekeeper) • MonALISA Monitoring Service • Globus services • RLS (Replica Location Service) VDT Server VDT Server VDT Server

Incorporate Requirementsinto a Framework ? • Framework design principles: • Information driven • Flexible client-server model • General, but pragmatic and simple • Avoid adding middleware requirements on grid resources VDT Client Recommendation Engine VDT Server • Assume the Virtual Data Toolkit: • Client (request/job submission) • Clarens Web Service • Globus clients • Condor-G/DAGMan • Chimera Virtual Data System • Server (resource gatekeeper) • MonALISA Monitoring Service • Globus services • RLS (Replica Location Service) VDT Server VDT Server

Related Provisioning Software

Innovative Workflow Scheduling Middleware • Modular system • Automated scheduling procedure based on modulated service • Robust and recoverable system • Database infrastructure • Fault-tolerant and recoverable from internal failure • Platform independent interoperable system • XML-based communication protocols • SOAP, XML-RPC • Supports heterogeneous service environment • 60 Java Classes • 24,000 lines of Java code • 50 test scripts, 1500 lines of script code

The Sphinx Workflow Execution Framework VDT Client Sphinx Server Sphinx Client Chimera Virtual Data System Clarens WS Backbone Request Processing Condor-G/DAGMan Data Warehouse Data Management VDT Server Site Globus Resource Information Gathering Replica Location Service MonALISA Monitoring Service

Sphinx Workflow Scheduling Server Sphinx Server Message Interface • Functions as the Nerve Centre • Data Warehouse • Policies, Account Information, Grid Weather, Resource Properties and Status, Request Tracking, Workflows, etc • Control Process • Finite State Machine • Different modules modify jobs, graphs, workflows, etc and change their state • Flexible • Extensible Graph Reducer Control Process Job Predictor Graph Predictor Job Admission Control Graph Admission Control Graph Data Planner Data Warehouse Job Execution Planner Graph Tracker Data Management Information Gatherer

SPHINX Scheduling in Parallel for Heterogeneous Independent NetworXs

Policy Based Scheduling Submissions Resources Time • Sphinx provides “soft” QoS through time dependent, global views of • Submissions (workflows, jobs, allocation, etc) • Policies • Resources • Uses Linear Programming Methods • Satisfy Constraints • Policies, User-requirements, etc • Optimize an “objective” function Estimate probabilities to meet deadlines within policy constraints J. In, P. Avery, R. Cavanaugh, and S. Ranka, "Policy Based Scheduling for Simple Quality of Service in Grid Computing", in Proceedings of the 18th IEEE IPDPS, Santa Fe, New Mexico, April, 2004 Policy Space Submissions Resources Time

Ability to tolerate task failures Jang-uk In, Sanjay Ranka et. al. "SPHINX: A fault-tolerant system for scheduling in dynamic grid environments", in Proceedings of the 19th IEEE IPDPS, Denver, Colorado, April, 2005 • Significant Impact of using feedback information

Grid Enabled Analysis SC|03

File Service File Service File Service File Service VDT Resource Service VDT Resource Service VDT Resource Service VDT Resource Service Fermilab Caltech Florida Iowa Sphinx RLS MonALISA ROOT Chimera Sphinx/VDT Monitoring Service Execution Service Replica Location Service Virtual Data Service Scheduling Service Data Analysis Client Distributed Services for Grid Enabled Data Analysis Distributed Services for Grid Enabled Data Analysis Clarens Clarens Globus Clarens Clarens GridFTP Globus Globus MonALISA

Evaluation of Information gathered from grid monitoring systems

Limitation of Existing Monitoring Systems for the Grid • Information aggregated across multiple users is not very useful in effective resource allocation. • An end-to-end parameter such as Average Job Delay - the average queuing delay experienced by a job of a given user at an execution site - is a better estimate for comparing the resource availability and response time for a given user. • It is also not very susceptible to monitoring latencies.

Effective DAG Scheduling • The completion time based algorithm here uses the Average Job Delay parameter for scheduling • As seen in the adjoining figure, it outperforms the algorithms tested with other monitored parameters.

Directed Acyclic Graph Work in Progress: Modeling Workflow Cost and developing efficient provisioning algorithms 1. Developing an objective measure of completion time Integrating performance and reliability of workflow execution P (Time to complete >=T) <= epsilon 2. Relating this measure to the properties of the longest path of the DAG based on the mean and uncertainty of time required for underlying tasks due to 1) variable time requirements due to different parameter values 2) failure due to change of the underlying resources etc. 3. Developing novel scheduling and replication techniques to optimize allocation based on these metrics.

Work in Progress: Provisioning algorithms for multiple workflows (Yield Management) Multiple Workflows Level 1 Level 1 Level 2 Level 2 Level 3 Level 3 Level 4 Level 4 Dag 1 Dag 1 Dag 2 Dag 2 Dag 3 Dag 3 Dag 4 Dag 4 Dag 5 Dag 5 • Quality of Service guarantees for each workflow • Controlled (a cluster of multi-core processors) versus uncontrolled • (grid of multiple clusters owned by multiple units) environment

CHEPREO - Grid Education and Networking • E/O Center in Miami area • Tutorial for Large Scale Application Development

Grid Education • Developing a Grid tutorial as part of CHEPREO • Grid basics • Components of a Grid • Grid Services OGSA … • OSG summer workshop • South Padre island, Texas. July 11-15, 2005 • http://osg.ivdgl.org/twiki/bin/view/SummerGridWorkshop/ • Lectures and Hands-on sessions • Building and Maintaining a Grid

Acknowledgements • CHEPREO project, NSF • GriPhyN/iVDgL, NSF • Data Mining Middleware, NSF • Intel Corporation

Thank You May the Force be with you!

Additional slides

Effect of latency on Average Job Delay • Latency is simulated in the system by purposely retrieving old values for the parameter while making scheduling decisions • The correlation indices with added latencies are comparable, though lower as expected, to the correlation indices of ‘un-delayed’ Average Job Delay parameter. The amount of correlation is still quite high.

SPHINX Scheduling Latency Average scheduling latency for various number of DAG’s (20, 40 , 80 and 100) with different arrival rate per minute.

Virtual data service Chimera Graphical user interface for data analysis ROOT Grid enabled Web service Clarens Clarens Clarens Grid resource management service VDT server Grid enabled execution service VDT client Grid resource monitoring system MonALISA Grid scheduling service Sphinx Clarens Clarens Replica location service RLS Demonstration at Supercomputing Conference: Distributed Data Analysis in a Grid Environment The architecture has been implemented and demonstrated in SC03 and SC04, Arizona, USA, 2003.

Scheduling DAGs: Dynamic Critical Path Algorithm The DCP algorithm executes the following steps iteratively: • Compute the earliest possible start time (AEST) and the latest possible start time (ALST) for all tasks on each processor. • Select a task which has the smallest difference between its ALST and AEST and has no unscheduled parent task. If there are tasks with the same differences, select the one with a smaller AEST. • Select a processor which gives the earliest start time for the selected task

Directed Acyclic Graph Scheduling DAGs: ILP- Novel algorithm to support heterogeneity (work supported by Intel Corporation) There are two novel features: • Assign multiple independent tasks simultaneously – cost of task assigned depends on the processor available, many tasks commence with a small difference in start time. • Iteratively refine the scheduling - refines the scheduling by using the cost of the critical path based on the assignment in the previous iteration.

Comparison of different algorithms Number of processors = 30. Number of Tasks = 2000. Number of processors = 30.

Time for Scheduling

Resource Management of Large-Scale Applications on a Grid

Resource Management of Large-Scale Applications on a Grid

Presentation Transcript

Resource Management of Large-Scale Applications on a Grid

Development of large-scale applications with Stata

Using Grid Technologies to Support Large-Scale Astronomy Applications

Large-scale enterprise content management

Large-Scale Resource Allocation

INTERNATIONAL NETWORK ON FINANCIAL MANAGEMENT OF LARGE-SCALE CATASTROPHES

Large Scale Applications on Hadoop in Yahoo

Large- scale Linked Data Management

EGEE A Large-scale Production Grid Infrastructure

Large Scale Grid Infrastructures: Status and Future

EGEE – A Large-Scale Production Grid Infrastructure

On Large Scale Modeling

Problem-solving on large-scale clusters: theory and applications

Large Scale Applications

DS-Grid: Large Scale Distributed Simulation on the Grid

Large-Scale Resource Allocation

Development of large-scale applications with Stata

Large-scale accelerator simulations: Synergia on the Grid

Resource and Service Management on the Grid

Problem-solving on large-scale clusters: theory and applications

Maintenance Patterns of large-scale PHP Web Applications