320 likes | 487 Views
Optimizing Execution of Data Intensive Jobs in Computational and Data Grids. Michael Ernst DESY. International Symposium on Grid Computing – ISGC 2005 Taipei. Overview. Motivation / LHC Computing Problem Grid Scheduling Resource Management Systems Optimizing Data Intensive Job Execution
E N D
Optimizing Execution of Data Intensive Jobs in Computational and Data Grids Michael Ernst DESY International Symposium on Grid Computing – ISGC 2005 Taipei
Overview • Motivation / LHC Computing Problem • Grid Scheduling • Resource Management Systems • Optimizing Data Intensive Job Execution • Conclusions ISGC 2005 M.Ernst
p-p collisions at LHC Event rate Level 1 Trigger Rate to tape Crossing rate 40 MHz Event Rates: ~109 Hz Max LV1 Trigger 100 kHz Event size ~1 Mbyte Readout network 1 Terabit/s Filter Farm ~107 Si2K Trigger levels 2 Online rejection 99.9997% (100 Hz from 50 MHz) System dead time ~ % Event Selection: ~1/1013 Luminosity Low 2x1033 cm-2 s-1 High 1034 cm-2 s-1 “Discovery” rate ISGC 2005 M.Ernst
LHC Computing Outlook • HEP trend is to fewer and bigger experiments • Multi- Peta-Bytes, GB/s, MSI2k • Worldwide collaborations, thousands of physicists, … • LHC experiments will be extreme cases • But, CDF, D0, Babar and Belle are approaching the same scale and tackling the same problems even now • (Worldwide) Hardware Computing costs at LHC will be in the region of 50M€ per year • Worldwide Software development for GRID in HEP also in this ballpark • With so few experiments, so many collaborators, so much money: • We have to get this right (enough)… ISGC 2005 M.Ernst
Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center LHC Data Grid Hierarchy CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1 ~PByte/sec ~100-1500 MBytes/sec Online System Experiment CERN Center PBs of Disk; Tape Robot Tier 0 +1 Tier 1 ~2.5-10 Gbps FNAL Center GridKa Center INFN Center RAL Center 2.5-10 Gbps Tier 2 ~2.5-10 Gbps Tier 3 Institute Institute Institute Institute Tens of Petabytes by 2007-8.An Exabyte ~5-7 Years later. Physics data cache 0.1 to 10 Gbps Tier 4 Workstations Emerging Vision: A Richly Structured, Global Dynamic System ISGC 2005 M.Ernst
Scheduled Computing • Organized, Scheduled, Simulation and Large-Scale Event Reconstruction is a task we understand “well” • We can make reasonably accurate estimates of the computing required • We can perform simple optimizations to share the work between the large computing centers ISGC 2005 M.Ernst
Interactive Session ISGC 2005 M.Ernst
Move Data to Job Moving only those parts of the data that the user really needs All of some events, or some parts of some events? Very different resource requirements Web-Services / Web-Caching may be the right technologies here Chaotic Computing • Data Analysis is a “Feeding Frenzy” • Data is widely dispersed, may be geographically mismatched to available CPU • Choosing between data and job movement? • How/When will we have the information to motivate those choices? • Move Job to Data • Information required to describe the data requirements can (will) be complex and poorly formulated • Difficult for a resource broker to make good scheduling choices • Current Resource Brokers are quite primitive ISGC 2005 M.Ernst
(Some) Guiding Principles for LHC Computing • Access to Data is more of a bottleneck than access to CPU • Make multiple distributed copies as early as possible - affordable? • Experiment needs to be able to enact Priority Policy • Stream data from Raw onwards • Initial detailed analysis steps will be run at the T1’s • Need access to large data samples • But T1’s are already fully utilized with Reconstruction & Reprocessing • T2’s have (by definition?) more limited Disk/Network than the T1’s • Good for final analysis, small (TB) samples • Make sure there is rapid access to locally replicate these • Perfect for Monte-Carlo Production – easy! • Much more demanding is their role in end-user analysis • User Analysis tasks are equal in magnitude to Production tasks • 50% Resources for each ISGC 2005 M.Ernst
Exp. DB Services Book Keeping Production Workflow Grid (File) Services File Description Replica Location Remote File I/O LCG Persistency Framework POOL POOL client on a CPU Node • POOL is called by experiment frameworks • Production Manager • Creates and maintains shared file catalogs and (event) collections • End User • Uses shared or private collections • POOL Applications are “grid-aware” via Replica Location Service (RLS) • File location and meta data queries are submitted to Grid services • The POOL storage manager allows access to local and remote files • Access via Root I/O (eg RFIO/dcap), • Eventually to be replaced by Grid File Access Library(GFAL) User Application Experiment Framework RDBMS Services Collection Description POOL Collection Location? Collection Access remote access ISGC 2005 M.Ernst
Many sources of data, services, computation Security & policy must underlie access & management decisions Discovery R R RM RM Registries organize services of interest to a community Access RM Resource management is needed to ensure progress & arbitrate competing demands RM RM Policy service Security service Policy service Security service Data integration activities may require access to, & exploration of, data at many locations Exploration & analysis may involve complex,multi-step workflows Integration as a Fundamental Challenge Source Ian Foster ISGC 2005 M.Ernst
Grid Scheduling • Minimizing the Average Weighted response time • Maximize machine/ utilization/minimize idle time • r : submission time of a job • t : completion time of a job • w : weight/priority of a job Current approach: • Extension of job scheduling for multiple computing facilities • Resource discovery and load-distribution to a remote resource • Usually batch job scheduling model on remote machine But actually required for Grid scheduling is: • Co-allocation and coordination of different resource allocations for a Grid job • Instantaneous ad-hoc allocation not always suitable This complex task involves: • “Cooperation” between different resource providers • Interaction with local resource management systems (heterogeneous interfaces and functions) • Support for reservations and service level agreements • Orchestration of coordinated resource allocations ISGC 2005 M.Ernst
Allocation for Grid Job (Example) time Data Storing Data Data Access Network 1 Data Transfer Data Transfer Reservations are Necessary! Computer 1 Loading Data Parallel Computation Providing Data Network 2 Communication for Computation Computer 2 Parallel Computation Software License Software Usage Storage Data Storage Network 3 Communication for Visualization Graphics W/S Visualization ISGC 2005 M.Ernst
Local Scheduling Systems Observation: • Local resource management systems (LRMS) exist • require extension for Grids by additional software or • will directly support Grids in the future • DRMAA (GGF) is available today for some LRMS • Different LRMS are part of the Grid and perform a lower-level scheduling • In addition the Grid requires some higher-level scheduling for coordinating resources required by the user’s jobs. • Multi-level scheduling model ISGC 2005 M.Ernst
Individual userpolicies Coordinate Allocations Higher-levelscheduling Submit Grid Job Description Select Offers Discover Resources Collect Offers Query for Allocation Offers Observe Communitypolicies Generate Allocation Offer Lower-levelscheduling Individual owner policies Analyze Query Scheduling Model Using a Brokerage/Trading strategy: ISGC 2005 M.Ernst
Functional View of Grid Data Management Application Location based on data attributes MetadataService Planner: Data location, Replica selection, Selection of compute and storage nodes Replica Location Service Location of one or more physical replicas Information Services State of grid resources, performance measurements and predictions Security and Policy Executor: Initiates data transfers and computations Data Movement Data Access Compute Resources Storage Resources Source C. Kesselmann ISGC 2005 M.Ernst
User Requirements • The infrastructure must accommodate choice of service options (e.g.efficiency, reliability, turn-around etc.) • Advance reservation of resources • Most relevant to HEP: Storage • Formulate preferences regarding their jobs and understand the resource and VO policies • Policy information and negotiation mechanisms • what is a policy of usage of the remote resources? • Prediction-based information • How long will my job run on a particular resource? • What is the amount of resources I need to complete the job in time? ISGC 2005 M.Ernst
Resource Management SystemApproach • Grid resources are not only the CPUs, but also data collections, databases, files, users, administrators, instruments, jobs/applications ... • Many metrics for scheduling: throughput, cost, latency, deadline, other time and cost metrics... • Grid resource management consists of job/resource scheduling, security (authorization services,...), local policies, negotiations, accounting, ... • Approach: user and resource owner driven negotiation process and thus, complex decision making process • Web Services Agreement Specification (WS-Agreement) ISGC 2005 M.Ernst
Resource Management System Grid Environment User Access Layer Resource Discovery Grid-Services • Data Management • Adaptive Components • AuthorizationService Application Broker Toolkit Globus Infrastructure • MDS • GRAM • SRM • GridFTP Job Manager Resource Management System – An Overview ISGC 2005 M.Ernst
Complexity of Resource Management • Information Collection • application requirements (resource requirements, environment, etc.) • user preferences (criteria and weight) • Selection phase • choose the best resources (schedule) based on the information provided and on the resource availability (estimates, predictions) • from simple matchmaking to multiple optimization techniques • Execution phase • file staging, execution control, job monitoring, migration, usually re-selection of resources, application adaptation ISGC 2005 M.Ernst
Relevant Criteria • Concerning particular resources (e.g. memory, CPU) or schedules (e.g. estimated processing time, maximum latency) • Specific for end-users (e.g. mean response time, latency, cost of computations), resource owners (e.g. machine (under-) utilization) and administrators (e.g. throughput) • Time criteria, cost criteria (e.g. weighted resource consumption, cost of computations) and resource utilization criteria (e.g. load balancing, machine idleness) • However, in practice • Lack or limited set of low level mechanisms which support e.g. advanced reservation • Lack of negotiation protocols and agreements • Prediction tools which provide advanced analysisand estimations (e.g. execution time, queue wait time) • Reliable information sources which describe behaviors of applications and resources, etc. ISGC 2005 M.Ernst
Job Scheduling using a Central Broker • One central scheduler exists for multiple applications • The goal is to match a set of application requirements to all available resources on the basis of various criteria ISGC 2005 M.Ernst
Application Level Scheduling • Each application is scheduled by an internal scheduler and forms a self-contained system • The goal is to match particular application requirements with one (or some) good resource(s) based on various criteria • Multicriteria techniques (GridLab) are used for the evaluation of resources as well as resource co-allocations (internal mechanisms) ISGC 2005 M.Ernst
Activities • Core service infrastructure • OGSI/WSRF • OGSA • GGF hosts several groups in the area of Grid scheduling and resource management. Examples: • WG Scheduling Attributes (finished) • WG Distributed Resource Management Application API (active) • WG Grid Resource Allocation Agreement Protocol (active) • WG Grid Economic Services Architecture (active) • RG Grid Scheduling Architecture (active) ISGC 2005 M.Ernst
Optimizing Data Intensive Jobs Application Location based on data attributes MetadataService Planner: Data location, Replica selection, Selection of compute and storage nodes Replica Location Service Location of one or more physical replicas Information Services State of grid resources, performance measurements and predictions Security and Policy Executor: Initiates data transfers and computations Data Movement Data Access Compute Resources Storage Resources Source C. Kesselmann ISGC 2005 M.Ernst
Make Storage a seamlessly integrated Resource • Access to information about the location of data sets • Information about access and transfer costs • Scheduling of data transfers and data availability • optimize data transfers w.r.t. storage space, data access costs etc. • Perfectly fits into general grid scheduling: • access to similar services • interaction necessary ISGC 2005 M.Ernst
Job + ResultSteering Toolbox Management SystemCompute Management SystemNetwork Management System Storage Usage Interactive Data Analysis Tools Grid User/Application Application Specific Scheduling Service Job SuccessOptimizer Service Query run-time collected information Information Service static & run-time collectedscheduled / forecasted Query for resources Scheduling Service Reservation Resources Execution success Data Management Service Data Resource Usage Job Supervisor Service Maintain information Network Profiling Network Management Service Accounting and Billing Service Maintain information Data Manager Compute Manager Network Manager Compute-Resources Data-Resources Network-Resources ISGC 2005 M.Ernst
Extensions to SE Information Providers • A “Cost Prediction Module” calculates load development of the local Storage Element (SE) and makes the information available using an agreed upon schema (i.e. GLUE, additional dynamic records req.) • By using this Information the local SE can predict the time required to make requested data sets available, and publish it to the Grid Scheduler • The local SE can calculate a point in time a collection of data sets can be optimally made available • This approach allows to only schedule jobs to the Compute Element (CE) when the associated SE has flagged all required data sets online ISGC 2005 M.Ernst
RED : improved modules Incoming Jobs Grid Scheduler Query current load Jobs Scheduled Query load prediction “make ready” Query “optimized submission time” Local Site Load / Cost Local Scheduler Compute Element Prediction Module Storage Element Load Info Local Cache (dCache) Tertiary Storage System Improved Scheduling ISGC 2005 M.Ernst
Development in Progress • The local Storage Element (e.g. dCache) provides Information about its load to allow a “Prediction Module” to make relevant predictions concerning future load development • We need to agree on a protocol/API to make the “Prediction Module” independent from a particular SE implementation • A Prediction Module is under development that publishes short-term Predictions about the SE’s future load development • A local scheduler is offering long-term guarantees concerning latencies to access data sets at a given load profile • The global scheduler must be capable to incorporate information provided by the local schedulers and the cost prediction modules to compute the optimal location to run the job ISGC 2005 M.Ernst
Conclusion • Resource Management and Scheduling is key to success in Next Generation Grids • Impossible to handle this manually in large Grids • Nor is the orchestration of resources a provider task • System integration is complex but vital • Local systems must be enabled to interact with Grid building blocks • Information Systems – Grid resources to expose services for negotiation • Basic research is still required in this area. • No ready-to-implement solution is available. • New concepts need to be developed • Current efforts provide the basic Grid infrastructure! Higher-level services as Grid scheduling and co-allocation are subject to research • RMS systems need to provide extensible negotiation interfaces • Grid scheduling essential to include coordination of different resources ISGC 2005 M.Ernst
The Goal is the Physics, not the Computing… • Motivation: at L0=1033 cm-2s-1, • 1 fill (6hrs) ~ 13 pb-1 • 1 day ~ 30 pb-1 • 1 month ~ 1 fb-1 • 1 year ~ 10 fb-1 • Most of Standard-Model Higgs can be probed within a few months • Ditto for SUSY • Turn-on for detector +computing and software will be crucial ISGC 2005 M.Ernst