Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson)

Co-Scheduling CPU and Storage using Condor and SRMs Presenter: Arie Shoshani Alex Romosan,Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson) Lawrence Berkeley National Laboratory

Problem: Running jobs on the Grid • Grid architecture needs to include components for dynamic reservation & scheduling of: • Compute resources – Condor (startd) • Storage resources – Storage Resource Managers (SRMs) • Network resources – Quality of Service in routers • Also need to coordinate • The co-scheduling of resources • Compute and storage resources only • The execution of the co-scheduled resources • Need to get DATA (files) into the execution nodes • Start the jobs running on nodes that have the right data on them. • Recover from failures • Balance use of nodes • Overall optimization – replicate “hot files”

client client Replica catalog Request Interpreter Request Executer request planning Network Weather Service Storage Resource Manager MSS Disk Cache Disk Cache Disk Cache General Analysis Scenario ... Client’s site logical query Metadata catalog A set of logical files Execution plan and site-specific files Storage Resource Manager result files Requests for data placement and remote computation Execution DAG network Storage Resource Manager Compute Resource Manager Storage Resource Manager Compute Resource Manager ... Disk Cache Compute Engine Compute Engine Site 1 Site 2 Site N

Simpler problem: run jobs on multi-node uniform clusters • Optimize parallel analysis jobs on the cluster • Jobs are partitioned into tasks: Jobi: [Ci, {Fij}, Oi ]  {Ci, Fij, Oij } • Currently using LFS • Currently files are NFS mounted – bottleneck • Want to run tasks independently on each node • Want to send tasks to where the files are • Very important problem for HENP applications Master Node Worker Node Worker Node Worker Node Worker Node HPSS

SRM is a Service • SRM functionality • Manage space • Negotiate and assign space to users • Manage “lifetime” of spaces • Manage files on behalf of a user • Pin files in storage till they are released • Manage “lifetime” of files • Manage action when pins expire (depends on file types) • Manage file sharing • Policies on what should reside on a storage resource at any one time • Policies on what to evict when space is needed • Get files from remote locations when necessary • Purpose: to simplify client’s task • Manage multi-file requests • A brokering function: queue file requests, pre-stage when possible • Provide grid access to/from mass storage systems • HPSS (LBNL, ORNL, BNL), Enstore (Fermi), JasMINE (Jlab), Castor (CERN), MSS (NCAR), …

Types of SRMs • Types of storage resource managers • Disk Resource Manager (DRM) • Manages one or more disk resources • Tape Resource Manager (TRM) • Manages access to a tertiary storage system (e.g. HPSS) • Hierarchical Resource Manager (HRM=TRM + DRM) • An SRM that stages files from tertiary storage into its disk cache • SRMs and File transfers • SRMs DO NOT perform file transfer • SRMs DO invoke file transfer service if needed (GridFTP, FTP, HTTP, …) • SRMs DO monitor transfers and recover from failures • TRM: from/to MSS • DRM: from/to network

Disk Cache Uniformity of Interface Compatibility of SRMs Client USER/APPLICATIONS Grid Middleware SRM SRM SRM SRM SRM SRM Enstore DCache JASMine CASTOR

Anywhere HRM-Client Command-line Interface Recovers from file transfer failures Recovers from staging failures Get list of files Recovers from archiving failures HRM-COPY (thousands of files) SRM-GET (one file at a time) HRM (performs writes) HRM (performs reads) LBNL BNL GridFTP GET (pull mode) Network transfer archive files stage files Disk Cache Disk Cache SRMs used in STAR for Robust Muti-file replication

srmGet/srmPut SRM/ No-SRM SRM Client Client-FTP-put (push) FTP-get Client-FTP-get (pull) srmReplicate SRM/ No-SRM SRM Client SRM-FTP-get (pull) SRM-FTP-put (push) File movement functionality: srmGet, srmPut, srmReplicate

SRM Methods Space management srmReserveSpace srmReleaseSpace srmUpdateSpace srmCompactSpace: srmGetCurrentSpace: FileType management srmChangeFileType: Status/metadata srmGetRequestStatus: srmGetFileStatus: srmGetRequestSummary: srmGetRequestID: srmGetFilesMetaData: srmGetSpaceMetaData: File Movement srm(Prepare)Get: srm(Prepare)Put: srmReplicate: Lifetime management srmReleaseFiles: srmPutDone: srmExtendFileLifeTime: Terminate/resume srmAbortRequest: srmAbortFile srmSuspendRequest: srmResumeRequest:

Simpler problem: run jobs on multi-node uniform clusters • Optimize parallel analysis on the cluster • Minimize movement of files between cluster nodes • Use nodes in cluster as evenly as possible • Automatic replication of “hot” files • Automatic management of disk space • Automatic removal of cold files(Automatic garbage collection) • Use • DRMs for disk management on each node • Space & content (files) • HRM for access from HPSS • Condor for job scheduling on each node • Startd to run jobs and monitor progress • Condor for matchmaking of slots and files

Collector JDD FSD schedd Negotiator startd startd startd startd DRM DRM DRM DRM Architecture JDD – Job Decomposition Daemon FSD – File Scheduling Daemon HRM HPSS

Detail actions (JDD) • JDD partitions jobs to tasks • Jobi: [Ci, {Fij}, Oi ]  {Ci, Fij, Oij } • JDM constructs 2 files • S(j) – set of tasks (jobs in Condor-speak) • S(f) – set of files requested • (Also keeps reference counts to files) • JDD probes all DRMs • For files they have • For missing files it can schedule requests to HRM • JDD schedules all missing files • Simple algorithm: schedule round-robin to nodes • Simply send request to each DRM • DRM removes files if needed and gets file from HRM • JDD sends each startd the list of files it needs • Startd checks with its DRM which of the needed files it has, and constructs a class-ad that lists onlyrelevant files

Detail actions (FSD) • FSD queues with Condor all tasks • FSD checks with Condor periodically on status of tasks • If a task is stuck it may choose to replicate the file(this is where a smart algorithm is needed) • File replication can be made from a neighbor node or from HRM • When startd runs a task, it requests DRM to pin file, run task, and release file

Collector JDD FSD schedd Negotiator startd startd startd startd DRM DRM DRM DRM Architecture - JDD generates list of Missing files JDD – Job Decomposition Daemon FSD – File Scheduling Daemon - JDD generates list of Missing files HRM HPSS

Need to develop • Mechnism for startd to communicate with DRM • Recently added to startd • Mechanism to check status of tasks • Mechanism to check that task is finished, and notify JDD • Mechanism to check that job is done, notify client • Develop JDD • Develop FSD

Open Questions (1) • What if a file was removed by a DRM? • In this case, if DRM does not find the file on its disk, then the task gets rescheduled • Note: usually, only “cold” files are removed • Should DRMs notify JDD when they remove a file? • How do you deal with output and merging of outputs? • Need DRMs to be able to schedule durable space • Moving files out of the compute node is the responsibility of the user (code) • Maybe moving files to their final destination should be a service of this system

Open Questions (2) • Is it best to process as many files on a single system as possible? • E.g. one system has 4 files, but also the files are on 4 different systems. Which is better. • Conjecture: if the overhead for splitting job is small, then splitting is optimized by matchmaking • What if file bundles are needed? • A file bundle is a set of files that are needed together • Need a more sophisticated class-ads • How to replicate bundles?

Detail activities • Development work • Design of JDD and FSD modules • Development of software components • Use of a real experimental cluster (8 + 1 nodes) • Install Condor and SRMs • Development of an optimization algorithm • Represented as a bipartite graph • Using network flow analysis techniques

Optimizing file replication on the cluster (D. Rotem)* • Jobs can be assigned to servers subject to the following constraints: • 1. Availability of Computation slots on the server, usually these correspond to CPUs. • 2. File(s) needed by the job must be resident on the server disk • 3. Sufficient disk space for storing job output. • 4. RAM • Goal : Maximize number of jobs assigned to servers while minimizing file replication costs * Article in preparation

Bipartite graph showing files and servers • An arc between f-node and s-node exists if the file is stored on that server • The number in the f-node represents the number of jobs that want to process that file • The number in the s-node represents the number of available slots on that server

File replication converted to a network flow problem 1) The total maximum number of jobs that can be assigned to the servers corresponds to the maximum flow in this network. 2) By the well-known max-flow min-cut theorem this is also equal to the capacity of a minimum cut shown in bold edges. Where, a cut is a set of edges that disconnects the source from the sink Max Flow is 11 in this case – Minimum cut shown in bold The number on the arcs is the MIN between the 2 nodes

Improving Flow by adding an edge Maximum flow improved to 13, additional edge represents file replication Problem: to find a subset of edges in of total minimum cost that maximizes the flow between the source and the sink.

Solution • Problem: Finding a set of edges of minimum cost to maximize flow (MaxFlowFixedCost ) • Problem is (strongly) NP-Complete • We use an approximation algorithm that finds a sub optimal solution in polynomial time, called Continuous Maximum Flow Improvement (C- MaxFlowImp) using linear programming techniques • Can show that the solution is bounded relative to the optimal • This will be implemented as part of FSD

Conclusions • Combining compute and file resources in class-ads is a useful concept • Can take advantage of matchmaker • Using DRMs to manage space and content of space provides: • Information for class-ads • Automatic garbage collection • Automatic staging of missing files from HPSS through HRM • Minimizing the number of files in class-ads is the key to efficiency • Get only needed files from DRM • Optimization can be done externally to Condor by File replication algorithms • Network flow analogy provide good theoretical foundation • Interaction between Condor and SRMs are through existing APIs • Small enhancements were needed in startd and DRMs • We believe that results can be extended to the Grid, but cost of replication will vary greatly – need to extend algorithms

Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson)

Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson)

Presentation Transcript

The Leadership Edge (Q-2)

Review of Capital Costs for Generation Technologies

RPS Model Methodology

Alex Baah Blake Dwyer Matt Iorio Greg Olson

By: Jeremy Wright Derek Anderson

Use of SRMs in Earth System Grid Arie Shoshani Alex Sim Lawrence Berkeley National Laboratory

Xrootd-SRM Andy Hanushevsky, SLAC Alex Romosan, LBNL August, 2006

Minnesota Mathematics and Science Frameworks

Nuclear Physics Greenbook Presentation (Astro,Theory, Expt)

Scientific Data Management Center (SDM-ISIC) Arie Shoshani Computing Sciences Directorate

OLSON OLSON

The SciDAC Scientific Data Management Center: Infrastructure and Results Arie Shoshani

Berkeley-SRM v2.1.1 Alex Sim Junmin Gu Arie Shoshani LCG workshop April 6, 2005

Ekow J. Otoo Frank Olken Arie Shoshani

Shahar Rotem

Scientific Data Management Center (Integrated Software Infrastructure Center – ISIC) Arie Shoshani

Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc

Greening Your Curriculum:

Team TAC UAV

Storage Resource Managers: Essential Components for the Grid Arie Shoshani Staff:

Berkeley-SRM v2.1.1 Alex Sim Junmin Gu Arie Shoshani LCG workshop April 6, 2005