380 likes | 630 Views
An overview of grid middleware and gLite. Outline. An overview of grid middleware Introduction of gLite Job managememt services of gLite. INTERNET. A Grid. Grid Many machines Across many locations and administrative domains Grid middleware runs on each machines
E N D
Outline • An overview of grid middleware • Introduction of gLite • Job managememt services of gLite. Overview of EGEE
INTERNET A Grid • Grid • Many machines • Across many locations and administrative domains • Grid middleware runs on each machines • High Performance Computing • High capacity Storage • Meet the need of scientific computing • Grid trust VOs • Users join VOs • Virtual organisation contributes resources & negotiates access • Additional services also enable the grid • Operation • Dissemination Virtual Organization is entity that corresponds to a organization or group of people.Desires to share computing, data or software resources Overview of EGEE
GRID SERVICES Authentication, Authorisation (AA) Users in many locations and organisations Access services (“user interface”) : logon, upload credentials, run m/w commands Build on Grid Security InfrastructureEncryption and Data Integrity, Authentication and Authorization “Gate keeping”: Authenticate users and give permissions Resources in many locations and organisations PBS, Condor, LSF,… System software NFS, … Operating system File system Local scheduler HPSS, CASTOR… Hardware Computing clusters,… Network resources Data storage Overview of EGEE
Basic job Management Users Tools for: • Submit jobs to a CE • Monitor jobs • Get outputs • Transfer files to CE • Transfer files between CE and SE How do I run a job on a compute element (CE) ? (CE =batch queue) Resources Compute elements Data storage Network resources Overview of EGEE
Information service (IS) Users Information Service (IS): • Resources such as CE and SE report their status to IS • Grid services query IS before running jobs How do I know which CE could run my job? Which is free? Resources Compute elements Data storage Network resources Overview of EGEE
File management Users Storage Transfer Replication management We’ve terabytes of data in files. My data are in files, and I’ve terabytes Our data are in files, and I’ve terabytes Resources Compute elements Data storage Network resources Overview of EGEE
Information System:collection information about the resource Characteristics and status of CE and SE Resource Broker (RB): Matches the user requirements with the available resources on the Grid Main components User Interface (UI): The place where users access the Grid Computing Element (CE): A batch queue on a site’s computers where the user’s job is executed Storage Element (SE): provides (large-scale) storage for files Overview of EGEE
Input “sandbox” DataSets info Output “sandbox” SE & CE info Job Submit Event Job Query Publish Job Status Storage Element Current production middleware Replica Catalogue “User interface” Information Service ResourceBroker (WorkLoad Mgr.) Author. &Authen. Input “sandbox” + Broker Info Output “sandbox” Logging & Book-keeping Computing Element Job Status Overview of EGEE
“gLite 3.0” the current middleware • Being deployed on EGEE production Grid now • Runs on various Linux releases • “Scientific Linux” most common • Ports to other Operating Systems in progress • History • During last 2 years, some new services were created in releases of new middleware, up to gLite 1.5, has been in pre-production use • A subset of these is deployed with some of the previous middleware (LCG 2.7) • All components already in LCG 2.7.0 plus upgrades • this already includes new versions of VOMS, R-GMA and FTS • The Workload Management System (with LB, CE, UI) of gLite 1.5.0 Overview of EGEE
gLite Grid Middleware Services Access API CLI Security Services Authorization Information & Monitoring Services Application Monitoring Information &Monitoring Auditing Authentication Data Management Workload Mgmt Services JobProvenance PackageManager MetadataCatalog File & ReplicaCatalog Accounting StorageElement DataMovement WorkloadManagement ComputingElement Connectivity Overview of EGEE
http://gridportal.hep.ph.ic.ac.uk/rtm 14:00 on 17 Jan 2007 Overview of EGEE
gLite Job Management Services Overview of EGEE
WMS’s Architecture Job management requests (submission, cancellation) expressed via a Job Description Language (JDL) Overview of EGEE
WMS’s Architecture Keeps submission Requests Requests are kept for a while, waiting for being dispatched If there is no matching resource available Overview of EGEE
WMS’s Architecture Repository of resource information Updated via notifications and/or active polling on sources Provide matchmaker With information to decide best resources for request. Overview of EGEE
WMS’s Architecture Finds an appropriate CE or resource for job request according to the information from ISM. Taking into account job preferences, resource status, policies on resources Overview of EGEE
WMS’s Architecture Performs the actual job submission and monitoring Normally it is Condor. Overview of EGEE
WMS’s Architecture Computing Element is the place where you jobs run Overview of EGEE
WMS components (1) WMS components handling the job during its lifetime and performing the submission • Network Server (NS) • is responsible for • Accepting incoming requests from the UI. • Authenticates the user. • Obtains a delegated full proxy from the user proxy. • Enqueues the job to the Workload Management.. • WorkLoad Manager (WM) • Is responsible for • Calls Matchmaker to find the resource which best matches the job requirements. • Interacting with Information System and File catalog. • Calculates the ranking of all the matchmaked resourceCondorC • Information Supermarket (ISM) • is responsible for • basically consists of a repository of resource information that is available in read only mode to the matchmaking engine Overview of EGEE
WMS components (2) WMS components handling the job during its lifetime and performing the submission • Job Adapter • is responsible for • making the final touches to the JDL expression for a job, before it is passed to CondorC for the actual submission • creating the job wrapper script that creates the appropriate execution environment in the CE worker node • transfer of the input and of the output sandboxes • Job Controller (JC) • Is responsible for • Converts the condor submit file into ClassAd • hands over the job to CondorC • CondorC • responsible for • performing the actual job management operations • job submission, job removal • Log Monitor • is responsible for • watching the CondorC log file • intercepting interesting events concerning active jobs • events affecting the job state machine • triggering appropriate actions. Overview of EGEE
CE’s Architecture Computing Element is built on a homogeneous farm of computing nodes (calledWorker Nodes) Also there are many components inside CE such as gatekeeper, globus-jobmanager, .. Overview of EGEE
CE’s Architecture Gatekeeper Grants access to the CE and map grid user to a local user id. Overview of EGEE
CE’s Architecture Batch System A cluster of compute nodes controlled by a head node. handles the job execution Example: Torque (Open PBS), PBS Overview of EGEE
Many CE in glite-enabled grid Few WMS coordinating the CEs and broker jobs to proper CEs. A typical case of glite-enabled grid Overview of EGEE
Computing Element Components) Gatekeeper • Grants access to the CE.. Authenticate users and map users to local accounts. • forks the globus-jobmanager. globus-jobmanager • Fork Condor-C (in CE) to help submit jobs to batch systems. BLAPHD (Batch Local ASCII Helper Protocol Daemon) • Offer an unique interface for condor-c(in CE) to submit jobs to different batch systems • BLAPHD commands is used by Condor-C (in CE) to submit jobs to the batch system. Batch System • handles the job execution on the available local worker nodes. • Batch System consists of: - torque (formerly known as OpenPBS) resource manager . - maui job scheduler . • A cluster MUST be homogeneous. Worker nodes • It is the host executing the jobs . • Also responsible for downloading and uploading jobs’ data from or to WMS or SE. Overview of EGEE
Job State Machine Overview of EGEE
Job State Machine (1/9) Submitted: job is entered by the user to the User Interface but not yet transferred to Network Server for processing Overview of EGEE
Job State Machine (2/9) Waiting: job was accepted by NS and is waiting for Workload Manager processing or being processedby WMHelper modules. Overview of EGEE
Job State Machine (3/9) Ready: job processed by WM and its Helper modules (CE found) but not yet transferred to the CE (local batch system queue) via JC and CondorC.. Overview of EGEE
Job State Machine (4/9) Scheduled:job waiting in the queue on the CE. Overview of EGEE
Job State Machine (5/9) Running:job is running on CE’s queuing system (inside one of the worker nodes) Overview of EGEE
Job State Machine (6/9) Done:job exited or considered to be in a terminal state by CondorC (e.g., submission to CE has failed in an unrecoverable way). Overview of EGEE
Job State Machine (7/9) Aborted: job processing was aborted by WMS (waiting in the WM queue or CE for too long, over-use of quotas, expiration of user credentials). Overview of EGEE
Job State Machine (8/9) Cancelled:job has been successfully canceled on user request. Overview of EGEE
Job State Machine (9/9) Cleared:output sandbox was transferred to the user or removed due to the timeout. Overview of EGEE
Further information • EGEE www.eu-egee.org • gLite http://www.glite.org/ • LCG http://lcg.web.cern.ch/LCG/ • Open Grid Forum http://www.gridforum.org/ • Globus Alliance http://www.globus.org/ • VDT http://www.cs.wisc.edu/vdt/ Overview of EGEE