440 likes | 458 Views
Monitoring and Fabric Management. The European DataGrid Project Team http://www.eu-datagrid.org. Contents. Monitoring and Fabric management overview What is being monitored R-GMA Fabric management. Information and Monitoring Services. EDG information providers
E N D
Monitoring and Fabric Management The European DataGrid Project Team http://www.eu-datagrid.org
Contents • Monitoring and Fabric management overview • What is being monitored • R-GMA • Fabric management
Information and Monitoring Services • EDG information providers • Software that provides information about resources and infrastructure • Provided by the work packages responsible for the resource • Globus MDS (Metacomputing Directory Service or Monitoring and Discovery Service as it is now called) • Based on OpenLDAP, a hierarchical database • R-GMA (Relational Grid Monitoring Architecture) • A relational implementation of the Global Grid Forums GMA • Overview • Uses within the testbed
LDAP - Directory Information Tree computing element storage element site information network information between this and other sites status file statistics supported protocols storage elements that are close (not necessarily at the same site)
in=siteinfo,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: SiteInfo objectClass: DataGridTop objectClass: DynamicObject siteName: RALDEV sysAdminContact: grid.sysadmin@rl.ac.uk userSupportContact: grid.support@rl.ac.uk siteSecurityContact: grid.security@rl.ac.uk dataGridVersion: 1.2 installationDate: 20020704142800Z Siteinfo
ceId=dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs-M,hn=dev01.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=GridceId=dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs-M,hn=dev01.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: DataGridTop objectClass: ComputingElement CEId: dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs-M GlobusResourceContactString:dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs:/O=Grid/O=UKHEP/CN=dev01.hepgrid.clrc.ac.uk GRAMVersion: ? Architecture: intel OpSys: RH 6.2 MinPhysicalMemory: 258 MinLocalDiskSpace: 2048 TotalCPUs: 1 FreeCPUs: 1 NumSMPs: 0 MinSPUProcessors: 0 MaxSPUProcessors: 0 TotalJobs: 0 RunningJobs: 0 IdleJobs: 0 IdleJobs: 0 MaxTotalJobs: 1 MaxRunningJobs: 1 WorstTraversalTime: 108000 EstimatedTraversalTime: 0 Active: TRUE Priority: 20 MaxCPUTime: 108000 MaxWallClockTime: 432000 AverageSI00: 300 MinSI00: 300 MaxSI00: 300 AuthorizedUser:/O=Grid/O=UKHEP/OU=hepgrid.clrc.ac.uk/CN=Tim Eves AuthorizedUser:/O=Grid/O=UKHEP/OU=hepgrid.clrc.ac.uk/CN=Tim Folkes RunTimeEnvironment: RALDEV AFSAvailable: FALSE OutboundIP: TRUE InboundIP: FALSE QueueName: M LRMSType: PBS LRMSVersion: OpenPBS_2.3 Computing Element
closeSE=dev02.hepgrid.clrc.ac.uk,ceId=dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs-M, hn=dev01.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: CloseStorageElement objectClass: DataGridTop objectClass: DynamicObject CEId:dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs-M ; CloseSE: dev02.hepgrid.clrc.ac.uk MountPoint: /flatfiles Close Storage Element
seId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=GridseId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: StorageElement objectClass: DataGridTop objectClass: DynamicObject SEId: dev02.hepgrid.clrc.ac.uk CloseCE: dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs-M SEtypearchitecture: disk SEsize: 13177 SEResourceContactString: grid.support@rl.ac.uk SEvo: wpsix Storage Element
seProtocol=gridftp, seId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: StorageElementProtocol objectClass: DataGridTop objectClass: DynamicObject SEId: dev02.hepgrid.clrc.ac.uk SEProtocol: gridftp Port: 2811 seProtocol=rfio, seId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: StorageElementProtocol objectClass: DataGridTop objectClass: DynamicObject SEId: dev02.hepgrid.clrc.ac.uk SEProtocol: rfio Port: 3147 seProtocol=file, seId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: StorageElementProtocol objectClass: DataGridTop objectClass: DynamicObject SEId: dev02.hepgrid.clrc.ac.uk SEProtocol: file Storage Element Protocols
in=status,seId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Gridin=status,seId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: StorageElementStatus objectClass: DataGridTop objectClass: DynamicObject SEfreespace: 12031 SEId: dev02.hepgrid.clrc.ac.uk Storage Element Status
Data GRID Mds-Vo-name =datagrid Mds-Vo-name =countryA Mds-Vo-name =countryB Mds-Vo-name =siteA Mds-Vo-name =siteB Mds-Vo-name =siteC Mds-Vo-name =siteD GRIS/GIIS Hierarchy • Mds-Vo-name=datagrid,o=grid • This will look at all the data • Mds-Vo-name=countryA,Mds-Vo-name=datagrid,o=grid • This will look at all the data from countryA • Mds-Vo-name=countryA,o=grid • This will look at all the data from countryA • Mds-Vo-name=siteB,Mds-Vo-name=countryA,o=grid • This will look at all the data from siteB • Mds-Vo-name=siteB,o=grid • This will look at all the data from siteB
Map Centre – WP7 • Alternatively the information can be viewed using WP7’s Map Center • http://ccwp7.in2p3.fr/mapcenter/
R-GMA Relational - Grid Monitoring Architecture An Overview
Data Data GRID GRID The Consumer Producer Model Producer • Use the Grid Monitoring Architecture from Global Grid Forum • A relational implementation • Applied to both information and monitoring • Creates impression that you have one RDBMS per Virtual Organization Registry Command flow Information flow Consumer
Relational Approach • Not a general distributed RDBMS system, but a way to use the relational model in a distributed environment where ACID properties are not generally important. • Producers announce: SQL “CREATE TABLE” publish: SQL “INSERT” • Consumers collect: SQL “SELECT”
R-GMA Application Code command flow Information flow Consumer Servlet Consumer API 9 Registry API 4 5 Registry Servlet • API – Servlet communication • http(s) in • XML back Schema API 6 8 2 3 Registry API 7 Producer API Schema Servlet 1 ProducerServlet Sensor Code
Contributions are Views SELECT * FROM cpuLoad WHERE country = ’UK’ AND site = ’RAL’ SELECT * FROM cpuLoad WHERE country = ’UK’ AND site = ’GLA’
ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring &Fault Tolerance ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage ConfigurationManagement (Mass storage, Disk pools) Installation &Node Mgmt Architecture logical overview
- Interface between Grid-wide services and local fabric; - Provides local authentication, authorization and mapping of grid credentials. ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring &Fault Tolerance ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage ConfigurationManagement (Mass storage, Disk pools) Installation &Node Mgmt Architecture logical overview
ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt(WP2) Monitoring &Fault Tolerance ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage ConfigurationManagement (Mass storage, Disk pools) Installation &Node Mgmt Architecture logical overview - provides transparent access (both job and admin) to different cluster batch systems; - enhanced capabilities (extended scheduling policies, advanced reservation, local accounting).
ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring &Fault Tolerance ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage(WP5) ConfigurationManagement (Mass storage, Disk pools) Architecture logical overview - provides the tools to install and manage all software running on the fabric nodes; • Agent to install, upgrade, remove and configure software packages on the nodes. -bootstrap services and software repositories. Installation &Node Mgmt
ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User • provides a central storage and management of all fabric configuration information; - central DB and set of protocols and APIs to store and retrieve information. FabricGridification Data Mgmt Monitoring &Fault Tolerance ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage ConfigurationManagement (Mass storage, Disk pools) Installation &Node Mgmt Architecture logical overview
ResourceBroker Other Wps Grid InfoServices WP4 subsystems Grid User FabricGridification Data Mgmt Monitoring &Fault Tolerance ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage ConfigurationManagement (Mass storage, Disk pools) Installation &Node Mgmt - provides the tools for gathering monitoring information on fabric nodes; • central measurement repository stores all monitoring information; • fault tolerance correlation engines detect failures and trigger recovery actions. Architecture logical overview
ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage (Mass storage, Disk pools) User job management (Grid and local)
ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage (Mass storage, Disk pools) User job management (Grid and local) - Submit job
ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage (Mass storage, Disk pools) User job management (Grid and local) - publish resource and accounting information
ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage (Mass storage, Disk pools) User job management (Grid and local) - Optimized selection of site
ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User • Authorize • Map grid local credentials FabricGridification Data Mgmt Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage (Mass storage, Disk pools) User job management (Grid and local)
ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt(WP2) • Select an optimal batch queue and submit • Return job status and output Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage (Mass storage, Disk pools) User job management (Grid and local)
Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters
Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters - Node malfunction detected
Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters • Remove node from queue • Wait for running jobs(?)
Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters - Update configuration templates
Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters - Trigger repair
Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters - Repair (e.g. restart, reboot, reconfigure, …)
Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters - Node OK detected
Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters • Put back node in queue
Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Automation Installation &Node Mgmt Automated management of large clusters
LCFG (Local ConFiGuration system) • Widely used fabric tool, whose purpose is tohandle automated installation and configuration in a very diverse and evolving environment • Mechanism: • Abstract configuration parameters are stored in a central repository located in the LCFG server. • Scripts on the host machine (LCFG client) read these configuration parameters and either generate traditional configuration files, or directly manipulate various services.
Local Authorization: LCAS • The Local Centre Authorization Service (LCAS) handles authorization requests to the local computing fabric. • In this release the LCAS is a shared library, which is loaded dynamically by the globus gatekeeper. The gatekeeper has been slightly modified for this purpose and will from now on be referred to as edg-gatekeeper. • The authorization decision of the LCAS is based upon the users' certificate and the job specification in RSL (JDL) format. The certificate and RSL are passed to (plug-in) authorization modules, which grant or deny the access to the fabric. Three standard authorization modules are provided by default: • lcas_userallow.mod, checks if user is allowed on the fabric (currently the gridmap file is checked). • lcas_userban.mod, checks if user should be banned from the fabric. • lcas_timeslots.mod, checks if fabric is open at this time of the day for datagrid jobs.
Authentication control flow EDG gatekeeper GLOBUS GLOBUS + LCAS Gatekeeper Gatekeeper TLS auth TLS auth LCAS (so) assist_gridmap assist_gridmap Jobmanager-* Jobmanager-* * And store in job repository
Further Information • Information and Monitoring Services • http://hepunx.rl.ac.uk/edg/wp3/ • Fabric Management • http://cern.ch/hep-proj-grid-fabric/