1 / 44

Monitoring and Fabric Management

Monitoring and Fabric Management. The European DataGrid Project Team http://www.eu-datagrid.org. Contents. Monitoring and Fabric management overview What is being monitored R-GMA Fabric management. Information and Monitoring Services. EDG information providers

cstockdale
Download Presentation

Monitoring and Fabric Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Monitoring and Fabric Management The European DataGrid Project Team http://www.eu-datagrid.org

  2. Contents • Monitoring and Fabric management overview • What is being monitored • R-GMA • Fabric management

  3. Information and Monitoring Services • EDG information providers • Software that provides information about resources and infrastructure • Provided by the work packages responsible for the resource • Globus MDS (Metacomputing Directory Service or Monitoring and Discovery Service as it is now called) • Based on OpenLDAP, a hierarchical database • R-GMA (Relational Grid Monitoring Architecture) • A relational implementation of the Global Grid Forums GMA • Overview • Uses within the testbed

  4. LDAP - Directory Information Tree computing element storage element site information network information between this and other sites status file statistics supported protocols storage elements that are close (not necessarily at the same site)

  5. in=siteinfo,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: SiteInfo objectClass: DataGridTop objectClass: DynamicObject siteName: RALDEV sysAdminContact: grid.sysadmin@rl.ac.uk userSupportContact: grid.support@rl.ac.uk siteSecurityContact: grid.security@rl.ac.uk dataGridVersion: 1.2 installationDate: 20020704142800Z Siteinfo

  6. ceId=dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs-M,hn=dev01.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=GridceId=dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs-M,hn=dev01.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: DataGridTop objectClass: ComputingElement CEId: dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs-M GlobusResourceContactString:dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs:/O=Grid/O=UKHEP/CN=dev01.hepgrid.clrc.ac.uk GRAMVersion: ? Architecture: intel OpSys: RH 6.2 MinPhysicalMemory: 258 MinLocalDiskSpace: 2048 TotalCPUs: 1 FreeCPUs: 1 NumSMPs: 0 MinSPUProcessors: 0 MaxSPUProcessors: 0 TotalJobs: 0 RunningJobs: 0 IdleJobs: 0 IdleJobs: 0 MaxTotalJobs: 1 MaxRunningJobs: 1 WorstTraversalTime: 108000 EstimatedTraversalTime: 0 Active: TRUE Priority: 20 MaxCPUTime: 108000 MaxWallClockTime: 432000 AverageSI00: 300 MinSI00: 300 MaxSI00: 300 AuthorizedUser:/O=Grid/O=UKHEP/OU=hepgrid.clrc.ac.uk/CN=Tim Eves AuthorizedUser:/O=Grid/O=UKHEP/OU=hepgrid.clrc.ac.uk/CN=Tim Folkes RunTimeEnvironment: RALDEV AFSAvailable: FALSE OutboundIP: TRUE InboundIP: FALSE QueueName: M LRMSType: PBS LRMSVersion: OpenPBS_2.3 Computing Element

  7. closeSE=dev02.hepgrid.clrc.ac.uk,ceId=dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs-M, hn=dev01.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: CloseStorageElement objectClass: DataGridTop objectClass: DynamicObject CEId:dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs-M ; CloseSE: dev02.hepgrid.clrc.ac.uk MountPoint: /flatfiles Close Storage Element

  8. seId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=GridseId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: StorageElement objectClass: DataGridTop objectClass: DynamicObject SEId: dev02.hepgrid.clrc.ac.uk CloseCE: dev01.hepgrid.clrc.ac.uk:2119/jobmanager-pbs-M SEtypearchitecture: disk SEsize: 13177 SEResourceContactString: grid.support@rl.ac.uk SEvo: wpsix Storage Element

  9. seProtocol=gridftp, seId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: StorageElementProtocol objectClass: DataGridTop objectClass: DynamicObject SEId: dev02.hepgrid.clrc.ac.uk SEProtocol: gridftp Port: 2811 seProtocol=rfio, seId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: StorageElementProtocol objectClass: DataGridTop objectClass: DynamicObject SEId: dev02.hepgrid.clrc.ac.uk SEProtocol: rfio Port: 3147 seProtocol=file, seId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: StorageElementProtocol objectClass: DataGridTop objectClass: DynamicObject SEId: dev02.hepgrid.clrc.ac.uk SEProtocol: file Storage Element Protocols

  10. in=status,seId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Gridin=status,seId=dev02.hepgrid.clrc.ac.uk,Mds-Vo-name=ral-dev,Mds-Vo-name=uk,o=Grid objectClass: StorageElementStatus objectClass: DataGridTop objectClass: DynamicObject SEfreespace: 12031 SEId: dev02.hepgrid.clrc.ac.uk Storage Element Status

  11. Data GRID Mds-Vo-name =datagrid Mds-Vo-name =countryA Mds-Vo-name =countryB Mds-Vo-name =siteA Mds-Vo-name =siteB Mds-Vo-name =siteC Mds-Vo-name =siteD GRIS/GIIS Hierarchy • Mds-Vo-name=datagrid,o=grid • This will look at all the data • Mds-Vo-name=countryA,Mds-Vo-name=datagrid,o=grid • This will look at all the data from countryA • Mds-Vo-name=countryA,o=grid • This will look at all the data from countryA • Mds-Vo-name=siteB,Mds-Vo-name=countryA,o=grid • This will look at all the data from siteB • Mds-Vo-name=siteB,o=grid • This will look at all the data from siteB

  12. Map Centre – WP7 • Alternatively the information can be viewed using WP7’s Map Center • http://ccwp7.in2p3.fr/mapcenter/

  13. R-GMA Relational - Grid Monitoring Architecture An Overview

  14. Data Data GRID GRID The Consumer Producer Model Producer • Use the Grid Monitoring Architecture from Global Grid Forum • A relational implementation • Applied to both information and monitoring • Creates impression that you have one RDBMS per Virtual Organization Registry Command flow Information flow Consumer

  15. Relational Approach • Not a general distributed RDBMS system, but a way to use the relational model in a distributed environment where ACID properties are not generally important. • Producers announce: SQL “CREATE TABLE” publish: SQL “INSERT” • Consumers collect: SQL “SELECT”

  16. R-GMA Application Code command flow Information flow Consumer Servlet Consumer API 9 Registry API 4 5 Registry Servlet • API – Servlet communication • http(s) in • XML back Schema API 6 8 2 3 Registry API 7 Producer API Schema Servlet 1 ProducerServlet Sensor Code

  17. Schema & Contributions

  18. Contributions are Views SELECT * FROM cpuLoad WHERE country = ’UK’ AND site = ’RAL’ SELECT * FROM cpuLoad WHERE country = ’UK’ AND site = ’GLA’

  19. Fabric Management

  20. ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring &Fault Tolerance ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage ConfigurationManagement (Mass storage, Disk pools) Installation &Node Mgmt Architecture logical overview

  21. - Interface between Grid-wide services and local fabric; - Provides local authentication, authorization and mapping of grid credentials. ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring &Fault Tolerance ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage ConfigurationManagement (Mass storage, Disk pools) Installation &Node Mgmt Architecture logical overview

  22. ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt(WP2) Monitoring &Fault Tolerance ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage ConfigurationManagement (Mass storage, Disk pools) Installation &Node Mgmt Architecture logical overview - provides transparent access (both job and admin) to different cluster batch systems; - enhanced capabilities (extended scheduling policies, advanced reservation, local accounting).

  23. ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring &Fault Tolerance ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage(WP5) ConfigurationManagement (Mass storage, Disk pools) Architecture logical overview - provides the tools to install and manage all software running on the fabric nodes; • Agent to install, upgrade, remove and configure software packages on the nodes. -bootstrap services and software repositories. Installation &Node Mgmt

  24. ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User • provides a central storage and management of all fabric configuration information; - central DB and set of protocols and APIs to store and retrieve information. FabricGridification Data Mgmt Monitoring &Fault Tolerance ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage ConfigurationManagement (Mass storage, Disk pools) Installation &Node Mgmt Architecture logical overview

  25. ResourceBroker Other Wps Grid InfoServices WP4 subsystems Grid User FabricGridification Data Mgmt Monitoring &Fault Tolerance ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage ConfigurationManagement (Mass storage, Disk pools) Installation &Node Mgmt - provides the tools for gathering monitoring information on fabric nodes; • central measurement repository stores all monitoring information; • fault tolerance correlation engines detect failures and trigger recovery actions. Architecture logical overview

  26. ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage (Mass storage, Disk pools) User job management (Grid and local)

  27. ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage (Mass storage, Disk pools) User job management (Grid and local) - Submit job

  28. ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage (Mass storage, Disk pools) User job management (Grid and local) - publish resource and accounting information

  29. ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage (Mass storage, Disk pools) User job management (Grid and local) - Optimized selection of site

  30. ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User • Authorize • Map grid  local credentials FabricGridification Data Mgmt Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage (Mass storage, Disk pools) User job management (Grid and local)

  31. ResourceBroker Other services Grid InfoServices Fabric mgt subsystems Grid User FabricGridification Data Mgmt(WP2) • Select an optimal batch queue and submit • Return job status and output Monitoring ResourceManagement Local User Farm A (LSF) Farm B (PBS) Grid DataStorage (Mass storage, Disk pools) User job management (Grid and local)

  32. Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters

  33. Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters - Node malfunction detected

  34. Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters • Remove node from queue • Wait for running jobs(?)

  35. Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters - Update configuration templates

  36. Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters - Trigger repair

  37. Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters - Repair (e.g. restart, reboot, reconfigure, …)

  38. Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters - Node OK detected

  39. Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Installation &Node Mgmt Automated management of large clusters • Put back node in queue

  40. Other Wps WP4 subsystems Information Invocation Monitoring &Fault Tolerance ResourceManagement Farm A (LSF) Farm B (PBS) ConfigurationManagement Automation Installation &Node Mgmt Automated management of large clusters

  41. LCFG (Local ConFiGuration system) • Widely used fabric tool, whose purpose is tohandle automated installation and configuration in a very diverse and evolving environment • Mechanism: • Abstract configuration parameters are stored in a central repository located in the LCFG server. • Scripts on the host machine (LCFG client) read these configuration parameters and either generate traditional configuration files, or directly manipulate various services.

  42. Local Authorization: LCAS • The Local Centre Authorization Service (LCAS) handles authorization requests to the local computing fabric. • In this release the LCAS is a shared library, which is loaded dynamically by the globus gatekeeper. The gatekeeper has been slightly modified for this purpose and will from now on be referred to as edg-gatekeeper. • The authorization decision of the LCAS is based upon the users' certificate and the job specification in RSL (JDL) format. The certificate and RSL are passed to (plug-in) authorization modules, which grant or deny the access to the fabric. Three standard authorization modules are provided by default: • lcas_userallow.mod, checks if user is allowed on the fabric (currently the gridmap file is checked). • lcas_userban.mod, checks if user should be banned from the fabric. • lcas_timeslots.mod, checks if fabric is open at this time of the day for datagrid jobs.

  43. Authentication control flow EDG gatekeeper GLOBUS GLOBUS + LCAS Gatekeeper Gatekeeper TLS auth TLS auth LCAS (so) assist_gridmap assist_gridmap Jobmanager-* Jobmanager-* * And store in job repository

  44. Further Information • Information and Monitoring Services • http://hepunx.rl.ac.uk/edg/wp3/ • Fabric Management • http://cern.ch/hep-proj-grid-fabric/

More Related