140 likes | 311 Views
M. Radecki , T. Szepieniec, M. Krakowian, T. Szymocha, M. Zdybek, D. Harezlak, and J. Andrzejewski ACC CYFRONET AGH. Operations in PL-Grid. Cracow Grid Workshop Cracow, 11.10.2010. Outline. Goal of Grid Operations PL-Grid services for users
E N D
M. Radecki, T. Szepieniec, M. Krakowian, T. Szymocha, M. Zdybek, D. Harezlak, and J. Andrzejewski ACC CYFRONET AGH Operations in PL-Grid Cracow Grid Workshop Cracow, 11.10.2010
Outline Goal of Grid Operations PL-Grid services for users User registration and account management – PL-Grid Portal Incident reporting Usage monitoring PL-Grid services for Polish NGI service availability monitoring grid usage accounting issue tracking High level view on EGI, NGI and PL-Grid Operations Incident Management in PL-Grid Grid Infrastructure Monitoring Operations Communication and Documentation
Goal of PL-Grid Operations coordinate and fulfill activities and processes required to provide and manage services for PL-Grid users manage the technology required to provide and support these services
PL-Grid infrastructure services Services for users access to computing power and storage space in 5 largest Polish computing centers scientific software (e.g Gaussian, Fluent, Povray) user account management system facilities to report problems & service requests resource usage monitoring system application portals and other tools for users (soon) PL-Grid as Polish NGI is obliged to provide some services interfaced to EGI service availability monitoring system issue tracking and user support system accounting (resource usage) system
User account management Motivation: necessity to determine if user is entitled to use PL-Grid resources Registration process confirms a user is researcher affiliated to Polish research unit or ward: undergraduates, PhD students authorized by supervisor Registration must be on-line for user Implementation: PL-Grid Portal based on Liferay engine Successful user registration results in Portal account - PL-Grid “entry point” for the user Easily extended with new functionality using JSR 268 portlets Ability to re-use rich Liferay components library like e.g. forum, wiki PL-Grid specific features Easy personal certificate access - ability to get X.509 certificate on-line scope limited to PL-Grid services only User account data integrated with PL-Grid tools & services User login used for services allowing login/password authentication/authorization Broadcast tool to contact all users
User account management – 1st year experiences PL-Grid user registration opened at last year's CGW PL-Grid Portal technology changed from Java Spring through Google Web Tookit to Liferay Agreed formal process description documents indispensable user registration important for all PL-Grid computing centers procedure security User statistics (as of 10.10.2010) Registered users: 204 PL-Grid staff: 64 independent researchers: 56 wards:84 no. of registered users Jan – Oct 2010
PL-Grid Scientific Software & Helpdesk PL-Grid offers access to both commercial and free scientific applications NAMD, ADF, Blender, CFour, CPMD, Dalton, Fluent, Gamess, Gaussian, Gromacs, NWChem, Povray, Turbomole Availability of software and current status are monitored and results are feed to incident management system higher availability for users Users can check if program failed due to their fault of computing center problem Issues with monitoring monitoring system designed for site admins, web interface unacceptable for users, consider possibility of using myEGI portal when available PL-Grid Helpdesk allows reporting issues, problems and service requests Reporting can be done via phone call, e-mail or PL-Grid Helpesk web interface, phone call reports are registered by operator Report registration returns a user with incident identifier allows to refer and modify the incident later on Incident transferred to EGI level if solution lies beyond the scope of Polish NGI still can be managed via PL-Grid Helpdesk
Resource Usage Monitoring System Motivation: PL-Grid grant accounting, daily data reports for users In first prototype available the users can track their resource usage status of jobs daily daily workload (CPU-, walltime) per computing center Currently used in parallel with EGI accounting - APEL
EGI, NGI & PL-Grid Operations – high level view Operations Support Tools Operations Support Teams EGI: Central Operator on Duty NGI: Regional Operator on Duty use EGI Operations Dashboard GGUS WebSvc WebSvc use use PL-Grid Helpdesk Regional Technical Support Site Administrators JMS use Monitoring
PL-Grid Operations: Incident Management “The main objective of incident management process is to resume regular state of affairs as quickly as possible and minimize the impact of business processes." Service Operation based on ITIL(R) V3 Identification incidents are triggered by monitoring system, users or technical staff Registration issue tracking system (PL-Grid adapted Request Tracker) incident reported by user or staff is always registered only long-standing (>24h) problems reported by monitoring system are registered Classification regular middleware services / PL-Grid applications Escalation experts are responsible for making sure the problem is solved or reassign incidents can be escalated to EGI for software problems Solution applied & Tested => Issue Closed administrator of failed resource applies solution triggers execution of the monitoring system probes check if user is satisfied => if all OK, close incident
Incident Management – PL-Grid experience Pro-active procedures for troubleshooting in first 24h monitoring system reported incidents, involving Regional Technical Support Incident solution process can be useful source of knowledge PL-Grid introduced Operational Problems Knowledge Base Regional Technical Support team creates entries data to be re-used when similar problem occurs again publicly available - web pages indexed by search engines entry contains full error message and detailed solution procedure - in case of problems – paste your error message in Google Search KB population started in Aug 2009, ~50 entries knowledge base link: https://weblog.plgrid.pl/category/1st-line-support/ Incident Management Metrics – evaluate performance quantitative e.g. number of incidents, individual submitters, GGUS share etc. focused on teams response time Issues team reaction time metrics indicate room for improvement, need to promote incident handling procedures among supporters/experts Knowledge Base requires initial investment, but more entries, more it pays off
Grid Infrastructure Monitoring System Motivation: not acceptable to wait for user to notify service problem PL-Grid monitoring system is extended version of EGI nagios-based system for grid services availability monitoring PL-Grid extensions monitoring PL-Grid scientific software probes for availability of PL-Grid VO (vo.plgrid.pl) other middleware services (being integrated) Alarms sent to EGI message bus (based on ActiveMQ JMS implementation) and then displayed in EGI Operations Dashboard (incl. PL-Grid extensions) Issues core services poorly or not monitored monitoring system triggers incidents, nice to have possibility to monitor trends and predict failures no control system, services does not have management interface – software maturity issue
Operations Communication & Documentation PL-Grid Operations Center is distributed, resources are located in geographically distant centers – requires other than F2F means of communication Solving operational problem requires interactive communication (better than e-mail) Coordination of distributed teams require procedures, work descriptions and handovers PL-Grid use bi-weekly teleconferences where operations issues can be discussed Jabber service with automatically generated contact list to all registered PL-Grid staff RTS fills daily handover reports and quarterly summary Operational Documentation Incident Handling in PL-Grid Helpdesk https://weblog.plgrid.pl/procedura-obslugi-helpdesku/ Operational Procedures for ROD, RTS and site admins https://weblog.plgrid.pl/procedury-operacyjne-pl-grid/