250 likes | 425 Views
AsiaPacific Regional Operation Center. Min-Hong Tsai ASGC ISGC May 3, Academia Sinica http://www.twgrid.org/aproc/. Agenda. Introduction ROC Status Recent Activities. APROC Introduction. APROC Goal Provide deployment support facilitating Grid expansion
E N D
AsiaPacific Regional Operation Center Min-Hong Tsai ASGC ISGC May 3, Academia Sinica http://www.twgrid.org/aproc/
Agenda • Introduction • ROC Status • Recent Activities
APROC Introduction • APROC Goal • Provide deployment support facilitating Grid expansion • Maximize the availability of Grid services • Supports EGEE sites in Asia Pacific since April 2005 • EGEE CIC • CIC-on-duty rotation: EGEE global operations • Monitoring tool development: GStat and GGUS Search • VO services • EGEE ROC • Monitoring, Diagnosis and Problem tracking M/W release deployment support • Security Coordination Site Registration • Portal and documentation
ASGCCA • Production service since July 2003 • Taiwan • LCG/EGEE users in Asia Pacific without local CA • Member of both • EUGridPMA • APGridPMA • http://ca.grid.sinica.edu.tw
VO Infrastructure Support • APROC hosts centralized services for VOs • Host VOMS server • VO assigns manager to maintain membership • VO supply AUP • Host LFC global file catalogue service • Resource Broker • Top-Level BDII • Currently supporting • TWGrid • APeSci
EGEE Site Registration and Certification • Registration Procedure: • http://www.twgrid.org/aproc/doc/admin_intro/newrc/ • Guidance for user and host certificate registration • Registration into GOCDB • Recommend startup documentation • Instructions for further registration in • Mailing lists • VO membership • APROC ticketing system • Consulting on site architecture and deployment • Deployment support and troubleshooting • Site certification • Manual tests • SFT and GStat tests
Middleware Support Installation support New release testing Supplementary release notes Assist in coordination of updates and upgrades Operations Support Review and track GGUS and APROC tickets Monitor and detect new problems Provide detailed technical support to sites Support Channels Phone Email TRS Ticketing System Middleware and Operations Support
APROC Portal • www.twgrid.org/aproc • Rollout Highlights • Supplemental documentation • Getting started links • Registration information • Contact Info and TRS links • lists.grid.sinica.edu.tw/apwiki • Supplementary release notes • Site Operations Procedures • Technical Howtos • Trouble Shooting FAQs • APF and GDA meeting minutes • Feel free to contribute!
Agenda • Introduction • ROC Status • Recent Activities
Members and Biweekly meeting • 11 sites, 7 countries, ~600 CPUs • Australia Japan • India Korea • Pakistan Singapore • Taiwan • APF Meetings • Short biweekly meeting between AP sites • Topics • Operation: M/W issues, operations news, review site status • Service challenge: news and announcements • Welcome other topics, such as BELLE or other regional topics
Site Registration • Site Registration • Recently: • JP-KEK-CRC-01 • In progress • Australia-UNIMELB-LCG2 • JP-KEK-CRC-02 • TW-THU-HPC • PAKGRID3-LCG2 • Welcomed site from CERN ROC • INDIACMS-TIFR • NCP-LCG2 • PAKGRID-LCG2
APROC Usage I • Total computing capacity is increasing • But so is utilization (peak over 80%)
APROC Usage II • Jobs predominately from Biomed, CMS and Atlas VOs • Past year: 41 KSI2K Years • This April: 21 KSI2K Years
APROC Availability I • Ideal Grid World: May 3, 2006
APROC Availability II • Daily snapshots of SFT results of each site • Availability of 60-70% • Better if weighted with numbers of CPU • CT mostly replica management failure • Sensitive to Information System performance • Network Issues • Network congestion and packet loss • APROC SmokePing to monitor net performance • But monitoring from CERN is more relevant • Scheduled Downtime • Network and power maintenance • Hardware maintenance and upgrade • Middleware upgrade Decommissioned Slow BDII 2.4 2.6 2.7
Support Issues and Tickets • Remote troubleshooting • Email interaction is slow • Remote testing is limited • Reluctantly ask for access to services • Local diagnostic tools would be helpful
ASGCCA Status • Improvements • Overhaul of certificate registration • instructions and application forms • Step-by-step guide for browser certificate management • Addition of FAQ sections to address common tasks • In progress • Certificate import error related to Firefox 1.5 • Design and implement new RA procedures • Revise and update CP/CPS
Agenda • Introduction • ROC Status • Recent Activities
Security Service Challenge 1 • Purpose to ensure that: • Sufficient information is available for audit trace (for IR) • Appropriate communication channels are available • Security Challenge with (OSCT) • Sending test jobs • Sites recover evidence • DN of job submitter IP address of submission UI • Executable name Time when executable ran • Results • Completed March 2006 for a period of one week • Instructions and audit guide sent to participating sites • 4 of 7 APROC sites completed challenge • Some sites could not participate due to SD or unavailability • Some results were incomplete since sites did not have Resource Broker (RB) • Sites need to contact RB admin for more information • Helpful learningexercise to familiarize security contacts with auditing process for LCG • Improvements • Sharing of audit techniques between ROCs (GOCWiki) • Tools to extract security audit information • Helpful for future SSC to measure security patch response time
Pre-Production Service • APROC started PPS service in April 2006 • Previously managed by Application team • PPS deployment with glite-3.0 RC2 complete • Mix of LCG and gLite components • LCG-CE gLite-CE • MON combined UI • Integration of production SE and SRM services • FTS still needs to be deployed • Summary • Good way to get experience with gLite middleware • Using YAIM is very good transition for ROC staff • LCG components are more stable than gLite counterparts • Required significant support from CERN for gLite-CE • Integration with lcg-CE batch system was not trivial • Still troubleshooting • Need significant time to relearn administration and troubleshooting techniques • Administration documentation like ones accumulated for LCG in GOCWiki would be helpful
Grid Administrator Tutorial I • Goal and details • Educate and train EGEE Site Administrators • Two day tutorial with instruction in Chinese • Hosted at Academia Sinica in March 2006 • Topics covered • Grid technology and components • Operations, administration and troubleshooting • Brief overview of Grid applications • Hands-on session to deploy functional sites • 36 Xen servers configured • Simple CA, RB, BDII, VOMS, LFC provided • 5 teams of 6 participants deployed sites (UI, MON, CE, WN, DPM-head, DPM-disk) • Based on Marco La Rosa’s KEK tutorial • http://lists.grid.sinica.edu.tw/apwiki/Grid_Administrator_Tutorial_Hands-on_Instructions
Grid Administrator Tutorial II • Results • 30 participants from 15 institutes • 4.18/5.0 survey evaluation scores • Only a couple teams where able to complete a fully functional site • Not enough time • Setup YAIM configuration from scratch • Time consuming and error prone • More realistic and gives chance for participants to troubleshoot • Feedback • Break up hands-on session to practice after each lecture • Provide a reference cheat sheet • Acronyms • Grid architecture diagrams • Suggest Linux training material as prerequisite • Provide user and developer tutorials • Significant time to setup hands-on session servers for installation • Is this available in GLIDA?
GStat Development • Instances created for • PPS Service • Regional projects • balticGrid, EELA, • EUChinaGrid, etc.. • Usage calculations modified • PhysicalCPU • SizeTotal, SizeFree • Results published • To Service Availability Monitoring Environment (SAME) at CERN • Client tool for to retrieve historical data • http://goc.grid.sinica.edu.tw/gocwiki/GStat_Client_Tools
Summary • People: • Jinny Chien Shu-Ting Liao • Jason Shih Howard Su • Jeng-Hsueh Wu Joanna Huang • Aries Hong Hung-Che Jen • Min Tsai • APROC Provides EGEE operations support services to AsiaPacific • There is significant room for improvement in availability • Middleware is becoming more reliable • Network monitoring in critical • Operations procedures to reduce Scheduled Downtime and improve time to recover • Diagnostic tools will be helpful for troubleshooting • Would there be interest in another Administration Tutorial in late Summer or Autumn? • If there are significant increase in deployment in AsiaPacific one ROC may not be scalable • Federation of APROC is another option • Please give us feedback on what we can improve • Contact us: • roc@lists.grid.sinica.edu.tw • http://www.twgrid.org/aproc • http://lists.grid.sinica.edu.tw/apwiki