190 likes | 327 Views
Operating Central European EGEE ROC. Marcin Radecki, Tomasz Szepieniec , Ale ksander Kusznir and Marian Bubak ACC CYFRONET AGH. Outline. Introduction EGEE and Central European (CE) R egion Challenges for CE Regional Operating Centre Applications & Users Cooperation
E N D
Operating Central European EGEE ROC Marcin Radecki, Tomasz Szepieniec, Aleksander Kusznir and Marian Bubak ACC CYFRONET AGH
Outline • Introduction • EGEE and Central European (CE) Region • Challenges for CE Regional Operating Centre • Applications & Users • Cooperation • Grid Infrastructure • Conclusions CGW’06; Cracow; 15-18thOctober 2006
EGEE – Community • Possibly largest production infrastructure spans over 32 countries • c.a. 200 sites grouped under 11 ROCs • Scientific community involves over 2000 people • EGEE’06 conference in Geneva • 700 attendees, • 32 „partner” projects present CGW’06; Cracow; 15-18thOctober 2006
Central European Region in EGEE • 7 countries, 22 sites, 1493 CPUs, 70 TB storage space • Supports 10/11 EGEE-approved + lot of associated VOs • Site size scales from 2-3 to 300 CPUs • Need for solutions suitable for both large computing centres and small sites • Maintenance model • Skills & experience • Scalable across a site’s resources CGW’06; Cracow; 15-18thOctober 2006
Challenges for CE ROC • We need to attract new users to grid and make possible their work in the new environment in order to use the resources efficiently. Provide the services the users require. • Grid spans across many administrative domains, each of which need to be active in terms of cooperation to share resources and collaborate productively. Excellent possibility for expertise sharing. • Having resources is not enough; infrastructure need to be stable before real users start to use it and we should maximize utilization as possible. CGW’06; Cracow; 15-18thOctober 2006
Grid-enabling users • Means to gain and uphold users with us • Understand users’ needs and satisfy them • Easy access, how-to-use documentation (in national languages) • Stable working environment • User Support infrastructure • Results: • Computational chemistry • Mariusz Sterzel (CYFRONET) coordinatescomputational chemistry applications in EGEE • Enabling commercial software - Gaussian VO • Study on pyrazoloquinolines (PQ) used for laserlight generation • Bioinformatics • Never Born Protein folding and functionrecognition - Prof. Irena Roterman team (CM-UJ) • Others: • Many small teams are workingwithin regional catch-all VO – VOCE CGW’06; Cracow; 15-18thOctober 2006
VOs in the Region • Supported VOs list alice, atlas, auger, balticgrid, belle biomed, cms, compass, compchem, crogrid, esr, euchina., gamess. gaussian, geant4, gear, geclipse, hone, hungrid, lhcb, magic, ops, skgrid, voce, vocet, zeus • Service/Data Challenges and test productions • Atlas Service Challenge 4 • World-wide In Silico Docking On Malaria data challenge 1st and 2nd (ongoing) • EGEE-ITU • International digital broadcasting agreement – new frequency plan • compatibility and complementary analysis CGW’06; Cracow; 15-18thOctober 2006
ROC Manager User Support Responsible Operations Responsible Security Responsible 1st Line Support Core Grid Services Regional Certification of Middleware Grid Operator On Duty Pre-Production Service Managment of CE ROC • ROC Manager • Represents the region at the level of the Project managerial bodies • Supervises all Service Activities • Operations • Coordinate actions related to infrastructure and middleware • Escalates unsolvable problems level higher • Fit the Project requirements into the region • User Support • Provides support tools for users • Takes part in shifts handling all user tickets in GGUS system • Security • Incident handling procedures • Incident response team CGW’06; Cracow; 15-18thOctober 2006
Procedures and Commitments • Well defined procedures makes collaboration more efficient • Clear paths on how we deal with things to avoid misunderstandings • Newbies are always there • People tend to forget things over the time • Procedures examples: • New site registration • New site admin joining • Site problem handling • Sending Weekly Reports • Commitments monitoring makes people more motivated CGW’06; Cracow; 15-18thOctober 2006
Operations - coordinate the work • Operations is the most time consuming task • To make sure that operational procedures are understood and followed up properly • To ensure production requirements are met at the sites • To work out best solutions for problems • To understand expectations/needs • To make sure problems are being solved in a proper way • To ensure weekly reports are completed and sent • Three styles of site administration observed • Keep all services ready all the time – „I’m the best admin in the city” • React only when gets a problem report – „I’m a bit occupied” • React only if my name appears on a „black list”, available to the public – „I’m hard-working on… something important” CGW’06; Cracow; 15-18thOctober 2006
Max. CPUs Jobs Executing Jobs Queued Avoid low usage periods Resources and their usage • Accounting in EGEE • July-October ’06 - over 672k CPU hours computed in CE region; equivalent of 275 CPUs running 24x7 • Problems with „missing” data • Update rate: daily • Our approach to accounting • Site performance efficiency study: - Up-to-date information on what is going at a site,- Maximize site utilization • better to have jobs queued at a site than idle CPUs • Is being extended towards a new system for fine grain accounting CGW’06; Cracow; 15-18thOctober 2006
Stable infrastructure- social aspect • How EGEE keeps the Grid stable • Grid Operator on Duty (GOD) watching entire grid • CE joined this activity in a first turn in EGEE-II • Raise a ticket for each detected problem • Problem diagnosis and solution suggestion • Use monitoring tools for problem detection and availability metrics • 1st Line Support in CE - how to be better than the average? • To detect and fix failures before they get notified by GOD Team and a ticket is raised • Support site admins on remedy actions • Suggest known well-working practices expertise sharing • Knowledge comes out of the mind with pain despite saving a lot of time while at work it needs a lot of encouragement for people to do so CGW’06; Cracow; 15-18thOctober 2006
Stable infrastructure - monitoring with NAGIOS • Try to monitor as much functionality as possible • E.g. all machines certificates expiration date • Reasonable probe frequency • Send a problem notification immediately but… • Do not spam each 5 minute • Allow site admin to tell the problem is being worked on • Do not send notification until notified • Allow site admin to schedule extraordinary check at will • To let him convince at once how good the workaround is working • Smart testing hierarchy • Monitors CE Core Services • added tests for checking RB, BDII, LFC, VOMS • Used by 1st line support • Overview of the region • Detailed check of services • Schedule checks when working on fixes CGW’06; Cracow; 15-18thOctober 2006
Operations metrics results EGEE Operations metrics results from last 10 months Data from EGEE CIC portal: https://egee.in2p3.fr/CIC/index.php?id=cic&subid=cic_roc_metrics&scope=project&project=&metrics=sft CGW’06; Cracow; 15-18thOctober 2006
Conclusions • CYFRONET gained the know-how on: • Coordination of a large initiative • Organization of work for different subtasks • Running a stable production infrastructure • Accurate Grid job accounting • Sensible and precise Grid infrastructure monitoring • Facilitating the application users introduction to Grid • Experience gathered in CE ROC may easily be re-used in building national Polish grid CGW’06; Cracow; 15-18thOctober 2006
Ogólnopolska infrastruktura gridowa PL-Grid Zespół Akademickiego Centrum Komputerowego CYFRONET AGH Kraków, czerwiec – wrzesień 2006 W poniższym opracowaniu przedstawiono motywację, cele, koncepcję i sposób podejścia do utworzenia narodowej infrastruktury gridowej, niezbędnej dla nowoczesnego prowadzenia badań naukowych (e-Science), spójnej z infrastrukturą europejską. PL-Grid jako infrastruktura dla e-Science Aktualnie prowadzenie badań naukowych wymaga wykorzystania zaawansowanych technologii informatycznych. Rośnie liczba zespołów naukowych, które intensywnie ze sobą współpracują, a do tego niezbędne są narzędzia informatyczne umożliwiające gromadzenie i wymianę uzyskanej wiedzy w skali globalnej. Wyniki eksperymentów to olbrzymie, rozproszone zbiory danych o różnorodnej strukturze, których opracowanie wymaga narzędzi dostępu, ich integracji oraz przetwarzania danych. Symulacja komputerowa jest w pełni akceptowaną metodą badawczą i coraz częściej łączone są ze sobą wyniki uzyskane z symulacji i eksperymentów. Takie nowatorskie podejście jest najbardziej widoczne w fizyce wysokich energii, w astrofizyce, naukach biologicznych i medycznych, w naukach o Ziemi. Dla realizacji tego nowego paradygmatu prowadzenia badań naukowych, zwanego e-Science, jest niezbędna infrastruktura gridowa (zwana też Cyber-Science Infrastructure), obejmująca oprogramowanie umożliwiające współdzielenie różnych zasobów komputerowych oraz narzędzia wspierające współdziałanie partnerów w ramach tzw. wirtualnych organizacji. Rys1. PL-Grid jako infrastruktura dla e-Science
Struktura organizacyjna PL-Gridu Raporty Informacja Zarząd Konsorcjum (Koordynator + członkowie) Zalecenia Propozycje Rada Użytkowników Rada Konsorcjum Koordynacja Gridy dziedzinowe Ocena Centrum Operacyjne PL-Grid Infrastruktura (sprzęt, sieć)