1 / 19

Operating Central European EGEE ROC

Operating Central European EGEE ROC. Marcin Radecki, Tomasz Szepieniec , Ale ksander Kusznir and Marian Bubak ACC CYFRONET AGH. Outline. Introduction EGEE and Central European (CE) R egion Challenges for CE Regional Operating Centre Applications & Users Cooperation

sabina
Download Presentation

Operating Central European EGEE ROC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Operating Central European EGEE ROC Marcin Radecki, Tomasz Szepieniec, Aleksander Kusznir and Marian Bubak ACC CYFRONET AGH

  2. Outline • Introduction • EGEE and Central European (CE) Region • Challenges for CE Regional Operating Centre • Applications & Users • Cooperation • Grid Infrastructure • Conclusions CGW’06; Cracow; 15-18thOctober 2006

  3. EGEE – Community • Possibly largest production infrastructure spans over 32 countries • c.a. 200 sites grouped under 11 ROCs • Scientific community involves over 2000 people • EGEE’06 conference in Geneva • 700 attendees, • 32 „partner” projects present CGW’06; Cracow; 15-18thOctober 2006

  4. Central European Region in EGEE • 7 countries, 22 sites, 1493 CPUs, 70 TB storage space • Supports 10/11 EGEE-approved + lot of associated VOs • Site size scales from 2-3 to 300 CPUs • Need for solutions suitable for both large computing centres and small sites • Maintenance model • Skills & experience • Scalable across a site’s resources CGW’06; Cracow; 15-18thOctober 2006

  5. Challenges for CE ROC • We need to attract new users to grid and make possible their work in the new environment in order to use the resources efficiently. Provide the services the users require. • Grid spans across many administrative domains, each of which need to be active in terms of cooperation to share resources and collaborate productively. Excellent possibility for expertise sharing. • Having resources is not enough; infrastructure need to be stable before real users start to use it and we should maximize utilization as possible. CGW’06; Cracow; 15-18thOctober 2006

  6. Grid-enabling users • Means to gain and uphold users with us • Understand users’ needs and satisfy them • Easy access, how-to-use documentation (in national languages) • Stable working environment • User Support infrastructure • Results: • Computational chemistry • Mariusz Sterzel (CYFRONET) coordinatescomputational chemistry applications in EGEE • Enabling commercial software - Gaussian VO • Study on pyrazoloquinolines (PQ) used for laserlight generation • Bioinformatics • Never Born Protein folding and functionrecognition - Prof. Irena Roterman team (CM-UJ) • Others: • Many small teams are workingwithin regional catch-all VO – VOCE CGW’06; Cracow; 15-18thOctober 2006

  7. VOs in the Region • Supported VOs list alice, atlas, auger, balticgrid, belle biomed, cms, compass, compchem, crogrid, esr, euchina., gamess. gaussian, geant4, gear, geclipse, hone, hungrid, lhcb, magic, ops, skgrid, voce, vocet, zeus • Service/Data Challenges and test productions • Atlas Service Challenge 4 • World-wide In Silico Docking On Malaria data challenge 1st and 2nd (ongoing) • EGEE-ITU • International digital broadcasting agreement – new frequency plan • compatibility and complementary analysis CGW’06; Cracow; 15-18thOctober 2006

  8. ROC Manager User Support Responsible Operations Responsible Security Responsible 1st Line Support Core Grid Services Regional Certification of Middleware Grid Operator On Duty Pre-Production Service Managment of CE ROC • ROC Manager • Represents the region at the level of the Project managerial bodies • Supervises all Service Activities • Operations • Coordinate actions related to infrastructure and middleware • Escalates unsolvable problems level higher • Fit the Project requirements into the region • User Support • Provides support tools for users • Takes part in shifts handling all user tickets in GGUS system • Security • Incident handling procedures • Incident response team CGW’06; Cracow; 15-18thOctober 2006

  9. Procedures and Commitments • Well defined procedures makes collaboration more efficient • Clear paths on how we deal with things to avoid misunderstandings • Newbies are always there • People tend to forget things over the time • Procedures examples: • New site registration • New site admin joining • Site problem handling • Sending Weekly Reports • Commitments monitoring makes people more motivated CGW’06; Cracow; 15-18thOctober 2006

  10. Operations - coordinate the work • Operations is the most time consuming task • To make sure that operational procedures are understood and followed up properly • To ensure production requirements are met at the sites • To work out best solutions for problems • To understand expectations/needs • To make sure problems are being solved in a proper way • To ensure weekly reports are completed and sent • Three styles of site administration observed • Keep all services ready all the time – „I’m the best admin in the city” • React only when gets a problem report – „I’m a bit occupied” • React only if my name appears on a „black list”, available to the public – „I’m hard-working on… something important” CGW’06; Cracow; 15-18thOctober 2006

  11. Max. CPUs Jobs Executing Jobs Queued Avoid low usage periods Resources and their usage • Accounting in EGEE • July-October ’06 - over 672k CPU hours computed in CE region; equivalent of 275 CPUs running 24x7 • Problems with „missing” data • Update rate: daily • Our approach to accounting • Site performance efficiency study: - Up-to-date information on what is going at a site,- Maximize site utilization • better to have jobs queued at a site than idle CPUs • Is being extended towards a new system for fine grain accounting CGW’06; Cracow; 15-18thOctober 2006

  12. Stable infrastructure- social aspect • How EGEE keeps the Grid stable • Grid Operator on Duty (GOD) watching entire grid • CE joined this activity in a first turn in EGEE-II • Raise a ticket for each detected problem • Problem diagnosis and solution suggestion • Use monitoring tools for problem detection and availability metrics • 1st Line Support in CE - how to be better than the average? • To detect and fix failures before they get notified by GOD Team and a ticket is raised • Support site admins on remedy actions • Suggest known well-working practices  expertise sharing • Knowledge comes out of the mind with pain  despite saving a lot of time while at work it needs a lot of encouragement for people to do so CGW’06; Cracow; 15-18thOctober 2006

  13. Stable infrastructure - monitoring with NAGIOS • Try to monitor as much functionality as possible • E.g. all machines certificates expiration date • Reasonable probe frequency • Send a problem notification immediately but… • Do not spam each 5 minute • Allow site admin to tell the problem is being worked on • Do not send notification until notified • Allow site admin to schedule extraordinary check at will • To let him convince at once how good the workaround is working • Smart testing hierarchy • Monitors CE Core Services • added tests for checking RB, BDII, LFC, VOMS • Used by 1st line support • Overview of the region • Detailed check of services • Schedule checks when working on fixes CGW’06; Cracow; 15-18thOctober 2006

  14. Operations metrics results EGEE Operations metrics results from last 10 months Data from EGEE CIC portal: https://egee.in2p3.fr/CIC/index.php?id=cic&subid=cic_roc_metrics&scope=project&project=&metrics=sft CGW’06; Cracow; 15-18thOctober 2006

  15. Conclusions • CYFRONET gained the know-how on: • Coordination of a large initiative • Organization of work for different subtasks • Running a stable production infrastructure • Accurate Grid job accounting • Sensible and precise Grid infrastructure monitoring • Facilitating the application users introduction to Grid • Experience gathered in CE ROC may easily be re-used in building national Polish grid CGW’06; Cracow; 15-18thOctober 2006

  16. Ogólnopolska infrastruktura gridowa PL-Grid Zespół Akademickiego Centrum Komputerowego CYFRONET AGH Kraków, czerwiec – wrzesień 2006 W poniższym opracowaniu przedstawiono motywację, cele, koncepcję i sposób podejścia do utworzenia narodowej infrastruktury gridowej, niezbędnej dla nowoczesnego prowadzenia badań naukowych (e-Science), spójnej z infrastrukturą europejską. PL-Grid jako infrastruktura dla e-Science Aktualnie prowadzenie badań naukowych wymaga wykorzystania zaawansowanych technologii informatycznych. Rośnie liczba zespołów naukowych, które intensywnie ze sobą współpracują, a do tego niezbędne są narzędzia informatyczne umożliwiające gromadzenie i wymianę uzyskanej wiedzy w skali globalnej. Wyniki eksperymentów to olbrzymie, rozproszone zbiory danych o różnorodnej strukturze, których opracowanie wymaga narzędzi dostępu, ich integracji oraz przetwarzania danych. Symulacja komputerowa jest w pełni akceptowaną metodą badawczą i coraz częściej łączone są ze sobą wyniki uzyskane z symulacji i eksperymentów. Takie nowatorskie podejście jest najbardziej widoczne w fizyce wysokich energii, w astrofizyce, naukach biologicznych i medycznych, w naukach o Ziemi. Dla realizacji tego nowego paradygmatu prowadzenia badań naukowych, zwanego e-Science, jest niezbędna infrastruktura gridowa (zwana też Cyber-Science Infrastructure), obejmująca oprogramowanie umożliwiające współdzielenie różnych zasobów komputerowych oraz narzędzia wspierające współdziałanie partnerów w ramach tzw. wirtualnych organizacji. Rys1. PL-Grid jako infrastruktura dla e-Science

  17. Uproszczona architektura PL-Gridu

  18. Struktura organizacyjna PL-Gridu Raporty Informacja Zarząd Konsorcjum (Koordynator + członkowie) Zalecenia Propozycje Rada Użytkowników Rada Konsorcjum Koordynacja Gridy dziedzinowe Ocena Centrum Operacyjne PL-Grid Infrastruktura (sprzęt, sieć)

  19. Harmonogram prac

More Related