1 / 18

SSC2 and Update on Multi-user Pilot Jobs Framework

SSC2 and Update on Multi-user Pilot Jobs Framework. Mingchao Ma, STFC – RAL HEPSysMan Meeting 20/06/2008. Security Service Challenge. What is it? How does it work? SSC 2 - UKI ROC experience. SSC - What is it?.

tolla
Download Presentation

SSC2 and Update on Multi-user Pilot Jobs Framework

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SSC2 and Update on Multi-user Pilot Jobs Framework Mingchao Ma, STFC – RAL HEPSysMan Meeting 20/06/2008

  2. Security Service Challenge • What is it? • How does it work? • SSC 2 - UKI ROC experience

  3. SSC - What is it? “The goal of the LCG/EGEE Security Service Challenge, is to investigate whether sufficient information is available to be able conduct an audit trace as part of an incident response, and to ensure that appropriate communications channels are available” Like a fire drill!

  4. SSC – Why and How? • To check if communication channel among involved parties (Sites, VOs and Security contacts etc) is functioning; • Exercises for system admins to trace users’ activities and to know various logfiles; • Not intrusive – only ‘legal’ operations; • No penetration and no execution of exploits; • Conduct and monitor by OSCT and ROC security Officers; • CERN challenges ALL Tier1 sites; • ROC security officer challenges Tier2 sites within that ROC

  5. Security Service Challenge • SSC 1: challenges the Workload Management System (WMS) on the Grid: Resource Broker (RB) and Compute Element (CE) (2005) • SSC 2: challenges the Storage Elements on the Grid (2007/2008) • SSC 3: challenges the Operational Diligence of the LCG/EGEE Grid Sites (ongoing) https://twiki.cern.ch/twiki/bin/view/LCG/LCGSecurityChallenge

  6. SSCs - UKI ROC • Security Service Challenge 2 • 22 Tier2 sites (SEs) UKI ROC were challenged by ROC security officer • Security Service Challenge 3 • RAL Tier1 was challenged by CERN on 06 March 2008 http://www.gridpp.ac.uk/security/ssc/ https://www.gridpp.ac.uk/security/ssc/ssc2/index.html http://grid-deployment.web.cern.ch/grid-deployment/ssc/SSC_2/SSC_2_google.html

  7. Security Service Challenge 2 • Timeline • From 21 January 2008 to 10 March 2008 • In total 22 sites (SEs) challenged • Job submission: from 21 Jan. to 28 Jan • 4 weeks (Feb. 2008) cool down period • GGUS ticket opened: 03 March 2008 • Challenge completed: 5pm 10 March 2008

  8. Security Service Challenge 2 • Basic Statistic • 22 SEs/Sites challenged, of which: • One site failed to run challenge job; • One site is opt out of the challenge due to site re-built; • One site is no longer part of EGEE Grid; • Initial response received from the 21 sites; • 18 sites acknowledged the initial alert ticket within 24 hours; • 2 site acknowledged ticket within 48 hours; • 1 site acknowledge ticket within 72 hours;

  9. Security Service Challenge 2 - Result

  10. Security Service Challenge 2 • Preliminary Analysis • All responsed sites (18) found some traces of the job activities and at least identified one SE operation • Communication channel seems to work well; • Most sites acknowledged ticket within 24 hours • 1 sites was within 72 hours, where a new staff has no support role in GGUS, therefore unable to answer the ticket

  11. Security Service Challenge 2 • Issues observed • None of 19 sites were able to identity the Lookup operation • Some sites only provided RAW logs (though correct part of log) information with little or no analysis • A few sites experienced log missing (accidentally deleted log file due to mis-configuration; log retention is only a month, again due to mis-configuration or lost log files due to system-rebuilt etc.) • SE’s logs (syntax and format) are still too complex; it seems that it is very difficult to fully rebuild some operations (site configuration? Or Insufficient log information?); Too many logfiles!

  12. Multi-user Pilot Jobs Framework

  13. What is multi-user pilot Job? • A multi-user pilot job, hereafter referred to simply as a pilot job, is a Grid job for which the following holds*: • a Grid job is submitted with a set of credentials belonging to either a member of the VO or to a service owned and operated by the VO • when this Grid job begins to execute at a Site, it pulls down and executes workload, hereafter called a user job, owned and submitted by a different member of the VO or multiple user jobs owned and submitted by multiple different members of the VO *Policy on Grid Multi-User Pilot Jobs https://edms.cern.ch/cedar/plsql/doc.info?cookie=7587020&document_id=855383&version=1

  14. Pilot Jobs Framework • A VO/Experiment-specific Workload Management System (WMS): • CMS glideinWMS http://indico.cern.ch/materialDisplay.py?sessionId=4&materialId=0&confId=20230 • LHCb DIRAC WMS http://indico.cern.ch/materialDisplay.py?sessionId=4&materialId=0&confId=20230 • ATLAS PanDA https://twiki.cern.ch/twiki/bin/view/Atlas/PanDA • ALICE ???

  15. A Simplified Diagram VOMS Server My Proxy Server Others Central Job Repository/VO-Specific WMS End User Jobs + Proxy Submit Pilot Job + Pilot Proxy Get User Jobs & User Proxy Pilot Job Pilot Job Glexec Glexec User Job User Job Worker Node(s) Worker Node(s) Site 1 Site 2

  16. Pilot Job Frameworks Review Workgroup • GDB working group mandated by WLCG MB on Jan. 22, 2008 • Mission • Review security issues in the pilot job framework of each experiment • Pilot jobs are taken as multi-user in this context • Define a minimum set of security requirements • Advise on improvements • Per framework or common to all • Report to GDB and MB • Time frame is a few months • Members • ALICE: Predrag Buncic • ATLAS: Torre Wenaus • CMS: Igor Sfiligoi • LHCb: Andrei Tsaregorodtsev • WLCG: Maarten Litmaath (chair) • EGEE: David Groep • FNAL: Eileen Berman • GridPP: Mingchao Ma • OSG: Mine Altunay * Content from Maarten Litmaath, GDB, 2008/06/11

  17. Questionnaire • Describe in a schematic way all components of the system. • If a component needs to use IPC to talk to another component for any reason, describe what kind of authentication, authorization, integrity and/or privacy mechanisms are in place. If configurable, specify the typical, minimum and maximum protection you can get. • Describe how user proxies are handled from the moment a user submits a task to the central task queue to the moment that the user task runs on a WN, through any intermediate storage. • What happens around the identity change on the WN, e.g. how is each task sandboxed and to what extent? • How can running processes be accounted to the correct user? • How is a task spawned on the WN and how is it destroyed? • How can a site be blocked?

  18. Questionnaire (cont.) • What site security processes are applied to the machine(s) running the WMS? • Who is allowed access to the machine(s) on which the service(s) run, and how do they obtain access? • How are authorized individuals authenticated on the machine(s)? • What is the process for keeping the service(s) and OS patched and up-to-date, especially with respect to security patches? • Do you have an identified security contact? • Describe the incident response plan to deal with security incidents and reports of unauthorized use? • What services (in general) run on the machine(s) that offer the WMS service? • What processes exist to maintain audit logs (e.g. for use during an incident)? • What monitoring exists on the machine(s) to aid detection of security incidents or unauthorized use? • Can you limit the users that can submit jobs to the VO WMS? How?

More Related