1 / 45

Flexible Session Management in a Distributed System

Flexible Session Management in a Distributed System. Condor Communication Layer: CEDAR. CEDAR – Condor External DAta Representation C++ API for network sockets Cross-platform data representation Special attention to UDP issues Flexibility with private networks Strong focus on security.

miracle
Download Presentation

Flexible Session Management in a Distributed System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flexible Session Management in a Distributed System CHEP 2009, Prague

  2. Condor Communication Layer: CEDAR • CEDAR – Condor External DAta Representation • C++ API for network sockets • Cross-platform data representation • Special attention to UDP issues • Flexibility with private networks • Strong focus on security CHEP 2009, Prague

  3. Basic CEDAR Security Features • Authentication • Encryption • Integrity Checks • Credential Mapping • Authorization Policy CHEP 2009, Prague

  4. Basic CEDAR Security Features • Autonegotiation of supported features Server wants: KERBEROS, GSI, OPENSSL, 3DES, BLOWFISH, MD5 Client wants: KERBEROS, NTSSPI, 3DES, MD5 CHEP 2009, Prague

  5. Basic CEDAR Security Features • Autonegotiation of supported features Policy: KERBEROS 3DES MD5 CHEP 2009, Prague

  6. Strong Authentication • Strong Authentication can be expensive! • PKI (OpenSSL and GLOBUS) is relatively CPU-intensive • KERBEROS hits the KDC • All require network round trips CHEP 2009, Prague

  7. Strong Authentication • Network round trips in a distributed grid environment (like glideinWMS) may be over a wide area network having relatively large latency 0.1s Hi! I want to talk 0.1s OK, let’s try GSI Here you go… 0.1s CHEP 2009, Prague

  8. Strong Authentication • In a single-threaded client blocking on network, all of those 0.1s quickly become a problem for scalability • Solutions? • Don’t block on network. • recent progress, but more to do • When something is expensive, cache it • fortunately, already supported CHEP 2009, Prague

  9. Security Session Cache • Session is a semi-permanent information exchange which is set up and torn down • Setup is costly, but done only once • Resuming is done often, but is faster • Tearing down is either explicit or based on expiration times CHEP 2009, Prague

  10. Session Management • Session Set Up 0.1s I want to talk 0.1s OK, let’s authenticate Authentication 0.2s CHEP 2009, Prague

  11. Session Management • Authentication results in secure key exchange I want to talk OK, let’s authenticate Authentication secret key: 0x3E42 secret key: 0x3E42 (secret key is actually 192 bits) CHEP 2009, Prague

  12. Session Management • The secret key is associated with a session ID I want to talk OK, let’s authenticate Authentication sess 1234 0x3E42 sess 1234 0x3E42 Call this session 1234 CHEP 2009, Prague

  13. Session Management • Resuming a session • Sends the ID and uses the secret key Use session 1234 Here is my request, encrypted (with 0x3E42) to prove who I am. Here is your response sess 1234 0x3E42 sess 1234 0x3E42 encrypted (with 0x3E42) to prove who I am. CHEP 2009, Prague

  14. Session Management • The session is stored by both sides sess 1234 key: 0x3E42 authn: KERBEROS user: zmiller@CS.WISC.EDU from: cobalt.cs.wisc.edu authz: ALLOW *.wisc.edu valid until: 2009.03.24.17.50.00 CHEP 2009, Prague

  15. Basic Condor Operation • User submits a job • Condor schedules the job on an execute node • Submit point connects to execute node and sends job • Job runs to completion • Execute node returns results CHEP 2009, Prague

  16. High Level Condor Operation (with authentication turned on) Central Manager Execute Node Job Submit Point CHEP 2009, Prague

  17. High Level Condor Operation (with authentication turned on) Central Manager Execute node advertises itself. Execute Node Job Submit Point Red lines represent authenticated connections CHEP 2009, Prague

  18. High Level Condor Operation (with authentication turned on) Central Manager User submits a job. Execute Node Job Submit Point % condor_submit work.sub CHEP 2009, Prague

  19. High Level Condor Operation (with authentication turned on) Central Manager Condor performs matchmaking. Execute Node Job Submit Point Red lines represent authenticated connections CHEP 2009, Prague

  20. High Level Condor Operation (with authentication turned on) Central Manager Submit node sends job to execute node Execute Node Job Submit Point Red lines represent authenticated connections CHEP 2009, Prague

  21. High Level Condor Operation (with authentication turned on) Central Manager Job runs to completion Execute Node Job Submit Point CHEP 2009, Prague

  22. High Level Condor Operation (with authentication turned on) Central Manager Execute node sends job results back Execute Node Job Submit Point Red lines represent authenticated connections CHEP 2009, Prague

  23. High Level Condor Operation (with authentication turned on) Central Manager Job complete! Execute Node Job Submit Point repeat ad-infinitum, reusing cached security sessions CHEP 2009, Prague

  24. Problems Crossing the Atlantic CHEP 2009, Prague • Experience in CMS CCRC-08 with glideinWMS (dynamic Condor pool on top of grid): • Even with sessions, authentication cost is killing performance. • Why didn’t we notice this in previous scale tests at Fermilab?! • Because network latency adds significantly to cost of authentication in Condor (blocking, single-threaded).

  25. Larger Scale View Execute Nodes Central Manager Job Submit Point CHEP 2009, Prague

  26. Larger Scale View Execute Nodes Central Manager Job Submit Point 100,000 jobs User submits 100,000 jobs CHEP 2009, Prague

  27. Larger Scale View Execute Nodes Central Manager Job Submit Point Condor schedules jobs and passes security session info for each match 100,000 jobs CHEP 2009, Prague

  28. Larger Scale View Execute Nodes Send jobs to execute nodes Central Manager Job Submit Point 100,000 jobs Lots of authentications! CHEP 2009, Prague

  29. Meeting CHEP 2009, Prague Igor: I will never give up on you guys (yet) Dan: all blocking network operations MUST be exterminated! Todd: don’t unwind all our code; use cooperative threads Miron: guys, listen

  30. The Plan CHEP 2009, Prague The Central Manager authenticates both the submit point and the execute nodes Using this trust relationship, the Central Manager can help establish a security session between the two, good for the duration of the match.

  31. Integrated Security Sessions and Matchmaking Central Manager Execute Node Job Submit Point CHEP 2009, Prague

  32. Integrated Security Sessions and Matchmaking Central Manager Execute node advertises itself AND match session info Execute Node match sess 1278 key: 0x72A9 … Job Submit Point (encrypted) Red lines represent authenticated connections CHEP 2009, Prague

  33. Integrated Security Sessions and Matchmaking Central Manager User submits a job. Execute Node Job Submit Point % condor_submit work.sub CHEP 2009, Prague

  34. Integrated Security Sessions and Matchmaking Condor schedules the job AND passes match session info to submitter Central Manager Execute Node Job Submit Point sess 1278 key: 0x72A9 … Red lines represent authenticated connections (encrypted) CHEP 2009, Prague

  35. Integrated Security Sessions and Matchmaking Central Manager Submit node sends job to execute node Execute Node sess 1278 key: 0x72A9 … Job Submit Point Red line here represents resuming the match session, saving an authentication sess 1278 key: 0x72A9 … CHEP 2009, Prague

  36. Integrated Security Sessions and Matchmaking Central Manager Job runs to completion Execute Node Job Submit Point CHEP 2009, Prague

  37. Integrated Security Sessions and Matchmaking Central Manager Execute node sends job results back Execute Node sess 1278 key: 0x72A9 … Job Submit Point Red line here represents resuming the match session, saving an authentication sess 1278 key: 0x72A9 … CHEP 2009, Prague

  38. Integrated Security Sessions and Matchmaking Central Manager Job complete! Execute Node Job Submit Point CHEP 2009, Prague

  39. What about Central Manager? • Communication between submit node and execute node is now cheaper • But Central Manager still has to authenticate everyone (at least once). CHEP 2009, Prague

  40. Execute Node Ads Execute Nodes Central Manager Job Submit Point many authentications CHEP 2009, Prague

  41. Two Ideas CHEP 2009, Prague • Easy: 2-tier ClassAd collection • done • Hard: remove blocking network operations and/or use threads • in progress

  42. 2-tier Collector Execute Nodes sub-collectors aggregation of ClassAds Central Manager Authentication workload distributed across multiple collectors. Big benefit, even with collectors all on just one machine. CHEP 2009, Prague

  43. Can we Cross the Atlantic? CHEP 2009, Prague • glideinWMS tests with one submit node: • Before (condor 7.1.2) • max 4,000 jobs/day • (500 simultaneously running) • After (condor 7.3.1) • 200,000 jobs/day • (22k-25k simultaneously running) • now limited by port usage, not scheduler throughput

  44. Conclusion • Security sessions essential in Condor • 2-tier collector works • Establishing security sessions through matchmaking is a big win. • one submit node much more convenient than 50 CHEP 2009, Prague

  45. Conclusion • Efficient delegation and caching of trust can be an important optimization in distributed systems. CHEP 2009, Prague

More Related