450 likes | 555 Views
Flexible Session Management in a Distributed System. Condor Communication Layer: CEDAR. CEDAR – Condor External DAta Representation C++ API for network sockets Cross-platform data representation Special attention to UDP issues Flexibility with private networks Strong focus on security.
E N D
Flexible Session Management in a Distributed System CHEP 2009, Prague
Condor Communication Layer: CEDAR • CEDAR – Condor External DAta Representation • C++ API for network sockets • Cross-platform data representation • Special attention to UDP issues • Flexibility with private networks • Strong focus on security CHEP 2009, Prague
Basic CEDAR Security Features • Authentication • Encryption • Integrity Checks • Credential Mapping • Authorization Policy CHEP 2009, Prague
Basic CEDAR Security Features • Autonegotiation of supported features Server wants: KERBEROS, GSI, OPENSSL, 3DES, BLOWFISH, MD5 Client wants: KERBEROS, NTSSPI, 3DES, MD5 CHEP 2009, Prague
Basic CEDAR Security Features • Autonegotiation of supported features Policy: KERBEROS 3DES MD5 CHEP 2009, Prague
Strong Authentication • Strong Authentication can be expensive! • PKI (OpenSSL and GLOBUS) is relatively CPU-intensive • KERBEROS hits the KDC • All require network round trips CHEP 2009, Prague
Strong Authentication • Network round trips in a distributed grid environment (like glideinWMS) may be over a wide area network having relatively large latency 0.1s Hi! I want to talk 0.1s OK, let’s try GSI Here you go… 0.1s CHEP 2009, Prague
Strong Authentication • In a single-threaded client blocking on network, all of those 0.1s quickly become a problem for scalability • Solutions? • Don’t block on network. • recent progress, but more to do • When something is expensive, cache it • fortunately, already supported CHEP 2009, Prague
Security Session Cache • Session is a semi-permanent information exchange which is set up and torn down • Setup is costly, but done only once • Resuming is done often, but is faster • Tearing down is either explicit or based on expiration times CHEP 2009, Prague
Session Management • Session Set Up 0.1s I want to talk 0.1s OK, let’s authenticate Authentication 0.2s CHEP 2009, Prague
Session Management • Authentication results in secure key exchange I want to talk OK, let’s authenticate Authentication secret key: 0x3E42 secret key: 0x3E42 (secret key is actually 192 bits) CHEP 2009, Prague
Session Management • The secret key is associated with a session ID I want to talk OK, let’s authenticate Authentication sess 1234 0x3E42 sess 1234 0x3E42 Call this session 1234 CHEP 2009, Prague
Session Management • Resuming a session • Sends the ID and uses the secret key Use session 1234 Here is my request, encrypted (with 0x3E42) to prove who I am. Here is your response sess 1234 0x3E42 sess 1234 0x3E42 encrypted (with 0x3E42) to prove who I am. CHEP 2009, Prague
Session Management • The session is stored by both sides sess 1234 key: 0x3E42 authn: KERBEROS user: zmiller@CS.WISC.EDU from: cobalt.cs.wisc.edu authz: ALLOW *.wisc.edu valid until: 2009.03.24.17.50.00 CHEP 2009, Prague
Basic Condor Operation • User submits a job • Condor schedules the job on an execute node • Submit point connects to execute node and sends job • Job runs to completion • Execute node returns results CHEP 2009, Prague
High Level Condor Operation (with authentication turned on) Central Manager Execute Node Job Submit Point CHEP 2009, Prague
High Level Condor Operation (with authentication turned on) Central Manager Execute node advertises itself. Execute Node Job Submit Point Red lines represent authenticated connections CHEP 2009, Prague
High Level Condor Operation (with authentication turned on) Central Manager User submits a job. Execute Node Job Submit Point % condor_submit work.sub CHEP 2009, Prague
High Level Condor Operation (with authentication turned on) Central Manager Condor performs matchmaking. Execute Node Job Submit Point Red lines represent authenticated connections CHEP 2009, Prague
High Level Condor Operation (with authentication turned on) Central Manager Submit node sends job to execute node Execute Node Job Submit Point Red lines represent authenticated connections CHEP 2009, Prague
High Level Condor Operation (with authentication turned on) Central Manager Job runs to completion Execute Node Job Submit Point CHEP 2009, Prague
High Level Condor Operation (with authentication turned on) Central Manager Execute node sends job results back Execute Node Job Submit Point Red lines represent authenticated connections CHEP 2009, Prague
High Level Condor Operation (with authentication turned on) Central Manager Job complete! Execute Node Job Submit Point repeat ad-infinitum, reusing cached security sessions CHEP 2009, Prague
Problems Crossing the Atlantic CHEP 2009, Prague • Experience in CMS CCRC-08 with glideinWMS (dynamic Condor pool on top of grid): • Even with sessions, authentication cost is killing performance. • Why didn’t we notice this in previous scale tests at Fermilab?! • Because network latency adds significantly to cost of authentication in Condor (blocking, single-threaded).
Larger Scale View Execute Nodes Central Manager Job Submit Point CHEP 2009, Prague
Larger Scale View Execute Nodes Central Manager Job Submit Point 100,000 jobs User submits 100,000 jobs CHEP 2009, Prague
Larger Scale View Execute Nodes Central Manager Job Submit Point Condor schedules jobs and passes security session info for each match 100,000 jobs CHEP 2009, Prague
Larger Scale View Execute Nodes Send jobs to execute nodes Central Manager Job Submit Point 100,000 jobs Lots of authentications! CHEP 2009, Prague
Meeting CHEP 2009, Prague Igor: I will never give up on you guys (yet) Dan: all blocking network operations MUST be exterminated! Todd: don’t unwind all our code; use cooperative threads Miron: guys, listen
The Plan CHEP 2009, Prague The Central Manager authenticates both the submit point and the execute nodes Using this trust relationship, the Central Manager can help establish a security session between the two, good for the duration of the match.
Integrated Security Sessions and Matchmaking Central Manager Execute Node Job Submit Point CHEP 2009, Prague
Integrated Security Sessions and Matchmaking Central Manager Execute node advertises itself AND match session info Execute Node match sess 1278 key: 0x72A9 … Job Submit Point (encrypted) Red lines represent authenticated connections CHEP 2009, Prague
Integrated Security Sessions and Matchmaking Central Manager User submits a job. Execute Node Job Submit Point % condor_submit work.sub CHEP 2009, Prague
Integrated Security Sessions and Matchmaking Condor schedules the job AND passes match session info to submitter Central Manager Execute Node Job Submit Point sess 1278 key: 0x72A9 … Red lines represent authenticated connections (encrypted) CHEP 2009, Prague
Integrated Security Sessions and Matchmaking Central Manager Submit node sends job to execute node Execute Node sess 1278 key: 0x72A9 … Job Submit Point Red line here represents resuming the match session, saving an authentication sess 1278 key: 0x72A9 … CHEP 2009, Prague
Integrated Security Sessions and Matchmaking Central Manager Job runs to completion Execute Node Job Submit Point CHEP 2009, Prague
Integrated Security Sessions and Matchmaking Central Manager Execute node sends job results back Execute Node sess 1278 key: 0x72A9 … Job Submit Point Red line here represents resuming the match session, saving an authentication sess 1278 key: 0x72A9 … CHEP 2009, Prague
Integrated Security Sessions and Matchmaking Central Manager Job complete! Execute Node Job Submit Point CHEP 2009, Prague
What about Central Manager? • Communication between submit node and execute node is now cheaper • But Central Manager still has to authenticate everyone (at least once). CHEP 2009, Prague
Execute Node Ads Execute Nodes Central Manager Job Submit Point many authentications CHEP 2009, Prague
Two Ideas CHEP 2009, Prague • Easy: 2-tier ClassAd collection • done • Hard: remove blocking network operations and/or use threads • in progress
2-tier Collector Execute Nodes sub-collectors aggregation of ClassAds Central Manager Authentication workload distributed across multiple collectors. Big benefit, even with collectors all on just one machine. CHEP 2009, Prague
Can we Cross the Atlantic? CHEP 2009, Prague • glideinWMS tests with one submit node: • Before (condor 7.1.2) • max 4,000 jobs/day • (500 simultaneously running) • After (condor 7.3.1) • 200,000 jobs/day • (22k-25k simultaneously running) • now limited by port usage, not scheduler throughput
Conclusion • Security sessions essential in Condor • 2-tier collector works • Establishing security sessions through matchmaking is a big win. • one submit node much more convenient than 50 CHEP 2009, Prague
Conclusion • Efficient delegation and caching of trust can be an important optimization in distributed systems. CHEP 2009, Prague