180 likes | 323 Views
BaBar MC production. The simple question: How can we run BaBar software on EDG grid sites?. Farm @ VU (Amsterdam University). EDG testbed (NIKHEF). Jobs. BaBar MC production software. A lot of computers. Results. Introduction of Parrot.
E N D
BaBar MC production The simple question: How can we run BaBar software on EDG grid sites? Farm @ VU (Amsterdam University) EDG testbed (NIKHEF) Jobs BaBar MC production software A lot of computers Results
Introduction of Parrot We need transparent access to the Objectivity Database (requires local file access) Farm @ VU (Amsterdam University) EDG testbed (NIKHEF) Jobs BaBar MC production software A lot of computers Chirp Parrot Results
Parrot functionality BaBar MC production Optimize (POSIX Interface) (Ptrace trap) Not yet Local Cache The Parrot Virtual File System HTTP FTP RFIO NeST Chirp Condor Proxy Whole File I/O (get/put) Partial File I/O (open,close,read,write, lseek) Secure Remote RPC x509 HTTP Server FTP Server RFIO Server NeST Server Chirp Server Condor Shadow Traditional I/O Services Integration with Castor Allocation and Mgmt Full UNIX Semantics Integration with Condor
Private network Relay GCB The introduction of GCB Farm @ VU (Amsterdam University) BaBar MC production software Chirp EDG testbed (NIKHEF) NFS Results Jobs Parrot Condor-G Jobs A lot of computers Some computers Results
Central Manager N A T A P GCB Server GCB functionality Private network B Relay Persistent connection
Condor-G Job Queue GlideIn Relay Batch job Relay Private network The introduction of GlideIn Farm @ VU (Amsterdam University) EDG testbed (NIKHEF) PBS job manager 72 hour jobs Can’t wait for queues BaBar MC production software Chirp NFS Results Jobs Jobs Parrot Private network A lot of computers Some computers Relay Results GCB
Condor-G Job Queue GlideIn Batch job Overview of complete setup Farm @ VU (Amsterdam University) EDG testbed (NIKHEF) PBS job manager 72 hour jobs Can’t wait for queues BaBar MC production software Relay Chirp NFS Results Jobs Jobs Relay Parrot Private network Private network A lot of computers Some computers Relay Results GCB
Leave only the components Farm @ VU (Amsterdam University) EDG testbed (NIKHEF) PBS job manager BaBar MC production software Queue GlideIn Chirp NFS Parrot Private network Private network A lot of computers Some computers GCB
Different MDS scheme • Objectivity database • LOCK server sockets • NFS problems • UID / hostname checks The interesting dependencies Farm @ VU (Amsterdam University) EDG testbed (NIKHEF) PBS job manager BaBar MC production software Queue GlideIn Chirp NFS NAT box Parrot Private network Private network A lot of computers Some computers GCB • Dropping UDP packages • Timeout 2 minutes • Inactive sockets • Inactive File I/O
Consequences • Different MDS scheme • Implemented EDG scheme for GlideIn • Objectivity • A lot of debugging • Made Parrot mimic hostname and uid • Tricked Objectivity to use standard NFS libraries • Aggressive NAT box • Changed GCB to use TCP instead of UDP • Used Parrot to keep sockets alive • Parrot recovers File I/O when TCP connection is lost • We are the first to run Objectivity cross-domain
Application Initializes 10 times slower Performance 3000 Production 3 times slower Time (minutes) 2500 2000 Production on EDG testbed 1500 1000 Production on local machine 500 1500 2000 500 1000 Events
Possible improvements • Create more sophisticated tool to acquire resources • Resource planning, distribution, etc. • Maybe something fancy already exists? Farm @ VU (Amsterdam University) EDG testbed (NIKHEF) PBS job manager BaBar MC production software Queue GlideIn Chirp NFS Parrot Private network Private network A lot of computers Some computers GCB • Parrot: Caching • On per directory basis • Requires debugging
Move chirp servers to private nodes Farm @ VU (Amsterdam University) EDG testbed (NIKHEF) PBS job manager BaBar MC production software Queue GlideIn • Use Condor/GCB machinery for chirp server • Solves security issues • Allows chirp server to be on private nodes • Requires new chirp-condor implementation NFS Private network Private network A lot of computers Some computers Parrot Chirp GCB
Move GCB to head node Farm @ VU (Amsterdam University) EDG testbed (NIKHEF) PBS job manager BaBar MC production software Queue GlideIn GCB NFS Private network Private network A lot of computers Some computers Parrot Chirp • Move GCB to same machine as Central Manager • Solution required for port conflicts • Temporary solution: Move CM to a private node
Use EDG data storage Farm @ VU (Amsterdam University) EDG testbed (NIKHEF) PBS job manager BaBar MC production software Queue GlideIn GCB • Write events to EDG data storage (gsiFTP) • Requires debugging NFS Private network Private network A lot of computers Some computers Parrot Chirp EDG data storage
Use more sites • Let GCB manage several private networks at the same time • Requires solution for conflicting private addresses Farm @ VU (Amsterdam University) EDG testbed (NIKHEF) PBS job manager BaBar MC production software Queue GlideIn Other testbed GCB Private network A lot of computers NFS Private network Private network A lot of computers Some computers Parrot Chirp EDG data storage
Conclusions • It works • BaBar MC production runs successfully on NIKHEF EDG testbed • All this experimental software actually works when used together • It looks easy • Our GRID setup is complicated, but…. • Parrot hides problems related to local file access • GCB hides problems related to network configurations • GlideIn hides complications with resource gathering • The user can just submit his/her jobs to a local batch system • There is some work to do • Performance could be better • Initialization 10 times slower • Production 3 times slower • Caching and (semi-) local event storage should improve this • Usability could be improved • GlideIn should have a tool to acquire them • Several improvements proposed for GCB/Parrot • The improvements are done at the level of the “grid” tools • The user benefits without rewriting code