130 likes | 302 Views
Torrent-based Software Distribution in ALICE. Costin.Grigoras@cern.ch. Outline. Motivation How it works Site requirements History Migration status. Motivation. ALICE was using site shared areas for installing the pre-compiled experiment software packages
E N D
Torrent-based Software Distribution in ALICE Costin.Grigoras@cern.ch
Outline Motivation How it works Site requirements History Migration status Torrent-based software distribution in ALICE
Motivation ALICE was using site shared areas for installing the pre-compiled experiment software packages Large sites suffered from AFS/NFS/… scalability issues and being a single point of failure Large space needed for the many active versions Old model needed a site local service to manage the installation, unpacking and deletion of the packages Requirement for strict site configuration to support operation – excludes use of ‘opportunistic’ resources/centres From the very beginning, the shared SW area and its access from the VO-box was considered a security risk All of the above and more are solved by the use of the Torrent protocol to distribute the software packages Torrent-based software distribution in ALICE
Torrent terminology package.tar.gz.torrent package.tar.gz • Metadata of the original file • SHA1 of chunks • SHA1 of entire file • Tracker location Initial seeder Chunks of equal size Advertise hashes of complete chunks Tracker Get file info Clients Leech Exchange chunks Seeder Prefer high-speed peers Leech Torrent-based software distribution in ALICE
How it works AliEn file catalogue torrent://alitorrent.cern.ch/… Build servers Site X No seeding between sites Software repository ( one tar.gz / version ) WN 1 Site Y WN 1 WN 2 WN 2 Torrent seeder alitorrent.cern.ch:8092 WN n Torrent tracker alitorrent.cern.ch:8088 WN n Torrent-based software distribution in ALICE
How it works (2) • Build servers for SLC5 (32b, 64b), SLC6 (32b, 64b), Mac OS X, Ubuntus … • Software repository: 150GB in 600 archives • Total size of a compressed (4x factor) software ‘set’ per job is ~300MB (this is what is downloaded to the WN) • One central tracker and seeder • Limited to 50MB/s to the world • Fallback to other download methods if torrent download fails for any reason • wget, xrdcp • But seed them nevertheless Torrent-based software distribution in ALICE
How it works (3) • Bootstrap • Pilot job script fetches and installs on the local node (`pwd`) the latest AliEn build by Torrent (20MB) • AliEn JobAgent gets a real job from the central queue and downloads the required software packages • Continuing to seed them in background for other local agents to quickly get them by LAN • The JA will run more jobs of the same type (user and SW requirements) within the TTL of the job • Everything is downloaded in the sandbox of the job, so is wiped at the end of its execution Torrent-based software distribution in ALICE
Torrent features we use • Clients explicitly publish their private IP in the central tracker • Allowing the discovery of LAN peers via this common service even behind NAT • Local Peer Discovery • Multicast to discover peers on same network • Peer exchange • Peer lists are distributed between the local peers • Distributed Hash Tables • Decentralized seeder lookup – seeders are trackers Torrent-based software distribution in ALICE
Site requirements • How to allow this to happen • iptables rules accepting: • Outgoing to alitorrent.cern.ch TCP/8088,8092 • WN-to-WN on • TCP, UDP / 6881:6999 – aria2c default listening ports • UDP, IGMP -> 224.0.0.0/4 – local peer discovery • Typically this is already the case, in some cases the ports had to be whitelisted (very smart firewalls ) • Implicitly sites do not exchange any torrent traffic between them • No service to run on the site or on the machines, no shared area any more, no SPF, essentially no local support for this Torrent-based software distribution in ALICE
History • The deployment has faced only policy difficulties • Eventually accepted after understanding the technology • There is no evil technology, only evil use… • First tests at CERN in 02.2009 • Site deployments starting 06.2009 • As the shared areas were proving insufficient • First at the large sites, in operation since 2 years • Presented in various forums within the collaboration and at CHEPs • Large awareness call in 01.2012 at ALICE T1/T2 Workshop in Karlsruhe Torrent-based software distribution in ALICE
Migration status • First transitions done in close collaboration with the sites • debugging on the WNs, following up the consequences on the local network, firewalls and such • One month ago we have asked all sites for permission to enable torrent • Most have confirmed that the policy allows the torrent protocol and checked the firewall policies and now they run torrent • Working with the rest to solve the (mostly) non-technical issues • Some mails went to unread mailboxes … Torrent-based software distribution in ALICE
Migration status • T0 – in operation since 3 years • T1s – 5 / 6 migrated • T2s – 36 / 78 migrated • Currently covering 2/3 of the resources, so on average more than 20K concurrent jobs are using torrent • Rock solid, very efficient technology • No incidents reported • Aiming for full migration until next AliEn version is deployed, to completely drop the PackManVoBox service and the need for shared SW area and caches Torrent-based software distribution in ALICE
Conclusion • Torrents have enabled us to • Simplify site operations by removing a VoBox service and the shared SW areas • Significantly reduce problems associated with SW deployment, relieves the sites support staff • Have quick software release cycles (both experiment and Grid middleware) • The migration process was carefully staged • Policy limitation clarified – discussion with security experts • Discussions and deployment at T0/T1s and selected T2s (regional coverage) • Presently – towards complete site coverage • Lifts some of the requirement for a site VoBox, specific configurations and services • Forward-looking system - towards opportunistic use of resources and clouds! Torrent-based software distribution in ALICE