230 likes | 375 Views
DRI Grant impact at the smaller sites. Pete Gronbech September 2012 GridPP29 Oxford. Target Areas. Internal Cluster networking Cluster to JANET interconnect Resilience and redundancy. Cluster Networking. Most sites clusters have been interconnected at 1Gb/s
E N D
DRI Grant impact at the smaller sites Pete Gronbech September 2012 GridPP29 Oxford
Target Areas • Internal Cluster networking • Cluster to JANET interconnect • Resilience and redundancy
Cluster Networking • Most sites clusters have been interconnected at 1Gb/s • As storage servers size increased from ~20TB to 40TB and even larger 36 bay units with usable capacities of ~70TB the network links had to be increased to cope with the number of simultaneous connections from worker nodes • Many sites decided to use trunked or bonded 1Gb/s links. • Work on the basis of roughly one 1Gb/s link per 10TB • This no longer scales for the very large servers having 6 bonded links. • Cost of gigabit networking starts to look high when you have to divide the number of ports on a switch by 6. • 10G bit switch prices coming down.
DRI Grant • Has allowed the sites to make the jump to 10Gbit switches in the cluster earlier than they would have planned to do so. • Has allowed some degree of future proofing by providing enough ports to cover expected cluster expansion over the next few years. • Replacing bonded gigabit with 10Gbit simplifies and tidies up the cabling and configuration. (Less to go wrong hopefully)
Campus connectivity • Many Grid Clusters had 1 or 2G/bit connections to the campus WAN. • Many sites have used grant funding to install routers to allow connectivity to the back bone at 10Gbit. • If the campus backbone is made up of 10 Gbit links then the danger is that the grid cluster could saturate some of these links so blocking other traffic to the JANET connection. • Links have to be doubled up on the route to the campus router. • The JANET connection to the university has to be increased or the Grid link capped to allow both Grid traffic and Campus traffic to flow un hindered • The alternative is to install a by-pass link directly to the JANET router.
Resilience • Where network upgrades were able to be purchased at a cost less than anticipated some funds where used to upgrade critical service nodes or infrastructure. • Storage server head nodes, caching servers, UPS or improved firewalls were items chosen by different institutes. • All sites were allocated some funds to purchase monitoring nodes. Originally intend to run Gridmon but the plan changed to use PerfSonar. • The end result is that the Grid clusters at the sites are in a much stronger position than before and will provide robust
Lancaster That Network Upgrade... The mad scramble network uplift plan for Lancaster took a 3-pronged approach. 1. Upgrade & Shang-hai the University's back up link. 10G (mostly) just for us. 2. Increase connectivity to campus backbone & thus between the two "halves" of the grid cluster and the local HEP cluster. 3. Add capacity for 10G networking to our cluster using a pair of Z9000 core switches & half a dozen S4810 rack switches. 4. This free's up some of the current switches that can be retasked to improve the HEP cluster networking.
RHUL • Now have 2x1Gb/s links to Janet, trunked. Second link added 7th March. Could not be utilised until old 1Gb/s firewall replaced. • • Network upgraded from 8x Dell PC6248 (1Gb/s) stack, to 2xF10 S4810 10Gb/s spine with PC6248s attached as leaves by 2x10Gb/s to each F10. • • Old 1Gb/s firewall out of warranty/support, to be replaced soon with Juniper SRX650 (7 Gb/s max).
Sussex • 4 36-port Infiniband switches • Arranged IB switches in Fat Tree topology
Common Themes • Well planned cluster networking, balanced and future proof • Vast improvement from ad hoc cost limited designs they replaced. • Have brought tangible benefits..
FTS Transfer Rates • To Oxford • From Oxford
Benefits • August 2012 Transfers of files to Oxford hitting 5Gbit rate cap for several hours.
Performance Tuning / Future • Now need to concentrate on improving FTS transfers to the remaining slow sites • Good Monitoring required both locally and nationally • PerfSonar being installed across the sites (See next talk) • Work with JANET and site networking to increase JANET connectivity where required.