200 likes | 308 Views
Tier0 Status. Tony Cass LCG-LHCC Referees Meeting 16 th February 2009. Agenda. Resources CASTOR status and performance Progress with new data centre project. Agenda. Resources CASTOR status and performance Progress with new data centre project. November Status.
E N D
Tier0 Status Tony Cass LCG-LHCC Referees Meeting16th February 2009 Tier0 Status - 1
Agenda Tier0 Status - 2 Resources CASTOR status and performance Progress with new data centre project
Agenda Tier0 Status - 3 Resources CASTOR status and performance Progress with new data centre project
November Status Status of 2009 procurements • CPU • First batch • Ordered out in late August • Delivery before November • Production in December or early 2009 • Second batch • Received the tender answers • Target FC approval in December • Delivery before March 2009 • Production in March – April 2009 • Disk • First batch • FC approval last week • Delivery in December • Production January 2009 • Second batch • Received the tender answers • Target FC approval in December • Delivery before March 2009 • Production in March – April 2009 • Tape • Media availability not a problem but exact procurement schedule depends on progress with new repack service between now and beginning of 2009 2 of 3 batches already on site No orders issued following December statement on likely schedule FC approval not required, but delivery scheduleunchanged (installation depends on readiness of racks) January February No orders issued following December statement on likely schedule 70 Sun T10KB drives ordered (1TB/cartridge)T10KA drives to be phased out as repack advances. Tier0 Status - 4
Procurements2009 Status & 2010 outlook • CPU & Disk • ~60% of foreseen 2009 pledges available in April • (Additional ATLAS request not included) • Balance to be operational in October • Tight schedule, but agreed with Purchasing dept. • Exploring options to purchase iSCSI disk storage • Greater cost/TB, but avoids interruption to CASTOR service due to disk server failure (#1 cause of incidents; disk failures are handled transparently) • 2010 procurement planning underway • Tenders issued in June; adjudication in ~November. • Tape • Expect ~20PB spare capacity by October. • Will purchase “high density” IBM robot in autumn • 14,000 slots — 14PB • Can convert an existing IBM robot to “high density’ version in 2010 (with no service interruption) if additional capacity required. Tier0 Status - 5
Resource Usage Efficiency (CPU...) • CPU/Wall ratio has long been a concern: • But utilisation of the public LXBATCH cluster is generally high: • Still, we see many jobs waiting for tape recalls • New “backfill” option introduced to schedule short jobs when long waits for tape expected. • Nice improvement seen: • Need to review settings and publicise to improve impact. Tier0 Status - 6
SLC5 Migration • Migration of batch resources underway • All new capacity introduced will be SLC5 based • Existing capacity migrated progressively. • Migration of LXPLUS alias is an issue: • Principle is easy: switch when majority of batch capacity is SLC5. But measured where? • @ CERN: switch early • on grid: switch late. • No clear/obvious solution yet. • [Rapid migration of other grid sites would help. And is maybe sensible before September anyway?] Tier0 Status - 7
Agenda Tier0 Status - 8 Resources CASTOR status and performance Progress with new data centre project
Agenda Tier0 Status - 9 • Resources • CASTOR status and performance • Upstream services (SRM, FTS) • CASTOR status & plans • Metrics • Progress with new data centre project
November Status SRM & FTS • SRM 2.7 release is delayed • Originally foreseen in June but has still not yet passed testing/certification • Continue with 1.3 until LHC shutdown • SLC3 – hardware running out of warranty retire/replace • Cannot be deployed in a fully redundant configuration • Built with an old castor client constrains the stager deployment Pre-production clusters in service for all LHC VOsProduction deployment before end-2008 • FTS 2.1 passed certification too close to LHC startup • Continue with 2.0 service (SLC3) • Setting up an independent 2.1 production service (SLC4) in parallel allowing VOs to move when convenient FTS 2.1 production service availableStill being “tested” by experiments but mostproduction transfers already with this version Tier0 Status - 10
CASTOR Status & Plans • Status • Generally quiet/good... • ... except for tape repack • BUT we are reasonably confident about our ability to support production; user analysis is the concern and there is no major load. • CASTOR 2.1.8, with integrated xrootd redirector, should deliver improvements for analysis • LSF bypass & reduced latency, but also improved scalability as xrootd daemon has smaller footprint than rfio (to be deprecated?) • Also delivers • end-to-end checksumming for rfio • User space accounting (required for later deployment of quotas) • operational improvements (notably automatic draining of disk servers) • fixes to problems identified by repack (main reason for deployment delays) • Schedule: end-Feb release, in production on c2cernt3 end-March, deployment for experiment instances in April. Tier0 Status - 11
November Status Performance metrics • Metrics have been implemented and deployed on preproduction cluster • Data collected in lemon • RRD graphs not yet implemented • Production deployment delayed for several reasons • New metrics imply several changes to exception/alarms and automated actions used in production • An unexpected technical dependency on the late SRM 2.7 version • Ongoing work to back-port the implementation All still true • Much progress, but little visible; consideringhow best to group metrics for display • e.g. group cache hits and garbage collectionactivity? However... Tier0 Status - 12
Agenda Tier0 Status - 16 Resources CASTOR status and performance Progress with new data centre project
New data centre project • Reminder: the selected strategy is to do a single tender for an overall solution • Four phase process developed: • Request (many) conceptual designs • Commission 3-4 companies submitting conceptual designs to develop an outline design • In-house, turn a selected outline design into plans and documents enabling • Single tender for overall construction. Tier0 Status - 17
November Status Outline Design Phase • Deadline: 28th November • Contacts with all 4 companies during design phase • All 4 companies say deadline will be met • Meetings to review proposed designs scheduled in week of December 8th. • Market Survey in preparation as first stage in selection of company for detailed design & construction. • Discussions in Oslo on 28th November to further investigate possible remote server installation in 2011 (and beyond) • RAL also have power available in 2011, but not as much and for a shorter period. Tier0 Status - 18
Current Status • Four designs reviewed • No clear winner, but consensus on leading design. • New Management supports project. Good, but… • New requirements --- “Green” & Prévessin heat recovery option • New organisation brings new players to brief • “Single Contract for construction” agreed • Agreement to work with one company to deliver fully acceptable design with modifications for new requirements. • Will lead to ~6 month delay. • [Personal view] Plan to continue with only one company should be agreed by Directorate now to avoid potential hiccups later. Frédéric Hemmer discussing with Sergio Bertolucci. • Will need to revisit option to install equipment at University of Oslo. Tier0 Status - 19
Questions? Comments? Tier0 Status - 20