130 likes | 267 Views
Update on replica management. Costin.Grigoras@cern.ch. Replica discovery algorithm. To choose the best SE for any operation (upload, download, transfer) we rely on a distance metric: Based on the network distance between the client and all known IPs of the SE Altered by current SE status
E N D
Update on replica management Costin.Grigoras@cern.ch
Replica discovery algorithm • To choose the best SE for any operation (upload, download, transfer) we rely on a distance metric: • Based on the network distance between the client and all known IPs of the SE • Altered by current SE status • Writing: usage + weighted write reliability history • Reading: weighted read reliability history • Static promotion/demotion factors per SE • Small random factor for democratic distribution Update on replica management
Network distance metric distance(IP1, IP2) = • same C-class network • same DNS domain name • same AS • f(RTT(IP1,IP2)), if known • same country + f(RTT(AS(IP1), AS(IP2))) • same continent • f(RTT(AS(IP1), AS(IP2))) • far, far away 0 1 Update on replica management
Network topology Update on replica management
SE status component • Driven by the functional add/get tests (12/day) • Failing last test => heavy demotion • Distance increases with a reliability factor: • ¾ last day failures + ¼ last week failures • http://alimonitor.cern.ch/stats?page=SE/table • The remaining free space is also taken into account for writing with: • f(ln(free space / 5TB)) • Storages with a lot of free space are slightly promoted (cap on promotion), while the ones running out of space are strongly demoted Update on replica management
What we gained • Maintenance-free system • Automatic discovery of resources combined with monitoring data • Efficient file upload and access • From the use of well-connected, functional SEs • Local copy is always preferred for reading, unless there is a problem with it, and then the other copies are also close by (RTT is critical for remote reading) • Writing falls back to even more remote locations until the initial requirements are met Update on replica management
Effects on the data distribution • Raw data-derived files stay clustered around CERN and the T1 that holds a copy • job splitting is thus efficient Update on replica management
Effects on MC data distribution • Some simulation results are spread on ~all sites and in various combination of SEs • yielding inefficient job splitting • this translates in more merging stages for the analysis • affecting some analysis types • overhead from more, short jobs • no consequence for job CPU efficiency Very bad case Update on replica management
Merging stages impact on trains • Merging stages are a minor contributor to the analysis turnaround time (few jobs, high priority) • Factors that do affect the turnaround: • Many trains starting at the same time in an already saturated environment • Sub-optimal splitting with its overhead • Resubmission of few pathological cases • The cut-off parameters in LPM could be used, with the price of 2 out of 7413 jobs the above analysis would finish in 5h Update on replica management
How to fix the MC case • Old data: consolidate replica sets in larger, identical baskets for the job optimizer to optimally split • With Markus’ help we are now in the testing phase on a large data set for a particularly bad train • 155 runs, 58K LFNs (7.5TB), 43K transfers (1.8TB) • target: 20 files / basket • Waiting for the next departure to evaluate the effect on the overall turnaround time of this train Update on replica management
How to fix the MC case (2) • Algorithm tries to find the least amount of operations that would yield large enough baskets • Taking SE distance into account (same kind of metric as for the discovery, in particular usage is also considered, keeps data nearby for fallbacks etc) • jAliEncan now move replicas (delete after copy), copy to several SEs at the same time, delayed retries • TODO: implement a “master transfer” to optimize the two stages of the algorithm (first copy&move operations, delete the extra replicas at the end) Update on replica management
Option for future MC productions • Miguel’s implementation of output location extension: • “@disk=2,(SE1;SE2;!SE3)” • The distance to the indicated SEs is altered with +/- 1 • after the initial discovery, so broken SEs are eliminated, location is still taken into account • The set should be: • large enough (ln(subjobs) ?) • set at submission time per masterjob • with a different value each time, eg: • space- and reliability-weighted random set of working SEs • Caveats: • Inefficiencies for writing and reading • Not using the entire storage space, and later on not using all the available CPUs for analysis (though a large production would) Update on replica management
Summary • Two possibilities to optimize replica placement (with the current optimizer) • Implement in LPM the algorithm described before • Trigger the consolidation algorithm at the end of a production/job • And/or fix the “se_advanced” splitting method so the SE sets become irrelevant Update on replica management