290 likes | 337 Views
ECMWF plays a vital role in NWP, global forecasts, atmosphere composition monitoring, climate reanalysis, supercomputing, data archiving, and educational programs. Explore business operations, migration incidents, data growth projections, and HPSS migration strategies with HPSS testing and migration incidents.
E N D
3rd French HPSS users Meeting Jessica Orban Jessica.orban@ecmwf.int
European cooperation at its best • ECMWF’s role is to address the critical and most difficult research problems in medium-range NWP (Numerical Weather Prediction) that no one country could tackle on its own • Deliverables and research • Global numerical weather forecasts • Composition of the atmosphere: monitoring and forecasting • Climate reanalysis: monitoring • Supercomputing & data archiving • Education programme European Centre for Medium-Range Weather Forecasts 2 European Centre for Medium-Range Weather Forecasts
Summary • Business as usual • Migration incident • Data growth for the next years • TS4500 acceptance tests • HPSS 7.5.3 migration • HPSS 7.5.3 testing • Bologna European Centre for Medium-Range Weather Forecasts
Business as usual • 3 environnements: prod, preprod and test • HPSS 7.4.2u1p1 • AIX 6.1 • 1 partition • 3 subsystems: • General • 2 TiB of disks (4 LUNs of 512 GiB) • 7.04 PiB of tapes (431 tapes) • 401 883 files, 6 184 directories, 4 filesets • Mars • No disk • 357.18 PiB of tapes (53 509 tapes) • 11 327 276 files, 7 020 516 directories, 629 filesets, 623 junctions European Centre for Medium-Range Weather Forecasts
Business as usual • ECFS • 1.16 PiB of disks (64 LUNs of 512GiB, 268 LUNs of 2TiB, 138 LUNs of 4TiB, 2 LUNs of 8TiB) • 95.19 PiB of tapes (17,655 tapes) • 367 600 840 files, 30 826 308 directories, 13 filesets, 10 junctions • 36 CoS/Hier (5 tests) • 31 active SC (10 disk SC, 21 tape SC) • Devices: • 493 disks • SL8500 • 16 LTO7, 155 T10KD, 56 T10KC • TS3500 • 11 LTO6, 10 LTO7, 6 LTO8 • 70 tape drives moved to direct connexion • Had to update Qlogic driver parameter « Target enable reset » to 0 (comparable to comparable lpfc module parameteris lpfc_fcp2_no_tgt_reset) echo "options qla2xxx ql2xtargetreset=0" > /etc/modprobe.d/qla2xxx.conf European Centre for Medium-Range Weather Forecasts
Business as usual • Repack • MARS used tape to tape hierarchies • Some research data are deleted after being written to both level of the hierarchy • About 6000 tapes repacked in the last 6 months • Big Purge • 60 PB deleted in April 2018 • 35 PB deleted in March 2019 European Centre for Medium-Range Weather Forecasts
Business as usual • Change of technologies • 1 SC (secondary copy) changed from LTO6 to LTO7 • 2597 LTO5 • 11608 LTO6 • 384 LTO7 • 2400 LTO5 repacked since mid February • 2 SC (secondary copy) changed from LTO6 to LTO8M • Write only new data as we don’t have enough LT8 drives to migrate data and to repack at the same time • 1st SC • 11754 LTO6 • 330 LTO8M • 2nd SC • 5586 LTO6 • 170 LTO8M European Centre for Medium-Range Weather Forecasts
Business as usual European Centre for Medium-Range Weather Forecasts
Business as usual European Centre for Medium-Range Weather Forecasts
Business as usual European Centre for Medium-Range Weather Forecasts
Business as usual European Centre for Medium-Range Weather Forecasts
Migration incident European Centre for Medium-Range Weather Forecasts
Data growth for the next years • Estimate based on the new HPC (new ITT mid 2019) • end 2019: 410PB • end 2020: 580PB • end 2021: 810PB • end 2022: 1EB • Will depend on the next HPC and which upgrades we will be able to afford • end 2023: 1.5EB • end 2024: 2.1EB • end 2025: 2.9EB European Centre for Medium-Range Weather Forecasts
TS4500 acceptance tests • Functional tests • TS1160: bug in D3I5_457 firmware (fixed in new firmware). All tapes must be partitioned again. Data on tape will be lost. • Drive error code are not reported in the GUI • SYSLOG are issued with localhost instead of library’s IP and are hardcoded on local3 • GUI can’t display 16 Gb/s on FC ports • Reliability tests • Redundant power loss is not always reported in SNMP or SYSLOG • Several issues with I/O stations, included tapes dropped inside the library • Sometimes, putting an accessor in service mess up with the other one European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 migration • Objectives: • Migrate core servers from AIX to Linux (finally) • Migrate from single partition HPSS 7.4.2 to a partitioned (9) HPSS 7.5.3 • Do this with a minimal interruption of service. • conversion done on the fly, with AIX environment operational • small downtime (2-3 hours?) to complete conversion, and transfer services to the Linux machine. • Factors: • Operation should have happened 2 years ago, in two steps. • Other projects (e.g. relocation) delayed this process. • Decision to do the two jumps in one go. • Our AIX machine has fairly limited resource. • One very busy subsystem, with 370 Million files • A hell of a lot of data to transfer and migrate • very active database with many parallel transactions European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 migration • General concepts • qrep • load : copy (with transform) one table at a time. • apply changes to loaded tables. • qverify • compare the contents of source and target database, and track differences. European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 migration • 3Q18: test environment converted with minimal issues. • 4Q18: Logic issues discovered while starting the production conversion • heavy usage of renames and partitioning did not mix well during apply changes part of qrep • Dec18: New conversion code delivered, but performance issues on AIX encountered. • Latency between source table update and target table update reaches several days. • 1Q19: additional parallelism and better capture balancing are introduced. • we now just about manage to keep up with source machines updates. • 2Q19: We are now dealing with qverify performances. European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 migration European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 migration • The conversion date has been moved multiple times. • May 4th is the ultimate target. • Our AIX boxes are not supported anymore. • We want some of the 7.5 features. • We need 7.5.3 to connect TS1160s • Francis will leave ECMWF shortly after. • Hopefully a qrep based solution, but... • ... We have a plan B • Accept a long downtime and load databases offline. • 10-12 hours downtime. (estimate based on dry run test) • What if errors encountered on the day? European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 testing • Tape media 60F JE default size is 320To • Recover • Tests: • Hierarchies • Disk to dual copies on tapes • Tape to tape • Recover of second copy • Recover of primary copy • Results: • It’s working and use TOR and RAO features (RAO calls can be improved) • No logs in Alarms and Events • No timestamp in recover logfile (/var/hpss/tmp/recover_<VolID>.txt. If run several times, history is lost • Doesn’t indicate which tapes are needed for the recovery • CRs opened for Dry-run and listing of tapes needed European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 testing • Db2 configuration • Change databaseparameter LOGSECOND from 10 to -1 • Db2 backup • With multiple partition, each database has 1 backup file per partition • Db2_fullbackup.ksh only verify and copy the last written file to the secondary backup partition • Tapes drives Quotas • Several major bugs found • If a drive is locked while in used, In Use values (read or write) are not updated (fixed in the last patch) • When PVL is restarted, In Use values are reset to 0 (fixed in the last patch) • CR opened to set recall limit with percentage European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 testing • Tapes drives Quotas • If the number of unlocked drives for a PVR goes below the recall limit (recall limit not set to -1), the recall limit is automatically changed to the number of available drives European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 testing • Tapes drives Quotas European Centre for Medium-Range Weather Forecasts
Data Centre fit-out timeline HPC operational in Bologna Bologna DC handover DHS operation in Bologna Data Centre construction 05-2020 10-2019 Start Procurements N&S infrastructure deployed HPC contract signed 100 Gbps link Reading-Bologna Q1 Q2 Q3 Q4 Q2 Q3 Q4 Q1 2020 2019 2020 Operational services Delivery and installation : racks, fibbers, network, servers, storage Procurements European Centre for Medium-Range Weather Forecasts
Questions ? European Centre for Medium-Range Weather Forecasts
Thank you European Centre for Medium-Range Weather Forecasts