130 likes | 278 Views
MSDC. MiniSeed Data Completeness S. Pintore. Scenario. A network of SeisComP Remote Server archiving data on mass storage creating on each storage a Peripheral Archive A Server creating a Central Archive Telecommunication network availability < 100% Limited bandwidth. Incomplete data.
E N D
MSDC MiniSeed Data Completeness S. Pintore
Scenario • A network of SeisComP Remote Server archiving data on mass storage creating on each storage a Peripheral Archive • A Server creating a Central Archive • Telecommunication network availability < 100% • Limited bandwidth
Incomplete data • Data in CA could be incomplete if the network become and stays unreachable for a long time • Some file missing in the CA • Data gaps into the files of the CA
The Mednet SeisComP servers network • Data are stored in 24 hours files • The segsize parameter is set to 5000 -512 byte blocks- • If the link stays down longer than about an hour data will present gaps. • If the link stays down longer than 1-2 days some file will miss. • Telecommunication network availability is generally good • Network faults during more than 1 day are more frequent than faults longer than an hour and shorter than 1 day.
Retransmit or Integrate ? • In order to insure data quality is necessary an integrity check • Due to the bandwidth limit you must choose between : • retransmitting all the file containing a gap • integrating your file transmitting only the data needed to fill the gap • These two execution steps aren’t necessarily distinct
Respect the environment • The procedure to rebuild the correct data should have a low impact on the systems, it should: • run on Linux using low resources • offer link security • permit control on bandwidth use • not need specific firewall rules
MSDC solution • MSDC uses the rsync tool that is already available, optimised for similar problems and well tested • The data check is made by rsync comparing the files in the CA with those in the PA • It uses rsync over ssh to: • secure the connection • avoid using the rsync port (873)
What does rsync offer ? • The features of the rsync algorithm • it works on arbitrary data • the total data transferred is about the size of a compressed diff file • it is fast for large files and large collections of files • it doesn’t assume any prior knowledge of the two files, but takes advantage of similarities • it is computationally inexpensive
MSDC main features • The msdc.sh can be run from command line or in a crontab line • It is a bash script • It avoids concurrent running conflicts, using a simple locking mechanism • It logs events and the name of the files corrected or definitely lost • The installation is made by the sysop user in his home directory
Security • MSDC uses a ssh key pair for the automation of the ssh connession • this key pair is dedicated to the msdc use, no other connections are possible using it • MSDC doesn’t interfere with other keys used to automate ssh connections • it doesn’t need an rsync server running
The MSDC package • The MSDC package msdc.tgz contains the files listed here: • msdc/bin/msdc.sh • msdc/bin/validate_rsync • msdc/bin/rsync • msdc/doc/README.msdc –Documentation- msdc/doc/COPYING -GPL License- • msdc/ssh
TODO • Option to use a different date
Alternative solutions: after the check • The data check could be done using SeedStuff utilities (check_file, extr_file, etc.) or qlib ones (qmerge, etc.). • For the incomplete files you can either: • retransmit all the file • or: • use qmerge to extract the data to fill the gaps, then transmit this “patches” eventually using qmerge –again- to fill the gaps. • Transmission: you should use a tool offering security as scp or sftp • You should then automate this procedure