110 likes | 118 Views
Collecting terabytes of data from FDSN data centers is possible but challenging. ROVER is a command-line client that runs long-term, verifies requested data retrieval, and builds a data index for easy integration into workflows.
E N D
ROVER Hardening Data Delivery by the internet
The challenge and motivation Collecting X terabytes of arbitrary data from the FDSN data centers is possible but: • Usually only possible by partitioning the request, orchestrated by user • Orchestration is non-trivial, need to deal with errors & re-tries • Complete? Downloads may be quietly truncated, a weakness of HTTP + streaming • Local data management • Summarization, indexing and sub-setting are all left to the users to (re)invent
Enter rover Retrieval of Various Experiment data Robustly • A command-line client • Designed to run long-term, until the request is complete (restartable) • Designed to verify that all requested data that can be retrieved has been retrieved • Using the DMC’s availability service • Designed to check for additions and, in the future, updates to requested data set • Builds a data index: for summarization, lookup and extraction • Index is in SQLite, support is ubiquitous. Simple text summaries trivially generated. • Index is the key for integrating such a data set into a workflow & a bridge to other systems.
Rover workflow 2launch retrieval per request, loop until nothing to retrieve 1 Create desired data request Subscription 1 Subscription 1 2d index data Request 1..N 2a check availability Data index 2b compare to local holdings Data set (miniSEED) 2c fetch needed data in parallel
How to install Two part installation: 1) Install mseedindex from source code: https://github.com/iris-edu/mseedindex Requirements: C compiler and make program 2) Install rover using pip: > pip install rover Requirements: Python >= 2.7 (and pip)
Rover: Quick Start, an example request $ rover init-repository datarepo $ cd datarepo 1. Initialize a data repository (and change into that directory) IU ANMO * LHZ 2012-01-01T00:00:00 2012-02-01T00:00:00 TA MSTX -- BH? 2012-01-01T00:00:00 2012-02-01T00:00:00 2. Create a request file named request.txt containing: $ rover retrieve request.txt 3. Run rover retrieve to fetch these data: * HTTP status & email when done <datarepo>/data/<network>/<year>/<day>/<station>.<network>.<year>.<day> Data are saved, in miniSEED format, to files with this organization:
Once you have data Report what is in the repository $ rover list-summary IU_ANMO_00_LHZ 2012-01-01T00:00:00.069500 2012-01-31T23:59:59.069500 IU_ANMO_10_LHZ 2012-01-01T00:00:00.069500 2012-01-31T23:59:59.069500 TA_MSTX__BHE 2012-01-01T00:00:00.000000 2012-01-31T23:59:59.975000 TA_MSTX__BHN 2012-01-01T00:00:00.000000 2012-01-31T23:59:59.975000 TA_MSTX__BHZ 2012-01-01T00:00:00.000000 2012-01-31T23:59:59.975000 List a summary (extents) of data in the repository • Limit summary to specific networks, stations, locations, channels & time ranges • Alternatively, use list-index for full details: actual contiguous traces
Once you have data Run your own fdnsws-dataselect service Run an FDSN web service on your local repository: https://iris-edu.github.io/portable-fdsnws-dataselect/ • Python-based web service that returns data based on a time series index • Most tools that use FDSN web services (FetchData, ObsPy, etc.) can be redirected to alternate services
Once you have data Direct use with ObsPy (next release) The DMC has contributed a new sub-module to ObsPy, which will be included in the next release, that allows directly discovering and reading of data in a rover-created repository: obspy.clients.filesystem.tsindex.Client Very similar to other ObsPy interfaces, this module provides: get_waveforms() get_availability_extent() get_availability() and a few more.
Once you have the data Use the data index directly The data index: for data discovery and summary, no need to crawl through files • Filenames, data identifiers (net, sta, loc, chan), earliest, latest, exact segments, sample rates, low level details and more... Index is stored in SQLite, a very powerful single file database, but easy to use! $ sqlite index.sql 'SELECT filename,network,station,location,channel,starttime,endtime FROM tsindex;' /path/cola.mseed|IU|COLA|00|LH1|2010-02-27T06:50:00.069539|2010-02-27T07:59:59.069538 /path/cola.mseed|IU|COLA|00|LH2|2010-02-27T06:50:00.069539|2010-02-27T07:59:59.069538 /path/cola.mseed|IU|COLA|00|LHZ|2010-02-27T06:50:00.069539|2010-02-27T07:59:59.069538 ...
Main take away points • Addresses robust collection of small to large data sets • Providing an index data repository • Cost: learning a new tool • Expected release: Spring 2019 • Ask if you would like to be an early tester! • See a demo at IRIS booth (808)