150 likes | 255 Views
Shelter from the Storm. Building a Safe Archive in a Hostile World. SCOOP Goal. SURA-funded Coastal Modeling Project Want to develop the community’s cutting-edge techniques to make them ready for use in tomorrow’s production systems.
E N D
Shelter from the Storm Building a Safe Archive in a Hostile World
SCOOP Goal • SURA-funded Coastal Modeling Project • Want to develop the community’s cutting-edge techniques to make them ready for use in tomorrow’s production systems. • For example, automatic verification of storm/surge models against observed data, to help improve the models
CCT Goals • One of CCTs key research outputs is software • Want this to be software of a good quality, to be robust • Want re-use of software across projects • Also want software to be picked up by external users, as well as collaborators
The SCOOP Archive • Need to archive lots of files • Atmospheric models (MM5, GFDL) • Hydrodynamic models (ADCIRC, SWAN, etc) • Observational data (sensor data, buoys) • Requirements poorly defined: • How much data? Don’t know • How long should we keep it for? Don’t know • Have to interface with bespoke data transport mechanisms (LDM) • How to achieve our goals under these conditions?!
Basic Archive Operation Upload: • Client signals they want to do an upload of some files (names are given) • Archive tells the client where to upload them to (transaction handles) • Client uploads files (indep. of archive) • Client tells archive it’s done • Archive creates the logical filenames • Use “upload” tool for this
Basic Archive Operation Download: • Clients use the catalog service to discover/search for logical filenames • Clients talk to the RLS server to get physical URLs • Interact with physical URLs directly • Can use “getdata” CLI tool to encapsulate this • Also, there are portal pages...
Operations on Service • fileUploadBegin - for starting an upload • fileUploadEnd - for saying that an upload is completed • logicalNameRetry • removeDeadTransactions • closeArchive
Distributed Software • Some services hosted externally • Can’t assume our machine or s/w never fails • Need to retain state of our service on restart
Robust Code • Don’t assume our service will remain “up” => Keep all internal state in a database => Reload internal state on a restart • Don’t assume external services always “up” => Design loosely coupled services => Store pending interactions in the database => Retry these periodically • Do “stress testing” on the service during the testing/debug cycle
Keep the internalAPIs Simple int logname_initialize(void); void logname_remove(void); bool logname_create_logfile (std::string logical_name, bool name_is_final, const std::vector<std::string>& urls); bool logname_delete_logfile(std::string logical_name); ulong logname_upload_pending_lognames (ulong max_rows, ulong& total_found, ulong& max_rows_used);
Encouraging Reuse • SCOOP Archive has lots of strange rules about filenames and metadata • During design and implementation, keep thinking: • Is this for the SCOOP project, or • Is this a generic feature • Use good O-O design to keep SCOOP code separate from archive code
Keeping SCOOPto one side... class ArchiveFilingLogic { public: // Called by the default moveFiles implementation virtual bool createPhysicalPath(std::string physicalPath); virtual bool moveFiles(std::vector<std::string>& fileNames,std::vector<std::string>& missingFiles,std::string stagePath,std::string physicalPath); virtual void physicalLocationForFiles (const std::vector<std::string>& filenames, std::map<std::string,std::string>& directories, std::map<std::string,std::string>& errors)=0; virtual std::vector<std::string> logicalNamesForFiles(const std::vector<std::string>& filenames,std::string physicalPath)=0; } ;
New Requirements • Handling common compression formats • Producing subsets of data (predictively) • Tracking data before it is ingested • Notifying people when data arrives • Transforming data to other formats • Generating analytical data “on the fly” • Federating data across multiple locations • Good initial design will simplify all this...
Highest Priority... • Archive machine running out of space • People have started to rely on the service • So, currently we are uploading copies of all data to SDSC DataCenter, using SRB • Now need to keep track of URLs on physically distributed resources • But SRB can help with some of the other requirements...