200 likes | 248 Views
REPLIX www.mpi.nl/replix. Willem.Elbers(@mpi.nl ) Max Planck Institute for Psycholinguistics, TLA. Agenda. Goals Motivation Infrastructure Language Archive specific Results Discussion. REPLIX – Repository / Workspace Workshop – September 2010. REPLIX Goals.
E N D
REPLIXwww.mpi.nl/replix Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA
Agenda • Goals • Motivation • Infrastructure • Language Archive specific • Results • Discussion REPLIX – Repository / Workspace Workshop – September 2010
REPLIX Goals • Data replication / synchronization between repositories at a logical level. • What is this logical level? • More than just moving files. • What about access rights? • What about structure defined on top of the data? • What about persistent identifiers (PIDs)? • What about things we didn’t think about? • Workflow based and easy to configure and adapt for different scenarios. • Workflow is a chain of small tasks (a.k.a. blocks). • Easy to develop and integrate new blocks. REPLIX – Repository / Workspace Workshop – September 2010
REPLIX Goals • Independent of repository implementation. • Repositories use different software solutions and we do not expect them to change. • How do we synchronize between different software solutions? • Inter-connection layer. • Originating repository is and should remain owner of the data. • Repository should never depend on REPLIX for anything else than the synchronization. • REPLIX has access to the file system but does not control it. • Repository controls its data with policies. REPLIX – Repository / Workspace Workshop – September 2010
REPLIX Goals Repository Software / Tools Repository File System inter-connect REPLIX REPLIX inter-connect inter-connect Repository Software / Tools Repository File System inter-connect REPLIX – Repository / Workspace Workshop – September 2010
Motivation – REPLIX TLA • Open up the MPI LAT backup sites (B1,B2) as read-only archives. • Improve the replication process in general. • Speed. • Validation. • Parts of the archive. • Update PID information. • Keep in mind to try to generalize to provide an out-of-the box solution for other repositories. Nijmegen Garching Gottingen LAT B1 B2 REPLIX – Repository / Workspace Workshop – September 2010
Motivation - Software • Implementation of the REPLIX communication system. • iRODS looks like a promising candidate (federated zones). • Implementation of the interface to the repository file system. • iRODS looks like promising candidate. • Implementation of the inter-connection layer. • REPLIX side. • iRODS looks like a promising candidate by using a custom module. • Repository side • Will require custom programming. REPLIX – Repository / Workspace Workshop – September 2010
Motivation • Perform iRODS performance tests. • See if iRODS lives up to our expectations. • How does iRODS compare to the current rsync process? • Develop a concrete test-case, based on iRODS, to test our ideas. • Main archive located at the MPI in Nijmegen, the Netherlands. • Backup archive located at the RZG in Garching, Germany. • Approximately 25-30 TB. • How much do we have to change in the existing software? REPLIX – Repository / Workspace Workshop – September 2010
Infrastructure • iRODS zones provide archive to archive connection. • Single archive data exists inside iRODS zone. • Use federated zones ensuring each archive remains autonomous. • Loose connection to the file system. • iRODS mounted collections. • iRODS regular collections are too strict since iRODS controls all files and their metadata. • XML-RPC interface to existing software (inter-connection layer). • Develop an iRODS micro-service to facilitate XML-RPC communication. • Use some reserved disk space for caching purposes. • How to handle different method signatures? REPLIX – Repository / Workspace Workshop – September 2010
Infrastructure REPLIX WM iRODS icommands jargon Existing software stack replixscipts scripts core + rule engine msiXmlRPC micro services virtual file system rule-base mounted collection(s) replix rule-base Repository (local) file system REPLIX – Repository / Workspace Workshop – September 2010
Infrastructure • Two ways of interacting with the REPLIX system: • Use the workflow manager (WM). • Invoke the workflow rules directly through the icommands. • Workflow manager is preferred. • Exposed through a REST-service interface. REPLIX – Repository / Workspace Workshop – September 2010
Language Archive Specific • How does this fit into the existing LAT infrastructure? SOURCE DESTINATION pid PID corpus-structure corpus-structure crawler crawler ams AMS AMS ams REPLIX LAT 2 LAT 1 LAMUS LAMUS IMDI Browser IMDI Browser REPLIX – Repository / Workspace Workshop – September 2010
Language Archive Specific • LAT synchronization workflow: • (1) File synchronization. • iRODS sync. • (2) Start crawler (index all files). • msiExecCmd. • (3) Permission synchronization. • msiXmlRpc. • (4) Update PID information. • msiXmlRpc. • Each step implemented as an iRODS action. REPLIX – Repository / Workspace Workshop – September 2010
Language Archive Specific • (1) Synchronize based on nodes in the archive tree. • If the node is the root node, synchronize all files. • If the node is not the root node, create a list of files that need to be synchronized and synchronize them. • File list export functionality needs to be available. • Do not touch file content. • (2) Start the crawler. • Use the iRODS “msiExecCmd” micro-service to start the crawler at the destination archive through a script. • The time this could take might be a problem. • PIDs should remain untouched and can be used as a reference to the parent archive. • Do not touch file content. REPLIX – Repository / Workspace Workshop – September 2010
Language Archive Specific • (3) Replicate Archive permissions. • AMS is in charge of the permissions in the archive. • (node id, user id, permission) triples. • Create an export, based on the selected node, at the source archive. • Transfer the export to the destination archive. • Import the data into AMS at the destination archive. • Export based on PIDs. • Constant between source and destination archive. • Translate between PID and node id, since AMS internally uses node id’s. • How to synchronize users? • Discard triples for non-existing users. REPLIX – Repository / Workspace Workshop – September 2010
Language Archive Specific • (4) Update PID information. • After replicating a file from the source archive to another archive, the files PID record has to be updated. • Create an export at the destination archive. • (pid, url) pairs. • Transfer to the parent archive. • Import into PID system. • How to administrate these changes to the PID record? • New domains are always allowed to be added. • Only allowed to update ‘own’ url assume domain is constant. REPLIX – Repository / Workspace Workshop – September 2010
Results • Performance test executed. • Transfer files from one zone (MPI) to another federated zone (RZG). • Gigabit connection. • Two sets of tests: • Increasing amount of small files (100KB). • Decreasing amount of increasing files (1MB 1GB). REPLIX – Repository / Workspace Workshop – September 2010
Results • (local) Pilot to test initial workflow. • Transfer files. • Trigger crawler. • Invoke script at destination. • Export permissions. • Invoke xmlRPC at source to create export. • Transfer export file. • Invoke xmlRPC at destination to import. • Initial results look promising. REPLIX – Repository / Workspace Workshop – September 2010
Results • What to do: • Implement local pilot project in Nijmegen-Garching environment. • Support sub-tree synchronization. • Support updating of handle records. • The interconnection layer requires changes in existing software. • The repository is required to provide the interconnection functionality for the used synchronization workflow actions. REPLIX – Repository / Workspace Workshop – September 2010
Questions / Discussion Any questions? REPLIX – Repository / Workspace Workshop – September 2010