1 / 20

REPLIX www.mpi.nl/replix

REPLIX www.mpi.nl/replix. Willem.Elbers(@mpi.nl ) Max Planck Institute for Psycholinguistics, TLA. Agenda. Goals Motivation Infrastructure Language Archive specific Results Discussion. REPLIX – Repository / Workspace Workshop – September 2010. REPLIX Goals.

hamlet
Download Presentation

REPLIX www.mpi.nl/replix

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. REPLIXwww.mpi.nl/replix Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA

  2. Agenda • Goals • Motivation • Infrastructure • Language Archive specific • Results • Discussion REPLIX – Repository / Workspace Workshop – September 2010

  3. REPLIX Goals • Data replication / synchronization between repositories at a logical level. • What is this logical level? • More than just moving files. • What about access rights? • What about structure defined on top of the data? • What about persistent identifiers (PIDs)? • What about things we didn’t think about? • Workflow based and easy to configure and adapt for different scenarios. • Workflow is a chain of small tasks (a.k.a. blocks). • Easy to develop and integrate new blocks. REPLIX – Repository / Workspace Workshop – September 2010

  4. REPLIX Goals • Independent of repository implementation. • Repositories use different software solutions and we do not expect them to change. • How do we synchronize between different software solutions? • Inter-connection layer. • Originating repository is and should remain owner of the data. • Repository should never depend on REPLIX for anything else than the synchronization. • REPLIX has access to the file system but does not control it. • Repository controls its data with policies. REPLIX – Repository / Workspace Workshop – September 2010

  5. REPLIX Goals Repository Software / Tools Repository File System inter-connect REPLIX REPLIX inter-connect inter-connect Repository Software / Tools Repository File System inter-connect REPLIX – Repository / Workspace Workshop – September 2010

  6. Motivation – REPLIX TLA • Open up the MPI LAT backup sites (B1,B2) as read-only archives. • Improve the replication process in general. • Speed. • Validation. • Parts of the archive. • Update PID information. • Keep in mind to try to generalize to provide an out-of-the box solution for other repositories. Nijmegen Garching Gottingen LAT B1 B2 REPLIX – Repository / Workspace Workshop – September 2010

  7. Motivation - Software • Implementation of the REPLIX communication system. • iRODS looks like a promising candidate (federated zones). • Implementation of the interface to the repository file system. • iRODS looks like promising candidate. • Implementation of the inter-connection layer. • REPLIX side. • iRODS looks like a promising candidate by using a custom module. • Repository side • Will require custom programming. REPLIX – Repository / Workspace Workshop – September 2010

  8. Motivation • Perform iRODS performance tests. • See if iRODS lives up to our expectations. • How does iRODS compare to the current rsync process? • Develop a concrete test-case, based on iRODS, to test our ideas. • Main archive located at the MPI in Nijmegen, the Netherlands. • Backup archive located at the RZG in Garching, Germany. • Approximately 25-30 TB. • How much do we have to change in the existing software? REPLIX – Repository / Workspace Workshop – September 2010

  9. Infrastructure • iRODS zones provide archive to archive connection. • Single archive data exists inside iRODS zone. • Use federated zones ensuring each archive remains autonomous. • Loose connection to the file system. • iRODS mounted collections. • iRODS regular collections are too strict since iRODS controls all files and their metadata. • XML-RPC interface to existing software (inter-connection layer). • Develop an iRODS micro-service to facilitate XML-RPC communication. • Use some reserved disk space for caching purposes. • How to handle different method signatures? REPLIX – Repository / Workspace Workshop – September 2010

  10. Infrastructure REPLIX WM iRODS icommands jargon Existing software stack replixscipts scripts core + rule engine msiXmlRPC micro services virtual file system rule-base mounted collection(s) replix rule-base Repository (local) file system REPLIX – Repository / Workspace Workshop – September 2010

  11. Infrastructure • Two ways of interacting with the REPLIX system: • Use the workflow manager (WM). • Invoke the workflow rules directly through the icommands. • Workflow manager is preferred. • Exposed through a REST-service interface. REPLIX – Repository / Workspace Workshop – September 2010

  12. Language Archive Specific • How does this fit into the existing LAT infrastructure? SOURCE DESTINATION pid PID corpus-structure corpus-structure crawler crawler ams AMS AMS ams REPLIX LAT 2 LAT 1 LAMUS LAMUS IMDI Browser IMDI Browser REPLIX – Repository / Workspace Workshop – September 2010

  13. Language Archive Specific • LAT synchronization workflow: • (1) File synchronization. • iRODS sync. • (2) Start crawler (index all files). • msiExecCmd. • (3) Permission synchronization. • msiXmlRpc. • (4) Update PID information. • msiXmlRpc. • Each step implemented as an iRODS action. REPLIX – Repository / Workspace Workshop – September 2010

  14. Language Archive Specific • (1) Synchronize based on nodes in the archive tree. • If the node is the root node, synchronize all files. • If the node is not the root node, create a list of files that need to be synchronized and synchronize them. • File list export functionality needs to be available. • Do not touch file content. • (2) Start the crawler. • Use the iRODS “msiExecCmd” micro-service to start the crawler at the destination archive through a script. • The time this could take might be a problem. • PIDs should remain untouched and can be used as a reference to the parent archive. • Do not touch file content. REPLIX – Repository / Workspace Workshop – September 2010

  15. Language Archive Specific • (3) Replicate Archive permissions. • AMS is in charge of the permissions in the archive. • (node id, user id, permission) triples. • Create an export, based on the selected node, at the source archive. • Transfer the export to the destination archive. • Import the data into AMS at the destination archive. • Export based on PIDs. • Constant between source and destination archive. • Translate between PID and node id, since AMS internally uses node id’s. • How to synchronize users? • Discard triples for non-existing users. REPLIX – Repository / Workspace Workshop – September 2010

  16. Language Archive Specific • (4) Update PID information. • After replicating a file from the source archive to another archive, the files PID record has to be updated. • Create an export at the destination archive. • (pid, url) pairs. • Transfer to the parent archive. • Import into PID system. • How to administrate these changes to the PID record? • New domains are always allowed to be added. • Only allowed to update ‘own’ url  assume domain is constant. REPLIX – Repository / Workspace Workshop – September 2010

  17. Results • Performance test executed. • Transfer files from one zone (MPI) to another federated zone (RZG). • Gigabit connection. • Two sets of tests: • Increasing amount of small files (100KB). • Decreasing amount of increasing files (1MB  1GB). REPLIX – Repository / Workspace Workshop – September 2010

  18. Results • (local) Pilot to test initial workflow. • Transfer files. • Trigger crawler. • Invoke script at destination. • Export permissions. • Invoke xmlRPC at source to create export. • Transfer export file. • Invoke xmlRPC at destination to import. • Initial results look promising. REPLIX – Repository / Workspace Workshop – September 2010

  19. Results • What to do: • Implement local pilot project in Nijmegen-Garching environment. • Support sub-tree synchronization. • Support updating of handle records. • The interconnection layer requires changes in existing software. • The repository is required to provide the interconnection functionality for the used synchronization workflow actions. REPLIX – Repository / Workspace Workshop – September 2010

  20. Questions / Discussion Any questions? REPLIX – Repository / Workspace Workshop – September 2010

More Related