1 / 68

Beyond code: Versioning data with Git and Mercurial

Explore how to manage data and metadata using version control repositories Mercurial and Git. Learn from real-life case studies and understand the importance of distributed version control.

ksilcox
Download Presentation

Beyond code: Versioning data with Git and Mercurial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beyond code: Versioning data with Git and Mercurial Stephanie Collett and Martin Haye California Digital Library, University of California

  2. Not on Agenda

  3. Agenda • Background • Case Study #1: eScholarship Backup • Case Study #2: Zephir Metadata • Summary

  4. Version Control Repository Code

  5. Version Control Repository Data/Metadata

  6. Why distributed?

  7. Case #1 eScholarship Data/Metadata Backup

  8. eScholarship

  9. ~50k scholarly works

  10. 10 files per work XML Metadata }

  11. ~500,000 files total XML Metadata }

  12. XML Metadata Single Mercurial Repository

  13. Working Repository Backup Repository Nightly Sync (hg push)

  14. XML Metadata Single Mercurial Repository

  15. XML Metadata Single Mercurial Repository .hgignore

  16. Working Storage Backup Storage } { Nightly Sync (rsync)

  17. 30-60 minutes for the batch job

  18. Logs Date } Commit History Annotation Change

  19. Case #2 Zephir Metadata Management System

  20. Zephir

  21. File system record/

  22. File system record/ marc.xml

  23. File system record/ marc.xml attrbutes.xml summary.xml transform.xsl

  24. File system record/ .git/ marc.xml attrbutes.xml summary.xml transform.xsl

  25. ... /pairtree/ab/cd/e/record/.git /pairtree/ab/cd/ea/record/.git /pairtree/ab/cd/ez/record/.git /pairtree/ab/cd/f2/record/.git /pairtree/ab/cd/f9/record/.git /pairtree/ab/cd/ff/record/.git /pairtree/ab/cd/fm/record/.git /pairtree/ab/cd/fq/record/.git /pairtree/ab/cd/gi/record/.git /pairtree/ab/cd/gw/record/.git /pairtree/ab/cd/gz/record/.git /pairtree/ab/cd/hs/record/.git /pairtree/ab/cd/ht/record/.git /pairtree/ab/cd/i/record/.git ... 10 million }

  26. Individually

  27. Versioning + Audit Trail + Diffing + Debugging

  28. Collectively

  29. record/ marc.xml

  30. 1 file, ~4k

  31. record/ marc.xml attrbutes.xml summary.xml transform.xsl

  32. 4 file, ~36k

  33. .git/ branches/ config description HEAD hooks/ index info/ objects/ refs/

  34. 43 files, ~132k record/ + record/.git

  35. ~132k x 10 million record/ + record/.git

  36. 43 files x 10 million record/ + record/.git

More Related