1 / 18

Building a Demonstration Prototype for the Preservation of Large-scale Multi-media Collections

Building a Demonstration Prototype for the Preservation of Large-scale Multi-media Collections Arcot Rajasekar San Diego Supercomputer Center, University of California, San Diego. DIGARCH: Our Project Goals. Design and Development of a “Generic” Prototype for

alaric
Download Presentation

Building a Demonstration Prototype for the Preservation of Large-scale Multi-media Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a Demonstration Prototype for the Preservation of Large-scale Multi-media Collections Arcot Rajasekar San Diego Supercomputer Center, University of California, San Diego

  2. DIGARCH: Our Project Goals • Design and Development of a “Generic” Prototype for Preserving Digital Video Collections • Preservation Life-cycle meshing seamlessly with content production • Minimal impact to production life-cycle • Management of Authenticity, Integrity, and Infrastructure Independence • Workflow system that automates accession, description, organization and preservation of videoand associated contents • Metadata definition, extraction and ingestion • Long-term retention and technology migration • At risk Collection: ‘Conversation with History’ video collection • Video, audio, text transcripts, web-based material • Databases of administrative and descriptive metadata • Derived products • Time-line: July 2005 – June 2006

  3. Content Generation & Preservation Workflow Integration Video tape interview at UCB 2 cameras – switched to DVCam UC Berkeley Edit and Format for Broadcast DVCam to DVCam Final Cut Pro Edit System Create Transcript Create Web Site Encode to Broadcast Server MPEG 2 Encode for Web Archive Real Media Create Web Site UC San Diego Accession Modules Workflow System Preservation Infrastructure Chronopolis – Long-term Preservation

  4. Preservation Framework Pre-Interview Interview Broadcast/Transfer Transcription Post Interview Metadata Analysis Schema Generation SIP/AIP Definitions Capture Scripts Metadata Validation Interview Metadata Capture Metadata DB Capture Make SIPS Aggregate AIP/Verify Store/Replicate/Preserve TV production Lifecycle Metadata Definition & Capture Workflow Persistent Archival Workflow

  5. Tools • Kepler • Interactive Workflow Tool • Used in multiple NSF projects – SEEK, GEON, etc • Storage Resource Broker – Infrastructure Independence • Collection Management Tool – Organization & Repurposing • Uniform Access Abstraction with Access Control/Audit Trails • Distributed, Heterogeneous Resource Access • Metadata Catalog – Authenticity & Integrity Maintenance • System, Structural, Descriptive, Preservation Metadata • User-defined and Discipline Specific Metadata • Extensible Schema Support • Web-Preservation Software • Automatic Crawl of Web sites • Auto-translation of links

  6. Research and Development • Metadata and Data Accession • Human Interactions – Study of Methodology • Accession Module • Design and validation of preservation templates • Producer-archive submission pipeline • Authenticity metadata extraction • Integrity metadata extraction • Development of preservation workflow • Mapping of templates to Kepler workflow actors • Validation of workflows • Extensions to the SRB data grid • Administrative tools for applying preservation policies

  7. Architectural Components • Metadata Capture • Descriptive • System & Structural • Provenance • Transcripts & Web Capture • Preservation-ready • Generation & Preservation of AIP • Video AIP • Metadata AIP • Transcript/Web AIP

  8. Capturing/Preserving Interview

  9. AIP

  10. What’s in a Transcript? STATES: • Transcripts in a state of being: “Dummy pages” (transcripts in “gestation”) • Transcript exists but not posted (e.g. when interviewee hasn’t okay-ed it yet) • Transcript not produced yet (e.g. audio not sent to transcriber – due to time and/or funding) • Transcripts posted • Transcripts edited after being posted • Also: • Transcripts in limbo (not scheduled for transcription– postponed) • With dummy page (e.g. 2005:SEN – see next page) • Without dummy page (e.g. 1987: LUTTWAK) FORMS: • Full-text form (1 piece – no outline or chunking): 2003: Terkel http://globetrotter.berkeley.edu/people3/Terkel/terkel-con0.html 1983: Thompson http://globetrotter.berkeley.edu/conversations/Thompson/thompson-con0.html • Regular form (chunked into pieces according to an index) EXAMPLES: • 2004: Englehardt page 1 has 2 embedded pics... http://globetrotter.berkeley.edu/people4/Engelhardt/engelhardt-con0.html • 2002: Lustick see photos (whole page!) -- also Norman Myers – also Wendy Ewall http://globetrotter.berkeley.edu/people2/Lustick/lustick-con0.html • 1999: Shinoda & Iwashita: use of different colors in the interview http://globetrotter.berkeley.edu/conversations/Shinoda_Iwashita/shinoda_iwashita0.html

  11. A Typical Transcript:Dec. 7, 2000: James FallowsOutline + Linked pages

  12. “Smart Harvesting of Transcripts” Algorithm • For each chronological listing page (chron & chron2): • Extract all semantically valid showID:transcriptURL pairs -> video-link prefix name + transcript URL based on “transcript type” and “transcript and video-link dates” • For each pair, crawl the main transcript URL, using the transcript outline (regular expression) to inline the linked pages(tables and images are filtered out) • Save each “blended” .html page to a separate file

  13. WRAP Workflow: Reconfigurable Workflows for Archival Preservation Replicate Ingest Stage Trigger Three Video Formats are preserved: Mov, MPEG, RT An Active “Push” Daemon initiates the workflow when videos become available Each Video is staged into the preservation environment and replicated

  14. Safe Ingest Workflow Switcher Depending on the result from Safe Ingest composite actor, the switchers will decide to do the follow-on actions or to re-do Ingest.

  15. Using MD5 Checksum to Do Safe Ingest MD5 checksum on Client Machine checksum in SRB Comparing the checksums

  16. Players & Roles • UCTV • Video generation and Broadcast • Transcription • Descriptive and production metadata generation • UCSD-TV • Webcast generation • Multiple Formats • Descriptive metadata generation • UCSD Libraries • Preservation Models and Processes • Description, Metadata Definition • Preservation metadata generation • Accession and Arrangement • SDSC • Development of Life-cycle Management systems • Storage • Preservation • Access

  17. Conclusion: Build A Preservation Workflow • Built upon a track record: • SDSC’s collection & storage management capabilities and in developing production software • UCSD Library leadership role in Digital preservation for library collections • Strong experience of UCTV and UCSD-TV in running content generation systems • Uses Proven Technologies: • Kepler and workflows • SRB, MCAT collection management systems • We are well on our way in delivering a prototype

More Related