1 / 21

Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format. Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler, Umadevi Thanneeru. Portico & JSTOR: Committed to Preserving the Scholarly Record.

oma
Download Presentation

Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler, Umadevi Thanneeru

  2. Portico & JSTOR: Committed to Preserving the Scholarly Record Ithaka helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways I T H A K A Digital Preservation Digitization for Preservation & Access “Light Archive” “Dark Archive” JATS-CON 2010

  3. Portico Archive • Portico’s objective is to help libraries make a secure and reliable transition from print to a reliance on e-content. • Maintains archiving agreement with publishers to collect and preserve content. • Receives content directly from publishers. • Preserves: • Current journals (born digital) • Back file journals (reborn digital) • E-books • Digitized historical collections JATS-CON 2010

  4. An “Insurance Policy” for e-Content • Provide libraries with access to archived content when it becomes lost, orphaned or abandoned (regardless of libraries past or current subscription): • Publisher ceases operation • Publisher discontinues title • Publisher drops back file • Provide libraries with post-cancellation access – if publisher specifically names Portico • About 90% of titles in Archive are covered by Portico post-cancellation access rights. • Libraries asked to pay annual Archive support payment to defray cost of preservation, e.g. “insurance premium” JATS-CON 2010

  5. Portico Archive as of July 19, 2010 • 114 publisher participants • 11,788 committed journal titles • 43,253 committed e-books • 13 committed digitized collections • >14 million articles ingested • 688 library participants • (48% outside US) • 4 Trigger events • 15 Post-cancellation Access Claims JATS-CON 2010

  6. Portico Preservation Infrastructure • Publisher supplies XML Source file (including the text, images) and PDF page rendition. • Best approach for preserving the intellectual content of the article or book. • Authenticate: verify that preserved content is what it purports to be. • Verify format: ensure the file meets syntactic and semantic rules of format specification. • Repair • Normalize (XML) • Create preservation metadata • Assess archival robustness of file format. • Migrate files to ensure future usability of content. • Replicate objects and metadata to protect against bit rot and media deterioration • Render articles to meet viewing requirements of delivery platform. JATS-CON 2010

  7. Key Challenges for an Archival DTD Dec 2001, Inera’s “E-Journal Archive DTD Feasibility Study” highlighted these Key Challenges for an Archival DTD: • Use of generated and boilerplate text, especially in • Label text for figure captions • Citation text • Author name and affiliation • Dates • Expression of links between author and affiliation • Reference elements • Expression of non-article and other content • Abbreviations and definitions JATS-CON 2010

  8. Key Challenges for an Archival DTD • Keywords • Sections, including handling of sections without headers • Placement of floating objects, such as figures, tables, graphs • Tables, including cell formatting issues (cells with figures, content alignment, etc.) • Math • Intra-, inter- and extra-article linking • Publisher-specific elements When reviewing the minutes of the Working Group and the evolution of the DTD, we can confirm that these areas have been the main focus of discussion. JATS-CON 2010

  9. Some Design Constraints • IMPLIED, not REQUIRED attributes • CDATA instead of controlled list • Optional Elements, or relaxed order of elements • Surprising location of Elements • No Domain Specific Elements JATS-CON 2010

  10. Publisher/Domain Specific Elements • Custom-Meta • Business Data • Allowed in journal-meta, article-meta, front-stub • Name/Value pair (may contain 38 different Elements) • Named-Content • Semantic Significance • Allowed in 112 Elements • May contain 59 different Elements JATS-CON 2010

  11. Challenges posed by source DTDs Extended Semantics for Named-Content • Price in Citation • Becomes <named-content content-type=“price”> <citation reference="1" id="R1" type="serial"> <author order="1"> <name><first>S. P.</first><last>Morgan</last></name> </author> <journal> <sertitle>J. Appl. Phys.</sertitle> <URI type="ISSN">0030-3941</URI><price>$01.00</price> <volume>29</volume> <pages><first>1358</first><last>1368</last></pages> <pubdate>1958</pubdate> </journal> <title>General solution of the Luneburg lens problem</title></citation> JATS-CON 2010

  12. Challenges posed by source DTDs More Extended Semantics for Named-Content • Affiliation in Footnotes/P • Becomes <named-content content-type=“aff” id=“AFF2”> <FOOTNOTE ID="N101" TYPE="AFF"><P ALPHABET="LATIN" TYPE="INDENT"><AFF ID="AFF2“><IT>Corresponding author address:</IT> Nicholas M. J. Hall, Dept. of Atmospheric and Oceanic Sciences, McGill University, 805 Sherbrooke St. W., Montreal PQ H3A 2K6, Canada.</AFF> </P></FOOTNOTE> JATS-CON 2010

  13. Challenges posed by source DTDs More Extended Semantics for Named-Content • Funding in Acknowledgments/P • Becomes <named-content content-type=“funding”> <ack><sectitle>ACKNOWLEDGMENTS</sectitle><p>Q.W.&#x2019;s research is partially supported by AFOSR Grant No. <funding source="USAFOSR"><contract>F49550-05-1-0025</contract></funding> and NSF Grants No. <funding source="NSF"><contract>DMS-0204243</contract></funding>, No. <funding source="NSF"><contract>DMS-0605029</contract></funding>, and No. <funding source="NSF"><contract>DMS-0626180</contract></funding>. P.Z. is partially supported by the special funds for major State Research Projects <funding source="UNSPECIFIED"><contract>2005CB321704</contract></funding> and National Science Foundation of China for Distinguished Young Scholars <funding source="NSFC"><contract>10225103</contract></funding>. H.Z.&#x2019;s work is supported in part by the Naval Postgraduate School Research Initiation Program.</p></ack> JATS-CON 2010

  14. Challenges posed by source DTDs More Extended Semantics for Named-Content • Organization Division in Affiliation • Becomes <named-content content-type=“division”> <Affiliation ID="Aff12"><OrgDivision>Optisches Institut</OrgDivision> <OrgName>Technische Universität Berlin</OrgName> <OrgAddress> <City>Berlin</City> <Country>Germany</Country> </OrgAddress> </Affiliation> JATS-CON 2010

  15. Challenges posed by source DTDs More Extended Semantics for Named-Content • Generic Element (addinfo) • Becomes <named-content content-type=“addinfo”> <ref-conf id="CIT0045"><ref-conf-text><author-ref-text><surname>Bishop</surname> <givenname>CJ</givenname></author-ref-text>, <author-ref-text><surname>Aanenses</surname> <givenname>DM</givenname></author-ref-text>, <author-ref-text><surname>Jordan</surname> <givenname>GE</givenname></author-ref-text>, <author-ref-text><surname>Kilian</surname> <givenname>M</givenname></author-ref-text>, <author-ref-text><surname>Hanage</surname> <givenname>WP</givenname></author-ref-text>, <author-ref-text><surname>Spratt</surname> <givenname>BG.</givenname></author-ref-text> <presentationtitle>Electronic taxonomy: assigning strains to bacterial species via the internet</presentationtitle>. <collectworktitle>BMC Biology</collectworktitle> <publicationfield-text><year>2009</year>; <year>7</year></publicationfield-text>: <firstpage>3</firstpage>. <addinfo>doi:10.1186/1741-7007-7-3</addinfo>.</ref-conf-text> </ref-conf> JATS-CON 2010

  16. Challenges posed by source DTDs Target DTD Structural Constraints that force the use of Named-Content • Table in Table • TD contains named-content, which contains a table <td><named-content content-type=“table”><table-wrap> • Figure in Table • TD contains named-content, which contains a fig <td><named-content content-type=“figure”><fig> • Display-Formula in Title • Title contains named-content, which contains a display-formula <title><named-content content-type=“display-formula”><display-formula> JATS-CON 2010

  17. Challenges posed by source DTDs • Question/Answer • Generic and Structural • Is saying <list list-content=“question”> enough? <Question-Answer> <Q><P><L>1</L>. The major advantage of amniotic membrane transplantation in pterygium surgery is</P></Q> <A><P><L>A</L>. reduction in surgical time</P></A> <A><P><L>B</L>. preservation of conjunctiva</P></A> <A><P><L>C</L>. better cosmetic outcomes compared with conjunctivalautografting</P></A> <A><P><L>D</L>. lowest recurrence rate among the surgical techniques</P></A></Question-Answer> JATS-CON 2010

  18. Challenges posed by source DTDs • Synonymy • Domain and Semantic • Is saying <list list-content=“synonymy”> enough? • Or <named-content content-type=“synonymy”> because of the semantic meaning? <SYNONYMY> <HEAD>ECHINOSTELIALES</HEAD> <ITEM><P><GENSP>Clastoderma debaryanum</GENSP> A. Blytt</P></ITEM> <ITEM><P><GENSP>Echinostelium apitectum</GENSP> K.D. Whitney, MC</P></ITEM> <ITEM><P><GENSP>Echinostelium coelocephalum</GENSP> T.E. Brooks &amp; H.W. Keller, MC</P></ITEM> <ITEM><P><GENSP>Echinostelium minutum</GENSP> de Bary, MC</P></ITEM> </SYNONYMY> Synonyms are different scientific names that pertain to the same taxon JATS-CON 2010

  19. Challenges posed by source DTDs • Decision Tree (Taxonomic Key) • Domain, Semantic, Structural, and Presentation <KEY> <COUPLET><DESCR><NO>1.</NO>Hypostomal setae (Hy) shorter than half the width of labrum</DESCR> <RESP><GENSP>Sycophilamellea</GENSP> (Curtis, 1831), <GENSP>Tetramesa </GENSP>Walker, 1848</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Hypostomal setae longer or about as long as half the width of labrum</DESCR> <RESP>2</RESP></COUPLET> <COUPLET><DESCR><NO>2.</NO>More than two dorsal setae (D) present on abdominal segments A6-8</DESCR> <RESP>3</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--At least one of abdominal segments A6-8 with only two dorsal setae</DESCR> <RESP>4</RESP></COUPLET> <COUPLET><DESCR><NO>3.</NO>Mandibles bidentate</DESCR> <RESP><GENSP>E. (Ahtola) atra</GENSP> (Walker, 1832)</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Mandibles unidentate</DESCR> <RESP><GENSP>E. nodularis</GENSP> Boheman</RESP></COUPLET> <COUPLET><DESCR><NO>4.</NO>Mandibles bidentate</DESCR> <RESP><GENSP>Eurytomaappendigaster</GENSP> group</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Mandibles unidentate</DESCR> <RESP><GENSP>Eurytomaheriadi</GENSP> Zerova</RESP></COUPLET></KEY> tree-like model of decisions and their possible outcomes JATS-CON 2010

  20. Concluding Question How to support Publisher/Domain Specific constructs in the Archival DTD? • Continue use of Named-Content • New Miscellaneous Element • Support for adding namespaced elements • Other JATS-CON 2010

  21. Questions/Answers?Thank you John Meyer Director of Data Technologies 100 Campus Drive, Suite 100 Princeton, NJ 08540 609 986-2220 john.meyer@ithaka.org JATS-CON 2010

More Related