210 likes | 355 Views
Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format. Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler, Umadevi Thanneeru. Portico & JSTOR: Committed to Preserving the Scholarly Record.
E N D
Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler, Umadevi Thanneeru
Portico & JSTOR: Committed to Preserving the Scholarly Record Ithaka helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways I T H A K A Digital Preservation Digitization for Preservation & Access “Light Archive” “Dark Archive” JATS-CON 2010
Portico Archive • Portico’s objective is to help libraries make a secure and reliable transition from print to a reliance on e-content. • Maintains archiving agreement with publishers to collect and preserve content. • Receives content directly from publishers. • Preserves: • Current journals (born digital) • Back file journals (reborn digital) • E-books • Digitized historical collections JATS-CON 2010
An “Insurance Policy” for e-Content • Provide libraries with access to archived content when it becomes lost, orphaned or abandoned (regardless of libraries past or current subscription): • Publisher ceases operation • Publisher discontinues title • Publisher drops back file • Provide libraries with post-cancellation access – if publisher specifically names Portico • About 90% of titles in Archive are covered by Portico post-cancellation access rights. • Libraries asked to pay annual Archive support payment to defray cost of preservation, e.g. “insurance premium” JATS-CON 2010
Portico Archive as of July 19, 2010 • 114 publisher participants • 11,788 committed journal titles • 43,253 committed e-books • 13 committed digitized collections • >14 million articles ingested • 688 library participants • (48% outside US) • 4 Trigger events • 15 Post-cancellation Access Claims JATS-CON 2010
Portico Preservation Infrastructure • Publisher supplies XML Source file (including the text, images) and PDF page rendition. • Best approach for preserving the intellectual content of the article or book. • Authenticate: verify that preserved content is what it purports to be. • Verify format: ensure the file meets syntactic and semantic rules of format specification. • Repair • Normalize (XML) • Create preservation metadata • Assess archival robustness of file format. • Migrate files to ensure future usability of content. • Replicate objects and metadata to protect against bit rot and media deterioration • Render articles to meet viewing requirements of delivery platform. JATS-CON 2010
Key Challenges for an Archival DTD Dec 2001, Inera’s “E-Journal Archive DTD Feasibility Study” highlighted these Key Challenges for an Archival DTD: • Use of generated and boilerplate text, especially in • Label text for figure captions • Citation text • Author name and affiliation • Dates • Expression of links between author and affiliation • Reference elements • Expression of non-article and other content • Abbreviations and definitions JATS-CON 2010
Key Challenges for an Archival DTD • Keywords • Sections, including handling of sections without headers • Placement of floating objects, such as figures, tables, graphs • Tables, including cell formatting issues (cells with figures, content alignment, etc.) • Math • Intra-, inter- and extra-article linking • Publisher-specific elements When reviewing the minutes of the Working Group and the evolution of the DTD, we can confirm that these areas have been the main focus of discussion. JATS-CON 2010
Some Design Constraints • IMPLIED, not REQUIRED attributes • CDATA instead of controlled list • Optional Elements, or relaxed order of elements • Surprising location of Elements • No Domain Specific Elements JATS-CON 2010
Publisher/Domain Specific Elements • Custom-Meta • Business Data • Allowed in journal-meta, article-meta, front-stub • Name/Value pair (may contain 38 different Elements) • Named-Content • Semantic Significance • Allowed in 112 Elements • May contain 59 different Elements JATS-CON 2010
Challenges posed by source DTDs Extended Semantics for Named-Content • Price in Citation • Becomes <named-content content-type=“price”> <citation reference="1" id="R1" type="serial"> <author order="1"> <name><first>S. P.</first><last>Morgan</last></name> </author> <journal> <sertitle>J. Appl. Phys.</sertitle> <URI type="ISSN">0030-3941</URI><price>$01.00</price> <volume>29</volume> <pages><first>1358</first><last>1368</last></pages> <pubdate>1958</pubdate> </journal> <title>General solution of the Luneburg lens problem</title></citation> JATS-CON 2010
Challenges posed by source DTDs More Extended Semantics for Named-Content • Affiliation in Footnotes/P • Becomes <named-content content-type=“aff” id=“AFF2”> <FOOTNOTE ID="N101" TYPE="AFF"><P ALPHABET="LATIN" TYPE="INDENT"><AFF ID="AFF2“><IT>Corresponding author address:</IT> Nicholas M. J. Hall, Dept. of Atmospheric and Oceanic Sciences, McGill University, 805 Sherbrooke St. W., Montreal PQ H3A 2K6, Canada.</AFF> </P></FOOTNOTE> JATS-CON 2010
Challenges posed by source DTDs More Extended Semantics for Named-Content • Funding in Acknowledgments/P • Becomes <named-content content-type=“funding”> <ack><sectitle>ACKNOWLEDGMENTS</sectitle><p>Q.W.’s research is partially supported by AFOSR Grant No. <funding source="USAFOSR"><contract>F49550-05-1-0025</contract></funding> and NSF Grants No. <funding source="NSF"><contract>DMS-0204243</contract></funding>, No. <funding source="NSF"><contract>DMS-0605029</contract></funding>, and No. <funding source="NSF"><contract>DMS-0626180</contract></funding>. P.Z. is partially supported by the special funds for major State Research Projects <funding source="UNSPECIFIED"><contract>2005CB321704</contract></funding> and National Science Foundation of China for Distinguished Young Scholars <funding source="NSFC"><contract>10225103</contract></funding>. H.Z.’s work is supported in part by the Naval Postgraduate School Research Initiation Program.</p></ack> JATS-CON 2010
Challenges posed by source DTDs More Extended Semantics for Named-Content • Organization Division in Affiliation • Becomes <named-content content-type=“division”> <Affiliation ID="Aff12"><OrgDivision>Optisches Institut</OrgDivision> <OrgName>Technische Universität Berlin</OrgName> <OrgAddress> <City>Berlin</City> <Country>Germany</Country> </OrgAddress> </Affiliation> JATS-CON 2010
Challenges posed by source DTDs More Extended Semantics for Named-Content • Generic Element (addinfo) • Becomes <named-content content-type=“addinfo”> <ref-conf id="CIT0045"><ref-conf-text><author-ref-text><surname>Bishop</surname> <givenname>CJ</givenname></author-ref-text>, <author-ref-text><surname>Aanenses</surname> <givenname>DM</givenname></author-ref-text>, <author-ref-text><surname>Jordan</surname> <givenname>GE</givenname></author-ref-text>, <author-ref-text><surname>Kilian</surname> <givenname>M</givenname></author-ref-text>, <author-ref-text><surname>Hanage</surname> <givenname>WP</givenname></author-ref-text>, <author-ref-text><surname>Spratt</surname> <givenname>BG.</givenname></author-ref-text> <presentationtitle>Electronic taxonomy: assigning strains to bacterial species via the internet</presentationtitle>. <collectworktitle>BMC Biology</collectworktitle> <publicationfield-text><year>2009</year>; <year>7</year></publicationfield-text>: <firstpage>3</firstpage>. <addinfo>doi:10.1186/1741-7007-7-3</addinfo>.</ref-conf-text> </ref-conf> JATS-CON 2010
Challenges posed by source DTDs Target DTD Structural Constraints that force the use of Named-Content • Table in Table • TD contains named-content, which contains a table <td><named-content content-type=“table”><table-wrap> • Figure in Table • TD contains named-content, which contains a fig <td><named-content content-type=“figure”><fig> • Display-Formula in Title • Title contains named-content, which contains a display-formula <title><named-content content-type=“display-formula”><display-formula> JATS-CON 2010
Challenges posed by source DTDs • Question/Answer • Generic and Structural • Is saying <list list-content=“question”> enough? <Question-Answer> <Q><P><L>1</L>. The major advantage of amniotic membrane transplantation in pterygium surgery is</P></Q> <A><P><L>A</L>. reduction in surgical time</P></A> <A><P><L>B</L>. preservation of conjunctiva</P></A> <A><P><L>C</L>. better cosmetic outcomes compared with conjunctivalautografting</P></A> <A><P><L>D</L>. lowest recurrence rate among the surgical techniques</P></A></Question-Answer> JATS-CON 2010
Challenges posed by source DTDs • Synonymy • Domain and Semantic • Is saying <list list-content=“synonymy”> enough? • Or <named-content content-type=“synonymy”> because of the semantic meaning? <SYNONYMY> <HEAD>ECHINOSTELIALES</HEAD> <ITEM><P><GENSP>Clastoderma debaryanum</GENSP> A. Blytt</P></ITEM> <ITEM><P><GENSP>Echinostelium apitectum</GENSP> K.D. Whitney, MC</P></ITEM> <ITEM><P><GENSP>Echinostelium coelocephalum</GENSP> T.E. Brooks & H.W. Keller, MC</P></ITEM> <ITEM><P><GENSP>Echinostelium minutum</GENSP> de Bary, MC</P></ITEM> </SYNONYMY> Synonyms are different scientific names that pertain to the same taxon JATS-CON 2010
Challenges posed by source DTDs • Decision Tree (Taxonomic Key) • Domain, Semantic, Structural, and Presentation <KEY> <COUPLET><DESCR><NO>1.</NO>Hypostomal setae (Hy) shorter than half the width of labrum</DESCR> <RESP><GENSP>Sycophilamellea</GENSP> (Curtis, 1831), <GENSP>Tetramesa </GENSP>Walker, 1848</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Hypostomal setae longer or about as long as half the width of labrum</DESCR> <RESP>2</RESP></COUPLET> <COUPLET><DESCR><NO>2.</NO>More than two dorsal setae (D) present on abdominal segments A6-8</DESCR> <RESP>3</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--At least one of abdominal segments A6-8 with only two dorsal setae</DESCR> <RESP>4</RESP></COUPLET> <COUPLET><DESCR><NO>3.</NO>Mandibles bidentate</DESCR> <RESP><GENSP>E. (Ahtola) atra</GENSP> (Walker, 1832)</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Mandibles unidentate</DESCR> <RESP><GENSP>E. nodularis</GENSP> Boheman</RESP></COUPLET> <COUPLET><DESCR><NO>4.</NO>Mandibles bidentate</DESCR> <RESP><GENSP>Eurytomaappendigaster</GENSP> group</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Mandibles unidentate</DESCR> <RESP><GENSP>Eurytomaheriadi</GENSP> Zerova</RESP></COUPLET></KEY> tree-like model of decisions and their possible outcomes JATS-CON 2010
Concluding Question How to support Publisher/Domain Specific constructs in the Archival DTD? • Continue use of Named-Content • New Miscellaneous Element • Support for adding namespaced elements • Other JATS-CON 2010
Questions/Answers?Thank you John Meyer Director of Data Technologies 100 Campus Drive, Suite 100 Princeton, NJ 08540 609 986-2220 john.meyer@ithaka.org JATS-CON 2010