1 / 42

Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

How Well Do You Know Your Data? Converting an Archive of Proprietary Markup Schemes to JATS: A Case Study . Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP. Founded in 1931

jena
Download Presentation

Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How Well Do You Know Your Data? Converting an Archive of Proprietary Markup Schemes to JATS: A Case Study Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP

  2. Founded in 1931 • Umbrella organization for 10 physical science societies. Combined membership totals 165,500 scientists, engineers and educators (with some overlap) • One of the world's largest non-profit publishers of scientific information in physics. • Home of the Physics Resources Center • Publish 24+ AIP, member, partner journals/magazines, three of which are co-published with other organizations, and one conference proceedings series • Mission: To inspire every Physical and Applied Scientist in the world to turn to AIP for the information and help that they need AIP at a glance

  3. The AIP Content Collection • 800,000 SGML/XML records encoded in • AIP ISO 12083 “header” SGML DTD (1995-present) • AIP ISO 12083 “full-text” SGML DTD (1995-2005) • AIP “ISO-12083-informed” full-text XML DTD (2005-present) • How was it used? • XML the source for print/online PDFs • The source for HTML rendered on the AIP online platform • And it worked well…but the times they were a changing The AIP Content Ecosystem

  4. AIP-centric! • XML overly specialized for specific AIP products • Required proprietary systems and support • Too many intermediary data transformations • Limited the adoption of new technology and standards • Too costly to maintain • Not the XML format of choice for data recipients What’s the problem…Why change?

  5. Recognition that the intellectual property is the premium asset • Markup the data to maximize its value and enrichment potential • Keep current with industry standards • Better meet client expectations! • Plan for success • Streamlined production workflow • Reorganize units to execute a unified content strategy • Not enough to realize the need to change, but to follow through and execute Redefining AIP’s future content strategy: If you could have anything you want…

  6. Standardization 1: adopt industry standard XML • Eliminate multiple formats and associated transformations • Enhanced data portability • Standardization 2: adopt XML technologies such as XSLT and Schematron • Minimize dependence on specialized applications and skill sets • Speak the same language as the STM Community C’mon…everybody does it!

  7. Journal and Archiving Interchange Tag Set (Not so) Big Surprise! JATS XSLT Schematron

  8. Make the plan known • Keep everyone informed and updated • Get “buy-in” • Ensure the whole organization understands the change in approach • Ensure the whole organization understands the end goal • Ensure the staff understands the important role they play in the success Build for Success: Communication

  9. Organize to succeed • Rethink and deploy an organization that most effectively achieves the goal • For AIP this meant… • Create a unified team following the overall strategy • Foster a definitive sense of ownership for the content as the “intellectual asset” • Develop a clear chain of content responsibility • Designate formal content “gatekeepers” Build for Success: Ownership

  10. Invest in an up-to-date content management system • Efficiently manage content, not have the product(s) manage the systems • Avoid unneeded workflow duplication • Avoid unwanted “end-around” content manipulation • Extensibility to adapt to future needs • Excellent versioning capabilities • Effective reporting tools Build for Success: Infrastructure

  11. Transform Decisions • Use XSLT • Create “mapping specification” for the following: • Transform AIP ISO 12083 “header” SGML DTD • Transform AIP “ISO-12083-informed” full-text XML DTD • On hold: AIP ISO 12083 “full-text” SGML DTD • Test and adapt based on results • Quality Control including Schematron • Document • Train staff and production partners Now What?

  12. Document Analysis • Helpful aids • Existing documentation • Institutional memory • Devise tagging principles • Correct known ambiguities The Process

  13. Identify: • Consistencies • Inconsistencies • Surprises • Evaluate tagging requirements • Create • Document Map (or “specification”) • Sample XML files as needed Document Analysis

  14. Strictly delineated element v. attribute Defined AIP-specific usage of JATS Treated <article-meta> as database-like Avoided customized content models; reserved for later use Reserved <x> markup for future use; use at transform as debugging tool Reserved <named-content> for semantic enrichment markup Devised Tagging Principles

  15. Tagging Principles x (Existing documentation + Institutional Memory) = JATS X+ = Creating the Document Map

  16. Resulting Map (“spec”)

  17. Before After <extra1> <suffix> <extra2> <role> <extra3> <degree> Corrected Known Ambiguities

  18. Generated text Style variation issues Multi-purpose tags Multimedia Time Expected Trouble Spots

  19. The ability to take a tag like <ack> and output the title “ACKNOWLEDGMENTS” is the closest thing we have to magic. Generated Text

  20. INTRODUCTION INTRODUCTION I. INTRODUCTION Introduction Introduction Style Variation Issues

  21. Three distinct rules for handling one sgml element, all within References: 1. when <othinfo> is sibling of <refitem>: a. <othinfo>  remove tag, retain PCDATA b. Retain content/punctuation and trailing space c. MOVE retained PCDATA to before </mixed-citation> of preceding <mixed-citation> 2.When back/citation/ref/othinfo: Strip <othinfo>, retain PCDATA 3. NOTE: nesting of <othinfo> requires: <citation id="r#"><ref><biother><othinfo>…<othinfo><dformula>  <ref><label>#. </label><note><p>….<disp-formula>… Mulitpurpose tags

  22. 1. <epaps>See supplementary material at <urlhref=”http://dx.doi.org/10.1063/1.3475476”>http://dx.doi.org/10.1063/1.3475476</url> <epapsid display="no" type=“multimedia">E-JAPIAU-108-032016</epapsid> for essential multimedia.</epaps> 2. <media id="v1" status="essential"> <media-object doi="10.1063/1.3674301.1" file-name=“006029jcpv1.mpg" id="mm1" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v1" show-link="yes"></mediaref> 3. <media id="v1" status="essential"> <media-object doi="10.1063/1.3674301.1" file-name="v1.mpg" id="mm1" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v1" show-link="yes"></mediaref> 4. <media id="v1" status="essential"> <media-object doi="10.1063/1.3674301.1" file-name="v1.mpg" id="mm1" mime-type="video/mpeg" mm-type="video" version="original"> <mediaref rids="v1" show-link="yes"></mediaref></media-object></media> <media id="v2" status="essential"> <media-object doi="10.1063/1.3674301.2" file-name="v2.mpg" id="mm2" mime-type="video/mpeg" mm-type="video" version="original"> <mediaref rids="v2" show-link="yes"></mediaref></media-object></media> <media id="v3" status="essential"> <media-object doi="10.1063/1.3674301.3" file-name="v3.mpg" id="mm3" mime-type="video/mpeg" mm-type="video" version="original"> <mediaref rids="v3" show-link="yes"></mediaref></media-object></media>. Multimedia

  23. Time

  24. Unexpected Trouble Spots:Language

  25. Deceptively simple example: • Before pacs • After: front/spin/docanal/pacs Language

  26. Unexpected Trouble Spots: Nasty Surprises

  27. Expected tagging: <p content-type="leadpara”>Weak signal detection possesses the potential application in many fields. By utilizing the sensitivity of the nonlinear system ...</p> Displays online as: Lead Paragraph Weak signal detection possesses the potential application in many fields. By utilizing the sensitivity of the nonlinear system Actual tagging: <p>Weak signal detection possesses the potential application in many fields. By utilizing the sensitivity of the nonlinear system ...</p> No online display Nasty Surprises

  28. Prerequisite training Content and tagging checks Incorporating Schematron Online displays QUALITY CONTROL AND TESTING

  29. Staff Training • NLM/JATS DTD • XPATH • XSLT • Schematron QUALITY CONTROL AND TESTINGPrerequisite

  30. Step 1 – Preliminary Testing: • Performed while XSLT was in progress • Analyst checked completed blocks of XSLT code and confirmed programmers understanding of instructions • Daily meetings held to discuss new findings or clarifications of instructions Trouble spot detected: specification document needed to be re-written using XPATH terminology. QUALITY CONTROL AND TESTINGContent and tagging checks

  31. Step 2 – Batch Processing • Performed when XSLT was complete. • Converted and parsed approximately 200 files • Investigated hidden problems and determined if an XSLT modification or manual fix was the best course of action to take

  32. Step 3– Group Testing • Performed when converted files were valid • Ran approximately 200 files from various journals with assorted article types • Entire group checked same sample of files • Check for dropped text • Ran Schematron

  33. Step 4 – Bulk Processing • Performed when all files were approved from the group testing • Entire corpus of content run with remaining errors resulting from bad source outliers • XSLT transformed over a 99% accuracy rate, with 800,000 there was still a large number to be inspected • Where applicable source or XSLT was fixed and files rerun

  34. Step 5 – Final Cleanup – Analyze flagged data. Investigated tags mapped in the XSLT to <x> or <strike> because the source tags had known problems.

  35. Central piece in our QC process derived from our pre-existing proprietary QC programs • List of checks or assertions written in XPATH language • Tracks ERRORS and WARNINGS specific to our data • Done in parallel while XSLT was being written QUALITY CONTROL AND TESTINGIncorporating Schematron

  36. JATS MARKUP with SCHEMATRON ERROR DETECTED <kwd-group kwd-group-type="pacs-codes"><compound-kwd><compound-kwd-part content-type="code">8440-x</compound-kwd-part> <compound-kwd-part content-type="value">Radiowave …</compound-kwd-part> <compound-kwd-part content-type="code">8440Ba</compound-kwd-part> <compound-kwd-part content-type="value">Antennas:. …</compound-kwd-part></compound-kwd> </kwd-group> JATS MARKUP CORRECTED <kwd-group kwd-group-type="pacs-codes"><compound-kwd><compound-kwd-part content-type="code">8440-x</compound-kwd-part> <compound-kwd-part content-type="value">Radiowave …</compound-kwd-part></compound-kwd> <compound-kwd> <compound-kwd-part content-type="code">8440Ba</compound-kwd-part><compound-kwd-part content-type="value">Antennas:. …</compound-kwd-part></compound-kwd> </kwd-group>

  37. SCHEMATRON RULE <rule id="ERROR_COMPOUND_KEYWORD" context="compound-kwd"><assert role="ERROR_COMPOUND_KEYWORD" test="count(compound-kwd-part) = 2">[ERROR] A compound-kwd must have two compound-kwd-part tags </assert></rule> <rule id="ERROR_COMPOUND_KEYWORD_PART" context="compound-kwd-part"><assert role="ERROR_COMPOUND_KEYWORD_PART" test="@content-type='code' or @content-type='value'">[ERROR] Invalid @content-type used for compound-kwd-part - allowable values are: code and value</assert> </rule>

  38. Assumptions at this point are: files are valid and Schematron runs clean • Testing was expanded to online publishing group and random testers throughout organization • Errors were found at this point that are apparent more in viewing • Great way to confirm that business rules are being followed QUALITY CONTROL AND TESTINGOnline Displays

  39. Don’t go it alone: follow industry best practices and standards • Set yourself up for success • It is impossible to overstate the importance of document analysis • Use analysis as an opportunity to correct known ambiguities • Recognize difference between bad and incorrect data • Create a detailed document map • XPATH training is valuable • Use Schematron as a central piece to QC process • Work as a team LESSONS LEARNED &GENERAL CONCLUSIONS

  40. We chose to use pre-existing JATS DTD elements and avoid any JATS module customization. The stock NISO JATS was more than sufficient to accommodate AIP’s tagging needs. We were able apply our tagging principles and remain true to our business rules. We have achieved the XML quality we were aiming towards.

  41. Questions?

More Related