420 likes | 506 Views
How Well Do You Know Your Data? Converting an Archive of Proprietary Markup Schemes to JATS: A Case Study . Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP. Founded in 1931
E N D
How Well Do You Know Your Data? Converting an Archive of Proprietary Markup Schemes to JATS: A Case Study Faye Krawitz Jennifer McAndrews Richard O’Keeffe Content Technology Group, AIP
Founded in 1931 • Umbrella organization for 10 physical science societies. Combined membership totals 165,500 scientists, engineers and educators (with some overlap) • One of the world's largest non-profit publishers of scientific information in physics. • Home of the Physics Resources Center • Publish 24+ AIP, member, partner journals/magazines, three of which are co-published with other organizations, and one conference proceedings series • Mission: To inspire every Physical and Applied Scientist in the world to turn to AIP for the information and help that they need AIP at a glance
The AIP Content Collection • 800,000 SGML/XML records encoded in • AIP ISO 12083 “header” SGML DTD (1995-present) • AIP ISO 12083 “full-text” SGML DTD (1995-2005) • AIP “ISO-12083-informed” full-text XML DTD (2005-present) • How was it used? • XML the source for print/online PDFs • The source for HTML rendered on the AIP online platform • And it worked well…but the times they were a changing The AIP Content Ecosystem
AIP-centric! • XML overly specialized for specific AIP products • Required proprietary systems and support • Too many intermediary data transformations • Limited the adoption of new technology and standards • Too costly to maintain • Not the XML format of choice for data recipients What’s the problem…Why change?
Recognition that the intellectual property is the premium asset • Markup the data to maximize its value and enrichment potential • Keep current with industry standards • Better meet client expectations! • Plan for success • Streamlined production workflow • Reorganize units to execute a unified content strategy • Not enough to realize the need to change, but to follow through and execute Redefining AIP’s future content strategy: If you could have anything you want…
Standardization 1: adopt industry standard XML • Eliminate multiple formats and associated transformations • Enhanced data portability • Standardization 2: adopt XML technologies such as XSLT and Schematron • Minimize dependence on specialized applications and skill sets • Speak the same language as the STM Community C’mon…everybody does it!
Journal and Archiving Interchange Tag Set (Not so) Big Surprise! JATS XSLT Schematron
Make the plan known • Keep everyone informed and updated • Get “buy-in” • Ensure the whole organization understands the change in approach • Ensure the whole organization understands the end goal • Ensure the staff understands the important role they play in the success Build for Success: Communication
Organize to succeed • Rethink and deploy an organization that most effectively achieves the goal • For AIP this meant… • Create a unified team following the overall strategy • Foster a definitive sense of ownership for the content as the “intellectual asset” • Develop a clear chain of content responsibility • Designate formal content “gatekeepers” Build for Success: Ownership
Invest in an up-to-date content management system • Efficiently manage content, not have the product(s) manage the systems • Avoid unneeded workflow duplication • Avoid unwanted “end-around” content manipulation • Extensibility to adapt to future needs • Excellent versioning capabilities • Effective reporting tools Build for Success: Infrastructure
Transform Decisions • Use XSLT • Create “mapping specification” for the following: • Transform AIP ISO 12083 “header” SGML DTD • Transform AIP “ISO-12083-informed” full-text XML DTD • On hold: AIP ISO 12083 “full-text” SGML DTD • Test and adapt based on results • Quality Control including Schematron • Document • Train staff and production partners Now What?
Document Analysis • Helpful aids • Existing documentation • Institutional memory • Devise tagging principles • Correct known ambiguities The Process
Identify: • Consistencies • Inconsistencies • Surprises • Evaluate tagging requirements • Create • Document Map (or “specification”) • Sample XML files as needed Document Analysis
Strictly delineated element v. attribute Defined AIP-specific usage of JATS Treated <article-meta> as database-like Avoided customized content models; reserved for later use Reserved <x> markup for future use; use at transform as debugging tool Reserved <named-content> for semantic enrichment markup Devised Tagging Principles
Tagging Principles x (Existing documentation + Institutional Memory) = JATS X+ = Creating the Document Map
Before After <extra1> <suffix> <extra2> <role> <extra3> <degree> Corrected Known Ambiguities
Generated text Style variation issues Multi-purpose tags Multimedia Time Expected Trouble Spots
The ability to take a tag like <ack> and output the title “ACKNOWLEDGMENTS” is the closest thing we have to magic. Generated Text
INTRODUCTION INTRODUCTION I. INTRODUCTION Introduction Introduction Style Variation Issues
Three distinct rules for handling one sgml element, all within References: 1. when <othinfo> is sibling of <refitem>: a. <othinfo> remove tag, retain PCDATA b. Retain content/punctuation and trailing space c. MOVE retained PCDATA to before </mixed-citation> of preceding <mixed-citation> 2.When back/citation/ref/othinfo: Strip <othinfo>, retain PCDATA 3. NOTE: nesting of <othinfo> requires: <citation id="r#"><ref><biother><othinfo>…<othinfo><dformula> <ref><label>#. </label><note><p>….<disp-formula>… Mulitpurpose tags
1. <epaps>See supplementary material at <urlhref=”http://dx.doi.org/10.1063/1.3475476”>http://dx.doi.org/10.1063/1.3475476</url> <epapsid display="no" type=“multimedia">E-JAPIAU-108-032016</epapsid> for essential multimedia.</epaps> 2. <media id="v1" status="essential"> <media-object doi="10.1063/1.3674301.1" file-name=“006029jcpv1.mpg" id="mm1" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v1" show-link="yes"></mediaref> 3. <media id="v1" status="essential"> <media-object doi="10.1063/1.3674301.1" file-name="v1.mpg" id="mm1" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v1" show-link="yes"></mediaref> 4. <media id="v1" status="essential"> <media-object doi="10.1063/1.3674301.1" file-name="v1.mpg" id="mm1" mime-type="video/mpeg" mm-type="video" version="original"> <mediaref rids="v1" show-link="yes"></mediaref></media-object></media> <media id="v2" status="essential"> <media-object doi="10.1063/1.3674301.2" file-name="v2.mpg" id="mm2" mime-type="video/mpeg" mm-type="video" version="original"> <mediaref rids="v2" show-link="yes"></mediaref></media-object></media> <media id="v3" status="essential"> <media-object doi="10.1063/1.3674301.3" file-name="v3.mpg" id="mm3" mime-type="video/mpeg" mm-type="video" version="original"> <mediaref rids="v3" show-link="yes"></mediaref></media-object></media>. Multimedia
Deceptively simple example: • Before pacs • After: front/spin/docanal/pacs Language
Expected tagging: <p content-type="leadpara”>Weak signal detection possesses the potential application in many fields. By utilizing the sensitivity of the nonlinear system ...</p> Displays online as: Lead Paragraph Weak signal detection possesses the potential application in many fields. By utilizing the sensitivity of the nonlinear system Actual tagging: <p>Weak signal detection possesses the potential application in many fields. By utilizing the sensitivity of the nonlinear system ...</p> No online display Nasty Surprises
Prerequisite training Content and tagging checks Incorporating Schematron Online displays QUALITY CONTROL AND TESTING
Staff Training • NLM/JATS DTD • XPATH • XSLT • Schematron QUALITY CONTROL AND TESTINGPrerequisite
Step 1 – Preliminary Testing: • Performed while XSLT was in progress • Analyst checked completed blocks of XSLT code and confirmed programmers understanding of instructions • Daily meetings held to discuss new findings or clarifications of instructions Trouble spot detected: specification document needed to be re-written using XPATH terminology. QUALITY CONTROL AND TESTINGContent and tagging checks
Step 2 – Batch Processing • Performed when XSLT was complete. • Converted and parsed approximately 200 files • Investigated hidden problems and determined if an XSLT modification or manual fix was the best course of action to take
Step 3– Group Testing • Performed when converted files were valid • Ran approximately 200 files from various journals with assorted article types • Entire group checked same sample of files • Check for dropped text • Ran Schematron
Step 4 – Bulk Processing • Performed when all files were approved from the group testing • Entire corpus of content run with remaining errors resulting from bad source outliers • XSLT transformed over a 99% accuracy rate, with 800,000 there was still a large number to be inspected • Where applicable source or XSLT was fixed and files rerun
Step 5 – Final Cleanup – Analyze flagged data. Investigated tags mapped in the XSLT to <x> or <strike> because the source tags had known problems.
Central piece in our QC process derived from our pre-existing proprietary QC programs • List of checks or assertions written in XPATH language • Tracks ERRORS and WARNINGS specific to our data • Done in parallel while XSLT was being written QUALITY CONTROL AND TESTINGIncorporating Schematron
JATS MARKUP with SCHEMATRON ERROR DETECTED <kwd-group kwd-group-type="pacs-codes"><compound-kwd><compound-kwd-part content-type="code">8440-x</compound-kwd-part> <compound-kwd-part content-type="value">Radiowave …</compound-kwd-part> <compound-kwd-part content-type="code">8440Ba</compound-kwd-part> <compound-kwd-part content-type="value">Antennas:. …</compound-kwd-part></compound-kwd> </kwd-group> JATS MARKUP CORRECTED <kwd-group kwd-group-type="pacs-codes"><compound-kwd><compound-kwd-part content-type="code">8440-x</compound-kwd-part> <compound-kwd-part content-type="value">Radiowave …</compound-kwd-part></compound-kwd> <compound-kwd> <compound-kwd-part content-type="code">8440Ba</compound-kwd-part><compound-kwd-part content-type="value">Antennas:. …</compound-kwd-part></compound-kwd> </kwd-group>
SCHEMATRON RULE <rule id="ERROR_COMPOUND_KEYWORD" context="compound-kwd"><assert role="ERROR_COMPOUND_KEYWORD" test="count(compound-kwd-part) = 2">[ERROR] A compound-kwd must have two compound-kwd-part tags </assert></rule> <rule id="ERROR_COMPOUND_KEYWORD_PART" context="compound-kwd-part"><assert role="ERROR_COMPOUND_KEYWORD_PART" test="@content-type='code' or @content-type='value'">[ERROR] Invalid @content-type used for compound-kwd-part - allowable values are: code and value</assert> </rule>
Assumptions at this point are: files are valid and Schematron runs clean • Testing was expanded to online publishing group and random testers throughout organization • Errors were found at this point that are apparent more in viewing • Great way to confirm that business rules are being followed QUALITY CONTROL AND TESTINGOnline Displays
Don’t go it alone: follow industry best practices and standards • Set yourself up for success • It is impossible to overstate the importance of document analysis • Use analysis as an opportunity to correct known ambiguities • Recognize difference between bad and incorrect data • Create a detailed document map • XPATH training is valuable • Use Schematron as a central piece to QC process • Work as a team LESSONS LEARNED &GENERAL CONCLUSIONS
We chose to use pre-existing JATS DTD elements and avoid any JATS module customization. The stock NISO JATS was more than sufficient to accommodate AIP’s tagging needs. We were able apply our tagging principles and remain true to our business rules. We have achieved the XML quality we were aiming towards.