390 likes | 532 Views
Reality Check What to expect from automated conversion to NLM XML Devorah Bloom. June 25, 2008. About DCL. Established in 1981 Pioneer in defining and developing the emerging data conversion industry (founding member of SGML User Group)
E N D
Reality Check What to expect from automated conversion to NLM XML Devorah Bloom June 25, 2008
About DCL • Established in 1981 • Pioneer in defining and developing the emerging data conversion industry (founding member of SGML User Group) • Expertise in complex conversion projects including eBooks, TechDocs, Defense, and Libraries. • Substantial experience in managing multiple vendors for large-scale projects, with automated tracking and reporting of data throughout • A sophisticated quality control workflow with both automated and human quality control steps to guarantee accuracy • Extensive experience with all key DTDs, including NLM, DITA, S1000D, ATA, NLM, DocBook and TEI • Industries served includes Technical Documentation, Publishing, Technical Societies, Aerospace, Government, Defense, Health Sciences, and Libraries & Universities • Wrote the data conversion chapters in The XML Handbook and Columbia guide to Digital Publishing • Publishes DCLNews, a Monthly Newsletter devoted to SGML/XML and Electronic Publishing topics with a subscriber base of 7,000
Agenda • Overview of problems with automated conversion • Automated conversion: specific examples • Conversion process • Summary • Q & A
Pitfalls of Automated Conversion: Inconsistencies Inconsistencies of legacy content: • Source materials • Examples: paper, PDF, SGML • Variety of style guides • Examples: Reference callouts as superscripted numbers or • numbers in parenthesis or square brackets • Punctuation used between author initials • Full or abbreviated labels used for Figures and Tables 4
Pitfalls of Automated Conversion: Tagging Automated Tagging Problems: • Multiple ways to tag the same text • Example: • <glossary><title>Abbreviations</title><def-list>…</glossary> • vs. • <glossary><def-list><title>Abbreviations</title>…</glossary> • Visual Tagging vs. Content Tagging • Example: • <p><bold>Abstract</bold></p> • vs. • <abstract><title>Abstract</title>…</abstract> • Text extraction tools • Examples: • Jade, Gemini
Text Extraction: Different Tools Adobe 7
Text Extraction: Different Tools Jade Outcome Criteria The overall outcome measure of ‘‘optimal potential di3erence’’ was satisfied when current density and electric fields were at their highest attainable value within predetermined measures. Electric fields were restricted between 1 and 10 V/cm to prevent joule heating e3ects,53 while current densities were limited to 1.8 mA/cm2 to prevent localized tissue necrosis. The current density threshold was preset to 1.8 mA/cm2 to adhere to International Electrotechnical Commissions regulations that 2 mA/cm2 should not be exceeded in electrical devices designed for the general population.35 Maintaining a value below the standard of practice was also important in providing a factor of safety since fluctuations may occur in vivo due to variations in ion concentrations, temperature and hydration; variables which were not accounted for with this finite element model. 8
Text Extraction: Different Tools Gemini Outcome Criteria The overall outcome measure of ‘‘optimal potential difference’’ was satisfied when current density and electric fields were at their highest attainable value within predetermined measures. Electric fields were restricted between 1 and 10 V/cm to prevent joule heating effects,53 while current densities were limited to1.8 mA/cm2 to prevent localized tissue necrosis. The current density threshold was preset to 1.8 mA/cm2 to adhere to International Electrotechnical Commissions regulations that 2 mA/cm2 should not be exceeded in electrical devices designed for the general population.35 9
Text extraction: More Examples Gemini Extracted as: correcting grammar and mastering rhetorical conventions. Beginning in the late s, with the rejection of institutional authority characterizing much social movement of that decade, the focus of writing instruction Jade Extracted as: Specifically, languages that allow sonority falls (e.g., lba, s=-2) tend to allow sonority plateaus (e.g., bda, s=0); languages that allow sonority plateaus tend to allow small sonority rises (e.g., bna, s=1); and 10
1 2 3 4 5 6 7 8 9 DCL Conversion Process OCR/Text Extraction Pre-Analysis Zoning Styling/ Pre-Editing Proofreading Conversion Editorial Review Quality Control Parsing
Pre-Analysis: Choosing a DTD NLM (JATS) – DTD of Choice • Clearly laid out • Well documented • Amply robust • Public Domain
Pre-Analysis: Limiting the Tag Usage Create a subset DTD OR Create rules via a conversion specification
Pre-Analysis: Choosing the Content to Convert Which content will be auto-generated? • TOC • Index • Labels • Titles
Pre-Analysis: Output Goals How true to source should my output be? • Emphasis in titles • Punctuation and connective text in citations
Pre-Analysis: Capturing as Multiple Formats <disp-formula id="FD1"> <mml:math id="M1" display='block'> <mml:semantics> <mml:mrow> <mml:mi>L</mml:mi> <mml:mo>=</mml:mo> <mml:mo>∑</mml:mo> <mml:mrow> <mml:msub> <mml:mrow> <mml:mi>l</mml:mi></mml:mrow> <mml:mi>i</mml:mi></mml:msub> <mml:mo>/</mml:mo> <mml:mi>N</mml:mi></mml:mrow> <mml:mo>.</mml:mo></mml:mrow> </mml:semantics></mml:math> </disp-formula> • Math as images and MathML • Tables as images and XHTML
Pre-Analysis: Determining Data Elements Appearance Based Content Based • <email> - @ • <uri> - www • <degrees> - PhD, MD, BA • <fig> - Figure, Illustration, Chart, Scheme • Alignment • Placement • Point size • Font
Pre-Analysis: Target XML • Define Usage • When a <fn> will be tagged in <fn-group> or in <author-notes> • List labels inside <list-item> or as an attribute on the <list> • Define Placement • Figure/Table after callout or in its own section • <aff> in each <contrib> or all grouped together afterwards • <bio> in <contrib> or in <back>
Zoning • Identifies what will be converted (omitting header, footers and page numbers) • Identifies each structure as a text box, image or table • Defines reading order for multi-column text
OCR/Text Extraction Pitfalls of Text Extraction • Special Characters • Emphasis • Ligatures • Hyphens – Soft and Hard
Proofreading: Hyphen Check **** Hyphenation Conflict **** >fibro-sis< >out-flow< >post-treatment< >veno-venous< >web-like< **** Uncertain Status **** >Budd-Chiari< >Cazals-Hatem >PTFE-covered< >clinico-pathologic< >ePTFE-coated< **** Likely Okay **** >-year-old< >Gray-scale< >Large-volume< >May-June< >US-guided< >balloon-expandable 26
Styling/Pre-editing Para with a label a list? Bolded para vs. section title? Numbers in parens reference callouts or just plain text? 27
Styling/Pre-editing: Pre-edit Tags • Multiple citations in a single reference • Split Tables • Untitled graphics • Data to be moved 29
Styling/Pre-editing: Reference Macro • Are the references Harvard or Numeric? • Is the author name last/first or first/last? • What is the placement of the year within the citation? • Is a comma or period used after the author names? 30
Conversion Automated conversion software This is the “lights out” part of the process. 31
Parsing <ref id="b1"> <citation citation-type="journal"><person-group><query>Centers for Disease Control and</query><name><surname>Prevention</surname></name></person-group><article-title>The study of Examples</article-title> <source>Journal of Examples</source> <year>2011</year> <fpage>123</fpage> <lpage>124</lpage> </citation> </ref> 32
Editorial Review: Visual QA • Math • Tables • Alignment • Spanning • Border • Images • Sizing • Placement 34
Quality Control Sample error log with generic checks 36
Quality Control Sample error log with custom checks 37
What We’ve Learned • Pre-analysis is crucial to successful conversion • Not all text extraction tools are created equal • Valid XML is not necessarily correct XML • A fully lights out conversion is not possible for a high quality conversion • Post conversion effort will be necessary but can be drastically reduced “Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” - Abraham Lincoln 38
Questions... & Answers • Devorah Bloom • Project Manager • dbloom@dclab.com • 718-307-5720 Data Conversion Laboratory 61-18 190th St., 2nd Floor Fresh Meadows, NY 11365 Telephone: (718) 357-8700 Fax: (718) 357-8776 Web: http://www.dclab.com 39