1 / 39

June 25, 2008

Reality Check What to expect from automated conversion to NLM XML Devorah Bloom. June 25, 2008. About DCL. Established in 1981 Pioneer in defining and developing the emerging data conversion industry (founding member of SGML User Group)

ashton
Download Presentation

June 25, 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reality Check What to expect from automated conversion to NLM XML Devorah Bloom June 25, 2008

  2. About DCL • Established in 1981 • Pioneer in defining and developing the emerging data conversion industry (founding member of SGML User Group) • Expertise in complex conversion projects including eBooks, TechDocs, Defense, and Libraries. • Substantial experience in managing multiple vendors for large-scale projects, with automated tracking and reporting of data throughout • A sophisticated quality control workflow with both automated and human quality control steps to guarantee accuracy • Extensive experience with all key DTDs, including NLM, DITA, S1000D, ATA, NLM, DocBook and TEI • Industries served includes Technical Documentation, Publishing, Technical Societies, Aerospace, Government, Defense, Health Sciences, and Libraries & Universities • Wrote the data conversion chapters in The XML Handbook and Columbia guide to Digital Publishing • Publishes DCLNews, a Monthly Newsletter devoted to SGML/XML and Electronic Publishing topics with a subscriber base of 7,000

  3. Agenda • Overview of problems with automated conversion • Automated conversion: specific examples • Conversion process • Summary • Q & A

  4. Pitfalls of Automated Conversion: Inconsistencies Inconsistencies of legacy content: • Source materials • Examples: paper, PDF, SGML • Variety of style guides • Examples: Reference callouts as superscripted numbers or • numbers in parenthesis or square brackets • Punctuation used between author initials • Full or abbreviated labels used for Figures and Tables 4

  5. Pitfalls of Automated Conversion: Tagging Automated Tagging Problems: • Multiple ways to tag the same text • Example: • <glossary><title>Abbreviations</title><def-list>…</glossary> • vs. • <glossary><def-list><title>Abbreviations</title>…</glossary> • Visual Tagging vs. Content Tagging • Example: • <p><bold>Abstract</bold></p> • vs. • <abstract><title>Abstract</title>…</abstract> • Text extraction tools • Examples: • Jade, Gemini

  6. Text Extraction Pitfalls: Source PDF

  7. Text Extraction: Different Tools Adobe 7

  8. Text Extraction: Different Tools Jade Outcome Criteria The overall outcome measure of ‘‘optimal potential di3erence’’ was satisfied when current density and electric fields were at their highest attainable value within predetermined measures. Electric fields were restricted between 1 and 10 V/cm to prevent joule heating e3ects,53 while current densities were limited to 1.8 mA/cm2 to prevent localized tissue necrosis. The current density threshold was preset to 1.8 mA/cm2 to adhere to International Electrotechnical Commissions regulations that 2 mA/cm2 should not be exceeded in electrical devices designed for the general population.35 Maintaining a value below the standard of practice was also important in providing a factor of safety since fluctuations may occur in vivo due to variations in ion concentrations, temperature and hydration; variables which were not accounted for with this finite element model. 8

  9. Text Extraction: Different Tools Gemini Outcome Criteria The overall outcome measure of ‘‘optimal potential difference’’ was satisfied when current density and electric fields were at their highest attainable value within predetermined measures. Electric fields were restricted between 1 and 10 V/cm to prevent joule heating effects,53 while current densities were limited to1.8 mA/cm2 to prevent localized tissue necrosis. The current density threshold was preset to 1.8 mA/cm2 to adhere to International Electrotechnical Commissions regulations that 2 mA/cm2 should not be exceeded in electrical devices designed for the general population.35 9

  10. Text extraction: More Examples Gemini Extracted as: correcting grammar and mastering rhetorical conventions. Beginning in the late s, with the rejection of institutional authority characterizing much social movement of that decade, the focus of writing instruction Jade Extracted as: Specifically, languages that allow sonority falls (e.g., lba, s=-2) tend to allow sonority plateaus (e.g., bda, s=0); languages that allow sonority plateaus tend to allow small sonority rises (e.g., bna, s=1); and 10

  11. 1 2 3 4 5 6 7 8 9 DCL Conversion Process OCR/Text Extraction Pre-Analysis Zoning Styling/ Pre-Editing Proofreading Conversion Editorial Review Quality Control Parsing

  12. Pre-Analysis: Choosing a DTD NLM (JATS) – DTD of Choice • Clearly laid out • Well documented • Amply robust • Public Domain

  13. Pre-Analysis: Limiting the Tag Usage Create a subset DTD OR Create rules via a conversion specification

  14. Pre-Analysis: Conversion Specification Document

  15. Pre-Analysis: Choosing the Content to Convert Which content will be auto-generated? • TOC • Index • Labels • Titles

  16. Pre-Analysis: Page Layout Concept

  17. Pre-Analysis: Output Goals How true to source should my output be? • Emphasis in titles • Punctuation and connective text in citations

  18. Pre-Analysis: Capturing as Multiple Formats <disp-formula id="FD1"> <mml:math id="M1" display='block'> <mml:semantics> <mml:mrow> <mml:mi>L</mml:mi> <mml:mo>&#x0003D;</mml:mo> <mml:mo>&#x02211;</mml:mo> <mml:mrow> <mml:msub> <mml:mrow> <mml:mi>l</mml:mi></mml:mrow> <mml:mi>i</mml:mi></mml:msub> <mml:mo>&#x0002F;</mml:mo> <mml:mi>N</mml:mi></mml:mrow> <mml:mo>&#x0002E;</mml:mo></mml:mrow> </mml:semantics></mml:math> </disp-formula> • Math as images and MathML • Tables as images and XHTML

  19. Pre-Analysis: Determining Data Elements Appearance Based Content Based • <email> - @ • <uri> - www • <degrees> - PhD, MD, BA • <fig> - Figure, Illustration, Chart, Scheme • Alignment • Placement • Point size • Font

  20. Pre-Analysis: Target XML • Define Usage • When a <fn> will be tagged in <fn-group> or in <author-notes> • List labels inside <list-item> or as an attribute on the <list> • Define Placement • Figure/Table after callout or in its own section • <aff> in each <contrib> or all grouped together afterwards • <bio> in <contrib> or in <back>

  21. Conversion Specification Document

  22. Zoning • Identifies what will be converted (omitting header, footers and page numbers) • Identifies each structure as a text box, image or table • Defines reading order for multi-column text

  23. OCR/Text Extraction    Pitfalls of Text Extraction • Special Characters • Emphasis • Ligatures • Hyphens – Soft and Hard

  24. Proofreading: Comparison Tool 24

  25. Proofreading: Tool for Image OCR 25

  26. Proofreading: Hyphen Check **** Hyphenation Conflict **** >fibro-sis< >out-flow< >post-treatment< >veno-venous< >web-like< **** Uncertain Status **** >Budd-Chiari< >Cazals-Hatem >PTFE-covered< >clinico-pathologic< >ePTFE-coated< **** Likely Okay **** >-year-old< >Gray-scale< >Large-volume< >May-June< >US-guided< >balloon-expandable 26

  27. Styling/Pre-editing Para with a label a list? Bolded para vs. section title? Numbers in parens reference callouts or just plain text? 27

  28. Styling/Pre-editing: Word Styling Template 28

  29. Styling/Pre-editing: Pre-edit Tags • Multiple citations in a single reference • Split Tables • Untitled graphics • Data to be moved 29

  30. Styling/Pre-editing: Reference Macro • Are the references Harvard or Numeric? • Is the author name last/first or first/last? • What is the placement of the year within the citation? • Is a comma or period used after the author names? 30

  31. Conversion Automated conversion software This is the “lights out” part of the process. 31

  32. Parsing <ref id="b1"> <citation citation-type="journal"><person-group><query>Centers for Disease Control and</query><name><surname>Prevention</surname></name></person-group><article-title>The study of Examples</article-title> <source>Journal of Examples</source> <year>2011</year> <fpage>123</fpage> <lpage>124</lpage> </citation> </ref> 32

  33. Editorial Review: Viewing 33

  34. Editorial Review: Visual QA • Math • Tables • Alignment • Spanning • Border • Images • Sizing • Placement 34

  35. Editorial Review: Reporting Stylesheets 35

  36. Quality Control Sample error log with generic checks 36

  37. Quality Control Sample error log with custom checks 37

  38. What We’ve Learned • Pre-analysis is crucial to successful conversion • Not all text extraction tools are created equal • Valid XML is not necessarily correct XML • A fully lights out conversion is not possible for a high quality conversion • Post conversion effort will be necessary but can be drastically reduced “Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” - Abraham Lincoln 38

  39. Questions... & Answers • Devorah Bloom • Project Manager • dbloom@dclab.com • 718-307-5720 Data Conversion Laboratory 61-18 190th St., 2nd Floor Fresh Meadows, NY 11365 Telephone: (718) 357-8700 Fax: (718) 357-8776 Web: http://www.dclab.com 39

More Related