1 / 30

Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents

Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents. Natasa Milic-Frayling Microsoft Research Cambridge UK www.planets-project.eu. What is the problem?. Digital is a victim of its own success

kimo
Download Presentation

Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents Natasa Milic-Frayling Microsoft Research Cambridge UK www.planets-project.eu

  2. What is the problem? • Digital is a victim of its own success i.e., the advances in digital technologies that make digital media broadly used and adopted Document formats, software and hardware are becoming obsolete faster than we can ensure the forward compatibility of the content.

  3. What are technical solutions? • We have two main strategies: • content migration • Migration to standards that are likely to be supported in the future. • emulation and simulation • Create emulators of hardware and simulators of software systems to enable old programmes to run and old data to be used.

  4. Ensure long-term access to Europe’s cultural and scientific heritage • Improve decision-making about long term preservation • Ensure long-term access to valued digital content • Control the costs through automation, scalable infrastructure • Ensure wide adoption across the user community • Establish market place for preservation services and tools • Build practical solutions • Integrate existing expertise, designs and tools • Share and build PreservationandLong-termAccessthrough NETworkedServices

  5. PLANETS Partners The British Library National Library, Netherlands Austrian National Library State and University Library, Denmark Royal Library, Denmark National Archives, UK Swiss Federal Archives National Archives, Netherlands Hatii at University of Glasgow University of Freiburg Technical University of Vienna University at Cologne Tessella Plc IBM Netherlands Microsoft Research, Cambridge ARC Seibersdorf research

  6. PLANETS Sub Projects

  7. preserving office documents Conversion Tools

  8. Microsoft & PLANETS: Preserving Office Documents • Microsoft Research role within PLANETS: • Conversion of binary Microsoft Office Documents into Office Open XML File Format (OpenXML) • We extended the effort to include other formats • More legacy formats, e.g. WordPerfect • Other open standards, e.g. Open Document Format. Binary MS Office OpenXML Binary MS Office OpenXML WordPerfect ODF DOS Word UOF

  9. Document Conversion Tools – Our Approach • Three-step approach, resulting in a modular and extendible infrastructure • Identify existing conversion tools and libraries • Wrap these tools and libraries into re-usable components • Integrate these components into PLANETS and other systems. • If possible, do not use the office applications (e.g., Microsoft Office or OpenOffice.org) • They are designed as interactive applications • Message boxes might pop up (“Do you want …”) • Unclear license question when running on a server.

  10. Reusable Components ToooXML (GUI) TB Interface Web Service Transformer Box (Wrapper) “Binary  OpenXML” Watch Folder Tool

  11. Extendible Architecture ToooXML (GUI) TB Interface Web Service Transformer Box (Wrapper) Transformer Box (Wrapper) Transformer Box (Wrapper) “ODF  OpenXML” “WP  OpenXML” “Binary  OpenXML” Watch Folder Tool

  12. More Technical Details (1) • Currently two types of wrappers for • Command-line tools (stand-alone executables) • OpenXML/ODF Translator (OpenXML  ODF) • OpenXML Document Viewer (OpenXML  HTML) • Microsoft conversion libraries (CNV libraries) • WordPerfect  RTF • RTF  OpenXML • … • We allow wrappers to be chained • WordPerfect  RTF  OpenXML  ODF.

  13. More Technical Details (2) • Microsoft conversion libraries (CNV libraries) • Originally designed to import/export “foreign” document formats into/from Microsoft Word • Based on the Microsoft Conversion API • Foreign2RTF • RTF2Foreign • Transformer Box CNV Wrapper follows this API. Microsoft Word Transformer Box CNV Wrapper CNV Library RTF2Foreign Foreign2RTF

  14. Supported Formats • Target formats • OpenXML • ODF • UOF • HTML • XCDL (format defined in PLANETS/PC) • Source formats • WordPerfect 5 • WordPerfect 6 • DOS Word • Word 2, 6, 95 • Word 97-2003 • RTF • ODF • OpenXML

  15. preserving office documents Conversion services

  16. Conversion applications and service

  17. Conversion applications and service

  18. Conversion applications and service

  19. Conversion applications and service

  20. Conversion applications and service

  21. Conversion applications and service

  22. understanding the quality criteria SIMILARITY ASSESSMENT

  23. How do we explore and compare digital artefacts • Perceptive aspects of the digital object • In the past printed version of the document and screen display • Interactive aspects of the digital objects • Dynamic content includes both individual artefacts and the `stream characteristics‘. • Non-perceptive aspects of the digital objects • Document object model, cashed data, action generated metadata, hidden formulas, etc.

  24. EXAMPLE: Perceptive features for Word Documents • Two objects in different formats are mapped onto the normalize form • E.g., a WP file converted into .docx. For both we create an XPS representation of the document • Feature extraction and comparison • For each feature develop a `digital object probe‘ that extract the feaeture and measure a property of the feature • E.g., pass XPS through OCR package and extract various layout features.

  25. Conversion applications and service

  26. Conversion applications and service

  27. What is ahead of us? • Research • What is the relationship between the human criteria and automated measurements? What usage scenarios do we aim for? • Technology • What ‘instruments’ do we need to extract and measure properties of the digital content? • How do we automate the process of inspection and quality assurance? • Legal • How do we run legacy software as services? We need updated licensing agreements. • How to provide services that combine open source and non-open source software?

  28. Contact: Natasa Milic-Frayling Microsoft Research Cambridge natasamf@microsoft.com Thank you

More Related