1 / 22

An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

Clemens Neudecker, KB National Library of the Netherlands Research Meeting, Amsterdam 3 November 2011. An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis. Background. > 20 individual software components for specific challenges

mmeredith
Download Presentation

An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clemens Neudecker, KB National Library of the Netherlands Research Meeting, Amsterdam 3 November 2011 An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

  2. Background • > 20 individual software components for specific challenges • Prototyping new algorithms, improving commercial solutions • Different frameworks (C, C++, Java, etc.), platforms (Win/Linux) • Extensible with 3rd party applications  IMPACT Interoperability Framework (IIF)

  3. Main requirements Behavioural: • Minimize integration effort • Minimize deployment effort • Maximize usability • Maximize scalability Functional: • Modular • Transparent • Expandable • Open source • Platform independent

  4. Architecture • Java • Web Services • Apache • Taverna Open Source available on https://github.com/impactcentre Free Hackathon 14/15 November, University of Manchester http://impact-mygrid-taverna-hackathon.wikispaces.com/

  5. Integration • Only requirement:command line executable • Generic command line wrapperproduces web service • Web service exposed as workflow module withdocumentation

  6. Generic Web Service Wrapper  Easy integration: developers can focus on their application and have to worry less about integration = higher quality software components

  7. Workflows • OCR workflow = data pipeline • Building blocks = processing modules (nodes) • Integration = interaction between nodes (mashups)  Collaboration with

  8. Workflow management Web 2.0 style registry: myExperiment Local client: Taverna Workbench Web client: Project website

  9. Local client: Taverna Workbench • Background: BioSciences • Developed and maintained bymyGrid, UK • Open source • GUI for design and execution of web services & workflows

  10. Remote client: Portal • SOAP/REST API • Remote execution of web services & workflows

  11. Community Web2.0 style workflow registry Community of experts Sharing of resources Knowledge exchange A central meeting point for users and researchers

  12. Scalability • Central ESB proxy manages multiple service copies • Process parallelization,Load distribution,Fail over, Security • Served >2M requests • Throughput improvements of 94% with every additional instance • Tested on Dutch Cloud (“Enlighten Your Research”)

  13. Dataset Access to a representative and annotated dataset of significant size, with metadata, ground truth and search facilities

  14. Evaluation features Text based comparison of result with ground truth, using Levenshtein distance method Layout based comparison of result with ground truth, using the Page Analysis And Ground Truth Elements Framework Example:

  15. Processing results or ground truth (e.g. binarisation, dewarping, page content) The PAGE Format Framework • Two-level architecture: • root structure • task specific sub-formats • Separate XML Schema definitions • Format identification via Namespaces • Mapping of • dependencies • process chains • alternative processing steps • Linking via IDs

  16. Ground-Truthing Tools • Aletheia • FineReader PAGE Exporter • GT Validator • GT Normalizer 17

  17. 1.0 1.5 0.0 0.5 0.0 1.0 0.0 0.5 2.0 0.0 0.0 2.0 0.0 1.0 1.0 0.0 Profile ‘Full Text Recognition’ • Evaluation for general text recognition 18

  18. Measures – Segmentation Errors Miss PartialMiss Mis-classi-fication Merge Caption Paragraph Ground Truth Segmentation Result Split 19

  19. OCR Accuracy

  20. Outlook • Online service for testing/evaluation • Specification & Guidelines • Extending the scope:Workflows for linguistic analysis: CLARINWorkflows for preservation: SCAPE • Even better scalability: Map/Reduce • Supported by a community of developers & practitioners

  21. “Anyway, the thing about progress is that is always seems greater than it really is.” Ludwig Wittgenstein, Philosophical Investigations (quoting Johann Nestroy)

More Related