300 likes | 458 Views
Overview of the IMPACT Project. Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands. Twitter: @impactocr, #impactproject. Overview of this presentation. Challenges in digitisation of historical printed text IMPACT project and objectives IMPACT Achievements
E N D
Overview of the IMPACT Project Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands Twitter: @impactocr, #impactproject
Overview of this presentation • Challenges in digitisation of historical printed text • IMPACT project and objectives • IMPACT Achievements • IMPACT Centre of Competence • How can we work together with YOU
KB Digital Library Programme • Goal: Offer everyone access to everything published in and about the Netherlands through the internet • 2013: 10% of the publications published in and about the Netherlands available in digital form (60 M pages by KB, 13 M pages by third parties) • Offer our full text collections in such a way that they can be immediately used by researchers • Example projects: Historical Newspapers – http://kranten.kb.nl Dutch Parliamentary Papers – http://www.statengeneraaldigitaal.nl/ Early European Books (Proquest) , 18th and 19th century books (Google), other projects - http://www.kb.nl/hrd/digitalisering/index-en.html • Timeframe covered: 1618 - 1995
Damaged pages, bleed through, difficult layout, historic fonts … OCR problems
Twitter: @impactocr, #impactproject Warping of paper (due to humidity)
Twitter: @impactocr, #impactproject Bleed through & Shine through Bad printing: blurred, broken, faded characters
Twitter: @impactocr, #impactproject Gothic print types
Twitter: @impactocr, #impactproject Annotations in the text
Twitter: @impactocr, #impactproject Complicated layout
Language Challenges: Spelling variants, orthographical variants, inflected forms…and more Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
Institutional Challenge: lack of knowledge and expertise inefficiency
Answering the challenges – IMPACT IMPACT – Improving Access to Text (2008-2011) • Large-scale integrating research project • Consortium of 26 partners • Good mix of public and private partners • Users, researchers and industry work together to find solutions • Each established in a large international network • Coordinated by the National Library of the Netherlands (KB) • Co-funded by EU (FP7 ICT Work Programme) • From 2012: sustainable Centre of Competence with alternative resources
Twitter: @impactocr, #impactproject IMPACT objectives Significantly improve mass digitisation of historical printed text by: • Innovate OCR software and language technology → tools for each step in the digitisation workflow from scan to publication • Share expertise and building capacity across Europe • Ensure that tools and services will be sustained after the end of the project
IMPACT Achievements: summary • On market: Improved commercial OCR • Ready for real life testing: • Adaptive OCR engine • Tools for OCR correction with volunteer involvement • Computerlexica for nine languages • Digitisation Framework with evaluation tools and dataset • Knowledge bank with guidelines and learning resources • Service for for print space recognition • For future development: • Novel Approaches to preprocessing, OCR and post correction • Tools for lexion building • Added value: Unique network bringing together experts from different communities • Centre of Competence for digitisation to start 1 january 2012
Twitter: @impactocr, #impactproject IMPACT Achievements: • Examples
Twitter: @impactocr, #impactproject Preprocessing: Novel Approaches to image enhancement before after Border removal and dewarping by NCSR and USAL
Twitter: @impactocr, #impactproject OCR: Improved commercial engine on market: ABBYY FR10 • Historic European font: FRE10 recognition of historic fonts: • 25% more accurate than FRE9 • 38% more accurate than FR XIX
Twitter: @impactocr, #impactproject OCR correction: two effective tools ready for implementation • Both make use of volunteer involvement • CONCERT by IBM: collaborative correction feeds back into Adaptive OCR • → promising pilots by libraries • LMU Post correction tool based on language input → pilot to start soon
Twitter: @impactocr, #impactproject Language: lexica for nine languages Correction of Long S with IMPACT lexicon for historical Dutch
Twitter: @impactocr, #impactproject Post Processing: Print space recognition • Functional Extension Parser by UIBK • Recognition of the structure of book pages • Enrichment of OCR results with structural information
Twitter: @impactocr, #impactproject Evaluation: IMPACT Framework • Modular and transparent method for evaluating specific workflows
Twitter: @impactocr, #impactproject Evaluation: IMPACT Dataset • Over half a million representative pages of digitised historical texts (newspapers, books, pamphlets, typewritten material) from the collections of 11 European libraries, with unique IDs and metadata • Invaluable resource for future research in OCR and language technology.
Centre of Competence in digitisation • New community: Bridges the gap between • content holders with digitisation programmes and • scientific communities in the area of pattern recognition, language technology, image processing • Mission: making Europe’s heritage accessible in digital form • Focus on practical solutions • Provides support in the implementation of the innovative IMPACT solutions for improving access to text • Provides tools and services for further advancement of the State of the Art in the field • Organises Conferences/workshops
How to join Three levels of membership : • Open (registration) access to forum, part of content • Basic membership (fee): access to all facilities, reduced fee for conferences • Premium membership (fee): member of the Board, privileges such as free entry to conferences Want to sign up? • Mail to impact@kb.nl for information on membership • Join us now already on LinkedIn • Follow us on Twitter (@impactocr) • Access through www.impact-project.eu
Houston: our ideas on working together Low hanging fruit: • Sharing open source solutions • Evaluating them in our framework with Ground Truth • Building a good set of use cases for all available tools • Sharing case studies on digitisation problems Adressing the big remaining challenge: • Getting the tools to work in real life environments • Bridging the gap between techy solution and content holders workflow
Twitter: @impactocr, #impactproject Questions? • impact@kb.nl • www.impact-project.eu Thank you!