Dagstuhl 4-8 . 12 .2006

TEI and Windows, a marriage made in …Tomaž ErjavecDepartment of Knowledge Technologies Jožef Stefan InstituteLjubljana Dagstuhl 4-8.12.2006

Overview • motivation • overview of the system • case study • problems

Background • Projects: • Text-critical editions of Slovene literature (e-Library) • XIX cent. Slovene books(Corpus / e-Library) • Goal: to produce TEI encoded historical texts • But, in both projects the editors/proofreaders/annotators • do not know XML • are typically involved only for a short period of time

Dilemma • Historical texts are special: • more complex annotations than typical digital libraries: • corrections & emendations, linking (several transcriptions), multimedia (facsimile, …) • more manual corrections of linguistic annotations than typical corpora: • smaller editions, worse performance of automatic tagging → allowing extensive editing is important • Dilemma: • force them to learn XML and TEI • let them use the tools they are familiar with

…let them use Microsoft • Potential advantages • no need for (extensive) training • flexibility of powerful (and familiar!) editors • Disadvantages • the need for a (robust) conversion procedure • unconstrained, opaque and shifting source encoding • much greater chance of errors

Approach Implemented a Web service that allows: • up-load of MS files • up-conversion to TEI • down-conversion to display / edit format • display or download of results (2+3) → user gets immediate feedback on quality of annotation

Case study: AHlib • Digital library / corpus of XIX century Slovene books • On-going project at Austrian Academy / Graz University (related project in Slovenia) • Goal is to compile: • structurally marked-up volumes, linked to facsimile (for e-library server) • hand-lemmatised corpus (for concordance engine)

Example

Stage 0 • Selection and meta-data • Scanning the originals (PDF) • OCR to MS Word …manual correction (mostly student work)

Example OCR • Preserves: • paragraphs • page breaks and numbers • ~headings • Most mistakes: • title pages • ornaments • č, š, ž, ſ • hyphenations, emphasized red = errors

The Web service • HTML form + Perl CGI • upload of .zip with file(s) • input in or into XML • use of XSLT for conversions • display or download result

Interface

Stage I: Structural markup • Annotators’ work: • manual correction of OCR errors • emendations • styling of Word • Annotators’ manual + Word .dot file for styles: • page number, footnote, emendation, gap, figure description,… • Conversion: • RTF to XML: UpCast • UpCast XML to TEI: dedicated XSLT • TEI to HTML: dedicated XSLT • HTML gives ToC, page index, colour coding

HTML example

Stage II: unknown words • the text is automatically PoS tagged and lemmatised • tagger (TnT) and lemmatiser (CLOG) induce the model from pre-annotated data (corpus + lexicon) • problems: • training data is modern Slovene • 1st manual correction does not spot all the OCR mistakes and typos → concentrate on unknown words (lemmas) only • correct both lemmatisation and text • perform the correction in several iterations

Using Excel • Dump unknown lemmas as a “token lexicon” in Excel XML (or tabular file) • Annotators (partially) correct • lemmas • text (when unknown word is a typo) • Resubmit text + lexicon with corrections • Dump new text + excel • Go to 2. until done • Submit the corrected word/lemma lexicon to system: these words are henceforth known (but possibly ambiguous)

Unknown word lexicon in Excel

What happens with Excel • if a word-form has several lemmatisations, the lines with the wrong ones are deleted • if just one, it is (if necessary) corrected • if typos are noticed, these lines are deleted, and the text corrected • very useful option - sorting by word-form, lemma • when (partially) done, resubmit • the new lexicon has already corrected lemmas marked in green

Depositing • When the unknown word lexicon for a book is finished, it is deposited with the system • From this point on, these words become known ones • However, the lemmatisations can be ambiguous

Stage III. Full lemmatisation • The same principle as before, but full concordances are included in Excel • The point here is to • check the lemmatisation of known words • disambiguate ambiguous ex-unknown words • still work in progress..

Some (-) experiences so far • if an annotator can misunderstand the instructions, they will • if MS can make your life miserable, it will (esp. Word!) • if the conversion software has a bug, somebody will find it • the editor format imposes a bottom line on what can be annotated and how • tokenisation problems cannot be fixed in Excel • Excel has a limit of 64,000 lines

Conclusions • the jury is still out.. • the system became rather complicated • very important parts of the “system”: • good annotators’ manual + on-line help • training course • fast start of work afterwards

Further work Minimising errors of automatic lemmatisation: • core lexicon for tagging • Dagstuhl solutions

Dagstuhl 4-8 . 12 .2006

Dagstuhl 4-8 . 12 .2006

Presentation Transcript

Gatsby Bellringer # 8 4/25/12

8/8/12

12-8

20 8 2) 12 42 3) 8 12 4) 12 21

Chinese 1 4-8-12-30

Chinese 2 4-8-12-13

8/2006

Dagstuhl, July 2006

Dagstuhl Intro

Dagstuhl Seminar on Atomicity, 2006

12-8

Dagstuhl 6-06

Dagstuhl April 2008

8/2006

November 2006 in Dagstuhl, Germany Jari.Arkko@ericsson Marcelo@it.uc3m.es

Dagstuhl presentation

8 12

RELA Agenda for 4/8-4/12

RIDE INTEROPERABILITY WORKSHOP, Brussels 8/12/2006

Dagstuhl Seminar