230 likes | 344 Views
TEI and Windows, a marriage made in … Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana. Dagstuhl 4-8 . 12 .2006. Overview. motivation overview of the system case study problems. Background. Projects:
E N D
TEI and Windows, a marriage made in …Tomaž ErjavecDepartment of Knowledge Technologies Jožef Stefan InstituteLjubljana Dagstuhl 4-8.12.2006
Overview • motivation • overview of the system • case study • problems
Background • Projects: • Text-critical editions of Slovene literature (e-Library) • XIX cent. Slovene books(Corpus / e-Library) • Goal: to produce TEI encoded historical texts • But, in both projects the editors/proofreaders/annotators • do not know XML • are typically involved only for a short period of time
Dilemma • Historical texts are special: • more complex annotations than typical digital libraries: • corrections & emendations, linking (several transcriptions), multimedia (facsimile, …) • more manual corrections of linguistic annotations than typical corpora: • smaller editions, worse performance of automatic tagging → allowing extensive editing is important • Dilemma: • force them to learn XML and TEI • let them use the tools they are familiar with
…let them use Microsoft • Potential advantages • no need for (extensive) training • flexibility of powerful (and familiar!) editors • Disadvantages • the need for a (robust) conversion procedure • unconstrained, opaque and shifting source encoding • much greater chance of errors
Approach Implemented a Web service that allows: • up-load of MS files • up-conversion to TEI • down-conversion to display / edit format • display or download of results (2+3) → user gets immediate feedback on quality of annotation
Case study: AHlib • Digital library / corpus of XIX century Slovene books • On-going project at Austrian Academy / Graz University (related project in Slovenia) • Goal is to compile: • structurally marked-up volumes, linked to facsimile (for e-library server) • hand-lemmatised corpus (for concordance engine)
Stage 0 • Selection and meta-data • Scanning the originals (PDF) • OCR to MS Word …manual correction (mostly student work)
Example OCR • Preserves: • paragraphs • page breaks and numbers • ~headings • Most mistakes: • title pages • ornaments • č, š, ž, ſ • hyphenations, emphasized red = errors
The Web service • HTML form + Perl CGI • upload of .zip with file(s) • input in or into XML • use of XSLT for conversions • display or download result
Stage I: Structural markup • Annotators’ work: • manual correction of OCR errors • emendations • styling of Word • Annotators’ manual + Word .dot file for styles: • page number, footnote, emendation, gap, figure description,… • Conversion: • RTF to XML: UpCast • UpCast XML to TEI: dedicated XSLT • TEI to HTML: dedicated XSLT • HTML gives ToC, page index, colour coding
HTML example
Stage II: unknown words • the text is automatically PoS tagged and lemmatised • tagger (TnT) and lemmatiser (CLOG) induce the model from pre-annotated data (corpus + lexicon) • problems: • training data is modern Slovene • 1st manual correction does not spot all the OCR mistakes and typos → concentrate on unknown words (lemmas) only • correct both lemmatisation and text • perform the correction in several iterations
Using Excel • Dump unknown lemmas as a “token lexicon” in Excel XML (or tabular file) • Annotators (partially) correct • lemmas • text (when unknown word is a typo) • Resubmit text + lexicon with corrections • Dump new text + excel • Go to 2. until done • Submit the corrected word/lemma lexicon to system: these words are henceforth known (but possibly ambiguous)
What happens with Excel • if a word-form has several lemmatisations, the lines with the wrong ones are deleted • if just one, it is (if necessary) corrected • if typos are noticed, these lines are deleted, and the text corrected • very useful option - sorting by word-form, lemma • when (partially) done, resubmit • the new lexicon has already corrected lemmas marked in green
Depositing • When the unknown word lexicon for a book is finished, it is deposited with the system • From this point on, these words become known ones • However, the lemmatisations can be ambiguous
Stage III. Full lemmatisation • The same principle as before, but full concordances are included in Excel • The point here is to • check the lemmatisation of known words • disambiguate ambiguous ex-unknown words • still work in progress..
Some (-) experiences so far • if an annotator can misunderstand the instructions, they will • if MS can make your life miserable, it will (esp. Word!) • if the conversion software has a bug, somebody will find it • the editor format imposes a bottom line on what can be annotated and how • tokenisation problems cannot be fixed in Excel • Excel has a limit of 64,000 lines
Conclusions • the jury is still out.. • the system became rather complicated • very important parts of the “system”: • good annotators’ manual + on-line help • training course • fast start of work afterwards
Further work Minimising errors of automatic lemmatisation: • core lexicon for tagging • Dagstuhl solutions