1 / 23

Dagstuhl 4-8 . 12 .2006

TEI and Windows, a marriage made in … Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana. Dagstuhl 4-8 . 12 .2006. Overview. motivation overview of the system case study problems. Background. Projects:

thane-wong
Download Presentation

Dagstuhl 4-8 . 12 .2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TEI and Windows, a marriage made in …Tomaž ErjavecDepartment of Knowledge Technologies Jožef Stefan InstituteLjubljana Dagstuhl 4-8.12.2006

  2. Overview • motivation • overview of the system • case study • problems

  3. Background • Projects: • Text-critical editions of Slovene literature (e-Library) • XIX cent. Slovene books(Corpus / e-Library) • Goal: to produce TEI encoded historical texts • But, in both projects the editors/proofreaders/annotators • do not know XML • are typically involved only for a short period of time

  4. Dilemma • Historical texts are special: • more complex annotations than typical digital libraries: • corrections & emendations, linking (several transcriptions), multimedia (facsimile, …) • more manual corrections of linguistic annotations than typical corpora: • smaller editions, worse performance of automatic tagging → allowing extensive editing is important • Dilemma: • force them to learn XML and TEI • let them use the tools they are familiar with

  5. …let them use Microsoft • Potential advantages • no need for (extensive) training • flexibility of powerful (and familiar!) editors • Disadvantages • the need for a (robust) conversion procedure • unconstrained, opaque and shifting source encoding • much greater chance of errors

  6. Approach Implemented a Web service that allows: • up-load of MS files • up-conversion to TEI • down-conversion to display / edit format • display or download of results (2+3) → user gets immediate feedback on quality of annotation

  7. Case study: AHlib • Digital library / corpus of XIX century Slovene books • On-going project at Austrian Academy / Graz University (related project in Slovenia) • Goal is to compile: • structurally marked-up volumes, linked to facsimile (for e-library server) • hand-lemmatised corpus (for concordance engine)

  8. Example

  9. Stage 0 • Selection and meta-data • Scanning the originals (PDF) • OCR to MS Word …manual correction (mostly student work)

  10. Example OCR • Preserves: • paragraphs • page breaks and numbers • ~headings • Most mistakes: • title pages • ornaments • č, š, ž, ſ • hyphenations, emphasized red = errors

  11. The Web service • HTML form + Perl CGI • upload of .zip with file(s) • input in or into XML • use of XSLT for conversions • display or download result

  12. Interface

  13. Stage I: Structural markup • Annotators’ work: • manual correction of OCR errors • emendations • styling of Word • Annotators’ manual + Word .dot file for styles: • page number, footnote, emendation, gap, figure description,… • Conversion: • RTF to XML: UpCast • UpCast XML to TEI: dedicated XSLT • TEI to HTML: dedicated XSLT • HTML gives ToC, page index, colour coding

  14. HTML example

  15. Stage II: unknown words • the text is automatically PoS tagged and lemmatised • tagger (TnT) and lemmatiser (CLOG) induce the model from pre-annotated data (corpus + lexicon) • problems: • training data is modern Slovene • 1st manual correction does not spot all the OCR mistakes and typos → concentrate on unknown words (lemmas) only • correct both lemmatisation and text • perform the correction in several iterations

  16. Using Excel • Dump unknown lemmas as a “token lexicon” in Excel XML (or tabular file) • Annotators (partially) correct • lemmas • text (when unknown word is a typo) • Resubmit text + lexicon with corrections • Dump new text + excel • Go to 2. until done • Submit the corrected word/lemma lexicon to system: these words are henceforth known (but possibly ambiguous)

  17. Unknown word lexicon in Excel

  18. What happens with Excel • if a word-form has several lemmatisations, the lines with the wrong ones are deleted • if just one, it is (if necessary) corrected • if typos are noticed, these lines are deleted, and the text corrected • very useful option - sorting by word-form, lemma • when (partially) done, resubmit • the new lexicon has already corrected lemmas marked in green

  19. Depositing • When the unknown word lexicon for a book is finished, it is deposited with the system • From this point on, these words become known ones • However, the lemmatisations can be ambiguous

  20. Stage III. Full lemmatisation • The same principle as before, but full concordances are included in Excel • The point here is to • check the lemmatisation of known words • disambiguate ambiguous ex-unknown words • still work in progress..

  21. Some (-) experiences so far • if an annotator can misunderstand the instructions, they will • if MS can make your life miserable, it will (esp. Word!) • if the conversion software has a bug, somebody will find it • the editor format imposes a bottom line on what can be annotated and how • tokenisation problems cannot be fixed in Excel • Excel has a limit of 64,000 lines

  22. Conclusions • the jury is still out.. • the system became rather complicated • very important parts of the “system”: • good annotators’ manual + on-line help • training course • fast start of work afterwards

  23. Further work Minimising errors of automatic lemmatisation: • core lexicon for tagging • Dagstuhl solutions

More Related