250 likes | 358 Views
International Conference “Corpus linguistics – 2013 ” St. Petersburg, June 25 –27 , 2013. Roland Mittmann, M.A. Institute of Empirical Linguistics Goethe University, Frankfurt am Main, Germany mittmann@em.uni-frankfurt.de. Old German and Old Lithuanian: The Creation of Two Deeply-Annotated
E N D
International Conference “Corpus linguistics – 2013”St. Petersburg, June 25–27, 2013 Roland Mittmann, M.A. Institute of Empirical Linguistics Goethe University, Frankfurt am Main, Germany mittmann@em.uni-frankfurt.de Old German and Old Lithuanian: The Creation of Two Deeply-Annotated Historical Text Corpora
1. Introduction • Aim: creation of deeply-annotated corpora of historical language stages • Approach: depending on • existing resources from previous analyses • qualities of the language itself • Comparison of approaches: • Old German Reference Corpus (OG/OGRC) • Old Lithuanian Reference Corpus (OL/OLRC)
2. Description of the corpora • Old German Reference Corpus(Referenzkorpus Altdeutsch) • all preserved texts from the oldest stages of German • Old High German and Old Saxon (= Old Low German) • ca. 750 – 1050 CE • ca. 650,000 word tokens • cooperation of 3 German universities: 2008 – 2013 • Humboldt University (Berlin) • Goethe University (Frankfurt am Main) • Schiller University (Jena) • several subcorpora already searchable online
2. Description of the corpora OGRC: www.deutschdiachrondigital.de
2. Description of the corpora • Old Lithuanian Reference Corpus(Senosios lietuvių kalbos korpusas) • preserved texts from the oldest stage of Lithuanian • ca. 1520 – 1800 CE • ca. 10,000,000 word tokens • pilot project covering 540,000 word tokens started in 2012 • international cooperation • Lithuanian Language Institute (LKI, Vilnius) • Goethe University (Frankfurt am Main) • University of Pisa, Italy • use of experiences made with the OGRC due to cooperation in Frankfurt
2. Description of the corpora • Qualities of the texts of both corpora • types of texts: • religious and secular texts • prose and poetry • translated/adapted and independently composed texts • language: • variation due to diachronic, diatopic and diastratic differences • foreign-language source texts and foreign-language words in the texts: • annotation as similar as possible to OG/OL word tokens • comprised in aforementioned word token numbers • Old Lithuanian:balanced choice of texts for pilot project
3. The unequal starting points • Divergence from modern languages • OL considerably closer to Modern Lithuanian than OG to Modern(High or Low) German – not only due to different age: • invention of printing press in 15th century and spread of written texts • deceleration of transformation pace of European literary languages • moderate language development from OL to Modern Lithuanian(however, large differences in spelling, in OL many variants) • vs. extensive mutations in vowel system between OG and Early Modern Times (e.g. reduction of unstressed vowels to schwa/zero)
3. The unequal starting points Impacts on availability of resources • Old Lithuanian • no historic dictionary of Lithuanian, no OL grammar (but OL dictionaries) • dictionaries and grammars of Modern Lithuanian may be helpful • Old German • specific dictionaries and grammars • glossaries for every subcorpus:all attested inflected word forms, related to corresponding lemmata • OLRC: basis for compilation of OL grammar and glossary • OGRC: questioning and amending of existing works
3. The unequal starting points • Digital availability of the texts • OG: one printed edition per text digitized by TITUS project in Frankfurt • OL: 10 texts in pilot project • 6 on TITUS • 3 adopted from OL database of Lithuanian Language Institute (LKI) • 1: edition being prepared • TITUS texts: • structural annotation:e.g., chapters and lines for original document and edition • information can directly be adopted, together with texts
3. The unequal starting points titus.uni-frankfurt.de
3. The unequal starting points • Referential text version • OGRC: • digitized edition as main reference layer • manual addition of original text forms and graphical peculiarities saved for later, only performed by way of example • OLRC: • digitized edition extended by version of original manuscripts or prints • detailed representation of amendments digitization of original documents required
4.1. The courses of action: OGRC • Pre-annotation • digitization of glossaries for the subcorpora into XML format
4.1. The courses of action: OGRC • Pre-annotation • digitization of glossaries for the subcorpora into XML format • linking part-of-speech and morphological data of the word forms with the word tokens in the texts: • extraction of data from glossary files • enrichment with additional part-of-speech and morphological information manually extracted from grammars • most glossaries give attestations with locations in text one-to-one-attribution • aim of consistent spelling and consistent modern German translation adaptation of glossary lemmata to standard dictionariesof Old High German and Old Saxon
4.1. The courses of action: OGRC • Conversion and manual annotation • conversion into ELAN format • software by Max Planck Institute for Psycholinguistics, Nijmegen,the Netherlands • database structure • with part-of-speech, morphological, lemmatical and structural pre-annotation • manual annotation: • amendment of information • dissolution of ambiguities • addition of simple syntactical annotation
4.1. The courses of action: OGRC • automated creation of standardized version of word tokens • from lemmata plus part-of-speech and morphological data • morphological knowledge of language stages conveyed into Perl program • standard word forms used to detect annotation mistakesby automated comparison with word forms in text edition
4.2. The courses of action: OLRC • Pre-annotation • no glossaries annotation tool learning from manual annotation required • use of Toolbox (by SIL International, Dallas, Texas) • applying expansible dictionaries • one dictionary with data of Lemuoklis • morphological analyser, lemmatizer and tagger by the LKI • enriched by semi-manually classified data from dictionaries on OL,Slavic loanwords in OL and Bible names • other dictionary with data of Lithuanian language dictionary • retrieval of data on all lemmata in the corpus from its digital version
4.2. The courses of action: OLRC Annotation in Toolbox (OLRC)
4.2. The courses of action: OLRC • lemmatization of word forms of OL texts:if possible, automatic, else manual • creation of standardized word forms by Lemuoklis from lemmata,part-of-speech and morphological annotation • Modern Lithuanian-English dictionary lemma translation • conveyance of word tokens into standardized spelling:Consistent Changes Program (SIL) • mainly for older texts, specific rules for every single author needed
4.2. The courses of action: OLRC • Manual annotation and conversion • in Toolbox: • joining of texts with Lemuoklisʼ data • manual disambiguation • Toolbox: no chart structure, restriction of amount of annotation layers • transfer of data into ELAN • automated split-up of word forms into graphemes annotation (also OGRC) • e.g., addition of information on multiword expressions, quotations and glossing of words • conversion into image annotation tool ImAnTo (Frankfurt University) • annotation of facsimiles of original documents • selection of details of images and linking to annotations
4.3. The courses of action: Parallel processing • Tagsets and annotation schemes • part-of-speech and morphological annotation:OGRC: Deutsch Diachron Digital Tagset (DDDTS) • adaptation of TIGER Morphology Annotation Scheme for Modern German,based on Stuttgart-Tübingen Tagset (STTS) • DDDTS used as basis for creation of tagset for OL • distinguishing between lemma-specific and record-specific qualities ofword tokens • language of word tokens according to ISO 639-3 (goh, osx; olt; lat)
4.3. The courses of action: Parallel processing • The ANNIS database • transfer of subcorpora of both projects into ANNIS database(Potsdam University, Germany) • joining of texts with extensive metadata description • developed by Middle High German and OGRC, adapted by OLRC • complex search patterns possible, more comfortable search tool in preparation
4.3. The courses of action: Parallel processing Representation in the ANNIS database (OGRC)
5. Conclusion • Comparison of approaches for OL and OG • work on OLRC benefits from course of action applied for OGRC –in spite of various aspects diverging initially • OLRC can use digitized data and tools for Modern Lithuanian –inapplicable for OGRC • lack of glossaries for OLRC additional adaptive annotation tool • special approaches required for objectives exceeding those of OGRC • e.g. precise annotation of facsimiles of original documents however, cooperation advantageous, more time for philological work
Thank you for your attention! Спасибозавнимание! Old German Reference Corpus: www.deutschdiachrondigital.de