Lemmatizing and tagging a corpus : which information for which linguistic purposes?

Lemmatizing and tagging a corpus : which information for which linguistic purposes? Lemmatizing and tagging a corpus : which information for which linguistic purposes? The example of the Greek and Latin LASLA databases compared to others Dominique Longrée, LASLA – Université de Liège et FUSL (Bruxelles)

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 0. Introduction : objectives • • to share the expertise of the LASLA (50 years) • « Laboratoire d’Analyse statistique des Langues anciennes », • set up in 1961 at the LiegeUniversity • • to offer a discussion • 1) which information in a database and for which purposes ? • which influence on the results of our linguistic studies ? • • to compare the lemmatizing and tagging practices of LASLA • with practices of other (Greek and Latin) databases

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 0. Introduction : plan • • LASLA and its databases • the research project LatLem • • the Opera Latina Web interface and the Hyperbase-Latin CD-Rom • the process of tokenization • the process of lemmatization • the process of tagging (morphosyntactic tags) • the process of tagging (syntactic, semantic and pragmatic tags) • the research project LatSynt

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Databases : Greek – Latin • The Laboratory for Statistic Analysis of Classical Languages (L.A.S.L.A.) • - set up in Septembre1961 • - first research centre • - aiming to study classical languages (Greek and Latin) • - using automatic data processing technologies. • part of the Faculty of Philosophy and Letters at the University of Liège • Missions : • 1) a detailed study of Greek and Latin languages and literatures using computer techniques as well as statistical and quantitative methods; • 2) the making of literary data banks and computer tools in order to distribute those data banks and make the most of them by all available Media.

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Greek Database 1.200.000 words/tokens : Attic orators : Andocides, Antiphon, Isocrates and Lysias Aristotle : De Anima, De partibus animalium, Categorie, Metafisica, Fisica, Historia animalium. Plato : 8 dialogues All classic tragedies : Aeschylus, Sophocles, Euripides and fragments Pausanias christian authors : for example St John Chrysostom’, De sacerdotio Hesychius of Jerusalem , Homilies

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Greek Database Facing each word form, appear the following data : 1. the reference of the word form, according to the ars citandi. 2. the lemma (the word as it appears in the dictionary of reference, which is the Greek-English Lexicon, of H. G. Liddell, R. Scott et H. S. Jones). 3. the grammatical category of the word (POS) lemma token reference POS ὦ 2 ὦ 2 1 1 1 1 1 A λ κοινός κοινὸν 2 1 1 2 2 2 A γ αὐτάδελφος αὐτάδελφον 2 1 1 3 3 3 A γ ̓Ισμήνη ̓ Ισμήνης 2 1 1 4 4 4 A β κάρα 1 κάρα 2 1 1 5 5 5 A β ἆρα ἆρ ̓ 2 1 2 1 6 6 A μ οἶδα οἶσθ ̓ 2 1 2 2 7 7 A ζ …….

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Latin Database • Latin classical texts: 2.000.000 tokens • The LASLA method : • - Étienne ÉVRARD, « Le laboratoire d’analyse statistique des langues anciennes de l’Université de Liège », Mouvement scientifique en Belgique, 9, 1962, p. 163-169 ; • - Joseph DENOOZ, « L’ordinateur et le latin, Techniques et méthodes », Revue de l’organisation internationale pour l’étude des langues anciennes par ordinateur, 1978, 4, p. 1-36.

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Latin Database • • the available fully lemmatized and encoded texts : • Classical texts (more than 2.000.000 words/tokens) Caesar et aliiCato Catullus Cicero :rhetoricworks : all; philosophicalworks : partim Curtius Horatius Iuvenalis Lucretius Ovidius Persius Petronius Plautus : 8 plays Plinius (Iunior) Propertius Sallustius Seneca Tacitus Tibullus Virgilius

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Latin Database • • the available fully lemmatized and encoded texts : other texts Medio-Latin : Sedulius Scottus Hagiographic texts (300.000 words) Neo-Latin : Descartes Spinoza • • next available texts : works in progress Cicero (letters) Cornelius NeposLivius SuetoniusHistoria Augusta Busbecq (by L. Grailet)

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Latin Database • 2.500.000 words/tokens • Bibliotheca Teubneriana Latina : 13 millions tokens • fully lemmatized texts, • with a full morphosyntactic tagging and 1 syntactic tag • systematically verified by a philologist

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Latin Database For each word of the text, : 1.the lemma (the word as it appears in the dictionary of reference, the Lexicon totius latinitatis of Forcellini, éd. de Corradini, Padoue, 1864) 2. an index which enables to distinguish various homograph lemmas ET 1 = adverb, ET 2 = coordinating conjunction or to spot proper names or adjectives derived from proper names N opposite Roma means “proper name” 3. the form as appearing in the text 4. the reference, according to the ars citandi 5. the complete morphologic analysis in alphanumeric format 6. regarding the verbs, syntactic indications : main clauses verbs subordinate clauses verbs (sorted by subordination type)

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Latin Database • • the information available for each latin form Lemma + index TextForm Reference Analysis Index : N Name2 ET 1 = adverb, ET 2 = coordinating conjunction

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Latin Database • • the information available for each latin form Lemma + index TextForm Reference Analysis Analysis urbem : 13C00 1 : Noun 3 : 3d Decl. C : Acc. sing.habuere 52L14 5 : Verb 2 : 2d Conj. Act. L : 3d pers. Plur 1 : Ind. 4 : Perfectum & main clause

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Latin Database • • the information available for each latin form Analysis audisset : 5JC32 5 : Verb J : 1st Conj. Dep. C : 3d pers. Sing 3 : Subj 2 : ImpPerf. BN cum clauserequisisse 53074 5 : Verb 3 : 3d Conj. Act. 0: unpers. 7 : Inf 4 : Perfectum AG Accusativus cum Infinitivo

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla system for tagging • • old fashioned • • the project Latlem

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Greek and Latin Database • accessible through: • index plublished by : • G. Olms (Hildesheim) • the Centre Informatique de Philosophie et Lettres (CIPL-Liège) • for Greek texts : specific software • for Latin texts : • the Opera Latina Web interface • the Hyperbase-Latin CD-Rom

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Latin Database accesible throug h“opera latina”: www.ulg.ac.be/cipl/lsl.htm

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 1. The Lasla Latin Database accesible through the CD-Rom “Hyperbase-latin” collaboration withthe UMR 6039 « Bases, corpus, langage » (CNRS-University of Nice)

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 2. “Tokenizing” a text : establishing the text

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 2. “Tokenizing” a text : segmenting the text into sentences /Accusa senatum, accusa equestrem ordinem..., accusa omnes ordines, omnes ciues../ /Accusa senatum,/ /accusa equestrem ordinem..., / /accusa omnes ordines, omnes ciues../

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 2. “Tokenizing” a text : segmenting the text into words compare with CD-Rom PHI 05 of the Packard Humanities Institute • • The string • -ibil-

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 2. “Tokenizing” a text : segmenting the text into words • compare with CD-Rom PHI 05 of the Packard Humanities Institute • Vergil’s Aeneid : • arma virumque cano • arma uirumque cano • clitic –que : • /que<blank>/, /que<,>/ , /que<;>/, /que<:>/, /que<.>/ • atque, ubique, undique, quicumque • amatus est / amatust • animum aduertere / animaduertere

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 3. Lemmatizing a text • to allow the recognition of the same lemma in its various occurrences in a text, • independently of the variety of its forms in those occurrences • 1) Greek and Latin are inflected languages

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 3. Lemmatizing a text • to allow the recognition of the same lemma in its various occurrences in a text, • independently of the variety of its forms in those occurrences • 2) the Latin spelling is not completely fixed • assimilation phenomena (inlicio/illicio; adtuli/attuli; quidquid/quicquid) • haplologies (exspecto/expecto) • weak phonological status of some phonemes • (harena/arena, exhibeo/exibeo, mihi/mi, consul/cosul, etc) • transformation of diphthongs into monophthongs • (saeta/seta: plaudite/plodite, poenicus/punicus)

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 3. Lemmatizing a text • to allow the recognition of the same lemma in its various occurrences in a text, • independently of the variety of its forms in those occurrences • 2) the Latin spelling is not completely fixed • elision, epenthesis, apheresis, contraction, as well as abbreviation • disjunction of parts of the compound words or tmesis • res publica for respublica • quo modo for quomodo • quam... ante for antequam • morphologic diachronic and synchronic variant s • pater familias/pater familiae • siet/sit • igni/igne • fecerit/faxit.

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 3. Lemmatizing a text • populus 1, “the people” and populus 2, “the poplar” • licet 1, “it is allowed”, licet 2 “although”

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 3. Lemmatizing a text Dux lemma : cooccurrent lemmas Ecart Corpus Extrait Mot 038 932 934 dvx 015 1616 120 miles 014 1285 105 exercitvs1 009 25801 604 qve 009 2447 107 bellvm 009 2298 98 romanvsa 009 1910 88 hostis 008 1725 78 arma 008 1059 58 legio 007 1113 55 castra2 007 862 45 copia 007 615 37 imperator 007 519 36 avctor 006 40004 802 et2 006 1968 70 vrbs 006 786 39 tot 006 536 31 cohors 006 493 29 agmen 006 371 26 comes Dux Wordform : cooccurrent wordforms Ecart Corpus Extrait Mot 038 141 141 dux 005 336 8 romanus 005 170 7 auctor 004 2733 19 erat 004 626 8 bello 004 506 8 exercitus 004 482 8 hostium 004 299 7 miles 004 151 5 comes 004 119 4 militiae 004 113 4 deae 004 106 4 cohortibus 004 87 4 campis 004 53 3 diuersis 004 44 3 copiarum 004 39 3 uoluntatis 004 37 3 rati 004 20 3 gregis

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 4. Morphosyntactic tagging • using the POS tag • research on Greek determiners (UMR6039-Nice, Michèle Biraud,) • - sequences α γ ε β and α μ γ ε β attested in the LASLA files • α : article • β : noun • γ : adjective • ε : adjective/pronoun • μ : particle • - οἱ ἄλλοι πάντες ἄνθρωποι

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 4. Morphosyntactic tagging • using the POS tag • research on parallel and reminiscent passages between literary works • (Koen Van Haegendoren, Liège)

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 4. Morphosyntactic tagging • using the POS tag : • to characterise authors • and genres

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 4. Morphosyntactic tagging • using the POS tag : • LASLA Latin texts and • BFM French • medieval texts

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 4. Morphosyntactic tagging • lemmatization and POS : the case of the adjectives used as nouns • a solution : amicus 1 noun vs. amicus 2 adjective • but “sunt christiani” ? • > sanctus, beatus, fidelis or impius… • another solution : sanctus is analyzed • 21A00_4 when used as adjective ( 2 for adjective, • 1 for first class, • A for singular nominative • 4 for male) • 21A0014 when used as a noun (the additional 1 indicating this use). • also for fideles /omnes fideles, credentes /omnes credentes, laudantes, • audientes, legentes

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 4. Morphosyntactic tagging • full morphosyntactic analysis : Greek declension in Latin

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 4. Morphosyntactic tagging • full morphosyntactic analysis : 4th conjugation

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 4. Morphosyntactic tagging • full morphosyntactic analysis : the deviant forms • a solution : a special code for the whole declension (domus) • another solution : several lemmas • ex : a plural male accusative saxos, instead of the plural neutral saxa • - a form of a new male lemma saxus (not attested in the dictionaries); • - an anomalous form of the neutral lemma saxum • in both cases with the same codification (12: 1 for noun, 2 for 2nd décl.) • but : • ex : facta est tonitrua in aera

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 4. Morphosyntactic tagging • full morphosyntacticanalysis : the deviant forms • ex : factaesttonitrua in aera • - tonitrua used as a singular nominative of the first declension. • - in the dictionaries, • tonitrus, us (4th decl.) • tonitruum, i (2nd decl) • tonitrus, i (m) (2nd decl) • tonitru(n) (4th decl.) • but no tonitrua. • - explained by a plural neutral of tonitruum reinterpreted • as a feminine singular, but how to lemmatize it? • a solution : • to consider tonitrua as a form of the lemma tonitruum; • the peculiarity of its use only in the tag corresponding • to the morphosyntactic analysis

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 4. Morphosyntactic tagging • full morphosyntacticanalysis : the deviant forms • ex : • regular forms of the lemma dulcis, “smooth”, • tagged with the code of the adjectives of the second class in -is (24) • anomalous form dulciam • tagged as a form of the lemma dulcis • with the code of the adjectives of the first class in -is (21) • and in the Classical Latin corpus: • caelum (n) and caelus (m • inferni (m) and inferna (n) • cingula (f), cingulus (m) and cingulum (n)

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 4. Morphosyntactic tagging • using the full morphosyntactic analysis : narrative indicative tenses

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 4. Morphosyntactic tagging • using the full morphosyntactic analysis : repeated sequences (adj.-adj)

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 5. Syntactic tagging • the ‘Treebank’ approaches • - Index Thomisticus • - Latin Treebanks at Perseus • based on : • - the Dependency grammar (Prague Dependency Treebank ) • - the Latin grammar of H.Pinkster • a training corpus is tagged manually and other corpora are encoded • by using automatic taggers • problems : • a method imposing a specific linguistic framework • mixing theoretical linguistic framework • producing data which are not verified

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 5. Syntactic tagging : the Project LatSynt • an original and innovative research on word order and Latin sentence structures • Objectives: • -to develop automatic procedures for parsing based on word order rules (in order to offer an alternative to ‘Treebank’ approaches) • -to evaluate the relevance of the recent linguistic descriptions • -to offer new tools for textual data analysis (TDA) • - for enuntiative structure modeling • - for Latin texts classification and segmentation • Methods: • - to develop automatic procedures grounded on • -the already encoded morphological information in the LASLA database • -the text linearity • - to refine and improve the computer programmes in successive stages

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 5. Syntactic tagging : the Project LatSynt – the first stage • Objective : - to mark out the boundaries of personal verb clauses (provided with a subordinating word) • - to specify the level of their subordination (their “embedding”) • from the alphanumeric data of the LASLA database:

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 5. Syntactic tagging : the Project LatSynt – the first stage Analysis audisset : BN cum clausecum : 32 subj. imp.

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 5. Syntactic tagging : the Project LatSynt – the first stage • 1st stage : • quem […] uidi, LN14 (‘subordination in QVI’ & ‘perfect indicative’) : transferred to both the recording of • - the quem form (lemma QVI) and • - the uidi form (lemma VIDEO) • 2d stage: • &0014 +LN14 -LN14 +LN12 +GK32 -GK32 -LN12

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 5. Syntactic tagging : the Project LatSynt – the first stage • 3d stage : • <&0014>[+LN14 -LN14]{+LN12 [+GK32 -GK32] -LN12}. • - Final stage: • Tacite, Annales, 13,11,2 / P2849 / • <&0014>[+LN14-LN14]{+LN12[+GK32-GK32]-LN12} • <&secuta (est)> que lenitas in Plautium Lateranum [+quem ob adulterium Messalinae ordine demotum -reddidit] senatui clementiam suam obstringens crebris orationibus {+quas Seneca testificando [+quam honesta -praeciperet] uel iactandi ingenii uoce principis -uulgabat}

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 5. Syntactic tagging : the Project LatSynt – the first stage • analysing left dislocations: • 5,22,2 P0909 1 [+BN35-BN35]<&0014> • ii [+cum ad castra -uenissent], nostri eruptione facta multis eorum interfectis, capto etiam nobili duce Lugotorige suos incolumes <&reduxerunt>

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 5. Syntactic tagging : the Project LatSynt – the first stage • Results : to bring out • - linguistic regularities (prolepsis) • - distances between texts (Caesar – Tacitus) • - the importance of semantic and pragmatic phenomena • Perspectives : • -to mark out the boundaries of complex syntagms (in order to mark out the boundaries of subordinate clauses without subordinator) • -to promote interactions with other researches regarding the text topology (at the micro- and macro structural levels)-repeted segments (Hyperbase-latin in collaboration with BCL–Nice/CNRS) -syntactic and multidimensional « motifs » (in c. with BCL–Nice/CNRS) • to use the results for texts segmentation and classification

Lemmatizing and tagging a corpus : which information for which linguistic purposes? 6. Tagging : what else ? • semantic and pragmatic information, • semantic functions: Goal, Recipient, Agent, etc • pragmatic functions: Rheme, Topic, Focus , etc • building databases available for all kinds of research • without imposing specific linguistic frameworks or analysis • tokenization, lemmatization or tagging : • not trivial processes • requiring thorough theoretical thinking

Lemmatizing and tagging a corpus : which information for which linguistic purposes?