1 / 34

Issues in Building and Exploiting Latin Language Resources

Issues in Building and Exploiting Latin Language Resources. Marco Passarotti Università Cattolica del Sacro Cuore, Milan (Italy). Outlook. Specific issues of ancient languages and texts Language Resources for Latin: Annotated corpora NLP The Index Thomisticus Treebank IT-VaLex

rickycarr
Download Presentation

Issues in Building and Exploiting Latin Language Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues inBuilding and ExploitingLatin Language Resources Marco Passarotti Università Cattolica del Sacro Cuore, Milan (Italy)

  2. Outlook • Specific issues of ancient languages and texts • Language Resources for Latin: • Annotated corpora • NLP • The Index Thomisticus Treebank • IT-VaLex • Exploiting Latin Language Resources: • Latin Word Order • From Syntax to Semantics: Textual Clustering (1) • From Lexicon to Semantics: Textual Clustering (2) • Is Latin a less-resourced language? What’s still missing?

  3. Specific Issues of Ancient Languages and Texts • No native speakers • Several interpretations of the same text • Several versions of the same text • No digital born texts: relevance of the original source • Dead language = “closed” corpus/lexicon • Diachrony & Dialects: not just one Ancient Greek or Latin • Representativeness: our data are just the top of the iceberg • Not the WSJ, but mostly literary and philosophical texts • Not (just) for NLP purposes, but for the text itself • Computational linguists not used to deal with ancient languages (LREC ’12: one paper on Latin!). BUT things are changing: see TLT, ACRH etc. • Pencil-and-paper scholars (Classicists) not used to deal with digital LRs, NLP tools and modern linguistic theories: difficult to find students with required expertise. BUT things are changing…

  4. Language Resourcesfor Latin

  5. Annotated CorporaTreebanks • Collaboration between CIRCSE and Perseus: • Latin Dependency Treebank: Classical Latin (approx. 55,000 annotated tokens) • Index Thomisticus Treebank: Thomas Aquinas opera omnia (approx. 180,000 annotated tokens) • PROIEL: University of Oslo • Several translations of the New Testament: Latin, Greek, Old Church Slavonic, Armenian, Gothic (approx. 120,000 annotated tokens) • All dependency-based (via PDT): common guidelines

  6. NLP Tools (1) • Morphological analysers: Words (Whitaker), Morpheus (Perseus), LEMLAT (ILC-CNR) • Data-driven NLP (best rates) Source of Data: Index Thomisticus Treebank Training set: 61,024 tokens (2,820 sentences) Test set: 7,379 tokens (329 sentences) • PoS Tagging (HMM-based HunPos tagger): • 96.75: coarse-grained PoS + fine-grained PoS • 89.90: with morphological features • Syntactic Parsing (DeSR): • 80.02 (LAS); 85.23 (UAS); 87.79 (LA)

  7. NLP Tools (2)13 Centuries… • IT-Train: 44,195 – IT-Test: 5,697 • LDT-Train: 47,662 – LDT-Test: 5,481 • Parser: DeSR

  8. The Index Thomisticus Treebankhttp://itreebank.marginalia.it

  9. The Corpus • Index Thomisticus (Busa): • opera omnia of Thomas Aquinas • 119 works + 61 of other authors • approx. 11 million words • morphologically tagged & lemmatized • Index Thomisticus Treebank: • Dependency-based = LDT & PROIEL • approx. 180,000 words (10,000 sentences) • from: • Scriptum super Sententiis Magistri Petri Lombardi • Summa contra Gentiles • Summa Theologiae

  10. From FGD to Annotation Layers language is “a system of means of expression with some definite aim” (Theses of the Prague Linguistic Circle, 1929) • L0 (w) Words (tokens): automatic segmentation only • L1 (m) Morphology: Tags (full morphology, 11 categories) + Lemma • L2 (a) [FORM] Analytical Layer (surface syntax): dependency-based Analytical dependency functions: Pred, Sb, Obj, Adv, Atr, Pnom… • L3 (t) [MEANING] Tectogrammatical Layer (underlying syntax): dependency-based • Autosemantic words only (no function words and punctuations) • Functors (valency): Arguments vs. Adjuncts • Arguments: ACT, PAT, EFF, ADDR, ORIG • Adjuncts (~ 50), semantically defined: LOC, TWHEN, MANN, COND,... • Ellipsis resolution & Coreference (grammatical only: relative clauses, control-modals, pronouns) • Topic/focus articulation (deep word order)

  11. In eodem enim instanti terminatur alteratio ad dispositionem quae est necessitas , et generatio ad formam;

  12. Dynamic Valency Lexicon IT-VaLex http://itreebank.marginalia.it/itvalex/

  13. Valency Number of obligatory complementations of a word • ‘arguments’ vs. ‘adjuncts’ • actants vs.circonstants • ‘inner participants’ vs. ‘free modifications’

  14. By Complex Query

  15. One Output

  16. ExploitingLatin Language Resources

  17. Latin Word-order

  18. Latin Word-order

  19. From Syntax to Semantics.Textual Clustering (1) R Enviroment for Statistical Computing Package: cluster (function DIANA)

  20. Clustering • deals with finding a structure in a collection of (un)labeled data • the process of organizing objects into groups (clusters) whose members are similar in some way • a cluster is a collection of objects which are “similar” to each other and are “dissimilar” to the objects belonging to other clusters

  21. Textual Clustering for WSD • Distributional Hypothesis (Harris, 1954) words that are used in similar contexts tend to have the same or related meanings • Firth (1957) “You shall know a word by the company it keeps”

  22. Clustering forma in the IT-TB

  23. Lemma forma • 18,357 occurrences in the IT • 5,191 occurrences of forma in the IT-TB • a ‘technical’ word in Thomas, showing high polysemy • 4 main meanings in the lexicon of Thomas by Deferrari & Barry (1948-1949): • “form, shape”, synonym of figura • “form”, the actualizing principle that makes a thing to be what it is • “mode, manner” • “formula”

  24. The Distribution of forma -GsB-

  25. The Distribution of forma -GsB- (tag: 6)

  26. From Lexicon to Semantics.Textual Clustering (2) R Enviroment for Statistical Computing Packages: tm, RTextTools, Deducer(Text), lsa …you shall know a text by the words it keeps

  27. dist = euclidean - hclust = ward Seneca: Dialogues Seneca: Tragedies Jerome Thomas

  28. Jerome – Vulgata: LSA

  29. Is Latina less-resourced language?What’s still missing?

  30. A BLaRK-like Set for Latin • Modules and Tools • Text pre-processing: named-entity recognition • Lemmatization and morphological disambiguation: PoS taggers (diachrony) • Syntactic analysis: parsers and shallow parsing (diachrony) • Anaphora and ellipsis resolution • Semantic and pragmatic annotation: coreference, semantic roles, TFA • Applications • Entering and acquiring information: digitization & OCR systems (images of original sources) • Against sparsity: common on-line infrastructure for ancient languages LRs • e-learning facilities for teaching ancient languages with LRs and NLP tools • Data • Texts: • more treebanked data from more eras • TGTS-like annotated texts • aligned translation(s) • Lexica (mono-/multilingual): • semantic-based valency lexicon: semantic roles + semantic features of the arguments • wordformation-based lexicon

  31. …heigh-ho, heigh-ho,it's off to work we go!

More Related