1 / 24

plWordNet as the Cornerstone of a Toolkit of Lexico -semantic Resources

plWordNet as the Cornerstone of a Toolkit of Lexico -semantic Resources. G4.19 Research Group , Institute of Informatics Wroc ł aw University of Technology * School of Electrical Engineering and Computer Science University of Ottawa.

kenna
Download Presentation

plWordNet as the Cornerstone of a Toolkit of Lexico -semantic Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. plWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources G4.19 Research Group, Institute of Informatics Wrocław University of Technology * School of Electrical Engineering and Computer Science University of Ottawa Marek Maziarz, MaciejPiasecki, Ewa Rudnicka, Stanisław Szpakowicz* www.plwordnet.pwr.wroc.pl

  2. Wordnet as a Lexical Resource • Princeton WordNet definesde factostandard • large size and coverage • open access • thousands of applications • Applications: dictionary vs knowledge representation • Range of description • Ideal size and naturaldevelopment limits

  3. plWordNet model: linguisticresource • Wordnet vs ontology • O: a strict knowledge representation • W: concepts expressed entirely in a natural language • W: synonymy is a matter of degree • O: certainty and a rigorous construction • W: shaped by the lexico-semantic dependencies • Alternative to formalisation • Corpus analysis and substitution tests • Minimal commitment: defininglexico-semanticrelations without committing to any particular theory of lexical semantic or human cognition

  4. plWordNet model: corpus-based development • Main source of lexical knowledge: a very large monolingual corpus • tools for corpus browsing • semi-automatic knowledge extraction • Additional sources: dictionaries and encyclopedias • Lexical unit • lemma-sense pair • a linguistically motivated primitive

  5. plWordNet model: synset definition • Synsets • groups of lexical units sharing certain relations {afekt 1 `passion’, uczucie 2 `feeling’} hypernym {miłość 1 `love’, umiłowanie 1 `affection’ , kochanie 1 ~`loving’} • Constitutive relations • fairly frequent (to describemanyLUs) • shared among LUs (to define groups) • grounded in the linguistic tradition(to facilitate their consistent understanding) • used in other wordnets (to improve compatibility)

  6. plWordNet model: non-relationalaspects • Constitutive features • stylistic registers, • verb aspect • and semantic verb classes • Referred to in the relation definitions • e.g. relations limited to verbs of the same aspect and semantic class • Glosses helps wordnet editors • Usage examples: direct links to the corpus

  7. Relation density Synset relation density in PWN 3.1 and inplWordNet 2.0

  8. Sizematters: lexicalcoverage Coverage of PWN/plWN for lemmas of different frequency in two similar 1.2G words corpora (Wikipedia)

  9. Sizematters: plWordNet 2.2 www.plwordnet.pwr.wroc.pl

  10. plWordNet: ongoingwork

  11. Sizematters: comparison of wordnets

  12. How manywordsarethere?- existingdictionaries ● Woordenboekder NederlandscheTaal 430klemmas ● dictionaryof Grimm brothers 330klemmas ● Oxford English Dictionary 300klemmas ● `Warsaw’ Polish Dictionary 280klemmas ● contemporaryPolishdictionaries 130klemmas unabridgeddictionaries

  13. How manywordsarethere?- approximation ~174k (10+ lemmas) N10+ = 6,67 COBUILD data

  14. How manywordsarethere? K - Krishnamurthy’s data (2002), GT - Good & Toulminapproximation (1956) plWordNet 3.0 200k lemmas

  15. Toolkit of Lexico-semanticResources • Lexicon of lexico-syntactic structures of multi-word expressions • plWordNet 3.0 (Słowosieć 3.0) • plWordNet 3.0 to WordNet 3.1 mapping • Semantic lexicon of proper names • Mapping to anontology • And a valency lexicon linked to plWordNet

  16. Lexiconof multi-word expressions • Non-trivial morphology of Polish MWEs • more than 100 nominal structural patterns • Description of the lexico-syntactic structures of MWEs • Multi-word LUs as semantic atoms • no internal semantic relations • Dynamic lexicon • a tool for automatic MWE extraction • 60 000 described in the lexicon and plWordNet

  17. Lexicon of ProperNames • PNs are not a part of the lexicon • PN is an instance of a type • characterised by referents • not by their semantic properties • Linking PNs via a wordnet • some lexico-syntactic contexts signal instance of • PNs are represented in wordnets • PNs as derivational bases for Common Nouns • Dynamic lexicon with 2.5 milion PNs verified manually

  18. plWordNet to WordNet 3.1 mapping • plWordNet: built independently to obtain faithful description • Manual mapping • bottom-up order • comparison of the relations structures • a cascading list of Interlingual-relations • plWordNet verification as an important side effect • Present state: 72 000 N and Adj synsets mapped • Target: complete plWordNet 3.0 mapped

  19. Wordnet editor: WordnetLoom

  20. WordnetLoom: editing the mapping

  21. Mapping to ontology • Ontology: unambiguous concepts defined formally • Lexical meanings • imprecisely delimited • constrained by usage, stylistic register and sentiment • Mapping to ontology • precise, formal description for meanings • association: concepts – their lexical embodiment • SUMO selected • Princeton WordNet mapping • Semi-automated mapping of plWordNet

  22. Expectations describes Valencelexicon MWE lexicon WordNet 3.1 + extension plWordNet 3.0 ProperNames Ontology: SUMO + intermediate level

  23. Applications • Strong universal basis • a comprehensive wordnet >200 000 lemmas resulting in ~285 000 LUs and ~210 000 synsets • one of the largest ever Polish dictionaries • Modularly constructed toolkit • a layered architecture of large software systems • separate but linked layers • each layer based on limited set of notions and principles and exchangeable • The core of the CLARIN-PL language technology infrastructure

  24. Thank-you www.plwordnet.pwr.wroc.pl Thank you!

More Related