240 likes | 408 Views
plWordNet as the Cornerstone of a Toolkit of Lexico -semantic Resources. G4.19 Research Group , Institute of Informatics Wroc ł aw University of Technology * School of Electrical Engineering and Computer Science University of Ottawa.
E N D
plWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources G4.19 Research Group, Institute of Informatics Wrocław University of Technology * School of Electrical Engineering and Computer Science University of Ottawa Marek Maziarz, MaciejPiasecki, Ewa Rudnicka, Stanisław Szpakowicz* www.plwordnet.pwr.wroc.pl
Wordnet as a Lexical Resource • Princeton WordNet definesde factostandard • large size and coverage • open access • thousands of applications • Applications: dictionary vs knowledge representation • Range of description • Ideal size and naturaldevelopment limits
plWordNet model: linguisticresource • Wordnet vs ontology • O: a strict knowledge representation • W: concepts expressed entirely in a natural language • W: synonymy is a matter of degree • O: certainty and a rigorous construction • W: shaped by the lexico-semantic dependencies • Alternative to formalisation • Corpus analysis and substitution tests • Minimal commitment: defininglexico-semanticrelations without committing to any particular theory of lexical semantic or human cognition
plWordNet model: corpus-based development • Main source of lexical knowledge: a very large monolingual corpus • tools for corpus browsing • semi-automatic knowledge extraction • Additional sources: dictionaries and encyclopedias • Lexical unit • lemma-sense pair • a linguistically motivated primitive
plWordNet model: synset definition • Synsets • groups of lexical units sharing certain relations {afekt 1 `passion’, uczucie 2 `feeling’} hypernym {miłość 1 `love’, umiłowanie 1 `affection’ , kochanie 1 ~`loving’} • Constitutive relations • fairly frequent (to describemanyLUs) • shared among LUs (to define groups) • grounded in the linguistic tradition(to facilitate their consistent understanding) • used in other wordnets (to improve compatibility)
plWordNet model: non-relationalaspects • Constitutive features • stylistic registers, • verb aspect • and semantic verb classes • Referred to in the relation definitions • e.g. relations limited to verbs of the same aspect and semantic class • Glosses helps wordnet editors • Usage examples: direct links to the corpus
Relation density Synset relation density in PWN 3.1 and inplWordNet 2.0
Sizematters: lexicalcoverage Coverage of PWN/plWN for lemmas of different frequency in two similar 1.2G words corpora (Wikipedia)
Sizematters: plWordNet 2.2 www.plwordnet.pwr.wroc.pl
How manywordsarethere?- existingdictionaries ● Woordenboekder NederlandscheTaal 430klemmas ● dictionaryof Grimm brothers 330klemmas ● Oxford English Dictionary 300klemmas ● `Warsaw’ Polish Dictionary 280klemmas ● contemporaryPolishdictionaries 130klemmas unabridgeddictionaries
How manywordsarethere?- approximation ~174k (10+ lemmas) N10+ = 6,67 COBUILD data
How manywordsarethere? K - Krishnamurthy’s data (2002), GT - Good & Toulminapproximation (1956) plWordNet 3.0 200k lemmas
Toolkit of Lexico-semanticResources • Lexicon of lexico-syntactic structures of multi-word expressions • plWordNet 3.0 (Słowosieć 3.0) • plWordNet 3.0 to WordNet 3.1 mapping • Semantic lexicon of proper names • Mapping to anontology • And a valency lexicon linked to plWordNet
Lexiconof multi-word expressions • Non-trivial morphology of Polish MWEs • more than 100 nominal structural patterns • Description of the lexico-syntactic structures of MWEs • Multi-word LUs as semantic atoms • no internal semantic relations • Dynamic lexicon • a tool for automatic MWE extraction • 60 000 described in the lexicon and plWordNet
Lexicon of ProperNames • PNs are not a part of the lexicon • PN is an instance of a type • characterised by referents • not by their semantic properties • Linking PNs via a wordnet • some lexico-syntactic contexts signal instance of • PNs are represented in wordnets • PNs as derivational bases for Common Nouns • Dynamic lexicon with 2.5 milion PNs verified manually
plWordNet to WordNet 3.1 mapping • plWordNet: built independently to obtain faithful description • Manual mapping • bottom-up order • comparison of the relations structures • a cascading list of Interlingual-relations • plWordNet verification as an important side effect • Present state: 72 000 N and Adj synsets mapped • Target: complete plWordNet 3.0 mapped
Mapping to ontology • Ontology: unambiguous concepts defined formally • Lexical meanings • imprecisely delimited • constrained by usage, stylistic register and sentiment • Mapping to ontology • precise, formal description for meanings • association: concepts – their lexical embodiment • SUMO selected • Princeton WordNet mapping • Semi-automated mapping of plWordNet
Expectations describes Valencelexicon MWE lexicon WordNet 3.1 + extension plWordNet 3.0 ProperNames Ontology: SUMO + intermediate level
Applications • Strong universal basis • a comprehensive wordnet >200 000 lemmas resulting in ~285 000 LUs and ~210 000 synsets • one of the largest ever Polish dictionaries • Modularly constructed toolkit • a layered architecture of large software systems • separate but linked layers • each layer based on limited set of notions and principles and exchangeable • The core of the CLARIN-PL language technology infrastructure
Thank-you www.plwordnet.pwr.wroc.pl Thank you!