1 / 21

Towards a Truly Bilingual WordNet - plWordNet 3.0

Learn about plWordNet 3.0, a unique and comprehensive lexical database that focuses on the relations between lexical units in Polish. Explore its integration with Princeton WordNet and its mapping procedure for inter-lingual relations.

lamc
Download Presentation

Towards a Truly Bilingual WordNet - plWordNet 3.0

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a Truly Bilingual WordNet – plWordNet 3.0 Ewa Rudnicka, Marek Maziarz, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

  2. What is a wordnet?

  3. WordNet – a lexico-semantic database Princeton WordNet (Fellbaum 1998) a huge electronic lexical database – a kind of thesaurus, yet of a much more advanced structure Words grouped into synonym sets called synsets Synsets linked via different lexico-semantic relations such as synonymy, near-synonymy, hypernymy/hyponymy, meronymy/holonymy, antonymy, fuzzynymy) the integratation of lexicaldatagatheredfromtheexistingresourcessuchastraditionalandelectronicdictionariesaswellasfromcorpora psycholinguistic principles – the structure of human lexical memory (cf. Miller 1998) taxonomic hierarchies for nouns, entailment relations for verbs

  4. Multi-lingual wordnets multi-lingualdatabases consistingofinter-linked 'national'/mono-lingualwordnets: EuroWordNet -transfer method – translation from Princeton WordNet Dutch, Spanish, Italian, French, German, Czechand Estonian(cf. Vossen 2002) MultiWordNet - semi-automaticacquisitionmethodfromthePrincetonWordNet Italian, Spanish, Portuguese, RomanianandLatin (Bentivogliet. al. ) IndoWordNetSinha et al.2006, Bhattacharyya2010) expansion approach from Hindi wordnet;16outof22 languagesofIndia

  5. plWordNet (Słowosieć) plWordNet – developed fairly independentlyof Princeton WordNet byapplyingauniquecorpus-basedmethod one of the biggest existing wordnets the emphasis on relations between lexical units, not between synsets much more relations, some of them specially designed to cover the pecularities of morphosyntactic structure of Polish (cf. Piaseckietal. 2009, Maziarzetal. 2012)

  6. plWordNet vs. Princeton WordNet Basic common concepts: lemma – base form representing different inflectional forms and different meanings Lexical unit – lemma plus sense pair (in wordnets marked with number) Synset – a set of synonymous lexical units Differences: plWN – synsets built of lexical units sharing the same constitutive relations (such as hyponymy, hypernymy, meronymy, holonymy) PWN – a synset represents a 'lexicalised concept' (cf. Miller 1998); synsets built of lexical units linked by synonymy relation, understood as a conceptual relation established on the basis of linguist's intuitions and dictionary definitions

  7. Mapping plWordNet on Princeton WordNet Linking plWordNet synsets with Princeton Wordnet synsets Defining a set of inter-lingual relations Setting a hierarchy of inter-lingual relations Designing mapping procedure Mapping direction: plWordNet > Princeton WordNet Domains selected for mapping: person, artefact, location, family relationships, food, time, vocabulary connected with thinking and communication a novel perspective – linking two independent systems the main challenge – different philosophical, theoretical and methodological assumptions

  8. Inter-lingual relationshierarchy A set of inter-lingual relations inspired by: - inter-lingual relations fromEuroWordNet (Vossen 2002) - intra-lingual relations from plWordNet (Maziarz et al. 2011) 1. Synonymy 2. Partial synonymy 3. Inter-register synonymy 4. Hyponymy 5. Hypernymy 6. Meronymy 7. Holonymy

  9. Inter-lingual relations (1) Synonymy (only one per one synset) - for large correspondence in sense and position in the source wordnet structure combined with many indirect inter-lingual links between the source and target synsets Inter-register synonymy - for I-synonyms as defined above, but differing in stylistic register Partial synonymy - in the case of partial correspondence of meanings and/or structures

  10. Partial synonymy

  11. Inter-lingual relations (2) Inter-lingual hyponymy - defined in terms of inclusion of set denotation: a hyponym refers to an object which is included in the denotation set of a hypernym Inter-lingual hypernymy - defined in terms of inclusion of set denotation; a hypernym referstoanobjectthat includes hyponyms in its denotation set Inter-lingual meronymy - for parts, elements or materials of bigger wholes Inter-lingual holonymy - for a whole made of smaller parts, elements or materials

  12. Mapping procedure (1) Recognizing the sense of a source synset: - checking its position in the network structure (all existing relations with an emphasis on hypernym(s) and hyponyms; definitions, commentaries; comparing other synsets contaning the given lemma) Example: {zagranica 1, obczyzna 1, obce terytorium 1}: - is a hyponym of {obszar 1, terytorium 1, obręb 1, strefa 1, zona 1, rejon 3} commentary: 'ograniczona część przestrzeni, zwykle dużych rozmiarów, określona powierzchnia czegoś (np. obszar państwa) 'a limited part of an area, usually of big size, a set surface of sth (e.g. state territory) - is a meronym of {świat 3, nieznane 1} – 'world, unknown territory' - is a fuzzynym of {granica państwa 1} – 'state border'

  13. Mapping procedure (2) Searching for a target synset: – choosing candidates for a target synset with the help of intuitions, automatic prompts and dictionaries: e.g. {foreign country 1} - 'any state of which one is not a citizen' – is a hyponym of {state 1, nation 1, country 1, land 9, commonwealth 2, res publica 1, body politic 1} - 'a politically organized body of people under a single government' - verifing candidates for a target synset (comparing hyper and hyponymic structures (and other if such exist) with the source synset (checking the existing and/or potential inter-lingual relations; definitions, commentaries; dictionaries) {state 1, ..} is an inter-lingual hyponym of {państwo 1, kraj 1} - 'zorganizowana politycznie społeczność, zamieszkująca określone terytorium, z niepodległą formą rządów' – 'a politically organised community, inhabiting a certain territory, with an independent form of government'

  14. Mapping procedure (3) Choosing a target synset and an inter-lingual relation: {foreign country 1} Synonymy – no (different meaning, structures and relations) Hyponymy – no (meaning, structures and relations do not qualify as a subtype) Meronymy – yes (meaning, structures and relations qualify as a part) Linking the source synset with the target synset:

  15. Results of inter-lingual mapping About 46 500 inter-lingual links/relations between synsets which amounts to about 50 000 relations between lexical units • Synonymy - 15268 • Partial synonymy – 971 • Inter-register synonymy - 676 • Hyponymy - 23677 • Hypernymy - 3526 • Meronymy – 1898 • Holonymy - 555 • Mapped branches: people, artefacts, places,food, time units, communication (partly), states and processes (partly), body parts (partly), group names (partly) Mapping direction: plWordNet – Princeton WordNet Bottom-up approach – starting from the lowest levels in the hierarchy

  16. Types of differences between plWN and PWN Lexico-grammatical differences (1) Markedness - young being (prosiak 'piglet' -hypo→ młodzik 'young creature') - diminutive (prosiaczek 'piggy' ← prosiak + -ek) - augmentative (2) Lexicalised gender - (cousin ~ kuzyn (masc.) & kuzynka (fem.) (3) Lexical gaps Lexico-grammatical differences (1) Markedness - young being (prosiak 'piglet' -hypo→ młodzik 'young creature') - diminutive (prosiaczek 'piggy' ← prosiak + -ek) - augmentative (2) Lexicalised gender - (cousin ~ kuzyn (masc.) & kuzynka (fem.) (3) Lexical gaps Inter-lingual lexico-grammatical differences: - marked forms (diminutives, augmentatives) - lexicalised gender - lexical gaps Differences in the definition of synonymy and synset: - 'Mixed' PWN synsets – marked and unmarked forms, feminine and masculine, countable and uncountable, hypernym and hyponym- hypernymy and (plWN) vs. and/or (PWN) Other differences: - synset definitions incompatible with relations (PWN) - different relations used for coding the same conceptual dependencies - more fine-grained meaning differentiation - differences boiling down to the content and size of resources

  17. Marked forms Lexico-grammatical differences (1) Markedness - young being (prosiak 'piglet' -hypo→ młodzik 'young creature') - diminutive (prosiaczek 'piggy' ← prosiak + -ek) - augmentative (2) Lexicalised gender - (cousin ~ kuzyn (masc.) & kuzynka (fem.) (3) Lexical gaps Lexico-grammatical differences (1) Markedness - young being (prosiak 'piglet' -hypo→ młodzik 'young creature') - diminutive (prosiaczek 'piggy' ← prosiak + -ek) - augmentative (2) Lexicalised gender - (cousin ~ kuzyn (masc.) & kuzynka (fem.) (3) Lexical gaps

  18. Differences in lexicalisation

  19. Hyponymy

  20. Different relations for coding the same conceptual dependencies

  21. References Fellbaum, Ch. (ed). 1998. WordNet: AnElectronicLexicalDatabase. MITPress: Cambridge, Massachusets. Maziarz, M., Piasecki, M. andS. Szpakowicz. 2012. ApproachingplWordNet2.0. Proceedingsofthe6thGlobalWordnetConference, Matsue. pp. 189-196. acceptedforpublication. Piasecki, M., Szpakowicz, S. andB. Broda. 2009. AWordnetfromtheGroundUp. OficynaWydawniczaPolitechnikiWrocławskiej: Wrocław. PrincetonWordNethttp://wordnet.princeton.edu/wordnet/ Słowosieć http://plwordnet.pwr.wroc.pl/wordnet/ Vossen, P. (ed). 2002. EuroWordNet. GeneralDocument. Amsterdam.

More Related