1 / 32

Worth its Weight in Gold or Yet Another Resource

Worth its Weight in Gold or Yet Another Resource. A Comparative Study of Wiktionary, OpenThesaurus and GermaNet. Christian M. Meyer and Iryna Gurevych First Workshop on Automated Knowledge Base Construction (AKBC), Grenoble, France, May 2010. Previously published in:

darroyo
Download Presentation

Worth its Weight in Gold or Yet Another Resource

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Worth its Weight in Gold or Yet Another Resource A Comparative Study of Wiktionary, OpenThesaurus and GermaNet Christian M. Meyer and Iryna Gurevych First Workshop on Automated Knowledge Base Construction (AKBC), Grenoble, France, May 2010. Previously published in: Lecture Notes in Computer Science, Vol. 6008, p. 38-49. http://dx.doi.org/10.1007/978-3-642-12116-6_4 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 1

  2. MotivationNLP Tasks and Lexical Semantic Knowledge Applications Expert-built Lexical Semantic Resources Lexical Semantic Knowledge GermaNet WordNet OpenCyc 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 2

  3. MotivationExpert-built Lexical Semantic Resources used for many years well studied high construction cost limited size hard to keep up-to-date Applications Expert-built Lexical Semantic Resources GermaNet WordNet OpenCyc 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 3

  4. MotivationCollaboratively-built Lexical Semantic Resources emerging freely available constantly updated competitive to expert-built structure and content related properties are largely unknown Applications Collaboratively-built Lexical Semantic Resources 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 4

  5. MotivationCollaboratively-built Lexical Semantic Resources • How are the resources organized? • Which kind of semantic knowledge is encoded? • What are their strengths and drawbacks? Structure and content related properties of collaborative resources are largely unknown Collaboratively-built Lexical Semantic Resources 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 5

  6. MotivationCollaboratively-built Lexical Semantic Resources Structure and content related properties of collaborative resources are largely unknown  Perform a comparative study of resources Expert-built Collaboratively-built Lexical Semantic Resources Lexical Semantic Resources GermaNet WordNet OpenCyc 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 6

  7. Lexical Semantic ResourcesWiktionary Collaboratively created online dictionary • Language • Etymology • Pronunciation • Part-of-speech • Word senses • Synonyms • Derived Terms • Translations • … Word Senses Semantic Relations 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 7

  8. Lexical Semantic ResourcesGermaNet and OpenThesaurus GermaNet • Semantic Network for the German Language • Created by lexicographers • WordNet-like structure • [Kunze and Lemnitzer, 2002] OpenThesaurus • Collaborative (but moderated) collectionofsynonyms • Used in OpenOffice • [Naber, 2005] 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 8

  9. A Uniform Representation of ResourcesSplitting of Synsets {vessel, watercraft} is hypernym of {boat} {vessel} is hypernym of {tank, storage tank} Insert synonymy relations within a synset vessel 1 boat 1 {boat} {vessel, watercraft} Insert semantic relations between each individual word sense watercraft 1 tank 1 {tank, storage tank} vessel 2 storage tank 1 {vessel} 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 9

  10. Word Sense Disambiguation in WiktionaryFinding the Correct Target Word Sense Sense [1] encodes a synonymy relation to “craft”. But which sense? ? 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 10

  11. Word Sense Disambiguation in WiktionaryFinding the Correct Target Word Sense – Approach • We apply a method based on semantic relatedness here. • Finding the best method for this task is a subject of our current studies. Explicit Semantic Analysis [Gabrilovich/Markovitch, 2007] 0.094 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 11

  12. Word Sense Disambiguation in WiktionaryFinding the Correct Target Word Sense – Evaluation • Upper bound: 2 human annotators A and B judged 250 randomly sampled relations (= 920 pairs of source and target candidate) • Lower bound: always choose the first word sense (this is usually the most frequent one) AO: Percentage of agreement κ: Cohen’s Kappa (chance-corrected) α: Krippendorff’s Alpha with set-valued distance function (MASI) 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 12

  13. Structural AnalysisTopological Results log(#nodes) Analysis: • Connectivity • Degree distribution • Network organization • Cluster analysis Results: • The largest connected component contains the bulk of semantic knowledge • The Wiktionary graph is scale-free and allows to predict analysis results to future (larger) versions • All graphs are small world graphs; they show organizational patterns that significantly differ from random graphs log(degree) 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 13

  14. Content AnalysisEncoded Lexemes, Word Senses and Semantic Relations Analysis: • Resource Size • Polysemy • Relation Type • Unidirectional Relations Results: • Wiktionary has most lexemes and word senses • GermaNet has most semantic relations • In general, more polysemous lexemes in Wiktionary • Many “dangling” articles in Wiktionary • Predominant type of relation for each resource • Most Wiktionary relations are unidirectional 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 14

  15. ConclusionsTake-home Message • How are the resources organized? • Uniform representation needed • Largest connected component is sufficient • Small world property • Which kind of semantic knowledge is encoded? • More polysemous lexemes in Wiktionary • Many dangling lexemes • What are their strengths and drawbacks? • Predominant type of relation for each resource • Number of semantic relations can be increased in Wiktionary 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 15

  16. ConclusionsFuture Work • Study English resources • Improve word sense disambiguation in Wiktionary • How large is the information overlap of the resources? • Combine the resources at the word sense level ? 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 16

  17. Thank you for your attention! Ubiquitous Knowledge Processing Additional Online Material: http://www.ukp.tu-darmstadt.de/data/lexical-resources/ 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 17

  18. Thank you for your attention! Ubiquitous Knowledge Processing Additional Online Material: http://www.ukp.tu-darmstadt.de/data/lexical-resources/ 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 18

  19. Backup Slides 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 19

  20. Terms and Definitions building Semantic Relation has hypernym plant works plant (noun) works (noun) is synonym plant<botany> plant<building> works plant (verb) Term has hyponym Lexeme plant<botany> Word Sense 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 20

  21. Selection of Resources Expert-built Collaboratively-built Constructed by Linguists Constructed by a community but reviewed by administrators Constructed by the Web community 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 21

  22. Topological Parameters • Average path length: calculate the shortest path between each pair of nodes, use the mean. • Clustering coefficient: the average probability that two neighbors of a node are connected by an edge •  node v: clustering coefficient Cv = 2nv / k(k–1) • with nv = number of inter-connected neighbors; k = node degree • Topological overlap: the average number of vertices to which both endpoints of an edge are linked •  edge (u,v): topological overlap O(u,v) = n(u, v) / min(ku , kv ) • with n(u, v) = number of nodes that are both linked to u and v;ku and kv = node degrees 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 22

  23. Topological Analysis: Connectivity GermaNet shows best connectivity Bulk of contents in largest CC OpenThesaurus highly scattered 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 23

  24. Topological Analysis: Scale-free Graphs Wiktionary GermaNet OpenThesaurus Wiktionary shows a clear power law, which leads to a scale-free graph Allows to predict topological insights to future revisions 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 24

  25. Topological Analysis: Small World Property • The Small World Property requires the graph to have • a small average path length, • a high clustering coefficient and • a high topological overlap • (compared to a random graph of similar size) [BarabásiandOltvai, 2004] 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 25

  26. Content AnalysisResource Size • Wiktionary hasmostlexemes/senses • GermaNet hasmostrelations 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 26

  27. Content AnalysisPolysemy What causes this difference? • Wiktionary contains more word with a high frequency in language (known to be more ambiguous) • The community more likely creates articles for polysemous terms, since they might be more interesting to create • The coverage of Wiktionary senses is on average higher • Wiktionary word senses are more fine-grained Possible explanations: Subject of ongoing work 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 27

  28. Content AnalysisDangling Lexemes 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 28

  29. Content AnalysisDangling Lexemes Wiktionary is still growing… Fig. courtesy: http://stats.wikimedia.org/wiktionary/EN/ChartsWikipediaDE.htm 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 29

  30. Content AnalysisSemantic Relations 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 30

  31. Content AnalysisOne Way Relations ? boat 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 31

  32. Content Analysis: Polysemy Wiktionary has especially many lexemes with over 14 senses 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 32

More Related