320 likes | 335 Views
Worth its Weight in Gold or Yet Another Resource. A Comparative Study of Wiktionary, OpenThesaurus and GermaNet. Christian M. Meyer and Iryna Gurevych First Workshop on Automated Knowledge Base Construction (AKBC), Grenoble, France, May 2010. Previously published in:
E N D
Worth its Weight in Gold or Yet Another Resource A Comparative Study of Wiktionary, OpenThesaurus and GermaNet Christian M. Meyer and Iryna Gurevych First Workshop on Automated Knowledge Base Construction (AKBC), Grenoble, France, May 2010. Previously published in: Lecture Notes in Computer Science, Vol. 6008, p. 38-49. http://dx.doi.org/10.1007/978-3-642-12116-6_4 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 1
MotivationNLP Tasks and Lexical Semantic Knowledge Applications Expert-built Lexical Semantic Resources Lexical Semantic Knowledge GermaNet WordNet OpenCyc 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 2
MotivationExpert-built Lexical Semantic Resources used for many years well studied high construction cost limited size hard to keep up-to-date Applications Expert-built Lexical Semantic Resources GermaNet WordNet OpenCyc 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 3
MotivationCollaboratively-built Lexical Semantic Resources emerging freely available constantly updated competitive to expert-built structure and content related properties are largely unknown Applications Collaboratively-built Lexical Semantic Resources 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 4
MotivationCollaboratively-built Lexical Semantic Resources • How are the resources organized? • Which kind of semantic knowledge is encoded? • What are their strengths and drawbacks? Structure and content related properties of collaborative resources are largely unknown Collaboratively-built Lexical Semantic Resources 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 5
MotivationCollaboratively-built Lexical Semantic Resources Structure and content related properties of collaborative resources are largely unknown Perform a comparative study of resources Expert-built Collaboratively-built Lexical Semantic Resources Lexical Semantic Resources GermaNet WordNet OpenCyc 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 6
Lexical Semantic ResourcesWiktionary Collaboratively created online dictionary • Language • Etymology • Pronunciation • Part-of-speech • Word senses • Synonyms • Derived Terms • Translations • … Word Senses Semantic Relations 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 7
Lexical Semantic ResourcesGermaNet and OpenThesaurus GermaNet • Semantic Network for the German Language • Created by lexicographers • WordNet-like structure • [Kunze and Lemnitzer, 2002] OpenThesaurus • Collaborative (but moderated) collectionofsynonyms • Used in OpenOffice • [Naber, 2005] 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 8
A Uniform Representation of ResourcesSplitting of Synsets {vessel, watercraft} is hypernym of {boat} {vessel} is hypernym of {tank, storage tank} Insert synonymy relations within a synset vessel 1 boat 1 {boat} {vessel, watercraft} Insert semantic relations between each individual word sense watercraft 1 tank 1 {tank, storage tank} vessel 2 storage tank 1 {vessel} 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 9
Word Sense Disambiguation in WiktionaryFinding the Correct Target Word Sense Sense [1] encodes a synonymy relation to “craft”. But which sense? ? 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 10
Word Sense Disambiguation in WiktionaryFinding the Correct Target Word Sense – Approach • We apply a method based on semantic relatedness here. • Finding the best method for this task is a subject of our current studies. Explicit Semantic Analysis [Gabrilovich/Markovitch, 2007] 0.094 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 11
Word Sense Disambiguation in WiktionaryFinding the Correct Target Word Sense – Evaluation • Upper bound: 2 human annotators A and B judged 250 randomly sampled relations (= 920 pairs of source and target candidate) • Lower bound: always choose the first word sense (this is usually the most frequent one) AO: Percentage of agreement κ: Cohen’s Kappa (chance-corrected) α: Krippendorff’s Alpha with set-valued distance function (MASI) 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 12
Structural AnalysisTopological Results log(#nodes) Analysis: • Connectivity • Degree distribution • Network organization • Cluster analysis Results: • The largest connected component contains the bulk of semantic knowledge • The Wiktionary graph is scale-free and allows to predict analysis results to future (larger) versions • All graphs are small world graphs; they show organizational patterns that significantly differ from random graphs log(degree) 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 13
Content AnalysisEncoded Lexemes, Word Senses and Semantic Relations Analysis: • Resource Size • Polysemy • Relation Type • Unidirectional Relations Results: • Wiktionary has most lexemes and word senses • GermaNet has most semantic relations • In general, more polysemous lexemes in Wiktionary • Many “dangling” articles in Wiktionary • Predominant type of relation for each resource • Most Wiktionary relations are unidirectional 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 14
ConclusionsTake-home Message • How are the resources organized? • Uniform representation needed • Largest connected component is sufficient • Small world property • Which kind of semantic knowledge is encoded? • More polysemous lexemes in Wiktionary • Many dangling lexemes • What are their strengths and drawbacks? • Predominant type of relation for each resource • Number of semantic relations can be increased in Wiktionary 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 15
ConclusionsFuture Work • Study English resources • Improve word sense disambiguation in Wiktionary • How large is the information overlap of the resources? • Combine the resources at the word sense level ? 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 16
Thank you for your attention! Ubiquitous Knowledge Processing Additional Online Material: http://www.ukp.tu-darmstadt.de/data/lexical-resources/ 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 17
Thank you for your attention! Ubiquitous Knowledge Processing Additional Online Material: http://www.ukp.tu-darmstadt.de/data/lexical-resources/ 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 18
Backup Slides 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 19
Terms and Definitions building Semantic Relation has hypernym plant works plant (noun) works (noun) is synonym plant<botany> plant<building> works plant (verb) Term has hyponym Lexeme plant<botany> Word Sense 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 20
Selection of Resources Expert-built Collaboratively-built Constructed by Linguists Constructed by a community but reviewed by administrators Constructed by the Web community 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 21
Topological Parameters • Average path length: calculate the shortest path between each pair of nodes, use the mean. • Clustering coefficient: the average probability that two neighbors of a node are connected by an edge • node v: clustering coefficient Cv = 2nv / k(k–1) • with nv = number of inter-connected neighbors; k = node degree • Topological overlap: the average number of vertices to which both endpoints of an edge are linked • edge (u,v): topological overlap O(u,v) = n(u, v) / min(ku , kv ) • with n(u, v) = number of nodes that are both linked to u and v;ku and kv = node degrees 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 22
Topological Analysis: Connectivity GermaNet shows best connectivity Bulk of contents in largest CC OpenThesaurus highly scattered 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 23
Topological Analysis: Scale-free Graphs Wiktionary GermaNet OpenThesaurus Wiktionary shows a clear power law, which leads to a scale-free graph Allows to predict topological insights to future revisions 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 24
Topological Analysis: Small World Property • The Small World Property requires the graph to have • a small average path length, • a high clustering coefficient and • a high topological overlap • (compared to a random graph of similar size) [BarabásiandOltvai, 2004] 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 25
Content AnalysisResource Size • Wiktionary hasmostlexemes/senses • GermaNet hasmostrelations 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 26
Content AnalysisPolysemy What causes this difference? • Wiktionary contains more word with a high frequency in language (known to be more ambiguous) • The community more likely creates articles for polysemous terms, since they might be more interesting to create • The coverage of Wiktionary senses is on average higher • Wiktionary word senses are more fine-grained Possible explanations: Subject of ongoing work 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 27
Content AnalysisDangling Lexemes 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 28
Content AnalysisDangling Lexemes Wiktionary is still growing… Fig. courtesy: http://stats.wikimedia.org/wiktionary/EN/ChartsWikipediaDE.htm 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 29
Content AnalysisSemantic Relations 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 30
Content AnalysisOne Way Relations ? boat 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 31
Content Analysis: Polysemy Wiktionary has especially many lexemes with over 14 senses 23.10.2019 | Computer Science Department | UKP Lab – Prof. Dr. Iryna Gurevych | Christian M. Meyer | 32