570 likes | 718 Views
2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project. T. Flati, D. Vannella, T. Pasini, R. Navigli. ERC Starting Grant MultiJEDI No. 259234. The Wikipedia structure. Article pages ~4M. Category pages ~ 700K. Two noisy graphs with no explicit hypernym relation.
E N D
2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project T. Flati, D. Vannella, T. Pasini, R. Navigli ERC Starting GrantMultiJEDI No. 259234
The Wikipedia structure Article pages ~4M Category pages~ 700K Two noisy graphs with no explicit hypernymrelation.
The Wikipedia structure: an example Pages Categories • Fictionalcharacters Cartoon The Walt Disney Company • Fictionalcharactersby medium Comics by genre Mickey Mouse Donald Duck Disney comics Disney character FunnyAnimal Disney comicscharacters Superman
Our goal To automatically create a Wikipedia Bitaxonomyfor Wikipedia pages and categories in a simultaneous fashion. categories pages
Our goal To automatically create a Wikipedia Bitaxonomyfor Wikipedia pages and categories in a simultaneous fashion. KEY IDEA The page and category level are mutually beneficial for inducing a wide-coverage and fine-grained integrated taxonomy
Key idea Pages Categories is a • Fictionalcharacters is a is a Cartoon Mickey Mouse The Walt Disney Company • Fictionalcharactersby medium Comics by genre is a is a is a is a Donald Duck Disney comics is a Disney character is a Disney comicscharacters FunnyAnimal Superman
A 3-phase method • Starting from two noisy graphs categories pages
A 3-phase method • 1. Build the page taxonomy pages
A 3-phase method • 1. Build the page taxonomy • 2.Bitaxonomy Algorithm categories pages
A 3-phase method • 1. Build the page taxonomy • 2.Bitaxonomy Algorithm categories pages
A 3-phase method • 1. Build the page taxonomy • 2.Bitaxonomy Algorithm • 3. Refine the category taxonomy +50% categories categories pages
Contributions • Self-containedapproach • Page taxonomy and category taxonomy built simultaneously • State-of-the-artresults when compared to all other available taxonomies
Assumptions • The first sentenceof a page is a gooddefinition (alsocalledgloss)
The WiBi Page taxonomy • [Syntactic step]Extractthe hypernymlemma from a page definition using a syntactic parser; • [Semantic step]Apply a set of linking heuristics to disambiguate the extracted lemma. ScroogeMcDuckis a character […] Syntacticstep Hypernym lemma: character nn nsubj ScroogeMcDuckis a character[…] cop Semanticstep A
The semantic step 5 cascading linking heuristics Linking heuristic Crowdsourced Category Multiword Monosemous Distributional Target page(CristianoRonaldo) Disambiguated hypernym(Football player) Ambiguoushypernym (‘player’)
1. Crowdsourced heuristic Use the links from the crowd! Mickey Mouse is a funny animalcartooncharacterand the official mascotofThe Walt Disney Company.
2. Category heuristic Given a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses. Donald Duck Pluto Characters in Disney package films Hook Mickey Mouse Disney comics characters Ambiguous hypernym: Character Goofy José Carioca
2. Category heuristic Given a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses. Pluto, also called Pluto the Pup, is a cartoon character […] Donald Duck Pluto Characters in Disney package films Captain James Hookis a fictionalcharacter[…] Hook Mickey Mouse is a funny animalcartooncharacter […] Mickey Mouse is a funny animalcartooncharacter[…] Mickey Mouse Disney comics characters Goofy is a funny animal cartooncharacter […] Ambiguoushypernym: Character Goofy José Carioca is a Disney cartooncharacter[…] José Carioca
2. Category heuristic Given a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses. Pluto, also called Pluto the Pup, is a cartoon character […] Donald Duck Characters in Disney package films Character (arts) 5, Funnyanimal 1 Captain James Hookis a fictionalcharacter[…] Mickey Mouse is a funny animalcartooncharacter[…] Mickey Mouse is a funny animalcartooncharacter[…] Disney comics characters Character (arts) 3, Funnyanimal 1, Cartoon 1 Goofy is a funny animal cartooncharacter […] Ambiguoushypernym: Character Character(arts) 8, Funnyanimal 2, Cartoon 1 José Carioca is a Disney cartooncharacter[…]
2. Category heuristic • Given a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses. Donald Duck Ambiguous hypernym: Character Character(arts) Character(arts) 8, Funnyanimal 2, Cartoon 1
Page taxonomy linking heuristics 3 4 Multiword(65K) Monosemous(161K) Category(1.603M) 2 5 Distributional(561K) Crowdsourced(1.338M) 1
The story so far 1 Noisy page graph Page taxonomy
The Bitaxonomy algorithm 2
The Bitaxonomy algorithm The information available in the two taxonomies is mutually beneficial; • At each step exploit one taxonomy to update the otherand vice versa; • Repeat until convergence.
The Bitaxonomy algorithm Startingfrom the pagetaxonomy Football team Football teams is a Football clubs Atlético Madrid Football clubs in Madrid Real MadridF.C. categories pages
The Bitaxonomy algorithm Exploitthe cross linkstoinferhypernym relations in the categorytaxonomy Football team Football teams is a is a Football clubs Atlético Madrid Football clubs in Madrid Real MadridF.C. categories pages
The Bitaxonomy algorithm Football team Football teams is a is a is a Football clubs Atlético Madrid Football clubs in Madrid Real MadridF.C. Take advantage of cross linksto infer back is-a relations in the page taxonomy categories pages
The Bitaxonomy algorithm Football team Football teams is a is a is a is a Football clubs Atlético Madrid Football clubs in Madrid Real MadridF.C. Use the relations found in previousstep to infernew hypernymedges categories pages
The Bitaxonomy algorithm Mutualenrichmentofboth taxonomies untilconvergence Football team Football teams is a is a is a is a Football clubs Atlético Madrid Football clubs in Madrid Real MadridF.C. categories pages
Page taxonomy evaluation (cont’d) Sensible 3% increment in terms of recall and coverage,with unvaried precision
Category taxonomy refinement Some categories are affected by some structural problems. No pages associated! Comicscharacters Comicscharactersbyprotagonist Garfieldcharacters categories pages
Category taxonomy refinement • 3 refinement procedures to obtain broader coverage for categories • Single super category • Sub-categories • Super-categories
Single super category Fictionalcharactersby medium • So we promote its only super category to hypernym This category hasonly 1 outgoing edge Comicscharacters Animatedcharacters Animation Comicscharactersbyprotagonist Animated television characters by series Garfieldcharacters
Sub-categories Focus on subcategories which have already been covered! Comics by company Comics characters Comics characters by company Comics titlesby company DC Comicscharacters Marvel Comicscharacters Disney comics
Sub-categories Focus on subcategories which have already been covered! Comics by company Comics characters 2 pathsending in v Only 1 path ending in u Comics characters by company Comics titlesby company DC Comicscharacters Marvel Comicscharacters Disney comics
Category taxonomy evaluation: coverage +50% categories covered! 1SUP SUB SUPER
Category taxonomy evaluation: P & R 86% +35% recall 1SUP SUB SUPER Iterations
Experimental setup • We created 2 datasets: • 1000 randomly sampled pages; • 1000 randomly sampled categories. • Each item was annotated with the most suitable generalization (lemma+page or category).
Competitors WikiNet MENTA WikiTaxonomy pages categories
Measures • We calculated typical measures to assess the quality of all the possible taxonomies; • Precision • Recall • Coverage • Specificity • Granularity
Category taxonomy comparison Specificity measure