380 likes | 405 Views
Encyclopedic Knowledge Patterns from Wikipedia Links. Andrea Giovanni Nuzzolese (1,2) nuzzoles@cs.unibo.it. Aldo Gangemi (1) aldo.gangemi@cnr.it. Valentina Presutti (1) valentina.presutti@cnr.it. Paolo Ciancarini (1,2) ciancarini@cs.unibo.it. ISWC 2011 Bonn, Germany 26 October 2011.
E N D
Encyclopedic Knowledge Patterns from Wikipedia Links Andrea Giovanni Nuzzolese (1,2) nuzzoles@cs.unibo.it Aldo Gangemi (1) aldo.gangemi@cnr.it Valentina Presutti (1) valentina.presutti@cnr.it Paolo Ciancarini (1,2) ciancarini@cs.unibo.it ISWC 2011 Bonn, Germany 26 October 2011 (1) Semantic Technology Laboratory ISTC-CNR (2) Dipartimento di Scienze dell’Informazione, Università di Bologna Work supported by the IKS project (EU FP7)
Origin of modern frames and knowledge patterns • «When one encounters a new situation (or makes a substantial change in one's view of the present problem) one selects from memory a structure called a Frame. This is a remembered framework to be adapted to fit reality by changing details as necessary … a frame is a data-structure for representing a stereotyped situation» (Minsky 1975) • Frames, schemas, scripts … «These large-scale knowledge configurations supply top-down input for a wide range of communicative and interactive tasks … the availability of global patterns of knowledge cuts down on non-determinacy enough to offset idiosyncratic bottom-up input that might otherwise be confusing» (Beaugrande, 1980)
Motivations • Improving knowledge exploration and summarization by: • Empirically discovering invariances in conceptual organization of knowledge – encyclopedic knowledge patterns – from Wikipedia crowd-sourced page link • Understanding the most intuitive way of selecting relevant entities used to describe a given entity • Identifying the typical / atypical types of things that people use for describing other things
Outline • What we did • Materials and methods • Results • Evaluation • Conclusions and future work
What we did (1/3) • We have proved that Wikipedia page links are a useful resource for studying cognitive aspects of human-knowledge interaction • We have shown that knowledge patterns can be empirically discovered
What we did (2/3) • We have produced 231 encyclopedic knowledge patterns1 • Available as OWL2 ontologies based on DBpedia ontology types • We have evaluated our method to select typical / atypical patterns through a user study 1http://www.ontologydesignpatterns.org/ekp/owl/
Aemoo • Exploratory search application based on EKP • Cf. Semantic Web Challenge presentation after lunch
Input data • Wikipedia page links generate 107.9M triples • As a comparison, infobox-based triples are 13.6M, including data value triples (9.4M) • “Unmapped” object value triples are only 7% of page links • Paths are used to discover Encyclopedic Knowledge Patterns • Such patterns should make it emerge the most typical types of things that the Wikipedia crowd uses to describe a resource of a given type dbpo:MusicalArtist Path Path Pi,j= [Si, p, Oj] linksToMusicalArtist linksToPlace linksToOrganisation dbpo:MusicalArtist dbpo:Place dbpo:Organisation
Encyclopedic Knowledge Patterns • An Encyclopedic Knowledge Pattern (EKP) is discovered from the paths emerging from Wikipedia page link invariances • They are represented as OWL2 ontologies
Collecting paths from wikilinks owl:Thing owl:Thing dbpo:Person dbpo:Organisation Path db:Minnie_Mouse db:The_Walt_Disney_Company dbpo: FictionalCharacter dbpo:Company Path dbpo:wikiPageWikiLink db:Mickey_Mouse rdf:type rdfs:subClassOf dbpo: FictionalCharacter owl:Thing dbpo: FictionalCharacter dbpo:Person
Paths and indicators • Emerging paths are stored in RDF according to the “Knowledge Architecture” vocabulary • Cf. our COLD2011 paper “Extracting core knowledge from linked data” • Paths and types are associated with a set of indicators
nRes(dbpo:MusicalArtist) dbpo:MusicalArtist Anthony_Kiedis Michael_Jackson rdf:type Jackie_Jackson rdf:type rdf:type rdf:type Chad_Smith dbpo:MusicalArtist dbpo:MusicalArtist Number of resources having type dbpo:MusicalArtist rdf:type John_Lennon Paul_McCartney rdf:type dbpo:MusicalArtist PREFIX dbpo : http://dbpedia.org/ontology/
nSubjectRes(Pi,j) Jackson_5 Dave_Grohl Michael_Jackson Jackie_Jackson Nirvana Oj Si dbpo:wikiPageWikiLink dbpo:MusicalArtist dbpo:Band Foo Fighters Beatles John_Lennon Paul_McCartney PREFIX dbpo : http://dbpedia.org/ontology/
nSubjectRes(Pi,j) Jackson_5 Dave_Grohl Michael_Jackson Jackie_Jackson Nirvana Oj Si dbpo:wikiPageWikiLink dbpo:MusicalArtist dbpo:Band Foo Fighters Beatles Number of distinct resources that participate in a path as subjects John_Lennon Paul_McCartney PREFIX dbpo : http://dbpedia.org/ontology/
pathPopularity(Pi,j,Si) Jackson_5 Dave_Grohl Michael_Jackson Jackie_Jackson Nirvana Madonna Prince Keith_Jarrett Charlie_Parker Foo Fighters Beatles nSubjectRes(Pi,j)/nRes(Si) John_Lennon Paul_McCartney
Boundaries of EKPs • An EKP(Si) is a set of paths, such that Pi,j ∈ EKP(Si) ⟺ pathPopularity(Pi,j, Si) ≥ t • t is a threshold, under which a path is not included in an EKP • How to get a good value for t?
Boundary induction * 40 reasonably covers most “core” path popularity values, as well as many of the unusual ones.
Boundary induction * 40 covers most “core” path popularity values, as well as many of the unusual ones.
Boundary induction * 40 covers most “core” path popularity values, as well as many of the unusual ones.
Boundary induction • .9 * 40 covers most “core” path popularity values, as well as many of the unusual ones.
Boundary induction • .9 * 40 covers most “core” path popularity values, as well as many of the unusual ones.
Boundary induction • .9 * 40 covers most “core” path popularity values, as well as many of the unusual ones.
k-means clustering on Path Popularity Sample distribution of pathPopularity for DBpedia paths. The x-axis indicates how many paths (on average) are above a certain value t for pathPopularity 3 small clusters with ranks above 22.67% 1 big cluster (4-cluster) with ranks below 18.18% 1 alternative cluster (6-cluster) with ranks below 11.89%
Boundary induction • .9 * 40 covers most “core” path popularity values, as well as many of the unusual ones.
FrameNet-based threshold • Frame elements (aka semantic roles) roughly correspond to paths of EKPs • The average number of frame elements in FrameNet frames is 9 (mode is 7) • The 9th rank in the path popularity means ranking is 11.89% • Compliant to k-means clustering with 6 clusters
User study • Now, we still have to prove that emerging EKPs with our threshold values provide an intuitive schema for organizing knowledge We have conducted a user study • We asked users to rank the paths associated with a sample set of DBPO classes • We have compared users’ ranks to those emerging from our empirical observations
Experimental setting (1/2) • 17 users divided in two groups • A sample of 12 DBPO classes (6+6) • Domains: social, media, commerce, science, technology, geography, and government
Experimental setting (2/2) Task We want to study the best way to describe things by linking them to other things. For example, if you want to describe a person, you might want to link it to other persons, organizations, places, etc. In other words, what are the most relevant types of things that can be used to describe a certain type of things? Relevance value scale 1 - The type is irrelevant 2 - The type is slightly irrelevant 3 - I am undecided between 2 and 4 4 - The type is relevant but can be optional 5 - The type is relevant and should be used for the description
Inter-rater agreement Average coefficient of concordance for ranks (Kendall’s W) Kendall’s W (for all values p < 0.0001) Reliability test: Cronbach’s alpha Kendall’s W range [0,1] 0 = no agreement 1 = complete agreement
How good is Wikipedia as a source of EKPs? • We need to compare Wikipedia-based EKPs to those emerging from the users’ choices • We have to compare pathPopularity(Pi,j,Si) of paths associated with the DBPO sample classes to the relevance scores assigned by the users • We need to define a mapping function between pathPopularity(Pi,j,Si) values and the 5-level scale of relevance scores
Mapping function pathPopularity <-> Likert scale • We use pathPopularityDBpedia distribution • We divide it into 5 intervals following the hypothesized t identified through the clusters
What is the “agreement” between DBpedia and our sample users? Average multiple correlation (Spearman ρ) between users’ assigned scores, and pathPopularityDBpedia based scores Pi,j ∈ EKP (Si) ⇔pathPopularity(Pi,j, Si) ≥ 11% Spearman ρ range: [-1,1] −1 = no agreement +1 = complete agreement) Multiple correlation coefficient (Spearman ρ) between users’s assigned score, and pathPopularityDBpedia based score Good correlation Satisfactory precision
OWL2 Formalization of EKPs Higher threshold (4-cluster based) used for inducing owl:someValuesFrom restrictions http://www.ontologydesignpatterns.org/ekp/owl/
Related work • Basic patterns (e.g. Yago) or induced schemas (e.g. Völker 2011) extracted from linked data • Top-down design (ODP) or reengineering (FrameNet, CLib) of knowledge patterns • Hybrid NLP-SW work: • frame detection (Coppola et al. 2009) • thesaurus induction (Giuliano et al. 2010) wikilinks used • NLP exploitation of Wikipedia, DBpedia, Yago: • lexicon extraction (Zesch et al. 2008) • relation extraction using Wikipedia as background knowledge (Nguyen 2007, Nastase 2010) wikilinks used • overview in (Medelyan et al. 2009)
Conclusions • Stable criteria for EKP boundary creation • high correlation of path popularity distributions across subject types • Large consensus among (multicultural) users • high interrater agreement and reliability • Good precision • high correlation between users’ and EKP rankings • Open EKP resource for filtering Wikipedia page links, seeing them through lenses, or creating hybrid NLP-SW applications
Ongoing and future work • Relation labeling: several strategies being investigated • By analogy, infobox schema induction • Relation discovery from Wikipedia text • Multicultural EKPs: early check with Italian Wikipedia • First comparisons gives good correlation between path rankings on same subject types • Other ontologies, e.g. Yago • Cf. paths like [Coca-ColaBrands,BoycottsOfOrganizations] • The long tail and atypical types • EKP-based lenses • DBpedia ontology evaluation