1 / 38

(2) Dipartimento di Scienze dell ’ Informazione, Università di Bologna

Encyclopedic Knowledge Patterns from Wikipedia Links. Andrea Giovanni Nuzzolese (1,2) nuzzoles@cs.unibo.it. Aldo Gangemi (1) aldo.gangemi@cnr.it. Valentina Presutti (1) valentina.presutti@cnr.it. Paolo Ciancarini (1,2) ciancarini@cs.unibo.it. ISWC 2011 Bonn, Germany 26 October 2011.

Download Presentation

(2) Dipartimento di Scienze dell ’ Informazione, Università di Bologna

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Encyclopedic Knowledge Patterns from Wikipedia Links Andrea Giovanni Nuzzolese (1,2) nuzzoles@cs.unibo.it Aldo Gangemi (1) aldo.gangemi@cnr.it Valentina Presutti (1) valentina.presutti@cnr.it Paolo Ciancarini (1,2) ciancarini@cs.unibo.it ISWC 2011 Bonn, Germany 26 October 2011 (1) Semantic Technology Laboratory ISTC-CNR (2) Dipartimento di Scienze dell’Informazione, Università di Bologna Work supported by the IKS project (EU FP7)

  2. Origin of modern frames and knowledge patterns • «When one encounters a new situation (or makes a substantial change in one's view of the present problem) one selects from memory a structure called a Frame. This is a remembered framework to be adapted to fit reality by changing details as necessary … a frame is a data-structure for representing a stereotyped situation» (Minsky 1975) • Frames, schemas, scripts … «These large-scale knowledge configurations supply top-down input for a wide range of communicative and interactive tasks … the availability of global patterns of knowledge cuts down on non-determinacy enough to offset idiosyncratic bottom-up input that might otherwise be confusing» (Beaugrande, 1980)

  3. Motivations • Improving knowledge exploration and summarization by: • Empirically discovering invariances in conceptual organization of knowledge – encyclopedic knowledge patterns – from Wikipedia crowd-sourced page link • Understanding the most intuitive way of selecting relevant entities used to describe a given entity • Identifying the typical / atypical types of things that people use for describing other things

  4. Outline • What we did • Materials and methods • Results • Evaluation • Conclusions and future work

  5. What we did (1/3) • We have proved that Wikipedia page links are a useful resource for studying cognitive aspects of human-knowledge interaction • We have shown that knowledge patterns can be empirically discovered

  6. What we did (2/3) • We have produced 231 encyclopedic knowledge patterns1 • Available as OWL2 ontologies based on DBpedia ontology types • We have evaluated our method to select typical / atypical patterns through a user study 1http://www.ontologydesignpatterns.org/ekp/owl/

  7. Aemoo • Exploratory search application based on EKP • Cf. Semantic Web Challenge presentation after lunch

  8. Input data • Wikipedia page links generate 107.9M triples • As a comparison, infobox-based triples are 13.6M, including data value triples (9.4M) • “Unmapped” object value triples are only 7% of page links • Paths are used to discover Encyclopedic Knowledge Patterns • Such patterns should make it emerge the most typical types of things that the Wikipedia crowd uses to describe a resource of a given type dbpo:MusicalArtist Path Path  Pi,j= [Si, p, Oj] linksToMusicalArtist linksToPlace linksToOrganisation dbpo:MusicalArtist dbpo:Place dbpo:Organisation

  9. Encyclopedic Knowledge Patterns • An Encyclopedic Knowledge Pattern (EKP) is discovered from the paths emerging from Wikipedia page link invariances • They are represented as OWL2 ontologies

  10. Materials

  11. Collecting paths from wikilinks owl:Thing owl:Thing dbpo:Person dbpo:Organisation Path db:Minnie_Mouse db:The_Walt_Disney_Company dbpo: FictionalCharacter dbpo:Company Path dbpo:wikiPageWikiLink db:Mickey_Mouse rdf:type rdfs:subClassOf dbpo: FictionalCharacter owl:Thing dbpo: FictionalCharacter dbpo:Person

  12. Paths and indicators • Emerging paths are stored in RDF according to the “Knowledge Architecture” vocabulary • Cf. our COLD2011 paper “Extracting core knowledge from linked data” • Paths and types are associated with a set of indicators

  13. nRes(dbpo:MusicalArtist) dbpo:MusicalArtist Anthony_Kiedis Michael_Jackson rdf:type Jackie_Jackson rdf:type rdf:type rdf:type Chad_Smith dbpo:MusicalArtist dbpo:MusicalArtist Number of resources having type dbpo:MusicalArtist rdf:type John_Lennon Paul_McCartney rdf:type dbpo:MusicalArtist PREFIX dbpo : http://dbpedia.org/ontology/

  14. nSubjectRes(Pi,j) Jackson_5 Dave_Grohl Michael_Jackson Jackie_Jackson Nirvana Oj Si dbpo:wikiPageWikiLink dbpo:MusicalArtist dbpo:Band Foo Fighters Beatles John_Lennon Paul_McCartney PREFIX dbpo : http://dbpedia.org/ontology/

  15. nSubjectRes(Pi,j) Jackson_5 Dave_Grohl Michael_Jackson Jackie_Jackson Nirvana Oj Si dbpo:wikiPageWikiLink dbpo:MusicalArtist dbpo:Band Foo Fighters Beatles Number of distinct resources that participate in a path as subjects John_Lennon Paul_McCartney PREFIX dbpo : http://dbpedia.org/ontology/

  16. pathPopularity(Pi,j,Si) Jackson_5 Dave_Grohl Michael_Jackson Jackie_Jackson Nirvana Madonna Prince Keith_Jarrett Charlie_Parker Foo Fighters Beatles nSubjectRes(Pi,j)/nRes(Si) John_Lennon Paul_McCartney

  17. Path Popularity distribution example

  18. Boundaries of EKPs • An EKP(Si) is a set of paths, such that Pi,j ∈ EKP(Si) ⟺ pathPopularity(Pi,j, Si) ≥ t • t is a threshold, under which a path is not included in an EKP • How to get a good value for t?

  19. Boundary induction * 40 reasonably covers most “core” path popularity values, as well as many of the unusual ones.

  20. Boundary induction * 40 covers most “core” path popularity values, as well as many of the unusual ones.

  21. Boundary induction * 40 covers most “core” path popularity values, as well as many of the unusual ones.

  22. Boundary induction • .9 * 40 covers most “core” path popularity values, as well as many of the unusual ones.

  23. Boundary induction • .9 * 40 covers most “core” path popularity values, as well as many of the unusual ones.

  24. Boundary induction • .9 * 40 covers most “core” path popularity values, as well as many of the unusual ones.

  25. k-means clustering on Path Popularity Sample distribution of pathPopularity for DBpedia paths. The x-axis indicates how many paths (on average) are above a certain value t for pathPopularity 3 small clusters with ranks above 22.67% 1 big cluster (4-cluster) with ranks below 18.18% 1 alternative cluster (6-cluster) with ranks below 11.89%

  26. Boundary induction • .9 * 40 covers most “core” path popularity values, as well as many of the unusual ones.

  27. FrameNet-based threshold • Frame elements (aka semantic roles) roughly correspond to paths of EKPs • The average number of frame elements in FrameNet frames is 9 (mode is 7) • The 9th rank in the path popularity means ranking is 11.89% • Compliant to k-means clustering with 6 clusters

  28. User study • Now, we still have to prove that emerging EKPs with our threshold values provide an intuitive schema for organizing knowledge  We have conducted a user study • We asked users to rank the paths associated with a sample set of DBPO classes • We have compared users’ ranks to those emerging from our empirical observations

  29. Experimental setting (1/2) • 17 users divided in two groups • A sample of 12 DBPO classes (6+6) • Domains: social, media, commerce, science, technology, geography, and government

  30. Experimental setting (2/2) Task We want to study the best way to describe things by linking them to other things. For example, if you want to describe a person, you might want to link it to other persons, organizations, places, etc. In other words, what are the most relevant types of things that can be used to describe a certain type of things? Relevance value scale 1 - The type is irrelevant 2 - The type is slightly irrelevant 3 - I am undecided between 2 and 4 4 - The type is relevant but can be optional 5 - The type is relevant and should be used for the description

  31. Inter-rater agreement Average coefficient of concordance for ranks (Kendall’s W) Kendall’s W (for all values p < 0.0001) Reliability test: Cronbach’s alpha Kendall’s W range [0,1] 0 = no agreement 1 = complete agreement

  32. How good is Wikipedia as a source of EKPs? • We need to compare Wikipedia-based EKPs to those emerging from the users’ choices • We have to compare pathPopularity(Pi,j,Si) of paths associated with the DBPO sample classes to the relevance scores assigned by the users • We need to define a mapping function between pathPopularity(Pi,j,Si) values and the 5-level scale of relevance scores

  33. Mapping function pathPopularity <-> Likert scale • We use pathPopularityDBpedia distribution • We divide it into 5 intervals following the hypothesized t identified through the clusters

  34. What is the “agreement” between DBpedia and our sample users? Average multiple correlation (Spearman ρ) between users’ assigned scores, and pathPopularityDBpedia based scores Pi,j ∈ EKP (Si) ⇔pathPopularity(Pi,j, Si) ≥ 11% Spearman ρ range: [-1,1] −1 = no agreement +1 = complete agreement) Multiple correlation coefficient (Spearman ρ) between users’s assigned score, and pathPopularityDBpedia based score Good correlation Satisfactory precision

  35. OWL2 Formalization of EKPs Higher threshold (4-cluster based) used for inducing owl:someValuesFrom restrictions  http://www.ontologydesignpatterns.org/ekp/owl/

  36. Related work • Basic patterns (e.g. Yago) or induced schemas (e.g. Völker 2011) extracted from linked data • Top-down design (ODP) or reengineering (FrameNet, CLib) of knowledge patterns • Hybrid NLP-SW work: • frame detection (Coppola et al. 2009) • thesaurus induction (Giuliano et al. 2010)  wikilinks used • NLP exploitation of Wikipedia, DBpedia, Yago: • lexicon extraction (Zesch et al. 2008) • relation extraction using Wikipedia as background knowledge (Nguyen 2007, Nastase 2010)  wikilinks used • overview in (Medelyan et al. 2009)

  37. Conclusions • Stable criteria for EKP boundary creation • high correlation of path popularity distributions across subject types • Large consensus among (multicultural) users • high interrater agreement and reliability • Good precision • high correlation between users’ and EKP rankings • Open EKP resource for filtering Wikipedia page links, seeing them through lenses, or creating hybrid NLP-SW applications

  38. Ongoing and future work • Relation labeling: several strategies being investigated • By analogy, infobox schema induction • Relation discovery from Wikipedia text • Multicultural EKPs: early check with Italian Wikipedia • First comparisons gives good correlation between path rankings on same subject types • Other ontologies, e.g. Yago • Cf. paths like [Coca-ColaBrands,BoycottsOfOrganizations] • The long tail and atypical types • EKP-based lenses • DBpedia ontology evaluation

More Related