400 likes | 526 Views
NAGA: Searching and Ranking Knowledge. Gjergji Kasneci Joint work with: Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, and Gerhard Weikum. Motivation. Example queries Which politicians are also scientists? Which gods do the Maya and the Greeks have in common?
E N D
NAGA: Searching and Ranking Knowledge Gjergji Kasneci Joint work with: Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, and Gerhard Weikum
Motivation • Example queries • Which politicians are also scientists? • Which gods do the Maya and the Greeks have in common? • Keyword queries are too weak to express advanced user intentions such as • concepts, • entity properties • relationships between entities
Motivation • Keyword queries are too weak to express advanced user intentions such as • concepts, • entity properties • relationships between entities • Data is not knowledge. • Data extraction and organization needed
Query: Results: Benjamin Franklin Paul Wolfowitz Angela Merkel … isA isA Scientist Politician $x
Greek god Query: Results: Thunder god Wisdom god Agricultural god … type $z type Mayan god $y type type $x
SYSTEMS Web Universal knowledge NAGA Question Answering & Ranking START Katz et al. TREC 2005 TextRunner Banko et al. IJCAI 2007 Ex DBMS Cafarella et al. CIDR 2007 ALICE Banko et al. K-CAP 2007 Question Answering Information Extraction KnowItAll Etzioni et al. WWW 2004 YAGO Suchanek et al. WWW 2007 BLINKS He et al. SIGMOD 2007 Entity Search Cheng et al. CIDR 2007 Semantic Database (Relational Database, XML(XLinks), RDF) Entity (Keyword) (Proximity) Search Ranking Libra Nie et al. WWW 2007 DISCOVER Histridis et al. VLDB 2002 BANKS Bhalotia et al. ICDE 2002 Querying
Outline • Framework • Data model • Query language • Ranking model • Evaluation • Setting • Metrics • Results
Framework (Data model) Excerpt from YAGO: Suchanek et al. WWW 2007 • Entity-relationship (ER) graph • Node label : entity • Edge label : relation • Edge weight : relation “strength” • Fact • Represented by an edge • Evidence pages for a fact f • Web pages from which f was derived • Computation of fact confidence (i.e. edge weights) : locatedIn Max Planck Institute Germany type type type
Framework (Data model) Excerpt from YAGO: Suchanek et al. WWW 2007 • Entity-relationship (ER) graph • Node label : entity • Edge label : relation • Edge weight : relation “strength” • Fact • Represented by an edge • Evidence pages for a fact f • Web pages from which f was derived • Computation of fact confidence (i.e. edge weights) : locatedIn Max Planck Institute Germany type type type
Framework (Data model) Excerpt from YAGO: Suchanek et al. WWW 2007 • Entity-relationship (ER) graph • Node label : entity • Edge label : relation • Edge weight : relation “strength” • Fact • Represented by an edge • Evidence pages for a fact f • Web pages from which f was derived • Computation of fact confidence (i.e. edge weights) : locatedIn Max Planck Institute Germany type type type
Framework (Data model) Excerpt from YAGO: Suchanek et al. WWW 2007 • Entity-relationship (ER) graph • Node label : entity • Edge label : relation • Edge weight : relation “strength” • Fact • Represented by an edge • Evidence pages for a fact f • Web pages from which f was derived • Computation of fact confidence (i.e. edge weights) : locatedIn Max Planck Institute Germany type type type
Framework (Data model) Excerpt from YAGO: Suchanek et al. WWW 2007 • Entity-relationship (ER) graph • Node label : entity • Edge label : relation • Edge weight : relation “strength” • Fact • Represented by an edge • Evidence pages for a fact f • Web pages from which f was derived • Computation of fact confidence (i.e. edge weights) : locatedIn Max Planck Institute Germany type type type
Framework (Query language) • R : set of relationship labels • RegEx(R) : set of regular expressions over R-labels • E : set of entity labels • V : set of variables • Definition (fact template) A fact template is a triple <e1 r e2> where e1 , e2 EV and r RegEx(R) V. Examples: givenNameOf | familiyNameOf Liu $x $x Albert Einstein Mileva Maric
Framework (Query language) • Definition (NAGA query) A NAGA query is a connected directed graph in which each edge represents a fact template. • Examples 1) Which physicist was born in the same year as Max Planck? isA isA 2) Which politician is also a scientist? Physicist Max Planck $y isA isA Scientist Politician $x bornInYear $x bornInYear 4) Which mountain is located in Africa? loctedIn* isA Mountain Africa $x 3) Which scientist are called Liu? 5) What connects Einstein and Bohr? givenNameOf | familiyNameOf isA Liu Scientist $x * Niels Bohr Albert Einstein
Framework (Query language) • Definition (NAGA query) A NAGA query is a connected directed graph in which each edge represents a fact template. • Examples 1) Which physicist was born in the same year as Max Planck? isA isA 2) Which politician is also a scientist? Physicist Max Planck $y isA isA Scientist Politician $x bornInYear $x bornInYear 4) Which mountain is located in Africa? loctedIn* isA Mountain Africa $x 3) Which scientist are called Liu? 5) What connects Einstein and Bohr? givenNameOf | familiyNameOf isA Liu Scientist $x * Niels Bohr Albert Einstein
Framework (Query language) • Definition (NAGA answer) A NAGA answer is a subgraph of the underlying ER graph that matches the query graph. • Examples 1) Which physicist was born in the same year as Max Planck? isA isA 2) Which mountain is located in Africa? Physicist loctedIn* Max Planck $y isA Mountain Africa $x bornInYear $x bornInYear loctedIn* loctedIn* isA Mountain Africa Tanzania Kilimanjaro isA isA 0.98 0.98 0.96 Physicist 0.96 0.96 Max Planck Mihajlo Pupuin 3) What connects Einstein and Bohr? * 0.97 0.97 Niels Bohr Albert Einstein bornInYear 1858 bornInYear hasWonPrize hasWonPrize Nobel Prize Albert Einstein Niels Bohr 0.95 0.95
Framework (Query language) • Definition (NAGA answer) A NAGA answer is a subgraph of the underlying ER graph that matches the query graph. • Examples 1) Which physicist was born in the same year as Max Planck? isA isA 2) Which mountain is located in Africa? Physicist loctedIn* Max Planck $y isA Mountain Africa $x bornInYear $x bornInYear loctedIn loctedIn isA Mountain Africa Tanzania Kilimanjaro isA isA 0.98 0.98 0.96 Physicist 0.96 0.96 Max Planck Mihajlo Pupin 3) What connects Einstein and Bohr? * 0.97 0.97 Niels Bohr Albert Einstein bornInYear 1858 bornInYear hasWonPrize hasWonPrize Nobel Prize Albert Einstein Niels Bohr 0.95 0.95
isA Einstein $x Einstein isa scientist Einstein isa vegetarian Framework (Ranking model) • Question How to rank multiple matches to the same query? • Ranking desiderata Confidence Correct answers • Certainty of IE • Trust/Authority of source “Max Planck born in Kiel” bornIn (Max_Planck, Kiel) (Source: Wikipedia) “They believe Elvis hides on Mars” livesIn (Elvis_Presley, Mars) (Source: The One and Only King‘s Blog) • Informativeness • prominent results preferred • Frequency of facts in the corpus isA isA • Compactness • Prefer “tightly” connected • answers • Size of the answer graph Einstein vegetarian Tom Cruise bornInYear hasWonPrize Nobel Prize Bohr 1962 hasWonPrize diedInYear
isA Einstein $x Einstein isa scientist Einstein isa vegetarian Framework (Ranking model) • Question How to rank multiple matches to the same query? • Ranking desiderata Confidence Correct answers • Certainty of IE • Trust/Authority of source “Max Planck born in Kiel” bornIn (Max_Planck, Kiel) (Source: Wikipedia) “They believe Elvis hides on Mars” livesIn (Elvis_Presley, Mars) (Source: The One and Only King‘s Blog) • Informativeness • prominent results preferred • Frequency of facts in the corpus isA isA • Compactness • Prefer “tightly” connected • answers • Size of the answer graph Einstein vegetarian Tom Cruise bornInYear hasWonPrize Nobel Prize Bohr 1962 hasWonPrize diedInYear
isA Einstein $x Einstein isa scientist Einstein isa vegetarian Framework (Ranking model) • Question How to rank multiple matches to the same query? • Ranking desiderata Confidence Correct answers • Certainty of IE • Trust/Authority of source “Max Planck born in Kiel” bornIn (Max_Planck, Kiel) (Source: Wikipedia) “They believe Elvis hides on Mars” livesIn (Elvis_Presley, Mars) (Source: The One and Only King‘s Blog) • Informativeness • prominent results preferred • Frequency of facts in the corpus isA isA • Compactness • Prefer “tightly” connected • answers • Size of the answer graph Einstein vegetarian Tom Cruise bornInYear hasWonPrize Nobel Prize Bohr 1962 hasWonPrize diedInYear
isA Einstein $x Einstein isa scientist Einstein isa vegetarian Framework (Ranking model) • Question How to rank multiple matches to the same query? • Ranking desiderata Confidence Correct answers • Certainty of IE • Trust/Authority of source “Max Planck born in Kiel” bornIn (Max_Planck, Kiel) (Source: Wikipedia) “They believe Elvis hides on Mars” livesIn (Elvis_Presley, Mars) (Source: The One and Only King‘s Blog) • Informativeness • prominent results preferred • Frequency of facts isA isA • Compactness • Prefer “tightly” connected • answers • Size of the answer graph Einstein vegetarian Tom Cruise bornInYear hasWonPrize Nobel Prize Bohr 1962 hasWonPrize diedInYear
isA Einstein $x Einstein isa scientist Einstein isa vegetarian Framework (Ranking model) • Question How to rank multiple matches to the same query? • Ranking desiderata Confidence Correct answers • Certainty of IE • Trust/Authority of source “Max Planck born in Kiel” bornIn (Max_Planck, Kiel) (Source: Wikipedia) “They believe Elvis hides on Mars” livesIn (Elvis_Presley, Mars) (Source: The One and Only King‘s Blog) • Informativeness • prominent results preferred • Frequency of facts isA isA • Compactness • Prefer “tightly” connected • answers • Size of the answer graph Einstein vegetarian Tom Cruise bornInYear hasWonPrize Nobel Prize Bohr 1962 hasWonPrize diedInYear
isA Einstein $x Einstein isa scientist Einstein isa vegetarian Framework (Ranking model) • Question How to rank multiple matches to the same query? • Ranking desiderata Confidence Correct answers • Certainty of IE • Trust/Authority of source NAGA exploits language models for ranking “Max Planck born in Kiel” bornIn (Max_Planck, Kiel) (Source: Wikipedia) “They believe Elvis hides on Mars” livesIn (Elvis_Presley, Mars) (Source: The One and Only King‘s Blog) • Informativeness • prominent results preferred • Frequency of facts isA isA • Compactness • Prefer “tightly” connected • answers • Size of the answer graph Einstein vegetarian Tom Cruise bornInYear hasWonPrize Nobel Prize Bohr 1962 hasWonPrize diedInYear
Framework (Ranking model) Statistical Language Models for Document IR [Maron/Kuhns 1960, Ponte/Croft 1998, Lafferty/Zhai 2001] • each doc has LM: generative • prob. distr. with parameters • query q viewed as sample • estimate likelihood that q • is sample of LM of doc d • rank by descending likelihoods • (best „explanation“ of q) d1 ? LM(1) q d2 ? LM(2) MLE: sparseness mixture model Background model (smoothing)
background model Framework (Ranking model) isA Albert Einstein $x • Scoring answers Query q with templates q1q2 … qn , e.g. Given g with facts g1g2 … gn , e.g. We use generative mixture models to compute P[q | g] isA Albert Einstein Physicist using generative mixture model estimated using knowledge base graph structure based on IE accuracy and authority analysis estimated by correlation statistics
isA Albert Einstein Vegetarian isA Albert Einstein Physicist Framework (Ranking model) isA Consider • Informativeness Albert Einstein $x Possible results NAGA Ranking (Informativeness)
isA Albert Einstein Physicist isA Albert Einstein Vegetarian Physicist isA Vegetarian more important than Physicist Albert Einstein isA Vegetarian Framework (Ranking model) isA Consider • Informativeness Albert Einstein $x Possible results BANKS Ranking (Bhalotia et al. ICDE 2002) • Relies only on underlying graph structure • Importance of an entity is proportional to its degree
Evaluation (Setting) • Knowledge graph YAGO (Suchanek et al. WWW 2007) • 16 Million facts • 85 NAGA queries • 55 queries from TREC 2005/2006 • 12 queries from the work on SphereSearch (Graupmann et al. VLDB 2005) • We provided 18 regular expression queries
Evaluation (Setting) • The queries were issued to • Google, • Yahoo! Answers, • START (http://start.csail.mit.edu/), • NAGA (Banks scoring) • relies only on the structure of the underlying graph. (see Bhalotia et al. ICDE 2002) • NAGA (NAGA scoring) • top-10 answers assessed by 20 human judges as relevant, less relevant and irrelevant.
Evaluation (Setting) • The queries were issued to • Google, • Yahoo! Answers, • START (http://start.csail.mit.edu/), • NAGA (Banks scoring) • relies only on the structure of the underlying graph. (see Bhalotia et al. ICDE 2002) • NAGA (NAGA scoring) • top-10 answers assessed by 20 human judges as relevant (2), less relevant (1), and irrelevant (0).
Evaluation (Metrics & Results) • NDCG (normalized discounted cumulative gain) • rewards result lists in which relevant results are ranked higher than less relevant ones • Useful when comparing result lists of different lengths • P@1 • to measure how satisfied the user was on average with the first answer of the search engine • We report the Wilson confidence intervals at =0.95%
Evaluation (Metrics & Results) • NDCG (normalized discounted cumulative gain) • rewards result lists in which relevant results are ranked higher than less relevant ones • Useful when comparing result lists of different lengths • P@1 • to measure how satisfied the user was on average with the first answer of the search engine • We report the Wilson confidence intervals at =0.95%
Evaluation (Metrics & Results) • NDCG (normalized discounted cumulative gain) • rewards result lists in which relevant results are ranked higher than less relevant ones • Useful when comparing result lists of different lengths • P@1 • to measure how satisfied the user was on average with the first answer of the search engine • Wilson confidence intervals computed at =0.95%
NAGA is a search engine for Advanced querying of information in ER graphs NAGA queries NAGA answers We a novel scoring mechanism based on generative language models, Applied to the specific and unexplored setting of ER graphs Incorporating confidence, informativeness, and compactness Viability of the approach demonstrated in comparison to state of the art search engines and QA-Systems. Summary
NAGA is a search engine for Advanced querying of information in ER graphs NAGA queries NAGA answers We propose a novel scoring mechanism based on generative language models, Applied to the specific and unexplored setting of ER graphs Incorporating confidence, informativeness, and compactness Viability of the approach demonstrated in comparison to state of the art search engines and QA-Systems. Summary
NAGA is a search engine for Advanced querying of information in ER graphs NAGA queries NAGA answers We propose a novel scoring mechanism based on generative language models, Applied to the specific and unexplored setting of ER graphs Incorporating confidence, informativeness, and compactness Viability of the approach demonstrated in comparison to state of the art search engines and QA-Systems. Summary
Thank youNAGA:http://www.mpi-inf.mpg.de/~kasneci/nagaYAGO: http://www.mpi-inf.mpg.de/~suchanek/yago