290 likes | 504 Views
A Generative Retrieval Model for Structured Documents. Le Zhao, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Oct 2008. Background. Structured documents Author Edited Fields Library systems: Title, meta-data of books
E N D
A Generative Retrieval Model for Structured Documents Le Zhao, Jamie Callan Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University Oct 2008
Background • Structured documents • Author Edited Fields • Library systems: Title, meta-data of books • Web documents: HTML, XML • Automatic Annotations • Part Of Speech, Named Entity, Semantic Role • Structured query • Human • Automatic
Example Structured Retrieval Queries • XML element retrieval • NEXI query (Wikipedia XML) a) //article[about(., music)] b) //article[about(.//section, music)] //section[about(., pop)] • Question Answering • Indri query (ASSERT style SRL annotation) • #combine[sentence]( #combine[target]( love #combine[./arg0] ( #any:person ) #combine[./arg1] ( Mary ) ) ) …music.. ..pop.. Music … …pop… … music… … music … D D arg0 arg1 [John] [loves] [Mary]
Motivation • Basis: Language Model + Inference Net (Indri search engine/Lemur) • Already supports field retrieval & indexing and retrieving relations between annotations • flexible query language – testing new query forms promptly • Main problems • Approximate matching (structure & keyword) • Evidence combination • Extension: from keyword retrieval model • Approximate structure & keyword matching • Combining evidence from inner fields • Goal: Outperform keyword in precision, thru • Coherent structured retrieval model & better understanding • Better smoothing & Guiding query formulation • Finer control via accurate & robust structured queries
Roadmap • Brief overview of Indri Field retrieval • Existing Problems • The generative structured retrieval model • Term & Field level Smoothing • Evidence combination alternatives • Experiments • Conclusions
Indri Document Retrieval • “#combine(iraq war)” • Scoring scope is “document” • Return a list of scores for docs • Language Model built from Scoring scope, smoothed with the collection model. • Because of smoothing partial matches can also be returned.
Indri Field Retrieval • “#combine[title](iraq war)” • Scoring scope is “title” • Return a list of scores for titles • Language Model built from Scoring scope (title), smoothed with Document and Collection models. • Results on Wikipedia collection:score document-number start end content-1.19104 199636.xml 0 2 Iraq War-1.58453 1595906.xml 0 3 Anglo-Iraqi War-1.58734 14889.xml 0 3 Iran-Iraq War-1.87811 184613.xml 0 4 2003 Iraq war timeline-2.07668 2316643.xml 0 5 Canada and the Iraq War-2.09957 202304.xml 0 5 Protests against the Iraq war-2.23997 2656581.xml 0 6 Saddam's Trial and Iran-Iraq War-2.35804 1617564.xml 0 7 List of Iraq War Victoria Cross Because of smoothing partial matches can also be returned.
Evidence Combination • Topic: A document with multiple sections about iraq war, and discusses Bush’s exit strategy. #combine(#combine[section](iraq war) bush #1(exit strategy) )[section]’s could return scores: (0.2, 0.2, 0.1, 0.002) for one document • Some options • #max (Bilotti et al 2007): Only considers one match • #or: Favors many matches, even if weak matches • #and: Biased against many matches, even if good matches • #average: Favors many good matches, hurt by weak matches • … • What about documents that don’t contain a section element? • But, do have a lot of matching terms?
Bias toward short fields • Topic: Who loves Mary?#combine( #combine[target]( Loves #combine[./arg0]( #any:person ) #combine[./arg1]( Mary ) ) ) • PMLE(qi|E) = count(qi)/|E| • Produces very skewed scores when |E| is small • E.g., if |E| = 1, PMLE(qi|E) is either 0 or 1 • Biases toward #combine[target](Loves) • [target] usually length 1, arg0/1 longer • Ratio between having and not having a [target] match is larger than that of [arg0/1], with Jelinek-Mercer smoothing
The Generative Structured Retrieval model A new framework for structured retrieval A new term-level smoothing method
A New Framework • #combine( #combine[section](iraq war) bush #1(exit strategy)) • Query • Traditional: merely the sampled terms • New: specifies a graphical model, a generation process • Scoring scope is “document” • For one document, calculate probability of the model • Sections are used as evidence of relevance for the document • a hidden variable in the graphical model • In general, inner fields are hidden, and used to score outer fields. • Hidden variables are summed over to produce final score • Averaging the scores from section fields, (uniform prior over sections)
+1 prior field: Field level smoothing A New Framework:Field-Level Smoothing • Term level smoothing (traditional) • no [section] contains iraq or war • add “prior terms” to [section] – Dirichlet prior from collection model • Field level smoothing (new) • no [section] field in document– add “prior fields” Terms in S P(w|C) Ss μ |S| Sections in D P(w|section, D, C) Ds
A New Framework:Advantages • Soft matching of sections • Matches documents even w/o section fields • “prior fields”, (Bilotti et al 2007) called this “empty fields” • Aggregation of all matching fields • P-OR, Max, … • Heuristics • From our generative model • Probabilistic-Average
Reduction to Keyword Likelihood Model • Assume [term] tag around each term in collection • Assume no document level smoothing (μd = +inf, λd = 0)then, no matter with how many empty smoothing fields, the AVG model degenerates to the Keyword retrieval model, in the following way: • #combine( #combine[term]( u ) == #combine( u v ) #combine[term]( v ) )(same collection level smoothing Dirichlet/Jelinek-Mercer is preserved)
Term Level Smoothing Revisited • Two level Jelinek-Mercer (traditional) • Equivalently (more general parameterization), • Two level Dirichlet (new) • Corrects J-M’s Bias toward shorter fields • Relative gain of matching independent of field length
Experiments - Smoothing- Evidence combination methods
XML retrieval INEX 06, 07 (Wikipedia collection) Goal: evaluate evidence combination (and smoothing) Topics (modified): element retrieval document retrieval, e.g. #combine(#combine[section](wifi security)) Assessments (modified): any element relevant document relevant Smoothing parameters Trained on INEX 06, 62 topics Tested on INEX 07, 60 topics Question Answering TREC 2002 AQUAINT Corpus Topics Training: 55 original topics -> 367 relevant sentences (new topics) Test: 54 original topics -> 250 relevant sentences (new topics) For example,Question: “wholovesMary”Relevant sentence: “John says he lovesMary”Query: #combine[target]( love #combine[./arg1](Mary) ) Relevance feedback setup, stronger than (Bilotti et al 2007) Datasets
Effects of 2-level Dirichlet smoothing Table 3. A comparison of two-level Jelinek-Mercer and two-level Dirichlet smoothing on the INEX and QA datasets. *: significance level < 0.04, **: significance level < 0.002, ***: significance level < 0.00001
Optimal Smoothing Parameters • Optimization with grid search • Optimal values for Dirichlet related to average length of the fields being queried
Evidence Combination Methods • For QA, MAX is best • For INEX • Evaluation at document level does not discount irrelevant text portions • Not clear which combination method performs best
Better Evaluation for INEX Datasets • NDCG • Assumptions • Degree of relevance is somehow given • user spends similar amount of effort on each document, and effort decreases in log-rank • With more informative element level judgments • Degree of relevance for a document = relevance density • the proportion of relevant texts (in bytes) in the document • Discount lower ranked relevant documents • not by # docs ranked ahead • but by length (in bytes) of texts ranked ahead • Effectively discounts irrelevant texts ranked ahead
Measuring INEX topics with NDCG • * p < 0.007 between AVG and MAX or AVG and OR • No significant difference between AVG and keyword!
Error Analysis for INEX06 Queries and Correcting INEX07 Queries • Two Changes (looking only at training set) • Semantic mismatch with topic (mainly keyword query) (22/70) • Lacking alternative fields: [image] [image,figure] • Wrong AND|OR semantics: (astronaut AND cosmonaut) (astronaut OR cosmonaut) • Misspellings: VanGogh Van Gogh • Over-restricted query terms using phrases: #1(origin of universe) #uw4(origin universe) • All [article] restrictions whole document (34/70) • Proofreading test (INEX07) queries • Retrieval results of the queries are not referenced in any way. • Only looked at keyword query + topic description
Performance after query correction df = 30: p < 0.006, for NDCG@10; p < 0.0004, for NDCG@20; p < 0.002 for NDCG@30
Conclusions • A structured query specifies a generative model for P(Q|D),model parameters estimated from D, rank D by P(Q|D) • Best evidence combination strategy is task dependent • Dirichlet smoothing corrects bias to short fields, and outperforms Jelinek-Mercer • Guidance to structured query formulation • Robust structured queries can outperform keyword
Acknowledgements • Paul Ogilvie • Matthew Bilotti • Eric Nyberg • Mark Hoy
Thanks! Comments & Questions?