130 likes | 349 Views
Finding Needles in the Haystack: Search and Candidate Generation. A presentation by Everett Coraor. What is Hypothesis Generation?. Hypothesis Generation is the process of producing possible answers to a given question. Search of potential documents and passages Creation of candidate list
E N D
Finding Needles in the Haystack: Search and Candidate Generation A presentation by Everett Coraor
What is Hypothesis Generation? • Hypothesis Generation is the process of producing possible answers to a given question. • Search of potential documents and passages • Creation of candidate list • Scoring of candidates • Balance between wide net and efficiency
Question Analysis Overview • ESG (English Slot Grammar) is used for parsing • Each parse tree contains a headword and a list of modifiers • Recognition of relations such as actorInand authorOf • Identification of LAT (lexical answer type) • “Robert Redford and Paul Newman starred in this depression-era grifter flick” > actorIn(Robert Redford, flick : focus) + actorIn(Paul Newman, flick : focus)
Searching Unstructured Resources • Three different question/answer pair relationships • Document-based searches -Correct answer is the title of the justifying document - “This country singer was imprisoned for robbery in 1972 and pardoned by Ronald Reagan” • TIC (title in clue) Passage searches -The title of the justifying document is present within the question - “Aleksander Kwasniewski became the president of this country in 1995” Title is neither in the question or the answer
Search Query Generation • Full query is constructed weighting subject relations to the focus higher - (2.0 “Robert Redford”) (2.0 “Paul Newman”) star depression era grifter (1.5 flick) • LAT-only query generated to narrow candidate answer list - depression era grifter flick • Unique entity identification - first 20th century US president
Document and Passage Search • Title-Based Document Search -Indri search engine is used -Separate search for long and short documents -Relevant document list size determined through empirical data weighing candidate recall against efficiency • Passage Search -Indri search engine -Lucene search engine
Indri Passage Search • #passage[X : Y] - X-word window - Shifting Y words at a time • 20 word passages scored - Wide range of search terms found scored higher • Treats each passage as a “mini-document” and scores using the document scoring system
Lucene Passage Search • Lucene scores each passage according to query-independent features • Sentence offset -Proximity to beginning of document • Sentence Length • Number of named entities -Passages with more named entities are scored higher
Searching Structured Resources • Answer Lookup • ??? • PRISMATIC search -Large-scale lexicalized relation resource -Gathers aggregate statistics of syntactic or semantic relations - “Unlike most sea animals, in the Sea Horse this pair of sense organs can move independently of one another”
Generating Candidates from Search Results • Structured search candidates -The listed word-relations • Three methods to obtain unstructured search candidates • Title of document candidate generation -Title of relevant documents • Wikipedia Title candidate generation -Extracts all noun phrases that are exclusively Wikipedia titles • Anchor Text candidate generation - “Neapolitan pizzas are made with ingredients like San Marzano tomatoes, which grow on the volcanic plains south of Mount Vesuvius…”