330 likes | 471 Views
Breaking through the syntax barrier: Searching with entities and relations. Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen. Wish upon a textbox, 1996. Your information need here. “ A rising tide of data lifts all algorithms ”. Wish upon a textbox, 1998. Your information need here.
E N D
Breaking through the syntax barrier:Searching with entities and relations Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen
Wish upon a textbox, 1996 Your information need here Chakrabarti 2005
“A rising tide of data lifts all algorithms” Wish upon a textbox, 1998 Your information need here Chakrabarti 2005
Wish upon a textbox, post-IPO • Now indexing 4,285,199,774 pages • Same interface, therefore same 2-word queries • Mind-reading wizard-in-black-box saves the day Your information need (still) here Chakrabarti 2005
If music had been invented tenyears ago along with the Web,we would all be playingone-string instruments(and not making great music). Udi Manber, A9.comPlenary speechWWW 2004 Chakrabarti 2005
Examples of “great music”… • The commercial angle: they’ll call you • Want to buy X, find reviews and prices • Cheap tickets to Y • Noun-phrase + Pagerank saves the day • Find info about diabetes • Find the homepage of Tom Mitchell • Searching vertical portals: garbage control • Searching Citeseer or IMDB through Google • Someone out there had the same problem • +adaptec +aha2940uw +lilo +bios Chakrabarti 2005
… and not-so-great music • Which produces better responses? • Opera fails to connect to secure IMAP tunneled through SSH • opera connect imap ssh tunnel • Unable to express manydetails of information need • Opera the email client, not a kind of music • The problem is with Opera, not ssh, imap, applet • “Secure” is an attribute of imap, but may not juxtapose Chakrabarti 2005
Why telegraphic queries fail • Information need relates to entities and relationships in the real world • But the search engine gets only strings • Risk over-/under- specified queries • Never know true recall • No time to deal with poor precision • Query word distribution dramatically different from corpus distribution • Query is inherently incomplete • Fix some known info, look for unknown info Chakrabarti 2005
Past the syntax barrier: early steps 1 • Taking the question apart • Question has known parts and unknown “slots” • Query-dependent information extraction (IE) • Compiling relations from the Web • is-instance-of (is-a), is-subclass-of • is-part-of, has-attribute • Exploit patterns that happen “naturally”… • …rather than architect knowledge bases and precise inference mechanisms • Can only go so far, but Web search standards are low to begin with… 2 Chakrabarti 2005
Atypes and ground constants • Specialize given domain to a token related to ground constants in the query • What animal is Winnie the Pooh? • instance-of(“animal”) NEAR “Winnie the Pooh” • When was television invented? • instance-of(“time”) NEAR “television” NEAR synonym(“invented”) • FIND x NEAR GroundConstants(question) WHERE x IS-A Atype(question) • Ground constants: Winnie the Pooh, television • Atypes: animal, time Chakrabarti 2005
Taking the question apart • Atype: the type of the entity that is an answer to the question • Problem: don’t want to compile a classification hierarchy of entities • Laborious, can’t keep up • Offline rather than question-driven • Instead • Set up a very large basis of features • “Project” question and corpus to basis Chakrabarti 2005
Scoring tokens for correct Atypes • FIND x “NEAR” GroundConstants(question) WHERE x IS-A Atype(question) • No fixed question or answer type system • Convert “x IS-A Atype(question)” to a soft match DoesAtypeMatch(x, question) Passage Question Answer tokens IE-style surfacefeature extractors IE-style surfacefeature extractors IS-A featureextractors Question feature vector Learn joint distrib. …other extractors… Snippet feature vector Chakrabarti 2005
Features for Atype matching • Question features: 1, 2, 3-token sequences starting with standard wh-words • where, when, who, how_X, … • Passage surface features: hasCap, hasXx, isAbbrev, hasDigit, isAllDigit, lpos, rpos,… • Passage IS-A features: all generalizations of all noun senses of token • Use WordNet: horseequidungulate, hoofed mammalplacental mammalanimal…entity • These are node IDs (“synsets”) in WordNet, not strings Chakrabarti 2005
Supervised learning setup • Get top 300 passages from IR engine • “Promising but negative” instances • Crude approximation to active learning • For each token invoke feature extractors • Question vector xq, passage vector xp • How to represent combined vector x? • Label = 1 if token is in answer span, 0 o/w • Question and answers from logs Chakrabarti 2005
how_far region#n#3 when entity#n#1 what_city Joint feature-vector design • Obvious “linear” juxtaposition x =(xp,xq) • Does not expose pairwise dependencies • “Quadratic” form x = xq xp • All pairwise product of elements • Model has param for every pair • Can discount for redundancy in pair info • If xq(xp) is fixed, what xp(xq) will yield the largest Pr(Y=1|x)? Chakrabarti 2005
Classification accuracy • Pairing more accurate than linear model • Are the estimated w parameters meaningful? • Given question, can return most favorable answer feature weights Chakrabarti 2005
Parameter anecdotes • Surface and WordNet features complement each other • General concepts get negative params: use in predictive annotation • Learning is symmetric (QA) Chakrabarti 2005
Taking the question apart • Atype: the type of the entity that is an answer to the question • Ground constants: Which question words are likely to appear (almost) unchanged in an answer passage? • Arises in Web search sessions too • Opera login fails • problem with login Opera email • Opera login accept password • Opera account authentication • … Chakrabarti 2005
POS@-1 POS@0 POS@1 Features to identify ground constants • Local and global features • POS of word, POS of adjacent words, case info, proximity to wh-word • Suppose word is associated with synset set S • NumSense: size of S (is word very polysemous?) • NumLemma: average #lemmas describing sS(are there many aliases?) • Model as a sequential learning problem • Each token has local context and global features • Label: does token appear near answer? Chakrabarti 2005
Ground constants: sample results • Global features (IDF, NumSense, NumLemma) essential for accuracy • Best F1 accuracy with local features alone: 71—73% • With local and global features: 81% • Decision trees better than logistic regression • F1=81% as against LR F1=75% • Intuitive decision branches Chakrabarti 2005
Summary of the Atype strategy • “Basis” of atypes A, a A could be synset, surface pattern, feature of a parse tree • Question q “projected” to vector (wa: a A) in atype space via learning conditional model • If q is “when…” or “how long…” whasDigit and wtime_period#n#1 are large, wregion#n#1 is small • Each corpus token t has associated indicator features a(t ) for every a • hasDigit(3,000)= is-a(region#n#1)(Japan)=1 Chakrabarti 2005
Reward proximity to ground constants • A token t is a candidate answer if • Hq(t ): Reward tokens appearing “near” ground constants matched from question • Order tokens by decreasing Projection of questionto “atype space” Atype indicator features of the token …the armadillo, found in Texas, is covered with strong horny plates Chakrabarti 2005
Evaluation: Mean reciprocal rank (MRR) • nq = smallest rank among answer passages • MRR = (1/|Q|) qQ(1/nq) • Dropping passage from #1 to #2 as bad as dropping it from #2 to not reporting it at all Experiment setup: • 300 top IR score passages • If Pr(Y=1|token) < thresholdreject token • If tokens rejected reject passage • Points below diagonal are good Chakrabarti 2005
Sample results • Accept all tokens IR baseline MRR • Moderate acceptance threshold non-answer passages eliminated, improves answer ranks • High threshold true answers eliminated • Another answer with poor rank, or rank = • Additional benefits from proximity filtering Chakrabarti 2005
Who provides is-a info? • Compiled KBs: WordNet, CYC • Automatic “soft” compilations • Google sets • KnowItAll • BioText • Can use asevidencein scoringanswers Chakrabarti 2005
Extracting is-instance-of info • Which researcher built the WHIRL system? • WordNet may not know Cohen IS-A researcher • Google has over 4.2 billion pages • “william cohen” on 86100 (p1=86.1k/4.2B) • researcher on 4.55M (p2=4.55M/4.2B) • +researcher +"william cohen“ on 1730: 18.55x more frequent than expected if independent • Pointwise mutual information PMI • Can add high-precision, low-recall patterns • “cities such as New York” (26600 hits) • “professor Michael Jordan” (101 hits ) Chakrabarti 2005
Bootstrapping lists of instances • Hearst 1992, Brin 1997, Etzioni 2004 • A “propose-validate” approach • Using existing patterns, generate queries • For each web page w returned • Extract potential fact e and assign confidence score • Add fact to database if it has high enough score • Example patterns • NP1 {,} {such as|and other|including} NPList2 • NP1 is a NP2, NP1 is the NP2 of NP3 • the NP1 of NP2 is NP3 • Start with NP1 = researcher etc. Chakrabarti 2005
System details • The importance of shallow linguistics working together with statistical tests • China is a (country)NP in Asia • Garth Brooks is a (countryADJ (singer)N)NP • Unary relation example • NP1 such as NPList2 & head(NP1)=plural(name(Class1)) & properNoun(head(each(NPList2))) instanceOf(Class1, head(each(NPList2)) ) “Head” of phrase Chakrabarti 2005
Compilation performance • Recall-vs-precision exposes size and difficulty of domain • “US state” is easy • “Country” is difficult • To improve signal-to-noise (STN) ratio, stop when confidence score is lower than threshold • Substantially improves recall-vs-precision Chakrabarti 2005
Exploiting is-a info for ranking Passage Question Answer tokens Atype IE-style surfacefeature extractors IE-style surfacefeature extractors WN IS-Afeature extractors Question feature vector Learn joint distrib. PMI scores fromsearch engine probes • Use PMI scores as additional features • Challenge: make frugal use of expensive inverted index probes Snippet feature vector Chakrabarti 2005
Concluding messages • Work much harder on questions • Break down into what’s known, what’s not • Find fragments of structure when possible • Exploit user profiles and sessions • Perform limited pre-structuring of corpus • Difficult to anticipate all needs and applications • Extract graph structure where possible (e.g. is-a) • Do not insist on specific schema • Design indices and ranking strategies for matching strings and semantics annotations • “Tip of the iceberg” under very complex ranking functions Chakrabarti 2005