Breaking through the syntax barrier: Searching with entities and relations

Breaking through the syntax barrier:Searching with entities and relations Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen

Wish upon a textbox, 1996 Your information need here Chakrabarti 2004

“A rising tide of data lifts all algorithms” Wish upon a textbox, 1998 Your information need here Chakrabarti 2004

Wish upon a textbox, post-IPO • Now indexing 4,285,199,774 pages • Same interface, therefore same 2-word queries • Mind-reading wizard-in-black-box saves the day Your information need (still) here Chakrabarti 2004

If music had been invented tenyears ago along with the Web,we would all be playingone-string instruments(and not making great music). Udi Manber, A9.comPlenary speechWWW 2004 Chakrabarti 2004

Examples of “great music”… • The commercial angle: they’ll call you • Want to buy X, find reviews and prices • Cheap tickets to Y • Noun-phrase + Pagerank saves the day • Find info about diabetes • Find the homepage of Tom Mitchell • Searching vertical portals: garbage control • Searching Citeseer or IMDB through Google • Someone out there had the same problem • +adaptec +aha2940uw +lilo +bios Chakrabarti 2004

… and not-so-great music • Which produces better responses? • Opera fails to connect to secure IMAP tunneled through SSH • opera connect imap ssh tunnel • Unable to express manydetails of information need • Opera the email client, not a kind of music • The problem is with Opera, not ssh, imap, applet • “Secure” is an attribute of imap, but may not juxtapose Chakrabarti 2004

Why telegraphic queries fail • Information need relates to entities and relationships in the real world • But the search engine gets only strings • Risk over-/under- specified queries • Never know true recall • No time to deal with poor precision • Query word distribution dramatically different from corpus distribution • Query is inherently incomplete • Fix some info, look for other Chakrabarti 2004

Broad, deep andad-hoc schema anddata integration: Very difficult! Defined many real problems away; “solved” apart fromperformance issues SQL XQuery 1 So far, largelyfor power-users,can be generatedby programs 3 No schema; either map to schema orsupport query on uninterpreted graphs Defining, labeling,extracting and rankinganswers are the major issues; no universal models; need many moreapplications Bool,Prox Domain of IR,time to analyzequeries muchmore deeply Structure in the query Bag-of-words 2 “Free-format text” HTML XML Entity+relations Structure in the corpus Chakrabarti 2004

Past the syntax barrier: early steps 1 • Taking the question apart • Question has known parts and unknown “slots” • Query-dependent information extraction • Compiling relations from the Web • is-instance-of (is-a), is-subclass-of • is-part-of, has-attribute • Graph models for textual data • Searching graphs with keywords and twigs • Global probabilistic graph labeling 2 3 Chakrabarti 2004

Part-1Working harder on the question

Atypes and ground constants • Specialize given domain to a token related to ground constants in the query • What animal is Winnie the Pooh? • instance-of(“animal”) NEAR “Winnie the Pooh” • When was television invented? • instance-of(“time”) NEAR “television” NEAR synonym(“invented”) • FIND x NEAR GroundConstants(question) WHERE x IS-A Atype(question) • Ground constants: Winnie the Pooh, television • Atypes: animal, time Chakrabarti 2004

Taking the question apart • Atype: the type of the entity that is an answer to the question • Problem: don’t want to compile a classification hierarchy of entities • Laborious, can’t keep up • Offline rather than question-driven • Instead • Set up a very large basis of features • “Project” question and corpus to basis Chakrabarti 2004

Scoring tokens for correct Atypes • FIND x “NEAR” GroundConstants(question) WHERE x IS-A Atype(question) • No fixed question or answer type system • Convert “x IS-A Atype(question)” to a soft match DoesAtypeMatch(x, question) Passage Question Answer tokens IE-style surfacefeature extractors IE-style surfacefeature extractors IS-A featureextractors Question feature vector Learn joint distrib. …other extractors… Snippet feature vector Chakrabarti 2004

Features for Atype matching • Question features: 1, 2, 3-token sequences starting with standard wh-words • where, when, who, how_X, … • Passage surface features: hasCap, hasXx, isAbbrev, hasDigit, isAllDigit, lpos, rpos,… • Passage IS-A features: all generalizations of all noun senses of token • Use WordNet: horseequidungulate, hoofed mammalplacental mammalanimal…entity • These are node IDs (“synsets”) in WordNet, not strings Chakrabarti 2004

Supervised learning setup • Get top 300 passages from IR engine • “Promising but negative” instances • Crude approximation to active learning • For each token invoke feature extractors • Question vector xq, passage vector xp • How to represent combined vector x? • Label = 1 if token is in answer span, 0 o/w • Question and answers from logs Chakrabarti 2004

how_far region#n#3 when entity#n#1 what_city Joint feature-vector design • Obvious “linear” juxtaposition x =(xp,xq) • Does not expose pairwise dependencies • “Quadratic” form x = xq xp • All pairwise product of elements • Model has param for every pair • Can discount for redundancy in pair info • If xq(xp) is fixed, what xp(xq) will yield the largest Pr(Y=1|x)? Chakrabarti 2004

Classification accuracy • Pairing more accurate than linear model • Are the estimated w parameters meaningful? • Given question, can return most favorable answer feature weights Chakrabarti 2004

Parameter anecdotes • Surface and WordNet features complement each other • General concepts get negative params: use in predictive annotation • Learning is symmetric (QA) Chakrabarti 2004

Taking the question apart • Atype: the type of the entity that is an answer to the question • Ground constants: Which question words are likely to appear (almost) unchanged in an answer passage? • Arises in Web search sessions too • Opera login fails • problem with login Opera email • Opera login accept password • Opera account authentication • … Chakrabarti 2004

POS@-1 POS@0 POS@1 Features to identify ground constants • Local and global features • POS of word, POS of adjacent words, case info, proximity to wh-word • Suppose word is associated with synset set S • NumSense: size of S (is word very polysemous?) • NumLemma: average #lemmas describing sS(are there many aliases?) • Model as a sequential learning problem • Each token has local context and global features • Label: does token appear near answer? Chakrabarti 2004

Ground constants: sample results • Global features (IDF, NumSense, NumLemma) essential for accuracy • Best F1 accuracy with local features alone: 71—73% • With local and global features: 81% • Decision trees better than logistic regression • F1=81% as against LR F1=75% • Intuitive decision branches Chakrabarti 2004

Summary of the Atype strategy • “Basis” of atypes A, a  A could be synset, surface pattern, feature of a parse tree • Question q “projected” to vector (wa: a  A) in atype space via learning conditional model • If q is “when…” or “how long…” whasDigit and wtime_period#n#1 are large, wregion#n#1 is small • Each corpus token t has associated indicator features a(t ) for every a • hasDigit(3,000)= is-a(region#n#1)(Japan)=1 Chakrabarti 2004

Reward proximity to ground constants • A token t is a candidate answer if • Hq(t ): Reward tokens appearing “near” ground constants matched from question • Order tokens by decreasing Projection of questionto “atype space” Atype indicator features of the token …the armadillo, found in Texas, is covered with strong horny plates Chakrabarti 2004

Evaluation: Mean reciprocal rank (MRR) • nq = smallest rank among answer passages • MRR = (1/|Q|) qQ(1/nq) • Dropping passage from #1 to #2 as bad as dropping it from #2 to not reporting it at all Experiment setup: • 300 top IR score passages • If Pr(Y=1|token) < thresholdreject token • If tokens rejected reject passage • Points below diagonal are good Chakrabarti 2004

Sample results • Accept all tokens  IR baseline MRR • Moderate acceptance threshold  non-answer passages eliminated, improves answer ranks • High threshold  true answers eliminated • Another answer with poor rank, or rank =  • Additional benefits from proximity filtering Chakrabarti 2004

Part-2Compiling fragments of soft schema

Who provides is-a info? • Compiled KBs: WordNet, CYC • Automatic “soft” compilations • Google sets • KnowItAll • BioText • Can use asevidencein scoringanswers Chakrabarti 2004

Extracting is-instance-of info • Which researcher built the WHIRL system? • WordNet may not know Cohen IS-A researcher • Google has over 4.2 billion pages • “william cohen” on 86100 (p1=86.1k/4.2B) • researcher on 4.55M (p2=4.55M/4.2B) • +researcher +"william cohen“ on 1730: 18.55x more frequent than expected if independent • Pointwise mutual information PMI • Can add high-precision, low-recall patterns • “cities such as New York” (26600 hits) • “professor Michael Jordan” (101 hits ) Chakrabarti 2004

Bootstrapping lists of instances • Hearst 1992, Brin 1997, Etzioni 2004 • A “propose-validate” approach • Using existing patterns, generate queries • For each web page w returned • Extract potential fact e and assign confidence score • Add fact to database if it has high enough score • Example patterns • NP1 {,} {such as|and other|including} NPList2 • NP1 is a NP2, NP1 is the NP2 of NP3 • the NP1 of NP2 is NP3 • Start with NP1 = researcher etc. Chakrabarti 2004

System details • The importance of shallow linguistics working together with statistical tests • China is a (country)NP in Asia • Garth Brooks is a (countryADJ (singer)N)NP • Unary relation example • NP1 such as NPList2 & head(NP1)=plural(name(Class1)) & properNoun(head(each(NPList2)))  instanceOf(Class1, head(each(NPList2)) ) “Head” of phrase Chakrabarti 2004

Compilation performance • Recall-vs-precision exposes size and difficulty of domain • “US state” is easy • “Country” is difficult • To improve signal-to-noise (STN) ratio, stop when confidence score is lower than threshold • Substantially improves recall-vs-precision Chakrabarti 2004

Exploiting is-a info for ranking Passage Question Answer tokens Atype IE-style surfacefeature extractors IE-style surfacefeature extractors WN IS-Afeature extractors Question feature vector Learn joint distrib. PMI scores fromsearch engine probes • Use PMI scores as additional features • Challenge: make frugal use of expensive inverted index probes Snippet feature vector Chakrabarti 2004

Broad, deep andad-hoc schema anddata integration: Very difficult! Defined many real problems away; “solved” apart fromperformance issues SQL XQuery 1 So far, largelyfor power-users,can be generatedby programs 3a: Schema-free search on labeled graphs No schema; either map to schema orsupport query on uninterpreted graphs Bool,Prox 3b: Labeling graphs 3 Domain of IR,time to analyzequeries muchmore deeply Defining, labeling,extracting and rankinganswers are the major issues Structure in the query Bag-of-words 2 “Free-format text” HTML XML Entity+relations Structure in the corpus Chakrabarti 2004

Email1 …your ECML submission titledXYZ has been accepted… PDF file XYZ A. ThorAbstract:… Email2 Could I get a preprint of yourrecent ECML paper? Mining graphs: applications abound EmailTo Time ECML A. U. Thor EmailDate EmailTo • No clean schema, data changes rapidly • Lots of generic “graph proximity” information XYZ EmailDate Canonical node Want toquickly find this file LastMod Chakrabarti 2004

Part-3aSearching graphs

Some useful graph query styles • Find single node with specified type strongly activated by given nodes • paper NEAR { author=“A. U. Thor” conference=“ECML” time=“now” } • May not know exact types of activators • Find connected subgraph that best explains how/why two or more nodes are related • connect {“Faloutsos” “Roussopulous } • Reward short paths, parallel paths, few distractions • Query is a small graph with match clauses Chakrabarti 2004

Measuring activation of single nodes movie (near “richard gere” “julia roberts”)http://www.cse.iitb.ac.in/banks/ Chakrabarti 2004

Reporting connecting subgraphs http://www.cse.iitb.ac.in/banks/ Chakrabarti 2004

Schema-free twig queries • Query is a small graph, very light on types • Node type (dblpauthor, vldbjournal) • Simple predicates to qualify nodes (string, number) • Match to data graph • Precisely defined matches for local predicates • Reward for proximity Chakrabarti 2004

Sample twig result VLDB journalarticle, 2001 GerhardWeikum’shome page Max-PlanckInstitutehome page Chakrabarti 2004

Main issues • What style of queries to support? • No end-user will type Xquery directly • But bag-of-words is too weak • How to rank result nodes and subgraphs? • Reward for text match, graph proximity, … • “Iceberg queries”: report top few after a lot of computation? • Can ranking be learnt as with question answering? • How to reuse and retarget supervision? Chakrabarti 2004

Part-3bProbabilistic graph labeling

Identifying node types • How do we know a node has a given type? • Find person near “Pagerank” • Find student near (homepage of Tom Mitchell) • Find {verb, ####, “television”} close together • Nodes in our graph have • Visible attributes “hobbies”, “my advisor is”, “start-of-sentence” • Invisible labels “student”, “verb” • Neighbors “HREF” “linear juxtaposition” • Goal: given some node labels, guess all others Chakrabarti 2004

Example: Hypertext classification • Want to assign labels to Web pages • Text on a single page may be too little or misleading • Page is not an isolated instance by itself • Problem setup • Web graph G=(V,E) • Node uV is a page having text uT • Edges in E arehyperlinks • Some nodes are labeled • Make collective guess at missing labels • Probabilistic model? Benefits? Chakrabarti 2004

For the moment we don’t need to worry about this Graph labeling model • Seek a labeling f of all unlabeled nodes so as to maximize • Let VK be the nodes with known labels and f(VK) their known label assignments • Let N(v) be the neighbors of v and NK(v)N(v) be neighbors with known labels • Markov assumption: f(v) is conditionally independent of rest of G given f(N(v)) Chakrabarti 2004

Probability of labelingspecific node v… Sum over all possible labelingsof unknown neighbors of v …given edges, text,parital labeling Markov assumption: label of v does not depend onunknown labels outside the set NU(v) Markov graph labeling • Circularity between f(v) and f(NU(v)) • Some form of iterative Gibb’s sampling or MCMC Chakrabarti 2004

Iterative labeling in pictures • c=class, t=text, N=neighbors • Text-only model: Pr[c|t] • Using neighbors’ labelsto update my estimates:Pr[c|t, c(N)] • Repeat untilhistograms stabilize • Local optima possible ? Chakrabarti 2004

Label estimates of v in the next iteration Take the expectation of thisterm over NU(v) labelings Joint distribution over neighborsapproximated as product of marginals Iterative labeling: formula • Sum over all possible NU(v) labelings still too expensive to compute • In practice, prune to most likely configurations By the Markov assumption, finally we need a distributioncoupling f(v) and vT(the text on v) and f(N(v)) Chakrabarti 2004

Labeling results • 9600 patents from 12 classes marked by USPTO • Patents have text and cite other patents • Expand test patent to include neighborhood • ‘Forget’ fraction of neighbors’ classes • Good for other data sets as well Chakrabarti 2004

Breaking through the syntax barrier: Searching with entities and relations