Enterprise Search in Academia: SeerSuite Seminar

Administrivia • What: LTI SeminarWhen: Friday Oct 30, 2009, 2:00pm - 3:00pmWhere: 1305 NSHFaculty Host: Noah SmithTitle: SeerSuite: Enterprise Search and Cyberinfrastructure for Science and AcademiaSpeaker: Dr. C. Lee Giles, Pennsylvania State UniversityCyberinfrastructure or e-science has become crucial in many areasof science as data access often defines scientific progress. Opensource systems have greatly facilitated design and implementation andsupporting cyberinfrastructure. However, there exists no open sourceintegrated system for building an integrated search engine and digitallibrary that focuses on all phases of information and knowledgeextraction, such as citation extraction, automated indexing andranking, chemical formulae search, table indexing, etc. …. • Counts for two writeups if you attend!

Two Page Status Report on Project – due Wed 11/2 at 9am This is a chance to tell me how your project is progressing - what's you accomplished, and what you plan to do in the future. There's no fixed format, but here are some things you might discuss. • What dataset will you be using? What does it look like (e.g., how many entities are there, how many tokens, etc)? Looking over the data is always a good first step before you start working with it, what did you do to get acquainted with the data? • Do you plan on looking at the same problem, or have you changed your plans? • If you plan on writing code, what have you written so far, in what languages, and what do you still need to do? • In you plan on using off-the-shelf code, what have you installed, what experiences have you had with it? • If you've run a baseline system on the data and gotten some results, what are they? are they consistent with what you expected?

Brin’s 1998 paper

Poon and Domingos – continued!plus Bellare & McCallum 10-28-2009 Mostly pilfered from Pedro’s slides

Idea: exploit “pattern/relation duality”: • Start with some seed instances of (author,title) pairs (e.g., “Isaac Asimov”, “The Robots of Dawn”) • Look for occurrences of these pairs on the web. • Generate patterns that match heuristically chosen subsets of the occurrences • - order, URLprefix, prefix, middle, suffix • Extract new (author, title) pairs that match the patterns. • Go to 2. [some workshop, 1998] Result: 24M web pages + 5 books  199 occurrences  3 patterns  4047 occurrences + 5M pages  3947 occurrences  105 patterns  … 15,257 books RelationPatterns • But: • mostly learned “science fiction books” at least in early rounds • some manual intervention • special regex’s for author/title used PatternsRelation

Markov Networks: [Review] Smoking Cancer • Undirected graphical models Asthma Cough • Potential functions defined over cliques

First-Order Logic • Constants, variables, functions, predicatesE.g.: Anna, x, MotherOf(x), Friends(x, y) • Literal: Predicate or its negation • Clause: Disjunction of literals • Grounding: Replace all variables by constantsE.g.: Friends (Anna, Bob) • World (model, interpretation):Assignment of truth values to all ground predicates

Markov Logic: Intuition • A logical KB is a set of hard constraintson the set of possible worlds • Let’s make them soft constraints:When a world violates a formula,It becomes less probable, not impossible • Give each formula a weight(Higher weight  Stronger constraint)

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Example: Friends & Smokers Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Markov Logic Networks • MLN is template for ground Markov nets • Probability of a world x: Weight of formula i No. of true groundings of formula i in x

Weight Learning • Parameter tying: Groundings of same clause • Generative learning: Pseudo-likelihood • Discriminative learning: Cond. Likelihood • [like CRFs – but we need to do inference. They use a Collins-like method that computes expectations near a MAP soln. WC] No. of times clause i is true in data Expected no. times clause i is true according to MLN

MAP/MPE Inference • Problem: Find most likely state of world given evidence • This is just the weighted MaxSAT problem • Use weighted SAT solver(e.g., MaxWalkSAT [Kautz et al., 1997])

The MaxWalkSAT Algorithm fori← 1 to max-triesdo solution = random truth assignment for j ← 1 tomax-flipsdo if ∑ weights(sat. clauses) > thresholdthen return solution c ← random unsatisfied clause with probabilityp flip a random variable in c else flip variable in c that maximizes ∑ weights(sat. clauses) return failure, best solution found

MAP=WalkSat, Expectations=???? • MCMC????:Deterministic dependencies break MCMCNear-deterministic ones make it veryslow • Solution:Combine MCMC and WalkSAT→MC-SAT algorithm [Poon & Domingos, 2006]

Slice Sampling [Damien et al. 1999] U P(x) Slice u(k) X x(k+1) x(k)

The MC-SAT Algorithm X(0) A random solution satisfying all hard clauses fork 1 tonum_samples MØ forallCi satisfied by X(k – 1) With prob. 1 – exp(– wi) add Ci to M endfor X(k) A uniformly random solution satisfying M endfor Max Walk Sat “SampleSat”: MaxWalkSat + Simulated Annealing

What can you do with MLNs?

Entity Resolution Problem: Given database, find duplicate records HasToken(token,field,record) SameField(field,record,record) SameRecord(record,record) HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’) SameField(f,r,r’) => SameRecord(r,r’) SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”)

Entity Resolution Can also resolve fields: HasToken(token,field,record) SameField(field,record,record) SameRecord(record,record) HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’) SameField(f,r,r’) <=> SameRecord(r,r’) SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”) SameField(f,r,r’) ^ SameField(f,r’,r”) => SameField(f,r,r”) P.Singla & P. Domingos, “Entity Resolution with Markov Logic”, in Proc. ICDM-2006.

Hidden Markov Models obs = { Obs1, … , ObsN } state = { St1, … , StM } time = { 0, … , T } State(state!,time) Obs(obs!,time) State(+s,0) State(+s,t) => State(+s',t+1) State(+s,t) => State(+s,t+1) [variant we’ll use-WC] Obs(+o,t) => State(+s,t)

What did P&D do with MLNs?

Information Extraction (simplified) • Problem: Extract database from text orsemi-structured sources • Example: Extract database of publications from citation list(s) (the “CiteSeer problem”) • Two steps: • Segmentation:Use HMM to assign tokens to fields • Entity resolution:Use logistic regression and transitivity

Motivation for joint extraction and matching

Information Extraction(simplified) Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’) SameField(+f,c,c’) <=> SameCit(c,c’) SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”) SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”)

Information Extraction (simplified) Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’) SameField(+f,c,c’) <=> SameCit(c,c’) SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”) SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”) More: H. Poon & P. Domingos, “Joint Inference in Information Extraction”, in Proc. AAAI-2007.

Information Extraction (less simplified) Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) => InField(i,+f,c) !Token("aardvark",i,c) v InField(i,”author”,c) … !Token("zymurgy",i,c) v InField(i,"author",c) … !Token("zymurgy",i,c) v InField(i,"venue",c)

Information Extraction (less simplified) Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) => InField(i+1,+f,c) InField(i,+f,c) ^!HasPunct(c,i) => InField(i+1,+f,c) InField(i,+f,c) ^!HasComma(c,i) => InField(i+1,+f,c) => InField(1,”author”,c) => InField(2,”author”,c) => InField(midpointOfC, "title", c) [computed off-line –WC]

Information Extraction (less simplified) Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) => InField(i+1,+f,c) InField(i,+f,c) ^!HasPunct(c,i) => InField(i+1,+f,c) InField(i,+f,c) ^!HasComma(c,i) => InField(i+1,+f,c) => InField(1,”author”,c) => InField(2,”author”,c) Center(c,i) => InField(i, "title", c)

Information Extraction (less simplified) Token(+t,i,c) => InField(i,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Initials tend to appear in author or venue field. Positions before the last non-venue initial are usually not title or venue. Positions after first “venue keyword” are usually not author or title. Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) => InField(i+1,+f,c) InField(i,+f,c) ^!HasPunct(c,i) => InField(i+1,+f,c) InField(i,+f,c) ^!HasComma(c,i) => InField(i+1,+f,c) => InField(1,”author”,c) => InField(2,”author”,c) => InField(midpointOfC, "title", c) Token(w,i,c) ^ IsAlphaChar(w) ^ FollowBy(c,i,”.”) => InField(c,”author”,i) v InField(c,”venue”,i) LastInitial(c,i) ^ LessThan(j,i) => !InField(c,”title”,j) ^ !InField(c,”venue”,j) FirstInitial(c,i) ^ LessThan(i,j) => InField(c,”author”,j) FirstVenueKeyword(c,i) ^ LessThan(i,j) => !InField(c,”author”,j) ^ !InField(c,”title”,j)

Information Extraction (less simplified) • SimilarTitle(c,i,j,c’,i’,j’): true if • c[i..j] and c’[i’…j’] are both “titlelike” • i.e., no punctuation, doesn’t violate rules above • c[i..j] and c’[i’…j’] are “similar” • i.e. start with same trigram and end with same token • SimilarVenue(c,c’): true if c and c’ don’t contain conflicting venue keywords (e.g., journal vs proceedings)

Information Extraction (less simplified) • SimilarTitle(c,i,j,c’,i’,j’): … • SimilarVenue(c,c’): … • JointInferenceCandidate(c,i,c’): • trigram starting at i in c also appears in c’ • and trigram is a possible title • and punct before trigram in c’ but not c

Information Extraction (less simplified) SimilarTitle(c,i,j,c’,i’,j’): … SimilarVenue(c,c’): … JointInferenceCandidate(c,i,c’): [InField(i,+f,c) ^!HasPunct(c,i) => InField(i+1,+f,c)] InField(i,+f,c) ^ !HasPunct(c,i) ^ (!exists c’:JointInferenceCandidate(c,i,c’)) => InField(i+1,+f,c) Why is this joint? Recall we also have: Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’) Jnt-Seg

Information Extraction (less simplified) • SimilarTitle(c,i,j,c’,i’,j’): … • SimilarVenue(c,c’): … • JointInferenceCandidate(c,i,c’): • InField(i,+f,c) ^ ~HasPunct(c,i) ^ • (!exists c’:JointInferenceCandidate(c,i,c’) ^SameCitation(c,c’) ) => • InField(i+1,+f,c) Jnt-Seg-ER

Results: segmentation Percent error reduction for best joint model

Results: matching Fraction of clusters correctly constructed using transitive closure of pairwise decisions Cora F-S: 0.87 F1 Cora TFIDF: 0.84 max F1

William’s summary • MLNs are a compact, elegant way of describing a Markov network • Standard learning methods work • Network may be very very large • Inference may be expensive • Doesn’t eliminate feature engineering • E.g., complicated “feature” predicates • Experimental results for joint matching/NER are not that strong overall • Cascading segmentation and then matching improves segmentation, maybe not matching • But it needs to be carefully restricted (efficiency?)

Bellare & McCallum

Outline • Goal: • Given (DBLP record, citation-text) that do match, learn to segment citations. • Methods: • Learn a CRF to align the record and text (sort of like learning an edit distance) • Generate alignments, anduse them as training data for a linear-chain CRF that does segmentation (aka extraction) • This CRF does not need records to work

Alignment…. • Notation: Alignment feature: depends on a and x’s Extraction feature: depends on a, y1and x2

Learning for alignment… • Generalized expectation criterion: rather than minimize Edata[f]-Emodel[f] … plus a penalty term for the weights…minimize a weighted squared difference between Emodel[f] and p, where p is the user’s prior on the value of the feature. “We simulate user-specified expectation criteria [i.e. p’s] with statistics on manually labeled citation texts.” … top 10 features by MI, p in 11 bins, w=10 Sum of marginal probabilities divided by size of variable set

Results On 260 records, 522 record-text pairs

Results CRF trained with extraction criteria derived from labeled data Trained on records partially aligned with high-precision rules Trained on DBLP records …and also use partial match to DB records at test time “Gold standard”- hand-labeled extraction data

Alignments and expectations Simplified version of the idea: from Learning String Edit Distance, Ristad and Yianilos, PAMI 1998

HMM Example Sample output: xT=heehahaha, sT=122121212 Pr(1->2) Pr(1->1) Pr(2->2) Pr(2->x) Pr(1->x) 1 2 Pr(2->1)

HMM Inference Key point: Pr(si=l)depends only on Pr(l’->l) and si-1 so you can propogate probabilities forward... x1 x2 x3 xT

Pair HMM Notation Andrew used “null”

Enterprise Search in Academia: SeerSuite Seminar

Enterprise Search in Academia: SeerSuite Seminar

Presentation Transcript

Administrivia

Administrivia

Administrivia

Administrivia

Administrivia

administrivia

Administrivia

Administrivia

Administrivia

Administrivia

Administrivia

Administrivia

Administrivia

Administrivia

Administrivia

Administrivia

Administrivia!

Administrivia

Administrivia

Administrivia