E N D
Administrivia • What: LTI SeminarWhen: Friday Oct 30, 2009, 2:00pm - 3:00pmWhere: 1305 NSHFaculty Host: Noah SmithTitle: SeerSuite: Enterprise Search and Cyberinfrastructure for Science and AcademiaSpeaker: Dr. C. Lee Giles, Pennsylvania State UniversityCyberinfrastructure or e-science has become crucial in many areasof science as data access often defines scientific progress. Opensource systems have greatly facilitated design and implementation andsupporting cyberinfrastructure. However, there exists no open sourceintegrated system for building an integrated search engine and digitallibrary that focuses on all phases of information and knowledgeextraction, such as citation extraction, automated indexing andranking, chemical formulae search, table indexing, etc. …. • Counts for two writeups if you attend!
Two Page Status Report on Project – due Wed 11/2 at 9am This is a chance to tell me how your project is progressing - what's you accomplished, and what you plan to do in the future. There's no fixed format, but here are some things you might discuss. • What dataset will you be using? What does it look like (e.g., how many entities are there, how many tokens, etc)? Looking over the data is always a good first step before you start working with it, what did you do to get acquainted with the data? • Do you plan on looking at the same problem, or have you changed your plans? • If you plan on writing code, what have you written so far, in what languages, and what do you still need to do? • In you plan on using off-the-shelf code, what have you installed, what experiences have you had with it? • If you've run a baseline system on the data and gotten some results, what are they? are they consistent with what you expected?
Poon and Domingos – continued!plus Bellare & McCallum 10-28-2009 Mostly pilfered from Pedro’s slides
Idea: exploit “pattern/relation duality”: • Start with some seed instances of (author,title) pairs (e.g., “Isaac Asimov”, “The Robots of Dawn”) • Look for occurrences of these pairs on the web. • Generate patterns that match heuristically chosen subsets of the occurrences • - order, URLprefix, prefix, middle, suffix • Extract new (author, title) pairs that match the patterns. • Go to 2. [some workshop, 1998] Result: 24M web pages + 5 books 199 occurrences 3 patterns 4047 occurrences + 5M pages 3947 occurrences 105 patterns … 15,257 books RelationPatterns • But: • mostly learned “science fiction books” at least in early rounds • some manual intervention • special regex’s for author/title used PatternsRelation
Markov Networks: [Review] Smoking Cancer • Undirected graphical models Asthma Cough • Potential functions defined over cliques
First-Order Logic • Constants, variables, functions, predicatesE.g.: Anna, x, MotherOf(x), Friends(x, y) • Literal: Predicate or its negation • Clause: Disjunction of literals • Grounding: Replace all variables by constantsE.g.: Friends (Anna, Bob) • World (model, interpretation):Assignment of truth values to all ground predicates
Markov Logic: Intuition • A logical KB is a set of hard constraintson the set of possible worlds • Let’s make them soft constraints:When a world violates a formula,It becomes less probable, not impossible • Give each formula a weight(Higher weight Stronger constraint)
Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)
Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)
Example: Friends & Smokers Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)
Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)
Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)
Markov Logic Networks • MLN is template for ground Markov nets • Probability of a world x: Weight of formula i No. of true groundings of formula i in x
Weight Learning • Parameter tying: Groundings of same clause • Generative learning: Pseudo-likelihood • Discriminative learning: Cond. Likelihood • [like CRFs – but we need to do inference. They use a Collins-like method that computes expectations near a MAP soln. WC] No. of times clause i is true in data Expected no. times clause i is true according to MLN
MAP/MPE Inference • Problem: Find most likely state of world given evidence • This is just the weighted MaxSAT problem • Use weighted SAT solver(e.g., MaxWalkSAT [Kautz et al., 1997])
The MaxWalkSAT Algorithm fori← 1 to max-triesdo solution = random truth assignment for j ← 1 tomax-flipsdo if ∑ weights(sat. clauses) > thresholdthen return solution c ← random unsatisfied clause with probabilityp flip a random variable in c else flip variable in c that maximizes ∑ weights(sat. clauses) return failure, best solution found
MAP=WalkSat, Expectations=???? • MCMC????:Deterministic dependencies break MCMCNear-deterministic ones make it veryslow • Solution:Combine MCMC and WalkSAT→MC-SAT algorithm [Poon & Domingos, 2006]
Slice Sampling [Damien et al. 1999] U P(x) Slice u(k) X x(k+1) x(k)
The MC-SAT Algorithm X(0) A random solution satisfying all hard clauses fork 1 tonum_samples MØ forallCi satisfied by X(k – 1) With prob. 1 – exp(– wi) add Ci to M endfor X(k) A uniformly random solution satisfying M endfor Max Walk Sat “SampleSat”: MaxWalkSat + Simulated Annealing
Entity Resolution Problem: Given database, find duplicate records HasToken(token,field,record) SameField(field,record,record) SameRecord(record,record) HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’) SameField(f,r,r’) => SameRecord(r,r’) SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”)
Entity Resolution Can also resolve fields: HasToken(token,field,record) SameField(field,record,record) SameRecord(record,record) HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’) SameField(f,r,r’) <=> SameRecord(r,r’) SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”) SameField(f,r,r’) ^ SameField(f,r’,r”) => SameField(f,r,r”) P.Singla & P. Domingos, “Entity Resolution with Markov Logic”, in Proc. ICDM-2006.
Hidden Markov Models obs = { Obs1, … , ObsN } state = { St1, … , StM } time = { 0, … , T } State(state!,time) Obs(obs!,time) State(+s,0) State(+s,t) => State(+s',t+1) State(+s,t) => State(+s,t+1) [variant we’ll use-WC] Obs(+o,t) => State(+s,t)
Information Extraction (simplified) • Problem: Extract database from text orsemi-structured sources • Example: Extract database of publications from citation list(s) (the “CiteSeer problem”) • Two steps: • Segmentation:Use HMM to assign tokens to fields • Entity resolution:Use logistic regression and transitivity
Information Extraction(simplified) Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’) SameField(+f,c,c’) <=> SameCit(c,c’) SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”) SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”)
Information Extraction (simplified) Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’) SameField(+f,c,c’) <=> SameCit(c,c’) SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”) SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”) More: H. Poon & P. Domingos, “Joint Inference in Information Extraction”, in Proc. AAAI-2007.
Information Extraction (less simplified) Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) => InField(i,+f,c) !Token("aardvark",i,c) v InField(i,”author”,c) … !Token("zymurgy",i,c) v InField(i,"author",c) … !Token("zymurgy",i,c) v InField(i,"venue",c)
Information Extraction (less simplified) Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) => InField(i+1,+f,c) InField(i,+f,c) ^!HasPunct(c,i) => InField(i+1,+f,c) InField(i,+f,c) ^!HasComma(c,i) => InField(i+1,+f,c) => InField(1,”author”,c) => InField(2,”author”,c) => InField(midpointOfC, "title", c) [computed off-line –WC]
Information Extraction (less simplified) Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) => InField(i+1,+f,c) InField(i,+f,c) ^!HasPunct(c,i) => InField(i+1,+f,c) InField(i,+f,c) ^!HasComma(c,i) => InField(i+1,+f,c) => InField(1,”author”,c) => InField(2,”author”,c) Center(c,i) => InField(i, "title", c)
Information Extraction (less simplified) Token(+t,i,c) => InField(i,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Initials tend to appear in author or venue field. Positions before the last non-venue initial are usually not title or venue. Positions after first “venue keyword” are usually not author or title. Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) => InField(i+1,+f,c) InField(i,+f,c) ^!HasPunct(c,i) => InField(i+1,+f,c) InField(i,+f,c) ^!HasComma(c,i) => InField(i+1,+f,c) => InField(1,”author”,c) => InField(2,”author”,c) => InField(midpointOfC, "title", c) Token(w,i,c) ^ IsAlphaChar(w) ^ FollowBy(c,i,”.”) => InField(c,”author”,i) v InField(c,”venue”,i) LastInitial(c,i) ^ LessThan(j,i) => !InField(c,”title”,j) ^ !InField(c,”venue”,j) FirstInitial(c,i) ^ LessThan(i,j) => InField(c,”author”,j) FirstVenueKeyword(c,i) ^ LessThan(i,j) => !InField(c,”author”,j) ^ !InField(c,”title”,j)
Information Extraction (less simplified) • SimilarTitle(c,i,j,c’,i’,j’): true if • c[i..j] and c’[i’…j’] are both “titlelike” • i.e., no punctuation, doesn’t violate rules above • c[i..j] and c’[i’…j’] are “similar” • i.e. start with same trigram and end with same token • SimilarVenue(c,c’): true if c and c’ don’t contain conflicting venue keywords (e.g., journal vs proceedings)
Information Extraction (less simplified) • SimilarTitle(c,i,j,c’,i’,j’): … • SimilarVenue(c,c’): … • JointInferenceCandidate(c,i,c’): • trigram starting at i in c also appears in c’ • and trigram is a possible title • and punct before trigram in c’ but not c
Information Extraction (less simplified) SimilarTitle(c,i,j,c’,i’,j’): … SimilarVenue(c,c’): … JointInferenceCandidate(c,i,c’): [InField(i,+f,c) ^!HasPunct(c,i) => InField(i+1,+f,c)] InField(i,+f,c) ^ !HasPunct(c,i) ^ (!exists c’:JointInferenceCandidate(c,i,c’)) => InField(i+1,+f,c) Why is this joint? Recall we also have: Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’) Jnt-Seg
Information Extraction (less simplified) • SimilarTitle(c,i,j,c’,i’,j’): … • SimilarVenue(c,c’): … • JointInferenceCandidate(c,i,c’): • InField(i,+f,c) ^ ~HasPunct(c,i) ^ • (!exists c’:JointInferenceCandidate(c,i,c’) ^SameCitation(c,c’) ) => • InField(i+1,+f,c) Jnt-Seg-ER
Results: segmentation Percent error reduction for best joint model
Results: matching Fraction of clusters correctly constructed using transitive closure of pairwise decisions Cora F-S: 0.87 F1 Cora TFIDF: 0.84 max F1
William’s summary • MLNs are a compact, elegant way of describing a Markov network • Standard learning methods work • Network may be very very large • Inference may be expensive • Doesn’t eliminate feature engineering • E.g., complicated “feature” predicates • Experimental results for joint matching/NER are not that strong overall • Cascading segmentation and then matching improves segmentation, maybe not matching • But it needs to be carefully restricted (efficiency?)
Outline • Goal: • Given (DBLP record, citation-text) that do match, learn to segment citations. • Methods: • Learn a CRF to align the record and text (sort of like learning an edit distance) • Generate alignments, anduse them as training data for a linear-chain CRF that does segmentation (aka extraction) • This CRF does not need records to work
Alignment…. • Notation: Alignment feature: depends on a and x’s Extraction feature: depends on a, y1and x2
Learning for alignment… • Generalized expectation criterion: rather than minimize Edata[f]-Emodel[f] … plus a penalty term for the weights…minimize a weighted squared difference between Emodel[f] and p, where p is the user’s prior on the value of the feature. “We simulate user-specified expectation criteria [i.e. p’s] with statistics on manually labeled citation texts.” … top 10 features by MI, p in 11 bins, w=10 Sum of marginal probabilities divided by size of variable set
Results On 260 records, 522 record-text pairs
Results CRF trained with extraction criteria derived from labeled data Trained on records partially aligned with high-precision rules Trained on DBLP records …and also use partial match to DB records at test time “Gold standard”- hand-labeled extraction data
Alignments and expectations Simplified version of the idea: from Learning String Edit Distance, Ristad and Yianilos, PAMI 1998
HMM Example Sample output: xT=heehahaha, sT=122121212 Pr(1->2) Pr(1->1) Pr(2->2) Pr(2->x) Pr(1->x) 1 2 Pr(2->1)
HMM Inference Key point: Pr(si=l)depends only on Pr(l’->l) and si-1 so you can propogate probabilities forward... x1 x2 x3 xT
Pair HMM Notation Andrew used “null”