290 likes | 416 Views
Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification. Xin Li & Dan Roth University of Illinois, Urbana-Champaign. Clustering Current approaches Some problems Making Clustering a Learning problem
E N D
Discriminative Training ofClustering Functions Theory and Experiments with EntityIdentification Xin Li & Dan Roth University of Illinois, Urbana-Champaign
Clustering Current approaches Some problems Making Clustering a Learning problem Supervised Discriminative Clustering Framework The Reference Problem: Entity Identification within & across documents. Outline
Kennedy The Reference Problem Document 1:The Justice Department has officially ended its inquiry into the assassinations ofJohn F. Kennedyand Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 thatKennedywas ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. Document 2: In 1953, MassachusettsSen. John F. Kennedymarried Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedy confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘ Document 3:David Kennedywas born in Leicester, England in 1959. …Kennedyco-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996).
Entity Identification in Text • Goal: Given names, within or across documents, identify real-world entities behind them. • Problem Definition: Given a set of names and their semantic types, [people], [locations] [Organizations] partition them into groups that refer to different entities. Approaches: • A generative Model [Li, Morie, Roth, NAACL’04] • A discriminative approach [Li, Morie, Roth, AAAI’04] Other works [on citation, and more: Milche et. al; Bilenko et. al.,…] • Intuitively, a discriminative approach, requires • using some similarity measure • between names, followed up by • clustering into clusters that • represent entities.
Clustering • An optimization procedure that takes • A collection of data elements • A distance (similarity) measure on the space of data elements • A Partition Algorithm • Attempts to: • Optimize some quality with respect to the given distance metric.
Example: K-means Clustering • An Optimization Problem: • Data: X = {x1,x2,…} Cluster Names: C = {1,2,3,…,K} • The Euclidean Distance:d(x1,x2) = [ (x1-x2)T(x1-x2)]1/2 • Find a mapping f: X C • That minimizes:jx 2 Cjd(x,j )2 • Wherej = 1/m x 2 Cj x mean of elements in the k-th cluster
Many NLP Applications • Class-based language models: • group similar words together based on their semantics (Dagan et. al 99, Lee et. al ; Pantel and Lin, 2002). • Document categorization; and topic identification (Karypis, Han 99,02). • Co-reference resolution – • build coreference chain of noun phrases (Cardie, Wagstaff 99). • In all cases – fixed metric distance; tuned for the application and the data. (and the algorithm?)
Clustering: Metric and Algorithm There is no ‘universal’ distance metric that is appropriate for all clustering algorithms How do we make sure we have an appropriate one, that reflects the task/designer intentions? d1(x,x’) = [(f1 - f1’) 2+(f2 - f2’) 2]1/2 d2(x,x’) = |(f1+ f2)-(f1’+f2’)| (a) Single-Linkage with Euclidean (c) K-Means with a Linear Metric (b) K-Means with Euclidean
A partition function h(S) = Ad(S) distance metric d clustering algorithm A + K-means X = {x1,x2,…}, C = {c1,c2,…,ck} Euclidean Distance: d(x, x’) = [(x- x’)T(x- x’)]1/2 unlabeled data set S partition h(S) Traditional Clustering Framework • Typically, unsupervised; no learning. • More recently: work on metric learning with supervision: [Bilenko&Mooney 03, 04, Xing et. al.’03, Schultz & Joachims’03, Bach & Jordan03] • Learning a metric; then cluster • Learning while clustering (algorithm specific)
Training Stage: Goal: h*=argmin errS(h,p) labeled data set S A partition function h(S) = Ad(S) supervised learner distance metric d clustering algorithm A + Application Stage: h(S’ ) unlabeled data set S’ partition h(S’) Supervised Discriminative Clustering (SDC) • Incorporates supervision directly into metric training process; • Training is driven by true clustering error • Computed via the chosen data partition algorithm.
Elements of SDC: Partition Function and Error • Goal: A partition function hmaps a set of data pointsSto a partition h(S)ofS. (outcome of a clustering algorithm) [Note difference from multi-class classification] • The partition function h is a function of the parameterized distance metric: d(x1,x2) = wi |xi1- xi2| • Error:Given a labeled data setS; p(S) = {(xi,ci)}1m,the correct partition,and a fixed clustering algorithmA, the training process attempts to findd*, minimizing the clustering error: d*= argmind errS(h,p), whereh(S)=Ad(S). Optimal (given) Partition Learned Partition
A Supervised Clustering Error errS(h,p) = 1/|S|2ij[d(xi,xj)*Aij +(D-d(xi,xj))*Bij] (as opposed to a quality function that depends only on the distance) Two types of errors in pairwise prediction: (xi,xj) h `together’ or ‘apart’ False negative: Aij = I [p(xi)=p(xj) & h(xi)h(xj)], False positive: Bij = I [p(xi) p(xj) & h(xi)=h(xj)], D = maxij d(xi,xj ) . (See paper for a comparison with other error functions)
Training the distance function S Initialize the distance Metric d Cluster S using algorithm h=Ad Update d Evaluate ErrS(h,p) A gradient descent based algorithm
Gradient descent Alg. Learns a metric Iteratively by adjusting the parameter vector by a small amount in the direction that would most reduce the error. Training the distance function
Entity Identification in Text • Goal: Given names, within or across documents, identify real-world entities behind them. • Problem Definition: Given a set of names and their semantic types, [people], [locations] [Organizations] partition them into groups that refer to different entities. Approaches: • A generative Model [Li, Morie, Roth, NAACL’04] • A discriminative approach [Li, Morie, Roth, AAAI’04]
Parameterized Distance Metrics for Name Matching John F. Kennedy ? President Kennedy • Feature Extraction:(John F. Kennedy, President Kennedy)= (1,2 ,…) • Fixed distance: distance (similarity) metric d for names. • d (John F. Kennedy, President Kennedy) 0.6 • d (Chicago Cubs, Cubs) 0.6 • d (United States, USA) 0.7 • A learned distance function parameterized as a Linear function over features (kernelized): d(John F. Kennedy, President Kennedy) = wi i • Make it a pairwise classifier: h(John F. Kennedy, President Kennedy) = `together’ iff wi i <= 0.5 • The distance function can be trained separately, to optimize partition quality, or via SDC, to minimize Error. (via gradient descent)
Features • Relational features that are extracted from a pair of strings, taking into account relative positions of tokens, substring relations, etc.
Experimental Setting • Names of people, locations and organizations. • John F. Kennedy, Bush, George W. Bush • U.S.A, United States, and America • University of Illinois, U. of I., IBM, International Business Machines. • 300 randomly picked New York Times news articles. • 8,600 names annotated by a named entity tagger and manually verified. • Training sets contain names labeled with its global entity. John F. Kennedy Kennedy1 President Kennedy Kennedy1, David Kennedy Kennedy2. Data is available fromhttp://l2r.cs.uiuc.edu/~cogcomp/
Gain from Metric Learning while Clustering • SoftTFIDF (Cohen et. al): Fixed metric • LMR (Li, Morie, Roth, AAAI’04) learned metric via a pairwise classifier; relational features extracted from pairs of strings; feedback from pairwise labels; • SDC: trains a linear weighted distance metric for the single-link clustering algorithm with labeled pairs of 600 names.
Different Clustering Algorithms • Difference across clustering algorithm is not as significant as difference obtained from learning a good metric via SDC.
Summary • A framework for Metric Learning for Clustering that is guided by global supervision with clustering as part of the feedback loop. • A parameterized distance metric is learned in a way that depends on the specific clustering algorithm used. • Significant improvement shown on the Reference Problem: Entity Identification Across documents.
Intuition behind SDC d K=16
John Kennedy 1 2 1 2 3 John Kennedy Davis Relational Features • Relational features:do not depend on specific tokens in the two names, but depend on some abstraction over tokens. • Honorific Equal: Mr., Mrs., President, Prof. • Nickname: Thomas, Tom • Edit Distance. Toward Concept-Based Text Understanding and Mining
Michael Jordan, Michael Jordan How to employ transitivity between names ? Clustering: splitting a set of names. • Distance Metrics:Edit distance, SoftTFIDF, Jora-Winkler • Clustering Algorithms: Single-Link, Complete-Link, K-means, graph cut. Toward Concept-Based Text Understanding and Mining
Outline • Clustering • Current approaches • Some problems • Making Clustering a Learning problem • Supervised Discriminative Clustering Framework • The Reference Problem: • Entity Identification in within & across document.
Entity Identification in Text • Goal: Given names, within or across documents, identify real-world entities behind them. • Problem Definition: Given a set of names and their semantic types, [people], [locations] [Organizations] partition them into groups that refer to different entities. Approaches: • A generative Model [Li, Morie, Roth, NAACL’04] • A discriminative approach [Li, Morie, Roth, AAAI’04] Challenge: millions of entities in the world, but in training, we can only see names of a limited number of entities.