Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification

Discriminative Training ofClustering Functions Theory and Experiments with EntityIdentification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Clustering Current approaches Some problems Making Clustering a Learning problem Supervised Discriminative Clustering Framework The Reference Problem: Entity Identification within & across documents. Outline

Kennedy The Reference Problem Document 1:The Justice Department has officially ended its inquiry into the assassinations ofJohn F. Kennedyand Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 thatKennedywas ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. Document 2: In 1953, MassachusettsSen. John F. Kennedymarried Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedy confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘ Document 3:David Kennedywas born in Leicester, England in 1959. …Kennedyco-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996).

Entity Identification in Text • Goal: Given names, within or across documents, identify real-world entities behind them. • Problem Definition: Given a set of names and their semantic types, [people], [locations] [Organizations] partition them into groups that refer to different entities. Approaches: • A generative Model [Li, Morie, Roth, NAACL’04] • A discriminative approach [Li, Morie, Roth, AAAI’04] Other works [on citation, and more: Milche et. al; Bilenko et. al.,…] • Intuitively, a discriminative approach, requires • using some similarity measure • between names, followed up by • clustering into clusters that • represent entities.

Clustering • An optimization procedure that takes • A collection of data elements • A distance (similarity) measure on the space of data elements • A Partition Algorithm • Attempts to: • Optimize some quality with respect to the given distance metric.

Clustering: Example (k=4)

Example: K-means Clustering • An Optimization Problem: • Data: X = {x1,x2,…} Cluster Names: C = {1,2,3,…,K} • The Euclidean Distance:d(x1,x2) = [ (x1-x2)T(x1-x2)]1/2 • Find a mapping f: X  C • That minimizes:jx 2 Cjd(x,j )2 • Wherej = 1/m x 2 Cj x mean of elements in the k-th cluster

Many NLP Applications • Class-based language models: • group similar words together based on their semantics (Dagan et. al 99, Lee et. al ; Pantel and Lin, 2002). • Document categorization; and topic identification (Karypis, Han 99,02). • Co-reference resolution – • build coreference chain of noun phrases (Cardie, Wagstaff 99). • In all cases – fixed metric distance; tuned for the application and the data. (and the algorithm?)

Clustering: Metric and Algorithm There is no ‘universal’ distance metric that is appropriate for all clustering algorithms How do we make sure we have an appropriate one, that reflects the task/designer intentions? d1(x,x’) = [(f1 - f1’) 2+(f2 - f2’) 2]1/2 d2(x,x’) = |(f1+ f2)-(f1’+f2’)| (a) Single-Linkage with Euclidean (c) K-Means with a Linear Metric (b) K-Means with Euclidean

Additional information in Clustering

A partition function h(S) = Ad(S) distance metric d clustering algorithm A + K-means X = {x1,x2,…}, C = {c1,c2,…,ck} Euclidean Distance: d(x, x’) = [(x- x’)T(x- x’)]1/2 unlabeled data set S partition h(S) Traditional Clustering Framework • Typically, unsupervised; no learning. • More recently: work on metric learning with supervision: [Bilenko&Mooney 03, 04, Xing et. al.’03, Schultz & Joachims’03, Bach & Jordan03] • Learning a metric; then cluster • Learning while clustering (algorithm specific)

Training Stage: Goal: h*=argmin errS(h,p) labeled data set S A partition function h(S) = Ad(S) supervised learner distance metric d clustering algorithm A + Application Stage: h(S’ ) unlabeled data set S’ partition h(S’) Supervised Discriminative Clustering (SDC) • Incorporates supervision directly into metric training process; • Training is driven by true clustering error • Computed via the chosen data partition algorithm.

Elements of SDC: Partition Function and Error • Goal: A partition function hmaps a set of data pointsSto a partition h(S)ofS. (outcome of a clustering algorithm) [Note difference from multi-class classification] • The partition function h is a function of the parameterized distance metric: d(x1,x2) =  wi |xi1- xi2| • Error:Given a labeled data setS; p(S) = {(xi,ci)}1m,the correct partition,and a fixed clustering algorithmA, the training process attempts to findd*, minimizing the clustering error: d*= argmind errS(h,p), whereh(S)=Ad(S). Optimal (given) Partition Learned Partition

A Supervised Clustering Error errS(h,p) = 1/|S|2ij[d(xi,xj)*Aij +(D-d(xi,xj))*Bij] (as opposed to a quality function that depends only on the distance) Two types of errors in pairwise prediction: (xi,xj) h `together’ or ‘apart’ False negative: Aij = I [p(xi)=p(xj) & h(xi)h(xj)], False positive: Bij = I [p(xi)  p(xj) & h(xi)=h(xj)], D = maxij d(xi,xj ) . (See paper for a comparison with other error functions)

Training the distance function S Initialize the distance Metric d Cluster S using algorithm h=Ad Update d Evaluate ErrS(h,p) A gradient descent based algorithm

Gradient descent Alg. Learns a metric Iteratively by adjusting the parameter vector by a small amount in the direction that would most reduce the error. Training the distance function

Entity Identification in Text • Goal: Given names, within or across documents, identify real-world entities behind them. • Problem Definition: Given a set of names and their semantic types, [people], [locations] [Organizations] partition them into groups that refer to different entities. Approaches: • A generative Model [Li, Morie, Roth, NAACL’04] • A discriminative approach [Li, Morie, Roth, AAAI’04]

Parameterized Distance Metrics for Name Matching John F. Kennedy ? President Kennedy • Feature Extraction:(John F. Kennedy, President Kennedy)= (1,2 ,…) • Fixed distance: distance (similarity) metric d for names. • d (John F. Kennedy, President Kennedy)  0.6 • d (Chicago Cubs, Cubs)  0.6 • d (United States, USA)  0.7 • A learned distance function parameterized as a Linear function over features (kernelized): d(John F. Kennedy, President Kennedy) =  wi i • Make it a pairwise classifier: h(John F. Kennedy, President Kennedy) = `together’ iff  wi i <= 0.5 • The distance function can be trained separately, to optimize partition quality, or via SDC, to minimize Error. (via gradient descent)

Features • Relational features that are extracted from a pair of strings, taking into account relative positions of tokens, substring relations, etc.

Experimental Setting • Names of people, locations and organizations. • John F. Kennedy, Bush, George W. Bush • U.S.A, United States, and America • University of Illinois, U. of I., IBM, International Business Machines. • 300 randomly picked New York Times news articles. • 8,600 names annotated by a named entity tagger and manually verified. • Training sets contain names labeled with its global entity. John F. Kennedy  Kennedy1 President Kennedy Kennedy1, David Kennedy Kennedy2. Data is available fromhttp://l2r.cs.uiuc.edu/~cogcomp/

Gain from Metric Learning while Clustering • SoftTFIDF (Cohen et. al): Fixed metric • LMR (Li, Morie, Roth, AAAI’04) learned metric via a pairwise classifier; relational features extracted from pairs of strings; feedback from pairwise labels; • SDC: trains a linear weighted distance metric for the single-link clustering algorithm with labeled pairs of 600 names.

Influence of Data Size

Different Clustering Algorithms • Difference across clustering algorithm is not as significant as difference obtained from learning a good metric via SDC.

Summary • A framework for Metric Learning for Clustering that is guided by global supervision with clustering as part of the feedback loop. • A parameterized distance metric is learned in a way that depends on the specific clustering algorithm used. • Significant improvement shown on the Reference Problem: Entity Identification Across documents.

Intuition behind SDC d K=16

John Kennedy 1 2 1 2 3 John Kennedy Davis Relational Features • Relational features:do not depend on specific tokens in the two names, but depend on some abstraction over tokens. • Honorific Equal: Mr., Mrs., President, Prof. • Nickname: Thomas, Tom • Edit Distance. Toward Concept-Based Text Understanding and Mining

Michael Jordan, Michael Jordan How to employ transitivity between names ? Clustering: splitting a set of names. • Distance Metrics:Edit distance, SoftTFIDF, Jora-Winkler • Clustering Algorithms: Single-Link, Complete-Link, K-means, graph cut. Toward Concept-Based Text Understanding and Mining

Outline • Clustering • Current approaches • Some problems • Making Clustering a Learning problem • Supervised Discriminative Clustering Framework • The Reference Problem: • Entity Identification in within & across document.

Entity Identification in Text • Goal: Given names, within or across documents, identify real-world entities behind them. • Problem Definition: Given a set of names and their semantic types, [people], [locations] [Organizations] partition them into groups that refer to different entities. Approaches: • A generative Model [Li, Morie, Roth, NAACL’04] • A discriminative approach [Li, Morie, Roth, AAAI’04] Challenge: millions of entities in the world, but in training, we can only see names of a limited number of entities.

Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification

Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification

Presentation Transcript

Identification and Entity Authentication

State-identification Experiments and Testing of Sequential Circuits

Discriminative Training of Markov Logic Networks

Taming light with plasmons –theory and experiments

On a Theory of Similarity functions for Learning and Clustering

Functions of Theory

Fast Interactive Image Segmentation by Discriminative Clustering

Discriminative Decorelation for clustering and classification ECCV 12

Collaborative Clustering for Entity Clustering

Theory of Identification

LECTURE 33: DISCRIMINATIVE TRAINING

LECTURE 32: DISCRIMINATIVE TRAINING

Part II: Discriminative Margin Clustering

Fast Multiscale Clustering and Manifold Identification

NMR theory and experiments

LECTURE 31: DISCRIMINATIVE TRAINING

A Theory of Learning and Clustering via Similarity Functions

On a Theory of Similarity Functions for Learning and Clustering

Discriminative Training of Markov Logic Networks