570 likes | 766 Views
Literature Survey: Graph-based Clustering and its Application in Coreference Resolution. Zheng Chen Computer Science Department The Graduate Center , The City University of New York November 24, 2009 . Motivations and Goals. Motivations
E N D
Literature Survey: Graph-based Clustering and its Application in Coreference Resolution Zheng Chen Computer Science Department The Graduate Center , The City University of New York November 24, 2009
Motivations and Goals • Motivations • Graph-based clustering has attracted researchers from various fields • Theoreticians are busy studying quality measures and algorithms • Practitioners are busy adapting the algorithms to their own applications • Some algorithms are known and popular in one field while new for the other • Goals • Provide an overview of graph-based clustering methodology • Applied to: coreference resolution Literature Survey
Outline (Part I: Theory) Graph-based Clustering Methodology (a five-part story) Literature Survey
Outline (Part II: Application) Coreference Resolution: A Case Study of Applying Graph-based Clustering • Entity Coreference Resolution • A two step procedure: classification and clustering • Graph-based (Nicolae and Nicolae, 2006) • Event Coreference Resolution • Agglomerative clustering algorithm (Chen et al., 2009a) • Graph-based (Chen and Ji, 2009b) Conclusions Literature Survey
Clustering in Graph Perspective Literature Survey
Graph Notation Literature Survey
Hypothesis The hypothesis can be stated in different ways: • a graph can be partitioned into densely connected subgraphs that are sparsely connected to each other • A random walk that visits a dense sub-graph will likely stay in the sub-graph until many of its nodes have been visited • Considering all shortest paths between all pairs of nodes, edges between dense sub-graphs are likely to be in many shortest paths Manhattan Queens Literature Survey
Modeling • Determine the meaning of vertices, edges • Compute the edge weights • Graph construction • Which graph should be chosen and how to choose parameters? (no theoretical justifications) Literature Survey
Measure Literature Survey
Measure: Cheat Sheet Formulas: Objective functions: Literature Survey
0.1 5 0.8 1 0.8 0.8 0.6 6 2 4 0.7 0.2 0.8 3 Measure: Computation Examples T=total weights in the graph intra_density (C1) = (0.8+0.8+0.6)/T inter_density(C1, C2)=(0.1+0.2)/T cut (C1, C2) =0.1+0.2 ratiocut(C1)= cut (C1, C2) /3 vol (C1)=0.8+0.8+0.6+0.1+0.2 ncut (C1)= cut (C1, C2) / vol (C1) expansion(C1)=min{1.6/1, 1.4/1, 1.4/1} =1.4 conductance(C1)=min{1.6/1.6, 1.4/1.5,1.4/1.6}=1.4/1.6 C2 C1 the fraction of edges inside cluster C1 expected fraction of edges in C1,if edges were located at random in the graph modularity(C1)=3/8-(4/8)2 Literature Survey
Measure Summary: NP-hard problem for optimizing each of the measures • intra-cluster density and inter-cluster sparsity in (Ausiello et al., 2002; Wagner and Wagner, 1993) • ncut (Shi and Malik, 2000) • expansion and conductance (Ausiello et al., 2002; Šíma and Schaeffer, 2006) • bicriteria in (Kannan et al., 2000) • modularity (Brandes, 2006) Any efficient algorithm, which has been claimed to solve the optimal problem with polynomial-time complexity, is heuristic and yields sub-optimal clustering. Literature Survey
Algorithm m: number of edges, n: number of nodes, k: number of clusters Literature Survey
Spectral clustering: Laplacian Matrix Literature Survey
Spectral clustering: Main Algorithm Literature Survey
Spectral clustering: Comments • unnormalized spectral clustering:ratiocut measure normalized spectral clustering: ncutmeasure • Which spectral clustering algorithm do we choose? • Regular graph: works equally well • the degrees in the graph are broadly distributed, prefer normalized rather than unnormalized • normalized case: prefer rather than • Why successful? • Does not make assumption on the form of the clusters • Efficient: • Lanczos algorithm to solve eigenvalue problem m: the number of edges, n: the number of vertices • No worry about “local” optimum traps • Unstable under different choices of the parameters when constructing the graph Literature Survey
Girvan and Newman Algorithm(Girvan and Newman, 2002) • Edge Betweenness • when a graph is made of tightly bound clusters, loosely interconnected, all shortest paths between clusters have to go through few inter-cluster connections. • Algorithm • 1. Calculate betweensess score for each edge • 2. Remove the one with the highest score • 3.Recalculate betweensess • 4. repeat from step 2 • Comments • optimizing modularity measure • Good results in real data • Complexity remains an issue, for sparse graph Literature Survey
Newman fast algorithm (Newman, 2004) • Algorithm • 1. Separate each node solely into n clusters. • 2. Calculate the increase of Q for all possible cluster pairs. • 3. Merge the pair which leads to the greatest increase in Q. • 4. Repeat 2 & 3 until the modularity Q reaches the maximal value. • Comments • Greedy optimizations technique • Advantage in complexity with on a sparse graph, 50 000 nodes in minutes rather than years Literature Survey
Algorithm: Summary • No algorithm is a panacea • A clustering algorithm was usually proposed to optimize some quality measure. Unfair to compare between two algorithms favoring two different measures • No measure can capture the full characteristics of cluster structures, thus no perfect algorithm • No definition for so called “best clustering”. The “best” depends on applications, data characteristics, granularity and so on. Literature Survey
Evaluation • Internal (intrinsic) measures • External(extrinsic) measures • Are there any formal constraints (properties, criteria) that an ideal extrinsic measure should satisfy? • Do the extrinsic measures proposed so far satisfy the constraints? Literature Survey
Evaluation: Formal Constraints (Amigo et al., 2008) • homogeneity • completeness • rag bag • cluster size vs. quantity Rosenberg and Hirschberg (2007) Literature Survey
Evaluation Measures Literature Survey
Measures for Coreference Resolution • MUC : • no credits for separating out singleton clusters • all errors are considered to be equal • B-Cubed : • overcomes the two drawbacks of MUC measure • give multiple credits to a single item • ECM : • seeks an optimal alignment between the system clustering and the reference clustering Literature Survey
Satisfaction of Formal Constraints for Various Measures • Extend the work of (Amigo et al., 2008) on more measures: adjusted rand index, V measure, MUC measure and ECM measure • Re-compute all the scores • None of the measures except B-Cubed F-measure can satisfy all the four constraints • ECM F-measure fails three constraints: homogeneity, completeness and rag bag Literature Survey
Future Directions • Scalability • graphs in real applications are growing rapidly • graphs are changing dynamically • Stability • perturbations in the graph • Statistical significance • how significant is it comparing with a clustering produced by a null model of the graph Literature Survey
Part II Coreference Resolution: a Case Study of Applying Graph-based Clustering Methodology
Coreference Resolution • Entity coreference resolution Identifying which noun phrases (NPs, or mentions) refer to the same real-world entity in text. • An entity is an object in the real world such as person, organization, facility • A mention is a textual reference to an entity. • Event coreference resolutionIdentifying which event mentions refer to the same event in text. • An event is a specific occurrence involving participants. • An event mention includes a distinguished trigger(the word that most clearly expresses an event occurs) and involving arguments (entities/temporal expressions that play certain roles in the event). Literature Survey
Entity Coreference Resolution: an Example John Perry, of Weston Golf Club, announced his resignation yesterday. He was the President of the Massachusetts Golf Association. During his two years in office, Perry guided the MGA into a closer relationship with the Women's Golf Association of Massachusetts. Literature Survey
Event Coreference Resolution: an Example EM4Ankara police chief ErcumentYilmaz visited the site of the morningblast . EM1An explosion in a cafe at one of the capital's busiest intersections killed one woman and injured another Tuesday. EM2Police were investigating the cause of the explosion inthe restroom of the multistory Crocodile Cafe in the commercial district of Kizilayduring the morning rush hour . EM5The explosion comes a month after EM6a bomb exploded at a McDonald's restaurant in Istanbul, causing damage but no injuries . EM7Radical leftist, Kurdish and Islamic groups are active in the country and have carried out the bombing in the past . EM3The blast shattered walls and windows in the building . Literature Survey
Event Coreference Resolution: an Example Literature Survey
A Parallel Comparison between Entity Coreference Resolution and Event Coreference Resolution The two problems are similar because: • the problem descriptions are similar • the mathematical interpretations are similar • They can be solved by applying a two-step procedure • they can be solved by applying graph-based clustering methodology They are different because: • entity and event have different attributes and values Literature Survey
Solution: a Two-step Procedure • classification step: compute the likelihood one entity mention corefers with the other • clustering step: group the mentions into clusters such that all mentions in a cluster refer to the same entity. Literature Survey
Solution: a Two-step Procedure Classification step • Learning algorithm • decision tree: McCarthy and Lehnert (1995) , Soon et al. (2001) , Strube el al. (2002) , Strube and Muller (2003) and Yang et al. (2003) • maximum entropy: Luo et al. (2004) • SVM: Finley and Joachims (2005) • Kernel :Yang et al. (2006) • Feature sets • Soon et al. (2001) define12 surface level features in four categories lexical, grammatical, semantic and positional • Ng and Cardie (2002) extend 12 to 53 with new features based on common-sense knowledge and linguistic intuitions • Ng (2007) proposes another six semantic features • Yang and Su (2007) extract semantic relatedness features from Wikipedia Literature Survey
Solution: a Two-step Procedure Literature Survey
Solution: a Two-step Procedure Clustering step • closest-first clustering (Soon et al., 2001) • Best-first clustering (Ng and Cardie, 2002) closest-first threshold=0.5 0.3 0.2 0.4 E1 E2 EM1 EM2 EM3 EM4 EM1 EM2 EM3 EM4 best-first E1 EM1 E2 0.6 0.7 EM2 EM3 EM4 0.8 Literature Survey
John Perry1, of Weston Golf Club2, announced his3 resignation yesterday. Link Model: Start Model: Solution: From Local clustering to Global clustering • Problem in the two-step procedure: • works in a greedy style without searching the space of all possible clusterings • Luo et al. (2004) [1,2, 3] [1,2] 3* [1,2] [3] [1] 2* 3 [1,3] [2] [1] [2] 3* [1] [2,3] [1] [2] [3] • Heuristic search algorithm that finds the most probable clustering, i.e., at each step of the search process, only the most promising nodes in the tree are expanded. • Still works in greedy style and may miss the optimal clustering Literature Survey
Solution: From Local clustering to Global clustering • Ng (2005) • 54 coreference resolution systems (3 classification algorithms, 3 clustering algorithms, 3 instance creation methods and 2 feature sets) • global ranking model • rank the 54 candidate clusterings to get the best clustering • performance depends on the best clustering from one of 54 systems system1 ... system54 clustering1 clustering54 ranking model best clustering Literature Survey
Solution: From Supervised to Unsupervised • classification step is supervised • semi-supervised: • co-training (Muller et al., 2002) • self-training • EM • unsupervised: • Non-Parametric Bayesian Models based on Dirichlet Processes (Haghighi and Klein 2007) • Integer Linear Programming (Denis and Baldridge, 2007) • markov logic (Poon and Domingos, 2008) Literature Survey
Solution: graph-based clustering methodology Literature Survey
Solution: graph-based clustering methodology • Nicolae and Nicolae (2006) Literature Survey
Solution: graph-based clustering methodology • Minimum cut Minimum cut is measured as the number of mentions that are correctly placed in their cluster. two correct cases: average and maximum weight √ 5 0.1 0.6 √ 1 0.2 0.1 0.5 x x 3 4 score(cut) = 3 0.7 0.5 2 √ Literature Survey
Solution: graph-based clustering methodology • BESTCUT Algorithm Mary1has a brother2, John3. The boy4is older than the girl5 Clustering: {Mary1, the girl5} and {a brother2, John3, The boy4} 5 Recursive procedure 0.1 Find the best cut using algorithm (Stoer and Wagner,1997) 0.6 1 0.2 0.1 Stop the cut? Yes: continue the procedure on the two subgraphs No: form entities 0.5 3 4 0.7 0.5 2 Literature Survey
Finding the Best Cut (Stoer and Wagner,1997) 5 5 5 5 0.1 0.1 0.1 0.1 0.6 0.6 0.6 0.6 Best Cut 1 1 1 1 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.5 0.5 0.5 0.5 3 4 3 3 3 4 4 4 score(cut1) = 3 score(cut2) = 4 score(cut3) = 5 0.7 0.7 0.7 0.7 0.5 0.5 0.5 0.5 2 2 2 2 score(cut2) = 3.5 Literature Survey
Solution: graph-based clustering methodology • Evaluation Literature Survey
Event Coreference Resolution • Pioneering work in MUC (Message Understanding Conference) Evaluations in the 1990s • Humphreys et al.,1997 (ontology) • Bagga and Baldwin,1998 (Vector Space Model) • Events are based on scenarios, e.g., management succession, resignation, election, espionage. • ACE Evaluations define 8 fine-grained event types • Recent work: • Chen et al., 2009a (agglomerative clustering) • Chen and Ji, 2009b (spectral graph-based clustering) Literature Survey
Event Coreference Resolution: agglomerative clustering(Chen et al., 2009a) • Similar to Luo et al. (2004)’s bell tree searching algorithm but using different notations • A pairwise event coreference model using event specific features (triggers/arguments/event attributes) • Event attributes play important role in distinguishing coreference from non-coreference • Performance bottleneck comes from system generated event mentions Literature Survey
Event Coreference Resolution: graph-based clustering methodology • Chen and Ji (2009b) Literature Survey
Event Coreference Resolution: graph-based clustering methodology Literature Survey