1 / 25

Chris Biemann University of Leipzig, NLP-Dept. Leipzig, Germany June 9, 2006

Chinese Whispers an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. Chris Biemann University of Leipzig, NLP-Dept. Leipzig, Germany June 9, 2006 TextGraphs 06, NYC, USA. Outline. Introduction to Graph Clustering Chinese Whispers Algorithm

artie
Download Presentation

Chris Biemann University of Leipzig, NLP-Dept. Leipzig, Germany June 9, 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chinese Whispers an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems Chris BiemannUniversity of Leipzig, NLP-Dept.Leipzig, Germany June 9, 2006 TextGraphs 06, NYC, USA

  2. Outline • Introduction to Graph Clustering • Chinese Whispers Algorithm • Experiments with Synthetic Data • Application of CW to • Language Seperation • POS clustering • Word Sense Induction • Extensions

  3. Graph Clustering • Find groups of nodes in undirected, weighted graphs • Hierarchical Clustering vs. Flat Partitioning 3 3 3 3 4 4 3

  4. ? Desired outcomes ? • Colors symbolise partitions 3 3 3 3 4 4 3

  5. Chinese Whispers Algorithm D L2 B L4 5 8 A L1 deg=1 deg=2 E L3 3 C L3 6 deg=4 deg=3 deg=5 Algorithm: initialize: forall vi in V: class(vi)=i; while changes: forall v in V, randomized order: class(v)=highest ranked class in neighborhood of v; • Nodes have a class and communicate it to their adjacent nodes • A node adopts one of the the majority class in its neighbourhood • Nodes are processed in random order for some iterations

  6. Example: CW-Partitioning in two steps

  7. Properties of CW PRO: • Efficiency: CW is time-linear in the number of edges. This is bound by n² with n= number of nodes, but in real world data, graphs are much sparser • Parameter-free: this includes number of clusters CON: • Non-deterministic: due to random order processing and possible ties w.r.t. the majority. • Does not converge: See tie example: However, the CONs are not severe for real world data... Formally hard to analyse: perform experiments

  8. Experiment: Bi-partite cliques, unweighted • Intuition: Bi-partite cliques should be split into two cliques • CW can split bi-partite cliques into two parts or leave them as a whole. • Measure, how often CW succeeds: the larger the graph, the saver the split -> CW meant for large graphs

  9. Co-occurrences: A source for Graphs • The entirety of all significant co-occurrences is a co-occurrence graph G(V,E) withV: Vertices = WordsE: Edges (v1, v2, s) with v1, v2 words, s significance value. • Co-occurrence graph is • weighted by significance (here: log-likelihood) • undirected • Small-world-property

  10. Application: Language Seperation • Cluster the co-occurrence graph of a multilingual corpus • Use words of the same class in a language identifier as lexicon • Almost perfect performance

  11. Application: Acquisition of POS-classes • Distributional similarity: Words that co-occur significantly with the same neighbours should be of the same POS • Clustering the second-order NB-co-occurrence graph of the BNC (excluding the top 2000 frequent words)

  12. Results: POS-clusters • In total: 282 clusters, of which 26 with more than 100 members. Syntacto-semantic motivation. Purity: 88%

  13. Application: Word Sense Induction • Co-occurrence graphs of ambigous words can be partitioned [Dorow & Widdows 03]: Leave out focus word • Clusters contain context words for disambiguation

  14. Unsupervised WSI Evaluation Framework Evaluation: For unambiguos words, merge their co-occurrence graphs and try to split them into previous parts • retrieval precision (rP): similarity of the found sense with the gold standard sense • retrieval recall (rR): amount of words that have been correctly assigned to the gold standard sense • precision (P): fraction of correctly found disambiguations • recall (R): fraction of correctly found senses 45 test words of different POS and frequency bands.

  15. Results: WSI • No parameter for expected number of clusters • CW scores compareable to an algorithm especially designed for WSI

  16. hip

  17. hip

  18. hip

  19. hip

  20. Conclusion • Very effective graph partitioning algorithm for weighted, undirected graphs • Possible to process really large graphs • Fuzzy partitioning and hierachichal clustering possible • Especially suited for small world graphs (sparse adjacency matrix) • Useful in NLP applications such as Language Seperation, POS clustering, Word Sense Induction Download a GUI implementation in Java of Chinese Whispers (Open Source) at http://wortschatz.informatik.uni-leipzig.de/~cbiemann/software/CW.html

  21. Questions ? THANK YOU

  22. Experiment: Convergence • Weighted graphs converge much faster (less ties) • For weighted graphs, 15 iterations were enough to partition the 1.7M nodes / 56M edges co-occurrence graph of our main German corpus • Larger graphs result in less uncertainity

  23. Experiment: Small World Mixtures • CW can seperate well if merge rate is not too high • Different sizes of original SWs do not impose a problem

  24. Experiment: Small World Mixtures • CW can seperate well if merge rate is not too high • Different sizes of original SWs do not impose a problem

  25. Usages of hip • FIGHT: The punching hip , be it the leading hip of a front punch or the trailing hip of a reverse punch , must swivel forwards , so that your centre-line directly faces the opponent . • MUSIC: This hybrid mix of reggae and hip hop follows acid jazz , Belgian New Beat and acid swing the wholly forgettable contribution of Jive Bunny as the sound to set disco feet tapping . • DANCER: Sitting back and taking it all in is another former hip hop dancer , Moet Lo , who lost his Wall Street messenger job when his firm discovered his penchant for the five-finger discount at Polo stores • HOORAY: Ho , hey , ho hi , ho , hey , ho , hip hop hooray , funky , get down , a-boogie , get down . • MEDICINE: We treated orthopaedic screening as a distinct category because some neonatal deformations (such as congenital dislocation of the hip ) represent only a predisposition to congenital abnormality , and surgery is avoided by conservative treatment . • BODYPART-INJURY: I had a hip replacement operation on my left side , after which I immediately broke my right leg . • BODYPART-CLOTHING: At his hip he wore a pistol in an ancient leather holster .

More Related