550 likes | 677 Views
Collective Relational Clustering. Indrajit Bhattacharya Assistant Professor Department of CSA Indian Institute of Science. Relational Data. Recent abundance of relational (‘non-iid’) data Internet Social networks Citations in scientific literature Biological networks
E N D
Collective Relational Clustering Indrajit BhattacharyaAssistant Professor Department of CSA Indian Institute of Science
Relational Data • Recent abundance of relational (‘non-iid’) data • Internet • Social networks • Citations in scientific literature • Biological networks • Telecommunication networks • Customer shopping patterns • … • Various applications • Web Mining • Online Advertising and Recommender Systems • Bioinformatics • Citation analysis • Epidemiology • Text Analysis • …
Clustering for Relational Data • Lot of research in Statistical Relational Learning over the last decade • Series of focused workshops in premier conferences • Confluence of different research areas • Recent focus of unsupervised learning from relational data • Regular papers in premiere conferences • Recent Book: Relational Data Clustering: Models, Algorithms, and Applications, Bo Long, Zhongfei Zhang, Philip S. Yu, CRC Press 2009
Traditional vs Relational Clustering • Traditional clustering focuses on ‘flat’ data • Cluster based on features of individual objects • Relational clustering additionally considers relations • Heterogeneous relations across objects of different types • Homogeneous relations across objects of the same type • Naïve solution: Flatten data, then cluster • Loss of relational and structural information • No influence propagation across relational chains • Cannot discover interaction patterns across clusters • Collective relational clustering looks to cluster different data objects jointly
Early Instances of Relational Clustering • Graph Partitioning Problem • Single type homogenous relational data • Co-clustering Problem • Bi-type heterogeneous relational data • General relational clustering considers multi-type data with heterogeneous and homogeneous relationships
Talk Outline • Introduction • Motivating Application: Entity Resolution over Heterogeneous Relational Data • The Relational Clustering Problem • Quick Survey of Relational Clustering Approaches • Probabilistic Model for Structured Relations • Probabilistic Model for Heterogeneous Relations • Future Directions
Talk Outline • Introduction • Motivating Application: Entity Resolution over Heterogeneous Relational Data • The Relational Clustering Problem • Quick Survey of Relational Clustering Approaches • Probabilistic Model for Structured Relations • Probabilistic Model for Heterogeneous Relations • Future Directions
Application: Entity Resolution Web data on Stephen Johnson
Application: Entity Resolution Movie Director Ind. Researcher Photographer Professor Media Presenter Administrator
Application: Entity Resolution • Data contains references to real world entities • Structured entities (People, Products, Institutions,…) • Topics / Concepts (comp science, movies, politics, …) • Aim: Consolidate (cluster) according to entities • Entity Resolution: Map structured references to entities • Sense Disambiguation: Group words according to senses • Topic Discovery: Group words according to topics or concepts
Relationships for Entity Resolution • Each document or structured record is a (co-occurrence) relation between references to persons, places, organizations, concepts, etc. Movie Director Photographer
Relational Network Among Entities Univ of Greenwich Bell Labs Prog. Lang. HPC Stephen Johnson Stephen Johnson Comp. Sc. Mark Cross Jeffrey Ullman Chris Walshaw Alfred Aho Photography White House EPA Government Entertainment Stephen Johnson Stephen Johnson Ansel Adams Cinema Media Music George W. Bush Direction Stephen Johnson Stephen Johnson BBC Leeds University Peter Gabriel
Using the Network for Clustering • Given the network, find the assignment of data items or references to these entities • Collective cluster assignment • Find a “nice” network of entities with regularities in the relational structure • Researchers collaborate with colleagues on similar topics • People send emails to colleagues and friends
Collective Cluster Assignment: Example Cluster 14 Cluster 5 Cluster 4 Parallelization Structured Mesh Code generation Bell Labs AT&T Bell code generation grammar expression tree Cluster 1 Stephen Johnson S Johnson SC Jonshon Cluster 11 Stephen Johnson Steve Johnson S Johnson S P Johnson Cluster 15 Cluster 3 U. Greenwich U. of GWich Cluster 2 Jeffrey Ullman J. Ullman J D Ullman Alfred Aho A Aho A V Aho Cluster 12 Cluster 13 Mark Cross M Cross Chris Walshaw Chris Walsaw C Walshaw …To find a minimal match cost, dynamic programming, approach of [A Aho and S Johnson, 76], is used. …
M. G. Everett S. Johnson S. Johnson M. Everett S. Johnson A. Aho Stephen C. Johnson Alfred V. Aho M J1 A J2 M J1 A J2 M 1 1 0 0 M 1 10 0 J1 1 1 0 0 J1 1 1 1 0 A 0 1 1 1 A 0 0 1 1 J2 0 0 1 1 J2 0 0 1 1 Regularity in a Cluster Network Clustering 2 Clustering 1 M. G. Everett S. Johnson S. Johnson M. Everett S. Johnson A. Aho Stephen C. Johnson Alfred V. Aho • Cl. 1 has better separation of attributes • Cl. 2 has fewer cluster-cluster relations
Collective Relational Clustering • Goal:Given relations among data items, assign to clusters such that relational neighborhoods of clusters have regularities (in addition to attribute similarities within clusters) • Challenges: • Collective / joint clustering decisions over relational neighborhoods • Defining regularity in relational neighborhoods • Searching over relational networks
Talk Outline • Introduction • Motivating Application: Entity Resolution over Heterogeneous Relational Data • The Relational Clustering Problem • Quick Survey of Relational Clustering Approaches • Probabilistic Model for Structured Relations • Probabilistic Model for Heterogeneous Relations • Future Directions
Relational Clustering: Different Approaches • Greedy Agglomerative Algorithms • Bhattacharya et al ‘04, Dong et al ‘05 • Information Theoretic Methods • Mutual Information (Dhillon et al ’03), • Information Bottleneck (Slonim & Tishby ’03), • Bregman Divergence (Merugu et al ‘04, Merugu et al ’06) • Matrix Factorization Techniques • SVD, BVD, (Long et al ‘05, Long et al ’06) • Graph Cuts • Min Cut, Ratio Cut, Normalized Cut, (Dhillon ’01)
Relational Clustering: Probabilistic Approaches • Models for Co-clustering • Taskar et al, ‘01; Hofmann et al, ‘98 • Infinite Relational Model (Kemp et al, ’06) • Mixed Membership Relational Clustering model (Long et al, ‘06) • Topic Models Extensions • Correlated Topic Models (Blei et al, ‘06) • Grouped Cluster Model (Bhattacharya et al ‘06) • Gaussian Process Topic Models (Agovic & Banerjee, ‘10) • Markov Logic Network (Kok & Domingos, ‘08) • Model for Mixed Relational Data (Bhattacharya et al 08)
Talk Outline • Introduction • Motivating Application: Entity Resolution over Heterogeneous Relational Data • The Relational Clustering Problem • Quick Survey of Relational Clustering Approaches • Probabilistic Model for Structured Relations • Probabilistic Model for Heterogeneous Relations • Future Directions
Modeling Groups of Entities Parallel Processing Research Group Bell Labs Group Stephen P Johnson Stephen C Johnson Chris Walshaw Kevin McManus Alfred V Aho Ravi Sethi Mark Cross Martin Everett Jeffrey D Ullman P1:C. Walshaw, M. Cross, M. G. Everett, S. Johnson P4:Alfred V. Aho,Stephen C. Johnson, Jefferey D. Ullman P2:C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus P5:A. Aho,S. Johnson, J. Ullman P6:A. Aho, R. Sethi, J. Ullman P3:C. Walshaw, M. Cross, M. G. Everett
θ z a LDA-Group Model α • Entity label aand group label z for each reference r P • Θ: ‘mixture’ of groups for each co-occurrence R β • Φz:multinomial for choosing entity a for each group z T Φ • Va: multinomial for choosing reference r from entity a A V r • Dirichlet priors with αand β
θ z a LDA-Group Model generate document α • Entity label aand group label z for each reference r P generate names • Θ: ‘mixture’ of groups for each co-occurrence R β Group Bell Labs • Φz:multinomial for choosing entity a for each group z T Entity Stephen P Johnson Φ • Va: multinomial for choosing reference r from entity a A V r • Dirichlet priors with αand β Reference S. Johnson
Inference Using Gibbs Sampling • Approximate inference with Gibbs sampling • Find conditional distribution for any reference given current groups and entities of all other references • Sample from conditional distribution • Repeat over all references until convergence • When number of groups and entities are known
Non Parametric Entity Resolution • Number of entities not a parameter • Allow number of entities to grow with data • For each reference choose any existing entity, or a new entity anew • Hidden name for a new entity equally prefers all observed references
Faster Inference: Split-Merge Sampling • Naïve strategy reassigns data items individually • Alternative: allow clusters to merge or split • For cluster ai, find conditional probabilities for • Merging with existing cluster aj • Splitting back to last merged clusters • Remaining unchanged • Sample next state for ai from distribution • O(n g + e) time per iteration compared to O(n g + n e)
ER: Evaluation Datasets • CiteSeer • 1,504 citations to machine learning papers (Lawrence et al.) • 2,892 references to 1,165 author entities • arXiv • 29,555 publications from High Energy Physics (KDD Cup’03) • 58,515 refs to 9,200 authors • Elsevier BioBase • 156,156 Biology papers (IBM KDD Challenge ’05) • 831,991 author refs • Keywords, topic classifications, language, country and affiliation of corresponding author, etc
ER: Experimental Evaluation • LDA-ER outperforms baselines in all datasets • A - Same entity to refs with attr similarity over a threshold • A* - Transitive closure over decisions in A • Baselines require threshold as parameter • Best achievable performance over all thresholds • LDA-ER does not require similarity threshold
ER: Trends in Semi-Synthetic Data Bigger improvement with • bigger % of ambiguous refs • more refs per co-occurrence • more neighbors per entity
Talk Outline • Introduction • Motivating Application: Entity Resolution over Heterogeneous Relational Data • The Relational Clustering Problem • Quick Survey of Relational Clustering Approaches • Probabilistic Model for Structured Relations • Probabilistic Model for Heterogeneous Relations • Future Directions
Entity Resolution over a Document Collection When it comes to create a universe George Lucas is undisputed leader. Harrison Ford has done justice and special effects are superb. Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases Harrison Ford the adventurer is it in yet another quest. To find his father who is in search of the Holy Grail. GeorgeLucas has done a wonderful job. Lucas script seemed funny enough. It was a fairly good movie with couple of laughs. There was not much story but Ford was good. In a document collection, which names refer to the same entities?
Jointly Modeling the Textual Content When it comes to create a universeGeorge Lucas is undisputed leader. Harrison Ford has done justice and special effects are superb. Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases Harrison Ford the adventurer is it in yet another quest. To find his father who is in search of the Holy Grail. George Lucas has done a wonderful job. Lucas script seemed funny enough. It was a fairly good movie with couple of laughs. There was not much story but Ford was good. • Words are indicative of the concept entities • Concept entities are related to person entities
Relational Clustering Over Structured and Unstructured Data • Document words belong to two categories • References to structured entities • References to (unstructured) concept entities • Collectively determine clusters for both types of entities • Relational patterns over two types of entities • Simplifications for learning • Observed domain of entities w/ structured attributes • Observed relationships between domain entities and categories for constructing relational neighborhoods
t e w m c a n N Generative Model for Documents from Structured Entities • Generate N reviews one by one • First choose a genre, say Action • Choose an Action movie, say Indiana Jones • Generate n mentions for movie • Choose movie attribute, say Actor • Get attribute value, say Harrison Ford • Generate mention for attribute value • Harrison Ford Ford • Generate m Action words • adventurer, quest, justice … • P(t) : Prior over genres • P(e | t) : Movies for genre • P(w | t) : Words for genre • P(c) : Prior over movie attributes
Entity Identification: Evaluation • Movie Reviews • 12,500 reviews: First 10 reviews for top 50 movies for 25 genres • Structured Movie Database from IMDB • 26,250 movies: Top 1250 movies from 25 genres + 25,000 others • Movie table with 7 columns, but no movie name column • Genre + Top 2 actors, actresses, directors, writers • Entity Identification Baseline • Aggregate similarity over all mentions to score entity for doc • Does not use unstructured words in document • Document Classification Baseline • SVM-Light with default parameters • Uses all words in the document, including structured mentions
Ent-Id: Experimental Results on IMDB • Baseline catches up with joint model only when 35% docs provided for training • Improvement in ent-id accuracy • Significant drop in entropy over entity choices
Ent-Id: Results on Semi-Synthetic Data • Ent-Id improves from 38% to 60% for medium overlap and to 70% when words clearly indicate genre • 80% training data for baseline, none for JM • Joint model outperforms baseline for large overlap between genres
Future Directions • Handling uncertain relations • Coupling with information extraction • Modeling the cluster network • Regularization for networks • Scalable inference mechanisms • Incorporating domain knowledge and user interaction • Semi-supervision • Active learning
References • A Agovic and A Banerjee., Gaussian Process Topic Models, UAI 2010 • S Kok and P Domingos, Extracting Semantic Networks from Text via Relational Clustering, ECML 2008 • I Bhattacharya, S Godbole, and S Joshi, Structured Entity Identification and Document Categorization: Two Tasks with One Joint Model, SIGKDD 2008 • I Bhattacharya and L Getoor, Collective Entity Resolution in Relational Data, ACM-TKDD, March 2007 • A Banerjee, S Basu, S Merugu, Multi-Way Clustering on Relation Graphs, SIAM SDM 2007 • B Long, M Zhang, P S Yu, A Probabilistic Framework for Relational Clustering, SIGKDD 2007 • D Zhou, J Huang, B Schoelkopf, Learning with hypergraphs: Clustering, classification, and embedding, NIPS 2007 • B Long, M Zhang, X Wu, P S Yu, Spectral Clustering for Multi-type Relational Data, ICML 2006 • I Bhattacharya and L Getoor, A Latent Dirichlet Model for Unsupervised Entity Resolution, SIAM SDM 2006 • X Dong, A Halevy, J Madhavan, Reference reconciliation in complex information spaces, SIGMOD 2005 • I Bhattacharya and L Getoor, Iterative Record Linkage for Cleaning and Integration, SIGMOD–DMKD, 2004 • B Taskar, E Segal, D Koller, Probabilistic Classification and Clustering in Relational Data, IJCAI 2001
Entity Resolution From Structured Relations P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson P2: “Partitioning Mapping of Unstructured Meshes to Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus P3: “Dynamic Mesh Partitioning: A Unied Optimisation and Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett P4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman P5: “Deterministic Parsing of Ambiguous Grammars”, A. Aho, S.Johnson, J. Ullman P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman Univ of Greenwich HPC Stephen Johnson Mark Cross Chris Walshaw Prog. Lang. Bell Labs Stephen Johnson Jeffrey Ullman Alfred Aho
G1 G2 ΦG1 ΦG2 0.3 0.2 0.3 0.2 0.2 0.2 0.2 0.2 0.2 Walshaw Johnson1 McManus Cross Everett Ullman Aho Sethi Johnson2 VJ1=Stephen P Johnson 0.04 0.04 0.90 S C Johnson Stephen C Johnson S Johnson z=G2 G2 G2 ΦG2 ΦG2 ΦG2 a=Aho U J2 VU VJ2 VA r=A.Aho J.Ullman S.Johnson LDA-ER Generative Process: Illustration For each paper p: • Choose θp • For each author • Sample z from θp • Sample a from Φz • Sample r from Va P5 θP5= [ p(G1)=0.1, p(G2)=0.9 ]
Stephen C Johnson Alfred Aho M. Cross S C Johnson Stephen C Johnson S Johnson 0.2 0.6 0.2 0.0 0.0 Generating References from Entities • Entities are not directly observed • Hidden attribute for each entity • Similarity measure for pairs of attributes • A distribution over attributes for each entity
ER: Performance for Specific Names Significantly larger improvements for ‘ambiguous names’
Simplifying the problem: Entity Identification • Assume database on entities available • IMDB movie database • DBLP, PubMed paper database • Customer databases in companies
Entity Identification: Still Difficult Fugitive: Harrison Ford, David Twohy When it comes to create a universe George Lucas is undisputed leader. Harrison Ford has done justice and special effects are superb. Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases ? Harrison Ford the adventurer is it in yet another quest. To find his father who is in search of the Holy Grail. George Lucas has done a wonderful job. Lucas script seemed funny enough. It was a fairly good movie with couple of laughs. There was not much story but Ford was good. ? ? Indiana Jones and the Last Crusade : Harrison Ford, George Lucas • Not enough information to disambiguate • Noise in entity mentions Star Wars: Return of the Jedi : Harrison Ford, George Lucas American Graffiti : Harrison Ford, George Lucas
The Intuition • Categorization and Entity Identification help each other • Classifier predicts additional attributes from document for use in entity identification • Classifiers for Genre, Rating, Country of the movie … • Entity identification creates labeled data for training the classifier • Reviews tagged with movies labeled with Genre, Rating, etc
Problem Formulation type column T columns C entities E • Unobserved central entity for each document Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases • Structured mentions derived from column values • Unstructured words determined by type value • Problem: Find the central entity for each document and categorize the documents according to type values
Formalizing the Intuition • Traditional entity identification only considers structured mentions as evidence • Here, words suggest type values, and entities relevant for those types get priority
Formalizing the Intuition • Traditional document categorization only considers words as evidence • Traditional entity identification only considers structured mentions as evidence • Mentions suggest entities, and type values relevant for those entities get priority • Here, words suggest type values, and entities relevant for those types get priority