330 likes | 363 Views
AutoPart: Parameter-Free Graph Partitioning and Outlier Detection. Deepayan Chakrabarti (deepay@cs.cmu.edu). Problem Definition. People. People Groups. People Groups. People.
E N D
AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)
Problem Definition People People Groups People Groups People Group people in a social network, or, species in a food web, or, proteins in protein interaction graphs …
Reminder People People Graph: N nodes and E directed edges
Problem Definition People Groups People People People Groups • Goals: • [#1] Find groups (of people, species, proteins, etc.) • [#2] Find outlier edges (“bridges”) • [#3] Compute inter-group “distances” (how similar are two groups of proteins?)
Problem Definition People Groups People People People Groups • Properties: • Fully Automatic (estimate the number of groups) • Scalable • Allow incremental updates
Related Work • Graph Partitioning • METIS (Karypis+/1998) • Spectral partitioning (Ng+/2001) • Clustering Techniques • K-means and variants(Pelleg+/2000,Hamerly+/2003) • Information-theoreticco-clustering (Dhillon+/2003) • LSI(Deerwester+/1990) Measure of imbalance between clusters, OR Number of partitions Rows and columns are considered separately, OR Not fully automatic Choosing the number of “concepts”
Outline • Problem Definition • Related Work • Finding clusters in graphs • Outliers and inter-group distances • Experiments • Conclusions
Outline • Problem Definition • Related Work • Finding clusters in graphs • What is a good clustering? • How can we find such a clustering? • Outliers and inter-group distances • Experiments • Conclusions
What is a “good” clustering Why is this better? Node Groups versus Node Groups Node Groups Node Groups • Similar nodes are grouped together • As few groups as necessary A few, homogeneous blocks Good Clustering Good Compression implies
Binary Matrix Node groups Node groups Main Idea Good Compression Good Clustering implies pi1 = ni1 / (ni1 + ni0) +Σi Cost of describing ni1, ni0 and groups Σi (ni1+ni0)* H(pi1) Description Cost Code Cost
One node group Examples high low +Σi Cost of describing ni1, ni0 and groups Σi (ni1+ni0)* H(pi1) Total Encoding Cost = Description Cost Code Cost low high n node groups
+Σi Cost of describing ni1, ni0 and groups Σi (ni1+ni0)* H(pi1) Total Encoding Cost = Description Cost Code Cost What is a “good” clustering Why is this better? Node Groups versus Node Groups Node Groups Node Groups low low
Outline • Problem Definition • Related Work • Finding clusters in graphs • What is a good clustering? • How can we find such a clustering? • Outliers and inter-group distances • Experiments • Conclusions
Algorithms k = 5 node groups
Algorithms Find good groups for fixed k Start with initial matrix Lower the encoding cost Final grouping Choose better values for k
Algorithms Find good groups for fixed k Start with initial matrix Lower the encoding cost Final grouping Choose better values for k
Node groups Node groups Fixed number of groups k • Reassign:for each node: • reassign it to the group which minimizes the code cost
Algorithms Find good groups for fixed k Start with initial matrix Lower the encoding cost Final grouping Choose better values for k
Choosing k • Split: • Find the group R with the maximum entropy per node • Choose the nodes in R whose removal reduces the entropy per node in R • Send these nodes to the new group, and set k=k+1
Algorithms Find good groups for fixed k Reassign Start with initial matrix Lower the encoding cost Final grouping Choose better values for k Splits
Algorithms • Properties: • Fully Automatic number of groups is found automatically • Scalable O(E) time • Allow incremental updates reassign new node/edge to the group with least cost, and continue…
Outline • Problem Definition • Related Work • Finding clusters in graphs • What is a good clustering? • How can we find such a clustering? • Outliers and inter-group distances • Experiments • Conclusions
Outlier edges Node Groups Node Groups Outlier Edges Nodes Nodes Deviations from “normality” Lower quality compression Outliers Find edges whose removal maximally reduces cost
Grp1 Grp2 Node Groups Grp3 Node Groups Inter-cluster distances Nodes Nodes Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j
Grp1 Grp2 Grp3 Inter-cluster distances Grp1 5.5 Grp2 Node Groups 5.1 4.5 Grp3 Node Groups Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j
Outline • Problem Definition • Related Work • Finding clusters in graphs • What is a good clustering? • How can we find such a clustering? • Outliers and inter-group distances • Experiments • Conclusions
Experiments “Quasi block-diagonal” graph with noise=10%
Experiments • DBLP dataset • 6,090 authors in: • SIGMOD • ICDE • VLDB • PODS • ICDT • 175,494 “dots”, one “dot” per co-citation Authors Authors
Experiments Author groups Authors Authors Author groups Stonebraker, DeWitt, Carey k=8 author groups found
Experiments Grp1 Grp8 Author groups Author groups Inter-group distances
Experiments • Epinions dataset • 75,888 users • 508,960 “dots”, one “dot” per “trust” relationship • k=19 groups found User groups User groups Small dense “core”
Experiments Time (in seconds) Number of “dots” Linear in the number of “dots” Scalable
Conclusions • Goals: • Find groups • Find outliers • Compute inter-group “distances” • Properties: • Fully Automatic • Scalable • Allow incremental updates