AutoPart: Automated Graph Partitioning and Outlier Detection

AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

Problem Definition People People Groups People Groups People Group people in a social network, or, species in a food web, or, proteins in protein interaction graphs …

Reminder People People Graph: N nodes and E directed edges

Problem Definition People Groups People People People Groups • Goals: • [#1] Find groups (of people, species, proteins, etc.) • [#2] Find outlier edges (“bridges”) • [#3] Compute inter-group “distances” (how similar are two groups of proteins?)

Problem Definition People Groups People People People Groups • Properties: • Fully Automatic (estimate the number of groups) • Scalable • Allow incremental updates

Related Work • Graph Partitioning • METIS (Karypis+/1998) • Spectral partitioning (Ng+/2001) • Clustering Techniques • K-means and variants(Pelleg+/2000,Hamerly+/2003) • Information-theoreticco-clustering (Dhillon+/2003) • LSI(Deerwester+/1990) Measure of imbalance between clusters, OR Number of partitions Rows and columns are considered separately, OR Not fully automatic Choosing the number of “concepts”

Outline • Problem Definition • Related Work • Finding clusters in graphs • Outliers and inter-group distances • Experiments • Conclusions

Outline • Problem Definition • Related Work • Finding clusters in graphs • What is a good clustering? • How can we find such a clustering? • Outliers and inter-group distances • Experiments • Conclusions

What is a “good” clustering Why is this better? Node Groups versus Node Groups Node Groups Node Groups • Similar nodes are grouped together • As few groups as necessary A few, homogeneous blocks Good Clustering Good Compression implies

Binary Matrix Node groups Node groups Main Idea Good Compression Good Clustering implies pi1 = ni1 / (ni1 + ni0) +Σi Cost of describing ni1, ni0 and groups Σi (ni1+ni0)* H(pi1) Description Cost Code Cost

One node group Examples high low +Σi Cost of describing ni1, ni0 and groups Σi (ni1+ni0)* H(pi1) Total Encoding Cost = Description Cost Code Cost low high n node groups

+Σi Cost of describing ni1, ni0 and groups Σi (ni1+ni0)* H(pi1) Total Encoding Cost = Description Cost Code Cost What is a “good” clustering Why is this better? Node Groups versus Node Groups Node Groups Node Groups low low

Algorithms k = 5 node groups

Algorithms Find good groups for fixed k Start with initial matrix Lower the encoding cost Final grouping Choose better values for k

Node groups Node groups Fixed number of groups k • Reassign:for each node: • reassign it to the group which minimizes the code cost

Algorithms Find good groups for fixed k Start with initial matrix Lower the encoding cost Final grouping Choose better values for k

Choosing k • Split: • Find the group R with the maximum entropy per node • Choose the nodes in R whose removal reduces the entropy per node in R • Send these nodes to the new group, and set k=k+1

Algorithms Find good groups for fixed k Reassign Start with initial matrix Lower the encoding cost Final grouping Choose better values for k Splits

Algorithms • Properties: • Fully Automatic  number of groups is found automatically • Scalable  O(E) time • Allow incremental updates  reassign new node/edge to the group with least cost, and continue…

Outlier edges Node Groups Node Groups Outlier Edges Nodes Nodes Deviations from “normality” Lower quality compression Outliers Find edges whose removal maximally reduces cost

Grp1 Grp2 Node Groups Grp3 Node Groups Inter-cluster distances Nodes Nodes Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j

Grp1 Grp2 Grp3 Inter-cluster distances Grp1 5.5 Grp2 Node Groups 5.1 4.5 Grp3 Node Groups Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j

Experiments “Quasi block-diagonal” graph with noise=10%

Experiments • DBLP dataset • 6,090 authors in: • SIGMOD • ICDE • VLDB • PODS • ICDT • 175,494 “dots”, one “dot” per co-citation Authors Authors

Experiments Author groups Authors Authors Author groups Stonebraker, DeWitt, Carey k=8 author groups found

Experiments Grp1 Grp8 Author groups Author groups Inter-group distances

Experiments • Epinions dataset • 75,888 users • 508,960 “dots”, one “dot” per “trust” relationship • k=19 groups found User groups User groups Small dense “core”

Experiments Time (in seconds) Number of “dots” Linear in the number of “dots”  Scalable

Conclusions • Goals: • Find groups • Find outliers • Compute inter-group “distances” • Properties: • Fully Automatic • Scalable • Allow incremental updates

AutoPart: Automated Graph Partitioning and Outlier Detection