330 likes | 481 Views
Fully Automatic Cross-Associations. Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM). Customers. Customer Groups. Products. Product Groups. Problem Definition.
E N D
Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)
Customers Customer Groups Products Product Groups Problem Definition Simultaneously group customers and products, or, documents and words, or, users and preferences …
Problem Definition • Desiderata: • Simultaneously discover row and column groups • Fully Automatic: No “magic numbers” • Scalable to large graphs
Related Work Dimensionality curse Choosing the number of clusters • K-means and variants: • “Frequent itemsets”: • Information Retrieval: • Graph Partitioning: User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters
versus Row groups Row groups Column groups Column groups What makes a cross-association “good”? Why is this better? • Similar nodes are grouped together • As few groups as necessary A few, homogeneous blocks Better Clustering Good Compression implies
Row groups Column groups Main Idea Good Compression Better Clustering implies Binary Matrix pi1 = ni1 / (ni1 + ni0) +Σi Cost of describing ni1 and ni0 Σi (ni1+ni0)* H(pi1) Description Cost Code Cost
One row group, one column group +Σi Cost of describing ni1 and ni0 Σi (ni1+ni0)* H(pi1) Total Encoding Cost = Description Cost Code Cost Examples high low low high m row group, n column group
versus Row groups Row groups +Σi Cost of describing ni1 and ni0 Σi (ni1+ni0)* H(pi1) Total Encoding Cost = Column groups Column groups Description Cost Code Cost What makes a cross-association “good”? Why is this better? low low
k=2, l=2 k=2, l=3 k=1, l=2 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 Algorithms l = 5 col groups k = 5 row groups
l = 5 k = 5 Algorithms Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l
l = 5 k = 5 Fixed k and l Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l
Row groups Column groups Fixed k and l • Swaps:for each row: • swap it to the row group which minimizes the code cost
Row groups Column groups Fixed k and l Ditto for column swaps … and repeat …
l = 5 k = 5 Choosing k and l Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l
l = 5 k = 5 Choosing k and l • Split: • Find the row group R with the maximum entropy per row • Choose the rows in R whose removal reduces the entropy per row in R • Send these rows to the new row group, and set k=k+1
l = 5 k = 5 Choosing k and l Split: Similar for column groups too.
l = 5 k = 5 Algorithms Find good groups for fixed k and l Swaps Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l Splits
Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise
Experiments l = 8 col groups k = 6 row groups “Caveman” graph with Zipfian cave sizes, noise=10%
Experiments l = 3 col groups k = 2 row groups “White Noise” graph
Experiments Documents Words “CLASSIC” graph of documents & words: k=15, l=19
Experiments NSF Grant Proposals Words in abstract “GRANTS” graph of documents & words: k=41, l=28
Experiments Epinions.com user Epinions.com user “Who-trusts-whom” graph from epinions.com: k=18, l=16
Experiments Users Webpages “Clickstream” graph of users and websites: k=15, l=13
Experiments Splits Time (secs) Swaps Number of non-zeros Linear on the number of “ones”: Scalable
Conclusions • Desiderata: • Simultaneously discover row and column groups • Fully Automatic: No “magic numbers” • Scalable to large graphs
l = 5 k = 5 Fixed k and l swaps swaps Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l
Experiments l = 5 col groups k = 5 row groups “Caveman” graph with Zipfian cave sizes, no noise
Aim l = 5 col groups k = 5 row groups Given any binary matrix a “good” cross-association will have low cost But how can we find such a cross-association?
Cost of describing cross-associations sizei * H(pi) + Σi Total Encoding Cost = Description Cost Code Cost Main Idea Good Compression Better Clustering implies Minimize the total cost
Main Idea • How well does a cross-association compress the matrix? • Encode the matrix in a lossless fashion • Compute the encoding cost • Low encoding cost good compression good clustering Good Compression Better Clustering implies