1 / 32

Fully Automatic Cross-Associations

Fully Automatic Cross-Associations. Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM). Customers. Customer Groups. Products. Product Groups. Problem Definition.

ferrol
Download Presentation

Fully Automatic Cross-Associations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

  2. Customers Customer Groups Products Product Groups Problem Definition Simultaneously group customers and products, or, documents and words, or, users and preferences …

  3. Problem Definition • Desiderata: • Simultaneously discover row and column groups • Fully Automatic: No “magic numbers” • Scalable to large graphs

  4. Cross-Associations ≠ Co-clustering !

  5. Related Work Dimensionality curse Choosing the number of clusters • K-means and variants: • “Frequent itemsets”: • Information Retrieval: • Graph Partitioning: User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters

  6. versus Row groups Row groups Column groups Column groups What makes a cross-association “good”? Why is this better? • Similar nodes are grouped together • As few groups as necessary A few, homogeneous blocks Better Clustering Good Compression implies

  7. Row groups Column groups Main Idea Good Compression Better Clustering implies Binary Matrix pi1 = ni1 / (ni1 + ni0) +Σi Cost of describing ni1 and ni0 Σi (ni1+ni0)* H(pi1) Description Cost Code Cost

  8. One row group, one column group +Σi Cost of describing ni1 and ni0 Σi (ni1+ni0)* H(pi1) Total Encoding Cost = Description Cost Code Cost Examples high low low high m row group, n column group

  9. versus Row groups Row groups +Σi Cost of describing ni1 and ni0 Σi (ni1+ni0)* H(pi1) Total Encoding Cost = Column groups Column groups Description Cost Code Cost What makes a cross-association “good”? Why is this better? low low

  10. k=2, l=2 k=2, l=3 k=1, l=2 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 Algorithms l = 5 col groups k = 5 row groups

  11. l = 5 k = 5 Algorithms Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l

  12. l = 5 k = 5 Fixed k and l Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l

  13. Row groups Column groups Fixed k and l • Swaps:for each row: • swap it to the row group which minimizes the code cost

  14. Row groups Column groups Fixed k and l Ditto for column swaps … and repeat …

  15. l = 5 k = 5 Choosing k and l Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l

  16. l = 5 k = 5 Choosing k and l • Split: • Find the row group R with the maximum entropy per row • Choose the rows in R whose removal reduces the entropy per row in R • Send these rows to the new row group, and set k=k+1

  17. l = 5 k = 5 Choosing k and l Split: Similar for column groups too.

  18. l = 5 k = 5 Algorithms Find good groups for fixed k and l Swaps Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l Splits

  19. Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise

  20. Experiments l = 8 col groups k = 6 row groups “Caveman” graph with Zipfian cave sizes, noise=10%

  21. Experiments l = 3 col groups k = 2 row groups “White Noise” graph

  22. Experiments Documents Words “CLASSIC” graph of documents & words: k=15, l=19

  23. Experiments NSF Grant Proposals Words in abstract “GRANTS” graph of documents & words: k=41, l=28

  24. Experiments Epinions.com user Epinions.com user “Who-trusts-whom” graph from epinions.com: k=18, l=16

  25. Experiments Users Webpages “Clickstream” graph of users and websites: k=15, l=13

  26. Experiments Splits Time (secs) Swaps Number of non-zeros Linear on the number of “ones”: Scalable

  27. Conclusions • Desiderata: • Simultaneously discover row and column groups • Fully Automatic: No “magic numbers” • Scalable to large graphs

  28. l = 5 k = 5 Fixed k and l swaps swaps Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l

  29. Experiments l = 5 col groups k = 5 row groups “Caveman” graph with Zipfian cave sizes, no noise

  30. Aim l = 5 col groups k = 5 row groups Given any binary matrix a “good” cross-association will have low cost But how can we find such a cross-association?

  31. Cost of describing cross-associations sizei * H(pi) + Σi Total Encoding Cost = Description Cost Code Cost Main Idea Good Compression Better Clustering implies Minimize the total cost

  32. Main Idea • How well does a cross-association compress the matrix? • Encode the matrix in a lossless fashion • Compute the encoding cost • Low encoding cost  good compression  good clustering Good Compression Better Clustering implies

More Related