Fully Automatic Cross-Associations

Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Customers Customer Groups Products Product Groups Problem Definition Simultaneously group customers and products, or, documents and words, or, users and preferences …

Problem Definition • Desiderata: • Simultaneously discover row and column groups • Fully Automatic: No “magic numbers” • Scalable to large graphs

Cross-Associations ≠ Co-clustering !

Related Work Dimensionality curse Choosing the number of clusters • K-means and variants: • “Frequent itemsets”: • Information Retrieval: • Graph Partitioning: User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters

versus Row groups Row groups Column groups Column groups What makes a cross-association “good”? Why is this better? • Similar nodes are grouped together • As few groups as necessary A few, homogeneous blocks Better Clustering Good Compression implies

Row groups Column groups Main Idea Good Compression Better Clustering implies Binary Matrix pi1 = ni1 / (ni1 + ni0) +Σi Cost of describing ni1 and ni0 Σi (ni1+ni0)* H(pi1) Description Cost Code Cost

One row group, one column group +Σi Cost of describing ni1 and ni0 Σi (ni1+ni0)* H(pi1) Total Encoding Cost = Description Cost Code Cost Examples high low low high m row group, n column group

versus Row groups Row groups +Σi Cost of describing ni1 and ni0 Σi (ni1+ni0)* H(pi1) Total Encoding Cost = Column groups Column groups Description Cost Code Cost What makes a cross-association “good”? Why is this better? low low

k=2, l=2 k=2, l=3 k=1, l=2 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 Algorithms l = 5 col groups k = 5 row groups

l = 5 k = 5 Algorithms Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l

l = 5 k = 5 Fixed k and l Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l

Row groups Column groups Fixed k and l • Swaps:for each row: • swap it to the row group which minimizes the code cost

Row groups Column groups Fixed k and l Ditto for column swaps … and repeat …

l = 5 k = 5 Choosing k and l Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l

l = 5 k = 5 Choosing k and l • Split: • Find the row group R with the maximum entropy per row • Choose the rows in R whose removal reduces the entropy per row in R • Send these rows to the new row group, and set k=k+1

l = 5 k = 5 Choosing k and l Split: Similar for column groups too.

l = 5 k = 5 Algorithms Find good groups for fixed k and l Swaps Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l Splits

Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise

Experiments l = 8 col groups k = 6 row groups “Caveman” graph with Zipfian cave sizes, noise=10%

Experiments l = 3 col groups k = 2 row groups “White Noise” graph

Experiments Documents Words “CLASSIC” graph of documents & words: k=15, l=19

Experiments NSF Grant Proposals Words in abstract “GRANTS” graph of documents & words: k=41, l=28

Experiments Epinions.com user Epinions.com user “Who-trusts-whom” graph from epinions.com: k=18, l=16

Experiments Users Webpages “Clickstream” graph of users and websites: k=15, l=13

Experiments Splits Time (secs) Swaps Number of non-zeros Linear on the number of “ones”: Scalable

Conclusions • Desiderata: • Simultaneously discover row and column groups • Fully Automatic: No “magic numbers” • Scalable to large graphs

l = 5 k = 5 Fixed k and l swaps swaps Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final cross-associations Choose better values for k and l

Experiments l = 5 col groups k = 5 row groups “Caveman” graph with Zipfian cave sizes, no noise

Aim l = 5 col groups k = 5 row groups Given any binary matrix a “good” cross-association will have low cost But how can we find such a cross-association?

Cost of describing cross-associations sizei * H(pi) + Σi Total Encoding Cost = Description Cost Code Cost Main Idea Good Compression Better Clustering implies Minimize the total cost

Main Idea • How well does a cross-association compress the matrix? • Encode the matrix in a lossless fashion • Compute the encoding cost • Low encoding cost  good compression  good clustering Good Compression Better Clustering implies

Fully Automatic Cross-Associations

Fully Automatic Cross-Associations

Presentation Transcript

Fully Automatic Clustering System

FULLY AUTOMATIC HP ADAPTIVE FINITE ELEMENT METHOD

FULLY AUTOMATIC ESPRESSO COFFEE MAKERS

Fully Automatic Fly Ash Brick Making Machine

Vertical Autoclave (Fully Automatic) - By Tanco Autoclave

Advantages Of Fully-Automatic & Semi-Automatic Washing Machines

Fully Automatic Washing Machine price list

Fully automatic new orleans photo booths

best washing machine fully automatic

Profile - Fully Automatic Chapatti Machine,Phulka Roti Machine,Compact Fully Automatic Chapati

LG Fully Automatic Washing Machine

Fully automatic biochemistry analyser | Finova healthcare

Semi automatic vs. fully automatic washing machine

De'Longhi Fully Automatic Coffee Machine

Fully Automatic Robotic Knee Replacement in Coimbatore

Fully-automatic Kjeldahl Analyzer

Fully Automatic Digital Compression Testing Machine

Fully Automatic Flap Barrier

Fully Automatic Paper Plate Making Machine.

Fully automatic clay bricks making machine: BMM410

Fully Automatic Cashew Kernel Separator Machine

Fully Automatic Pellet Snacks Making Machine, Fully Automatic Pellet Snacks Maki