460 likes | 633 Views
Kronecker Graphs. The Kronecker Graph Model ( rmat ). Start with a parameter matrix A For n vertices, take Kronecker products Normalize the entries. Generating Edges. One Method Calculate the whole Kronecker matrix Sample each edge independently according to entry Another Method
E N D
The Kronecker Graph Model (rmat) • Start with a parameter matrix A • For n vertices, take Kronecker products • Normalize the entries
Generating Edges • One Method • Calculate the whole Kronecker matrix • Sample each edge independently according to entry • Another Method • Treat parameters as probabilities • Flip coins for each edge
Features • Pro • Fast to generate: parallel and distributed • Few parameters to fit • Self-similarity • Con • Doesn’t have a powerlaw distribution • Parameters aren’t intuitive • May not be connected • Used in Graph500 benchmark [Seshadri, Kolda, Pinar]
Variance of Real Graphs [Moreno, Kirschner, Neville, Vishwanathan]
Web Search Information Retrieval: Given a query Hugh Laurie, find all documents that mention those words
Web Ranking Before 1998 • Use tf-idf(roughly) • Term frequency – inverse document frequency # of occurrences of in # of occurrences of in , the corpus
Results • It was bad • The best results for a topic may not mention the topic explicitly a lot
What are we missing? • Traditional IR only has the text to work with • We have an information network • The hyperlinks are created by intelligent, rational beings!
1998 – HITS (J. Kleinberg) • What if we ranked documents by in-links? The power law distribution on in-degree will get us every time.
HITS • Idea: Different pages and different links play different roles • Some pages are AUTHORITIES • Some pages are HUBS
Hubs • What is a good hub? A page is a good hub if it points to many authorities.
Authorities • What is a good authority? A page is a good authority if many hub pages point to it. How can we find good hubs and good authorities?
HITS • Everyone starts with a hub-score of 1 and authority-score of 1 • A-update: For each page p, auth(p) is the sum of the hub-scores of pages that point to p. • H-update: For each page p, hub(p) is the sum of the auth-scores of pages ppointsto.
Formally • M is the adjacency matrix, h the hub-scores and a the auth-scores How many iterations should we do? Calculated on the subgraph that corresponds to the query at hand
Where does HITS fail? • Assumes a bipartite clique structure to the web • Doesn’t allow more general forms of endorsement
PageRank – try 1 • Instead of h and a scores, just one score. PR-update(p) = sum of normalized PR score of each page that points to p
Where does this fail? Hint: The web graph is directed.
Actual PageRank • Make the graph strongly connected by adding epsilon weight links between all pages. • Let A be the normalized adjacency matrix
Calculating with the Power Method • Start with • Calculate • Add to every entry • Normalize and repeat Repeat this times
The Random Surfer Model • What natural process can justify PageRank? • How can we model how people might use the web?
The Random Surfer • Starts at some page on the web • With probability (1-), selects a random link on the page and follows it • With probability , gets bored and jumps to some new random web page.
The Random Surfer • The PageRank vector is the probability that you will visit each website in this process
Random Walks on Graphs 1/3 1 1/3 1/2
Stationary Distributions • What does this process converge to? • Connection between eigenvectors and stationary distributions. Why is the top eigenvalue always 1?
Mixing Time • How long does it take to converge? • Why does PageRank converge in time?
Undirected Graphs • The stationary distribution is proportional to the degree
Personalized PageRank • What if the surfer didn’t jump randomly? • s can be any distribution over the pages
Uses of Personalized PageRank • Creating personalized search results • Topic-sensitive PageRank • Local community detection • Can you compute it more efficiently than PageRank?
The Intentional Surfer • Click data is collected by • Google/Bing Toolbar • Cookies from ad websites.. • Can use this to get better estimates for click through rates of each link • Modifies our transition probabilities to improve PageRank
Search Engine Optimization • Designing your page with the ranking function in mind • Co-evolves with search engines • Obvious Tricks • Make a collection of websites to point to you • Buy old webpages • Include text in background color font • Paying others to link to you
Link spam detection Spam The web graph
Connection to HITS • If you link to a lot of spam sites, you are probably also spam. (Hub) • If you are linked to by lots of spam sites, you are probably why that spam collection was built. (Authority) • Start with seed sites with Hub, Authority scores of 1.
Trust Propagation • Given some information (i trusts j) or (i does not trust j), how can we model trust in a network?
Types of Trust Propagation • Direct Propagation • Transpose Propagation • Co-citation • Trust Coupling i j k i j i j k m i j m
Distrust Propagation • Trust Only • 1-Step Distrust • Propagated Distrust
Propagating Trust and Distrust • Eigenvalue Propagation • Weighted Linear Combination How do you round this matrix to give trust/distrust?
Experiments • Epinions ‘web-of-trust’ • 841,372 edges labeled + or -. Try all combinations of trust and distrust propagation. What is the best model?
Project Proposals • Email by 9/26 to: isabelle@eecs.berkeley.eduanirban.dasgupta+cs294@gmail.com