570 likes | 576 Views
Lecture 4. Modules/communities in networks. What is a module?. Nodes in a given module (or community group or a functional unit) tend to connect with other nodes in the same module Biology: proteins of the same function (e.g. DNA repair) or sub-cellular localization (e.g. nucleus)
E N D
Modules/communities in networks
What is a module? • Nodes in a given module (or community group or a functional unit) tend to connect with other nodes in the same module • Biology: proteins of the same function (e.g. DNA repair) or sub-cellular localization (e.g. nucleus) • WWW – websites on a common topic (e.g. physics) or organization (e.g. EPFL) • Internet – Autonomous systems/routers by geography (e.g. Switzerland) or domain (e.g. educational or military)
Hierarchical clustering • calculating the “similarity weight” Wij for all pairs of vertices (e.g. # of independent paths i j) • start with all n vertices disconnected • add edges between pairs one by one in order of decreasing weight • result: nested components, where one can take a ‘slice’ at any level of the tree
Girvan & Newman (2002): betweenness clustering • Betweenness of and edge i -- j is the # of shortest paths going through this edge • Algorithm • compute the betweenness of all edges • remove edge with the lowest betweenness • recalculate betweenness • Caveats: • Betweenness needs to be recalculated at each step • very expensive: all pairs shortest path – O(N3) • may need to repeat up to N times • does not scale to more than a few hundred nodes, even with the fastest algorithms
Using random walks/diffusion to discover modules in networks K. Eriksen, I. Simonsen, S. Maslov, K. Sneppen, PRL 90, 148701(2003)
Why diffusion? • Any dynamical process would equilibrate faster on modules and slower between modules • Thus its slow modesreveal modules • Diffusion is the simplest dynamical process (people also use others like Ising/Potts model, etc.)
Random walkers on a network • Study the behavior of many VIRTUAL random walkers on a network • At each time step each random walker steps on a randomly selected neighbor • They equilibrate to a steady state ni ~ ki (solid state physics: ni = const) • Slow modes of equilibration to the steady state allow to detect modules in a network
Similarity transformation • Matrix Tij is asymmetric • Could in principle result to complex eigenvalues/eigenvectors • Luckily, Sij=1/(Ki Kj) has the same eigenvalues and eigenvectors vi /Ki • Known as similarity transformation
Density of states () • filled circles –real AS-network • empty squares – degree-preserving randomized version
Participation ratio: PR()=i1/(v()i)4 250 200 150 Participation Ratio 100 50 0 -1 -0.5 0 0.5 1 l
Russia US Military
2 0.9626 RU RU RU RU CA RU RU ?? ?? US US US US ?? (US Department of Defence) 3 0.9561 ?? FR FR FR ?? FR ?? RU RU RU ?? ?? RU ?? 4 0.9523 US ?? US ?? ?? ?? ?? (US Navy) NZ NZ NZ NZ NZ NZ NZ 5. 0.9474 KR KR KR KR KR ?? KR UA UA UA UA UA UA UA
Using random walks/diffusion to rank information networks e.g. Google’s PageRank made it 160 billion $
Information networks • 3x105 Phys Rev articles connected by 3x106 citation links • 1010 webpages in the world • To find relevant information one needs to efficiently search and rank!!
Ranking webpages • Assign an “importance factor” Gi to every webpage • Given a keyword (say “jaguar”) find all the pages that have it in their text and display them in the order of descending Gi. • One solution still used in scientific publishing is Gi=Kin(i)(the number of incoming links), but: • Too democratic: It doesn’t take into account the importance of nodes sending links • Easy to trick and artificially boost the ranking (for the WWW)
How Google works • Google’s recipe (circa 1998) is to simulate the behavior of many virtual “random surfers” • PageRank: Gi ~ the number of virtual hits the page gets. It is also ~ the steady state number of random surfers at a given page • Popular pages send more surfers your way PageRank ~ Kinis weighted bythe popularity of a webpage sending each hyperlink • Surfers get bored following links with probability =0.15 a surfer jumps to a randomly selected page (not following any hyperlinks)
How communities in the WWW influence Google ranking H. Xie, K.-K. Yan, SM, cond-mat/0409087 physics/0510107 Physica A 373 (2007) 831–836
How do WWW communities influence their average Gi? • Pages in a web-community preferentially link to each other. Examples: • Pages from the same organization (e.g. EPFL) • Pages devoted to a common topic (e.g. Physics) • Pages in the same geographical location (e.g Switzerland) • Naïve argument: communities“trap” random surfers to spend more time inside they should increase the average Google ranking of the community
Test of a naïve argument Naïve argument is wrong: it could go either way Community #1 Community #2 log10(<G>c) # of intra-community links
Eww Ecc
Gc – average Google rank of pages in the community; Gw 1 – in the outside world • Ecw Gc/<Kout>c– current from C to W • It must be equal to: Ewc Gw/<Kout>w– current from W to C • Thus Gcdepends on the ratio between Ecw and Ewc– the number of edges (hyperlinks) between the community and the world
Balancing currents for nonzero • Jcw=(1- ) Ecw Gc/<Kout>c + Gc Nc – current from C to W • It must be equal to: Jcw=(1- )Ewc Gw/<Kout> + Gw Nw(Nc/Nw)– current from W to C
What are the consequences? • For very isolated communities(Ecw/E(r)cw< and Ewc/E(r)wc<) one has Gc=1. Their Google rank is decoupled from the outside world! • Overall range: <Gc<1/
WWW - the empirical data • We have data for ~10 US universities (+ all UK and Australian Universities) • Looked closely at UCLA and Long Island University (LIU) • UCLA has different departments • LIU has 4 campuses
=0.001 Abnormally high PageRank
Top PageRank LIU websites for =0.001 don’t make sense • #1 www.cwpost.liu.edu/cwis/cwp/edu/edleader/higher_ed/ hear.html' • #5 …/higher_ed/ index.html • #9 …/higher_ed/courses.html Strongly connected component World
What about citation networks? • Better use =0.5 instead of =0.15: people don’t click through papers as easily as through webpages • Time arrow: papers only cite older papers: Small values of give older papers unfair advantage • New algorithm CiteRank (as in PageRank). Random walkers start from recent papers ~exp(-t/d)
Summary • Diffusion and modules (communities) in a network affect each other • In the “hardware” part of the Internet (Autonomous systems or routers ) diffusion allows one to detect modules • In the “software” part • Diffusion-like process is used for ranking (Google’s PageRank) • WWW communities affect this ranking in a non-trivial way
Part 2: Opinion networks "Extracting Hidden Information from Knowledge Networks", S. Maslov, and Y-C. Zhang, Phys. Rev. Lett. (2001). "Exploring an opinion network for taste prediction: an empirical study", M. Blattner, Y.-C. Zhang, and S. Maslov, in preparation.
Predicting customers’ tastes from their opinions on products • Each of us has personal tastes • Information about them is contained in our opinions on products • Matchmaking: opinions of customers with tastes similar to mine could be used to forecast my opinions on untested products • Internet allows to do it on large scale (see amazon.com and many others)
Opinion networks Opinions of movie-goers on movies WWW Other webpages 1 opinion 1 Movies 1 Webapges Customers 2 1 2 2 3 2 3 3 4 3 4
Storing opinions Matrix of opinions IJ Network of opinions Movies 1 2 9 Customers 1 2 8 2 3 8 3 1 4
Using correlations to reconstruct customer’s tastes • Similar opinions similar tastes • Simplest model: • Movie-goers M-dimensional vector of tastes TI • Movies M-dimensional vector of features FJ • Opinions scalar product: IJ= TIFJ Movies 2 1 1 Customers 1 9 2 8 2 3 8 3 4
Loop correlation • Predictive power 1/M(L-1)/2 • One needs many loops to best reconstruct unknown opinions L=5 known opinions: Predictive power of an unknown opinion is 1/M2 An unknown opinion
Main parameter: density of edges • The larger is the density of edgesp the easier is the prediction • At p1 1/N (N=Ncostomers+Nmovies) macroscopic prediction becomes possible. Nodes are connected but vectors TI andFJ are not fixed: ordinary percolation threshold • At p2 2M/N> p1 all tastes and features (TI andFJ) can be uniquely reconstructed: rigidity percolation threshold
Real empirical data (EachMovie dataset) on opinions of customers on movies: 5-star ratings of 1600 movies by 73000 users 1.6 million opinions!
Spectral properties of • For M<N the matrix IJhas N-M zero eigenvalues and M positive ones: = R R+. • Using SVD one can “diagonalize” R = U D V+such that matrices VandU are orthogonal V+ V = 1, U U+ = 1, and D is diagonal.Then = U D2 U+ • The amount of information contained in : NM-M(M-1)/2 << N(N-1)/2 - the # of off-diagonal elements