330 likes | 436 Views
Search Engine Technology (11). Prof. Dragomir R. Radev radev@cs.columbia.edu. SET Fall 2013. … 17. continued …. [Slide from Reka Albert]. [Slide from Reka Albert]. The strength of weak ties. Granovetter’s study: finding jobs
E N D
Search Engine Technology(11) Prof. Dragomir R. Radev radev@cs.columbia.edu
SET Fall 2013 … 17. continued …
The strength of weak ties • Granovetter’s study: finding jobs • Weak ties: more people can be reached through weak ties than strong ties (e.g., through your 7th and 8th best friends) • More here: http://en.wikipedia.org/wiki/Weak_tie
Prestige and centrality • Degree centrality: how many neighbors each node has. • Closeness centrality: how close a node is to all of the other nodes • Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes • Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. • Prestige = same as centrality but for directed graphs.
SET Fall 2013 … 18. Graph-based methods Harmonic functions Random walks PageRank …
Random walks and harmonic functions • Drunkard’s walk: • Start at position 0 on a line • What is the prob. of reaching 5 before reaching 0? • Harmonic functions: • P(0) = 0 • P(N) = 1 • P(x) = ½*p(x-1)+ ½*p(x+1), for 0<x<N • (in general, replace ½ with the bias in the walk) 0 1 2 3 4 5
(**) The original Dirichlet problem • Distribution of temperature in a sheet of metal. • One end of the sheet has temperature t=0, the other end: t=1. • Laplace’s differential equation: • This is a special (steady-state) case of the (transient) heat equation : • In general, the solutions to this equation are called harmonic functions.
Learning harmonic functions • The method of relaxations • Discrete approximation. • Assign fixed values to the boundary points. • Assign arbitrary values to all other points. • Adjust their values to be the average of their neighbors. • Repeat until convergence. • Monte Carlo method • Perform a random walk on the discrete representation. • Compute f as the probability of a random walk ending in a particular fixed point. • Eigenvector methods • Look at the stationary distribution of a random walk
Eigenvectors and eigenvalues • An eigenvector is an implicit “direction” for a matrix where v (eigenvector)is non-zero, though λ (eigenvalue) can be any complex number in principle • Computing eigenvalues:
Eigenvectors and eigenvalues • Example: • Det (A-lI) = (-1-l)*(-l)-3*2=0 • Then: l+l2-6=0; l1=2; l2=-3 • For l1=2: • Solutions: x1=x2
Stochastic matrices • Stochastic matrices: each row (or column) adds up to 1 and no value is less than 0. Example: • The largest eigenvalue of a stochastic matrix E is real: λ1 = 1. • For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 • In other words, GTp = p.
1 Ω Electrical networks and random walks c • Ergodic (connected) Markov chain with transition matrix P 1 Ω 1 Ω w=Pw b a 0.5 Ω 0.5 Ω d From Doyle and Snell 2000
1 Ω Electrical networks and random walks c 1 Ω 1 Ω b a 0.5 Ω 0.5 Ω • vxis the probability that a random walk starting at x will reach a before reaching b. d • The random walk interpretation allows us to use Monte Carlo methods to solve electrical circuits. 1 V
Markov chains • A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E. • Path = sequence (x0, x1, …, xn).Xi = xi-1*E • The probability of a path can be computed as a product of probabilities for each step i. • Random walk = find Xjgiven x0, E, and j.
Stationary solutions • The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: • E is stochastic • E is irreducible • E is aperiodic • To make these conditions true: • All rows of E add up to 1 (and no value is negative) • Make sure that E is strongly connected • Make sure that E is not bipartite • Example: PageRank [Brin and Page 1998]: use “teleportation”
t=0 1 6 8 2 7 t=1 5 3 4 Example This graph E has a second graph E’(not drawn) superimposed on it:E’ is the uniform transition graph.
Eigenvectors • An eigenvector is an implicit “direction” for a matrix. Ev = λv, where v is non-zero, though λ can be any complex number in principle. • The largest eigenvalue of a stochastic matrix E is real: λ1 = 1. • For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 • In other words, ETp = p.
Computing the stationary distribution functionPowerStatDist (E): begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1) L = ||p(i)-p(i-1)||1; i = i + 1; untilL < returnp(i) end Solution for thestationary distribution Convergence rate is O(m)
t=0 1 6 8 2 7 t=1 5 3 4 t=10 Example
PageRank • Developed at Stanford and allegedly still being used at Google. • Not query-specific, although query-specific varieties exist. • In general, each page is indexed along with the anchor texts pointing to it. • Among the pages that match the user’s query, Google shows the ones with the largest PageRank. • Google also uses vector-space matching, keyword proximity, anchor text, etc.
SET Fall 2013 … 19. Hubs and authorities Bipartite graphs HITS and SALSA Models of the web …
Honda Ford VW Car and Driver HITS • Hypertext-induced text selection. • Developed by Jon Kleinberg and colleagues at IBM Almaden as part of the CLEVER engine. • HITS is query-specific. • Hubs and authorities, e.g. collections of bookmarks about cars vs. actual sites about cars.
HITS • Each node in the graph is ranked for hubness (h) and authoritativeness (a). • Some nodes may have high scores on both. • Example authorities for the query “java”: • www.gamelan.com • java.sun.com • digitalfocus.com/digitalfocus/… (The Java developer) • lightyear.ncsa.uiuc.edu/~srp/java/javabooks.html • sunsite.unc.edu/javafaq/javafaq.html
HITS • HITS algorithm: • obtain root set (using a search engine) related to the input query • expand the root set by radius one on either side (typically to size 1000-5000) • run iterations on the hub and authority scores together • report top-ranking authorities and hubs • Eigenvector interpretation:
Example [slide from Baldi et al.]
HITS • HITS is now used by Ask.com and Teoma.com . • It can also be used to identify communities (e.g., based on synonyms as well as controversial topics. • Example for “jaguar” • Principal eigenvector gives pages about the animal • The positive end of the second nonprincipal eigenvector gives pages about the football team • The positive end of the third nonprincipal eigenvector gives pages about the car. • Example for “abortion” • The positive end of the second nonprincipal eigenvector gives pages on “planned parenthood” and “reproductive rights” • The negative end of the same eigenvector includes “pro-life” sites. • SALSA (Lempel and Moran 2001)
a A B b Models of the Web • Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology • Erdös/Rényi 59, 60 • Barabási/Albert 99 • Watts/Strogatz 98 • Kleinberg 98 • Menczer 02 • Radev 03
Observations: Links are made based on topics Topics are expressed with words Words are distributed very unevenly (Zipf, Benford, self-triggerability laws) Model Pick n Generate n lengths according to a power-law distribution Generate n documents using a trigram model Model (cont’d) Pick words in decreasing order of r. Generate hyperlinks with random directionality Outcome Generates power-law degree distributions Generates topical communities Natural variation of PageRank: LexRank Evolving Word-based Web
Readings • paper by Church and Gale (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.3957)