1 / 59

Link Analysis and Anti-Spam

Link Analysis and Anti-Spam. Tie-Yan Liu Microsoft Research Asia. Outline. First Session Overview of Link Analysis Technologies PageRank and HITS Second Session More about Link Analysis Algorithms Third Session Spam and Anti-Spam Homework. First Session.

lan
Download Presentation

Link Analysis and Anti-Spam

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Link Analysis and Anti-Spam Tie-Yan Liu Microsoft Research Asia

  2. Outline • First Session • Overview of Link Analysis Technologies • PageRank and HITS • Second Session • More about Link Analysis Algorithms • Third Session • Spam and Anti-Spam • Homework "Web Search and Mining" Course @ USTC, 2005

  3. First Session

  4. Typical Search Engine Architecture "Web Search and Mining" Course @ USTC, 2005

  5. Ranking for the Search Results • Today’s search engines may return millions of pages for a certain query • It is definitely not possible for the user to preview all these results • An appropriate ranking will be very helpful. • Ranking on relevance • Ranking on importance "Web Search and Mining" Course @ USTC, 2005

  6. Traditional IR Ranking • A ranking purely on relevance • Term frequency (tf) • Inverse Document Frequency (idf) • Okapi … • Many other aspects that Dr. Shuming Shi will mention in the next course. "Web Search and Mining" Course @ USTC, 2005

  7. Limitations of Traditional IR • Text-based ranking function • www.harvard.edu can hardly be recognized as one of the most authoritative pages for the query “harvard”, since many other web pages contain “harvard” more often. • The number of pages with the same relevance is still too large for the users to preview. • Pages are not sufficiently self-descriptive • Usually the term “search engine” doesn't appear on the web pages of search engines. "Web Search and Mining" Course @ USTC, 2005

  8. What’s More for Web Search • In order to solve these problems • We must leverage other information on the Web • We must distinguish those pages with the same amount of relevance • Link Analysis • The web is not just a collection of pure-text documents • the hyperlinks are also very important! • A link from page A to page B may indicate: • A is related to B, or • A is recommending, citing, voting for or endorsing B • Links effect the ranking of web pages and thus have commercial value. "Web Search and Mining" Course @ USTC, 2005

  9. Famous Link Analysis Methods • HITS • PageRank "Web Search and Mining" Course @ USTC, 2005

  10. HITS - Kleinberg’s Algorithm • HITS – Hypertext Induced Topic Selection • For each vertex v in a subgraph of interest: • a(v) - the authority of v • h(v) - the hubness of v • A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites "Web Search and Mining" Course @ USTC, 2005

  11. Authority and Hubness 5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4) "Web Search and Mining" Course @ USTC, 2005

  12. Convergence of Authority and Hubness • Recursive dependency: • a(v)  Σ h(w) • h(v)  Σ a(w) w  pa[v] w  ch[v] • Using Linear Algebra, we can prove: a(v) and h(v) converge "Web Search and Mining" Course @ USTC, 2005

  13. HITS Example Find a base subgraph: • Start with a root set R {1, 2, 3, 4} • {1, 2, 3, 4} - nodes relevant to the topic • Expand the root set R to include all the children and a fixed number of parents of nodes in R  A new set S (base subgraph)  "Web Search and Mining" Course @ USTC, 2005

  14. HITS Example Hubs and authorities: two n-dimensional a and h • HubsAuthorities(G) • 1  [1,…,1]  R • a  h  1 • t  1 • repeat • for each v in V • do a (v)  Σ h (w) • h (v)  Σ a (w) • a  a / || a || • h  h / || h || • t  t + 1 • until || a – a || + || h – h || < ε • return (a , h ) |V| 0 0 t w  pa[v] t -1 w  pa[v] t t -1 t t t t t t t t -1 t t -1 t t "Web Search and Mining" Course @ USTC, 2005

  15. HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights "Web Search and Mining" Course @ USTC, 2005

  16. Matrix Denotion of HITS • It is clear that the authority and hubness values calculated by the aforementioned algorithm is the left and right singular vector of the adjacency matrix of the base sub graph. "Web Search and Mining" Course @ USTC, 2005

  17. PageRank • Introduced by Page et al (1998) • The page rank is proportional to its parents’ rank, but inversely proportional to its parents’ outdegree "Web Search and Mining" Course @ USTC, 2005

  18. Matrix Notation Adjacent Matrix A = "Web Search and Mining" Course @ USTC, 2005

  19. Matrix Notation • Matrix Notation r = Br • Pagerank is embedded in the eigenvector of B associated with the eigen value 1. B = "Web Search and Mining" Course @ USTC, 2005

  20. Matrix Notation "Web Search and Mining" Course @ USTC, 2005

  21. Markov Chain Notation • Random surfer model • Description of a random walk through the Web graph • Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page rt= M rt-1M: transition matrix for a first-order Markov chain (stochastic) Does it converge to some sensible solution (as t∞) regardless of the initial ranks ? "Web Search and Mining" Course @ USTC, 2005

  22. Problem • “Rank Sink” Problem • In general, many Web pages have no inlinks/outlinks • It results in dangling edges in the graph E.g. no parent  rank 0 MT converges to a matrix whose last column is all zero no children  no solution MT converges to zero matrix "Web Search and Mining" Course @ USTC, 2005

  23. Modification • Surfer will restart browsing by picking a new Web page at random M = ( B + E ) E : escape matrix M : stochastic matrix • Still problem? • It is not guaranteed that M is primitive • If M is stochastic and primitive, PageRank converges to corresponding stationary distribution of M "Web Search and Mining" Course @ USTC, 2005

  24. Distribution of the Mixture Model • The probability distribution that results from combining the Markovian random walk distribution & the static rank source distribution r = εe + (1- ε)x ε: probability of selecting non-linked page PageRank Now, transition matrix [εH + (1- ε)M] is primitive and stochasticrtconverges to the dominant eigenvector "Web Search and Mining" Course @ USTC, 2005

  25. PageRank v.s. HITS - Algorithm "Web Search and Mining" Course @ USTC, 2005

  26. PageRank v.s. HITS - Stability • Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don’t change significantly? • General Strategy for evaluating stability: • 1. Start with original adjacency matrix, A • 2. Perturb the matrix to get A*, Select k nodes in graph to add or delete • 3. Compute distance, d(r(A),r(A*)), for some distance measure d and objective function r that measures the quality of results of A’ somehow • 4. Compute amount of perturbation p(Α,Α*) for some distance function p that measures the amount of perturbation • 5. Evaluate the conditions, if any, where small values for p generate large values for d "Web Search and Mining" Course @ USTC, 2005

  27. Stability of HITS • Ng 2001 • A bound on the number of hyperlinks k that can added or deleted from one page without affecting the authority or hubness weights • Observations • Stability determined by eigengap • Eigengap: difference between 1st and 2nd eigenvalues • ATA for authorities, AAT for hubs • If eigengap is big, HITS will be insensitive to small perturbations, vice versa if small δ: eigengap λ1 – λ2d: maximum outdegree of G "Web Search and Mining" Course @ USTC, 2005

  28. Stability of PageRank • Looser bound • Ng et al (2001) • Bianchini et al (2001) • Observations • The parameter ε of the mixture model has a stabilization role • If original k pages to be modified do not have high overall PR scores then perturbed scores will not be far from the original "Web Search and Mining" Course @ USTC, 2005

  29. Second Session

  30. Pre-PageRank • PageRank achieves great success in the industry, many people regarded it as a break-through in the research field as well. • Actually the basic idea of PageRank has already appeared in many previous works • Mark 1988 • Bray 1996 • Marchiori 1997 • …… "Web Search and Mining" Course @ USTC, 2005

  31. Mark 1988 • To calculate the score S of a document at vertex v 1 Σ S(w) S(v) = s(v) + | ch[v] | w  |ch(v)| v: a vertex in the hypertext graph G = (V, E) S(v): the global score s(v): the score if the document is isolated ch(v): children of the document at vertex v • Limitation: • - Require G to be a directed acyclic graph (DAG) • - If v has a single link to w, S(v) > S(w) • If v has a long path to w and s(v) < s(w), then S(v) > S (w) Mark, D. M., (1988), "Network models in geomorphology," Chapter 4 in Modeling in Geomorphologic Systems, Edited by M. G. Anderson, John Wiley., p.73-97. "Web Search and Mining" Course @ USTC, 2005

  32. Bray 1996 • The visibility of a site is measured by the number of other sites pointing to it • Authority? • The luminosity of a site is measured by the number of other sites to which it points • Hub? "Web Search and Mining" Course @ USTC, 2005

  33. Marchiori (1997) • Hyper information should complement textual information to obtain the overall information S(v) = s(v) + h(v) - S(v): overall information - s(v): textual information - h(v): hyper information r(v, w) • h(v) = Σ F S(w) w  |ch[v]| - F: a fading constant, F Є (0, 1) - r(v, w): the rank of w after sorting the children of v by S(w) "Web Search and Mining" Course @ USTC, 2005

  34. Post PageRank • And following the success of PageRank, a lot of new algorithms were also proposed. • Fast PageRank calculation (Haveliwala) • Topic-sensitive PageRank • Personalized PageRank • LinkFusion • …… "Web Search and Mining" Course @ USTC, 2005

  35. Fast PageRank calculation [Haveliwala – 1999] • Partition the destination vector into d blocks that each fit into main memory, and to compute one block at a time. • This algorithm is quite similar in structure to the Block Nested-Loop Join algorithm in database systems. which also performs very well for data sets of moderate size but eventually loses out to more scalable approaches. "Web Search and Mining" Course @ USTC, 2005

  36. Fast PageRank calculation [Haveliwala – 2003] • Basic observation: • the convergence rates of the PageRank values of individual pages during application of the Power Method is nonuniform. That is, many pages converge quickly, with a few pages taking much longer to converge. Furthermore, the pages that converge slowly are generally those pages with high PageRank. "Web Search and Mining" Course @ USTC, 2005

  37. Topic-Specific PageRank [Haveliwala - WWW02] • Topic-specific PageRanks • For each page precomputed PageRank values of the most relevant topics used for each query. • 16 topics "Web Search and Mining" Course @ USTC, 2005

  38. Link Fusion –[Zeng, WWW04] • In a more generalized scenario, suppose there are N data types. The importance attribute of one type of object can be reinforced by both inter and intra-type links as: • Suppose w is the attribute vector of all the objects in the URM. Link Fusion can be represented as: wnew=LurmTwold • Such iterative calculation can be continued: wn=(LurmT)nw0 • The result w is the prime eigenvector of Lurm, which can be explained as the value of data objects regarding a specific attribute. "Web Search and Mining" Course @ USTC, 2005

  39. Limits of Link Analysis • Pay-for-place • Search engine bias : organizations pay search engines and page rank • Advertisements: organizations pay high ranking pages for advertising space • With a primary effect of increased visibility to end users and a secondary effect of increased respectability due to relevance to high ranking page "Web Search and Mining" Course @ USTC, 2005

  40. Limits of Link Analysis • Stability • Adding even a small number of nodes/edges to the graph has a significant impact • Topic drift • A top authority may be a hub of pages on a different topic resulting in increased rank of the authority page • Content evolution • Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks "Web Search and Mining" Course @ USTC, 2005

  41. Third Session

  42. What is Link Spam • Since link analysis has played an important role in search engines, it has large commercial values • Improving one’s PageRank, can directly increase one’s clicks thus earn more money. • Link Spam is something trying to unfairly gain a high ranking on a search engine for a web page without improving the user experience, by mean of tricky modification / manipulation of the link graph. "Web Search and Mining" Course @ USTC, 2005

  43. Link Spamming Technologies • Adding outlinks • Replicate hub pages • Adding inlinks • Create a honey pot • Infiltrate a web directory • Post links on blog, wiki, etc • Participate in-link exchange • Buy expired domains • Create own spam farm. "Web Search and Mining" Course @ USTC, 2005

  44. Case Study: Spam HITS • Hub score can be increased by adding outlinks to the target page • Authority score can be increased by creating hyperlinks from high-hub-score pages to the target page. "Web Search and Mining" Course @ USTC, 2005

  45. Case Study: Spam PageRank • Factors that influence PageRank • PR(t)=PRstatic(t)+PRin(t)-PRout(t)-PRsink(t) • Strategies • Own pages are part of the spam farm, maximizing PRstatic • Accessible pages point to the spam farm, maximizing PRin • Links pointing outside the spam farm are supressed, minimizing PRout(t) • All pages within the farm have some outlinks, minimizing PRsink(t) "Web Search and Mining" Course @ USTC, 2005

  46. Anti-Spam • Early approaches • BHITS, SALSA, DOM, revised HITS, BadRank … • State-of-the-art • TrustRank (2004) • Revised PageRank (VLDB2004) • BadRank + (WWW2005) • SpamRank (WWW2005, workshop) • …… "Web Search and Mining" Course @ USTC, 2005

  47. TrustRank • Basic assumption • Good pages seldom point to spam pages, but spam pages may very likely point to good pages. • Use TrustRank to denote the goodness of a webpage, and use Trust Propagation to label all the web pages starting from a small human-labeled seed set. "Web Search and Mining" Course @ USTC, 2005

  48. TrustRank • Step 1: Initialization • How to select seeds • Inverse PageRank (Hub pages, since they have more influence) • High PageRank (Important pages are more important to search applications) • Step 2: Propagation "Web Search and Mining" Course @ USTC, 2005

  49. TrustRank • Step 3: • Trust Dampening • Trust Splitting "Web Search and Mining" Course @ USTC, 2005

  50. BadRank+ • Motivation • Pages in the spam farm are densely connected, and many common pages exist in both the inlinks and outlinks of these pages. • Propagate the badness of pages in the seed set to detect other the spam pages in the Web. "Web Search and Mining" Course @ USTC, 2005

More Related