430 likes | 533 Views
Taking over Search Engines. Web Spamming. What is Spamming ? Spamming is the art of increasing the rank of a page. Why ? Having more visits means gaining more money. How ? Web search engines are the gateways to the web. Get listed in the top results. How much Spam out there ?.
E N D
Web Spamming • What is Spamming ? • Spamming is the art of increasingthe rank of a page. • Why ? • Having more visits means gaining more money. • How ? • Web search engines are the gateways to the web. • Get listed in the top results.
How much Spam out there ? • Real-Web data from the MSN crawler collected during August 2004 • 105,484,446 Web pages
Why is spam bad ? • For Users: • Useless pages. • For Search Engines: • Wastes bandwidth, CPU cycles, storage space. • Pollutes corpus. • Distorts ranking of results. (Again bad news for users !)
Techniques • Web Search Engines use a number of measure to estimate the importance of a page • Content Analysis: TF x IDF, … • Link Analysis: PageRank, … • Also spammers use a number of techniques ! • Content Manipulation, i.e. terms • Stucture Manipulation, i.e. links
Content Manipulation 1 • Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition : • Increases the Term Frequency • dumortierite dumose dumous dump dumper dumpage Dumping dumper dumpily : • Makes a document relevant to many queries. • It is effective when using rare words (Inverse Document Frequency).
Where ? • Body, Title, Meta Tag, Anchor, Url.
Content Manipulation 2 • Content Repurposing: • Weaving : • Insertion of spam words into a well formed page copied another web-site. • Phrase Stitching : • Gluing well formed sentences copied from many other web-sites. • Why ? • Overcomes simple statistics that may be taken into account by web search engines
The Big Picture (1)Techniques / Boosting / Term <a href=“target.html”>free, great deals, cheap, inexpensive, cheap, free</a> Link Bombing
Link Manipulation • Links and pages from the attacker point of view
Creating (Hijacked) In-Links • Honey pots. • copies of valuable content (e.g. Unix Man Pages) with hidden links to spam farms or target pages. • Web Directories, Blogs, Wikis • all of the above usually have high Page Rank, and it is possible to add outgoing links to owned pages. • Link Exchange • Buy Expired domain • Creating Link Farms
Spamming HITS • HITS algorithm: • Searches for Hubs and Authorities • Top ranked pages are the more authoritative ones • Spam on HITS • Find a collection of good Hubs • Add links from Hubs to the target page • The target page is now linked to good Hubs !!
PageRank • PageRank in one equation: • PR(p) = M + (1- ) Vp • M is the adjacency matrix of the Web Graph. • is the damping factor. (usually .85) • in case of fairness Vp=1/N (N = # of pages in the Web). • V is the personalization vector. • What happens if a page p has no outgoing links ? • of its PR is lost --> all the PR will be lost eventually. • solution: normalize rows of M. (i.e. insert links to every other page)
Aggregate Page Rank • Total page rank is affected by • Number of pages • Incoming Links • Outgoing Links • Dangling Nodes • Topologies that: • Use as many pages as possible • minimize outgoing links • minimize dangling nodes incoming links WEB-SITE outgoing links
Chain topology (more is better) PR (Web Site) = 0.34 I a O 0.18 0.34 0.47 PR (Web Site) = 0.21+0.29 = 0.50 I a b O 0.11 0.21 0.29 0.37 I a b c d e f O 0.03 0.07 0.09 0.12 0.14 0.16 0.17 0.18 PR (Web Site) = 0.77
Ring topology I a O 0.18 0.34 0.47 0.18 I a O 0.11 0.03 b f 0.15 0.11 PR (Web Site) = 0.86 e c 0.12 0.14 d 0.13
Clique topology I a O 0.18 0.34 0.47 0.18 I a O 0.04 0.03 b f 0.15 0.15 PR (Web Site) = 0.93 e c 0.15 0.15 d 0.15
Increasing Page Rank of a single target page • Complicated structures do not help • chain, ring, clique waste page rank among every node in the website • To maximize the page rank of a target page a • all hijacked pages I must point to a • all boosting pages (b,c,d,e,f) must point to a • no links among boosting pages • the target page must point to all of the boosting pages
Star topology I a O 0.18 0.34 0.47 0.09 b 0.09 0.09 f c PR (a) = 0.43 I a O 0.09 0.03 e d 0.09 0.09
Putting all together • Given many spam farms • Create highly connected topologies among target pages • Link Exchange • every target page will be rewarded proportionally to their previous page rank
Is it worth ? • Page rank has a power low distribution • if a page has a low initial PageRankit is easy to improve it and to get higher ranking • if a page as an higher initial PageRankit is hard to improve it and it is harder to overcome other pages • Consider that: • it is cheap to generate automatically a link farm, but • spamming is expensive in terms of registered domains and IPs.
Hiding Techniques • Discriminate between real users and crawlers in order to hide spam activity to both of them
Content Hiding • Use background color for text. • add keywords • Use small 1 pixel anchor images. • add links
Cloaking • Identify whether the request comes from a real user or a search engine and provide different content. • To users: • provide target pages. • To Search Engines • provide useful and interesting text. • provide a link structure that increase PageRank. • Solution: • Download the same page twice.
Redirection • The redirection mechanism is used to create doorways to target pages • Search Engines: • download the page and crawl its links. • Users: • are immediately redirected to a target page.
Why content hiding is tough • HTML code can be parsed trying to detect spam intrusions. • Javascript code can be parsed too, but it is more difficult. • Eventually, it is needed to interpret the code. • Crawling is already very expensive !
Link analysis algorithms against web spamming • TrustRank • Anti-Trust Rank • Truncated Page Rank • SpamRank
Trust Rank • Observation • Good pages tend to link good pages. • Human is the best spam detector • Algorithm • Select a small subset of pages and let a human classify them • Propagate goodness of pages
Trust Rank: Selection • The seed set S should: • be as small as possible • cover a large part of the Web • Covering is related to out-links in the very same way PageRank is related to in-link • Inverse PageRank ! • A small number of pages with the highest Inverse PageRank is labeled by a human expert.
Trust Rank: Propagation • Initial values • TR(p) = 1, if p was found to be a good page • TR(p) = 0, otherwise • Iterations: • propagate Trust in the same way as PageRank • splitting through out-links • damping (attenuation) • only a fixed number of iteration M.
Anti-Trust Rank • Goal • find spam pages • Algorithm • Obtain a seed set of spam pages labeled by hand. (prefer high PageRank) • Compute PageRank Algorithm on the trasnposed adjacency matrix. • Use the seed set in the personalization vector. • Rank the pages in descending order of their scores.
Truncated Page Rank • Observation • Good pages have high page rank because of pages between 5 and 10 hops away
Truncated Page Rank • Observation • Good pages have high page rank because of pages between 5 and 10 hops away • Spam pages gain page rank because of pages in their neighborhood
Truncated Page Rank • Observation • Good pages have high page rank because of pages between 5 and 10 hops away • Spam pages gain page rank because of pages in their neighborhood • Solution • promote rank coming from far away • demote rank coming from the closest pages
Truncated Page Rank • Rank propagates through links • only a fraction propagates according to the adjacency matrix M • 5 steps of propagation mean • M · M · M · M · M = 5·M5 • We can calculate the page rank of a page by summing up the contributions from different distances: • PR(p) = t ·Mt = damping(t)· Mt • We can replace n with a function like this:
Truncated Page Rank • Strategy: • Pages whose PageRank is largely different from its Truncated PageRank are likely to be spam • Results: • Comparable with TrustRank
Spam Rank • Observations: • Spam pages are usually supported by low PageRank Pages. • Spammers have a limited budget, so they replicate only what they need for boosting PageRank. • Idea: • Find missing statistical features of dishonest supporters. • Due to the self-similarity, the honest supporter set should have a power-law distribution of PageRank.
Spam Rank: Algorithm • Find supporters for each page. • Check whether each set of supporters follows a power-law distribution of its PageRank. • Create penalties for suspicious pages. • Run PageRank using a personalization vector based on penalties. • Spam Rank is a Measure of Undeserved PageRank