Web Information retrieval (Web IR)

Web Information retrieval (Web IR) Handout #9: Connectivity Ranking Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Outline • PageRank • HITS • Personalized PageRank • HostRank • Distance Rank

Ranking : Definition • Ranking is the process which estimates the quality of a set of results retrieved by a search engine • Ranking is the most important part of a search engine

Ranking Types • Content-based • Classical IR • Connectivity based (web) • Query independent • Query dependent • User-behavior based

Web information retrieval • Queries are short: 2.35 terms in avg. • Huge variety in documents: language, quality, duplication • Huge vocabulary: 100s millions terms • Deliberate misinformation • Spamming! • Its rank is completely under the control of Web page’s author

Docs Docs Words 1 1 1 2 2 2 Query n n w Web graph Ranking in Web IR • Ranking is a function of the query terms and of the hyperlink structure • Using content of other pages to rank current pages • It is out of the control of the page’s author • Spamming is hard

Connectivity-based Ranking • Query independent • PageRank • Query dependent • HITS

Google’s PageRank Algorithm • Idea: Mine structure of the web graph • Each web page is a node • Each hyperlink is a directed edge

PageRank • Assumption: A link from page A to page B is a recommendation of page B by the author of A(we say B is successor of A) • Quality of a page is related to its in-degree • Recursion: Quality of a page is related to • its in-degree, and to • the quality of pages linking to it • PageRank A B Successor

Surfer s p Definition of PageRank • Consider the following infinite random walk (surf): • Initially the surfer is at a random page • At each step, the surfer proceeds to a randomly chosen successor of the current page (With probability 1/outdegree) • The PageRank of a page p is the fraction of steps the surfer spends at p in the limit. • Random surfer model

PageRank (cont.) By previous theorem: • PageRank = stationary probability for this Markov chain, i.e.

PageRank (cont.) PageRank of P is P(A)/4 + P(B)/3 B A P

PageRank (cont.)

Damping Factor (d) • Web graph is not strongly connected • Convergence of PageRank is not guaranteed • Effects of sinking web pages • Pages without outputs • Trapping pages • Damping factor (d) • Surfer proceeds to a randomly chosen successor of the current page with probability d or to a randomly chosen web page with probability (1-d) where n is the total number of nodes in the graph

PageRank Vector (Linear Algebra) • R is the rank vector (eigen vector) • ri is rank value of page i • P is a matrix in that pij=1/O(i) if i points to j then else pij=0 • Goal is to find eigen vector of matrix P with eigen value one • It iterates to converge (power method) • Using damping factor we have (ei=1/n)

PageRank Properties • Advantages • Finds popularity • It is offline • Disadvantages • It is query independent • All of pages will compete together • Unfairness

HITS (An online query dependent) • Hypertext Induced Topic Search • By Kleinberg

HITS (Hypertext Induced Topic Selection) • The algorithm produces two types of pages: - Authority:A page is very authoritative if it receives many citations. Citation from important pages weight more than citations from less-important pages - Hub:Hubness shows the importance of a page. A good hub is a page that links to many authoritative sites • For each vertex v Є V in a graph of interest: • a(v) - the authority of v • h(v) - the hubness of v

HITS 5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4)

Authority and Hubness Convergence • Authorities and hubs exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed to by many good hubs

HITS Example Find a base subgraph: • Start with a root set R {1, 2, 3, 4} • {1, 2, 3, 4} - nodes relevant to the topic • Expand the root set R to include all the children and a fixed number of parents (d) of nodes in R • A new set S (base subgraph) • Real version of HITS is based on site relations

Topic Sensitive PageRank (TSPR) • It precomputes the importance scores online, as with ordinary PageRank. • However, it compute multiple importance scores for each page; • It computes a set of scores of the importance of • a page with respect to various topics. • At query time, these importance scores are combined based on the topics of the query to form a composite PageRank score for those pages matching the query.

TSPR (Cont.) • We have n topics on the web, The rank of page v in topic t • Difference with original PageRank is in E vector (it is not uniform and we have n E vectors) • The are n ranking values for each page • Problem: finding the topic of a page and a query (we do not know user interest)

TSPR (Cont.) • Cj = category j • Given a query q, let q’ be the context of q (here q’=q) • P(q’|cj) is computed from the class term-vector Dj (number of the terms in the documents below each of the 16 top-level categories, Djt simply gives the total number of occurrences of term t in documents listed below class cj)

TSPR (Cont.) • The quantity P(cj ) is not as straightforward. It is used uniformly, although we could personalize the query results for deferent users by varying this distribution. • In other words, for some user k, we can use a prior distribution Pk(cj ) that reflects the interests of user k. • This method provides an alternative framework for user-based personalization, rather than directly varying the damping vector E

TrustRank • Spamming in Web • Good and bad pages • TrustRank is used to overcome Spamming • It proposes techniques to semi-automatically separate reputable, good pages from spam. • It first selects a small set of seed pages to be evaluated by an expert. • Once it manually identifies the reputable seed pages, then it uses the link structure of the web to discover other pages that are likely to be good.

TrustRank (cont.) • Idea: Good page links to other Good page and bad pages link to other bad pages

TrustRank (cont.) • It formalizes the notion of a human checking a page for spam by a binary oracle function O over all pages p :

Trust damping and Trust Splitting

Computing Trustiness of each page • Goal: • Trust Propagation (E(i) is computed from Normalized Oracle vector for example if O(5)=O(10)=O(15)=1 then E(5)=E(10)=E(15)=0.33 ):

HostRank • Pervious link analysis algorithms generally work on a flat link graph, ignoring the hierarchal structure of the Web graph. • They suffer from two problems: the sparsity of link graph and biased ranking of newly-emerging pages. • HostRank considers both the hierarchical structure and the link structure of the Web.

Example of Domain, Host, Directory

Supper node & Hierarchical Structure of a Web Graph • Tupper-layer is an aggregated link graph which consists with supernodes (such as domain, host and directory). • The lower-layer graph is the hierarchical tree structure, in which each node is the individual Web page in the supernode, and the edges are the hierarchical links between the pages.

Hierarchical Random Walk Model • At the beginning of each browsing session, a user randomly selects the supernode. • 2. After the user finished reading a page in a supernode, he may select one of the following three actions with a certain probability: • Going to another page within current supernode. • Jumping to another supernode that is linked by current supernode. • Ending the browsing.

Two stages in HostRank • First computing score of each suppernode by random walk (PageRank) • Second propagating score among pages inside a supernode • Dissipative Heat Conductance (DHC) model

Ranking & Crawling Challenges • Rich-get-richer problem • Unfairness • Low precision • Spamming phenomenon

Popularity and Quality • Definition1:We define the popularity of page p at time t, P(p; t), as the fraction of Web users who like the page • We can interpret the PageRank of page as its popularity on the web • Definition 2 : We define the qualityof a page p, Q(p), as the probability that an average user will like the page when user sees the page for the first time

Rich-get-richer Problem • It causes young high quality pages receive less popularity • It is from search-engine bias • Entrenchment Effect

--------- • --------- • --------- • --------- • --------- • --------- … user attention entrenched pages new unpopular pages Entrenchment Effect • Search engines show entrenched (already-popular) pages at the top • Users discover pages via search engines; tend to focus on top results

Quality Popularity Popularity as a Surrogate for Quality • Search engines want to measure the “quality” of pages • Quality is hard to define and measure • Various “popularity” measures are used in ranking • e.g., in-links, PageRank, usertraffic

Measuring Search-Engine Bias • Random-surfer model • Users follow links randomly • Never use search engines • Search-dominant model • Users always start with a search engine • Only visit pages returned by search engines • It has been found that it takes 60 times longer for a new page to become popular under Search-Dominant than Random-Surfer model.

Popularity Evaluation Random Surfer Search Dominant

Some Definitions

Relation between Popularity & Visit rate in Random Surfer Model • r1 is constant • We can consider PageRank as Popularity (the current PageRank of a page represents the probability that a person arrives at the page if the person follows links on the Web randomly)

Popularity evolution

Popularity evolution (Q(p)=1)

Relation in Search Dominant Model

Random Surfer vs. Search Dominant

Search Dominant Formula Detail(Found from AltaVista log-power law)

Rank Promotion (by Pandey)

Web Information retrieval (Web IR)