450 likes | 582 Views
Page Rank Modifications & Alternatives. Brett Harper. Overview. Computing Customized Page Ranks Adaptive Ranking of Web Pages Generalizing PageRank Damping Functions for Link-Based Ranking Algorithms An Approach to Confidence Based Page Ranking for User-Oriented Web Search
E N D
Page Rank Modifications & Alternatives Brett Harper
Overview • Computing Customized Page Ranks • Adaptive Ranking of Web Pages • Generalizing PageRank Damping Functions for Link-Based Ranking Algorithms • An Approach to Confidence Based Page Ranking for User-Oriented Web Search • Web Page Ranking using Link Attributes
Computing Customized Page Ranks • Page rank usually depends on how related a document is to a query, and the quality of the document. • PageRank introduces document authority. • Similar to the citation problem. • Most proposed web ranking algorithms are based on connectivity rather than content. • For customized ranks, the concept of page importance depends on the situation.
Computing Customized Page Ranks • Current solutions build different ranks for topics, users, or queries. • Automatic building of the ranking function from a set of user examples.
Computing Customized Page Ranks • Brin & Page's PageRank • Generalized PageRank, where x is a vector containing ranks, W is an n*n matrix, and e is an n-vector. • Parametric PageRank, where the sum of each of the a's is 1.
Computing Customized Page Ranks • User requirements are represented as an optimization problem where the variables are the user requirements and the total number of constraints. • The issue of how to obtain constraints is not discussed. • A cost function allows the ranks to be changed in accordance with the requirements. (Quadratic and linear) • Methods for infeasible requirements. • Penalty Function • Number of satisfied constraints, in addition to the cost function.
Computing Customized Page Ranks • WT10G data set • Constraints defined • Adaptive rank computed • Compared to PageRank on entire WT10G dataset
Adaptive Ranking of Web Pages • Alter PageRank by modifying the PageRank equation. • Can be done from perspective of the user or web site administrators. • Modify rank by changing (1-d) in the original PageRank. • Dynamic Control • Static Control
Adaptive Ranking of Web Pages • Rules • B is an r*n matrix, b is a rule vector of size r • Inputs and outputs should be positive • The cost function allows the rank of certain pages to be modified while keeping the current rank of other pages.
Adaptive Ranking of Web Pages • Initial solution was to structure the problem as a quadratic programming problem. • Second solution uses clusters to reduce the number of dimensions. • Pages are clustered based on score • Vector E contains k parameters. • Vector A is the sum of the columns in (I-dW)^-1 that correspond to a certain class.
Adaptive Ranking of Web Pages • Vector E contains k parameters. • Vector A is the sum of the columns in M that correspond to a certain class. • H is defined as BA • is the quadratic term • is the linear term
Adaptive Ranking of Web Pages • Contradicting constraints • Relax constraints to arrive at sub-optimal solution • Add s to the cost function (used to balance importance of contraints and original cost function)
Adaptive Ranking of Web Pages • Use a clustering algorithm to split webpages into clusters. • Compute Ai • If there is a feasible solution, use the first formula to find the optimal parameters e1,...,ek. • If no feasible solution exists, use the version for relaxed constraints to find sub-optimal parameters e1,...,ek. • Compute rank as
Adaptive Ranking of Web Pages • Used the WT10G data set for experiments • First experiment: Swap importance of two pages located some distance Δ apart. • Effectively modifies the PageRank • Constraints on highly ranked pages disturbs the rest of the pages more significantly. • These disruptions appear in blocks due to clustering. • When swapping two pages, effect is greater on lower ranked than higher ranked pages. • Quality of results is influenced by # of clusters.
Adaptive Ranking of Web Pages • Second experiment: Change # of clusters • Gradually increase # of clusters used from 5 to 100. • Cost function stops improving at ~60 clusters. • Clustering can reduce the complexity level of the problem. • # of clusters quite small compared to the size of the collection.
Adaptive Ranking of Web Pages • Clustering techniques • Cluster by score • Cluster by rank (variable-sized cluster dimensions) • Cluster by rank with fixed size cluster dimensions
Adaptive Ranking of Web Pages • PageRanks can be modified, but constraints on some pages causes the ranks of all pages to be affected. • The effect of these constraints depends on how highly ranked the constrained page is.
Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms • Damping functions reduce page importance propogation on long paths. • Focus on linear, exponential, and hyperbolic decay. • Exponential corresponds to original PageRank.
Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms • For functional rankings, a link matrix is used. • Normalization • Dangling nodes • If P is the resulting matrix after normalization, the rank is defined as
Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms • An equivalent approach takes into account the branching contribution. • Rank of a node is the weighted sum of incoming paths, with weights that decay exponentially with path length. • PageRank is a functional ranking where the damping function is (1-α)α^t.
Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms
Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms • Linear Damping
Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms • Hyperbolic Damping
Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms • Empirical Damping • Pages that are linked are similar, but the topic changes as the distance increases. • Use decrease in text similarity as an approximation to an empirical damping function. • .uk domain, 18m pages, 200 pages chosen at random, similarity measured using TF.IDF without stemming or stop-word removal • Results show that this is better approximated by linear damping with L=8 or 9 than by exponential damping.
Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms
Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms • Approximating Hyperbolic with Exponential Damping • Find the α that minimizes the difference of weights for different values of β and the maximum path length l.
Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms • Approximating Exponential with Linear Damping • Find the L that minimizes the difference of weights for different values of α and the maximum path length l.
Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms • Parameters for the damping function • Characteristic path length (average distance between two nodes) grows sub-logarithmically with the size of the graph. • For a smaller graph, the damping function should decay faster. • The sum of the weights up to the average path lengths of graphs L1 and L2 have to be similar for both rankings to behave in a similar way.
Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms • Experimental Comparison of precision (PageRank vs. LinearRank) • Used the WebTREC Gov2 collection (25m documents, .gov domain, 2004) • Chose 50 queries at random to run. • PageRank took 39 iterations to run. LinearRank was run for 5, 10, and 20 iterations. • After first 5 results, LinearRank had precision similar to PageRank. • Useful when rankings can't be computed in advance.
An Approach to Confidence Based Page Ranking for User Oriented Web Search • Confidence is the probability of accessing a page for a specific query given past behavior. • Use this probability to enhance page rankings of most relevant pages. • Should also take link structure into account. • Merge pages with similar categories since users lose interest after first few results.
An Approach to Confidence Based Page Ranking for User Oriented Web Search • Extract important features and categories from web pages. • Prune pages from the graph that are not relevant. • Calculate confidence for all features and categories of each page. • Use citations (link structure) and confidence measure to recursively compute the page rank.
An Approach to Confidence Based Page Ranking for User Oriented Web Search • Extract important features and categories from web pages. • Search the full-text and extended anchor text for most relevant features/categories. • in the set of features where N(P,i) is the total # of times page P is accessed for query i and O(i) is the total number of queries made for i. • Pages with high E(P,a) will likely be accessed for the topic a.
An Approach to Confidence Based Page Ranking for User Oriented Web Search • Prune pages from the graph that are not relevant. • Pages without similar features/categories can be connected. • These pages are used for extracting features/ categories, but are pruned if the confidence does not meet a certain threshold. • Citations of pruned pages are also removed.
An Approach to Confidence Based Page Ranking for User Oriented Web Search • Calculate confidence for all features and categories of each page. • in the customized graph. • Calculating C(a,P) for the entire history is not realistic, so only take recent history into account.
An Approach to Confidence Based Page Ranking for User Oriented Web Search • Use citations (link structure) and confidence measure to recursively compute the page rank. • PR(P,a) = (1-d) + d[PR(T1,a)/O(T1)+...+ PR(Tn,a)/O(Tn)], where Ti is a citing page and O(Ti) is the # of outgoing links. • RPR(P,a) = PR(P,a) * C(a,P) • New pages cited by many many relevant high-ranked pages. Can be suppressed by including a time period. • Substitute damping factor d with (1-C(a,P))
An Approach to Confidence Based Page Ranking for User Oriented Web Search • The data set was constructed from a list of 7 queries, from which the top 30 results were obtained from Google. • A graph of these nodes was then created, and further expanded to a depth of 2. This new graph contained 500-800 nodes. • Higher ranked pages are not always accessed a higher number of times. • Pages can be accessed for multiple queries. • Pages with higher confidence tend to be ranked higher.
Web Page Ranking using Link Attributes • Tries to improve on current ranking techniques by assigning different weights to links. (WLRank) • Relative position in the page • Tag where the link is contained • Length of anchor text
Web Page Ranking using Link Attributes • L(j,i) is 1 if a link exists or 0 otherwise, and c is a constant that gives a base weight to every link • T(j,i) depends on the tag • AL(j,i) is length of anchor text divided by average anchor text length d. • RP(j,i) is the relative position weighted by constant b. • If W(j,i) = L(j,i) then it is equal to PageRank.
Web Page Ranking using Link Attributes • Tested against 460k pages in the .CL domain. • Several users provided relevance judgements on the first 10 results of several queries. • Used c=1, b=1, and d=100. • Only used weights for <b> and <h1> tags. • Compare precision based on a perfect ranking for the first 10 answers. • Improvement of 13% on average.
Conclusions • PageRank can be modified to fit user requirements and specific categories. • Different functions can be used to decay PageRank influence on path lengths. • Can improve PageRank through clustering.
References • Tsoi, A. C., Hagenbuchner, M., and Scarselli, F. 2006. Computing customized page ranks. ACM Trans. Interet Technol. 6, 4 (Nov. 2006), 381-414. • Tsoi, A. C., Morini, G., Scarselli, F., Hagenbuchner, M., and Maggini, M. 2003. Adaptive ranking of web pages. In Proceedings of the 12th international Conference on World Wide Web (Budapest, Hungary, May 20 - 24, 2003). WWW '03. ACM, New York, NY, 356-365. • Baeza-Yates, R., Boldi, P., and Castillo, C. 2006. Generalizing PageRank: damping functions for link-based ranking algorithms. In Proceedings of the 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Seattle, Washington, USA, August 06 - 11, 2006). SIGIR '06. ACM, New York, NY, 308-315. • Mukhopadhyay, D., Giri, D., and Singh, S. R. 2003. An approach to confidence based page ranking for user oriented Web search. SIGMOD Rec. 32, 2 (Jun. 2003), 28-33. • Baeza-Yates, R. and Davis, E. 2004. Web page ranking using link attributes. In Proceedings of the 13th international World Wide Web Conference on Alternate Track Papers &Amp; Posters (New York, NY, USA, May 19 - 21, 2004). WWW Alt. '04. ACM, New York, NY, 328-329.