Keyword Search in Databases using PageRank

Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Roadmap • PageRank: Ranking Web Pages using link structure • Ranking Keyword Search Results in Structured Databases • Ranking Combining Individual PageRanks

Roadmap • PageRank: Ranking Web Pages using link structure of the web • Ranking Keyword Search Results in Structured Databases • Ranking Combining Individual PageRanks

PageRank(1) • Stanford project • Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. “The PageRank Citation Ranking: Bringing Order to the Web”. • Started Google

PageRank(2) • Make use of the link structure of the web to calculate a quality ranking (PageRank) for each web page. • Citation counting a metric for measuring page/paper quality • PageRank a more sophisticated citation counting method, not prone to manipulation. • Each page has unique PageRank, independent of keyword query • PageRank does NOT express relevance of page to query

PageRank (3) • Calculation Intuition :PageRank of page P increases when pages with large PageRanks point to P. • The rank of a page is evenly distributed among its forward links. • A problem: When two pages form a loop by pointing to each other but no other page, then in every iteration this loop accumulates and never distributes rank. This is called rank sink.

PageRank is a Usage Simulation • “Random surfer” • Given a random URL • Clicks randomly on links • After a while gets bored and gets a new random URL • The number of visits to each page is its PageRank.

PageRank Calculation PR(A)=(1-d) + d*( PR(T1)/C(T1)+…+ PR(Tn)/C(Tn) ) d: damping factor, normally this is set to 0.85. T1, …, Tn: pages pointing to page A PR(A): PageRank of page A. PR(Ti): PageRank of page Ti. C(Ti): the number of links going out of page Ti. Note: d counts for PageRank sinks

Example of Calculation (1) Page A Page B Page C Page D

Example of Calculation (2) 1*0.85/2 Page A 1 Page B 1 1*0.85 1*0.85 1*0.85/2 Page C 1 1*0.85 Page D 1

Example of Calculation (3) • Each page has not passed on 0.15, so we get: Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1 Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page A) + 0.15 (not transferred) = 2.275 Page D: receives none, but has not transferred 0.15 = 0.15 Page A 1 Page B 0.575 Page C 2.275 Page D 0.15

Example of Calculation (4) Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) = 2.08375 Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 (from Page A) +0.15 (not transferred) = 1.19125 Page D: receives none, but has not transferred 0.15 = 0.15 Page A 2.08375 Page B 0.575 Page C 1.19125 Page D 0.15

Example - Conclusions • Page C has the highest PageRank, and page A has the next highest: page C has a highest importance in this page graph! • More iterations lead to convergence of PageRanks.

Base set • In practice when the user gets bored tends to use his bookmarked pages instead of a random one. These bookmarked pages constitute the base set. • The PR formula is modified to reflect this behavior. PR(A)=(1-d)*E + d*( PR(T1)/C(T1)+…+ PR(Tn)/C(Tn) ) If A in base set E = 1 else E = 0

Roadmap • PageRank: Ranking Web Pages using link structure • Ranking Keyword Search Results in Structured Databases • Ranking Combining Individual PageRanks

Keyword Query • Input: set of keywords • Output: List of nodes ranked according to their relevance to the keywords • Score of a result-node: • Sum of keyword-specific PRs (OR semantics) • Product of keyword-specific PRs (AND semantics)

Database Schema • Tupples in C, Y, P, A are objects that represent nodes in schema graph • Primary to foreign key relations represent edges in the graph • All connections are two way except P – P that is only from paper to cited paper

Architecture • List of • Nodeid • Node text • PR wrt all keywords • Attributes of PRindex table: • Keyword • CLOB of (id,PR) list

Modified PageRank Formula PR(A)=(1-d) + d*(weight(T1→A)*PR(T1)/C(T1)+…+ weight(Tn→A)*PR(Tn)/C(Tn)), if A has keyword PR(A)=d*(weight(T1→A)*PR(T1)/C(T1)+…+ weight(Tn→A)*PR(Tn)/C(Tn)), if A doesn’t have keyword

Preprocessing stage (1) • Load whole database in memory • Create edges Hashtable ( nodeId, nodeId, Type of edge ) • Create nodes Hashtable ( nodeId ) • Create text Hashtable ( nodeId, text ) • For each keyword • Find all nodes that contain keyword and put them in base set. • Execute PR algorithm with base set.

Preprocessing stage (2) • Create descending list of (nodeid,PR) pair. • Store list in CLOB in PRindex table indexed by keyword.

Query Stage • For each keyword in input retrieve ( id, PR ) list from database. • Resolve top-k ids with respect to the sum of Page ranks using Fagin’s algorithm (PODS 2001).

Fagin’s Algorithm • Descending sorted keyword-specific PR lists • Keep the maximum possible value of a node that is the current PR for node extracted so far in scanned lists plus the PR of currently pointed nodes in other lists. Keep the minimum value that is the current PR for node. • Algorithm terminates when it finds k objects of which minimum value is greater than the maximum PR value for the rest of nodes.

Conclusions • We implemented a system for keyword search in databases using PageRank. • It uses an index of keyword specific Object Ranks

Keyword Search in Databases using PageRank