E N D
ThePageRankCitationRanking: BringOrdertotheweb LawrencePage,SergeyBrin,RajeevMotwaniandTerryWinograd n PresentedbyFeiLi n 1
MotivationandIntroduction WhyisPageImportanceRatingimportant? –NewchallengesforinformationretrievalontheWorld WideWeb. •Hugenumberofwebpages:150millionby1998 1000billionby2008 •Diversityofwebpages:differenttopics,differentquality,etc. WhatisPageRank? n n Amethodforratingtheimportanceofwebpages objectivelyandmechanicallyusingthelinkstructureof theweb. •
TheHistoryofPageRank PageRankwasdevelopedbyLarryPage(hence thenamePage-Rank)andSergeyBrin. Itisfirstaspartofaresearchprojectaboutanew kindofsearchengine.Thatprojectstartedin1995 andledtoafunctionalprototypein1998. Shortlyafter,PageandBrinfoundedGoogle. n n n n16billion…
RecentNews TherearesomenewsaboutthatPageRankwillbe canceledbyGoogle. TherearelargenumbersofSearchEngine Optimization(SEO). SEOusedifferenttrickmethodstomakeaweb pagemoreimportantundertheratingofPageRank. n n n
LinkStructureoftheWeb 150millionwebpagesà1.7billionlinks BacklinksandForwardlinks: ØAandBareC’sbacklinks ØCisAandB’sforwardlink n Intuitively,awebpageisimportantifithasalotofbacklinks. Whatifawebpagehasonlyonelinkoffwww.yahoo.com?
ASimpleVersionofPageRank u:awebpage n nBu:thesetofu’sbacklinks nNv:thenumberofforwardlinksof pagev nc:thenormalizationfactortomake ||R||L1=1(||R||L1=|R1+…+Rn|)
AnexampleofSimplifiedPageRank PageRankCalculation:firstiteration
AnexampleofSimplifiedPageRank PageRankCalculation:seconditeration
AnexampleofSimplifiedPageRank Convergenceaftersomeiterations
AProblemwithSimplifiedPageRank Aloop: Duringeachiteration,theloopaccumulates rankbutneverdistributesranktootherpages!
RandomWalksinGraphs TheRandomSurferModel –Thesimplifiedmodel:thestandingprobability distributionofarandomwalkonthegraphof theweb.simplykeepsclickingsuccessive linksatrandom TheModifiedModel –Themodifiedmodel:the“randomsurfer” simplykeepsclickingsuccessivelinksat random,butperiodically“getsbored”and jumpstoarandompagebasedonthe distributionofE n n
ModifiedVersionofPageRank E(u):adistributionofranksofwebpagesthat“users”jumpto whenthey“getsbored”aftersuccessivelinksatrandom.
DanglingLinks Linksthatpointtoanypagewithnooutgoing links Mostarepagesthathavenotbeen downloadedyet Affectthemodelsinceitisnotclearwhere theirweightshouldbedistributed Donotaffecttherankingofanyotherpage directly Canbesimplyremovedbeforepagerank calculationandaddedbackafterwards n n n n n
PageRankImplementation ConverteachURLintoauniqueintegerandstore eachhyperlinkinadatabaseusingtheintegerIDs toidentifypages SortthelinkstructurebyID Removeallthedanglinglinksfromthedatabase Makeaninitialassignmentofranksandstart iteration n n n n Choosingagoodinitialassignmentcanspeedupthepagerank n Addingthedanglinglinksback. n
ConvergenceProperty PR(322MillionLinks):52iterations n nPR(161MillionLinks):45iterations nScalingfactorisroughlylinearinlogn
ConvergenceProperty TheWebisanexpander-likegraph –Theoryofrandomwalk:arandomwalkonagraphissaidtobe rapidly-mixingifitquicklyconvergestoalimitingdistribution onthesetofnodesinthegraph.Arandomwalkisrapidly- mixingonagraphifandonlyifthegraphisanexpandergraph. –Expandergraph:everysubsetofnodesShasaneighborhood (setofverticesaccessibleviaoutedgesemanatingfromnodesin S)thatislargerthansomefactorαtimesof|S|.Agraphhasa goodexpansionfactorifandonlyifthelargesteigenvalueis sufficientlylargerthanthesecond-largesteigenvalue. n
SearchingwithPageRank Twosearchengines: –Title-basedsearchengine –Fulltextsearchengine Title-basedsearchengine –Searchesonlythe“Titles” –Findsallthewebpageswhosetitlescontainallthequery words –SortstheresultsbyPageRank –Verysimpleandcheaptoimplement –Titlematchensureshighprecision,andPageRankensures highquality Fulltextsearchengine –CalledGoogle –Examinesallthewordsineverystoreddocumentandalso performsPageRank(RankMerging) –Moreprecisebutmorecomplicated • • • 21
PersonalizedPageRank ImportantcomponentofPageRankcalculationisE –Avectoroverthewebpages(usedassourceofrank) –Powerfulparametertoadjustthepageranks Evectorcorrespondstothedistributionofwebpagesthat arandomsurferperiodicallyjumpsto InsteadinPersonalizedPageRankEconsistsofasingle webpage n n n
PageRankvs.WebTraffic Somehighlyaccessedwebpageshavelow pagerankpossiblybecause –Peopledonotwanttolinktothesepagesfromtheir ownwebpages(theexampleintheirpaperis pornographicsites…) –Someimportantbacklinksareomitted n useusagedataasastartvectorforPageRank.
Conclusion isaglobalrankingofallwebpagesbasedon theirlocationsinthewebgraphstructure PageRankusesinformationwhichisexternaltothe webpages–backlinks Backlinksfromimportantpagesaremoresignificant thanbacklinksfromaveragepages Thestructureofthewebgraphisveryusefulfor informationretrievaltasks. n n n n