340 likes | 422 Views
CMPE 493 INTRODUCTION INFORMATION RETRIEVAL. PERSONALIZED QUERY EXPANSION FOR THE WEB Chirita , P.A., Firan , C.S., and Nejdl , W . SIGIR, 2007, pp . 7-14. Bahtiyar Kaba 2007102824. Introduction.
E N D
CMPE 493INTRODUCTION INFORMATION RETRIEVAL PERSONALIZED QUERY EXPANSION FOR THE WEB Chirita, P.A., Firan, C.S., and Nejdl, W. SIGIR, 2007, pp. 7-14 Bahtiyar Kaba 2007102824
Introduction • Aim: improvethesearchoutputbyexpandingthequerywithexploitingusers’ PIR(PersonalInformationRepository). • Why? • Inherentambiguity of shortqueries. • Ex: “languageambiguity” => a computerscientistor a linguisticsscientistprobablysearchforsthdifferent. • So, helpthemformulating a betteryquerybyexpansion. “languageambiguity in computing”. • Comeupwiththelattertermbyinvestigatingtheuser’sdesktop(PIR). • Studiesshow 80% of userspreferpersonalizedoutputsfortheirsearch.
Whatwewilluse? • Thepersonalcollection of alldocuments, textdocuments, emails, cached Web pagesetc. • Bypersonalizingthisway, wehave 2 advantages: • Betterdescription of theusersinterest, there is a largeamount of information. • Privacy: “profile” information is extractedandexploitedlocally. weshould not tracktheURLsclickedorqueriesissued.
Algorithms • Localdesktopquerycontext: • Determineexpansiontermsfromthosepersonaldocumentsmatchingthequerybest. • Keyword, expression, summarybasedtechniques. • Global desktopcollection: • Investigateexpansionsbased on cooccurencemetricsandexternaltheasurithroughentirepersonaldirectory. • Beforedetails of these, a glance at previouswork.
PreviousWork • Two IR researchareas: SearchPersonalizationandAutomaticQueryExpansion • A lot of algorithmsforbothdomains, but not as muchforcombiningthem. • Personalizedsearch: rankingsearchresultsaccordingtouserprofiles (ex. Bymeans of pasthistory) • QueryExpansion: derive a betterformulation of thequerytoenhanceretrieval.Based on exploitingsocialorcollectionspecificcharacteristics.
Personalizedsearch • Twomajorcomponents: • UserProfiles: generatedwith as features of visitedpages. • Topicpreferencevectors -> Topic-Sensitivepagerank. • Advantage of beingeasytoobtainandprocess. • But, may not sufficetoobtain a goodunderstanding of user’sinterestsandconcernsaboutprivacy. • PersonalizationAlgorithmitself: • Topicorientedpagerank. Pagerankvectorsaccrdingly, thenbiastheresultsaccordingtothesevectorandsearchtermsimilarity.
QueryExpansion • RelevanceFeedback: • Usefulinformationfortheexpansionterms can be extractedfromtherelevantdocumentsreturned. • Extractsuchkeywordsbased on termfrequency, documentfrequency, summarizationof top-rankeddocuments. • Co-occurence: • Termshighlyco-occuringtogetherwereshowntoincraseprecision. Assesstermrelationshiplevels. • Theasurus: • Expandthequerywithnewtermshavingclosemeaning. • Can be extractedfrom a largetheasuri, ex: Wordnet.
QueryExpansionwith PIR • Wehave a rich, personalcollection but the data is veryunstructuredin format, contentetc. • So, weanalyze PIR at variousgranularitylevels, fromtermfrequencywithingDesktopdocumentsto global co-occurencestatistics. • Thenan empiricalanalysis of thealgorithms is proposed.
LocalDesktopAnalysis • Similartorelevancefeedbackmethodforqueryexpansion, this time weuse PIR besthits. • Investigate in 3 granularitylevels: • Termanddocumentfrequency: • Advantage of beingfasttocompute as wehave a previous offline computation. • Independentlyassociate a scorewitheachtermbased on twostatistics.
LocalDesktopAnalysis • TermFrequency: • Useactualfrequencyinformationandposition of thetermfirstappears. • TermScore= [1/2 + ½*(nrWords-pos/nrWords)]*log(1+TF) • Positioninformation is used as moreinformativetermsappearearlier in thedocument.
LocalDesktopAnalysis • Documentfrequency • Giventhe set of top-k relevantdocuments, generatesnippetsfocusing on theoriginalsearchrequest, thenorderbytheir DF scores. • Focusing on thequery is necessary since DF scoresarecalculatedthroughentire PIR. • TFxIDFweightingmay not be goodforlocaldesktopanalysis, since a termwithhight DF in desktopmay be rare in web. • Ex: page-rankmayhavehigh DF in a IR scientists PIR having a lowtfxidfwhile it resolvesgood in the web.
LocalDesktopAnalysis • LexicalDispersionHypothesis: an expression’slexicaldispersion can be usedtoidentifykeyconcepts. • {adjective?noun+} • Generatesuchcompoundexpressions offline andusethemforqueryexpansion on runtime. • Furtherimprovementsbyorderingaccordingtolexicaldispersion.
LocalDesktopAnalysis • Summarization: • The set of relevantdesktopdocumentsidentified • Then a summarycontainingmostimportantsentencesgenerated as output. • Mostcomprehensiveoutput but not efficient as it can not be computed offline. • Rankthedocumentsaccordingtotheirsaliencescorescomputed as follows:
LocalDesktopAnalysis • Summarization: • SalienceScore= square(SW)/TW + PS + square(TQ)/NQ • SW : significantterms, decidedifits TF is above a thresholdvaluemsas: • Ms=7-0.1*(25-NS) ;if NS < 25 7 ;if 25<NS<40 7 + 0.1*(NS -40) ;if NS>40 • PS: positionscore • (Avg(NS)-SentenceIndex)/square(avg(NS)) • Scaling it thisway, shortdocumentsare not effected, as they do not havesummaries in thebeginning. • Final term is forbalancingtowardsoriginalquery. • Themorequeryterms a sentence, themorerelated it is.
Global DesktopAnalysis • Previoustechniqueswerebased on relevantdocumentsforthequery. • Now, werely on informationacrosstheentire PIR of theuser. • Wehavetwotechniques: • Co-occurenceStatistics: • TheasurusBasedExpansion:
Global DesktopAnalysis • Foreachterm, wecomputetermsco-occuringmostfrequentlywith it in our PIR collection, thenusethisinfo at runtimetoexpandourqueries.
Global DesktopAnalysis • Algorithm: • Off-linecomputation: • 1: Filter potential keywords k with DF 2 [10, . . . , 20% · N] • 2: For each keyword ki • 3: For each keyword kj • 4: Compute SCki,kj, the similarity coefficient of (ki, kj) • On-linecomputation: • 1: Let S be the set of keywords, • potentially similar to an input expression E. • 2: For each keyword k of E: • 3: S S [ TSC(k), where TSC(k) contains the • Top-K terms most similar to k • 4: For each term t of S: • 5a: Let Score(t) Qk2E(0.01 + SCt,k) • 5b: LetScore(t) #DesktopHits(E|t) • 6: Select Top-K terms of S with the highest scores.
Global DesktopAnalysis • Wehaveeachterms’ correlatedtermscalculated offline. At run time weneedtocalculatecorrelation of everyoutputtermwiththeeniterquery. Twoapproaches: • Product of correlationbetweentermandallkeywords • Thenumber of documentstheproposedtermco-occurswithentirequery. • Similaritycoefficientsarecalculateusing: • Cosinesimilarity : (correlationcoefficient) • Mutualinformation • LikelihoodRatio
Global DesktopAnalysis • TheasurusBasedExpansion: • Identifythe set of termsrelatedtoqueryterms (usingtheasurusinformation), thencalculateeachco-occurencelevel of possibleexpansions (i.e originalsearchqueryandthenewterm). Selecttheoneswiththehighestfrequency.
TheasurusBaseExpansion • 1: For each keyword k of an input query Q: • 2: Select the following sets of related terms • 2a: Syn: AllSynonyms • 2b: Sub: All sub-concepts residing one level below k • 2c: Super: All super-concepts residing one level above k • 3: For each set Si of the above mentioned sets: • 4: For each term t of Si: • 5: Search the PIR with (Q|t), i.e., • the original query, as expanded with t • 6: Let H be the number of hits of the above search • (i.e., the co-occurence level of t with Q) • 7: Return Top-K terms as ordered by their H values.
Experiments • 18 subjectsindexedtheircontentwiththeirselectedpaths: Emails, docs,webcache. • Types of Queries • Randomlogquery, hitting 10 docs in PIR. • Self selectedspecificquery, subjectthinkhavingonemeaning • Self selectedambigiousquery, subjectthinkhavingmorethanonemeaning. • We set thenumber of expandedtermsto 4.
Experiments • Measure • DiscountedCumulativeGain: • DCG = G(1) ; if i = 1 DCG(i-1) + G(i)/log(i) ;otherwise. • Givingmoreweighttohighlyrankeddocuments, andincorporatingrelevancelevels.
Experiments • Labelingsforthefollowingresultstables: • Google: Actualgoogleresult • TF,DF : as regular, termanddocumentfrequency • LC, LC[O]:regularandoptimizedlexicalcompounds • SS : sentenceselection (summarization) • TC[CS], TC[MI],TC[LR]: termco-occurencestatisticswithcosingsimilarity, mutualinformation, andlikelihoodratiorespectively. • WN[SYN],WN[SUB],WN[SUP]: wordnetbasedtheasurusexpansionwithsynoyums, subconceptsandsuperconceptsrespectively.
Results • Forlogqueriesthebestperformanceachievedwith TF, LC[O] and TC[LR]. • Wegetgoodresultswithsimplekeywordandexpressionoriented (TF, LC[O]) techniques, whereasmorecomplicatedonesdoes not showsignificantimprovements. • Forunambigiousselectedqueries, we do not havemuchimprovement, but forambigiouswehave a clearbenefit. • Forclear(unambigious) queriesdecreasingthenumber of expandedterms can bringfurtherimprovements. İdea of adaptivealgorithms.
Adaptivity • An optimal personalizedqueryexpansionalgorithmshouldadaptitselfaccordingtotheinitialquery. • Howshouldwemeasurethis, i.e. Howmuchpersonal data be fed intooursearch. • QueryLength: • thenumber of words in theuserquery, not efficient -> thereareshortorlongcomplicatedqueries. • QueryScope: • IDF of theentirequery. • Log(#docuemntsincollection/#hitsforquery) • Performswellcollectionfocused on a singletopic. • QueryClarity: • Measuresthediveregencebetweenlanguage model of thequeryandthelanguage model of thecollection(PIR). • ΣP(w | Query) * log (P(w | Query)/ P(w)) where w is a word in query, P(w | Query) is theprobabilty of theword in queryand P(w) theprobability in theentirecollection.
Calculate “scope” forthe PIR and “clarity” forthe web. • Wewilluse LC[O] (bestperformance in thepreviousexperiment), TF, and WN[SYN] whichproducedgoodfirstandsecondexpansionterms. • Tailortheamount of expansionterms as a function of itsambiguity in Web andclarity in the web. • Thescoresforcombination of scopeandclaritylevels as follows:
Experiments • Similarapproachtaken as thepreviousexperiments. • For top logqueries, an improvementovergoogleandeven on staticmethods (termnumber = 4). • Forrandomqueries, againbetterresultsthanGoogle, but behindthestaticmethods. Wemayneed a betterselection of thenumber of expansionterms. • For self-selectedqueries: • A clearimprovementforambigiousqueries. • Slightperformanceincreaseforclearqueries. • Resultstelladaptivity is further step forresearch in web searchpersonalization.
Conclusion • Fivetechniquesfordeterminingexpansiontermsgeneratedfrompersonaldocuments. • Empiricalanalysisshow 51.28% improvement. • Furtherworkstoadaptsearchqueries. • An additionalimprovement of 8.47%.
FurtherWork • Investigations on howtooptimallyselectthenumber of expansionterms. • Otherqueryexpansionsuggestionapproaches: LatentSemanticAnalysis.