CMPE 493 INTRODUCTION INFORMATION RETRIEVAL

CMPE 493INTRODUCTION INFORMATION RETRIEVAL PERSONALIZED QUERY EXPANSION FOR THE WEB Chirita, P.A., Firan, C.S., and Nejdl, W. SIGIR, 2007, pp. 7-14 Bahtiyar Kaba 2007102824

Introduction • Aim: improvethesearchoutputbyexpandingthequerywithexploitingusers’ PIR(PersonalInformationRepository). • Why? • Inherentambiguity of shortqueries. • Ex: “languageambiguity” => a computerscientistor a linguisticsscientistprobablysearchforsthdifferent. • So, helpthemformulating a betteryquerybyexpansion. “languageambiguity in computing”. • Comeupwiththelattertermbyinvestigatingtheuser’sdesktop(PIR). • Studiesshow 80% of userspreferpersonalizedoutputsfortheirsearch.

Whatwewilluse? • Thepersonalcollection of alldocuments, textdocuments, emails, cached Web pagesetc. • Bypersonalizingthisway, wehave 2 advantages: • Betterdescription of theusersinterest, there is a largeamount of information. • Privacy: “profile” information is extractedandexploitedlocally. weshould not tracktheURLsclickedorqueriesissued.

Algorithms • Localdesktopquerycontext: • Determineexpansiontermsfromthosepersonaldocumentsmatchingthequerybest. • Keyword, expression, summarybasedtechniques. • Global desktopcollection: • Investigateexpansionsbased on cooccurencemetricsandexternaltheasurithroughentirepersonaldirectory. • Beforedetails of these, a glance at previouswork.

PreviousWork • Two IR researchareas: SearchPersonalizationandAutomaticQueryExpansion • A lot of algorithmsforbothdomains, but not as muchforcombiningthem. • Personalizedsearch: rankingsearchresultsaccordingtouserprofiles (ex. Bymeans of pasthistory) • QueryExpansion: derive a betterformulation of thequerytoenhanceretrieval.Based on exploitingsocialorcollectionspecificcharacteristics.

Personalizedsearch • Twomajorcomponents: • UserProfiles: generatedwith as features of visitedpages. • Topicpreferencevectors -> Topic-Sensitivepagerank. • Advantage of beingeasytoobtainandprocess. • But, may not sufficetoobtain a goodunderstanding of user’sinterestsandconcernsaboutprivacy. • PersonalizationAlgorithmitself: • Topicorientedpagerank. Pagerankvectorsaccrdingly, thenbiastheresultsaccordingtothesevectorandsearchtermsimilarity.

QueryExpansion • RelevanceFeedback: • Usefulinformationfortheexpansionterms can be extractedfromtherelevantdocumentsreturned. • Extractsuchkeywordsbased on termfrequency, documentfrequency, summarizationof top-rankeddocuments. • Co-occurence: • Termshighlyco-occuringtogetherwereshowntoincraseprecision. Assesstermrelationshiplevels. • Theasurus: • Expandthequerywithnewtermshavingclosemeaning. • Can be extractedfrom a largetheasuri, ex: Wordnet.

QueryExpansionwith PIR • Wehave a rich, personalcollection but the data is veryunstructuredin format, contentetc. • So, weanalyze PIR at variousgranularitylevels, fromtermfrequencywithingDesktopdocumentsto global co-occurencestatistics. • Thenan empiricalanalysis of thealgorithms is proposed.

LocalDesktopAnalysis • Similartorelevancefeedbackmethodforqueryexpansion, this time weuse PIR besthits. • Investigate in 3 granularitylevels: • Termanddocumentfrequency: • Advantage of beingfasttocompute as wehave a previous offline computation. • Independentlyassociate a scorewitheachtermbased on twostatistics.

LocalDesktopAnalysis • TermFrequency: • Useactualfrequencyinformationandposition of thetermfirstappears. • TermScore= [1/2 + ½*(nrWords-pos/nrWords)]*log(1+TF) • Positioninformation is used as moreinformativetermsappearearlier in thedocument.

LocalDesktopAnalysis • Documentfrequency • Giventhe set of top-k relevantdocuments, generatesnippetsfocusing on theoriginalsearchrequest, thenorderbytheir DF scores. • Focusing on thequery is necessary since DF scoresarecalculatedthroughentire PIR. • TFxIDFweightingmay not be goodforlocaldesktopanalysis, since a termwithhight DF in desktopmay be rare in web. • Ex: page-rankmayhavehigh DF in a IR scientists PIR having a lowtfxidfwhile it resolvesgood in the web.

LocalDesktopAnalysis • LexicalDispersionHypothesis: an expression’slexicaldispersion can be usedtoidentifykeyconcepts. • {adjective?noun+} • Generatesuchcompoundexpressions offline andusethemforqueryexpansion on runtime. • Furtherimprovementsbyorderingaccordingtolexicaldispersion.

LocalDesktopAnalysis • Summarization: • The set of relevantdesktopdocumentsidentified • Then a summarycontainingmostimportantsentencesgenerated as output. • Mostcomprehensiveoutput but not efficient as it can not be computed offline. • Rankthedocumentsaccordingtotheirsaliencescorescomputed as follows:

LocalDesktopAnalysis • Summarization: • SalienceScore= square(SW)/TW + PS + square(TQ)/NQ • SW : significantterms, decidedifits TF is above a thresholdvaluemsas: • Ms=7-0.1*(25-NS) ;if NS < 25 7 ;if 25<NS<40 7 + 0.1*(NS -40) ;if NS>40 • PS: positionscore • (Avg(NS)-SentenceIndex)/square(avg(NS)) • Scaling it thisway, shortdocumentsare not effected, as they do not havesummaries in thebeginning. • Final term is forbalancingtowardsoriginalquery. • Themorequeryterms a sentence, themorerelated it is.

Global DesktopAnalysis • Previoustechniqueswerebased on relevantdocumentsforthequery. • Now, werely on informationacrosstheentire PIR of theuser. • Wehavetwotechniques: • Co-occurenceStatistics: • TheasurusBasedExpansion:

Global DesktopAnalysis • Foreachterm, wecomputetermsco-occuringmostfrequentlywith it in our PIR collection, thenusethisinfo at runtimetoexpandourqueries.

Global DesktopAnalysis • Algorithm: • Off-linecomputation: • 1: Filter potential keywords k with DF 2 [10, . . . , 20% · N] • 2: For each keyword ki • 3: For each keyword kj • 4: Compute SCki,kj, the similarity coefficient of (ki, kj) • On-linecomputation: • 1: Let S be the set of keywords, • potentially similar to an input expression E. • 2: For each keyword k of E: • 3: S S [ TSC(k), where TSC(k) contains the • Top-K terms most similar to k • 4: For each term t of S: • 5a: Let Score(t) Qk2E(0.01 + SCt,k) • 5b: LetScore(t) #DesktopHits(E|t) • 6: Select Top-K terms of S with the highest scores.

Global DesktopAnalysis • Wehaveeachterms’ correlatedtermscalculated offline. At run time weneedtocalculatecorrelation of everyoutputtermwiththeeniterquery. Twoapproaches: • Product of correlationbetweentermandallkeywords • Thenumber of documentstheproposedtermco-occurswithentirequery. • Similaritycoefficientsarecalculateusing: • Cosinesimilarity : (correlationcoefficient) • Mutualinformation • LikelihoodRatio

Global DesktopAnalysis • TheasurusBasedExpansion: • Identifythe set of termsrelatedtoqueryterms (usingtheasurusinformation), thencalculateeachco-occurencelevel of possibleexpansions (i.e originalsearchqueryandthenewterm). Selecttheoneswiththehighestfrequency.

TheasurusBaseExpansion • 1: For each keyword k of an input query Q: • 2: Select the following sets of related terms • 2a: Syn: AllSynonyms • 2b: Sub: All sub-concepts residing one level below k • 2c: Super: All super-concepts residing one level above k • 3: For each set Si of the above mentioned sets: • 4: For each term t of Si: • 5: Search the PIR with (Q|t), i.e., • the original query, as expanded with t • 6: Let H be the number of hits of the above search • (i.e., the co-occurence level of t with Q) • 7: Return Top-K terms as ordered by their H values.

Experiments • 18 subjectsindexedtheircontentwiththeirselectedpaths: Emails, docs,webcache. • Types of Queries • Randomlogquery, hitting 10 docs in PIR. • Self selectedspecificquery, subjectthinkhavingonemeaning • Self selectedambigiousquery, subjectthinkhavingmorethanonemeaning. • We set thenumber of expandedtermsto 4.

Experiments • Measure • DiscountedCumulativeGain: • DCG = G(1) ; if i = 1 DCG(i-1) + G(i)/log(i) ;otherwise. • Givingmoreweighttohighlyrankeddocuments, andincorporatingrelevancelevels.

Experiments • Labelingsforthefollowingresultstables: • Google: Actualgoogleresult • TF,DF : as regular, termanddocumentfrequency • LC, LC[O]:regularandoptimizedlexicalcompounds • SS : sentenceselection (summarization) • TC[CS], TC[MI],TC[LR]: termco-occurencestatisticswithcosingsimilarity, mutualinformation, andlikelihoodratiorespectively. • WN[SYN],WN[SUB],WN[SUP]: wordnetbasedtheasurusexpansionwithsynoyums, subconceptsandsuperconceptsrespectively.

Resultsforlogqueries

Resultsforselectedqueries

Results • Forlogqueriesthebestperformanceachievedwith TF, LC[O] and TC[LR]. • Wegetgoodresultswithsimplekeywordandexpressionoriented (TF, LC[O]) techniques, whereasmorecomplicatedonesdoes not showsignificantimprovements. • Forunambigiousselectedqueries, we do not havemuchimprovement, but forambigiouswehave a clearbenefit. • Forclear(unambigious) queriesdecreasingthenumber of expandedterms can bringfurtherimprovements. İdea of adaptivealgorithms.

Adaptivity • An optimal personalizedqueryexpansionalgorithmshouldadaptitselfaccordingtotheinitialquery. • Howshouldwemeasurethis, i.e. Howmuchpersonal data be fed intooursearch. • QueryLength: • thenumber of words in theuserquery, not efficient -> thereareshortorlongcomplicatedqueries. • QueryScope: • IDF of theentirequery. • Log(#docuemntsincollection/#hitsforquery) • Performswellcollectionfocused on a singletopic. • QueryClarity: • Measuresthediveregencebetweenlanguage model of thequeryandthelanguage model of thecollection(PIR). • ΣP(w | Query) * log (P(w | Query)/ P(w)) where w is a word in query, P(w | Query) is theprobabilty of theword in queryand P(w) theprobability in theentirecollection.

Calculate “scope” forthe PIR and “clarity” forthe web. • Wewilluse LC[O] (bestperformance in thepreviousexperiment), TF, and WN[SYN] whichproducedgoodfirstandsecondexpansionterms. • Tailortheamount of expansionterms as a function of itsambiguity in Web andclarity in the web. • Thescoresforcombination of scopeandclaritylevels as follows:

ClarityLevels

Experiments • Similarapproachtaken as thepreviousexperiments. • For top logqueries, an improvementovergoogleandeven on staticmethods (termnumber = 4). • Forrandomqueries, againbetterresultsthanGoogle, but behindthestaticmethods. Wemayneed a betterselection of thenumber of expansionterms. • For self-selectedqueries: • A clearimprovementforambigiousqueries. • Slightperformanceincreaseforclearqueries. • Resultstelladaptivity is further step forresearch in web searchpersonalization.

Conclusion • Fivetechniquesfordeterminingexpansiontermsgeneratedfrompersonaldocuments. • Empiricalanalysisshow 51.28% improvement. • Furtherworkstoadaptsearchqueries. • An additionalimprovement of 8.47%.

FurtherWork • Investigations on howtooptimallyselectthenumber of expansionterms. • Otherqueryexpansionsuggestionapproaches: LatentSemanticAnalysis.

Thankyou…

CMPE 493 INTRODUCTION INFORMATION RETRIEVAL

CMPE 493 INTRODUCTION INFORMATION RETRIEVAL

Presentation Transcript

Introduction to Information Retrieval (IR)

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Information Retrieval Introduction and Survey

CSE484 Introduction to Information Retrieval

Introduction to information retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval