Query Broadening to improve IR

first we look at a method for Information Retreival query broadening that requires input from the user then we look at an automatic method for query broadening using a thesaurus by the end of the lecture you should understand what a thesaurus, terminology-bank, ontology are, and how they are used to broaden queries Query Broadening to improve IR

Synonyms football / soccer, tap / faucet: search for one, find both? homonyms lead (metal or leash?), tap: find both, only want one? local/global contexts determine “good” terms football articles: won’t mention word ‘football’; will have particular meaning for the word ‘goal’ Precoordination (proximity query): multi-word terms “Venetian blind” vs “blind Venetian” Some issues to be resolved

effort - required by the users in formulation of queries time - between receipt of user query and production of list of ‘hits’ presentation - of the output coverage - of the collection recall - the fraction of relevant items retrieved precision - the fraction of retrieved items that are relevant user satisfaction – with the retrieved items Evaluation/Effectiveness measures

User unaware of collection characteristics is likely to formulate a ‘naïve’ query query broadening aims to replace the initial query with a new one featuring one or other of: new index terms adjusted term weights One method uses feedback information from the user Another method uses a thesaurus / term-bank / ontology Better hits: Query Broadening

Relevance Feedback From response to initial query, gather relevance information H = set of all hits HR = R = set of retrieved, relevant hits HNR = H-R = set of retrieved, non-relevant hits replace query q with replacement query q' : q' = q  di / |HR|  di / |HNR| note: this moves the query vector closer to the centroid of the “relevant retrieved” document vectors and further from the centroid of the “non-relevant retrieved” documents. di  HR di  HNR

We expect documents that are similar to one another in meaning (or usefulness) to have similar index terms. The system creates a replacement query (q’) based on q, but adds index terms that have been used to index known relevant documents, increases the relative weight of index terms in q that are also found in relevant documents, and reduces the weight of terms found in non-relevant documents. Using terms from relevant documents

It could help if documents were being missed because of the synonym problem. The user uses the word ‘jam’, but some recipes use ‘jelly’ instead. Once a hit that uses ‘jelly’ has been recognized as relevant, then ‘jelly’ will appear n the next version of the query. Now hits may use ‘jelly’ but not ‘jam’. Conversely, it can help with the homonym problem. If the user wants references to ‘lead’ (the metal), and gets documents relating to dog-walking, then by marking the dog-walking references as not relevant, key words associated with dog-walking will be reduced in weight How does this help?

If  is set = 0, ignore non-relevant hits, a positive feedback system; often preferred the feedback formula can be applied repeatedly, asking user for relevance information at each iteration relevance feedback is generally considered to be very effective for “high-use” systems one drawback is that it is not fully automatic. pros and cons of feedback

not relevant relevant Recipe for jam pudding Simple feedback example: T = {pudding, jam, traffic, lane, treacle} d1 = (0.8, 0.8, 0.0, 0.0, 0.4), d2 = (0.0, 0.0, 0.9, 0.8, 0.0), d3 = (0.8, 0.0, 0.0, 0.0, 0.8) d4 = (0.6, 0.9, 0.5, 0.6, 0.0) DoT report on traffic lanes Recipe for treacle pudding Radio item on traffic jam in Pudding Lane Display first 2 documents that match the following query: q = (1.0, 0.6, 0.0, 0.0, 0.0) Retrieved documents are: d1 : Recipe for jam pudding d4 : Radio item on traffic jam r = (0.91, 0.0, 0.6, 0.73)

Positive and Negative Feedback Suppose we set and  to 0.5,  to 0.2 q' = q  di / | HR | di / | HNR| = 0.5 q + 0.5 d1  0.2 d4 = 0.5  (1.0, 0.6, 0.0, 0.0, 0.0) + 0.5  (0.8, 0.8, 0.0, 0.0, 0.4)  0.2  (0.6, 0.9, 0.5, 0.6, 0.0) = (0.78, 0.52,  0.1,  0.12, 0.2) (Note |Hn| = 1 and |Hnr| = 1) di  HR di  HNR

relevant relevant Simple feedback example: T = {pudding, jam, traffic, lane, treacle} d1 = (0.8, 0.8, 0.0, 0.0, 0.4), d2 = (0.0, 0.0, 0.9, 0.8, 0.0), d3 = (0.8, 0.0, 0.0, 0.0, 0.8) d4 = (0.6, 0.9, 0.5, 0.6, 0.0) Display first 2 documents that match the following query: q’ = (0.78, 0.52,  0.1,  0.12, 0.2) Retrieved documents are: d1 : Recipe for jam pudding d3 : Recipe for treacle pud r’ = (0.96, 0.0, 0.86, 0.63)

a thesaurus or ontology may contain controlled vocabulary of terms or phrases describing a specific restricted topic, synonym classes, hierarchy defining broader terms (hypernyms) and narrower terms (hyponyms) classes of ‘related’ terms. a thesaurus or ontology may be: generic (as Roget’s thesaurus, or WordNet) specific to a certain domain of knowledge, eg medical Thesaurus

Language normalisation by replacing words from documents and query words with synonyms from a controlled language, we can improve precision and recall: Uncontrolled keywords Index terms Content analysis Thesaurus match User query Normalised query

Include terms likely to be of value in content analysis for each term, form classes of related words (separate classes for synonyms, hypernyms, hyponyms) form separate classes for each relevant meaning of the word terms in a class should occur with roughly equal frequency (not easy – NL has Zipf’s law word-freq ) avoid high-frequency terms it involves some expert judgment that will not be easy to automate. Thesaurus / Ontology construction

Example thesaurus A public-domain thesaurus (WORDNET) is available from: http://www.cogsci.princeton.edu/~wn/ /home/cserv1_a/staff/nlplib/WordNet/2.0 /home/cserv1_a/staff/extras/nltk/1.4.2/corpora/wordnet synonyms (sense 1): data processor electronic computer computer information processing system

Example thesaurus A public-domain thesaurus (WORDNET) is available from: http://www.cogsci.princeton.edu/~wn/ synonyms (sense 2): estimator calculator computer reckoner figurer

Terminology (from WordNet Help) Hypernym is the generic term used to designate a whole class of specific instances. Y is a hypernym of X if X is a (kind of) Y. Hyponym is the generic term used to designate a member of a class. X is a hyponym of Y if X is a (kind of) Y. Coordinate words arewords that have the same hypernym.Hypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>".

Hypernyms Sense 1computer, data processor, electronic computer, information processing system-> machine -> device -> instrumentality, instrumentation -> artifact, artefact -> object, physical object -> entity, somethingHypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>".

Hyponyms Sense 1 computer, data processor, electronic computer, information processing system=> analog computer, analogue computer=> digital computer=> node, client, guest=> number cruncher=> pari-mutuel machine, totalizer, totaliser, totalizator, totalisator=> server, hostHypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>".

Coordinate terms Sense 1computer, data processor, electronic computer, information processing system-> machine=> assembly=> calculator, calculating machine=> calendar=> cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM=> computer, data processor, electronic computer, information processing system=> concrete mixer, cement mixer=> corker=> cotton gin, gin=> decoder

replace term in document and/or query with term in controlled language replace term in query with related or broader term to increase recall suggest to user narrower terms to increase precision Thesaurus use computer (sense 1) Doc: <data processor> S Thesaurus match Query: < electronic computer> computer (sense 1)

replace term in document and/or query with term in controlled language replace term in query with related or broader term to increase recall suggest to user narrower terms to increase precision Thesaurus use All collection All collection B Thesaurus match match Query: <computer (sense 1)> Query: <node(sense 6)>

replace term in document and/or query with term in controlled language replace term in query with related or broader term to increase recall suggest to user narrower terms to increase precision Thesaurus use All collection N All collection Thesaurus match match User Query: <computer (sense 1)> Query: client

a thesaurus or ontology can be used to normalise a vocabulary and queries (?or documents?) it can be used (with some human intervention) to increase recall and precision generic thesaurus/ontology may not be effective in specialized collections and/or queries Semi-automatic construction of thesaurus/ontology based on the retrieved set of documents has produced some promising results. Key points

Query Broadening to improve IR

Query Broadening to improve IR

Presentation Transcript

Broadening Your Education

Broadening Participation

Temporal Query Log Profiling to Improve Web Search Ranking

Doppler Broadening Spectroscopy:

Using IR techniques to improve Automated Text Classification

Broadening Strands overview

Temporal Query Log Profiling to Improve Web Search Ranking

To IR or not to IR?

Broadening CMSA Membership

Broadening the Impact to Industry/Education

To improve or not to improve

Line Broadening Analysis

SPARQL Query Graph Model (How to improve query evaluation?)

Broadening Participation

Broadening Access to Geospatial Capabilities

Broadening the Definition

Broadening Participation

Broadening Creativity

BROADENING THE AUDIENCE

Broadening Your Impact

Broadening Your Impact

To improve or not to improve