240 likes | 415 Views
Query Expansion. By: Sean McGettrick. What is Query Expansion?. Query Expansion is the term given when a search engine adding search terms to a user’s weighted search. The goal is to improve precision and/or recall.
E N D
Query Expansion By: Sean McGettrick
What is Query Expansion? • Query Expansion is the term given when a search engine adding search terms to a user’s weighted search. • The goal is to improve precision and/or recall. • Example: User Query: “car”; Expanded Query: “car cars automobile automobiles auto” etc…
Classes of Query Expansion • Human and/or computer generated thesauri • Relevance feedback • Automatic query expansion
Query Expansion Issues • Two major issues • Which terms to include? • Which terms to weight more? • Concept-Based vs. Term-Based Query Expansion • Is it better to expand based upon the individual terms in the query, or the overall concept of the query?
Relevance of Query Expansion • Query expansion is very important on the web. • The amount of information on the web is always increasing. • In 1999, Google had 135 million pages. It now has over 3 billion. • Search engine users follow specific trends with their searches. • 2-3 words • Broad search term • Do not like to expand their queries either through refining search terms or using Boolean operators
Thesauri • What is a Thesauri in the IR world? • “Any data structure that defines semantic relatedness between words.” • Schutze and Pedersen (1997) • Often more complex than normal Thesauri. • Thought to be too broad to be useful.
The Need For Thesauri • Naturally assumed that pulling words from a thesauri would increase: • The number of documents retrieved. • Possibly precision. • The car example: “car” vs. “car, auto, automobile, vehicle, sedan, etc…” • Which would retrieve the largest number of documents? • Is larger necessarily better?
Human & Automatically Generated Thesauri • Earliest work began in the 1950s. • H.P. Luhn • Thesaurofacet – detailed list of engineering terms • Largely used in such industries as medicine, aerospace, and other technological fields.
Drawbacks of Handcrafted Thesauri • Cost • Development. • Maintenance. • Cost often outweighs benefit. • Time • It often takes a long time for thesauri to develop. • Hard to keep up with the pace of scientific and technological development.
Automatically Generated Thesauri • Need grew from limitations of handcrafted thesauri. • No longer the cost of experts to generate thesauri.
Automatically Generated Thesauri • 3 Steps. • Extract word co-occurrences. • Define word similarities. • Based upon word co-occurrence or lexical relationship. • Cluster words based upon their similarities. • Not proven very successful. • As late as 1990 many industries were still using handcrafted thesauri.
Relevance Feedback • Began in the 1960s. • Significant improvement in recall and precision over early query expansion work. • Basic process as follows. • The user creates their initial query which returns an initial result set. • The user then selects a list of documents that are relevant to their search. • The system then re-weights and/or expands the query based upon the terms in the documents.
Relevance Feedback Models • Many different types of models. • Depend on methods and theories behind them. • Vector Space. • Probabilistic. • Boolean.
“Ide dec-hi” Method • In this method, all the top ranked relevant documents are used as is the highest ranked non-relevant document. • The non-relevant document is used a point in the vector space from which the feedback query is removed. • Up to 160% improvement over non-expanded queries.
Interactive Query Expansion • Uses a thesaurus. • After initial query is submitted, the system returns a list of associated and relevant words derived from both the result set and a thesaurus. • Useful, but more research is needed.
Pseudo-relevance Feedback • Grew from problems involved in implementing relevance feedback systems. • Users do not like to give manual feedback to the system.
Pseudo-relevance Feedback Process • The system returns an initial set of documents. • The system assumes that the top n number of documents are relevant to the query. • The system takes terms from these documents to re-weight the query. • Relies largely on the systems ability to initially retrieve relevant documents.
Automatic Query Expansion • The process of automatic query expansion using computer generated thesauri. • Works somewhat like pseudo-relevance feedback. • Implementation not as useful, but still widely researched.
Term Co-occurrence Measures • Process of developing relationships between words based upon their co-occurrence in documents. • Clustering • Documents that share a significant number of terms are grouped together. • A thesaurus is then generated from the terms in these categories. • Categories sometimes too narrow or broad. • Does not account for synonyms.
Lexical Co-Occurrence Measures • Instead of looking at the frequency of terms in a document, the proximity of words in a document is looked at. • Context of words becomes important. • Some performance improvement shown in small document collections. • Not quite as good as relevance feedback, but better than pseudo-relevance feedback.
Current State of Query Expansion • Query Expansion technology has reached somewhat of a plateau. • This is due to limiting factors of relevance feedback and word co-occurrence. • Current research attempting to refine previous research in the field.
Where To Go From Here? • Grammatical Based Thesauri • Syntactical relationship between words • Words placed into classes • Some improvement on small document collections. Failed on larger ones. • AI Searching • Mostly theory • Intelligent Agents • Could be customized reflect specific needs of the user • Next logical step in IR, but still far off from commercial use
Works Cited • Attardi, G., S. Di Marco and F. Sebastiani. 1998. Automated Generation of Category-Specific Thesauri for Interactive Query Expansion. • Grefenstette, G. 1992. Use of Syntactic Context to Produce Term Association Lists for Text Retrieval. In Proceedings of the 15th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, ed. N. Belkin, P. Ingwersen and A. M. Pesjtersen: pp. 89-97. New York: ACM Press. • Ide, E. 1971. New Experiments in Relevance Feedback. In G. Salton. The SMART Retrieval System: Experiments in automatic document processing. Englewood Cliffs, NJ: Prentice-Hall. • Qiu, Y., 1993. Concept Based Query Expansion. In Proceedings of SIGIR-93, 16th ACM International Conference on Research and Development in Information Retrieval. • Schutze, H. and J. Pederson. 1997. A Cooccurance-based Thesaurus and Two Applications to Information Retrieval. Information Processing and Management 33, no. 3: pp. 307-318. • Walker, D. 2001. Query Expansion Using Thesauri.