1 / 24

Query Expansion

Query Expansion. By: Sean McGettrick. What is Query Expansion?. Query Expansion is the term given when a search engine adding search terms to a user’s weighted search. The goal is to improve precision and/or recall.

webb
Download Presentation

Query Expansion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Expansion By: Sean McGettrick

  2. What is Query Expansion? • Query Expansion is the term given when a search engine adding search terms to a user’s weighted search. • The goal is to improve precision and/or recall. • Example: User Query: “car”; Expanded Query: “car cars automobile automobiles auto” etc…

  3. Classes of Query Expansion • Human and/or computer generated thesauri • Relevance feedback • Automatic query expansion

  4. Query Expansion Issues • Two major issues • Which terms to include? • Which terms to weight more? • Concept-Based vs. Term-Based Query Expansion • Is it better to expand based upon the individual terms in the query, or the overall concept of the query?

  5. Relevance of Query Expansion • Query expansion is very important on the web. • The amount of information on the web is always increasing. • In 1999, Google had 135 million pages. It now has over 3 billion. • Search engine users follow specific trends with their searches. • 2-3 words • Broad search term • Do not like to expand their queries either through refining search terms or using Boolean operators

  6. Thesauri • What is a Thesauri in the IR world? • “Any data structure that defines semantic relatedness between words.” • Schutze and Pedersen (1997) • Often more complex than normal Thesauri. • Thought to be too broad to be useful.

  7. The Need For Thesauri • Naturally assumed that pulling words from a thesauri would increase: • The number of documents retrieved. • Possibly precision. • The car example: “car” vs. “car, auto, automobile, vehicle, sedan, etc…” • Which would retrieve the largest number of documents? • Is larger necessarily better?

  8. Human & Automatically Generated Thesauri • Earliest work began in the 1950s. • H.P. Luhn • Thesaurofacet – detailed list of engineering terms • Largely used in such industries as medicine, aerospace, and other technological fields.

  9. Drawbacks of Handcrafted Thesauri • Cost • Development. • Maintenance. • Cost often outweighs benefit. • Time • It often takes a long time for thesauri to develop. • Hard to keep up with the pace of scientific and technological development.

  10. Automatically Generated Thesauri • Need grew from limitations of handcrafted thesauri. • No longer the cost of experts to generate thesauri.

  11. Automatically Generated Thesauri • 3 Steps. • Extract word co-occurrences. • Define word similarities. • Based upon word co-occurrence or lexical relationship. • Cluster words based upon their similarities. • Not proven very successful. • As late as 1990 many industries were still using handcrafted thesauri.

  12. Relevance Feedback • Began in the 1960s. • Significant improvement in recall and precision over early query expansion work. • Basic process as follows. • The user creates their initial query which returns an initial result set. • The user then selects a list of documents that are relevant to their search. • The system then re-weights and/or expands the query based upon the terms in the documents.

  13. Relevance Feedback Models • Many different types of models. • Depend on methods and theories behind them. • Vector Space. • Probabilistic. • Boolean.

  14. “Ide dec-hi” Method • In this method, all the top ranked relevant documents are used as is the highest ranked non-relevant document. • The non-relevant document is used a point in the vector space from which the feedback query is removed. • Up to 160% improvement over non-expanded queries.

  15. Interactive Query Expansion • Uses a thesaurus. • After initial query is submitted, the system returns a list of associated and relevant words derived from both the result set and a thesaurus. • Useful, but more research is needed.

  16. Pseudo-relevance Feedback • Grew from problems involved in implementing relevance feedback systems. • Users do not like to give manual feedback to the system.

  17. Pseudo-relevance Feedback Process • The system returns an initial set of documents. • The system assumes that the top n number of documents are relevant to the query. • The system takes terms from these documents to re-weight the query. • Relies largely on the systems ability to initially retrieve relevant documents.

  18. lol

  19. Automatic Query Expansion • The process of automatic query expansion using computer generated thesauri. • Works somewhat like pseudo-relevance feedback. • Implementation not as useful, but still widely researched.

  20. Term Co-occurrence Measures • Process of developing relationships between words based upon their co-occurrence in documents. • Clustering • Documents that share a significant number of terms are grouped together. • A thesaurus is then generated from the terms in these categories. • Categories sometimes too narrow or broad. • Does not account for synonyms.

  21. Lexical Co-Occurrence Measures • Instead of looking at the frequency of terms in a document, the proximity of words in a document is looked at. • Context of words becomes important. • Some performance improvement shown in small document collections. • Not quite as good as relevance feedback, but better than pseudo-relevance feedback.

  22. Current State of Query Expansion • Query Expansion technology has reached somewhat of a plateau. • This is due to limiting factors of relevance feedback and word co-occurrence. • Current research attempting to refine previous research in the field.

  23. Where To Go From Here? • Grammatical Based Thesauri • Syntactical relationship between words • Words placed into classes • Some improvement on small document collections. Failed on larger ones. • AI Searching • Mostly theory • Intelligent Agents • Could be customized reflect specific needs of the user • Next logical step in IR, but still far off from commercial use

  24. Works Cited • Attardi, G., S. Di Marco and F. Sebastiani. 1998. Automated Generation of Category-Specific Thesauri for Interactive Query Expansion. • Grefenstette, G. 1992. Use of Syntactic Context to Produce Term Association Lists for Text Retrieval. In Proceedings of the 15th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, ed. N. Belkin, P. Ingwersen and A. M. Pesjtersen: pp. 89-97. New York: ACM Press. • Ide, E. 1971. New Experiments in Relevance Feedback. In G. Salton. The SMART Retrieval System: Experiments in automatic document processing. Englewood Cliffs, NJ: Prentice-Hall. • Qiu, Y., 1993. Concept Based Query Expansion. In Proceedings of SIGIR-93, 16th ACM International Conference on Research and Development in Information Retrieval. • Schutze, H. and J. Pederson. 1997. A Cooccurance-based Thesaurus and Two Applications to Information Retrieval. Information Processing and Management 33, no. 3: pp. 307-318. • Walker, D. 2001. Query Expansion Using Thesauri.

More Related