Improving Internet Search using Query Expansion and Focusing

Improving Internet Search using Query Expansion and Focusing Prepared for The Fifth International Military Applications Symposium Conference Theme: Military Personnel Research Track :Recruiting I Sumali J. Conlon, Wendy Wang, Jie Zhang, and Yue-hua She Division of Management Information Systems School of Business Administration University of Mississippi University, MS 38677 Phone (662)-915-5470 • This research is supported by the Office of Naval Research under grant N00014-00-0668.

Outline • Research Goals • Motivation • System Architecture • Data • Techniques • Conclusions and Future Research

Research Goals Improve Precision and Recall Rates in Internet Search Future Goals: • Automatic Summarization • Web Mining

Motivation The Internet and other online information has been growing exponentially on-line information • 20% numerical • 80% textual Current web search engines provide Low Precision & Low Recall Precision rate = #retrieved pages that are relevant_ # retrieved pages Recall rate = #relevant pages retrieved_ #relevant pages on the Web

Modern Information Retrievalby R. Baeza-Yates, Berthier Ribeiro-Neto, Ricardo Baeza-Yates List Price: $50.00Our Price:$50.00See All New: from $45.49See All Used: from $32.00 Availability: Usually ships within 24 hoursEdition: Paperback • Customers who bought this book also bought: • Readings in Information Retrieval (Morgan Kaufmann Series in Multimedia Information and Systems) by Karen Sparck Jones (Editor), et al • Managing Gigabytes: Compressing and Indexing Documents and Images by Ian H. Witten, et al • Understanding Search Engines : Mathematical Modeling and Text Retrieval (Software, Environments, Tools) by Michael W. Berry, Murray Browne • The Social Life of Information by John Seely Brown, Paul Duguid • UML Distilled: A Brief Guide to the Standard Object Modeling Language (2nd Edition) by Martin Fowler, Kendall Scott see larger photo See more product details

Database structure • Books(Isbn, title, authors, list_price, our_price) • Related_books(main_book_isbn, related_books_isbn) SQL: Select isbn, title, authors From books, related_books Where main_book_isbn = :main_book_isbn and isbn = related_books_isbn; • Customers who bought this book also bought: • Readings in Information Retrieval (Morgan Kaufmann Series in Multimedia Information and Systems) by Karen Sparck Jones (Editor), et al • Managing Gigabytes: Compressing and Indexing Documents and Images by Ian H. Witten, et al • Understanding Search Engines : Mathematical Modeling and Text Retrieval (Software, Environments, Tools) by Michael W. Berry, Murray Browne • The Social Life of Information by John Seely Brown, Paul Duguid • UML Distilled: A Brief Guide to the Standard Object Modeling Language (2nd Edition) by Martin Fowler, Kendall Scott

Motivation • Current Search Engines work best with • Proper Nouns – “Office of Naval Research” • Other unusual words – “Geophysics” • Problems: • Lack of syntactic and semantic knowledge • Search engines cannot understand human language

4 Levels of Current Information Retrieval techniques • Texts are “bags-of-words” • Statistical techniques • Linguistics--Syntax -- “Junior College” vs. “College Junior” – Use parser • Linguistics--Syntax and Semantics -- “white collar worker” includes “professional” – Use parser and semantic representation

Linguistic Analysis is very difficult “We still know very little about how linguistic phrases should be used, and what happens when we manipulate entities, concepts, and relations that these phrases denote, and not just the words used to make them” From Natural Language Information Retrieval [Strzalkowski, 1999, p 16]

Improving Internet Search improving recall rate: query expansion • Proper Nouns • Syntactic Analysis • Lexical Semantic Relations improving precision rate: focusing • Co-occurrences Both use knowledge from a knowledge base

System Architecture

Input Request y Proper Noun N Parser Search the KB Send request to the search engines Query Expansion & Focusing Calculate Weights WWW y Threshold N Discarded Figure 2. Steps of improving recall rate

The Parser - Implemented in Prolog S “Who sells computers in LA." NP VP N V NP N PP P NP Who sells computers in LA S(noun(“who”), verb_pp(“sells”, noun(“computers”), pp(“in”, noun_p1(“LA”))))

Data • WordNet--Princeton • Proper noun lexicon • Idiom lexicon • Noun phrase lexicon • Co-occurrence lexicon

Data (continued) • WordNet: a lexical database - Princeton University • Synonyms (automobile ISA motor vehicle) • Hypernyms (automobile ISA_KIND_OF …) • Hyponyms (…ISA_KIND_OF automobile) • Holonyms (professor ISA_member_of faculty) • Meronyms (air bag PART_OF automobile) 95,600 word forms [Miller et al., 1990; Miller, 1995]

Semantic Network University Institution of Higher Education ISA ISA Is_Part_OF ISA Educational Institution Academia Is_A_Kind_Of School Has_Member Building ISA Is_A_Kind_Of person Organization Is_A_Kind_Of Naval Academy

Data (continued) The Knowledge Base isa(New York City, city) antonym(fast, slow) isa(CA, state) antonym(large, small) isa(Mississippi, state) antonym(East, West) isa(Texas, state) part_of(CPU, computer) isa(IBM, company) part_of(Florida, Southeast US) isa(Ole Miss, university) part_of(Georgia, Southeast US) isa(Informs, Organization) part_of(Miami, Florida)

Tools • Perl/CGI • Oracle 9 • Linux

Traditional Information Retrieval technique Vector-Space Model (tf*idf) Di = (Wi1, Wi2, Wi3, … Wit) Di – document (or query) text Wik – weight of term Tk (log(fik) + 1.0) * log(N/nk) Wik = ( tj=1 [log(fij) + 1.0) * log(N/nj)]2 )1/2 fik = occurrence frequency of Tk in Qi N = collection size, nk = number of documents with term Tk assigned S(Di,Qj) = tk=1 (dik * qjk) S = Similarity of two texts (documents and queries)

Our Techniques • Syntax “The car that is red” = “the red car” • Query Expansion - Capture same concept terms (phrases) • Focusing - Reduce useless phrases • Vector-Space Model

Query Expansion • Original search phrase (x,y) • Database is queried for similar words (synonyms, hypernyms) -X returns {x1,x2,…,xn} -Y returns {y1,y2 ,…,ym} (x,y), (x,y1), (x,y2), …, (X,Ym) (x1,y), (x1,y1), (x1,y2), …,(x1,ym) … (xn,y), (xn,y1), (xn,y2), …,(xn,ym)

Relevance Feedback Collect terms from the output of the original query Synonyms At Cornell and Siemens Corporate Research, Inc. -- did not improve precision and recall rates significantly (too many useless terms) Query Expansion in Literature

Example:Search for “Fast Computers”Synonyms for “Fast” (75 terms)

Query Expansion will produce queries: Too Many Queries, most are not useful!

Use co-occurrence data from the web Compare the expanded list to the terms appearing in the co-occurrence data Keep lexically related words if they overlap with co-occurrences Send addition requests to the web Focusing Stage -- select only useful synonyms

Search for “Fast Computers” Will produce phrases like: fast computers fast pace computer very fast computers modern high-speed computers interconnect high-speed computers high-speed parallel computers high-speed digital computers first high-speed computers latest high-speed computers new high-speed computers powerful high-speed computers large high-speed computer speeding up computers using… small, high-speed computers

Example:Search for “Fast Computers”Synonyms for “Fast” (75 terms)

B11 Extension B12 B1 B13 Evaluate links: B2 B B3 … C1 C A C1 C2 D C2 C3 … C3 … … K

Limitations • Domain Specific • Requires Sophisticated Parser (syntactic analysis) • Requires Extensive Knowledge Base

Conclusion • Use NLP techniques to Improve precision and recall rates • Query Expansion and Focusing • Knowledge base Future Research • Automatic Text Summarization • Mining the Web • This technique can be used in recruiting!

THANK YOU!

Improving Internet Search using Query Expansion and Focusing

Improving Internet Search using Query Expansion and Focusing

Presentation Transcript

Lecture 8 Query Expansion

Relevance Feedback and Query Expansion

Query expansion techniques

Improving Search

Improving Semantic Search Using Query Log Analysis

Accelerating search engine query processing using the GPU

Using Boolean Operators and other Internet Search Tools

Improving web image search results using query-relative classifiers

FOCUSING THE SEARCH

Query Expansion

Information Retrieval - Query expansion

Improving Internet Search using Query Expansion and Focusing

Improving Query Results using Answer Corroboration

Query Expansion

Improving Error Discovery using Guided Search

Information Retrieval - Query expansion

Detecting influenza epidemics using search engine query data

Query Expansion

Query Expansion

Using WordNet and WSD in Conceptual Query Expansion

Lecture 10: Query expansion

Information Retrieval - Query expansion