310 likes | 435 Views
Improving Internet Search using Query Expansion and Focusing. Prepared for The Fifth International Military Applications Symposium Conference Theme: Military Personnel Research Track :Recruiting I Sumali J. Conlon, Wendy Wang, Jie Zhang, and Yue-hua She
E N D
Improving Internet Search using Query Expansion and Focusing Prepared for The Fifth International Military Applications Symposium Conference Theme: Military Personnel Research Track :Recruiting I Sumali J. Conlon, Wendy Wang, Jie Zhang, and Yue-hua She Division of Management Information Systems School of Business Administration University of Mississippi University, MS 38677 Phone (662)-915-5470 • This research is supported by the Office of Naval Research under grant N00014-00-0668.
Outline • Research Goals • Motivation • System Architecture • Data • Techniques • Conclusions and Future Research
Research Goals Improve Precision and Recall Rates in Internet Search Future Goals: • Automatic Summarization • Web Mining
Motivation The Internet and other online information has been growing exponentially on-line information • 20% numerical • 80% textual Current web search engines provide Low Precision & Low Recall Precision rate = #retrieved pages that are relevant_ # retrieved pages Recall rate = #relevant pages retrieved_ #relevant pages on the Web
Modern Information Retrievalby R. Baeza-Yates, Berthier Ribeiro-Neto, Ricardo Baeza-Yates List Price: $50.00Our Price:$50.00See All New: from $45.49See All Used: from $32.00 Availability: Usually ships within 24 hoursEdition: Paperback • Customers who bought this book also bought: • Readings in Information Retrieval (Morgan Kaufmann Series in Multimedia Information and Systems) by Karen Sparck Jones (Editor), et al • Managing Gigabytes: Compressing and Indexing Documents and Images by Ian H. Witten, et al • Understanding Search Engines : Mathematical Modeling and Text Retrieval (Software, Environments, Tools) by Michael W. Berry, Murray Browne • The Social Life of Information by John Seely Brown, Paul Duguid • UML Distilled: A Brief Guide to the Standard Object Modeling Language (2nd Edition) by Martin Fowler, Kendall Scott see larger photo See more product details
Database structure • Books(Isbn, title, authors, list_price, our_price) • Related_books(main_book_isbn, related_books_isbn) SQL: Select isbn, title, authors From books, related_books Where main_book_isbn = :main_book_isbn and isbn = related_books_isbn; • Customers who bought this book also bought: • Readings in Information Retrieval (Morgan Kaufmann Series in Multimedia Information and Systems) by Karen Sparck Jones (Editor), et al • Managing Gigabytes: Compressing and Indexing Documents and Images by Ian H. Witten, et al • Understanding Search Engines : Mathematical Modeling and Text Retrieval (Software, Environments, Tools) by Michael W. Berry, Murray Browne • The Social Life of Information by John Seely Brown, Paul Duguid • UML Distilled: A Brief Guide to the Standard Object Modeling Language (2nd Edition) by Martin Fowler, Kendall Scott
Motivation • Current Search Engines work best with • Proper Nouns – “Office of Naval Research” • Other unusual words – “Geophysics” • Problems: • Lack of syntactic and semantic knowledge • Search engines cannot understand human language
4 Levels of Current Information Retrieval techniques • Texts are “bags-of-words” • Statistical techniques • Linguistics--Syntax -- “Junior College” vs. “College Junior” – Use parser • Linguistics--Syntax and Semantics -- “white collar worker” includes “professional” – Use parser and semantic representation
Linguistic Analysis is very difficult “We still know very little about how linguistic phrases should be used, and what happens when we manipulate entities, concepts, and relations that these phrases denote, and not just the words used to make them” From Natural Language Information Retrieval [Strzalkowski, 1999, p 16]
Improving Internet Search improving recall rate: query expansion • Proper Nouns • Syntactic Analysis • Lexical Semantic Relations improving precision rate: focusing • Co-occurrences Both use knowledge from a knowledge base
Input Request y Proper Noun N Parser Search the KB Send request to the search engines Query Expansion & Focusing Calculate Weights WWW y Threshold N Discarded Figure 2. Steps of improving recall rate
The Parser - Implemented in Prolog S “Who sells computers in LA." NP VP N V NP N PP P NP Who sells computers in LA S(noun(“who”), verb_pp(“sells”, noun(“computers”), pp(“in”, noun_p1(“LA”))))
Data • WordNet--Princeton • Proper noun lexicon • Idiom lexicon • Noun phrase lexicon • Co-occurrence lexicon
Data (continued) • WordNet: a lexical database - Princeton University • Synonyms (automobile ISA motor vehicle) • Hypernyms (automobile ISA_KIND_OF …) • Hyponyms (…ISA_KIND_OF automobile) • Holonyms (professor ISA_member_of faculty) • Meronyms (air bag PART_OF automobile) 95,600 word forms [Miller et al., 1990; Miller, 1995]
Semantic Network University Institution of Higher Education ISA ISA Is_Part_OF ISA Educational Institution Academia Is_A_Kind_Of School Has_Member Building ISA Is_A_Kind_Of person Organization Is_A_Kind_Of Naval Academy
Data (continued) The Knowledge Base isa(New York City, city) antonym(fast, slow) isa(CA, state) antonym(large, small) isa(Mississippi, state) antonym(East, West) isa(Texas, state) part_of(CPU, computer) isa(IBM, company) part_of(Florida, Southeast US) isa(Ole Miss, university) part_of(Georgia, Southeast US) isa(Informs, Organization) part_of(Miami, Florida)
Tools • Perl/CGI • Oracle 9 • Linux
Traditional Information Retrieval technique Vector-Space Model (tf*idf) Di = (Wi1, Wi2, Wi3, … Wit) Di – document (or query) text Wik – weight of term Tk (log(fik) + 1.0) * log(N/nk) Wik = ( tj=1 [log(fij) + 1.0) * log(N/nj)]2 )1/2 fik = occurrence frequency of Tk in Qi N = collection size, nk = number of documents with term Tk assigned S(Di,Qj) = tk=1 (dik * qjk) S = Similarity of two texts (documents and queries)
Our Techniques • Syntax “The car that is red” = “the red car” • Query Expansion - Capture same concept terms (phrases) • Focusing - Reduce useless phrases • Vector-Space Model
Query Expansion • Original search phrase (x,y) • Database is queried for similar words (synonyms, hypernyms) -X returns {x1,x2,…,xn} -Y returns {y1,y2 ,…,ym} (x,y), (x,y1), (x,y2), …, (X,Ym) (x1,y), (x1,y1), (x1,y2), …,(x1,ym) … (xn,y), (xn,y1), (xn,y2), …,(xn,ym)
Relevance Feedback Collect terms from the output of the original query Synonyms At Cornell and Siemens Corporate Research, Inc. -- did not improve precision and recall rates significantly (too many useless terms) Query Expansion in Literature
Example:Search for “Fast Computers”Synonyms for “Fast” (75 terms)
Query Expansion will produce queries: Too Many Queries, most are not useful!
Use co-occurrence data from the web Compare the expanded list to the terms appearing in the co-occurrence data Keep lexically related words if they overlap with co-occurrences Send addition requests to the web Focusing Stage -- select only useful synonyms
Search for “Fast Computers” Will produce phrases like: fast computers fast pace computer very fast computers modern high-speed computers interconnect high-speed computers high-speed parallel computers high-speed digital computers first high-speed computers latest high-speed computers new high-speed computers powerful high-speed computers large high-speed computer speeding up computers using… small, high-speed computers
Example:Search for “Fast Computers”Synonyms for “Fast” (75 terms)
B11 Extension B12 B1 B13 Evaluate links: B2 B B3 … C1 C A C1 C2 D C2 C3 … C3 … … K
Limitations • Domain Specific • Requires Sophisticated Parser (syntactic analysis) • Requires Extensive Knowledge Base
Conclusion • Use NLP techniques to Improve precision and recall rates • Query Expansion and Focusing • Knowledge base Future Research • Automatic Text Summarization • Mining the Web • This technique can be used in recruiting!